# Table of Contents
* [Learning Objectives:](#Learning-Objectives:)
* [`str` (string) type](#str-%28string%29-type)
	* [String methods](#String-methods)
	* [String indexing](#String-indexing)
		* [Immutability of Python `str` type](#Immutability-of-Python-str-type)
	* [String slicing](#String-slicing)
	* [String conversions and `format`](#String-conversions-and-format)
		* [Format strings (old-style)](#Format-strings-%28old-style%29)
		* [The `str.format()` mini-language](#The-str.format%28%29-mini-language)
		* [Details on the `str.format()` mini-language](#Details-on-the-str.format%28%29-mini-language)


# Learning Objectives:

After completion of this module, learners should be able to:

* apply string indexing rules
* apply the `str.format` mini-language to generate formatted output

# `str` (string) type

Strings (or `str` objects) are textual data with *delimeters* to denote where the string starts and ends. String literals are constructed with single quote characters(i.e., `'`), double quote characters (i.e., `"`) or a trio of single or double quote characters (i.e., `'''` or `"""`) as delimiters. Triple quoted strings can span multiple lines&mdash;all associated whitespace will be included in the string.

In [None]:
a = 'a string'
print(a, type(a))

In [None]:
string_1 = 'Single quotes as delimiters permit "double" quotes inside.'
string_2 = "Double quotes as delimiters don't have problems with 'single' quotes inside."
string_3 = '''
Triple (single) quotes don't have problems with 'single' or "double" quotes
inside. They don't even have problems with line breaks!
'''
print(string_1)
print(string_2)
print(string_3)

To embed a single quote character (one or more) within a string delimited by single quotes, a backslash character is needed as an *escape chracter*. The same applies for double quote characters embedded within strings delimited by double quote characters. Other escaped string literals can be found in the [Python documentation](https://docs.python.org/3/reference/lexical_analysis.html#strings). Notice as of Python 3, strings characters are [Unicode code points](https://en.wikipedia.org/wiki/Code_point).

In [None]:
empty_str = ''
string_1 = 'Single quotes as delimiters permit \'escaped single\' quotes inside.'
string_2 = "Double quotes as delimiters \"escaped double\" quotes inside."
string_3 = '''Other escaped characters include the literal backslash \\,
Unicode characters with hex values like \\u00CC == \u00CC,'''
string_4 = 'the\ttab character \\t &\nthe line feed \\n.'
print('empty_str = %r' % empty_str)
print(string_1)
print(string_2)
print(string_3,string_4)

In [None]:
print("Unicode charcters may be entered by name: \N{GREEK SMALL LETTER DELTA}. ",
      "And also by codepoint: \u03B4")

In [None]:
print("Strings can contain either literal non-ASCII characters",
      "Say in Русский.  Or they can contain escapes to codepoints",
      "such as \u0420\u0443\u0441\u0441\u043a\u0438\u0439")

In [None]:
import unicodedata
unicodedata.lookup("GREEK SMALL LETTER DELTA")

In [None]:
hex(ord("δ"))

In [None]:
unicodedata.name("δ")

In [None]:
old_s = 'Mary had a little lamb\nIts fleece was white as snow\nAnd everywhere that Mary went\nThe lamb was sure to go.'
print(old_s)

In [None]:
new_s = """Mary had a little lamb
Its fleece was white as snow
And everywhere that Mary went
The lamb was sure to go."""
print(new_s)

In [None]:
# Check whether the strings new_s and old_s are identical in every way.
print(new_s == old_s) 
new_s is old_s

In [None]:
s = "Ain't it a shame?!"  # Single quote in double quotes
s

In [None]:
s = 'Ain\'t it a shame?!' # Another example of escaping characters within strings
print(s)

In [None]:
s = """He said "Ain't that a shame"!""" # Triple quotes to include both single/double
print(s)

In [None]:
print('He said "Hi" to me')

In [None]:
print("He said \"Hi\" to me") # Another example of escaping characters within strings

## String methods

As objects, strings have a variety of *methods* (functions) that can be invoked and operate on data contained in the calling `str` object.

| | | |
 :-: | :-: | :-: | :-: 
`capitalize`|`casefold`|`center`|`count`
`encode`|`endswith`|`expandtabs`|`find`
`format`|`format_map`|`index`|`isalnum`
`isalpha`|`isdecimal`|`isdigit`|`isidentifier`
`islower`|`isnumeric`|`isprintable`|`isspace`
`istitle`|`isupper`|`join`|`ljust`
`lower`|`lstrip`|`maketrans`|`partition`
`replace`|`rfind`|`rindex` | `rjust`
`rpartition`|`rsplit`|`rstrip` | `split`
`splitlines`|`strip`|`swapcase`| `title`
`translate`|`upper`|`zfill` |

Many of these methods have purposes indicated clearly by their names. We can use `help` or the [Python documentation](https://docs.python.org/3/library/stdtypes.html#string-methods) to determine their function. Let's examine a few here.

Given a `str` object with identifier, say, `a_string`, any string *`method`* is invoked using `a_string.`*`method()`* (that is, the string as an argument to the method is positioned as a prefix of the method in the call). Of course, other arguments may be required within the parentheses, depending on which method is used.

With strings, as with all objects, the object instance itself is the first thing passed to the method, defined in the class.  So, for example, writing `a_string.method(other, args)` does the same thing as calling `str.method(a_string, other, args)`.

In [None]:
haiku = """
    an aging willow
    its image unsteady
    in the flowing stream
"""
print(haiku) # We construct a multi-line string to experiment with first.

In [None]:
# haiku.count('i') returns the number of times the character 'i' occurs
haiku.count('i')

In [None]:
# haiku.count('in') returns the number of times the substring 'in' occurs
haiku.count('in')

In [None]:
haiku.split()

In [None]:
sum('in' == word for word in haiku.split())

In [None]:
sum('in' in word for word in haiku.split())

In [None]:
haiku.count('x') # returns 0 because 'x' is not a substring of haiku

In [None]:
print(haiku.strip()) # Removes leading/trailing whitespace (but not internal whitespace)

In [None]:
# Splits string haiku on line feed characters; returns a *list*
lines = haiku.split('\n')
lines

In [None]:
# We're going to jump ahead slightly in this example, i.e., using list conprehension
# The following removes empty lines as well as trailing whitespace
[line.strip() for line in haiku.split('\n') if line]

In [None]:
# joining pieces back together
print("\n".join([line.strip() for line in haiku.split('\n') if line]))

In [None]:
print(haiku.upper()) # Convert to upper case, return a new string
print(haiku)

In [None]:
# replaces a source substring with target substring, return as new string
print(haiku.replace('unsteady','wavering'))

In [None]:
# nothing happens with the source substring not found
print(haiku.replace('uneasy','wavering')) 

In [None]:
'uneasy' in haiku # Should evaluate to False

In [None]:
'unsteady' in haiku

In [None]:
haiku.endswith('stream') # Whoops, need to strip the whitespace...

In [None]:
haiku.rstrip().endswith('stream') # This is what we expected...

In [None]:
# Another jump ahead to map().  This is another way of iterating implicitly
lines = map(str.strip, haiku.split('\n'))
[x for x in lines if x.endswith(('willow','stream'))]

In [None]:
haiku.isalpha() # Only True when all characters are alphanumeric

In [None]:
"David".isalpha() # Should be True

In [None]:
haiku.isdigit()

In [None]:
"12345".isdigit()

In [None]:
# This asks "are all the *letters* lowercase?", 
# not "are all the characters lowercase letters?
haiku.islower()

In [None]:
"abc123#$%^&".islower()

In [None]:
# However, there must also *be* some letters for this to be true
"12345".islower()

In [None]:
help(str.islower)

In [None]:
haiku.isupper()

In [None]:
# Converts string haiku into a list of words
# ... more specifically, divide the string around any sequence of whitespace
words = haiku.split() 
words

In [None]:
print(words)
"__".join(words)

In [None]:
"_".join(haiku) # Treats string haiku as list of letters; joins all with '_'

In [None]:
prefixes = ('Ti', 'Da', 'Le')
"David".startswith(prefixes)

In [None]:
# Replaces line-feed with empty strings; puts all on one line
print(haiku.replace('\n',''))
# split into substrings on 'w'; returns list
haiku.replace('\n','').split('w')

In [None]:
# Returns leading index of first occurrence of 'aging' inside 'haiku'
haiku.find('aging')

In [None]:
# haiku.find(substring) returns -1 if the substring is not found
print(haiku.find('old'))

In [None]:
help(str.find)

In [None]:
# also str.rindex() exists, and behaves as expected
haiku.rfind('st'), haiku.find('st')

In [None]:
# haiku.index is like haiku.find() with different error-handling bahaviour
haiku.index('aging') 

In [None]:
haiku.index('old') # ValueError because 'old' not substring of haiku

In [None]:
haiku[0], haiku[10]

In [None]:
haiku[:8]

In [None]:
haiku[8:]

In [None]:
haiku[8:20]

In [None]:
haiku[8:20] + haiku[20:30] == haiku[8:30]

In [None]:
haiku[-1]

In [None]:
haiku[-20:-10]

The distinct behaviors of `str.find` and `str.index` suggest two distinct methods for safeguarding output from a program. The first method `str.find` returns `-1` when the substring input argument does not produce a match. By contrast, the second method `str.index` returns an *exception*&mdash;in particular, the exception `ValueError` to inform us that the substring was not found in the string.

Using `str.find`, we can construct an `if-else` block to flag the error. Notice that if we don't try to catch the erroneous return value `-1`, the statement `print(haiku[position:])` prints the last character of `haiku` (which happens to be `\n`, a line feed character.

Note on good programming practice: It is more dangerous to let your program *succeed* in returning a wrong answer than it is to raise an uncaught exception that you *have to* fix before working with the program.  The philosophy behind this is often expressed with the slogan *"Fail early, fail hard!"*

In [None]:
# pos = haiku.find('old')
# end = haiku[pos:]
haiku[-1:]

In [None]:
position = haiku.find('old')
if position != -1:
    print(haiku[position:])
else:
    print("Not found")

A more Pythonic idiom to catch an error is to use a *`try-except`* block instead. With the `try-except` block, the Python interpreter attempts to execute the statement `position = haiku.index('old')`. In this case, rather than returning an innocuous value `-1` (as `haiku.find` would do), `haiku.index` *raises an exception* (in this case, the exception `ValueError`). When an exception is raised within a `try` block, the code within the `except` block executes instead. It is generally considered better practice to raise exceptions in functions/modules that can be caught in higher-level namespaces.

Programmers who have worked with languages such as C++ and Java may think of exceptions as terrible events that indicate a program is badly broken.  In contrast, Pythonic code follows the philosophy that *"Exceptions are not that exceptional!"*  Allowing exceptions to occur, and catching them in the appropriate place is good and expected coding style.

In [None]:
# More Pythonic not to allow a bad answer to pass silently
try:
    position = haiku.index('old')
    print(haiku[position:])
except Exception as e: # Exception is the broad class of all exceptions
    # This except block catches *any* exception whatsoever
    print(repr(e))

In [None]:
# Even more Pythonic to catch only the exception we know how to deal with
try:
    pos = haiku.index('old')
    print(haiku[pos:])
except ValueError as e: # In this version, we flag the particular exception
    # This except block executes only with a ValueError.
    print("Not found")

## String indexing

An important feature of string manipulation in Python is *string indexing* and *string slicing*. *Indexing* refers to extracting individual elements (characters) from Python strings. The syntax for indexing uses square brackets around an integer index to refer to a character inside the string.
* Indexing starts at 0 at the beginning (left) of the string, e.g., the reference `s[3]` refers to the *fourth* character of the string `s` counting from the left.
  * Sometimes using neologisms of constructing cardinal numbers from ordinal names makes clear the difference between "which character" and "which index position."  E.g. "zeroeth", "oneth", "two-eth", "three-eth" to name indices.  If the use of fake words pains you, don't use these.
* Negative indices start from the end (right) of the string, e.g, `s[-2]` is the second to last character in the string.
* Trying to index with an index too large for the string throws an exception (an `IndexError`)

In [None]:
s = "My name is David"
print(s[11])  # Indexed from zero, so s[11] is the 12th character,; expect 'D'
print(s[-3])  # expect 'v'
print(len(s)) # Prints the length of the string

In [None]:
# Should raies an IndexError (last character in the string has index 15)
print(s[16])

In [None]:
s[-17]

In [None]:
# We could have checked the length using:
len(s)   # Indices 0 .. 15

### Immutability of Python `str` type

A confusing feature for newcomers coming from C-family languages to Python is that strings are *immutable*; that is, individual characters/substrings within a string object cannot be overwritten once the string has been created. Thus, expressions involving string indices (or slices, see below) can occur on the right-hand side of an assignment operator, but never on the left-hand side. There are a handful of other immutable data structures in Python whose items cannot be reassigned after the object has been created. Having certain data structures being immutable enables optimizations in using dictionaries (see below).

In [None]:
print(s)
# This assignment works; s[4] on the right-hand sice of the assignment operator
c3 = s[3]
c3

In [None]:
s[3] = 'g' # This raises an exception (TypeError)

## String slicing

Beyond indexing individual elements of strings, we can also extract *slices* (substrings) from strings by specifying a (half-open) range of indices within brackets.
* The slice `my_string[a:b]` extracts a substring with characters `my_string[a]`, `my_string[a+1]`, `my_string[a+2]`, … `my_string[b-2]`, `my_string[b-1]` from the string `my_string` (under the assumption `b>a≥0`).
* If `a≥b`, the slice `my_string[a:b]` is the empty string.
* If `step>0` and `b>a>0`, then the slice `my_string[a:b:step]` extracts a substring from `my_string` starting from position `a` up to (but not including) position `b` in steps of length `step`. Of course, the endpoints can be given as negative integers as well, in which case, positions are measured from the right of the string.
* If `step<0`, the slice `my_string[a:b:step]` extracts a substring traversing the string `my_string` from right to left starting at `a`, terminating at (but not including `b`)
* The slice `my_string[:b]` slices from the beginning of the string (position `0`) up to (but not including) postition `b`.
* The slice `my_string[a:]` slices from position `a` up to *and including* the end of the string.

These rules are more easily understood looking at examples.

In [None]:
print(s)
# Slicing: pull out char 4 up to (but *not* including) char 9 from string s
print(s[4:9]) 

In [None]:
s[11:] # From s[11] to the very end

In [None]:
s[:11] # From the very start up to (but not including) character 11

In [None]:
s[:11] + s[11:]

In [None]:
s[-5] # s[-5] means the character 5 preceding the end

In [None]:
s[-5:] # Equivalent to s[11:] for this string

In [None]:
s[-5:-3] # Again, remember slicing is half-open (non-inclusive)

In [None]:
s[1:10:3] # Specify stride of length 3 (i.e., count in steps of 3)

In [None]:
s[15:8:-2]

The next couple cells show why using half-open intervals makes reasoning about slices easier.  The end of one slice adds seamlessly to the start of one with the same index.  This helps avoid what are called "[fence-post errors](https://en.wikipedia.org/wiki/Off-by-one_error)."

There's an old computer science joke that illustrates this:

> There are two hard things in computer science: cache invalidation; naming things; and off-by-one errors.

In [None]:
s[3:7] + s[7:10] # Remember, + operator concatenates strings...

In [None]:
s[:5] + s[5:]

## String conversions and `format`

It is possible to cast numeric values from strings if the strings represent appropriate numeric literals.

In [None]:
print(float("3.14"), int("-8"))

### Format strings (old-style)

Prior to Python 2.6, the principle way of conversting numeric data and other Python data into strings was using *string interpolation*.  The syntax of this style resembles conventions from the C programming language's `printf` statement. The basic trick was to use a `%` character preceding one of the characters in the table below to specify what would be substituted into the format string.

String interpolation remains widely used, but the newer `str.format()` method is more powerful, albeit also often more complicated.

| Conversion | Meaning
| :-:   | :-:
|`d`     |      Signed integer decimal.
|`i`     |      Signed integer decimal.
|`o`     |      Unsigned octal.
|`u`     |      Unsigned decimal.
|`x`     |      Unsigned hexadecimal (lowercase).
|`X`     |      Unsigned hexadecimal (uppercase).
|`e`     |      Floating point exponential format (lowercase).
|`E`     |      Floating point exponential format (uppercase).
|`f`     |      Floating point decimal format.
|`F`     |      Floating point decimal format.
|`g`     |      Same as "`e`" if exponent is greater than `-4` or less than precision, "`f`" otherwise.
|`G`     |      Same as "`E`" if exponent is greater than `-4` or less than precision, "`F`" otherwise.
|`c`     |      Single character (accepts integer or single character string).
|`r`     |      String (converts any Python object using `repr()`).
|`s`     |      String (converts any Python object using `str()`).
|`%`     |      No argument is converted, results in a "`%`" character in the result.

The format string followed by the `%` character and a Python tuple of values to convert described the "string interpolation". The generic syntax for variable substitution in a format string is

        %[flags][width][.precision]type

where `flags`, `width`, and `precision` are optional parameters.

In [None]:
a = str(42.5)
print(a)
# String interpolation (C-style, more or less)
# i.e. %[flags][width][.precision]type
from math import pi
print("Pi is about %d, in %s" % (pi, "Indiana"))
"For rough use, we often just use %0.4f" % pi

In [None]:
# Notice that if just one value is being interpolated into the string, 
# we can give the bare value.  However, if multiple, must use a tuple.
"For rough use, we often just use %0.7f" % pi

In [None]:
"Better precision is %.17f" % pi

In [None]:
"Past 17 digits, floating point precision is meaningless: %.30f" % pi

In [None]:
"Octal %o; Decimal %i; HEX %X; hex %x; Octal w/ marker %#o; Hex w/ marker %#X" % (
       13,         13,     13,     13,                 13,                13)

In [None]:
"Explicit signs %+d, %+d" % (-13, 13)

In [None]:
"Zero padded ints %+06d, %+06d" % (-13, 13)

In [None]:
"Space padded ints %6d, %6d" % (-13, 13)

In [None]:
"A scientific notation format %.3e using 'e'" % 1234567890

### The `str.format()` mini-language

The `format()` function and `str.format()` method of strings are enormously powerful, and occassionally enormously confusing. An excellent summary of the differences (with examples) can be found at [Pyformat](https://pyformat.info/).

Let's try a few examples both with old-style string interpolation and with `str.format`.

In [None]:
# Define a tuple of numeric values, say dollar amounts.
expenses = (1234.5678, 9900000.1, 83, .02)
for n, item in enumerate(expenses):
    print("Purchase %d:\t$%.2f" % (n+1, item))

We can do better than the last cell using a `format` specifier. In particular, two things we want in formatted currencies is comma separators in large numbers and right alignment.

In [None]:
format_string = "Purchase {}:\t${:>13,.2f}" # Format string using new format mini-language
for n, item in enumerate(expenses):
    print(format_string.format(n+1, item))

We compactly described the currency format above. However, we may rather have the dollar sign close to its amount. This needs to be done in two stages.

In [None]:
format_string = "Purchase {}:\t{:>14}"
for n, item in enumerate(expenses):
    amount = "${:,.2f}".format(item)
    print(format_string.format(n+1, amount))

### Details on the `str.format()` mini-language

Take a look at this for a complete description of the `str.format()` mini language: https://docs.python.org/3.4/library/string.html#formatstrings

| Option | Meaning
|:------:|:--------------------------------------
| `<`      | The field will be left-aligned within the available space. The default for strings.
| `>`      | The field will be right-aligned within the available space. The default for numbers.
| `=`      | Forces the padding to be placed after the sign (if any) but before the digits. This is used for printing fields in the form "`+000000120`". This alignment option is only valid for numeric types.
| `^`      | Forces the field to be centered within the available space.
| `+`      | A sign should be used for both positive as well as negative numbers.
| `-`      | A sign should be used only for negative numbers; the default behavior.
| `space`  | A leading space should be used on positive numbers, a minus sign on negative numbers.

Notice that `str.format()` permits several data structures for specifiying values.

In [None]:
# parameters can be out of order...
print("The capital of {1:s} is {2:s}, a {0:s} city".format(
                      "Northern", "California", "Sacramento", "USA"))

In [None]:
# Using keyword arguments to specify values to format
print("The capital of {state} is {capital}".format(
                      capital="Sacramento", state="California", country="USA"))

See the above linked documents for more full details.