### String

It's extremely common to deal with lots of text in a real-world context. Python has plenty of support for textual data. We'll go through some of the most common operations here. 

Python represents texts as strings, which are basically lists of characters.

In [1]:
x = "Hello, world"
type(x)

str

There are a couple of ways of specifying strings, each of which is most convenient in the right context:
* Using single quotes: 'Does the name "John" ring a bell?'
* Using double quotes: "It's Friday"
* Using triple quotes: """A long    

long    
string"""    
* Raw strings: r"\w*\d{4}" or r"C:\Windows" or r"""\w*\d{4}""" (more below)    
Use single quotes when the string itself contains double quotes. Likewise, use double quotes when the string itself contains single quotes. Normal strings can't span multiple lines: that's what triple-quoted strings are useful for.    

Inside strings, you can introduce special characters by escaping them with a backslash. Here are some common examples:    
* \n: Newline
* \r: Carriage return
* \t: Tab
* \': Single quote (works even inside single-quoted strings)
* \": Double quote (works even inside double-quoted strings)
* \\: Backslash itself    

Here's an example:

In [2]:
print("A complex string\nwith\tmany so-called \"quoted\' characters,\nintroduced with \\")

A complex string
with	many so-called "quoted' characters,
introduced with \


In regular expressions (which we'll cover later if you've never seen them), all this quoting really gets in the way. That's where raw strings are useful. Raw strings will not interpret any escape characters.    
You might also find raw strings useful for specifying pathnames in Windows.

In [3]:
print( "Search for backslashed words and digits (hard): '\\\\\\w+\\d{4}'")
print(r"Search for backslashed words and digits (easy): '\\\w+\d{4}'")
print( "Path (hard): C:\\Users\\me\\folder\\data.txt")
print(r"Path (easy): C:\Users\me\folder\data.txt")

Search for backslashed words and digits (hard): '\\\w+\d{4}'
Search for backslashed words and digits (easy): '\\\w+\d{4}'
Path (hard): C:\Users\me\folder\data.txt
Path (easy): C:\Users\me\folder\data.txt


As mentioned above, strings are basically lists of characters, which you can access via indexing or slicing:

In [4]:
s = "Hello"
print(s[0])
print(s[2:4])
print(s[-3:])

H
ll
llo


#### Ex How would you classify a filename based on its extension as either a JPEG, a text file, a CSV, or something else?

In [18]:
a="C:\\Users\\me\\folder\\data.txt"
a[-3:]=="txt"

True

You can iterate through the characters in a string with a for loop:

In [19]:
for c in "Hello":
    print c

H
e
l
l
o


#### Ex Given a string for a phone number, e.g. "+32-123-456789", count the number of hyphens in it

In [26]:
import re
text="+32-123-456789"
a=re.findall(r'(?:-)',text)
print len(a)

2


#### Ex Repeat exercise 0.9.2 using list comprehensions

Strings have lots of other specialized functions. Here are the most common:

In [27]:
s = "hello world"
s.upper()

'HELLO WORLD'

In [28]:
s = "My NaMe Is EaRl"
s.lower()

'my name is earl'

In [29]:
s = "John Doe"
s.startswith("John")

True

In [30]:
s = "myfile.jpg"
s.endswith(".jpg")

True

There are plenty of functions to classify characters:

In [31]:
print("9".isdigit())
print("a".isalpha())
print("\t".isspace())

True
True
True


Two particularly useful operations are split and join.  split takes one string and a delimiter and returns a list of substrings between such delimiters:

In [32]:
csvline = "one,two,three"
csvline.split(",")

['one', 'two', 'three']

join takes a list of strings and a delimiter, then joins the strings with the delimiter in between. The syntax is a bit special:

In [33]:
",".join(["one", "two", "three"])

'one,two,three'

The delimeter can be the empty string. That's actually quite useful:

In [34]:
"".join(["abra","cadabra"])

'abracadabra'

#### Ex Using join and a list comprehension, remove anything except the digits in the phone number "+32-123-456789"

Finally, strip() will remove leading and trailing spaces from a string

In [35]:
"  hi\t".strip()

'hi'

#### Ex Using split, strip and a list comprehension, take a line from a CSV and make a list of the fields in there, with leading and trailing spaces removed

#### Ex Look up the help string for split. Now split the string "one,two,three" into two parts, one before the first comma and then everything after that

In [41]:
a="one,two,three"
a[:3].split(",")
a[4:].split(",")

['two', 'three']

#### Ex Look up the help string for replace. Now replace every hyphen in the phone number "+32-123-456789" with a dot ('.')

In [48]:
import string
a="+32-123-456789"
b=string.replace(a,"-",".")
print b

+32.123.456789


Now with plenty of practice with strings, we're ready to introduce format. This function allows you to insert values within a string and control how those values are displayed. Here are a couple of representative examples:

In [49]:
print("a = {0}, a^2 = {1}. What do you think, {2}?"
      .format(10, 10**2, "John"))

a = 10, a^2 = 100. What do you think, John?


Here's a more interesting example of how you might control field alignment and width to format a table:

In [50]:
data = [
    ['name', 'age', 'phone'],          # Header row
    ['John', 25, '+32-123-456789'],    # Data row 1
    ['Melissa', 38, '+1-510-123456'],  # Data row 2
    ['Joey', 3, '<none>'],             # Data row 3
]
for line in data:
    print('{0:<15}{1:>4}{2:^20}'.format(line[0], line[1], line[2]))

name            age       phone        
John             25   +32-123-456789   
Melissa          38   +1-510-123456    
Joey              3       <none>       


Another useful feature of format is its control of precision in formatting floating point numbers and percentages:

In [52]:
print("{0:.1f}".format(3.141592))
print("{0:.4f}".format(3.141592))
print("{0:.2%}".format(45./76.))

3.1
3.1416
59.21%


### Tuples

Python has special syntax for short, immutable lists, called tuples, that makes many operations a snap.    
A tuple is created like a list, except with parentheses instead of square brackets:


In [53]:
t = (1,2,3)
type(t)

tuple

Most operations that work with lists also work with tuples:

In [54]:
print(len(t))
print(t[0])
print(t[:2])
print(sum(t))

3
1
(1, 2)
6


An empty tuple, like an empty list, is represented by a pair of parentheses with nothing in between:

In [55]:
empty_tuple = ()
print(empty_tuple)
print(len(empty_tuple))

()
0


A tuple with one item needs slightly special syntax to differentiate it from an expression in parentheses:

In [56]:
the_number_one = (1)
type(the_number_one)

int

In [57]:
a_lonely_tuple = (1,)   # Need an extra comma!
type(a_lonely_tuple)

tuple

#### Ex Using list comprehension, take a number n and make a list of all the ways of expressing it as a product of two positive integers.

Tuples are useful because of the packing and unpacking operations.    
Python will automatically make a tuple (packing) whenever it sees a set of values separated by commas:

In [58]:
stock = 'GOOG', 100, 45.32
type(stock)

tuple

In [59]:
'{0}{1}{2}'.format('GOOG', 100, 45.32)

'GOOG10045.32'

In [60]:
stock

('GOOG', 100, 45.32)

Python will automatically extract the members of a tuple if you assign to a list of variables:

In [61]:
name, quantity, price = stock
print(name)
print(quantity)
print(price)

GOOG
100
45.32


This syntax enables several idioms:

In [62]:
a = 5
b = 2

a, b = b, a  # Swap

print("a is now {0} and b is now {1}".format(a, b))

a is now 2 and b is now 5


In [63]:
# Iteration through a list of tuples
portfolio = [
    ('GOOG', 100, 45.32),
    ('APPL',  50, 67.89),
    ('MSFT',   1,  0.43),
]
for name, quantity, price in portfolio:
    print("I have {0} shares of {1}, priced at {2}, "
          "for a value of EUR {3:,.2f}"
          .format(quantity, name, price, price*quantity))

I have 100 shares of GOOG, priced at 45.32, for a value of EUR 4,532.00
I have 50 shares of APPL, priced at 67.89, for a value of EUR 3,394.50
I have 1 shares of MSFT, priced at 0.43, for a value of EUR 0.43


Notice that the tuples in portfolio are unpacked right in the for statement. The above is equal to the slightly more verbose code below:

In [64]:
# Iteration through a list of tuples
portfolio = [
    ('GOOG', 100, 45.32),
    ('APPL',  50, 67.89),
    ('MSFT',   1,  0.43),
]
for stock in portfolio:
    name, quantity, price = stock
    print("I have {0} shares of {1}, priced at {2}, "
          "for a value of EUR {3:,.2f}"
          .format(quantity, name, price, price*quantity))

I have 100 shares of GOOG, priced at 45.32, for a value of EUR 4,532.00
I have 50 shares of APPL, priced at 67.89, for a value of EUR 3,394.50
I have 1 shares of MSFT, priced at 0.43, for a value of EUR 0.43


Tuple unpacking is also useful together with the zip operation on lists. If you have two "parallel" lists A and B, zip(A,B) produces a list of the corresponding tuples in each list.

In [65]:
A = [1,2,3]
B = ["one", "two", "three"]
zip(A,B)

[(1, 'one'), (2, 'two'), (3, 'three')]

The tuples can then be unpacked in a for loop:

In [66]:
for num, word in zip(A, B):
    print("The word for {0} in English is {1}".format(num, word))

The word for 1 in English is one
The word for 2 in English is two
The word for 3 in English is three


#### Given a list l, produce a list of tuples for the consecutive items in l.

If you need the indices of items in a list, you can use enumerate:

In [67]:
word = "Crab"  # You can think of a string as a list of characters
for i, c in enumerate(word):
    print("Letter {0} is {1}".format(i + 1, c))

Letter 1 is C
Letter 2 is r
Letter 3 is a
Letter 4 is b


Another useful idiom has to do with splitting strings of a known structure:

In [68]:
filename = 'mypicture.jpg'
basename, extension = filename.split(".")
print(basename)
print(extension)

mypicture
jpg


In [69]:
csvline = "GOOG,100,45.32"
name, quantity_str, price_str = csvline.split(",")
print(name)
print(quantity_str)
print(price_str)

GOOG
100
45.32


In [70]:
csvfile_with_header = (
"""Name,Quantity,Price
GOOG,100,45.32
APPL,50,67.89
...""")
header_line, rest = csvfile_with_header.split("\n", 1)
print("Header line: '{0}'".format(header_line))
print("Rest: '{0}'".format(rest))
print("Rest lines: '{0}'".format(rest.split("\n")))

Header line: 'Name,Quantity,Price'
Rest: 'GOOG,100,45.32
APPL,50,67.89
...'
Rest lines: '['GOOG,100,45.32', 'APPL,50,67.89', '...']'


### Dictionaries

The final core Python data type we'll talk about are dictionaries. These allow you to associate keys to values. Here's an example:

In [71]:
d = {'GOOG': 45.32, 'APPL': 67.89, 'MSFT': 0.43}
type(d)

dict

Dictionaries support lookup by key:

In [72]:
print(d['GOOG'])

45.32


You can see if a dictionary contains a given key:

In [73]:
'APPL' in d

True

In [74]:
'IBM' in d

False

You can add new keys to a dictionary:

In [75]:
d['IBM'] = 13.24
print(d)

{'GOOG': 45.32, 'APPL': 67.89, 'IBM': 13.24, 'MSFT': 0.43}


Or delete keys:

In [76]:
del d['APPL']
print(d)

{'GOOG': 45.32, 'IBM': 13.24, 'MSFT': 0.43}


A key is associated with only one value. You can replace it as follows:

In [77]:
print(d)
d['GOOG'] = 0.01   # Google terminates Gmail...
print(d)

{'GOOG': 45.32, 'IBM': 13.24, 'MSFT': 0.43}
{'GOOG': 0.01, 'IBM': 13.24, 'MSFT': 0.43}


If you use a for loop against a dictionary, you'll iterate through its keys:

In [78]:
for name in d:
    print("The price of {0} is {1}".format(name, d[name]))

The price of GOOG is 0.01
The price of IBM is 13.24
The price of MSFT is 0.43


More often than not, you'll want to iterate through the key-value pairs. Tuple unpacking makes this super-easy:

In [79]:
for name, price in d.items():
    print("The price of {0} is {1}".format(name, price))

The price of GOOG is 0.01
The price of IBM is 13.24
The price of MSFT is 0.43


In [80]:
d.items()

[('GOOG', 0.01), ('IBM', 13.24), ('MSFT', 0.43)]

There is no guaranteed order of iteration. If you want a sorted order, sort the items list as follows:

In [81]:
for name, price in sorted(d.items()):
    print("The price of {0} is {1}".format(name, price))

The price of GOOG is 0.01
The price of IBM is 13.24
The price of MSFT is 0.43


#### Ex Write some code to create a dictionary that maps each word in a sentence to the number of times it appears in it. For simplicity, assume words are only separated by spaces