## Strings
Immutable Unicode arrays

In [25]:
'spam eggs'  # single quotes best quotes

'spam eggs'

In [1]:
"doesn't"

"doesn't"

In [28]:
'"Isn\'t," they said.'

'"Isn\'t," they said.'

In [31]:
print('"Isn\'t," they said.')

"Isn't," they said.


In [36]:
print('C:\some\name')  # here \n means newline!

C:\some
ame


In [37]:
print(r'C:\some\name')

C:\some\name


In [38]:
print("""\
Usage: thingy [OPTIONS]
     -h                        Display this usage message
     -H hostname               Hostname to connect to
""")

Usage: thingy [OPTIONS]
     -h                        Display this usage message
     -H hostname               Hostname to connect to



In [39]:
3 * 'un' + 'ium'

'unununium'

In [40]:
'Py' 'thon'

'Python'

In [41]:
text = ('Put several strings within parentheses '
        'to have them joined together.')
text

'Put several strings within parentheses to have them joined together.'

In [32]:
prefix = 'Py'


In [35]:
prefix + 'thon'

'Python'

In [2]:
shrug = r'¯\_(ツ)_/¯'
shrug[0]

'¯'

In [45]:
shrug[4]

'ツ'

In [47]:
shrug[-2]  # second-last character

'/'

In [49]:
shrug[2:5]

'_(ツ'

In [57]:
shrug[:6]

'¯\\_(ツ)'

In [58]:
shrug[6:]

'_/¯'

In [56]:
shrug[-3:]

'_/¯'

```
   ┌───┬───┬───┬───┬───┬───┬───┬───┬───┐
     ¯   \   _   (   ツ  )   _   /   ¯  
   └───┴───┴───┴───┴───┴───┴───┴───┴───┘
   0   1   2   3   4   5   6   7   8   9
  -9  -8  -7  -6  -5  -4  -3  -2  -1
```

### Why 0-based indexes? 💡

http://www.cs.utexas.edu/users/EWD/transcriptions/EWD08xx/EWD831.html

prof.dr. Edsger W. Dijkstra

In [2]:
word = 'Python'
word[42]  # only 6 characters

IndexError: string index out of range

In [3]:
word[4:42]

'on'

In [4]:
word[42:]

''

In [67]:
word[0] = 'C'

TypeError: 'str' object does not support item assignment

In [68]:
'C' + word[1:]  # concatenate with +

'Cython'

In [70]:
s = 'supercalifragilisticexpialidocious'
len(s)

34

### 🍎🍎🍎 Python Core: builtins not members
`len()` and other common operations are implemented as builtin functions to ensure a consistent interface and to avoid polluting the namespace of custom objects

In [1]:
'\N{SNOWMAN}' == '☃'

True

In [10]:
'these' < 'those'  # code point order ('Z' < 'a')

True

In [78]:
str(5000)  # all objects may be turned into strings

'5000'

### ⏱⏱⏱ Performance Matters: Strings

- small strings are "interned"
- stored as arrays of unicode code points
- 1 (latin1), 2 (USC2) or 4 bytes (UCS4) per code point depending on *largest code point*
- `S[x]` is O(1)
- `len(S)` is O(1)
- `S1 + S2` is O(n + m)
- Python uses a peephole optimizer on string expressions

In [122]:
'fragil' in 'supercalifragilisticexpialidocious'

True

In [121]:
'supercalifragilisticexpialidocious'.startswith('super')

True

In [125]:
'supercalifragilisticexpialidocious'.index('list')

14

In [126]:
'supercalifragilisticexpialidocious'.split('li')

['superca', 'fragi', 'sticexpia', 'docious']

### 🍎🍎🍎 Python Core: Iteration
All iterable collections support the same protocol

In [71]:
for c in 'supercalifragilisticexpialidocious':
    print(c)

s
u
p
e
r
c
a
l
i
f
r
a
g
i
l
i
s
t
i
c
e
x
p
i
a
l
i
d
o
c
i
o
u
s


### Review
- what type is used for single characters?
- how do you prevent backslash characters in literal strings from being interpreted specially?
- what keyword do you use to check for the existence of a substring?
- what keyword is used to loop over strings character-by-character?
- how do you extract the last three characters from a string?
- how efficient is repeatedly appending to the end of a string?

## Lists

Dynamic, mutable, *usually* same content data type

In [4]:
squares = [1, 4, 9, 16, 25]
squares

[1, 4, 9, 16, 25]

In [5]:
squares[0]

1

In [9]:
squares[-3:]  # slicing creates new list

[9, 16, 25]

In [10]:
squares[:]

[1, 4, 9, 16, 25]

In [41]:
list(squares)  # be explicit

[1, 4, 9, 16, 25]

In [11]:
squares + [36, 49, 64, 81, 100]

[1, 4, 9, 16, 25, 36, 49, 64, 81, 100]

In [24]:
cubes = [1, 8, 27, 65, 125]  # something's wrong here
4 ** 3  # the cube of 4 is 64, not 65!

64

In [25]:
cubes[3] = 64
cubes

[1, 8, 27, 64, 125]

In [26]:
cubes.append(216)  # add the cube of 6
cubes.append(7 ** 3)  # and the cube of 7
cubes

[1, 8, 27, 64, 125, 216, 343]

In [2]:
cubes = []
for i in range(1, 8):
    cubes.append(i ** 3)
cubes

[1, 8, 27, 64, 125, 216, 343]

In [3]:
[i ** 3 for i in range(1, 8)]

[1, 8, 27, 64, 125, 216, 343]

In [75]:
rooms = []
for letter in 'ABC':
    for number in [10, 20, 30]:
        rooms.append(letter + str(number))
rooms

['A10', 'A20', 'A30', 'B10', 'B20', 'B30', 'C10', 'C20', 'C30']

In [76]:
[letter + str(number) for letter in 'ABC' for number in [10, 20, 30]]

['A10', 'A20', 'A30', 'B10', 'B20', 'B30', 'C10', 'C20', 'C30']

In [7]:
[letter for letter in 'hellO WOrld' if letter.isupper()]

['O', 'W', 'O']

### 🏠🏠🏠 Idiomatic Python: Comprehensions
Use list comprehensions instead of `map()`, `lambda` and `filter()`

In [77]:
letters = list('abcdefg')
letters

['a', 'b', 'c', 'd', 'e', 'f', 'g']

In [74]:
letters[2:5] = 'ⒸⒹⒺ'  # iterates RHS
letters

['a', 'b', 'Ⓒ', 'Ⓓ', 'Ⓔ', 'f', 'g']

In [20]:
# same as: del letters[2:5]
letters[2:5] = []
letters

['a', 'b', 'f', 'g']

In [21]:
# same as letters.clear()
letters[:] = []
letters

[]

In [2]:
my_words = 'these are mine'.split()
my_words

['these', 'are', 'mine']

In [3]:
your_words = my_words  # simple assignment never copies
your_words

['these', 'are', 'mine']

In [28]:
my_words.clear()
your_words

[]

In [72]:
my_words = ['these', 'are', 'mine']
your_words = list(my_words)
my_words.clear()
your_words

['these', 'are', 'mine']

In [31]:
# builtin function len()
letters = ['a', 'b', 'c', 'd']
len(letters)

4

In [39]:
# in keyword to test membership
'd' in letters

True

In [5]:
a = ['a', 'b', 'c']
n = [1, 2, 3]
x = [a, n]
x

[['a', 'b', 'c'], [1, 2, 3]]

In [6]:
x[0]

['a', 'b', 'c']

In [7]:
x[1] == [1, 2, 3]

True

In [8]:
[1, 2, 3] < [1, 2, 4]

True

In [34]:
x[0][1]

'b'

### ⏱⏱⏱ Performance Matters: Lists

- lists are implemented as arrays in memory
- `L[x]` is O(1)
- `L[x] = ...` is O(1)
- `L.append(e)` is O(1) amortized worst case
- `L.pop()` is O(1) from end
- `len(L)` is O(1)
- most other operations O(n)

### Review
- what sort of objects does the `list()` constructor accept?
- what keyword is used to test for membership in a list?
- what keyword is used to loop over list contents member-by-member?
- does assigning a list to a new name copy its contents?
- how do you remove the last three items from a list?
- how do you write a loop with an index variable over values `0, 1, 2, 3, 4`?
- how do you filter values from a list comprehension?

### Exercise: word extraction

1. paste the text from a web page into a Python string  
   e.g. `content = '''<paste-here>'''`
2. extract all the 5-letter words from that content using a list comprehension
3. extract all the capitalized words into a list using a list comprehension

## Tuples
Immutable, light weight, often *mixed types* & *position-significant*

In [44]:
t = 12345, 54321, 'hello!'
t[0]

12345

In [45]:
t

(12345, 54321, 'hello!')

In [47]:
u = t, (1, 2, 3, 4, 5)
u

((12345, 54321, 'hello!'), (1, 2, 3, 4, 5))

In [48]:
t[0] = 88888

TypeError: 'tuple' object does not support item assignment

In [2]:
v = ([1, 2, 3], [3, 2, 1])
v

([1, 2, 3], [3, 2, 1])

In [3]:
v[0][1] = 500
v

([1, 500, 3], [3, 2, 1])

In [52]:
empty = ()
singleton = 'hello',    # <-- note trailing comma
len(empty)

0

In [53]:
len(singleton)

1

In [54]:
singleton

('hello',)

In [1]:
tuple([10, 20, 30])

(10, 20, 30)

In [56]:
# sequence unpacking
x, y, z = t
y

54321

### ⏱⏱⏱ Performance Matters: Tuples

- tuples are fixed-size and allocated only once
- prefer tuples for safety and limited memory use

### Review
- how do you access the 2nd item in a tuple?
- what sort of objects does the `tuple()` constructor accept?
- how do you delete the last three items in a tuple?
- are all items in a tuple immutable?

## Dicts
Ordered (since 3.6) mutable mapping for any hashable keys → any values 

In [2]:
tel = {'jack': 4098, 'sape': 4139}
tel['guido'] = 4127
tel

{'jack': 4098, 'sape': 4139, 'guido': 4127}

In [3]:
tel['jack']

4098

In [4]:
del tel['sape']
tel['irv'] = 4127
tel

{'jack': 4098, 'guido': 4127, 'irv': 4127}

In [5]:
list(tel)

['jack', 'guido', 'irv']

In [6]:
for element in tel:
    print(element)

jack
guido
irv


In [7]:
sorted(tel)

['guido', 'irv', 'jack']

In [72]:
list(tel.items())

[('jack', 4098), ('guido', 4127), ('irv', 4127)]

In [76]:
for name, number in tel.items():
    print(name, end=' -- ')
    print(number)

jack -- 4098
guido -- 4127
irv -- 4127


In [34]:
'guido' in tel

True

In [35]:
'jack' not in tel

False

In [36]:
dict([('sape', 4139), ('guido', 4127), ('jack', 4098)])

{'sape': 4139, 'guido': 4127, 'jack': 4098}

In [37]:
dict(sape=4139, guido=4127, jack=4098)

{'sape': 4139, 'guido': 4127, 'jack': 4098}

In [38]:
# order ignored when comparing
{'a': 1, 'b': 2} == {'b': 2, 'a': 1}

True

In [40]:
# repeated keys not supported 
{'a': 1, 'a': 3}

{'a': 3}

In [8]:
{0: 'zero', 0.0: 'zilch', 0j: 'nada'}

{0: 'nada'}

In [11]:
# dict comprehension (like list comprehension)
{letter: ord(letter) for letter in 'ABCDEFG'}

{'A': 65, 'B': 66, 'C': 67, 'D': 68, 'E': 69, 'F': 70, 'G': 71}

### ⏱⏱⏱ Performance Matters: Dicts

- implemented as compact array + index hash table
- `D[x]` is O(1)
- `x in D` is O(1)
- `D[x] = ...` is O(1) amortized worst case
- `del D[x]`, `D.pop(x)` is O(1) amortized worst case
- `len(D)` is O(1)

dict is not optimized for many dicts with same keys, consider namedtuple or pandas dataframe for working with many similar records more efficiently

### Review
- what type of objects can be used as dict keys? values?
- what keyword is used to check if a key is present in a dict?
- what dict member function is used to return all key, value pairs?

## Sets
Unordered collection of hashable values with no duplicates

In [12]:
basket = {'apple', 'orange', 'apple', 'pear', 'orange', 'banana'}
print(basket)

{'apple', 'orange', 'pear', 'banana'}


In [13]:
'orange' in basket

True

In [15]:
basket - {'orange', 'broccoli'}

{'apple', 'banana', 'pear'}

In [21]:
basket & {'pear', 'orange', 'broccoli'}

{'orange', 'pear'}

In [49]:
basket > set()  # Can't use {}, that's a dict

True

In [17]:
basket > {'apple', 'broccoli'}

False

In [25]:
set(['a', 'a', 'b', 'c', 'd'])

{'a', 'b', 'c', 'd'}

In [26]:
set('aabcd')

{'a', 'b', 'c', 'd'}

In [27]:
set('typing quotes and commas is very tiring'.split())

{'and', 'commas', 'is', 'quotes', 'tiring', 'typing', 'very'}

In [28]:
for element in set('abcd'):
    print(element)

d
a
c
b


### Review
- what type of objects can sets contain?
- is the order of insertion in sets preserved?

## Files

In [84]:
f = open('workfile', 'w', encoding='utf-8')
f.write('Line 1\nLine 2\n')
f.close()

In [87]:
with open('workfile', encoding='utf-8') as f:
    read_data = f.read()
read_data

'Line 1\nLine 2\n'

In [83]:
f.close()
f.read()

ValueError: I/O operation on closed file.

In [89]:
with open('workfile', encoding='utf-8') as f:
    for line in f:
        print('---')
        print(line)

---
Line 1

---
Line 2



In [91]:
value = {'the_answer': 42}
with open('important', 'w', encoding='utf-8') as f:
    f.write(value)

TypeError: write() argument must be str, not dict

In [95]:
import json

value = {'the_answer': 42, 'my_mood': '😺'}
with open('important', 'w', encoding='utf-8') as f:
    json.dump(value, f)

| JSON | Python | notes
|:- |:- |:-
| `null` | `None` |
| `true` / `false` | `True` / `False` |
| `"string"` | str |
| `1234` / `1234.5` | int / float | large int support\*, nan\*, inf\*
| `[1, 2, 3]` | list |
| `{"a":"b"}` | dict | only str keys, no repeated keys, ordered*

\*may not be preserved by other JSON implementations

In [99]:
with open('important', encoding='utf-8') as f:
    content = f.read()
content

'{"the_answer": 42, "my_mood": "\\ud83d\\ude3a"}'

In [101]:
json.loads(content)['the_answer']

42

### Review
- what keyword is used to execute a context manager?
- why should context managers be used to open files?
- why might you prefer reading and writing json instead of plain text?


## String Formatting


In [105]:
year = 2019
rtype = 'Contracts Quarterly'
f'Total {rtype} records for {year}'

'Total Contracts Quarterly records for 2019'

In [109]:
year = 2019
rtype = 'Contracts Quarterly'
'Total {} records for {}'.format(rtype, year)

'Total Contracts Quarterly records for 2019'

In [106]:
total = 1_018_410
count = 596_196
percentage = count / total
'{:-9} records  {:2.2%}'.format(count, percentage)

'   596196 records  58.54%'

In [107]:
total = 1_018_410
count = 596_196
f'{count:-9} records  {count / total:2.2%}'

'   596196 records  58.54%'

In [12]:
for x in range(11):
    print('{ones:3d} {squares:4d} {cubes:5d}'.format(
        ones=x, squares=x * x, cubes=x * x * x))

  0    0     0
  1    1     1
  2    4     8
  3    9    27
  4   16    64
  5   25   125
  6   36   216
  7   49   343
  8   64   512
  9   81   729
 10  100  1000


In [116]:
# old string formatting
import math
'The value of pi is approximately %5.3f.' % math.pi

'The value of pi is approximately 3.142.'

### Exercise: git log processing

1. capture the output from `git log` on a repository with at least 50 contributors into a text file
   e.g. `git log > gitlog.txt`
2. extract the email addresses from this text file using file and string operations
3. print out all unique email addresses along with the names of the contributors in sorted order

## Extra Material

In [31]:
# unicode ordinal from string
ord('☂')

9730

In [30]:
# unicode character from ordinal
chr(9730)

'☂'

In [75]:
# nul characters fully supported
len('\0\0\0')

3

In [4]:
# literal bytes (not unicode)
bs = b'hello\n'
bs

b'hello\n'

In [7]:
# iteration gives ints between 0 and 255
for b in bs:
    print(b)

104
101
108
108
111
10


In [8]:
# convert bytes to str
bs.decode('utf-8')

'hello\n'

In [9]:
# convert str to bytes
'\N{SNOWMAN}'.encode('utf-8')

b'\xe2\x98\x83'

In [None]:
# generic reverse iterator for collections
for c in reversed('supercalifragilisticexpialidocious'):
    print(c, end='')

In [None]:
# opaque iterator
reversed('supercalifragilisticexpialidocious')

In [None]:
# join() is a str member, where it should be
''.join(reversed('supercalifragilisticexpialidocious'))

In [None]:
# the better way to reverse a string: using slice "step"
'supercalifragilisticexpialidocious'[::-1]

In [None]:
# nested dict
dept_titles = {
    'csps-efpc': {
        'en': 'Canada School of Public Service',
        'fr': 'École de la fonction publique du Canada'
    },
    'tbs-sct': {
        'en': 'Treasury Board of Canada Secretariat',
        'fr': 'Secrétariat du Conseil du Trésor du Canada',
    },
}

depts['tbs-sct']['fr']

In [None]:
# convert to flat dict with tuple keys
flat_titles = {
    (dept, lang): title
    for (dept, langs) in dept_titles.items()
    for (lang, title) in langs.items()}

flat_titles

In [None]:
# access with tuple key
flat_titles['csps-efpc', 'fr']

### ⏱⏱⏱ Performance Matters: {} 💕 ()

- safely combine any hashable keys to flatten dicts
- faster and lower memory use than nested dicts

### frozenset
Immutable unordered collection of hashable values with no duplicates

In [60]:
frozenset('abc')

frozenset({'b', 'c', 'a'})

In [61]:
d = {frozenset('abc'): 'found it'}
d[frozenset('cba')]

'found it'

In [63]:
{frozenset('pqr'), frozenset(), frozenset('rpq'), frozenset('rp'),
 frozenset()}

{frozenset(), frozenset({'p', 'r'}), frozenset({'r', 'q', 'p'})}