**ids-pdl03-tut.ipynb**: This Jupyter notebook is provided by Joachim Vogt for the _Python Data Lab_ of the module _Introduction to Data Science_ offered in Fall 2022 at Jacobs University Bremen. Module instructors are Hilke Brockmann, Adalbert Wilhelm, and Joachim Vogt. Jupyter notebooks and other learning resources are available from a dedicated _module platform_.

# Python strings and lists

This tutorial continues the discussion of Python data types and introduces important built-in data structures. Follow the instruction below to learn about

- [ ] string variables,
- [ ] type conversion,
- [ ] lists and tuples,
- [ ] dictionaries,
- [ ] file input and output.

If you wish to keep track of your progress, you may edit this markdown cell, check a box in the list above after having worked through the respective part of this notebook, and save the file.

*Short exercises* are embedded in this notebook. *Sample solutions* can be found at the end of the document.

Documentation and online resources:
- The Python programming language is documented at [Python.org](https://www.python.org) maintained by the Python Software Foundation (PSF).

## Preparation

The following data file is expected to reside in the working directory. Identify the file on the module platform and upload it to the same folder as this Jupyter notebook.

- `gdp-per-capita-in-us-dollar-world-bank.csv`: GDP per capita in constant 2010 US dollars 1960-2020, published by the [World Bank, 2021-07-30](http://data.worldbank.org/data-catalog/world-development-indicators), available from [Our World in Data](https://ourworldindata.org/grapher/gdp-per-capita-in-us-dollar-world-bank).

## Introduction to Python strings

To familiarize with the concept of string variables, let us start with simple examples. A sequence of characters delimited by single quotes (`'Beeblebrox'`) constitutes a Python string that is then assigned to the variable `lastname`.

In [1]:
lastname = 'Beeblebrox'
print(lastname)

Beeblebrox


A set of dedicated functions is associated with string variables, e.g., `count()` to check how often a certain substring occurs in a string, `capitalize()` to make the first letter a capital, `upper()` and `lower()` to convert a string to uppercase and lowercase, respectively. Try tab completion (`lastname.<Tab>`) to obtain a list of functions associated with this variable type.

In [2]:
#Tab completion: Uncomment the following line and press the <Tab> key after the period
#lastname.

In the following code cell, uncomment individual lines to select a function and observe the output.

In [3]:
#print(lastname.count('eb'))
#print(lastname.upper())
#print(lastname.lower())
#print(lastname.lower().capitalize())
#print(lastname.replace('e','a'))

Strings are concatentated using the `+` operator.

In [4]:
firstname = 'Zaphod'
fullname = firstname + ' ' + lastname
print(fullname)

Zaphod Beeblebrox


Individual letters are addressed by specifying its position (index) in square brackets. In Python, the first position is at index 0.

In [5]:
print(fullname[2])

p


Substrings can extracted through *slicing*, i.e., by specifying ranges of indices separated by a colon `:` as follows. Note that in Python, `[m:n]` gives the range from `m` to `n-1` (i.e., `n` is not included).

In [6]:
print(fullname[2:5])

pho


Negative indices refer to the end of a string.

In [7]:
print(fullname[-1])
print(fullname[-5:-2])

x
ebr


## String variables

Strings are character sequences and enclosed by single quotes, double quotes, or triple quotes. The Python string type is `str`.

In [8]:
sq = 'This string is defined using single quotes.'
print(sq)
print(type(sq))
dq = "This string is defined using double quotes."
print(dq)
print(type(dq))
tq = '''This string is defined using triple quotes.'''
print(tq)
print(type(tq))

This string is defined using single quotes.
<class 'str'>
This string is defined using double quotes.
<class 'str'>
This string is defined using triple quotes.
<class 'str'>


String variables containing single quotes or double quotes may be delimited with the other type of quote, or with triple quotes.

In [9]:
question = "What's the question?"
print(question)
sentence = '''She asked, "What's the question?"'''
print(sentence)

What's the question?
She asked, "What's the question?"


Not only quotes deserve special attention, also the backslash `\`, allowing for multi-line definition of strings as follows.

In [10]:
CCl2F2 = 'Dichloro\
difluoro\
methane'
print(CCl2F2)

Dichlorodifluoromethane


Special characters like quotes or the backslash are considered in Python strings by means of _escape sequences_, e.g., `\'` for a single quote, and `\\` for the backslash.

In [11]:
question = 'What\'s the question?'
print(question)
ab = 'a\\b'
print(ab)

What's the question?
a\b


Further examples of escape sequences are `\n` (newline), `\t` (tab), `\b` (backspace), `\r` (return). In the following code cell, uncomment individual lines to select a function and observe the output.

In [12]:
#print('a\\b')
#print('a\b')
#print('a\tb')
#print('a\nb')

Numerical values can be incorporated in Python strings using the `format()` function. 

In [13]:
a = 3.14159265
str1 = 'The number pi is {}.'.format(a)
print(str1)

The number pi is 3.14159265.


The numerical value of the function argument is inserted at the position of the curly braces `{}`. The output can be controled using format identifiers such as `.4f` producing a four-digit number.

In [14]:
a = 3.14159265359
str2 = 'The number pi is {:.4f}.'.format(a)
print(str2)

The number pi is 3.1416.


Here is an example with several arguments. 

In [15]:
a = 3.14159265359
b = 2.71828182846
str3 = 'The numbers pi and e are {pi:.4f} and {e:.4f}, respectively.'.format(pi=a,e=b)
print(str3)

The numbers pi and e are 3.1416 and 2.7183, respectively.


### Exercise: String variables

Inspect the following Python strings. Predict the output and double-check your assessment by uncommenting the line with respective `print()` function.

In [16]:
### Example 01
str01 = 'jonathan' + ' ' + 'swift'
#print(str01)
### Example 02
str02 = 'jonathan'.capitalize() + ' ' + 'swift'.upper()
#print(str02)
### Example 03
str03 = "Gulliver's " + 'Travels'
#print(str03)
### Example 04
str04 = str02 + ': ' + str03
#print(str04)
### Example 05
str05 = 'm\nq'
#print(str05)
### Example 06
str06 = 'm\\nq'
#print(str06)
### Example 07
str07 = 'm\
nq'
#print(str07)
### Example 08
a = 4
p = 3
str08 = '{}**{} gives {}.'.format(a,p,a**p)
#print(str08)
### Example 09
str09 = 'The product of {x} and {y} is {z}.'.format(z=a*p,x=p,y=a)
#print(str09)
import math
### Example 10
str10 = 'The result of exp(-1) is {:.4f}.'.format(math.exp(-1))
#print(str10)

## Conversion of scalar data types

The function `format()` can be used to turn a number into a string. 

In [17]:
'{}'.format(2.718281828)

'2.718281828'

More directly, such a type conversion is done by the function `str()`.

In [18]:
str(2.718281828)

'2.718281828'

Strings containing numbers must be converted to a numerical data type before meaningful operations can be performed. Conversion from string to integer is accomplished by means of the function `int()`.

In [19]:
s1 = '3'
s2 = '4'
print('s1={}, s2={}, s1+s1={}'.format(s1,s2,s1+s2))
n1 = int(s1)
n2 = int(s2)
print('n1={}, n2={}, n1+n1={}'.format(n1,n2,n1+n2))

s1=3, s2=4, s1+s1=34
n1=3, n2=4, n1+n1=7


The function `float()` converts to floating-point numbers.

In [20]:
s1 = '3.2'
s2 = '4.3'
print('s1={}, s2={}, s1+s1={}'.format(s1,s2,s1+s2))
n1 = float(s1)
n2 = float(s2)
print('n1={}, n2={}, n1+n1={}'.format(n1,n2,n1+n2))

s1=3.2, s2=4.3, s1+s1=3.24.3
n1=3.2, n2=4.3, n1+n1=7.5


One may also convert integers to floating-point numbers. In the reverse direction, the digits are cut off.

In [21]:
i1 = 3
print('i1={}    : float(i1)={}'.format(i1,float(i1)))
f2 = 4.3
print('f2={}  : int(f2)={}'.format(f2,int(f2)))
f3 = -4.3
print('f3={} : int(f3)={}'.format(f3,int(f3)))

i1=3    : float(i1)=3.0
f2=4.3  : int(f2)=4
f3=-4.3 : int(f3)=-4


### Exercise: Type conversion

Complete the code cell below according to the instructions included as comments.

In [22]:
### Define two large integers differing by a small number
i1 = 12345678901234567890
i2 = 12345678901234567891
### Compute and print the difference of integers i1 and i2. 

### Convert integers i1 and i2 to floats f1 and f2, respectively.

### Compute and print the difference of floats i1 and i2.

### Define three string variables containing floating-point numbers.

### Convert s3, s4, s5 to floats and print the results.


## Python lists

Python lists are collections of objects of possibly different types. Lists are enclosed in square brackets `[...]`.

In [23]:
lst1 = [2,'three',4,5.6]

Like strings, lists can be concatentated using the `+` operator.

In [24]:
lst2 = ['seven',8.9,10]
lst3 = lst1 + lst2
print(lst3)

[2, 'three', 4, 5.6, 'seven', 8.9, 10]


The number of elements in a list is returned by the function `len()`.

In [25]:
print('Number of elements in lst1: {}'.format(len(lst1)))
print('Number of elements in lst2: {}'.format(len(lst2)))
print('Number of elements in lst3: {}'.format(len(lst3)))

Number of elements in lst1: 4
Number of elements in lst2: 3
Number of elements in lst3: 7


Individual elements in lists can be selected using its index, analogous to individual characters in strings. In Python, the element at the first position is at index 0.

In [26]:
print('List     :',lst3)
pos = 5
print('Position : {}'.format(pos))
ind = pos-1
print('Index    : {}'.format(ind))
elem = lst3[ind]
print('Element  : {}'.format(elem))
print('Type     : {}'.format(type(elem)))

List     : [2, 'three', 4, 5.6, 'seven', 8.9, 10]
Position : 5
Index    : 4
Element  : seven
Type     : <class 'str'>


This selection mechanism works also for ranges of the form `m:n`, addressing list elements from index `m` to index `n-1` (i.e., index `n` is not included). This is called (list) *slicing*.

In [27]:
m = 2
n = 5
print('Complete list lst3      :',lst3)
print('Partial  list lst3[{}:{}] :'.format(m,n),lst3[m:n])

Complete list lst3      : [2, 'three', 4, 5.6, 'seven', 8.9, 10]
Partial  list lst3[2:5] : [4, 5.6, 'seven']


The form `:n` (omission of first index) is equivalent to `0:n`. The form `m:` (omission of the second index) is equivalent to `m:n` where `n` is the length of the list (total number of list elements).

In [28]:
m = 2
n = 4
print('Complete list lst3     :',lst3)
print('Partial  list lst3[:{}] :'.format(n),lst3[:n])
print('Partial  list lst3[{}:] :'.format(m),lst3[m:])

Complete list lst3     : [2, 'three', 4, 5.6, 'seven', 8.9, 10]
Partial  list lst3[:4] : [2, 'three', 4, 5.6]
Partial  list lst3[2:] : [4, 5.6, 'seven', 8.9, 10]


Negative indices refer to the end of a string.

In [29]:
m = -4
n = -2
print('Complete list lst3     :',lst3)
print('Partial  list lst3[:{}] :'.format(n),lst3[:n])
print('Partial  list lst3[{}:] :'.format(m),lst3[m:])

Complete list lst3     : [2, 'three', 4, 5.6, 'seven', 8.9, 10]
Partial  list lst3[:-2] : [2, 'three', 4, 5.6, 'seven']
Partial  list lst3[-4:] : [5.6, 'seven', 8.9, 10]


Adding a third index yields *strides*: `m:n:s`. E.g., to select every second element starting at the second position (index 1), one may write `1:7:2` or simply `1::2`.

In [30]:
m = 1
n = 7
s = 2
print('Complete list lst3        :',lst3)
print('Partial  list lst3[{}:{}:{}] :'.format(m,n,s),lst3[m:n:s])
print('Partial  list lst3[{}::{}]  :'.format(m,s),lst3[m::s])

Complete list lst3        : [2, 'three', 4, 5.6, 'seven', 8.9, 10]
Partial  list lst3[1:7:2] : ['three', 5.6, 8.9]
Partial  list lst3[1::2]  : ['three', 5.6, 8.9]


Unlike tuples discussed below, lists are mutable objects, i.e., their entries can be changed, even to an object of a different type

In [31]:
lst4 = ['two',3,4.2,5]
print('lst4 :',lst4)
lst4[3] = 'five'
print('lst4 :',lst4)

lst4 : ['two', 3, 4.2, 5]
lst4 : ['two', 3, 4.2, 'five']


Another notable property of lists concerns assignments to another variable. Instead of an independent copy, a so-called _view_ is produced, i.e., a second reference to the same object. Manipulating the underlying object through operations on either of the two variables affects both of them. 

In [32]:
lst5 = ['cat','dog']
print('lst5 :',lst5)
lst6 = lst5
lst6[0] = 'mouse'
print('lst6 :',lst6)
print('lst5 :',lst5)

lst5 : ['cat', 'dog']
lst6 : ['mouse', 'dog']
lst5 : ['mouse', 'dog']


To create an independent copy of a list that can be manipulated independently, use the `copy()` method.

In [33]:
lst5 = ['cat','dog']
print('lst5 :',lst5)
lst6 = lst5.copy()
lst6[0] = 'mouse'
print('lst6 :',lst6)
print('lst5 :',lst5)

lst5 : ['cat', 'dog']
lst6 : ['mouse', 'dog']
lst5 : ['cat', 'dog']


Tuples are like lists but immutable, i.e., their entries cannot be changed after definition. Tuples are enclosed in round parentheses while their entries are referenced using square brackets.

In [34]:
tpl = (1,2)
print(tpl,type(tpl))
print('tpl[1] = {}'.format(tpl[1]))

(1, 2) <class 'tuple'>
tpl[1] = 2


Uncomment the instruction in the code cell below to obtain an error message showing that tuples are immutable.

In [35]:
#tpl[1] = 3

### Exercise: Python lists

Complete the code cell below according to the instructions included as comments.

In [36]:
### Define a sample Python list with elements of different types.
mylist = ['Hi!',7.3,'Hello!',1.2345e2,1.2345e32,1024,'Guten Tag!',2**64-1]
print(mylist)
### Apply list slicing to extract and print a list with the second, third, and fourth element.

### Apply list slicing to extract and print a list with the first, third, and fifth element.

### Create an independent copy of the list and exchange the third element with 'Bonjour!'


['Hi!', 7.3, 'Hello!', 123.45, 1.2345e+32, 1024, 'Guten Tag!', 18446744073709551615]


## Dictionaries

Python dictionaries are data structures mapping key to values, i.e., they consist of key:value pairs. Depending on the Python version, these associative arrays may be ordered or unordered. The following illustrate example may be understood as a data base for different types of fruit.

In [37]:
DictOfFruits = {'Apples':19,'Bananas':13,'Oranges':17,'Pears':11}
print('Dictionary : ',DictOfFruits)

Dictionary :  {'Apples': 19, 'Bananas': 13, 'Oranges': 17, 'Pears': 11}


Keys and values can be accessed separately.

In [38]:
print('Keys       : ',DictOfFruits.keys())
print('Values     : ',DictOfFruits.values())

Keys       :  dict_keys(['Apples', 'Bananas', 'Oranges', 'Pears'])
Values     :  dict_values([19, 13, 17, 11])


Items are pairs in the form of (key,value) tuples.

In [39]:
print('Items      : ',DictOfFruits.items())

Items      :  dict_items([('Apples', 19), ('Bananas', 13), ('Oranges', 17), ('Pears', 11)])


Individual entries (values and items) are indexed by keys.

In [40]:
print(DictOfFruits['Oranges'])

17


## Reading and writing files

String and type conversion operations are useful when dealing with data files. Here we consider files organized in sequential records or lines, and illustrate file operations using data on three cities in Northern Germany.

In [41]:
header = 'City,Area[km2],Altitude[m]'
line01 = 'Bremen,318.21,11'
line02 = 'Hannover,204.3,55'
line03 = 'Oldenburg,103.09,5'

Consult the Python documentation on the functions `open()`, `writelines()`, `close()`, and study the following set of instructions. A new data file `cities.txt` is opened for writing (`w`), then the header is written, followed by the three data records. Note that except for the last record (end of file or EOF), the newline character `\n` must be appended so that subsequent lines are separated in the data file. Check your working directory to verify that the file has been created.

In [42]:
fout = open('cities.txt','w')
fout.writelines(header+'\n')
fout.writelines(line01+'\n')
fout.writelines(line02+'\n')
fout.writelines(line03)
fout.close()

In a similar manner, such a file can be read so that its content becomes available in Python. The function `read()` puts all content into a single string.

In [43]:
fin = open('cities.txt','r')
content = fin.read()
fin.close()
content

'City,Area[km2],Altitude[m]\nBremen,318.21,11\nHannover,204.3,55\nOldenburg,103.09,5'

Applying the function `split()` with the newline character `\n` as its argument gives a list of strings, each corresponding to one line from the data file. This is equivalent to using the function `splitlines()`.

In [None]:
lines = content.split('\n')
#lines = content.splitlines()
print(lines,type(lines))

Each line can now be processed further to extract information.

In [None]:
BremenRecord = lines[1].split(',')
BremenArea = float(BremenRecord[1])
print('Bremen area [km2]: {}'.format(BremenArea))

The above data file example is primarily meant to illustrate basic string and list processing. To read (load) and write (save, store) specific file formats, numerous convenience functions exist in popular modules such as NumPy and pandas to facilitate file input and output.

### Exercise: File handling

Using the functions `open()` and `read()` as described above, load the contents of the file `gdp-per-capita-in-us-dollar-world-bank.csv`. Using the function `count()`, display how often the strings `'Australia'`, `'Bhutan'`, `'Canada'``'Denmark'`, `'Ecuador'`, `'Fiji'`, and `'Gambia'` show up in the data file.

---
---

## Solutions to the exercises

### Solution: String variables

In [None]:
### Example 01
str01 = 'jonathan' + ' ' + 'swift'
print(str01)
### Example 02
str02 = 'jonathan'.capitalize() + ' ' + 'swift'.upper()
print(str02)
### Example 03
str03 = "Gulliver's " + 'Travels'
print(str03)
### Example 04
str04 = str02 + ': ' + str03
print(str04)
### Example 05
str05 = 'm\nq'
print(str05)
### Example 06
str06 = 'm\\nq'
print(str06)
### Example 07
str07 = 'm\
nq'
print(str07)
### Example 08
a = 4
p = 3
str08 = '{}**{} gives {}.'.format(a,p,a**p)
print(str08)
### Example 09
str09 = 'The product of {x} and {y} is {z}.'.format(z=a*p,x=p,y=a)
print(str09)
import math
### Example 10
str10 = 'The result of exp(-1) is {:.4f}.'.format(math.exp(-1))
print(str10)

### Solution: Type conversion

In [None]:
### Define two large integers differing by a small number
i1 = 12345678901234567890
i2 = 12345678901234567891
### Compute and print the difference of integers i1 and i2. 
print('i2-i1 : {}'.format(i2-i1))
### Convert integers i1 and i2 to floats f1 and f2, respectively.
f1 = float(i1)
f2 = float(i2)
### Compute and print the difference of floats i1 and i2.
print('f2-f1 : {}'.format(f2-f1))
### Define three string variables containing floating-point numbers.
s3 = '8.3'
s4 = '5.14e3'
s5 = '9.22e-2'
### Convert s3, s4, s5 to floats and print the results.
f3 = float(s3)
print('f3 = {}'.format(f3))
f4 = float(s4)
print('f4 = {}'.format(f4))
f5 = float(s5)
print('f5 = {}'.format(f5))

### Solution: Python lists

In [None]:
### Define a sample Python list with elements of different types.
mylist = ['Hi!',7.3,'Hello!',1.2345e2,1.2345e32,1024,'Guten Tag!',2**64-1]
print(mylist)
### Apply list slicing to extract and print a list with the second, third, and fourth element.
print(mylist[1:4])
### Apply list slicing to extract and print a list with the first, third, and fifth element.
print(mylist[:5:2])
### Create an independent copy of the list and exchange the third element with 'Bonjour!'
newlist = mylist.copy()
newlist[2] = 'Bonjour!'
print(newlist)

### Solution: File handling

In [None]:
fin = open('gdp-per-capita-in-us-dollar-world-bank.csv','r')
content = fin.read()
fin.close()
country = 'Australia'
print("Number of occurences of the string \'{}\' : {}".format(country,content.count(country)))
country = 'Bhutan'
print("Number of occurences of the string \'{}\' : {}".format(country,content.count(country)))
country = 'Canada'
print("Number of occurences of the string \'{}\' : {}".format(country,content.count(country)))
country = 'Denmark'
print("Number of occurences of the string \'{}\' : {}".format(country,content.count(country)))
country = 'Ecuador'
print("Number of occurences of the string \'{}\' : {}".format(country,content.count(country)))
country = 'Fiji'
print("Number of occurences of the string \'{}\' : {}".format(country,content.count(country)))
country = 'Gambia'
print("Number of occurences of the string \'{}\' : {}".format(country,content.count(country)))

---
---