# Getting Started with Jupyter Notebook
- Jupyter notebook is an extremely useful tool for developing and presenting projects (particularly in Python).  You can include code segments and view their output directly in your browser.  You can also add rich text, visualisations, equations and more.

- The difference between this and Grok (from COMP10001) is that you can run your code line by line (without having to run all of your code at once for an output).

## Cells
Jupyter notebooks contains two main types of cells:
- Markdown cells: These can be used to contain text, equations and other non-code items.  The cell that you're reading right now is a markdown cell (double click on this cell to edit it). You can use [Markdown](https://www.markdownguide.org/) to format your text. If you prefer, you can also format your text using <b>HTML</b>.  Closing the editor (top right of this cell when in edit mode, or pressing `shift + enter`) will format and display your text. 
- Code cells: These contain code segments that can be executed individually. When executed, the output of the code will be displayed below the code cell. Press `shift + enter` to run the cell. Note that you could also use `print()` to show output.

Try to understand the output of the following cell:

In [5]:
x = 2
y = 3
z = x + y
concat = 'z = ' + str(z) + '.00'
print(concat == f'z = {z:.2f}') # if you do not remember string formatting, look up f-strings
concat

True


'z = 5.00'

Note that if you swap the last two lines, the string will not be displayed.

You can select a cell by clicking it. The buttons `+Code` and `+Text` will add the corresponding cell below the selected cell. Check out the `Commands` menu to see what other commands are available.

# Python review

This review assumes that you know basic Python well. The topics reviewed here are more of an intermediate level. Here is a list of the topics reviewed:

- lists, sets, dictionaries
- comprehensions
- `zip()`
- `numpy` and vectorisation
- default and keyword arguments
- lambda functions
- `.strip()`, `.split()`, and `.join()`
- Reading and writing text files

Feel free to look up online anything that is not explained in this notebook. The only aspects explained in detail here are the ones that are likely to reappear in this subject.

# Data structures and iteration

## Lists
A list is a sequence of objects. Adding an item to the end of a list can be done using the `.append()` method. The following list contains the first 1000 perfect squares.

In [6]:
squares_1k = []
for x in range(1, 1001):
    squares_1k.append(x*x)

We also review how slicing works. The syntax `lst[a:b:x]` refers to a subsequence of the list, starting from index `a` (inclusive), ending on index `b` (exclusive), where every `x`-th item is selected. Default values are assumed if any of these are omitted.

In [7]:
print(squares_1k[:10]) # first ten squares
print(squares_1k[-10:]) # last ten squares
print(squares_1k[1:21:2]) # first ten even squares

[1, 4, 9, 16, 25, 36, 49, 64, 81, 100]
[982081, 984064, 986049, 988036, 990025, 992016, 994009, 996004, 998001, 1000000]
[4, 16, 36, 64, 100, 144, 196, 256, 324, 400]


## Sets

A set is a collection of *unique* objects. Note that there is no notion of a sequence here.

Suppose we want to build a set containing all possible digit sums of the first 1000 perfect squares. We first write a function that calculates the sum of digits of a number.

<blockquote style="padding: 10px; background-color: #24292E;">

## Exercise

1. Write a function `sum_digits(num)` that calculates the sum of digits of a positive integer `num`.  
_Hint: You would need to perform some type conversion._

In [8]:
def sum_digits(num):
    import numpy as np
    ### answer 1 here
    x = [int(val) for val in str(num)]
    return np.sum(x)

sum_digits(129123123)

24

You can create an empty set with `set()`. Note that `{}` is an empty dictionary instead. You can insert an object to a set with the `.add()` method. Trying to insert an object that the set already has will do nothing.

In [9]:
squares_dsum_set = set()
for num in squares_1k:
    squares_dsum_set.add(sum_digits(num))
squares_dsum_set

{1, 4, 7, 9, 10, 13, 16, 18, 19, 22, 25, 27, 28, 31, 34, 36, 37, 40, 43, 46}

Note that `set(iterable)` can convert `iterable` into a set, removing any duplicates. Searching for an object in a set is much faster than searching for an object in a list (constant time  vs linear time).

## Dictionaries

A dictionary is a lookup table that maps unique *keys* to their respective *values*. You can insert a key-value pair into a dictionary with the syntax `dictionary[key] = value`. If `key` already exists in `dictionary`, then an overwrite would occur. Like sets, any lookup done with a dictionary also takes constant time only.

In [10]:
squares_to_dsum = {} # empty dictionary
for num in squares_1k[:20]:
    squares_to_dsum[num] = sum_digits(num)
squares_to_dsum

{1: 1,
 4: 4,
 9: 9,
 16: 7,
 25: 7,
 36: 9,
 49: 13,
 64: 10,
 81: 9,
 100: 1,
 121: 4,
 144: 9,
 169: 16,
 196: 16,
 225: 9,
 256: 13,
 289: 19,
 324: 9,
 361: 10,
 400: 4}

We can use `.keys()` to iterate through the keys of a dictionary, and `.values()` to iterate through the values of a dictionary. Using a for loop to directly iterate through a dictionary (`for x in dictionary`) will iterate through its keys.

However, the most common way to iterate through a dictionary is to use the `.items()`, which goes through tuples, each containing a key and its corresponding value. The idiomatic approach is to unpack the tuple like so:

In [11]:
for key, value in squares_to_dsum.items():
    print(f'The digits of {key} sum to {value}.')

The digits of 1 sum to 1.
The digits of 4 sum to 4.
The digits of 9 sum to 9.
The digits of 16 sum to 7.
The digits of 25 sum to 7.
The digits of 36 sum to 9.
The digits of 49 sum to 13.
The digits of 64 sum to 10.
The digits of 81 sum to 9.
The digits of 100 sum to 1.
The digits of 121 sum to 4.
The digits of 144 sum to 9.
The digits of 169 sum to 16.
The digits of 196 sum to 16.
The digits of 225 sum to 9.
The digits of 256 sum to 13.
The digits of 289 sum to 19.
The digits of 324 sum to 9.
The digits of 361 sum to 10.
The digits of 400 sum to 4.


## List/Set/Dictionary Comprehensions

When constructing `squares_1k`, `squares_dsum_set`, and `squares_to_dsum` above, we added each item explicitly within a for loop. There is a more convenient way to construct these data structures using *comprehensions*. The basic syntax is `expr_in_terms_of_x for x in iterable`. We reconstruct these three pieces of data using comprehensions:

In [12]:
# list comprehension
squares_1k = [x*x for x in range(1, 1001)]
# set comprehension
squares_dsum_set = {sum_digits(num) for num in squares_1k}
# diciontary comprehension
squares_to_dsum = {num: sum_digits(num) for num in squares_1k[:20]}

If you are unsure what these lines of code mean, compare them with the initial construction of `squares_1k`, `squares_dsum_set`, and `squares_to_dsum`. The results are the same, but written in a much simpler way.

An extension is to incorporate the use of `if` to filter some iterates in a comprehension. Suppose we wanted to get the list of squares that are not divisible by 3:

In [13]:
squares_nodiv3 = []
for x in range(1, 1001):
    if x % 3 != 0:
        squares_nodiv3.append(x*x)
squares_nodiv3[:10]

[1, 4, 16, 25, 49, 64, 100, 121, 169, 196]

This can also be achieved by a one-line comprehension:

In [14]:
squares_nodiv3 = [x*x for x in range(1, 1001) if x % 3 != 0]
squares_nodiv3[:10]

[1, 4, 16, 25, 49, 64, 100, 121, 169, 196]

A very similar syntax can be used with the `sum()/min()/max()` functions, without having to wrap the argument in another data structure. For example, the following counts how many of the first 1000 perfect squares contain the digit 0:


In [15]:
#squares_1k

In [16]:
sum(1 for square in squares_1k if '0' in str(square))

416

<blockquote style="padding: 10px; background-color: #24292E;">

## Exercise

2. Use a list comprehension to construct a list of all perfect *cubes* that start with the digit 1, up to one million (which is 100^3).

3. Rewrite the `sum_digits` function from Exercise 1 without an explicit for loop.

In [17]:
str(1002)[0]

'1'

In [18]:
### answer 2 here
perfect_cubes = []
[perfect_cubes.append(val**3) for val in range(1,101) if str(val)[0] == '1']
print(perfect_cubes)

[1, 1000, 1331, 1728, 2197, 2744, 3375, 4096, 4913, 5832, 6859, 1000000]


In [19]:
def sum_digits(num):
    ### answer 3 here, should only require one line
    x = sum(int(val) for val in str(num))
    return x

## Iteration with multiple sequences

Sometimes we may have multiple sequences with corresponding entries to iterate through together. Suppose we have three lists, consisting of names, heights, and ages.

In [20]:
names = ['Alec', 'Bindu', 'Clair', 'Dylan', 'Elliot']
heights = [1.72, 1.67, 1.58, 1.63, 1.75]
ages = [29, 32, 21, 35, 24]

for i in range(len(names)):
    print(f'{names[i]} is {ages[i]} old and {heights[i]}m tall.')

Alec is 29 old and 1.72m tall.
Bindu is 32 old and 1.67m tall.
Clair is 21 old and 1.58m tall.
Dylan is 35 old and 1.63m tall.
Elliot is 24 old and 1.75m tall.


There is a more Pythonic alternative to the iteration above, which is to use the `zip()` function:

In [21]:
for name, height, age in zip(names, heights, ages):
    print(f'{name} is {age} old and {height}m tall.')

Alec is 29 old and 1.72m tall.
Bindu is 32 old and 1.67m tall.
Clair is 21 old and 1.58m tall.
Dylan is 35 old and 1.63m tall.
Elliot is 24 old and 1.75m tall.


Using `zip()` helps make the code slightly more readable, and saves coding time if each iteration were more complex. The output of performing `zip()` with $k$ inputs is an iterable object, where each element is $k$-tuple, with one element from each input. This is most easily seen by printing out the `zip` object after being converted into a list:

In [22]:
list(zip(names, heights, ages))

[('Alec', 1.72, 29),
 ('Bindu', 1.67, 32),
 ('Clair', 1.58, 21),
 ('Dylan', 1.63, 35),
 ('Elliot', 1.75, 24)]

Note that the syntax `for name, height, age in zip(names, heights, ages)` works because of tuple unpacking.

## NumPy

Here, we give a quick introduction to the NumPy module, the common Python package for scientific computing. The standard shorthand for NumPy in code is `np` (see import statement below). Lists in NumPy are known as *arrays*. Unlike Python lists, the objects in a NumPy array must all be of the same data type. One way to create NumPy arrays is to convert Python lists to NumPy arrays.

In [23]:
import numpy as np

arr = np.array([523, 67, 1024, 9])
arr # notice the equal width formatting

array([ 523,   67, 1024,    9])

There are also NumPy functions to create special arrays, for example:

In [24]:
print(np.zeros(5)) # array of 5 zeros
print(np.arange(37, 43)) # integers from 37 (inclusive) to 43 (exclusive)

[0. 0. 0. 0. 0.]
[37 38 39 40 41 42]


A convenient feature of NumPy is *vectorisation*. If you had two lists of integers and wanted a third list consisting of the sum of corresponding integers from the two lists, the simplest approach looks like the following:

In [25]:
lst1 = [34, 52, 46, 74, 60]
lst2 = [71, 22, 53, 21, 43]
[x + y for x, y in zip(lst1, lst2)]

[105, 74, 99, 95, 103]

On the other hand, vectorisation allows NumPy arrays to be added together directly:

In [26]:
arr1 = np.array(lst1)
arr2 = np.array(lst2)
arr1 + arr2

array([105,  74,  99,  95, 103])

Vectorisation automatically applies the appropriate operations to all elements. If an operation is applied for an array and a single value, then the single value is broadcasted along the whole array:

In [27]:
print(arr1 + 100)
print(arr1 > 50)

[134 152 146 174 160]
[False  True False  True  True]


It is also possible to have NumPy arrays of higher dimensions. If you convert a list of lists of integers to a NumPy array, you will get a 2D array (each internal list is a row of the 2D array):

In [28]:
lists = [[3, 7, 12, 19, 20], [2, 6, 8, 11, 17], [1, 4, 9, 15, 18], [5, 10, 13, 14, 16]]
arr2d = np.array(lists)
print(arr2d.shape) # (number of rows, number of columns)
print(arr2d.size) # number of entries
arr2d

(4, 5)
20


array([[ 3,  7, 12, 19, 20],
       [ 2,  6,  8, 11, 17],
       [ 1,  4,  9, 15, 18],
       [ 5, 10, 13, 14, 16]])

Going down by rows is known as axis 0, whereas going down by columns is known as axis 1. This distinguishment is important for functions with the `axis` argument, such as `.mean()`.

In [29]:
print(np.mean(arr2d)) # average of all 20 values
print(np.mean(arr2d, axis=0)) # average across rows (one average per column)
print(np.mean(arr2d, axis=1)) # avearge across columns (one average per row)

10.5
[ 2.75  6.75 10.5  14.75 17.75]
[12.2  8.8  9.4 11.6]


Indexing for Python lists also apply to NumPy arrays. However, NumPy allows even more flexible indexing. Examples:

In [30]:
print(arr1[[0, 2, 4]]) # can use a list for indexing
print(arr1[arr1 > 50]) # can use a boolean array for indexing
print(arr2d[1:,2:]) # 2D indexing does not require two pairs of square brackets

[34 46 60]
[52 74 60]
[[ 8 11 17]
 [ 9 15 18]
 [13 14 16]]


The last example is a little complicated: it selects the sub-array starting from row 1, column 2, going both downwards and to the right.

In Week 1's workshop, you will be introduced to the Pandas module, which is heavily used in this subject whenever we work with data. Its representation of tabular data is built upon NumPy, so the concepts surrounding vectorisation and 2D arrays here also apply for Pandas. Indexing will be covered in more detail, so do not worry about it for now.

In this subject, we also use NumPy for some mathematical functions, e.g. logarithms and square roots. NumPy has plenty of other useful functions, be sure to look up NumPy if you find yourself trying to implement some tedious array or mathematical operation.

In [43]:
print(np.log2(64)) # logarithm with base 2
print(np.log(64)) # natural logarithm (base e)
print(np.sqrt(64))

6.0
4.1588830833596715
8.0


<blockquote style="padding: 10px; background-color: #24292E;">

## Exercise

4. Use NumPy to find the sum of the first 100 perfect cubes.

In [44]:
### answer 4 here
np.sum(perfect_cubes)

1034076

# String methods

A quick review of some helpful string methods:
- `str.strip()`: Returns a string where whitespace (space, newline, tab, etc.) is removed from the start and end of `str`.
- `str.split(sep)`: Returns a list of strings created by splitting up the original string by the string `sep`. If `sep` is not provided, then the original string is split by any whitespace.
- `str.join(list of strings)`: Returns a string made by joining the list of strings together, where the strings are separated by `str`. You can think of this as the inverse of `.split()`.

In [45]:
ori_text = """Alec ,  Bindu, 
   Clair  ,Dylan,  Elliot 
   """

name_list = []
for token in ori_text.split(','):
    print(token) # to show that the splitted strings are messy with whitespace
    name_list.append(token.strip())
print(name_list) # shows that whitespace has been removed

cleaned_text = ', '.join(name_list) # put back together into one string
print(cleaned_text)

Alec 
  Bindu
 
   Clair  
Dylan
  Elliot 
   
['Alec', 'Bindu', 'Clair', 'Dylan', 'Elliot']
Alec, Bindu, Clair, Dylan, Elliot


# Files I/O

On Ed, the 'Files' panel on the left displays the files available. However, the default view only allows Jupyter notebooks to be opened. To access the other files, you have to switch to 'Full View' instead (top right menu under the three dots).

In Python, the preferred way to open files is to use `with open(...) as fp:`, which avoids the need to explicitly write code to close the file. The `open(filename, mode='r')` function must be called with a filename argument, and optionally takes another argument for the mode:
- 'r' (default) opens a file for reading
- 'w' opens a file for writing (any existing content is overwritten)
- 'a' opens a file for appending

After obtaining a file pointer `fp`, the following methods are possible in read mode:
- `fp.read()`: Returns all content as a single string.
- `fp.readlines()`: Returns a list of strings, one string for each line. These strings would end with `'\n'`.
- `fp.readline()`: Returns the next line as a string.

If the file is opened in write/append mode, then writing/appending is down with `fp.write()`. Unlike `print()`, `fp.write()` does not automatically end with a newline.

In [46]:
with open('iris.csv', 'r') as fp: # the 'r' here is not needed
    print(fp.readline()) # print out the first line of the file

sepal_length,sepal_width,petal_length,petal_width,species



<blockquote style="padding: 10px; background-color: #24292E;">

## Exercise

5. Read in the comma-separated values file `iris.csv`, then use `.split()` and `.join()` to rewrite the file such that the values are separated by tabs (`'\t'`) instead of commas. Write the new contents into the file `iris.tsv`.

In [58]:
### answer 5 here
with open('iris.tsv', 'w') as fp:
    fp.write('\t'.join(open('iris.csv', 'r').read().split(sep=',')))

# Functions

## Function arguments
In this subject, we will often call functions from other modules, which may offer a large range of flexibility in the form of having many possible function arguments. Because of this, many functions have *default arguments*, which are arguments that assume a default value. The default value is only used if another value is not provided when the function is called.

As an example, the `mode` argument in `open()` has the default value `'r'`. If no mode is provided when `open()` is used, then the file will be opened in read mode by default. It is optional to provide a value for default arguments when calling a function. You can also write functions with default arguments (which is where the default value is specified).

In [59]:
def print_info(name, height=None, age=None):
    if height is None:
        height_str = 'unknown'
    else:
        height_str = str(height)+'m'

    if age is None:
        age_str = 'unknown'
    else:
        age_str = str(age)+' years'

    print(f'{name}: height {height_str}, age {age_str}')

For the function above, `name` must be provided whenever `print_info()` is called, but `height` and `age` may not need to be specified.

In [60]:
print_info('Alec')

Alec: height unknown, age unknown


If you choose to specify values for default arguments, it is common to include their keywords (think of keyword as the argument name). This is because some functions may have many default arguments, and we may only want to overwrite a handful of those values.

In [61]:
print_info('Alec', height=1.72, age=29) # specify height and age
print_info('Bindu', height=1.67) # specify height only
print_info('Clair', age=21) # specify age only
print_info('Dylan', 1.63) # specify height only, not as common without keyword
print_info('Elliot', age=24, height=1.75) # note order does not matter if keywords are specified

Alec: height 1.72m, age 29 years
Bindu: height 1.67m, age unknown
Clair: height unknown, age 21 years
Dylan: height 1.63m, age unknown
Elliot: height 1.75m, age 24 years


## Lambda functions

It is possible for function arguments to be functions themselves. For example, `sorted()` can take a `key` argument that is a function, which is applied to the items to be compared (before comparison). For example, suppose we want to sort the first 100 perfect squares by their digit sum:

In [39]:
squares_sorted = sorted(squares_1k, key=sum_digits)
squares_sorted[:20]

[1,
 100,
 10000,
 1000000,
 4,
 121,
 400,
 10201,
 12100,
 40000,
 16,
 25,
 1024,
 1600,
 2401,
 2500,
 22201,
 102400,
 160000,
 240100]

An alternative is to use a *lambda function*, rather than defining a function via `def sum_digits(num)` etc. This is useful if `sum_digits` has not already been defined.

In [40]:
squares_sorted = sorted(squares_1k, key=lambda x: sum(int(d) for d in str(x)))
squares_sorted[:20]

[1,
 100,
 10000,
 1000000,
 4,
 121,
 400,
 10201,
 12100,
 40000,
 16,
 25,
 1024,
 1600,
 2401,
 2500,
 22201,
 102400,
 160000,
 240100]

Note that the functionality of `lambda x: sum(int(d) for d in str(x))` is equivalent to the implementation of `sum_digits` (see Exercise 3). In general, the expression `lambda args: expr` is equivalent to

```
def func(args):
    return expr
```

except that the name `func` does not exist. For this reason lambda functions are also known as *anonymous* functions. If lambda expressions are assigned to a variable, then the variable can be used as a function:

In [41]:
add = lambda x, y: x+y # note lambda functions can take multiple arguments
add(2, 3)

5

Lambda functions can simply be used as a shorthand. These would reappear when we meet functions from the Pandas module that can take in other functions as arguments.

<blockquote style="padding: 10px; background-color: #24292E;">

## Exercise

6. Use `zip()` and a dictionary comprehension to create a dictionary that maps the following names to the corresponding height.

7. Sort the list of names according to their corresponding heights, making use of a lambda function.  
_Hint: You can use_ `height_dict` _from Exercise 6 in your lambda function._

In [73]:
names = ['Alec', 'Bindu', 'Clair', 'Dylan', 'Elliot']
heights = [1.72, 1.67, 1.58, 1.63, 1.75]
height_dict = {name:height for name,height in zip(names,heights)}### answer 6 here
height_dict

{'Alec': 1.72, 'Bindu': 1.67, 'Clair': 1.58, 'Dylan': 1.63, 'Elliot': 1.75}

In [90]:
### answer 7 here
{k:v for k,v in sorted(height_dict.items(), key=lambda item: item[0])}

{'Alec': 1.72, 'Bindu': 1.67, 'Clair': 1.58, 'Dylan': 1.63, 'Elliot': 1.75}