# Introduction to python for data manipulation

Students have been introduced to python already, but it is probably advisable to have a refresher on how it works, at least those parts that are most relevant to machine learning.

Python already comes with many handy features for working with data. It has a rich set of

- data types: int, float, boolean, string, ...
- datetime and string handling
- control structures: if, while, for, do, ...
- data structures: tuples, lists, array, dictionaries, ...
- modules: class, function, libraries (external modules), ...
- support for functional programming: map, flatmap, lambda, anonymous functions, ... 

as well as some useful idioms, including

- list comprehensions
- list unpacking

In this workbook, we will review some of these features.

In [None]:
# First, some imports.
# Like C, the core python runtime is kept relatively small by including only essential items.
# However, for any practical use, some module imports are needed.
import math
import os

## Python variables have a type.

Generally, containers can have types that are mixed type, and operator semantics depend on their type.

Sometimes explicit type conversion is needed.

In [None]:
a = "Monday"
# check type of a
print("type of a is {}".format(type(a)))

b = 2
# check type of b
print("type of b is {}".format(type(b)))

# Note that '+' between strings is equivalent to concatenation
print("a + a is {}".format(a + a)) 

# Note that 'b' needs explicit conversion to string so it can be concatenated with 'a'.
print("a + b is {}".format(a + str(b))) 

# Note that multiplying a string by an integer replicates that string
print("a * b is {}".format(a * b)) 

## Python can be very flexible

Students can develop python functions very easily. Indeed, they are recommended to avoid duplication of code across cells in a notebook, which can lead to subtle and hard-to-find errors.

The basic idea to use `def` to define the function, and then to call it in that or a later code cell. Here is an example.

In [None]:
# See https://realpython.com/python-kwargs-and-args/

# add 2 numbers. If the second number is not given, it defaults to 1
def add2(a,b=1):
  return a+b

# add any number of numbers
def addN(*args):
  total = 0
  # Iterate over the Python args tuple
  for x in args:
    total += x
  return total

# create a dict from key,value pairs (python provides a constructor so this is not needed, just for exposition!)
def createDict(**kwargs):
  aDict = {}
  # iterate over the python kwargs tuple
  for key in kwargs:
    val = kwargs[key]
    aDict[key] = val
  return aDict

print("add2(3,4) = {}".format(add2(3,4)))
print("add2(3) = {}, should be 4".format(add2(3)))

print("addN(3,4,5,6) = {}, should be 18".format(addN(3,4,5,6)))

print("createDict(pi=3.14,e=2.72) = {}".format(createDict(pi=3.14,e=2.72)))

# String handling

Strings can be treated as arrays, which means that extracting parts of string using _slicing_ is very convenient.

As with arrays, string indexes start at 0, and the ranges include the lower bound and exclude the upper.

The `.split()` function, when called on a string, provides a handy way to select parts of a string.




In [None]:
alphabet = 'abcdefghijklmnopqrstuvwxyz'
print('alphabet[0:3] = {} = alphabet[:3] = {} = abc'.format(alphabet[0:3],alphabet[:3]))
print('alphabet[-3:] = {} = xyz'.format(alphabet[-3:]))

stmt = 'The cat sat on the mat'
print('second word (cat) is stmt.split(\' \')[1] = {}'.format(stmt.split(' ')[1]))
print('last word (mat) is stmt.split(\' \')[-1] = {}'.format(stmt.split(' ')[-1]))

# Dates and times

When working with data, we frequently need to use date-based criteria, e.g., to select between two datetimes.

Python offers the datetime and time libraries to facilitate this.

In [None]:
import datetime as dt
import time as tm

`time` returns the current time in seconds since 1 Jan 1970 (sometimes this is called "Unix time").

In [None]:
tm.time()

We can convert the timestamp to datetime (more friendly to humans!).

In [None]:
dtNow = dt.datetime.fromtimestamp(tm.time())
dtNow

We can access the parts of the datetime directly. Note that python automatically collects these values into a tuple (an immutable collection - cannot be changed after it is created, unlike a list which is mutable).

In [None]:
dtNow.year, dtNow.month, dtNow.day, dtNow.hour, dtNow.minute, dtNow.second 

`duration` below is an object (in the example it reprepresents 7 days = 1 week duration).

In [None]:
duration = dt.timedelta(days=7)  # duration of 7 days
duration

`date.today` returns the current local date. We can also check the date a week ago (without needing to do complicated date arithmetic!) and we can also compare dates as we would expect. 

In [None]:
today = dt.date.today()

In [None]:
weekAgo = today - duration  # the date 7 days ago

As well as doing arithmetic using datetime objects, we can also do comparisons. The code below also shows how assertions can be added to code (can be useful for debugging). The logic below is that today should be greater than a week ago - that condition should be true. If it is talse, we get an AssertionErrror with the associated text - but also a logical conundrum!

In [None]:
assert today > weekAgo, f"call Einstein: {weekAgo} is later {today} - help!"  # compare dates

# map(): applying functions to iterables

Python borrows some idioms from functional programming, notably map(), which applies a function to one or more iterables (generally tuples or lists).

Ine the example below, we create a basket of groceries for Lidl and for Tesco, with their prices. We then compare them item by item. Note the use of a dictionary that is sorted by key (for an ordinary dict, the keys can be in any order), to ensure that everything is deterministic and we are comparing like with like!

Note that we can extract the values in a dictionary or ordered dictionary  `d` using `d.values`.

In [None]:
from collections import OrderedDict

tescoPrice = OrderedDict()
tescoPrice['bread'] = 1.89
tescoPrice['milk'] = 0.89

lidlPrice = OrderedDict()
lidlPrice['bread'] = 1.79
lidlPrice['milk'] = 0.90

cheapestPriceMap = map(min, list(tescoPrice.values()), list(lidlPrice.values()))

Map has iterated over the two lists in parallel, applying min() to each to pick out the minimum price for that grocery item. However, we have lost the label which says what the price is.

This is where python's `zip()` function comes in handy, to combine the grocery item names from tescoPrice.keys (or equivalently from lidlPrice.keys) with the cheapest prices we found for that grocery item.

This results in a tuple of tuples, where each of these inner tuples is a pair, with the first item being the grocery item name, and the second being the cheapest price for that item.

We can then give this data structure to the `dict` constructor, to convert it into a dict, which is easier to work with.

In [None]:
groceryItems = list(tescoPrice.keys())
cheapestPrices = list(cheapestPriceMap)
cheapestPrice = dict(zip(groceryItems, cheapestPrices))
cheapestPrice

Apart from calculating the cheapest price for each grocery item, we might also be interested in which shop offered that price. To do this we create a new dict, where the keys are the grocery items and each value is a tuple pair, with the shop name followed by the cheapest price.

In [None]:
cheapestItemShopPrice = dict()
for groceryItem in groceryItems:
  tescoOffer = tescoPrice[groceryItem]
  lidlOffer = lidlPrice[groceryItem]
  if (cheapestPrice[groceryItem] == tescoOffer):
    cheapestItemShopPrice[groceryItem] = ("Tesco", tescoOffer)
  else:
    cheapestItemShopPrice[groceryItem] = ("Lidl", lidlOffer)
cheapestItemShopPrice

One of the advantages of `map()` is that it is more readily implemented concurrently than by using the equivalent nested loops.


# List comprehensions

A very common scenario in preparing data for machine learning is that we have a list and we wish to filter it to remove unwanted items.

This can be done using a for loop.

However, the more idiomatic way in python is to use a _list comprehension_.

A _list comprehension_ is just another way of expressing a loop. It tends to be more terse and, when you get used to it, it can be as readable as the more tradional way of expressing loops. In some circumstances, it can run faster because it is easier for the compiler to optimise the generated code. However, the main advantage is probably that it can be written as a single expression.

In the following code, we show how a list can be filtered in a loop and also using a list comprehension.

In [None]:
mixedNums = [-2, -1, 0, 1, 2]
nonneg = []
for x in mixedNums:
  if (x >= 0):
    nonneg.append(x)
nonneg

Now for the list comprehension version:

In [None]:
nonneg = [x for x in mixedNums if x >= 0]
nonneg

# Lambda functions

Clearly the list comprehension version is shorter than the for loop. This will generally be true for relatively simple cases like this. Writing code on one line has other benefits too, such as when we wish to apply _lambda functions_ to data.

A lambda function is an unnamed function, where function body is a single expression that is calculated for its inputs. If the intention is to create a list, that expression could be a list comprehension (a for loop would not work as the body of a lambda function).

In [None]:
mult2 = lambda a,b : a*b
mult2(3,4) # Should be 12

`lambda` can be used to define functions with any number of arguments, but the function itself needs to
satisfy normal python rules.

# Python regular expressions

Regular expressions are used to describe patterns in text (strings). Typically regexes are used to find matches, to split strings, to create new strings according to a pattern, etc. Text data plays a big role in machine learning. Sometimes the built-in string handling features are just not enough - regular expressions greatly extend what "regular" python can do with strings.

We could easily devote several weeks to regexes, but it is probably better to know the basics, and then to use stackoverflow or equivalent to deal with more complex cases.

A good starting point for understand python's `re` module is [this tutorial](https://docs.python.org/3/howto/regex.html).

# Exercises for students

1. Use a) a for loop and b) a list comprehension to change "The cat sat on a mat" to all uppercase.

2. Rewrite the prices example to use a list comprehension instead of `map()` to generate the list of cheapest prices.

3. Generate timings to compare the `map()`, `for` loop and list comprehension ways of comparing prices. You might find the [timeit module](https://docs.python.org/3/library/timeit.html) useful. For more realistic timings, you will need to generate much larger dictionaries of prices. For the random item name, one way of doing this is to generate them as needed with [python's uuid module](https://docs.python.org/3/library/uuid.html). For the prices, [python's random module](https://docs.python.org/3/library/random.html) should help.

4. Given the text of [President Abraham Lincoln's _Gettysburg Address_](https://github.com/timburks/gott/blob/master/test/gettysburg-address.txt), use python's text handling features to

   - Count the number of words
   - Find the number of times each word appears (be careful - the delimiters are not always whitespace!).