##Syllabus of the Practical Lectures: Machine Learning II

**Week 1: Python Review**


*   Basic Data Structures
*   Functions
*   Using Code outside of Notebooks
*   Pandas, Numpy and Matplotlib

**Week 2: Data Visualization - Part I**

*   Visualizing Data Using Matplotlib
*   Visualizing Data Using Seaborn
*   Best Practices on Visualization

**Week 3: Data Visualization - Part II**

*   Looking at Continuous Variables
*   Visualization as a Tool For Modelling
*   Exercises

**Week 4: Intro to Clustering and Grouping Data**

*   Intro to data grouping
*   RFM Analysis
*   "Matrix" Thinking and understanding cohorts


**Week 5: Clustering algorithms**

*   K-means
*   Hierarchical clustering

**Week 6: Self-organizing maps (SOMs)**

*   Definition of SOMs
*   Training a SOM
*   Applications of SOMs

**Week 7: Clustering algorithms II**

*   DBSCAN
*   Mean-shift algorithm

**Week 8: Dimensionality reduction**

*   Principal component analysis (PCA)

**Week 9: Dimensionality reduction II**

*   t-distributed Stochastic Neighbor Embedding (t-SNE)
*   Uniform Manifold Approximation and Projection (UMAP)

**Week 10: Anomaly detection**

*   Combining Clustering and Supervised Methods to detect anomalies
*   Statistical Based Methods
*   DBScan to Detect Anomalies
*   SOM Anomaly

**Week 11: Association rule learning**

*   Apriori algorithm
*   Eclat algorithm

**Week 12: Deep learning for unsupervised learning**

*   Autoencoders

**Week 13: Practical Project Help**

*   Helping students kickstart (or continuing) their practical project.

# Python Review!

Throughout these notebooks, we are going to use multiple ways to code (and call) Python code. Some of our functions will be abstracted (i.e. written) outside of the notebook to make our notebook more clean and less confusing with a lot of long functions.
<br>
<br>
We'll be using a file called `utils.py` that will contain code that we can call and execute, just like any other Python library. Sometimes we will also see functions inside notebooks, when we'll highlight important pieces of the code.
<br>
<br>
Next to this notebook, you'll have all the files needed for this lecture - if we run the code below, we'll mount that drive inside our Google Colab environment. Later, we'll copy our files into our Google Colab Directory.

In [12]:
# Mount Drive files
from google.colab import drive
import sys, os
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


Keep in mind that we need to give access to our folders with our Google Account. This will let us do the next step, copy the contents into our Google Colab environment:

In [13]:
# Copy contents of folder into Google Colab Environment
# Please verify these paths in your Google Drive and update if necessary
!cp -r "/content/gdrive/MyDrive/Colab Notebooks/Machine Learning II/Week 1/utils.py" /content
!cp -r "/content/gdrive/MyDrive/Colab Notebooks/Machine Learning II/Week 1/data" /content/data

cp: cannot stat '/content/gdrive/MyDrive/Colab Notebooks/Machine Learning II/Week 1/data': No such file or directory


`!cp -r` is a command line statement. Here, we are not using Python code but using scripting language that we can initialize with `!`.
<br>
<br>
`cp` means `copy` and `-r` means recursively, a fancy word for all the contents of the folder.

Most python scripts start with libraries on top - this will let readers of your code know what dependencies your script has.

In [14]:
import pandas as pd
import numpy as np
import math

Now comes a magic part - we'll important some functions from this magical `utils` module.
<br>
<br>
*Note: We normally say that a module is a single .py file while a package/library is a collection of py files that we can use. For example, the pandas library contains many .py files (https://github.com/pandas-dev/pandas/tree/main/pandas/core)*

In [15]:
# Import our custom modules
from utils import (
    print_string_utils,
    extract_lengthy_capitals_utils
)

We'll have the chance to work with these functions in a bit!

# Python Basic Data Types

Let's start with defining an object:


In [16]:
number = 1

After running the code, we can call `number` throughout our code, and it will output the value we've stored in it:

In [17]:
number

1

We can also use a `print` statement to check what our object has:

In [18]:
print(number)

1


The most basic (let's call them atomic, for now) data types in Python are numbers - `integer` and `float` - we can conveniently check our data types by calling `type` on an object:

In [19]:
type(2)

int

In [20]:
type(2.5)

float

We can also call `type` on stored variables:

In [21]:
number_1 = 4.4
type(4.4)

float

`Python` complies with all mathematical rules we know such as order of operations:

In [22]:
(1/2)*(32+32)

32.0

In [23]:
1/2*32+32

48.0

We can also call some neat functions, using the `math` library:

In [24]:
math.sqrt(9)

3.0

.. and as these functions resolve to numbers, we can use them to make other calculations!

In [25]:
math.sqrt(9)*20

60.0

Other two important data types are `strings` and `bool`:

In [26]:
type('Hi class!')

str

In [27]:
type(True)

bool

The data types define the operations and the `methods` we can call on the objects. For example, replace is a `string` method that we can use to replace any piece of text with another:

In [28]:
fruit = 'banana'
fruit.replace('a','u')

'bununu'

But calling `replace` on a number will not work!

In [29]:
number_fruits = 15
fruit.replace(1,2)

TypeError: replace() argument 1 must be str, not int

The `replace` method only works for objects of the `str` type. The same goes for other methods that only work with specific object types.

A method is called by providing a `.` after an object. A `function` is called by using the name of a function followed by `()`.
<br>
<br>
Although similar (they can both return values and use parameters), `functions` are more general (and possibly can be applied to multiple data types) and do not follow an `object oriented programming` logic. For example, the `len` function can be applied to `str` and `list` (a data structure we will see in a minute):

In [30]:
len('banana')

6

# Python Basic Data Structures

Some data structures in Python can hold other types of data. For example, `list`, a very convenient object in Python is able to hold `str`, `int`, `float`, `bool`, other `lists`, etc.

In [31]:
an_example_list = [1,2,'A','B']

We can access our lists by using indexes (calling `[]`):

In [32]:
an_example_list[0]

1

To make our life more difficult, `Python` indexes start on 0, while R starts on 1 ðŸ˜©

We can slice to retrieve multiple elements:

In [33]:
an_example_list[1:3]

[2, 'A']

Slices work as follows: `object[i,j]` means that we are indexing `object` from element `i` until element `j-1`.

`lists` are mutable, meaning that we can change them in-place - for example, I can change the first element by indexing it and assigning it to something new:

In [34]:
an_example_list[0] = 'New element!'

In [35]:
an_example_list

['New element!', 2, 'A', 'B']

On the other way, `str` are not mutable:

In [36]:
my_text = 'Europe'
my_text[5] = 'a'

TypeError: 'str' object does not support item assignment

Also, lists preserve the data types that our underlying objects have:

In [37]:
an_example_list

['New element!', 2, 'A', 'B']

In [38]:
type(an_example_list[1])

int

In [39]:
type(an_example_list[2])

str

Another important data structure is the dictionary that creates a `key-value` pair structure:

In [40]:
languages = {
    'SQL': 1,
    'Python': 2,
    'R': 3,
    'Java': 4,
    'Javascript': 5,
    'Julia': 6
}

Notice that I'm using this "vertical" format to add new key-value pairs. This is not mandatory but is generally considered a best practice if your line of code goes over 79 characters.
<br>
<br>
Google Colab even scolds us if we go over that mark, by putting a ruler on the editor:

In [41]:
languages = {'SQL': 1, 'Python': 2, 'R': 3, 'Java':4, 'Javascript': 5, 'Julia': 6}

We can access dictionaries by their key:

In [42]:
languages['SQL']

1

In [43]:
languages['Python']

2

Three important items of dictionaries:

In [44]:
languages.items()

dict_items([('SQL', 1), ('Python', 2), ('R', 3), ('Java', 4), ('Javascript', 5), ('Julia', 6)])

In [45]:
languages.keys()

dict_keys(['SQL', 'Python', 'R', 'Java', 'Javascript', 'Julia'])

In [46]:
languages.values()

dict_values([1, 2, 3, 4, 5, 6])

Another important data structure is the `set`, that is able to hold distinct values:

In [47]:
set([1,1,1,1,2])

{1, 2}

# Python Control Flow

Control flow will be important to understand some of the functions we will use throughout the course. We'll mostly use `for`, `while` and `if` statements.

`for` enables us to iterate through a specific object:

In [48]:
for letter in 'Europe':
  print(letter)

E
u
r
o
p
e


In [49]:
list_integers = [2, 4, 6, 10]

for number in list_integers:
  print(number**2)

4
16
36
100


`enumerate` is also cool because it enables us to iterate through indexes and elements:

In [50]:
for index, number in enumerate(list_integers):
  print(index, number)

0 2
1 4
2 6
3 10


`if` is a statement to create conditional situations in our code:

In [51]:
var = 12

if var == 12:
  print("It's true!")
else:
  print("It's not true!")

It's true!


In [52]:
var = 15

if var == 12:
  print("It's true!")
else:
  print("It's not true!")

It's not true!


We can also use `elif` to create other conditions:

In [53]:
var = 15

if var < 12:
  print("var is less than 12.")
elif var < 15:
  print("var is less than 15.")
else:
  print("var is greater or equal than 15!")

var is greater or equal than 15!


`while` loops keep going until a certain condition is met - watch out as this is really prone to inifinite loops!

In [54]:
n = 1
while n <= 10:
  print('Number is {}'.format(n))
  n = n+1

  # n = n+1 or n+=1 are exactly the same code!

Number is 1
Number is 2
Number is 3
Number is 4
Number is 5
Number is 6
Number is 7
Number is 8
Number is 9
Number is 10


With Python functions, indentation controls the code! The indented blocks depend on the code that sits a level before it.

# Python Functions

Python functions are a centerpiece of the Python language. They let us create replicable and reusable code than can be ran according to different parameters. For example, if we want to create a statement that prints a sentence saying `The square root of 4 is 2` for any number:

In [55]:
def print_square_root(number):
  statement = "The square root of {} is {}".format(number, math.sqrt(number))
  return statement

In [56]:
print_square_root(4)

'The square root of 4 is 2.0'

In [57]:
print_square_root(16)

'The square root of 16 is 4.0'

In [58]:
print_square_root(25)

'The square root of 25 is 5.0'

It's important that our functions have a `return` statement - otherwise, we will be creating functions that are not able to "speak" with the outside world. For example, if I pick up the return from the `print_square_root` function:

In [None]:
return_sentence = print_square_root(25)

I can view the `return_sentence` because I've plucked it from the function using the `return` statement:

In [None]:
return_sentence

But, if I use a `print`, instead of a `return`:

In [None]:
def print_square_root_print(number):
  statement = "The square root of {} is {}".format(number, math.sqrt(number))
  print(statement)

In [None]:
return_sentence_print = print_square_root_print(25)

In [None]:
return_sentence_print

The variable `return_sentence_print` does not contain anything because I didn't pluck anything form the function using `return`. Also, functions have their own spoke, meaning that everything that we've calculated inside the function does not exist outside of it (unless stated in the `return` statement or using the `global` keyword):

In [None]:
statement

Although declared inside the function, Python does not understand what `statement` is.

During this notebook, we'll also use functions that are outside of the notebook. For example, the `print_string_utils` function just prints a string:

In [None]:
print_string_utils("Hi class, I'm outside of the notebook!")

Any function outside of a notebook can be called by it's name. For example, we are using `from utils import print_string_utils` to import the  `print_string_utils` from the `utils.py` file. Notebooks are great for teaching but not so great to do proper software development - using these files that are outside of the notebook are a common pattern in software engineering.

Anatomy of a good python function:
*  Has a docstring
*  Arguments and Return are explained
*  Has type hinting
*  Only does one thing
*  Is less than 10/15 lines of code.

For example - take this function that extracts all the country capitals that have more than 5 letters:

In [None]:
def extract_lengthy_capitals(list_capitals: list) -> list:
  '''
  Skims thorough a list of capitals and extracts
  all capitals that have more than 5 letters by
  returning a new list.

  Arguments:
  - list_capitals(list): List of capitals to skim
  through.

  Returns:
  - lenghty_capitals(list): List of capitals with
  more than 5 letters.
  '''
  lengthy_capitals = []

  for capital in list_capitals:
    if len(capital) > 5:
      lengthy_capitals.append(capital)

  return lengthy_capitals

In [None]:
extract_lengthy_capitals(['Paris','Lisbon','Oslo','Riga'])

In [None]:
extract_lengthy_capitals(['Paris','Lisbon','Oslo','Riga','Madrid'])

The function works as we expect and contains a lot of elements that are very cool:

![discuss](https://cdn-icons-png.flaticon.com/512/1189/1189168.png)

*Note: When we see the image above, we'll pause during the classes to discuss a bit of the topics we've been approaching!*

Of course, we can also have these beautiful functions outside of our notebook:

In [None]:
extract_lengthy_capitals_utils(['Paris','Lisbon'])

# Numpy and Pandas

Two of the most important libraries for machine learning and data science are `Numpy` and `Pandas` - let's start by working `Numpy`, something we've imported with the `np` alias:

In [None]:
my_array = np.array(1)

Above, we've defined a *scalar*, a single value in a one dimension. We can continue to add elements to the dimension of the array by providing a list inside `np.array`:

In [None]:
my_array_multiple_elements = np.array([1,2])

We can see the `shape` of an array, using the `.shape` attribute:

In [None]:
my_array_multiple_elements.shape

Defining a multidimensional array is easy:

In [None]:
my_array_multiple_dimensions = np.array([[1,2],[1,2]])

In [None]:
my_array_multiple_dimensions.shape

We can apply calculations directly to vectors - although most of them are accessed using the `np` library, instead of `math`:

In [None]:
np.sqrt(my_array_multiple_dimensions)

Also, we can define arrays with more than 2 dimensions:

In [None]:
threed_array = np.array([[[1, 2, 3, 4],
                          [5, 6, 7, 8],
                          [9, 10, 11, 12]],

                        [[13, 14, 15, 16],
                          [17, 18, 19, 20],
                          [21, 22, 23, 24]]])

In [None]:
threed_array.shape

... and access elements using `[]` to index, just like other Python objects. The difference is that we can now index multiple dimensions by dividing what we want to index with a comma:

In [None]:
threed_array[0, 1, 1]

This indexed the value `6` as it is the value that sits on the second column and second row (index 1) of the first matrix (index 0)

In [None]:
threed_array

`numpy` is a very cool library to perform fast calculations using Python. But, the fact that is in array format and can't store more than one data type at a time makes it a bit harder to work with, particularly in data science setting.
<br>
<br>
To overcome data we have the (arguably) most important Python object for Data Scientists - the `pandas` data frame!

Let's create our first `pandas` DataFrame. *Note: Keep in mind that this is one of the many ways that we can use to create pandas dataframes!*

In [None]:
pd.DataFrame([0,1,2], columns=['col'], index=['A','B','C'])

We can pass more data by passing a list of lists:

In [None]:
pd.DataFrame([[0,'A',2],[1,'B',3]], columns=['col_1', 'col_2','col_3'], index=['A','B'])

In [None]:
dataframe_example = pd.DataFrame(
    [[0,'A',2],[1,'B',3]],
    columns=['col_1', 'col_2','col_3'],
    index=['A','B']
)

Notice that the two previous code blocks are exactly the same (except the assignment to `dataframe_example`. The latter pd.DataFrame functions produces exactly the same result as the former, with the different that we are stacking the code to avoid overflowing 79 characters.

Here, we can play around with our `dataframe` by using indexes - here's 4 examples of different examples of indexes using the `loc` and `iloc` syntax:

In [None]:
dataframe_example.loc[:,'col_1']

In [None]:
dataframe_example.iloc[:,0]

In [None]:
dataframe_example.loc['A',:]

In [None]:
dataframe_example.iloc[0,:]

We can also load `csv` files directly into a pandas DataFrame using the `read.csv` function:

In [None]:
mtcars = pd.read_csv('/content/data/mtcars.csv')

Calling `head` or `tail` will give us a preview of our dataframe:

In [None]:
mtcars.head()

In [None]:
help(pd.DataFrame.tail)

That's it for our Python review week! Before we leave, let's attempt some exercises to refresh our memory on the Python language.

In [None]:
(lambda a : a + 10)(100)

In [None]:
def x(a):
  return a+10

  # Exercise Section

This first exercise section will mostly be around testing your knowledge on `Python`, particularly Python objects, `pandas` and `numpy`.
<br>
To solve the exercises, simple replace the snippets with `### YOUR CODE HERE` by the code that solves the exercise. Afterwards, check the solutions file to see if everything you've developed is matching the **result** of the solutions. If you achieve the same result with a slightly different code, that is completely ok!
<br>
<br>
Find the solutions in the `solutions.py` file that lies in the same folder of this notebook.

### Exercise 1

Multiply the number 20 by the logarithm of the number 5 and store it in a variable named `result_1`.

In [None]:
### YOUR CODE HERE

### Exercise 2

Multiply the value of 100*1.05 and then divide the value by the multiplication of 200 times the square root of 20. Save the returning result in a `result_2` named object.

Hint: Check the `math.sqrt` function!

In [None]:
### YOUR CODE HERE

### Exercise 3

Check the type of the string `'This is a string'` and store it in a `type_arg` named object.

In [None]:
### YOUR CODE HERE

### Exercise 4

Create a list with the elements 2, 4, 10 and 'A'.
Store the list in a `list_1` named object.

In [None]:
### YOUR CODE HERE

### Exercise 5

Subset the 1, 2 and 3 elements in the `list_1` and store it in an object named `list_1_subset`

In [None]:
### YOUR CODE HERE

### Exercise 6

Extract the distinct elements from the following list `[0.01,0.01,0.02,0.03]`. Call the object `set_list`. *Hint: Remember the set object!*

In [None]:
### YOUR CODE HERE

### Exercise 7
Create a list called continents with the values "Europe", "Africa" and "Asia".

In [None]:
### YOUR CODE HERE

### Exercise 8
Iterate through the continents list and create a new list with the continents names in lowercase. You can use either a loop or a list comprehension.
Name the new list *continents_lowercase*.

In [None]:
### YOUR CODE HERE

### Exercise 9
Create a new dictionary called basket_groceries with the following key and values:
- 'apple': 1,
- 'cookies': 2,
- 'fish': 200


In [None]:
### YOUR CODE HERE

### Exercise 10
Use a loop to sum all the values in the basket_groceries dictionary. Store the result in a variable named total_basket.


In [None]:
### YOUR CODE HERE

### Exercise 11
Implement a function named `retrieve_letters`, where you retrieve a list with specific letters given by an argument.

Your retrieve_letters should take two arguments:Â string and character.

For example, for input `string='this is a sentence'` and `character='e'`, the returning list is `['e','e','e']` as that is the number os e's contained in the string.

In [None]:
### YOUR CODE HERE

### Exercise 12

Implement a function named `only_evens`, where you remove odd numbers from an input list. Your `only_evens` should take one argument:Â  number_list, a list with elements.

If you find a character element in the `number_list` , return a string with the sentence:Â `"Invalid list!"`

If the list only contains integers, return a list only with the integers that are even from the input list. Hint:Â You can check if a number is even if `n%2 ==Â 0` is `True`.

Example of input and output expected:

- If input list is [1,2,3], output should be [2].
- If input list is [2,4], output should be [2,4].
- If input list is [1,'a',3] output should be 'Invalid list!'

In [None]:
### YOUR CODE HERE

### Exercise 13

Create an array os 0's with two rows and 7 columns using `numpy`. Call the object `np_zero_array`.

*Hint: Check the function `np.zeros`!*

In [None]:
### YOUR CODE HERE

### Exercise 14

Create an array with the following format:
* [1,2,3]
* [4,5,6]
* [7,8,9]

Call the object `matrix_example`

In [None]:
### YOUR CODE HERE

### Exercise 15

Select the first row from the matrix and all the columns. Store the object in an object named `first_row`.

In [None]:
### YOUR CODE HERE

### Exercise 16
Read the `WBA_data.csv` file in the `data` folder into a pandas dataframe called `wba_data`.
<br>
<br>
Hint: You can add the file to Google Colab directly or use a relative path regarding the drive mount.

In [None]:
### YOUR CODE HERE

### Exercise 17

Print the top 10 rows of the DataFrame using the appropriate pandas method:

In [None]:
### YOUR CODE HERE

### Exercise 18

Print the bottom 10 rows of the DataFrame using the appropriate pandas method:

In [None]:
### YOUR CODE HERE

### Exercise 19

Create a new column called year with the year available in the `date` column:

In [None]:
### YOUR CODE HERE

### Exercise 20
Filter the rows for 2014 and store them in a wba_2014 dataframe.

In [None]:
### YOUR CODE HERE

### Exercise 21

Obtain the average of the `open` variable by year and store the result in a wba_agg dataframe. *Hint: Check the `groupby` function!*

In [None]:
### YOUR CODE HERE