# A comfortable introduction to Machine Learning

##### By Keiron O'Shea (keo7), and Chuan Lu (cul)

As a means of easing you into the module and the python ecosystem in general, we will explore basic classification techniques through the use of decision trees.

## Supervised and Unsupervised Learning

Supervised learning describes the technique in which a machine learning model is built throgh the use of labeled training data. For example, if we wanted to build a model to predict as to whether an image **is** a hot dog or **not** a hot dog we will create a database of pictures with labels corresponding to each image detailing as to whether or not it is or is not a hot dog. When trained, the algorithm will learn how to determine whether or not a hotdog is visible in the image.

![Supervised Learning](images/supervised_classification.png)

**Figure One:** An example of a supervised classification task, in which the training examples in the orange segment are pre-labeled as being "hot dog", and those in the white segment are pre-labeled as being "not a hot dog".

Unsupervised learning focuses on the building of machine learning models without the use of labeled data. As there are no labels available, the model will be required to extract nuances based on the data provided. This is very useful in areas in which we are unable to give 100% confirmation of a stratification, providing a useful tool to extract features that would have otherwise gone unnoticed.

In this practical we will focus on supervised classification techniques.

### Task One: Would you survive the Titanic?

On April 15th 1912, during her maiden voyage, the RMS Titanic sank after a collision with an iceberg - killing 1,502 of the 2,224 passengers and crew on board. One of the main reasons as to why the deathtoll was so high was due to a lack of lifeboats.

To get you up and running with the ecosystem, we ask you to complete the analysis of what "class" of person were likely to survive.

#### Loading in the data

In this directory, you will find a file named ```titanic.csv``` in the ```data``` directory. Open it up with Libre/Microsoft Office and study the data carefully. Each column heading variable has the following meaning:

- ```survival```: Whether or not the passenger surprived or not (0 = False, 1 = True)
- ```class```: Travel class of passenger (1 = First, 2 = Second...)
- ```name```: Name of the passenger
- ```sex```: Sex of the passenger
- ```age```: Age of the passenger, in years
- ```sibsp```: Number of siblings/spouses aboard
- ```parch```: Number of parents/children aboard
- ```ticket```: Ticket number of passenger
- ```fare```: Fare paid by passenger
- ```cabin```: Cabin in which the passenger stayed
- ```embarked```: Port of emarkation (C = Cherbourg, Q = Queenstown, S = Southampton)
- ```boat``` Lifeboat (if Survival == 1)
- ```body```: Body number (if Survival == 0, and body recovered)

To load this data into this notebook, we will make use of ```pandas```. ```pandas``` "is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language". In this practical, we will guide you through the use of this package - but in the future we do expect you to make use of the package's documentation. This can be found here:

https://pandas.pydata.org/pandas-docs/stable/

As the data is in the form of a comma-seperated value (```csv```) file, we will make use of ```pandas```' ```read_csv``` function. Documentation for this can be found here:

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

Before we do this, we must first load the library into our project. We can do this using the following:

In [None]:
import pandas

### Hello Pandas
Pandas is a Python library that contains high-level data structures and manipulation tools designed for data analysis. Think of Pandas as a Python version of Excel. Scikit-learn, on the other hand, is an open-source machine learning library for Python. While Scikit-learn does a lot of the heavy lifting, what's equally important is ensuring that raw data is processed in such a way that we are able to 'feed' it to Scikit-learn. Hence, the ability to manipulate raw data with Pandas makes it an indispensible part of our toolkit

Whilst this is an acceptable way of loading in the library, when working with large projects it can be a bit tiresome to write ```pandas``` in full every time you are required to leverage on the library. Fortunately for us, we can make use of ```as``` when importing the library to shorten the call. We can do this by doing the following:

In [None]:
import pandas as pd

Now we load the data:

## Data Exploration

Before building a model, we want to explore the data first: some data cleaning, visualisation and simple statistics will be useful here. 

In [None]:
data = pd.read_csv("./data/nb-titanic.csv").dropna()

And to just get a quick glimpse of the data that we have loaded, we can just call ```data.head(n_rows)``` where ```n_rows``` is equal to the number of rows we want to see.

In [None]:
data.head(10)

Before we feed our data into a classifier, we first have to do a bit of  manipulation to the ```DataFrame``` object. For the purposes of this practical we will convert much of the string data into categorical data. This is a fairly simple task in which we can leverage ```numpy``` to make things easier:

In [None]:
# Drop the irelevant variables
data = data.drop(['name', 'ticket', 'cabin'], axis=1)

# Fill in missing values with a mean
age_mean = data['age'].mean()
data['age'] = data['age'].fillna(age_mean)

from scipy.stats import mode

# Fill in missing values with mode for discrete variables
mode_embarked = mode(data['embarked'])[0][0]
data['embarked'] = data['embarked'].fillna(mode_embarked)

As there are only two unique values for the column Sex, we have no problems of ordering.

In [None]:
data['gender'] = data['sex'].map({'female': 0, 'male': 1}).astype(int)

For the column Embarked, however, replacing {C, S, Q} by {1, 2, 3} would seem to imply the ordering C < S < Q when in fact they are simply arranged alphabetically.

To avoid this problem, we create dummy variables. Essentially this involves creating new columns to represent whether the passenger embarked at C with the value 1 if true, 0 otherwise. Pandas has a built-in function to create these columns automatically.

In [None]:
pd.get_dummies(data['embarked'], prefix='embarked').head(10)

In [None]:
data = pd.concat([data, pd.get_dummies(data['embarked'], prefix='embarked')], axis=1)

Exercise:

Write the code to create dummy variables for the column Sex.

In [None]:
# Your code here




In [None]:
data = data.drop(['sex', 'embarked'], axis=1)

# Put column name to a list
cols = data.columns.tolist()

# Reoder the column names and the dataframe (data) according the new column order
cols = [cols[1]] + cols[0:1] + cols[2:]
data = data[cols]

We review our processed training data.

In [None]:
data.head(10)

In [None]:
# Summarise the dataset: descriptive statistics
data.describe()

### Visualising the data
Data visualisation can be performed using Pandas and Matplotlib.

In [None]:
# %matplotlib inline: To make matplotlib inline graphics
%matplotlib inline 
import matplotlib.pyplot as plt

In [None]:
# Histograms for checking the distributions of the variables.
data.survived.value_counts().plot(kind='bar')

In [None]:
y = data["survived"].values

In [None]:
data['age'].plot(kind='hist') # Histogram for age

In [None]:
# Boxplots to compare the distribution of continuous variables by groups
data.boxplot(column='age', by='survived')
data.boxplot(column='fare', by='survived')

In [None]:
# Scatter plots
# Visualise the data by groups in colors
df0=data[data['survived']==0] # subset of data
df1=data[data['survived']==1] # subset of data
ax = df0.plot(kind='scatter', x='age', y='fare', color='green', label='survived')
df1.plot(kind='scatter', x='age', y='fare', color='red', label='Not Survived', ax=ax)

Exercise:

What are the other variables that you would like to visualise in order to understand the association between those variables and survival data? 

In [None]:
# Your answer or code here



Now using the code above, analyse the column definitions and determine what features you would like the NB classifier to learn from.

In [None]:
X = data.values[:,1:] # remember to exclude the output column (the first column here)
print(X.shape)

Now we can check to see whether or not this data has been set up correctly by ensuring that there are a equal amount of samples in both ```X``` and ```y```. If this throws and exception, alter your code to make it work. If you are still stuck, call over a demonstrator to help.

In [None]:
if X.shape[0] != y.shape[0]:
    raise Exception("Sample counts do not align! Try again!")


### Setup your classifier

## Decision Tree classifiers

Ensemble learning is the technique of building multiple models to train, and then combining them in a manner that is likely to produce better results than individual models. These models don't have to be classifiers, and can be trained to deal with most tasks. A decision tree is a strcuture that, as the name alludes to, split the data into branches and make simple decisions at each level. From this, we are able to arrive at the final output by walking down the tree. The figure below is a simplistic decision tree that attempts to determine whether or not it is raining using features taken from a weather machine:

![Simple Decision Tree](images/dt.png)
**Figure Two:** A simplistic decision tree that attempts to determine whether or not it is raining using features taken from a weather machine:

Now, create a decision tree classifier and ```fit``` your data on it. If you are struggling to do this, take a look at the pseudocode:

```python3
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()
clf.fit(inputs, labels)
```

Splitting the data using train_test_split to accurately evaluate our models (a 80/20 split will suffice)
Creating a Decision Tree classifier, training it using the training dataset (see https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier)
Evaluating model performance against the testing datasets
Evaluate the model using the decision surface

In [None]:
# Write your code here!


How did it perform? Was it good? Was it bad?

The model is probably badly overfitted, and as such is unlikely to be a good general classification model. Just to prove this point, run the following code to split data into training and test sets:

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.base import clone
from sklearn.metrics import classification_report

# Split the data into 80% training, 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Clone your classifier, with default parameters.
clf = clone(clf)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

# Just get a classification report.
print(classification_report(y_test, y_pred))

#### Model performance

It's important that we evaluate our model to see if it's a capable classifier. To do this we can make use of a number of metrics. Take a look at the ```sklearn.metrics``` documentation, and study what sort of metrics are suited for this task:

https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics

Once you have done that, read the following pseudocode and try to evaluate your model yourself:

```python3
from sklearn.metrics import foo_score

clf = Classifier()
clf.fit(X, y)
y_pred = clf.predict(X)

foo = foo_score(y, y_pred)

print("Foo Score: %f" % (foo))

```

In [None]:
# Type your code here

How did it perform? Was it good? Was it bad?

I'm afraid to say that the model is probably badly overfitted, and as such is unlikely to be a good general classification model. Just to prove this point, run the following code:

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.base import clone
from sklearn.metrics import classification_report

# Split the data into 80% training, 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Clone your classifier, with default parameters.
clf = clone(clf)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

# Just get a classification report.
print(classification_report(y_test, y_pred))

Don't worry too much about this right now if the model was bad. If you have time, you can play around with this at a later date.

In [None]:
# Check your classifier 
clf

Exercies:

Are there any parameters in the classifier that can be changed? In particular have a look at 'criterion', how about changing it to information gain? Look up the documentation to see how to do it. 

In [None]:
#Different decision tree with alternative model configuration
# You code here: 



## Helpers

### Pandas Cheatsheet

![Pandas Cheatsheet](images/pandascheat.png)

### List slicing tips

If you're new to the python programming language, understanding list slicing may be a bit difficult. Here's a quick guide.

Given the following list:

In [None]:
l = ["This", "is", "a", "list", "of", "strings"]

If I wanted to get the first element of that list, I'd simply:

In [None]:
l[0]

If I wanted to get the last element of that list, I'd simply:

In [None]:
l[-1]

If I wanted to get everything after the first element:

In [None]:
l[1:]

And if I wanted to get everything before the last element:

In [None]:
l[:-1]

And finally, everything after the first and before the last:

In [None]:
l[1:-1]