# Advanced Machine Learning with Python 


## Module 2: Working with real-world data

In this module you will learn how to load the online retail dataset in Python, visualise and summarise it to produce insights that will guide your machine learning endevours in the next modules. We will learn how to plot data and calculate summary statistics to build a dataset from which we ultimately will try to predict future customer behaviour.


### Learning Activity: Load the required libraries

First we need to load the required Python libraries. Libraries are like extensions to the base python that add functionality or help to make tasks more convenient to do. 

In [None]:
import scipy
import numpy as np
import pandas as pd
import plotly.plotly as py

import visplots

from sklearn import preprocessing, metrics
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.grid_search import GridSearchCV
from plotly.graph_objs import *
from plotly.offline import download_plotlyjs, init_notebook_mode, iplot

init_notebook_mode()

print("libraries all imported, ready to go")

### The dataset

The dataset we will be using throughout this workshop is an *adapted and aggregated* version of the online retail case study, available from the UCI Machine Learning repository (https://archive.ics.uci.edu/ml/datasets/Online+Retail). The dataset has been designed for this workshop with the purpose of modelling the behaviour of customers ("returning" vs. "non-returning" customers) based on their activity (such as balance, max spent and number of orders, among others).

### Learning Activity: Importing the data

As a first step we load the dataset from the provided `retail_data.csv` file with `pandas`. To achieve this you will use the `.read_csv()` method. We just need to point to the location of the dataset and indicate under what name we want to store the data, i.e. `retail`, and `pandas` will do the rest. 

At a first stage, the data has only been loaded. Let's have a look at the top few lines; we can use the `.head()` method to achieve this.

In [None]:
# Import the data and explore the first few rows

In order to feed the data into our classification models and sklearn, the imported retail DataFrame needs to be converted into a `numpy` array. For more information on numpy arrays, see http://scipy-lectures.github.io/intro/numpy/array_object.html. 

In addition, it is **always** a good practice to check the dimensionality of the imported data using the `shape` command prior to constructing any classification model to make sure you have really imported all the data, and imported it in the correct way (e.g. one common mistake is to get the separator wrong and end up with only one column). 

In [None]:
# Convert to numpy array and check the dimensionality

### Learning Activity: Split the data into input features, X, and outputs, y

Subsequently, we need to split our initial dataset into the data matrix _X_ (independent variable) and the associated class vector _y_ (dependent or target variable). The input features, _X_,  are the variables that you use to predict the outcome. In this data set, there are ten input features stored in columns 1-10 (index 0-9, although the upper bound is not included so the range for indexing is 0:10), all of which have continuous values. The output label, _y_, holds the information of whether the customer has returned or not ("yes" vs. "no"), and is stored in the final (eleventh) column (index 10). To split the data, we need to assign the columns of the input features and the columns of the output labels to different arrays:

In [None]:
# Split to input matrix X and class vector y

Try printing the size of the input matrix _X_ and class vector _y_ using the "`shape`" command:

In [None]:
# Print the dimensions of X and y

## Exploratory Data Analysis

Visualisation is an integral part of Data Science. Exploratory data analysis (EDA) is the field dealing with the analysis of data sets as a means of summarising their main characteristics, most often using visual methods.

Plotly is an online collaborative data analysis and graphing tool that we will use in order to construct fully interactive graphs. The Plotly API allows you to access all of the library's interactive functionality directly from Python (or other programming languages such as R, JavaScript and MATLAB, among others). Crucially, Plotly has recently been made **open-source**, which now enables plotting **offline** without requiring access to their API. _Plotly Offline_ brings interactive Plotly graphs to the _offline_ Jupyter (IPython) Notebook environment.


### Learning Activity:  Investigate the y frequencies

An important aspect to understand before applying any classification algorithm is how the output labels are distributed. Are they evenly distributed or not? Imbalances in distribution of labels can often lead to poor classification results for the minority class even if the classification results for the majority class are very good.

In [None]:
# Print the y frequencies

In our current dataset, you can see that the _y_ values are categorical (i.e. they can only take one of a discrete set of values) and have a non-numeric representation, "yes" vs. "no". This can be problematic for scikit-learn and plotting functions in Python, since they assume numerical values, so we need to map the text categories to numerical representations using `LabelEncoder`  and the `fit_transform()` function from the `preprocessing` module:

In [None]:
# Convert the categorical to numeric values, and print the y frequencies

Visualising the class frequencies is a good way to get a feel for how the data is distributed. As a simple example, try plotting the frequencies of the class labels (held in yFreq), "1" and "0" (corresponding to "yes" and "no" respectively), and see how they are distributed using a barplot from Plotly:

In [None]:
# Display the y frequencies in a barplot with Plotly

# (1) Create the Data object
# (2) Create a Layout object
# (3) Create a Figure object
# (4) Plot


More examples on Plotly barplots can be found at https://plot.ly/python/bar-charts/. In addition, a full list of arguments on barplots can be found at https://plot.ly/python/reference/#bar/.


### Learning Activity: Data scaling

It is usually advisable to scale your data prior to fitting a classification model to avoid attributes with
greater numeric ranges dominating those with smaller numeric ranges. In order to investigate the range and descriptive statistics of our features, we can apply the `describe()` function from `pandas` to the original `retail` DataFrame (**_not_** the numpy array!). For instance:

In [None]:
# Apply the describe() function on the retail DataFrame

Boxplots are a powerful visual aid, commonly used in order to investigate simultaneously the range differences of the input features. Boxplots are a standardised way of displaying the distribution of the data based on the "five number summary" (minimum, first quartile, median, third quartile, and maximum). For example, try and plot the features of the _raw_ matrix _X_ using the script for the boxplots:

In [None]:
# Create a boxplot of the raw data

There are many ways of scaling but one common scaling mechanism is auto-scaling, where for each
column, the values are centred around the mean and divided by their standard deviation. This scaling
mechanism can be applied by calling the `scale()` function in scikit-learn’s `preprocessing` module.

In [None]:
# Auto-scale the data

Try to re-run the previous plotting script and have a look at the outcome of the boxplot after scaling. Alternatively, 
if you feel more adventurous, you create a more enhanced version of the boxplot. You can find more online examples at https://plot.ly/python/box-plots/, and also a full list of boxplot arguments at https://plot.ly/python/reference/#box.


In [None]:
# Create a boxplot of the scaled data (simple or enhanced)

### Learning Activity: Investigate the relationship between input features

You can visualise the relationship between two variables (features) using a simple scatter plot. This step can give you a good first indication of the ML model model to apply and its complexity (linear vs. non-linear). At this stage, let’s plot the first two variables against each other. We can also relate associations between features to their _y_ classifications by making the colour of the points dependent on the corresponding _y_ value:

In [None]:
# Create an enhanced scatter plot of the first two features


Examples of Plotly scatterplots can be found at https://plot.ly/python/line-and-scatter/ (or for a list of arguments refer to https://plot.ly/python/reference/#scatter/).


### Bonus Activity:  Try plotting different combinations of three features (f1, f2, f3) in the same plot.


The scatterplots we have seen so far investigated the relationship between two variables (features). A three-dimensional graph lets you introduce a third axis, typically called the _z_ axis, and can help you understand the relationship between three variables. Plotly's fully interactive functionality allows you to plot, hover, zoom and rotate 3-dimensional scatterplots. For a full list of arguments on 3d plots in Plotly visit https://plot.ly/python/reference/#scatter3d. Other examples on 3D scatterplots using Plotly can be found at https://plot.ly/python/3d-scatter-plots/.

_Hint: Investigate the Scatter3d object from Plotly_

_Axes in 3D Plotly plots work a bit differently than in 2D (axes are bound to a Scene object -- use help(Scene))._


In [None]:
# Create a 3D scatterplot using the first three features

### Bonus Activity: Try different combinations of f1 and f2 (in a grid/scatterplot matrix if you can).


A scatterplot matrix shows a grid of scatterplots where each attribute is plotted against all other attributes. For example, try to create a scatterplot matrix of the first four features.
You can find further information on how to create and customise subplots with Plotly at https://plot.ly/python/subplots/.

_Hints: You may want to use nested loops that iterate through the rows and columns of the grid, and also import and make use of the_ `make_subplots()` _function from Plotly_

In [None]:
# Create a grid plot of scatterplots using a combination of features

## Module 3: Decision Trees and Random Forests

In this module, you will implement two popular and extremely powerful Machine Learning models - Decision Trees and Random Forests - using Python and scikit-learn. For every classification model built with scikit-learn, we will follow four main steps: 1) **Building or instantiating ** the classification model (using either default, pre-defined or optimised parameters), 2) **Training** the model, 3) **Testing** the model, and 4) **Performance evaluation** using various metrics to test its generalisation ability.  Thorough validation techniques will be applied throughout these steps as a means of ensuring real-world metrics and avoiding cases of overfitting (or underfitting). Finally, you will learn how to optimise the hyperparameters of a model as a way of boosting its overall performance. 


### Learning Activity: Split the data into training and test sets

Training and testing a classification model on the same dataset is a methodological mistake: a model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data (poor generalisation). To use different datasets for training and testing, we need to split the online retail dataset into two disjoint sets: train and test (**Holdout method**) using the `train_test_split()` function. 

The `random_state` argument specifies a value for the seed of the random generator. By setting this seed to a particular value, each time the code is executed, the split between train and test datasets will be exactly the same. If this value is not specified, a different split will be performed each time since the random generator driving the split will be seeded by a pseudo-random number.

In [None]:
# Split into training and test sets

The output of `train_test_split()` consists of four arrays. _XTrain_ and _yTrain_ are the two arrays you use to train your model. _XTest_ and _yTest_ are the two arrays that you use to evaluate your model. By default, scikit-learn splits the data so that 25% of it is used for testing, but you can also specify the proportion of data you want to use for training and testing. You can check the [`train_test_split()` documentation](http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html) on how to set this parameter. 

<br/>You can check the sizes of the different training and test sets by using the `shape` attribute:

In [None]:
# Print the dimensionality of the individual splits

You can also investigate how the class labels are distributed within the *yTest* vector by using the `itemfreq` function as before

In [None]:
# Calculate the frequency of classes in yTest

We can see that 59 random samples of class 0 (non-returning customers) and 441 random samples of class 1 (returning customers) are included in the _yTest_ set.

### Learning Activity:  Decision Trees

Decision Tree classifiers construct classification models in the form of a tree structure. A decision tree progressively splits the training set into smaller subsets. Each node of the tree represents a subset of the data. Once a new sample is presented to the data, it is classified according to the test condition generated for each node of the tree.

Let us build a simple decision tree with 3 layers. (See [here](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) for the documentation of the Decision Tree classifier.)

In [None]:
# Build the Decision Tree classifier using a pre-defined parameter

# Train the model

# Test the model

### Learning Activity: Calculate validation metrics for your classifier

In a classification task, once you have created your predictive model, you will need to evaluate it. Evaluation functions help you to do this by reporting the performance of the model through four main performance metrics: precision, recall and specificity for the different classes, and overall accuracy. To understand these metrics, it is useful to create a _confusion matrix_, which records all the true positive, true negative, false positive and false negative values.

We can compute the confusion matrix for our classifier using the `confusion_matrix` function in the `metrics` module.


In [None]:
# Get the confusion matrix for your classifier using metrics.confusion_matrix


Because performance metrics are such an important step of model evaluation, scikit-learn offers a wrapper around these functions, `metrics.classification_report`, to facilitate their computation. It also offers the function `metrics.accuracy_score` that we tried before to compute the overall accuracy.


In [None]:
# Report the metrics using metrics.classification_report

###  Learning activity: Boundary visualisation of Decision Trees

We can visualise the classification boundary created by the Random Forest using the `visplots.dtDecisionPlot` function. You can check the arguments passed in this function by using the `help` command. In addition to the mandatory arguments, the function `visplots.dtDecisionPlot` takes as optional arguments the ones from the `DecisionTreeClassifier` function, so you can have a look at the [documentation](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier).

In [None]:
# Check the arguments of the function

# Plot the boundary of Decision Trees

### Learning Activity:  Random Forests

The random forests model is an _ensemble method_ since it aggregates a group of decision trees into an [ensemble](http://scikit-learn.org/stable/modules/ensemble.html). Ensemble learning involves the combination of several models to solve a single prediction problem. It works by generating multiple classifiers/models which learn and make predictions independently. Those predictions are then combined into a single (mega) prediction that should be as good or better than the prediction made by any one classifer. Unlike single decision trees which are likely to suffer from high variance or high bias (depending on how they are tuned) Random Forests use averaging to find a natural balance between the two extremes. <br/> 

Let us start by building a simple Random Forest model which consists of 150 independently trained decision trees. For further details and examples on how to construct a Random Forest, see [here](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html).

In [None]:
# Build a Random Forest classifier with 150 decision trees

### Learning Activity: Visualising the RF accuracy

We can also investigate how the overall test accuracy gets influenced with the increase of `n_estimators` (decision trees) in our model. In order to do so, we can use the provided `rfAvgAcc` function from `visplots`:

In [None]:
# Visualise the average accuracy of the RF model

### Learning Activity: Feature Importance 

Random forests allow you to compute a heuristic for determining how “important” a feature is in predicting a target. This heuristic measures the change in prediction accuracy if you take a given feature and permute (scramble) it across the datapoints in the training set. The more the accuracy drops when the feature is permuted, the more “important” we can conclude the feature is.

We can use the `feature_importances_` attribute of the RF classifier to obtain the relative importance of each feature, which we can then visualise using a simple bar plot.

In [None]:
# Display the importance of the features in a barplot

###  Learning activity: Boundary visualisation of Random Forests

We can visualise the classification boundary created by the Random Forest using the `visplots.rfDecisionPlot` function. You can check the arguments passed in this function by using the `help` command. In addition to the mandatory arguments, the function `visplots.rfDecisionPlot` takes as optional arguments the ones from the `RandomForestClassifier` function, so you can have a look at the [documentation](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html).

In [None]:
# Check the arguments of the function

# Plot the boundary of Random Forest

### Learning Activity: Tuning Random Forests with grid search

Random forests offer several parameters that can be tuned. In this case, parameters such as `n_estimators`, `max_features`, `max_depth` and `min_samples_leaf` can be some of the parameters to be optimised. The optimal choice for these parameters is highly *data-dependent*. Rather than trying one-by-one predefined values for each hyperparameter, we can automate this process. The scikit-learn library provides the grid search function [`GridSearchCV`](http://scikit-learn.org/stable/modules/generated/sklearn.grid_search.GridSearchCV.html), which allows us to exhaustively search for the optimum combination of parameters by evaluating models trained with a particular algorithm with all provided parameter combinations. Further details and examples on grid search with scikit-learn can be found [here](http://scikit-learn.org/stable/modules/grid_search.html). You can use the `GridSearchCV` function with the validation technique of your choice (in this example, 10-fold cross-validation has been applied) to search for a parametisation of the RF algorithm that gives a more optimal model.

As a first step, create a dictionary of allowed parameter ranges for `n_estimators` and `max_depth` (or include more of the parameters you would like to tune) and conduct a grid search with cross validation using the `GridSearchCV` function:

In [None]:
# Conduct a grid search with 10-fold cross-validation using the dictionary of parameters

By default, parameter search uses overall accuracy (`sklearn.metrics.accuracy_score`) as a metric in classification. For some applications, other scoring functions and metrics are better suited (for example in _unbalanced classification_, the overall accuracy score may often be misleading). An alternative scoring function such as the ones provided at http://scikit-learn.org/stable/modules/model_evaluation.html can be specified via the `scoring` parameter in `GridSearchCV`.

### Learning Activity: Visualising the grid search results in a heatmap

We can also graphically represent the results of the grid search using a heatmap:

In [None]:
# Create a heatmap to visualise the results of the grid search with cross-validation

### Learning Activity: Testing and evaluating the generalisation performance

When evaluating the resulting model it is important to do it on held-out samples that were not seen during the grid search process (_XTest_). So, we are testing our independent _XTest_ dataset using the optimal parameters:

In [None]:
# Build the classifier using the *optimal* parameters detected by grid search

## Module 4: Logistic Regression

Logistic regression represents one of many ways of starting simple - by
modelling the underlying relationships fully _linearly_. Logistic regression is based on linear regression, but rather than the predicted output being a continuous value, it predicts the probability that a sample belongs to a class based on the values of the input variables. In the case of classification, we can use this to then assign the sample to the most likely class. For more details, see: http://www.omidrouhani.com/research/logisticregression/html/logisticregression.htm 


### Learning Activity: Implement Logistic Regression

In scikit-learn, you can learn a logistic regression model using the [LogisticRegression object](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html). As with linear regression, there are certain assumpttions that you might make or constraints that you wish your model to fulfil, e.g. whether or not you want a constant to be included in the function. You can also specify the way you wish learning to take place by using different solvers or how you wish errors to be penalised.


In [None]:
# Build a Logistic Regression classifier with the default parameters

### Test Activity: Apply class weights

Repeat the previous activity, but this time also set the argument `class_weight` to `'balanced'`. If this argument is not given, all classes are supposed to have weight one. The `balanced` mode uses the values of _y_ to automatically adjust weights inversely proportional to class frequencies in the input data as `n_samples / (n_classes * np.bincount(y))`. More information can be found in the [documentation](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html).

In [None]:
# Build a Logistic Regression classifier with the default parameters and class_weight='balanced'

### Test Activity: Boundary visualisation of Logistic Regression

We can visualise the classification boundary created by the logistic regression model using the built-in function `visplots.logregDecisionPlot`. You can check the arguments passed in this function by using the `help` command. In addition to the mandatory arguments, the function `visplots.logregDecisionPlot` takes as optional arguments the ones from the `LogisticRegression` function, so you can have a look at the [documentation](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).

In [None]:
# Check the arguments of the function

# Plot the boundary of Logistic Regression

### Test Activity: Tuning Logistic Regression

Two hyperparameters that are often tuned for logistic regression models are the norm used in penalisation (`penalty`), which can be either `l1` or `l2` (default `l2`) and the inverse of regularisation strength, `C` (default `1.0`). 

In [None]:
# Define the parameters to be optimised and their ranges

# Tune the parameters of Logistic Regression

### Test Activity: Testing and evaluating the generalisation performance

When evaluating the resulting model it is important to do it on held-out samples that were not seen during the grid search process (_XTest_). So, we are testing our independent _XTest_ dataset using the optimal parameters:

In [None]:
# Build the classifier using the *optimal* parameters detected by grid search

### Bonus Activity: Visualise the results of GridsearchCV in a heatmap 

Plot the results of the grid search using a heatmap.

In [None]:
# Visualise the results of GridsearchCV in a heatmap

## Module 5: Support Vector Machines

Support Vector Machines (SVMs) attempt to build a decision boundary that accurately separates the samples of different classes by *maximising* the margin between them.

### Learning Activity: Linear SVMs

At first, let us build a linear SVM model using the _default_ value for the hypeparameter `C` (based on the scikit-learn documentation, the default case is `C = 1.0`). The regularisation `C` trades off misclassification of training examples against simplicity of the decision surface. A low `C` tolerates training misclassifications and allows softer margins, while for high `C` the misclassifications become more significant leading to hard-margin SVMs and potentially cases of overfitting.

Thorough documentation on how to implement linear SVMs with scikit-learn can be found [here](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html).


In [None]:
# Build a linear SVM classifier with the default hyperparameter C
# (where C = 1.0; this argument is optional and could be omitted)

### Test Activity: Boundary visualisation of linear SVMs

We can visualise the classification boundary created by the linear SVM using the `visplots.svmDecisionPlot` function. You can check the arguments passed in this function by using the `help` command. In this case, you need to set the kernel to `linear`. In addition to the mandatory arguments, the function `visplots.svmDecisionPlot` takes as optional arguments the ones from the `SVC` function, so you can have a look at the [documentation](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html).


In [None]:
# Check the arguments of the function

# Plot the SVM boundaries of the linear kernel

### Learning Activity: Non-linear (RBF) SVMs

In addition to the regularisation parameter `C`, which is common for all types of SVM, the gamma hyperparameter in the RBF kernel controls the nonlinearity of the SVM bounaries. The larger the gamma, the more nonlinear the boundaries surrounding individual samples. Lower values of gamma lead to broader, more linear boundaries. <br/>  

At first, let us build an RBF SVM model (set the `kernel` parameter to `rbf`) using the default values for the hypeparameters `C` (`C=1.0`) and `gamma` (`gamma='auto'`). Thorough documentation on how to implement SVMs with scikit-learn can be found at http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

In [None]:
# Build a non-linear (RBF) classifier using the default parameters for C and gamma

### Test Activity: Apply class weights

Repeat the previous activity, but this time also set the argument `class_weight` to `balanced`. If this argument is not given, all classes are supposed to have weight one. The `balanced` mode uses the values of _y_ to automatically adjust weights inversely proportional to class frequencies in the input data as `n_samples / (n_classes * np.bincount(y)) `

In [None]:
# Build a non-linear (RBF) classifier using class_weight='balanced'

- What do you observe?

### Test Activity: Boundary visualisation of non-linear (RBF) SVMs


1) Try to visualise the classification boundary created by the RBF SVM using the `visplots.svmDecisionPlot` function with the default parameters of `C` and `gamma`. You can check the arguments passed in this function by using the `help` command. Remember to set the correct kernel! <br/>
2) Try different combinations of `C` and `gamma`, re-plot the boundaries and investigate how different values of the hyperparameters affect the separating hyperplane <br/>
3) Apply `class_weight = 'balanced'` and visualise the boundaries

In [None]:
# Plot the boundaries of the RBF kernel 

### Learning Activity: Hyperparameter tuning for non-linear SVMs

Hyper-parameters are parameters that are not directly learnt within estimators. In scikit-learn they are passed as arguments to the constructor of the estimator classes. 
Proper choice of `C` and `gamma` is critical for the performance of SVMs. Optimisation (tuning) of the hyperparameters can be achieved by applying a coarse tuning (often followed by a finer-tuning in the "neighborhood" of good parameters). *This may take a few minutes to run*.

In [None]:
# Define the parameters to be optimised and their ranges

# Tune the parameters of the SVM

### Test Activity: Testing and evaluating the generalisation performance

When evaluating the resulting model it is important to do it on held-out samples that were not seen during the grid search process (_XTest_). So, we are testing our independent _XTest_ dataset using the optimal parameters:

In [None]:
# Build the classifier using the *optimal* parameters detected by grid search

### Bonus Activity: Visualise the results of GridsearchCV in a heatmap 

Plot the results of the grid search using a heatmap.

In [None]:
# Visualise the results of GridsearchCV in a heatmap

### END OF DAY 1 ###