# Introductory applied machine learning (INFR10069)

# Lab 1: Data analysis and visualisation

In this lab we work with a spam filtering dataset. We will perform exploratory data analysis, visualisation and, finally, we learn how to perform classification tasks using Naive Bayes. For this, we will use the the packages introduced in Lab 1, and `scikit-learn` package (`sklearn`): a machine learning library for Python which works with numpy array, and pandas DataFrame objects.

**Please Note**: Throughout this lab we make reference to [`methods`](https://en.wikipedia.org/wiki/Method_%28computer_programming%29) for specific objects e.g. "make use of the predict method of the MultinomialNB classifier". If you get confused, refer to the documentation and just ctrl+f for the object concerned:
* [Scikit-learn API documentation](http://scikit-learn.org/stable/modules/classes.html) 
* [Seaborn API documentation](https://seaborn.github.io/api.html)
* [Matplotlib Pyplot documentation](http://matplotlib.org/1.5.3/api/pyplot_summary.html)
* [Pandas API documentation](http://pandas.pydata.org/pandas-docs/stable/api.html)
* [Numpy documentation](http://docs.scipy.org/doc/numpy/reference/)

There are also tonnes of great examples online; googling key words with the word "example" will serve you well.

First, we need to import the packages (run all the code cells as you read along):

In [None]:
# Import packages
from __future__ import division, print_function # Imports from __future__ since we're running Python 2
import os
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix
%matplotlib inline

*Clarification*:

* The `%matplotlib inline` command is a special ipython [built in magic command](http://ipython.readthedocs.io/en/stable/interactive/magics.html) which forces the matplotlib plots to be rendered within the notebook.

## Spambase dataset

The [Spambase](http://archive.ics.uci.edu/ml/datasets/Spambase) dataset consists of tagged emails from a single email account. You should read through the description available for this data to get a feel for what you're dealing with. We have downloaded the dataset for you.

You will find the dataset located at `./datasets/spambase.csv` (the `datasets` directory is adjacent to this file). Execute the cell below to load the csv into in a pandas DataFrame object. 

In [None]:
# Load the dataset
data_path = os.path.join(os.getcwd(), 'datasets', 'spambase.csv')
spambase = pd.read_csv(data_path, delimiter = ',')

We have now loaded the data. Let's get a feeling of what the data looks like by using the `head()` method.

In [None]:
spambase.head(5) # Display the 5 first rows of the dataframe

### ========== Question 1 ==========

**a)** Display the number of attributes in the dataset (i.e. number of columns).

In [None]:
# Your code goes here
spambase.shape[1]

**b)** Display the number of observations (i.e. number of rows).

In [None]:
# Your code goes here
spambase.shape[0]

**c)** Display the mean and standard deviation of each attribute.

In [None]:
# Your code goes here
spambase.describe()

We now want to *remove* some of the attributes from our data. There are various reasons for wanting to do so, for instance we might think that these are not relevant to the task we want to perform (i.e. e-mail classification) or they might have been contaminated with noise during the data collection process.

## Data cleaning

### ========== Question 2 ==========

**a)** Delete the `capital_run_length_average`, `capital_run_length_longest` and  `capital_run_length_total` attributes. *Hint*: You should make use of the [`drop`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html) method. *Tip*: some pandas methods have the argument `inplace` which you can use to determine whether they alter the object they are called upon and return nothing, or return a new object. This is particularly useful if you are dealing with huge datasets where you would typically want to operate `inplace`.

In [None]:
# Your code goes here
spambase.drop(["capital_run_length_average", "capital_run_length_longest", 
                          "capital_run_length_total"], axis=1, inplace=True)
## or, less efficiently
# spambase = spambase.drop(["capital_run_length_average", "capital_run_length_longest", 
#                           "capital_run_length_total"], axis=1)

**b)** Display the new number of attributes. Does it look like what you expected?

In [None]:
# Your code goes here
spambase.shape[1]

The remaining attributes represent relative frequencies of various important words and characters in emails. This is true for all attributes except `is_spam` which represents whether the e-mail was annotated as spam or not. So each e-mail is represented by a 55 dimensional vector representing whether or not a particular word exists in an e-mail. This is the so called [bag of words](http://en.wikipedia.org/wiki/Bag_of_words_model) representation and is clearly a very crude approximation since it does not take into account the order of the words in the emails.

### ========== Question 3 ==========

Now let's get a feeling of the distribution of ham (i.e. valid) vs. spam emails. We can do this by using a [countplot](https://seaborn.github.io/generated/seaborn.countplot.html?highlight=countplot#seaborn.countplot) in seaborn.

**a)** Produce a seaborn [countplot](https://seaborn.github.io/generated/seaborn.countplot.html?highlight=countplot#seaborn.countplot) object that shows the distribution of ham/spam e-mails. Assign it to a variable (e.g. `ax` to emphasise it is a [matplotlib.axes.Axes](http://matplotlib.org/api/axes_api.html#axes) object)
  
**b)** In the same cell, modify the labels on the x axis (`xticklabels`) to `Ham` and `Spam` (by default they should be set to `0.0` and `1.0`). *Hint: Axes objects have a [`set_xticklabels`](http://matplotlib.org/api/axes_api.html#matplotlib.axes.Axes.set_xticklabels) method!* 
  
**c)** Finally, again in the same cell, remove the `is_spam` label from the x axis (`xlabel`) since it does not add any information to the graph

You may notice `<matplotlib.text.Text at ...memory_location...>` printed by the ipython notebook. This is just because the notebook is inferring how to display the last object in the cell. To explicitly plot the Axes object, use the `matplotlib.pyplot.show()` method at the very end of the cell, i.e. `plt.show()` (we imported the `matplotlib.pyplot` module as `plt` above)

In [None]:
# Your code goes here
ax = sns.countplot(x='is_spam', data=spambase)
ax.set_xticklabels(['Ham', 'Spam'])
plt.xlabel('')
plt.show()

Now we want to simplify the problem by transforming our dataset. We will replace all numerical values which represent word frequencies with a binary value representing whether each word was present in a document or not.

### ========== Question 4 ==========

**a)** Crate a new dataframe called `spambase_binary` from `spambase`. *Hint*: Look into the [`copy`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.copy.html) method in pandas. *Tip*: Be careful, in python, unless you explictly say not to, assigment is typically just reference e.g.
```python
i = [1, 3]
j = i
i[1] = 5
print(j)
```
outputs:
```
[1, 5]
```

In [None]:
# Your code goes here
spambase_binary = spambase.copy(deep=True)

**b)** Convert all attributes in `spambase_binary` to Boolean values: 1 if the word or character is present in the email, or 0 otherwise.

In [None]:
# Your code goes here
spambase_binary[spambase_binary > 0] = 1

**c)** Display the 5 last observations of the transformed dataset.

In [None]:
# Your code goes here
spambase_binary.tail(5)

## Visualisation

Now we want to get a feeling for how the presence or absence of some specific words could affect the outcome (whether an email is classifed as *ham* or *spam*). We will be focusing on three specific words, namely `make`, `internet` and `edu`.

### ========== Question 5 ==========

**a)** Using seaborn, produce one figure with three [countplots](https://seaborn.github.io/generated/seaborn.countplot.html?highlight=countplot#seaborn.countplot), one for each of the frequency variables for the words `make`, `internet` and `edu`. For each variable, the count plot should have two bars: the number of emails containing the word (i.e. the variable = 1), and the number not containing that word (i.e. the variable = 0).

In [None]:
# Your code goes here
words_of_interest = ['word_freq_' + word for word in ['make', 'internet', 'edu']]
n_words = len(words_of_interest)
plt.subplots(n_words, figsize=(8,10))
for i, word in enumerate(words_of_interest):
    plt.subplot(n_words,1,i+1)
    # Grey colour chosen such that the meaning of 1 and 0 is
    # not confused between plots. It is good practice to use
    # colour only when it is meaningful, and to not mix 
    # colour meanings between similar plots
    sns.countplot(x=word, data=spambase_binary, color='Gray') 
plt.tight_layout()
plt.show()

**b)** Repeat the above but split the bars showing the proportion of emails that are spam/ham. *Hint*: This only requires you to use the `hue` input argument to use different colours for the `is_spam` variable.

In [None]:
# Your code goes here
words_of_interest = ['word_freq_' + word for word in ['make', 'internet', 'edu']]
n_words = len(words_of_interest)
plt.subplots(n_words, figsize=(8,10))
for ii, word in enumerate(words_of_interest):
    plt.subplot(n_words, 1, ii+1)
    sns.countplot(x=word, hue='is_spam', data=spambase_binary)
plt.tight_layout()
plt.show()

## Multinomial Naive Bayes classification

Given the transformed dataset, we now wish to train a Naïve Bayes classifier to distinguish spam from regular email by fitting a distribution of the number of occurrences of each word for all the spam and non-spam e-mails. Read about the [Naive Bayes classifier](https://en.wikipedia.org/wiki/Naive_Bayes_classifier) and the underlying assumption if you are not already familiar with it. In this lab we focus on the [Multinomial Naive Bayes classifier](https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Multinomial_naive_Bayes). 

We will make use of the `MultinomialNB` class in `sklearn`. **Check out the user guide [description](http://scikit-learn.org/stable/modules/naive_bayes.html#multinomial-naive-bayes) and [documentation](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB) to familiarise yourself with this class.**

All classifiers in `sklearn` implement a `fit()` and `predict()` [method](https://en.wikipedia.org/wiki/Method_%28computer_programming%29). The first learns the parameters of the model and the latter classifies inputs. For a Naive Bayes classifier, the [`fit()`](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html#sklearn.naive_bayes.GaussianNB.fit) method takes at least two input arguments `X` and `y`, where `X` are the input features and `y` are the labels associated with each example in the training dataset (i.e. targets). 

As a first step we extract the input features and targets from the DataFrame. To do so, we will use the [`values`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.values.html) property. For the input features we want to select all columns except `is_spam` and for this we may use the [`drop`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html) method which discards the specified columns along the given axis. In fact, we can combine these two operations in one step.

### ========== Question 6 ==========

**a)** Create a Pandas DataFrame object `X` containing only the features (i.e. exclude the label `is_spam`). We need to do this as it is the input Scikit-learn objects expect for fitting. *Hint*: make use of the `drop` method.

In [None]:
# Your code goes here
X = spambase_binary.drop('is_spam', axis=1)

**b)** Create a Pandas Series object `y` that contains only the label from `spambase_binary`.

In [None]:
# Your code goes here
y = spambase_binary['is_spam']

**c)** Display the dimensionality (i.e. `shape`) of each of the two arrays. *Hint:* The shape of `X` and `y` should be `(4601, 54)` and `(4601,)` respectively.

In [None]:
# Your code goes here
print('Input features shape: {}'.format(X.shape))
print('Targets shape: {}'.format(y.shape))

### ========== Question 7 ==========

Now we want to train a Multinomial Naive Bayes classifier. Initialise a `MultinomialNB` object and [`fit`](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB.fit) the classifier using the `X` and `y` arrays extracted in the cell above.

In [None]:
# Your code goes here
mnb = MultinomialNB()
mnb.fit(X=X, y=y)

## Model evaluation

We can evaluate the classifier by looking at the classification accuracy, and the [confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix). 

Scikit-learn model objects have built in scoring methods. The default [`score` method for `MultinomialNB`](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB.score) estimates the classification accuracy score. Alternatively, you can compute the prediction for the training data and make use of the [`accuracy_score`](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html) function (that is in fact what the classifier's `score()` method does under the hood).

Scikit-learn also has a [`confusion_matrix`](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html#sklearn.metrics.confusion_matrix) implementation which returns a numpy array (square matrix) of dimensionality `K`, where `K` is the number of classes (2 in our case).

### ========== Question 8 ========== 

**a)** Display the log-prior probabilities for each class. *Hint:* use tab-completion to figure out which attribute of the `MultinomialNB` structure you are interested in.

In [None]:
# Your code goes here 
'Class log-priors: {}'.format(mnb.class_log_prior_)

**b)** Predict the output of the classifier by using the training data as input. *Hint*: make use of the `predict` method of the `MultinomialNB` classifier.

In [None]:
# Your code goes here 
tr_pred = mnb.predict(X=X)

**c)** Compute the classification accuracy on the training data by either using the `accuracy_score` metric or the `score` method of the `MultinomialNB`. 

In [None]:
# Your code goes here 
ca = accuracy_score(y, mnb.predict(X)) # or ca = gnb.score(X,y)

**d)** Compute the resulting confusion_matrix by using the builtin scikit-learn class and display the result. 

In [None]:
# Your code goes here 
cm = confusion_matrix(y, tr_pred)
cm

**e)** Normalise the produced confusion matrix by the true class and display the result. In other words, the matrix should show you what proportion of `Ham` emails were predicted as `Ham`/`Spam` and vice versa.

In [None]:
# Your code goes here 
cm_norm = cm/cm.sum(axis=1)[:, np.newaxis]
cm_norm
## Confusion matrix has values c_ij such that true label i 
## is predicted as label j, i.e. rows should sum to 1

**f)** By making use of the `plot_confusion_matrix` provided below, visualise the normalised confusion matrix. Plot the appropriate labels on both axes by making use of the `classes` input argument.

In [None]:
def plot_confusion_matrix(cm, classes=None, title='Confusion matrix'):
    """Plots a confusion matrix."""
    if classes is not None:
        sns.heatmap(cm, xticklabels=classes, yticklabels=classes, vmin=0., vmax=1., annot=True)
    else:
        sns.heatmap(cm, vmin=0., vmax=1.)
    plt.title(title)
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

In [None]:
# Your code goes here
plt.figure()
plot_confusion_matrix(cm_norm, classes=['ham', 'spam'])

### ========== Question 9 ==========

Study the output produced, most importantly the percentages of correctly and incorrectly classified instances. You probably will notice that your classifer does rather well despite making a very strong assumption on the form of the data. If we didn't make this assumption, what would be the main practical problems? *Hint*: If you've forgotten the assumption of the Naive Bayes model, check wikipedia and/or sklearn documentation.

*Your answer goes here:*

The classifier is doing well, but it's only been tested so far on the same data we used for training it, so we can't be sure that it can generalise to new examples. 

The main practical problem if we didn't make the Naive Bayes assumption (conditional independence given the label) is that we would have to estimate a full covariance matrix of size 55 X 55 (i.e. ~1500 parameters) and we only have 4000 samples, so the covariance estimate might be dominated by noise. Assuming conditional independence allows us to esimate a diagonal covariance matrix i.e. estimate a variance for each variable independently and assume all covariances between distinct variables are 0.

### ========== Question 10 ==========

The empirical log probability of input features given a class $P\left(x_i  |  y\right)$ is given by the attribute `feature_log_prob` of the classifier. For each feature there are two such conditional probabilities, one for each class. 

**a)** What dimensionality do you expect the `feature_log_prob_` array to have? Why?

*Your answer goes here:*

There is a probability for each feature(/variable) conditional on each of the two outcome values, so the dimensionality should logically be (54,2) or (2,54)

**b)** Inspect the log probabilities of the features. Verify that it has the expected dimensionality (i.e. `shape`).

In [None]:
# Your code goes here
mnb.feature_log_prob_.shape

**c)** Create a list of the names of the features that have higher log probability when the email is `Ham` than `Spam` i.e. what features imply an email is more likely to be `Ham`? *Hint:* There are a many ways to do this. Try it on your own then, if you get stuck, you can do it using index numbers (look up [`np.argwhere`](http://docs.scipy.org/doc/numpy/reference/generated/numpy.argwhere.html)), or using a boolean mask (look up [pandas indexing](http://pandas.pydata.org/pandas-docs/stable/indexing.html)). The column names of a Pandas DataFrame are contained in the `columns` attribute.

In [None]:
# Your code goes here

feature_names = X.columns
hammy_bool = mnb.feature_log_prob_[0] > mnb.feature_log_prob_[1]

## Simple boolean selection
ham_features = feature_names[hammy_bool].tolist()
print(ham_features)

## With np.argwhere
idx1 = np.argwhere(hammy_bool).squeeze().tolist()
print(feature_names[idx1].tolist())

## with list comprehension 
idx2 = [ii for ii, x in enumerate(hammy_bool) if x]
print(feature_names[idx2].tolist())

### ========== Question 11 ==========

For the final part of this section we will now pretend we are spammers wishing to fool a spam checking system based on Naïve Bayes into classifying a spam e-mail as ham (i.e. a valid e-mail). For this we will use a test set consisting of just one data point (i.e. e-mail). This tiny dataset is called `spambase_test` and has already been pre-processed for you which means that the redundant attributes have been removed and word frequencies have been replaced by word presence/absence.

**a)** Load `./datasets/spambase_test.csv` dataset into a new pandas structure

In [None]:
# Your code goes here
data_path_test = os.path.join(os.getcwd(), 'datasets', 'spambase_test.csv')
spambase_test = pd.read_csv(data_path_test, delimiter = ',')

**b)** Use `spambase_test` to create a pandas DataFrame object X_test, contatining the test features, and pandas Series object y_test, containing the test outcome

In [None]:
# Your code goes here
X_test = spambase_test.drop('is_spam', axis=1)
y_test = spambase_test['is_spam'] 

**c)** Feed the input features into the classifier and compare the outcome to the true label. Make sure you don't feed the target into the classifier as you will receive an error (why?). Does the classifer classify the spam e-mail correctly?

In [None]:
# Your code goes here
prediction = mnb.predict(X_test)
print('Actual label: {}\t Prediction: {}'.format(int(y_test), int(prediction)))

**d)** Pick one (perhaps random) attribute that has higher probability for the ham class (using your feature names in Question 10c) and set the corresponding value in `X_test` to 1. Now predict the new outcome. Has it changed? If not, keep modifying more attributes until you have achieved the desired outcome (i.e. model classifies the e-mail as ham).

In [None]:
# Your code goes here
#3. Loop until prediction is ham
for feat_name in ham_features:
    # Multiple indexing methods are available for Pandas
#     X_test.iloc[0][feat_name] = 1
#     X_test.loc[0][feat_name] = 1
    X_test.ix[0, feat_name] = 1
    prediction = mnb.predict(X_test)
    if prediction != y_test.iloc[0]:
        print('The word "', feat_name[10:], '" did the trick!')
        break

### ========== Question 12 ==========

**This is an extension for people keen to learn more advanced plotting.** We'll be happy to discuss your conclusions in the lab.

**a)** Create a plot of the spam/ham log probabilities for all of the features. This will help you find the spammiest/hammiest words to use in your emails! *Hint*: you can do this however you like, but try 'adapting' [this matplotlib demo](http://matplotlib.org/examples/api/barchart_demo.html)

In [None]:
# Your code goes here
log_probs = mnb.feature_log_prob_
ticklabs = feature_names

plt.figure(figsize=(12,6))
plt.plot(log_probs.T)
plt.xticks(range(len(feature_names)), 
           [f[10:] for f in ticklabs],
           rotation='vertical')
plt.show()

# or...more fancily
import matplotlib as mpl
colour_palette = mpl.rcParams['axes.prop_cycle'].by_key()['color']
N = len(ticklabs)
ind = np.arange(N)  # the x locations for the groups
width = 0.35        # the width of the bars
fig, ax = plt.subplots(figsize=(12,6))
ham_rects = ax.bar(ind, log_probs[0], width, color=colour_palette[0])
spam_rects = ax.bar(ind + width, log_probs[1], width, color=colour_palette[1])
ax.set_ylabel('Log Probability')
ax.set_xticks(ind + width)
ax.set_xticklabels([f[10:] for f in ticklabs], rotation='vertical')
ax.legend((ham_rects[0], spam_rects[0]), ('Ham', 'Spam'), loc='best')
plt.show()

**b)** The features are in the order they appear in the dataset. Can you order them by probability of being `Ham`?

In [None]:
# Your code goes here
sort_idx = np.argsort(mnb.feature_log_prob_[0])
log_probs = mnb.feature_log_prob_[:, sort_idx]
ticklabs = feature_names[sort_idx]


plt.figure(figsize=(12,6))
plt.plot(log_probs.T)
plt.xticks(range(len(feature_names)), 
           [f[10:] for f in ticklabs],
           rotation='vertical')
plt.show()

# or...more fancily
import matplotlib as mpl
colour_palette = mpl.rcParams['axes.prop_cycle'].by_key()['color']
N = len(ticklabs)
ind = np.arange(N)  # the x locations for the groups
width = 0.35        # the width of the bars
fig, ax = plt.subplots(figsize=(12,6))
ham_rects = ax.bar(ind, log_probs[0], width, color=colour_palette[0])
spam_rects = ax.bar(ind + width, log_probs[1], width, color=colour_palette[1])
ax.set_ylabel('Log Probability')
ax.set_xticks(ind + width)
ax.set_xticklabels([f[10:] for f in ticklabs], rotation='vertical')
ax.legend((ham_rects[0], spam_rects[0]), ('Ham', 'Spam'), loc='best')
plt.show()

**c)** What about ordering by the absolute difference between `Ham` and `Spam` log probability?

In [None]:
# Your code goes here
sort_idx = np.argsort(abs(mnb.feature_log_prob_[0]-mnb.feature_log_prob_[1]))
log_probs = mnb.feature_log_prob_[:, sort_idx]
ticklabs = feature_names[sort_idx]


plt.figure(figsize=(12,6))
plt.plot(log_probs.T)
plt.xticks(range(len(feature_names)), 
           [f[10:] for f in ticklabs],
           rotation='vertical')
plt.show()

# or...more fancily
import matplotlib as mpl
colour_palette = mpl.rcParams['axes.prop_cycle'].by_key()['color']
N = len(ticklabs)
ind = np.arange(N)  # the x locations for the groups
width = 0.35        # the width of the bars
fig, ax = plt.subplots(figsize=(12,6))
ham_rects = ax.bar(ind, log_probs[0], width, color=colour_palette[0])
spam_rects = ax.bar(ind + width, log_probs[1], width, color=colour_palette[1])
ax.set_ylabel('Log Probability')
ax.set_xticks(ind + width)
ax.set_xticklabels([f[10:] for f in ticklabs], rotation='vertical')
ax.legend((ham_rects[0], spam_rects[0]), ('Ham', 'Spam'), loc='best')
plt.show()