# 2019 Mashine Learning course at the Faculty of Physics, Astronomy and Applied Computer Science

## Lab class no. 2 - Classification
by Piotr Warchoł


This is an amalgamation of a couple opensource notebooks from the web, with some adjustments.

Today, lets work on two canonical datasets. The MNIST and the spam datasets.

In [4]:
# To support both python 2 and python 3
from __future__ import division, print_function, unicode_literals

# To plot pretty figures
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12
import seaborn as sns
%matplotlib inline

import pandas as pd
import numpy as np
#import os

import warnings
warnings.filterwarnings('ignore')


# to make this notebook's output stable across runs
np.random.seed(42)

## 1. MNIST 
The MNIST database (Modified National Institute of Standards and Technology database) is a large database of handwritten digits that is commonly used for training various image processing systems.

In [5]:
import sklearn

In [6]:
from six.moves import urllib
from sklearn.datasets import fetch_mldata

from scipy.io import loadmat
mnist_alternative_url = "https://github.com/amplab/datascience-sp14/raw/master/lab7/mldata/mnist-original.mat"
mnist_path = "./mnist-original.mat"
response = urllib.request.urlopen(mnist_alternative_url)
with open(mnist_path, "wb") as f:
    content = response.read()
    f.write(content)
mnist_raw = loadmat(mnist_path)
mnist = {
    "data": mnist_raw["data"].T,
    "target": mnist_raw["label"][0],
    "COL_NAMES": ["label", "data"],
    "DESCR": "mldata.org dataset: mnist-original",
}
print("Success!")

URLError: <urlopen error [Errno -3] Temporary failure in name resolution>

In [None]:
#from sklearn.datasets import fetch_openml
#mnist = fetch_openml('mnist')
#mnist

In [None]:
X, y = mnist["data"], mnist["target"]
X.shape

In [None]:
y.shape

In [None]:
#number of pixels for each image
28*28

In [None]:
# lets look at one of the images
%matplotlib inline
import matplotlib
some_digit = X[35000]
some_digit_image = some_digit.reshape(28, 28)
plt.imshow(some_digit_image, cmap = matplotlib.cm.binary,
           interpolation="nearest")
plt.axis("off")
plt.show()

In [None]:
#Lets plot more

def plot_digit(data):
    image = data.reshape(28, 28)
    plt.imshow(image, cmap = matplotlib.cm.binary,
               interpolation="nearest")
    plt.axis("off")

def plot_digits(instances, images_per_row=10, **options):
    size = 28
    images_per_row = min(len(instances), images_per_row)
    images = [instance.reshape(size,size) for instance in instances]
    n_rows = (len(instances) - 1) // images_per_row + 1
    row_images = []
    n_empty = n_rows * images_per_row - len(instances)
    images.append(np.zeros((size, size * n_empty)))
    for row in range(n_rows):
        rimages = images[row * images_per_row : (row + 1) * images_per_row]
        row_images.append(np.concatenate(rimages, axis=1))
    image = np.concatenate(row_images, axis=0)
    plt.imshow(image, cmap = matplotlib.cm.binary, **options)
    plt.axis("off")

In [None]:
plt.figure(figsize=(9,9))
example_images = np.r_[X[:12000:600], X[13000:30600:600], X[30600:60000:590]]
plot_digits(example_images, images_per_row=10)
plt.show()

In [None]:
#there was "artifiacial" order in the data so we shuffle it
shuffle_index = np.random.permutation(70000)
X, y = X[shuffle_index], y[shuffle_index]

In [None]:
#split for training and testing
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

#### We will only try to adress the problm of is it a 5 or not with Logistic regression from sklearn
(perform just this binarry classification)

In [None]:
y_train_5 = (y_train == 5)
y_test_5 = (y_test == 5)

In [None]:
from sklearn.linear_model import LogisticRegression

logreg_clf = LogisticRegression(max_iter=5, random_state=42)
logreg_clf.fit(X_train, y_train_5)

In [None]:
some_other_digit=X[2340]
plot_digit(some_other_digit)

In [None]:
# did we do ok? First, look at the example we showed as an image
logreg_clf.predict([some_other_digit])

How did we do accros the whole training sample?

In [None]:
from sklearn.model_selection import cross_val_score
cross_val_score(logreg_clf, X_train, y_train_5, cv=3, scoring="f1")

In [None]:
cross_val_score(logreg_clf, X_train, y_train_5, cv=3, scoring="accuracy")

Lets look at the confussion matrix and calculate some accuracy metrics

In [None]:
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix

y_train_pred = cross_val_predict(logreg_clf, X_train, y_train_5, cv=3)
conf_mat=confusion_matrix(y_train_5, y_train_pred)
conf_mat

In [None]:
conf_mat_normalized = conf_mat.astype('float') / conf_mat.sum(axis=1)[:, np.newaxis]
sns.heatmap(conf_mat_normalized)
plt.ylabel('True label')
plt.xlabel('Predicted label')

If the prediction was perfect

In [None]:
y_train_perfect_predictions = y_train_5
conf_mat_perf=confusion_matrix(y_train_5, y_train_perfect_predictions)
conf_mat_perf

In [None]:
from sklearn.metrics import precision_score
conf_mat_perf_normalized = conf_mat_perf.astype('float') / conf_mat_perf.sum(axis=1)[:, np.newaxis]

precision_score(y_train_5, y_train_pred).astype('float') / conf_mat.sum(axis=1)[:, np.newaxis]
sns.heatmap(conf_mat_perf_normalized)
plt.ylabel('True label')
plt.xlabel('Predicted label')

Back to reality

In [None]:
from sklearn.metrics import precision_score, recall_score

In [None]:
precision_score(y_train_5, y_train_pred)

In [None]:
recall_score(y_train_5, y_train_pred)

#### Now check of you can correctly calculate the two "by hand", from the precission matrix

#### Check F1 score the same way, by hand and with sklearn

In [None]:
from sklearn.metrics import f1_score





Now, lets look at precission vs recal with respect to tresholding

In [None]:
y_scores = logreg_clf.decision_function([some_digit])
y_scores

In [None]:
y_scores = cross_val_predict(logreg_clf, X_train, y_train_5, cv=3,
                             method="decision_function")

Note: there is an [issue](https://github.com/scikit-learn/scikit-learn/issues/9589) introduced in Scikit-Learn 0.19.0 where the result of `cross_val_predict()` is incorrect in the binary classification case when using `method="decision_function"`, as in the code above. The resulting array has an extra first dimension full of 0s. We need to add this small hack for now to work around this issue:

In [None]:
y_scores.shape

In [None]:
# hack to work around issue #9589 introduced in Scikit-Learn 0.19.0
if y_scores.ndim == 2:
    y_scores = y_scores[:, 1]

In [None]:
from sklearn.metrics import precision_recall_curve

precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)

In [None]:
def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):
    plt.plot(thresholds, precisions[:-1], "b--", label="Precision", linewidth=2)
    plt.plot(thresholds, recalls[:-1], "g-", label="Recall", linewidth=2)
    plt.xlabel("Threshold", fontsize=16)
    plt.legend(loc="upper left", fontsize=16)
    plt.ylim([0, 1])

plt.figure(figsize=(8, 4))
plot_precision_recall_vs_threshold(precisions, recalls, thresholds)
plt.xlim([-7, 7])

plt.show()

Obviously, we need to balance precission and reacall by assuming the proper treshold for predictions

In [None]:
from sklearn.metrics import roc_curve

fpr, tpr, thresholds = roc_curve(y_train_5, y_scores)

In [None]:
def plot_roc_curve(fpr, tpr, label=None):
    plt.plot(fpr, tpr, linewidth=2, label=label)
    plt.plot([0, 1], [0, 1], 'k--')
    plt.axis([0, 1, 0, 1])
    plt.xlabel('False Positive Rate', fontsize=16)
    plt.ylabel('True Positive Rate', fontsize=16)

plt.figure(figsize=(8, 6))
plot_roc_curve(fpr, tpr)
plt.show()

#### Now, produce the confussion matrix, calculate the F1 score, accuracy, precission recall and plot the ROC curve, for the TEST data
Are the results as good?

## 2. Spam filtering

Download the spam.csv file form http://cs.if.uj.edu.pl/piotrek/ML2019/spam.csv

In [7]:
data = pd.read_csv("http://cs.if.uj.edu.pl/piotrek/ML2019/datasets/spam.csv", encoding='latin-1')

URLError: <urlopen error [Errno -3] Temporary failure in name resolution>

In [None]:
data.head()

Let's drop the unwanted columns, and rename the column name appropriately.

In [None]:
#Drop column and name change
data = data.drop(["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis=1)
data = data.rename(columns={"v1":"label", "v2":"text"})

In [None]:
data.tail()

In [None]:
#Count observations in each label
data.label.value_counts()

In [None]:
# convert label to a numerical variable
data['label_num'] = data.label.map({'ham':0, 'spam':1})

In [None]:
data.head()

### Train Test Split
Before performing text transformation, let us do train test split. Infact, we can perform k-Fold cross validation. However, due to simplicity, I am doing train test split.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train,X_test,y_train,y_test = train_test_split(data["text"],data["label"], test_size = 0.2, random_state = 10)

In [None]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

### Text Transformation
Various text transformation techniques such as stop word removal, lowering the texts, tfidf transformations, prunning, stemming can be performed using sklearn.feature_extraction libraries. Then, the data can be convereted into bag-of-words. <br> <br>
For this problem, Let us see how our model performs without removing stop words.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

see http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

In [None]:
vect = CountVectorizer()

Note : We can also perform tfidf transformation.

In [None]:
vect.fit(X_train)

vect.fit function learns the vocabulary. We can get all the feature names from vect.get_feature_names( ). <br> <br> Let us print first, last twenty features and some from the middle

In [None]:
print(vect.get_feature_names()[0:20])
print(vect.get_feature_names()[1000:1010])
print(vect.get_feature_names()[-20:])

In [None]:
X_train_df = vect.transform(X_train)

Now, let's transform the Test data.

In [None]:
X_test_df = vect.transform(X_test)

In [None]:
type(X_test_df)

### Machine Learning models:

#### Multinomial Naive Bayes
Generally, Naive Bayes works well on text data. Multinomail Naive bayes is best suited for classification with discrete features. 

In [None]:
prediction = dict()
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(X_train_df,y_train)

In [None]:
prediction["Multinomial"] = model.predict(X_test_df)

In [None]:
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report

In [None]:
accuracy_score(y_test,prediction["Multinomial"])

#### Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train_df,y_train)

In [None]:
prediction["Logistic"] = model.predict(X_test_df)

In [None]:
accuracy_score(y_test,prediction["Logistic"])

#### $k$-NN classifier

In [None]:
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_train_df,y_train)

In [None]:
prediction["knn"] = model.predict(X_test_df)

In [None]:
accuracy_score(y_test,prediction["knn"])

### Parameter Tuning using GridSearchCV

Based, on the above four ML models, Naive Bayes has given the best accuracy. However, Let's try to tune the parameters of $k$-NN using GridSearchCV

http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
k_range = np.arange(1,30)

In [None]:
k_range

In [None]:
param_grid = dict(n_neighbors=k_range)
print(param_grid)

In [None]:
model = KNeighborsClassifier()
grid = GridSearchCV(model,param_grid)
grid.fit(X_train_df,y_train)

In [None]:
grid.best_estimator_

In [None]:
grid.best_params_

In [None]:
grid.best_score_

In [None]:
grid.cv_results_


### Model Evaluation

In [None]:
print(classification_report(y_test, prediction['Multinomial'], target_names = ["Ham", "Spam"]))

In [None]:
conf_mat = confusion_matrix(y_test, prediction['Multinomial'])
conf_mat_normalized = conf_mat.astype('float') / conf_mat.sum(axis=1)[:, np.newaxis]

In [None]:
sns.heatmap(conf_mat_normalized)
plt.ylabel('True label')
plt.xlabel('Predicted label')

### Understand what happend
(note, this is for test data)

In [None]:
print(conf_mat)

By seeing the above confusion matrix, it is clear that 5 Ham are mis classified as Spam, and 8 Spam are misclassified as Ham. Let'see what are those misclassified text messages. Looking those messages may help us to come up with more advanced feature engineering.

In [None]:
pd.set_option('display.max_colwidth', -1)

I increased the pandas dataframe width to display the misclassified texts in full width. 

#### Misclassified as Spam

In [None]:
X_test[y_test < prediction["Multinomial"] ]

#### Misclassfied as Ham

In [None]:
X_test[y_test > prediction["Multinomial"] ]

## Linear Discriminant analysis
This is not the homework but I strongly encurage you to do this.

- Implement a Linear discriminant analysis model for a p=1 (one dimensional) predictor and 2 classes.
- Train it on data points sampled from two partly overlapping Gaussian distributions associated with the classes.
- Plot histograms of data points used for training.
- Test it with aditional samples.
- Implement a function that will produce a confussion matrix (with precission and recall scores) based on the results of the model fitting to data.
- make the treshold adjustable
- produce the confussion matrix for 3 different tresholds.
- do the same for two, partly overlapping uniform distributions

# Homework 2 

- Implement Naive Bayes model (remember about smoothing). 
- Find a reasonably interesting but not to complicated dataset for which you will be able to use this model to perform binary classification. Do the latter.
- Produce the confussion matrix, calculate accuracy, precission, recall
- Check how your model does against its version from sklearn and logistic regression from sklearn. 