# Machine learning training, testing and evaluating

**Scikit-learn** is an open-source machine learning library for Python that offers a variety of regression, classification and clustering algorithms. You can find a lot of detailed information on it at this link http://scikit-learn.org

In this section we'll continue from the previous session whereby I added some extra columns into the data frame to see if it helps us to predict whether a text message is **ham** or **spam**. We also checked for missing values. The code below is a summary of that in the previous lecture. Run all this code to build the dataset with new variables in it.

In this lecture we'll perform a fairly simple classification exercise with scikit-learn. 

In [51]:
import numpy as np
import pandas as pd
import io

In [52]:
from google.colab import files
uploaded = files.upload()
# Read tsv file into a dataframe object
# Press tab to check you are in the correct folder location and to browse
# to the tsv file
# The sep command indicates this files is separated by tabs
dataframe = pd.read_csv(io.BytesIO(uploaded['SMSSpamCollection.tsv']), sep="\t")

Saving SMSSpamCollection.tsv to SMSSpamCollection (2).tsv


One of the methods we could use to determine whether a text message is **HAM** or **SPAM** is through examination of the length of characters in each line of text.

Let's create a loop that uses a list to contain the number of characters within each line. then we'll add the lenght to the end of each row in the dataframe.

In [53]:
message_length_col = []
for index, row in dataframe.iterrows():
    length_message_text = len(row.message)
    # add the length of each message to list
    message_length_col.append(length_message_text)

Next we'll add this list to the dataframe. I'm also inserting this data under the column heading **length**.

See this link for further information 
https://www.geeksforgeeks.org/adding-new-column-to-existing-dataframe-in-pandas/

In [54]:
# now we'll add the contents of this list to a new column
# called "length" to the end of our dataframe
# See https://www.geeksforgeeks.org/adding-new-column-to-existing-dataframe-in-pandas/
dataframe['length'] = message_length_col

In [55]:
ham_data = []
spam_data = []
for index, row in dataframe.iterrows():
    # If the label data is recognised to be "ham"
    if row["label"] == "ham":
        ham_data.append(row)
    else:
        spam_data.append(row)

# Convert list to a dataframe before performing descriptive statistics on it
ham_dataframe = pd.DataFrame(ham_data)
spam_dataframe = pd.DataFrame(spam_data)

In [56]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [57]:
# Define variables first
punct_length_col = []
punct_count = 0

for index, textrow in dataframe.iterrows():
    doc_object = nlp(textrow.message)
    for word in doc_object:
        if word.pos_ == 'PUNCT':           
            punct_count += 1
    # Sentence is checked so add count to list
    punct_length_col.append(punct_count)
    punct_count =0   

It is important to ensure that the list we're going to insert into the text dataframe contaisn the same number of rows. Otherwise we'll get an error.

In [58]:
#Add punct list to dataframe
dataframe['punct'] = punct_length_col

In [59]:
# View top of dataframe content
dataframe.head()

Unnamed: 0,label,message,length,punct
0,ham,"Go until jurong point, crazy.. Available only ...",111,4
1,ham,Ok lar... Joking wif u oni...,29,2
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,155,1
3,ham,U dun say so early hor... U c already then say...,49,2
4,ham,"Nah I don't think he goes to usf, he lives aro...",61,1


There could be enough difference between text length in **SPAM** texts versus **HAM** ones to uniquely identify one against the other, but there is not as distinct difference between the number of punctuations in **HAM** messages compared to **SPAM**.

We'll now create a ML model using `scikit learn`. In the next lecture we'll use the text content to build a more accurate ML model.
 

# Machine Learning model creation through scikit-learn

Every algorithm is accessed in `scikit-learn` using an **estimator**.

The genral syntax to import a model is:

>`from sklearn.family import Model`

For example, to use the `linear regression` model, we would use:

>`from sklearn.linear_model import LinearRegression`.

We can set all of the **estimator** parameters when it is instantiated. If we don't, suitable default values will be applied to the ML model. We can press `SHIFT + TAB` in Jupyter notebook to view all of the possible parameters for each model.
<br>
For example, if we were creating a Linear regression model, and we wanted it to be normalised, we would set the `normalize` parameter to `True`. We can view all of the parameters of a ML model by using the `print(model)` command.

Once the model is created, then we need to fit the model with data. As described in an earlier lecture, the data is split into **training** and **testing** data. We'll work through this process in the upcoming code.

Once the data is split, then we can fit (or train) or model on the **training** data using the `model.fit()` command. The syntax is :

>`model.fit(X_train, y_train)`

Note that I'm using specific syntax to do this. Refer to the earlier lecture on supervised learning for further information. 

Once the model has been fit and trained on the training data, it is then ready for prediction on the test dataset. We predict data from the trained model using this command:

>`predict = model.predict(X_test)`

We then evaluate the ML model by comparing predicted value to actual test values. With **classification** models, we will examine accuracy, F1 score etc.

# Split the data into training and testing datasets

Before we instantiate our ML model, we'll divide the dataset into 2 smaller training and testing datasets.
If we want to divide the DataFrame into two smaller sets, we could use
> `train, test = train_test_split(dataframe)`

We'll also set up our Features (X) and Labels (y). The Label is simple - we're trying to predict the `label` column in our data. For Features we'll use the `length` and `punct` columns. 

*Please note, **X** is capitalised and **y** is lowercase.*

In [60]:
from sklearn.model_selection import train_test_split

In [61]:
# X is the feature data
# We're creating a list of column names to
# use from our dataframe.
# We need 2 brackets as there is more than 1 entry
X = dataframe[['length', 'punct']]

# This is the label data - 1 entry
# so only need 1 set of brackets
y = dataframe['label']

# Use SHIFT + TAB to see full options and to
# copy some contents below
# test-size represents percentage to use for testing data
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state = 1)

Lets examine the size of the training and testing datasets.

In [62]:
# Contains 2 columns
print('X train data shape', X_train.shape)
print('X test data shape', X_test.shape)

X train data shape (3900, 2)
X test data shape (1672, 2)


In [27]:
# 1 column of label data
print('y train data shape', y_train.shape)
print('y test data shape', y_test.shape)

y train data shape (3900,)
y test data shape (1672,)


In [63]:
# Index position matches with index
# position in X_train
y_train

4393     ham
216      ham
4471     ham
3889     ham
5030    spam
        ... 
905      ham
5192     ham
3980     ham
235     spam
5157     ham
Name: label, Length: 3900, dtype: object

In [64]:
print(X_test)

      length  punct
1078      28      1
4028      45      3
958       26      0
4642       7      1
4674     107      4
...      ...    ...
3954     114      6
619       59      2
1987      24      1
2358      52      1
3594      22      1

[1672 rows x 2 columns]


In [65]:
y_test

1078     ham
4028     ham
958      ham
4642     ham
4674     ham
        ... 
3954    spam
619      ham
1987     ham
2358     ham
3594     ham
Name: label, Length: 1672, dtype: object

# Training classifiers with our data
Now that we have our training and testing datasets created, we can use this data to build classification models. We'll build several models and see how each one differs according to accuracy.

There are several classifiers that we can use for our text datasets. Data can more easily be separated linearly with classifiers such as Naive Bayes and linear SVMs, and might lead to better generalisation than is achieved by other classifiers. See https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html?highlight=classifiers for more information on classifiers.

See https://www.analyticsvidhya.com/blog/2018/04/a-comprehensive-guide-to-understand-and-implement-text-classification-in-python/ for more information on various classifiers that are available to us.

There are 4 classifiers that we are going to train and test, and then examine for overall accuracy. There's more than just these models, but these are  particularly suited to sentiment analysis and text classification.

(a) Linear classifier<br> 
(b) Naïve Bayes<br> 
(c) Random Forest<br> 
(d) Support Vector Classifier<br> 

**Note** - as with all classifiers, we need to evaluate overall accuracy of the model after it is built. The theory conveyed by some classifiers does not necessarily carry over to real datasets.

## Train a Logistic Regression classifier (model)

One of the simplest multi-class classification tools is [logistic regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).

The following steps are used with all classifiers, not just logistic regression.

### Step 1 - import the model we want to use

In [66]:
from sklearn.linear_model import LogisticRegression

### Step 2 - create an instance of the model we imported

Next we create a specific instance of the model we want to construct. Note that I'm using the same model that I imported in Step 1.

There are lots of different settings available for the model. To see these settings, press the `SHIFT + TAB` keys **twice** when the cursor is over the line of code. Then scroll down through the settings window. In this example we are going to use the **L-BFGS** algorithmic solver. See this link for further information on the various parameters available for the logistic regression model https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html 

The default values are usually good. 

### Step 3 - build the model and fit data to it

We build the model with any specifc options we want to set that are not the dafult values. In this example, I'm using the **L-BFGS** option.

Once the model is built, I provide training data to the model.

In [67]:
lin_reg_model = LogisticRegression(solver='lbfgs')
# Note that the "fit" option must be run in the same cell as line above
lin_reg_model.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

## Testing model accuracy

Now we are going to test the accurcy of the model using the test data.

Firstly we import the `sklearn.metrics` module which includes score functions, performance metrics and pairwise metrics and distance computations. See https://scikit-learn.org/stable/modules/classes.html#sklearn-metrics-metrics for more information. 

Now we create a **predictions** set with some test data. The model has not yet seen this data. It contains **length** amd **punctuation** data for each text message that we already have correct answer for in the **y_test** dataset.

In [68]:
from sklearn import metrics

# Create a prediction set:
# The model has not yet seen contents of X_test
# which is a dataset of message length and punctuation
# And we know to expect answers in y_test
# which is a list of expected label output
lin_reg_model_predictions = lin_reg_model.predict(X_test)

Let's look at the contents of the **predictions** set.

In [69]:
# This is the predicted output from the model
lin_reg_model_predictions

array(['ham', 'ham', 'ham', ..., 'ham', 'ham', 'ham'], dtype=object)

The `metrics` model contains a `confusion metrics` option that provides us with options to build a confusion matrix to evaluate model accuracy. 

See https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html#sklearn.metrics.confusion_matrix for more information.

In [35]:
# Now we compare what the model predicted 
# with what is expected as output
# Print a confusion matrix
print(metrics.confusion_matrix(y_test,lin_reg_model_predictions))

[[1389   53]
 [ 222    8]]


The total number of confusions in this model is 275.

We can create a dataframe and add labels to make the confusion matrix easier to read and interpret.

In [36]:
# You can make the confusion matrix less confusing by adding labels:
dataframe_labels = pd.DataFrame(metrics.confusion_matrix(y_test,lin_reg_model_predictions), 
                  index=['correct ham','correct spam'], 
                  columns=['predicted ham','predicted spam'])
dataframe_labels

Unnamed: 0,predicted ham,predicted spam
correct ham,1389,53
correct spam,222,8


From the confusion matrix we can see that the model correctly classified 8 spam messages, and incorrectly classified 53 ham messages as spam.

The model is better at correctly classifying ham with 1389 correclty classified, and 53 ham messages were incorrectly classified as spam. But the results are terrible.

Accuracy = TP + TN / Total
= 1389 + 53 / 1672
= 1442/1672 = 0.86

Let's look at a classification report to show precison, recall and F1-score.

Overall the model is good at predicting **ham**, and not so good at **spam**.

In [70]:
# Print a classification report
print(metrics.classification_report(y_test,lin_reg_model_predictions))

              precision    recall  f1-score   support

         ham       0.86      0.96      0.91      1442
        spam       0.13      0.03      0.05       230

    accuracy                           0.84      1672
   macro avg       0.50      0.50      0.48      1672
weighted avg       0.76      0.84      0.79      1672



We can also show the overall accuracy of the model with this command.

In [38]:
# Print the overall accuracy
print(metrics.accuracy_score(y_test,lin_reg_model_predictions))

0.8355263157894737


This accuracy value indicates that the overall accuray of the model is less than if the model were just to predict **HAM** for all messages in the test dataset (0.86).

Let's evaluate other models that are available to us.

## Train a naïve Bayes classifier

One of the most common classifiers is naive Bayes model .See http://scikit-learn.org/stable/modules/naive_bayes.html#naive-bayes). for more information. It is particularly good for document classification and spam filtering. 

We will build this model using the same steps we used earlier. We'll import the model from scikit-learn, create an instance of the model, fit (train) the model using our training data, and then predict data using the model.


In [71]:
# First import the model we want to use
from sklearn.naive_bayes import MultinomialNB

# Create an instance of the model - common model for text data and spam filtering
nb_model = MultinomialNB()

# Fit model to training data
nb_model.fit(X_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [40]:
# Predict answers to data from the X_text dataset
# containing text length and punctuation count
nb_model_predictions = nb_model.predict(X_test)

# Show results in a confusion matrix
print(metrics.confusion_matrix(y_test,nb_model_predictions))

[[1442    0]
 [ 230    0]]


Overall confusion is 230, down from 275 for the logistic regression classifier.

Now we can see that this model is no longer any good at predicting spam at all in our text messages.

We can look at this in more detail with the `sklearn metrics` report. The warning from scikit-learn shows us that the model cannot predict spam within the text messages.

In [41]:
print(metrics.classification_report(y_test,nb_model_predictions))

              precision    recall  f1-score   support

         ham       0.86      1.00      0.93      1442
        spam       0.00      0.00      0.00       230

    accuracy                           0.86      1672
   macro avg       0.43      0.50      0.46      1672
weighted avg       0.74      0.86      0.80      1672



  _warn_prf(average, modifier, msg_start, len(result))


And we can view the overall accuracy of the model. Note how the overall accuracy appears to suggest that the model is quite accurate, but it is no better than one where only ham is provided as test data.

In [72]:
print(metrics.accuracy_score(y_test,nb_model_predictions))

0.8624401913875598


## Random forest model

A random forest is a bagging model, and part of the tree model family. A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. 

See https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html for more information.

We will build this model using the same steps we used earlier. We'll import the model from scikit-learn, create an instance of the model, fit (train) the model using our training data, and then predict data using the model.

In [73]:
# First import the model we want to use
from sklearn.ensemble import RandomForestClassifier

# Create an instance of the model - common model for text data and spam filtering
rf_model = RandomForestClassifier()

# Fit model to training data
rf_model.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [44]:
# Predict answers to data from the X_text dataset
# containing text length and punctuation count
rf_model_predictions = rf_model.predict(X_test)

# Show results in a confusion matrix
print(metrics.confusion_matrix(y_test,rf_model_predictions))

[[1352   90]
 [ 105  125]]


Overall confusion is 193 (100 + 93), down from 230 for the Naïve Bayes classifier.

Now we can see that this model is better than Naïve Bayes classifier at predicting spam in our text messages.


In [45]:
print(metrics.classification_report(y_test,rf_model_predictions))

              precision    recall  f1-score   support

         ham       0.93      0.94      0.93      1442
        spam       0.58      0.54      0.56       230

    accuracy                           0.88      1672
   macro avg       0.75      0.74      0.75      1672
weighted avg       0.88      0.88      0.88      1672



It appears that this model is very good at predicting HAM text messages, amd better than the other 2 models when it comes to predicting SPAM.

In [46]:
print(metrics.accuracy_score(y_test,rf_model_predictions))

0.8833732057416268


Overall accuracy of the model suggest that it is better than Logistic Regression (0.84) and Naïve Bayes (0.86).



## Train a Support Vector Classifier (SVC)

Lets examine whether a SVM will improve the model accuracy.

See https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC for further information on scikit-learn SVC.

In [47]:
# Import a Support Vector Classification model (SVC)
from sklearn.svm import SVC

In [48]:
# Setting gamma to "auto", otherwise the SVC model 
# returns an error
svc_model = SVC(gamma="auto")
svc_model.fit(X_train, y_train)

svc_model_predictions = svc_model.predict(X_test)

print(metrics.confusion_matrix(y_test, svc_model_predictions))

[[1371   71]
 [ 104  126]]


Overall confusion is (104 + 71) = 175. This is better than Naive Bayes model (230), the Random Forest model (193) and the Logistic Regression classifier (275).

And we can examine the metrics report for further detail on overall accuracy of the model.

In [49]:
print(metrics.classification_report(y_test,svc_model_predictions))

              precision    recall  f1-score   support

         ham       0.93      0.95      0.94      1442
        spam       0.64      0.55      0.59       230

    accuracy                           0.90      1672
   macro avg       0.78      0.75      0.77      1672
weighted avg       0.89      0.90      0.89      1672



It appears that this model is very good at predicting **HAM** text messages, and better than the other 2 models when it comes to predicting **SPAM**.

In [50]:
print(metrics.accuracy_score(y_test,svc_model_predictions))

0.8953349282296651


This model is better than the Naive Bayes model (0.86 overall accuracy) and the Logistic Regression classifier (0.83 overall accuracy) and slightly better than the Random Forest model (0.88).