# Doing ML - the right way

Welcome back! 

In the previous tutorial, we saw how to load and clean our data. We created features to translate text into numbers and we trained a machine learning model using those features. In this tutorial, we will discuss about evaluation metrics, imbalanced data, and why it is important to split your dataset into a training and a test set. Finally, we will briefly discuss how to recognise whether we are dealing with high bias or variance, and finally what happens when there is a mismatch between the training and test set. 

Note that throughout this tutorial, you will find that some code cells contain the expression `None`. In this case, it is expected from you to delete this `None` and place your code there (it is always expected that one `None` corresponds to one line of code).

In [None]:
# all the libraries you need for this tutorial
# please run this cell before you proceed further
import re
import pandas as pd
import numpy as np
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, recall_score, precision_score, confusion_matrix, ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split

## Data

We are going to use the same dataset as last time: a collection of (Dutch) WhatsApp messages collected by Radboud University. It was (partially) annotated by the NFI for meetings, i.e. messages that indicated meetings were being planned, or that referred to a meeting that had already been planned. You can find more information on the dataset here: https://easy.dans.knaw.nl/ui/datasets/id/easy-dataset:112987.

Last time we worked with the translated messages, but this time we will work with the original Dutch messages. No worries if you don't speak Dutch. You will not be selecting features manuallly this time.

As before, our dataset contains three columns called 'id' (= identifier of the message), 'text' ( = the actual message), and 'label' (1 for when a meeting is discussed, and 0 if there is not). There is only one difference: the dataset used in this tutorial has more data points. Let's continue with loading our data and see what we have.

In [None]:
# Using what you learned in the previous tutorials, load the file "wg4_doing_ML_the_right_way.csv" 
# as a pandas dataframe called 'df' and print its first 10 entries
whatsapp_file = None  # name of the file
df = None  # load file
None  # print first 10 entries

In [None]:
# let's see how many entries in total and per label we have
print('Number of entries = ', df.shape[0])
label_counts = Counter(df['label'])
print('Number of entries with label 0 (no meeting) =', label_counts[0])
print('Number of entries with label 1 (meeting planned) =', label_counts[1])
print('Portion of entries with label 1 =', round(label_counts[1]/df.shape[0], 3))

The dataset should contain 1558 entries in total. If not, please double check which file you are loading. 

Look at that! There are many more entries with label 0 than entries with labels 1. In the previous tutorial, there were as many entries with label 0 as with label 1, remember? Later, we will discuss if this is a problem and what we can do in such a situation, but for now let's prepare our dataset.

In [None]:
# As in the previous tutorial let's clean our texts. This time we will make a function to clean texts
# using the same code we used in the previous tutorial.

def delete_substring(pattern, string): 
    return re.sub(pattern, '', str(string))

def clean_text (df): 
    # Lower the text 
    df["clean_text"] = df["text"].str.lower()
    # Delete [removed] from the text
    df["clean_text"] = df.apply(lambda x: delete_substring(r'\[removed]', x["clean_text"]), axis = 1)
    # Remove anything that is not a letter, space or digit
    df["clean_text"] = df.apply(lambda x: delete_substring(r'[^\w\s]', x["clean_text"]), axis = 1)
    return df
     
df = clean_text(df)

# Print the first 10 rows of the dataset
df.head(10)

In [None]:
# Next step is to split our dataset into training and test set.
# we use 80% of our entries for our training set and the remaining 20% for the test set
# we set random_state to a number for reproducibility. In this way, every time you run this cell, the split of the data
# will be exactly the same

df_train, df_test = train_test_split(df, test_size=0.2, random_state=4)
print('Training set:')
print('(number of rows, number of columns) =', df_train.shape)
print('Number of entries with label 0 =', Counter(df_train['label'])[0])
print('Number of entries with label 1 =', Counter(df_train['label'])[1])
print('Portion of entries with label 1 =', round(Counter(df_train['label'])[1]/df_train.shape[0], 3))

print('____________________________________________________')
print('Test set:')
print('(number of rows, number of columns) =', df_test.shape)
print('Number of entries with label 0 =', Counter(df_test['label'])[0])
print('Number of entries with label 1 =', Counter(df_test['label'])[1])
print('Portion of entries with label 1 =', round(Counter(df_test['label'])[1]/df_test.shape[0], 3))

We already saw that 19.5% of our entries have label 1. Notice how close to 19.5% the portion of entries with label 1 in the training and test set is. That is thanks to `train_test_split` function that keeps the proportions of our training and test set similar to those in the whole dataset.

In [None]:
# Let's continue by creating our features
cv = CountVectorizer()
train_features = cv.fit_transform(df_train['clean_text'])  
test_features = cv.transform(df_test['clean_text'])
# QUESTION: do you remember why we call fit_transform on the training data only?

# let's see the number of features after transformation
print('Training set: (number of rows, number of columns) =', train_features.shape)
print('Test set: (number of rows, number of columns) =', test_features.shape)

WOW! We have way more columns than rows! The model that we use can handle such a sparse dataset.

### Evaluation metrics

Before we continue, let's define a function that evaluates a model in terms of accuracy, recall, and confusion matrix. It is something that we will use a lot in this tutorial, so it is good to have such a function instead of copy-pasting the same code all the time (This is why we have functions!!).

But first things first, what is accuracy? Recall? Confusion matrix? Are you confused already?

Assume that we only have 10 entries in our dataset and that we have already trained a model to predict whether a message is about planning a meeting or not (label 1 or 0, respectively). The table below shows the first 10 entries of our dataset and the 'predicted' labels. Let's evaluate our model based on those entries.

| id     | clean_text                                          | label | predicted label |
| -----: | --------------------------------------------------: | ----: | ---:            |
| 116725 | ja k denk dat wij wel eerder klaar zijn was zo...   |   0   |  1
| 116726 | ik heb trouwens ook een samenvatting van dat b...   |   0   |  0
| 116727 | ik kom speciaal voor jou supetvroeg naar het o...   |   1   |  0 
| 116728 | maar zelfs als ik kan blijven slapen dan moet ...   |   1   |  1
| 116729 |                ligt eraan wanneer je m nodig hebt   |   0   |  1
| 116730 |                        wilde plannen dit weekend    |   0   |  0
| 116731 |     lol heb al  maanden niet gestofzuigd            |   0   |  0 
| 116732 | op de officiÃ«le site tenminste dat denk ik dat...   |   0   |  1 
| 116733 |                              tot morgenavond he     |   1   |  0
| 116734 |                                   waar ben je dan   |   0   |  0



Its **accuracy** is just how often our model has correctly predicted the label about the meeting. As we can see, 5 out of 10 times our models predicted the correct label, so our accuracy is 5/10 = 0.5. 

What about its recall? Well, for recall first we have to pick which class is of importance for us. In this case, the minority class is the one that we are interested in. Then, **recall** is for how many entries with label 1 the model has also predicted 1. Back to our dummy example, we see that out of 3 entries with label 1 only once the model has provided the correct label (the model is not so great, therefore we don't recommend you to blindly assign labels). So our recall is 1/3 = 0.33.

Finally, the **confusion matrix** is simply a 'summary' of the model performance. It provides per label how many entries have been classified correctly and how many not. So the confusion matrix in this case is the following:

|                |             |        |
| -------------- |-------------| ------------------------------------ |
|                | Predicted: no meeting (0)  | Predicted: meeting (1)|
| No meeting (0) | 4                          |   3                   |
| Meeting (1)    | 2                          |   1                   |

Now we can clearly see that 4 entries with label 0 (no meeting) were correctly predicted as 0, while 3 entries with label 0 were misclassified as 1 (meeting). Similar, we can observe that 2 entries with label 1 have given the wrong prediction by our model and 1 entry with label 1 is correctly classified. We can also use a confusion matrix to calculate the accuracy = (4+1)/(4+3+2+1) = 5/10 = 0.5 and recall = 1/(2+1) = 1/3 = 0.33.


To complete the code in the next cell, please check the following links: 
1. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html
2. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html#sklearn.metrics.recall_score
3. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html#sklearn.metrics.confusion_matrix

In [None]:
# Let's define a simple function that, given predictions and the true labels of some observations, returns
# the accuracy, recall, and confusion matrix
def evaluate_model(ground_truth, predictions):
    
    # Replace None in each line below so that model accuracy, recall, and confusion matrix are calculated.
    acc = None  # accuracy
    recall = None  # recall
    con_ma = None  # confusion matrix
    
    return acc, recall, con_ma

In [None]:
# Run the following cell to test your function
toy_predictions = [0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0]
toy_labels = [0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0]
toy_accuracy, toy_recall, toy_conf_matrix = evaluate_model(toy_labels, toy_predictions)
print('accuracy =', toy_accuracy)
print('recall =', round(toy_recall, 3))
print('Confusion matrix =', toy_conf_matrix.tolist())

If your accuracy is equal to 0.6, good job! If recall is equal to 0.615, amazing! Finally, if the confusion matrix is [[10, 7],[5, 8]] then wonderful! 

## Imbalanced data

You are dealing with imbalanced data when the frequency one group of entries is much higher than the frequency of another group. In our case, there are 1254 entries with label 0 and only 304 entries with label 1. We refer to the group of entries with label 0 as our majority class and to the group of entries with label 1 as our minority class.

So, why is this a problem? Let's follow the steps of the previous tutorial and closely inspect the performance of our model. This time we will fit a support vector machine model on our data (https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html). This model is a popular classification model (like logistic regression), but is better suited for sparse datasets. See for more information [*the Wikipedia page*](https://en.wikipedia.org/wiki/Support_vector_machine).

In [None]:
# Let's first initialize a support vector machine model with the default configuration.
base_model = SVC()

# Replace None below with the command for fitting the base_model on train_features
None  

# Derive predictions on our test set. 
base_predictions = base_model.predict(test_features) 

In [None]:
# Let's check the accuracy of our model
base_accuracy, base_recall, base_conf_matrix = evaluate_model(df_test['label'], base_predictions)
base_accuracy

Oh nice, almost 81%! Not bad at all, right? But what about our recall? When the text is actually about planning a meeting, how often did our model predict the correct label?

In [None]:
base_recall

Oh that is not nice at all! So from all the entries with label 1 in our test set, only 3.2% of them was correctly classified. Let's inspect the confusion matrix to see what happened.

In [None]:
# Just run the cell. The code below is just to make the confusion matrix easier to read.
cmd = ConfusionMatrixDisplay(base_conf_matrix)
cmd.plot();

What do you observe?

So, what can we do about this? Below, we briefly discuss a few ways of working with imbalanced data:

 - Handling imbalanced data on a data-level: For example, oversampling the minority class (e.g. create new entries by duplication) or undersampling the majority class (e.g. remove entries). Be mindful of the shortcomings of those techniques. By undersampling your majority class, valuable information may get lost and by oversampling your minority class, your model might end up 'recognizing' those instances so well that it isn't able to generalize to other instances that also are part of the minority class.

 - Handling imbalanced data on an algorithm-level: Carefully choose the parameters of the model. For example, logistic regression and support vector machine in Python have a parameter called 'class_weight' that you can set to 'balanced'. When you do that, mistakes made in the minority class during training are penalized more than mistakes in the majority class. Therefore, the models put an effort in correctly classifying the minority class.  

 - Of course, one can always use a combination of the aforementioned approaches. 
 
 
 Here, we only covered few approaches, feel free to search online for other ways to handle imbalanced data. And remember that each case is unique, a method that worked well for one situation might not be the best for another.
 
 In this tutorial, we will set the parameter class_weight to 'balanced' when setting up our support vector machine model.

In [None]:
# We initialize a new support vector machine with the class_weight set to 'balanced'.

model = None

model.fit(train_features, df_train['label'])
predictions = model.predict(test_features) 

In [None]:
# Let's check the performance of this model
accuracy, recall, conf_matrix = evaluate_model(df_test['label'], predictions)
accuracy

Em that is sighty less than before. Let's see our recall and confusion matrix!

In [None]:
print('Recall = ', recall)
print('Confusion Matrix:')
cmd = ConfusionMatrixDisplay(conf_matrix)
cmd.plot();

Oh, we went from 0.03 to 0.5. That's quite an improvement. So, even though we lost a bit on the overall performance, we gain quite a lot on correctly classifying the minority class. (Still of course, there is quite some space to further improve this model!)

By the way, in the previous tutorial (wg3_intro_to_ML), we actually undersampled the majority class for you by randomly removing entries with label 0. Which approach is better for this dataset? Can you think why? 

Note that in the previous turorial, we used logistic regression as our model. If you want to directly compare the undersampling method to the usage of the 'class_weight' parameter, maybe it would have been nice if we had also used a logistic regression model in this tutorial. So, feel free to go and replace 'SVC' model with 'LogisticRegression' (who knows, you might even get better results).

## Training vs test set

One of the most important steps is to split our dataset to training and test set, but why? In some cases, we have to split it into more parts but this is a story for another day. Let's see what happens when we evaluate our model on the training set.

In [None]:
# metrics on the training set
# Replace None with the appropriate line of code
train_predictions = None  # derive prediction using train_features

# derive predictions on our training set
train_accuracy, train_recall, train_conf_matrix = evaluate_model(df_train['label'], train_predictions)

In [None]:
print('Training set:')
print('Accuracy =', train_accuracy)
print('Recall =', train_recall)

print('_______________________________')
print('Test set:')
print('Accuracy =', accuracy)
print('Recall =', recall)

What do you observe? 


Our train accuracy and recall are higher than test accuracy and recall. A machine learning model will almost always achieve much better performance on the data that it has already seen. That is why we split the dataset into two parts, we train our model on one part and evaluate its performance on the other part. 

## Bias or variance? High or low?

What conclusions can be drawn from the difference in performance between training and test set? One can argue that we are dealing with overfitting (high variance), and indeed that will be the case. This means that our models 'learned' so well our training data that it is not able to generalize well and proper classify unseen data. You are dealing with this kind of situations when the difference between the performance on the training set and test set is so high. 

For our case, this can be due to class imbalance. We could consider trying other methods or a combination of methods to better handle it. Or it might be because we are using way too many features and so, we have to figure out how to reduce the dimension of the data without losing information. Finally, we could go back to our model and 'play' with parameters, that helps with the regularization of the model (and so with overfitting). Sometimes, it helps to just simplify the model. There are several things that one can try. 

Let's try playing with the regularization parameter of the SVC model. For SVC, the regularization parameter is called C. 

What is the default value for the regularization parameter C? You can find the answer in the documentation: https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html As we have not specified the parameter, this is the value that we have been using until now.


In [None]:
# We want to initialize new SVC models with higher and lower regularization than the default value to see the effect on overfitting

# Let's make a function for this 
# Replace None with the appropriate code to initialize the SVC model with the regularization parameter C

def create_regularized_SVC(C): 
    model = None
    return model

In [None]:
# Here is a function to fit and evaluate your new SVC model. We have put all the functions for fitting, evaluating and 
# printing the metrics in one function to make the code easier to reuse. 

def fit_and_evaluate_SVC(model):
    model.fit(train_features, df_train['label'])
    predictions = model.predict(test_features)
    accuracy, recall, conf_matrix = evaluate_model(df_test['label'], predictions)
    
    train_predictions = model.predict(train_features) 
    train_accuracy, train_recall, train_conf_matrix = evaluate_model(df_train['label'], train_predictions)
    
    print('_______________________________')
    print('Training set:')
    print('Accuracy =', train_accuracy)
    print('Recall =', train_recall)

    print('_______________________________')
    print('Test set:')
    print('Accuracy =', accuracy)
    print('Recall =', recall)
    
# Initialize a SVC model with low regularization (a high value for C is low regularization for SVC models) by setting C to 100
# make use of the create_regularized_SVC function you have created.
# Replace None with the appropriate code.

model = None

# We will fit and evaluate the model 
fit_and_evaluate_SVC(model)


So with lower regularization the overfitting has gotten worse. Look at the accuracy and recall on the training data! It is above 99%! The accuracy on the test data is also a bit higher than before but the recall is much lower. The model now learns on the training data so well that it is not able to generalize to unseen data such as the data in the test set.

Let's try the opposite: We will make the regularization stronger by lowering the value for C below the default value.

In [None]:
# Initialize a SVC model with higher regularization value than the default (the lower the value for C, the stronger the regularization). 
# Create a SVC model with C set to 0.5 by replacing the None with the appropriate code  

model = None

# We will fit and evaluate the model 
fit_and_evaluate_SVC(model)


Increasing the regularization has reduced overfitting as the outcomes for the training data and the test data are now similar (both around 0.8). This means that the model generalizes very well. 

However this does not mean that the model is good. Its recall is terrible! In fact, the recall is zero. What does that mean? Why do you think the accuracy is still around 80 % ?

We can also combine multiple methods. What if we increase the regularization as well as setting the class_weight parameter to `balanced` in our SVC model?

In [None]:
# Let's make a function for creating a SVC model that accepts a regularization parameter as input and has the class_weight set to `balanced'
# Replace None with the appropriate code

def create_regularized_SVC_balanced(C): 
    model = None
    return model

# Create a SVC model with the regularization parameter (C) set to 0.5 and class_weight = 'balanced' 

model = create_regularized_SVC_balanced(0.5)

# We will fit and evaluate the model 
fit_and_evaluate_SVC(model)

That looks better! The higher performance on the test set means that the model is able to generalize better. In other words, this model works better on data that is has not seen before.

Have fun and play around with the regularization parameter to see what happens and if you can improve the model even more. 


## Data mismatch

So far we have used WhatApp messages in Dutch to train and evaluate our model (and we have just discussed why we need to split our dataset to training and test set). Now that we have a trained model and we know how well it performs, can we apply it on any kind of text?

If your answer is 'no, because our model still needs some work to improve its performance', assume that it performs great on training and test set. What about now? Can we use it on any text that we are interested in? Let's start with the basics, can we use it on English text? What if the text is in Dutch but it's an email? Or a transcription of a telephone conversation? What do you expect to occur in these situations?

To get a feeling of what to expect in these situations, let's use our model on few instances that are no part of the dataset with the WhatsApp messages. We will use data from the Instagram page @liefdevantoen. It contains personal ads from newspapers from between 1840 and 1940, some of which discussing meetings. Let's see how our model classifies those!

In [None]:
# let's load the new dataset
new_file = 'wg4_doing_ml_right_cross_domain_data.csv'
cross_domain_df = pd.read_csv(new_file)
cross_domain_df.head(10)

In [None]:
# how many entries in total and per label are there?
print('Number of entries = ', cross_domain_df.shape[0])
print('Number of entries with label 0 (no meeting) =', Counter(cross_domain_df['label'])[0])
print('Number of entries with label 1 (meeting planned) =', Counter(cross_domain_df['label'])[1])

In [None]:
# Let's prepare our dataset the same way as the WhatsApp data using the clean_text function we made

cross_domain_df = clean_text(cross_domain_df)

# and let's transform our text to features
# replace None so that we get the same features as before (remember? we fitted a CountVectorizer earlier)
features = None


In [None]:
# replace None to derive the model predictions
cross_domain_predictions = None

In [None]:
# derive the evaluation metrics
cd_accuracy, cd_recall, cd_conf_matrix = evaluate_model(cross_domain_df['label'], cross_domain_predictions)

In [None]:
# replace None to print the accuracy
None


In [None]:
# print the recall
print('Recall =', cd_recall)

In [None]:
# print the confusion matrix
print('Confusion Matrix:')
cmd = ConfusionMatrixDisplay(cd_conf_matrix)
cmd.plot();

So what happened to our metrics? Why?

Over time, language and spelling conventions have changed, which might make it more difficult for our model to classify the messages correctly. In such situations, one has to be mindful on how to close the gap between the data available for training and testing the model and the data we actually want to derive predictions. A thoughtful choice of features might be quite helpful here.

# Further (optional) exercises

1. You learned about cross-validation today. In this workbook we have only used one train-test split. You can now experiment with cross-validation. Have a look at this cross_val_score function:
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score. By default it does a 5-fold cross-validation. You can also try to implement a 10-fold cross-validation instead by changing the cv parameter. 

2. You also learned about oversampling as a solution for imbalanced data. A good package for this is imbalanced-learn (shortened: imblearn). By running the cell below you can install and import that package.


In [None]:
!pip install imbalanced-learn

from imblearn.over_sampling import SMOTE, RandomOverSampler

Now you can experiment with both random oversampling (RandomOverSampler) and SMOTE (SMOTE) on the training set. You will use train_features and df_train['label'] . Have a look at how it changes the label counts using the Counter() function.

Also see these links: 


https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.RandomOverSampler.html
https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html?highlight=smote#imblearn.over_sampling.SMOTE

# Kaggle Challenge

Do you want to put your data science skills to the test? See if you can implement your own model that predicts poisonous mushrooms in the Kaggle challenge: https://www.kaggle.com/t/3fb3213893214f28825b0f8848e471c9