# Introduction to Machine Learning

Welcome to the practical on Machine Learning. 
Today we are going to start by loading and preprocessing a new dataset. Then we'll see how feature extraction can be done and how these features can be used to train a model.   

The dataset we will be working with is a collection of WhatsApp messages that has been collected by the Radboud University. You can find more information on the dataset here: https://easy.dans.knaw.nl/ui/datasets/id/easy-dataset:112987.

This dataset was originally in Dutch, but we have translated it automatically into English. You might notice some strange mistakes that are caused by the translation.

This dataset has been (partially) annotated by the NFI in order to tag *meetings*, i.e., messages which indicate that meetings were being planned, or which refer to a meeting that had already been planned. Our task is to train a model which can (correctly) classify messages. That is, does a message make reference to a meeting? 

Let's get started!

## Loading our data and preprocessing

In [None]:
# let's load our dataset to see what it looks like
import pandas as pd

whatsapp_file = 'wg3_intro_to_ml_data_balanced_translated.csv'
df = pd.read_csv(whatsapp_file)

display(df.head())
print('Total number of entries:', len(df))

You can still see the original text (in Dutch) in column `dutch_text` and the new translated English text in the `text` column. Let's remove the Dutch text.

In [None]:
# Here we simply drop (or leave out) the column with the Dutch text
df = df.drop(labels = ['dutch_text'], axis=1)
df.head()

As you can see above, our data consists of 3 columns called 'id', 'text' and 'label' and it contains 608 entries.  The labels of this data are 1 and 0: 1 for when a meeting is discussed, and 0 otherwise.

In [None]:
# let's see how many labels we have of each
# we can do this using Counter
from collections import Counter
Counter(df['label'])

In [None]:
# let's investigate the text entries of our dataset a little bit closer
# setting the max column width
pd.options.display.max_colwidth = 90 # you can adjust this number so that the text is legible in your screen

# display the first 15 entries of the text column
df["text"][:15]

As you can see there is a lot going on here. The messages contain capital letters, interpunction and it seems that someone has replaced certain words with [REMOVED].

In order to classify whether people are discussing to meet up, we do not need these things. In fact, they could make our life a lot harder (you will find out why later in this notebook). So let's clean up a bunch of these things.

In [None]:
# We create a new column called "clean_text" that contains the cleaned data
# use .lower() to lowercase any string
df["clean_text"] = df["text"].str.lower()

Now we have lowered the text we still want to remove punctuation and [REMOVED] from our text. 

In [None]:
# We will import the re package which allows us to search for sequences in our text that match a predefined pattern.
import re

# re allows us to substitute any given string with something else, in this case an empty string, effectively deleting it
def delete_substring(pattern, string): 
    return re.sub(pattern, '', str(string))
     
# we use df.apply() to call the function delete_substring on each row, which takes as input the substring to 
# delete ([removed]), and the cleaned text
df["clean_text"] = df.apply(lambda x: delete_substring(r'\[removed]', x["clean_text"]), axis = 1)

# below we substitute anything that is not a letter, space or digit (in practice that is punctuation) with an 
# empty string
df["clean_text"] = df.apply(lambda x: delete_substring(r'[^\w\s]', x["clean_text"]), axis = 1)

# display the first 15 entries of column clean_text
df["clean_text"][:15]

## Feature creation
Machine learning algorithms take numbers as input. This can be a bit counter-intuitive since we work with human-language data. 

The simplest method of creating input (called *features*) from language, is to count whether/how often certain words occur in our sample. In our case, words of interest may be *where* and *when*.

In [None]:
# So let's create some features:

relevant_words = [
        "where",
        "when" 
]

### Exercise: complete the function below

In [None]:
# now it's your turn! 
# write a function below which returns 1 if a certain word is in a string, and 0 if that word is not in the string

def count(word, string):
    ### YOUR CODE HERE! ###
    pass
    
# you can test if it works by uncommenting the code below:
# word = "hello"
# string = "hello world!"
# count(word, string) # should return 1

### Using occurrence of words as features

In [None]:
# the code below looks quite complicated, but don't get discouraged, I'll explain what happens:

# we loop through our list of relevant words, which contains, in this case, the words "where" and "when"
for word in relevant_words:
    # we create a new column in the dataframe for each relevant word, where the column name is the word itself (df[word])
    # we use df.apply() to call the function count() on each row, which takes as input the relevant word, and the cleaned text
    df[word] = df.apply(lambda x: count(word, x["clean_text"]), axis=1)

# the result can be seen below, for each relevant word we have a column in the data, that contains a 1 
# if that particular word occurs in the text, and 0 if it does not
df[["clean_text", "label", "where", "when"]].head()

As you can see above, the words we've chosen are actually not present in messages 2 and 3, which are messages about 
meetings (i.e., column label is equal to 1). This indicates that this choice of features may not be adequate for training a model, but we will work with it for the purpose of illustration.

## Splitting the data

There is one more thing we need to do in order to start training: divide our data in train and test sets. This is an import step in order to avoid *overfitting*. That is, a model that fits its training data far too well and, therefore, fails to fit additional data. Such a model cannot reliably be applied to classify unseen data. Keeping a small part of the data separately allows us to check whether our model is overfitting on the training data or not. 

In [None]:
# Luckily, sklearn has a function that does this for us
from sklearn.model_selection import train_test_split

# we use 20% of our data for testing, as indicated in the variable test_size
# the random_state variable controls the shuffling of the data before the split is applied
df_train, df_test = train_test_split(df, test_size=0.2, random_state=4)

print(len(df_train))
print(len(df_test))

df_train[["clean_text", 
          "label",
          "where", 
          "when" 
        ]].head()

### Visualizing the data

As you can see from the indices of the data, the dataset has been shuffled before dividing the dataset into train and test sets. Before we train our model, let's have a look at our data and the features we just created.

In [None]:
import numpy as np

# this is just some code to make sure not all dots overlap, so we can see how many points are where
df_train["jittered_where"] = df["where"] - 0.3* np.random.rand(len(df["where"])) -0.05
df_train["jittered_when"] = df["when"] - 0.3* np.random.rand(len(df["when"])) -0.05

# let's plot our data!
ax2 = df_train.plot(kind='scatter',
                     x="jittered_where",
                     y="jittered_when",
                     c="label",
                     colormap="winter")



Above you see a scatterplot, on the x-axis is whether the word "where" occurs (1) or not (0), and on the y-axis you can see if the word "when" occurs (1) or not (0). If the messages discuss a meeting, they are green; otherwise, they are coloured blue. 

As you may have noticed, our features do not allow for a perfect split between the samples that discuss a meeting and those that do not. So, most likely, our machine learning model won't be able to perfectly distinguish them either.

## Training a model
Let's see how a model can be trained using the features above. We use scikit-learn again (also known as sklearn), a common package that includes not only a function to split datasets but many machine learning models as well. See also [*the website*](https://scikit-learn.org/stable/).

We will start with one of the most basic models in machine learning: *logistic regression*. [*Logistic regression*](https://en.wikipedia.org/wiki/Logistic_regression) models the probability of an event based on the input variables. In this case, it models the probability for a message being a meeting, given features that we will define below.

In [None]:
from sklearn.linear_model import LogisticRegression

# we initiate the model so it's ready to use
model = LogisticRegression()

Now that we have initiated the model, we can fit the model to our data (the data being the features we have derived and the associated labels). 'Fitting' simply means finding the weights in our logistic regression equations that best 'fit' with our data, i.e. which most often give the right result (NB technically, we are optimising a loss function).

In [None]:
model.fit(
     df_train[[ 
         "where", # 'where'
         "when" # 'when'
        ]], 
    df_train['label']
)

And that's it! We now have a trained model, that can be used to perform authorship predictions on new texts. Congratulations, you have just done machine learning!

Now let's have a quick look at how well our model does. Keep in mind that we look at the performance on the test data (df_test), the performance on the training data (df_train) is probably higher as the model has already seen that data. Like doing an exam when you've already discussed the questions in class!

In [None]:
from sklearn.metrics import accuracy_score

model_predictions = model.predict(
    df_test[["where", "when"]]) 

accuracy_score(df_test['label'], model_predictions)

An accuracy of 0.52 is not great. And, why is that? Because it only performs slightly better than the toss of a coin. In fact, we might as well have assigned our samples randomly to one of the classes.

Can you think of ways to improve our model? For instance, other words that would be relevant for our case?  

### Exercise: Improve the performance of our model by adding a few extra words to our list of relevant_words. 

The code that you need in order to do this is below. You can also use the code block directly below to inspect some messages with a positive label. This is optional but can be very helpful.

In [None]:
# (optional) If you want, you can use this block to write some code to have a look specifically at messages about meetings

def show_messages_with_label(label): 
    #YOUR CODE FOR SELECTING DATA WITH A CERTAIN LABEL
    pass

# Uncomment the code below to check if your function works. It should return True. 
# We are checking if the length (len()) of the data is equal to the sum of all the labels which should
# be the case if all the messages have the label 1

# df_with_one_label = show_messages_with_label(1)
# print(sum(df_with_one_label['label']) == len(df_with_one_label)) 
      
# Uncomment the code below to use your function to select 10 messages with the positive (1) label

# show_messages_with_label(label=1)['clean_text'].head(10)

In [None]:
# Remember to add some extra words to the relevant_words to improve performance. 
# Make sure you get the indentation right.

# Possible words that help the model are: 

relevant_words = [
    "where", 
    "when",
]

for word in relevant_words:
    df[word] = df.apply(lambda x: count(word, x["clean_text"]), axis=1)

    
# uncomment the print statements if you wanna check what your data looks like now
# print(df.head())

df_train, df_test = train_test_split(df, test_size=0.2, random_state=4)

# display(df_train[relevant_words])

model.fit(
    df_train[relevant_words], 
    df_train['label']
)

model_predictions = model.predict(
    df_test[relevant_words]) 

print(accuracy_score(df_test['label'], model_predictions))

if accuracy_score(df_test['label'], model_predictions) > 0.5163934426229508:
                  print("Congratulations, you improved the accuracy of the model!")
else:
    print("Your model did not outperform our previous model, try again")


## Count Vectorizer
There may be a lot more words that indicate meetings being discussed, that we did not think of yet. One way of not having to think of all these words is by using a count vectorizer. This function creates a matrix (= list of lists) that counts how often any word that occurs at least once in the entire dataset (= the vocabulary) occurs in each sample. 

An example from the documentation: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html 

```
corpus = [
     'This is the first document.'
     'This document is the second document.',
     'And this is the third one.',
     'Is this the first document?',
 ]
 
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

vectorizer.get_feature_names_out()
array(['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third',
       'this'], ...)
       
print(X.toarray())
[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]
 ```
As you can see, 'and' is the first column of the array above. It occurs only in the third sample, so there is a 1 in that row, and a 0 in the others. The second word is 'document', it occurs in the first and fourth sample, and it occurs even twice in the second sample. Therefore there is a one in the first and fourth row of the second column, and a two in the second row of the second column.

### Question: Can you think of a reason why we lowercased our data and removed all punctuation?

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()
features_train = cv.fit_transform(df_train["clean_text"])
features_test = cv.transform(df_test["clean_text"])
print(features_train.shape)
print(features_test.shape)

As you can see above, our training data now has 486 (= size of the subset) rows and 1045 (= vocabulary size) columns.

Mind that we call fit_transform() on the training data and transform() on the test data. This means that we create features based on the training data, and then check if those features occur in the test data. If you were to call fit_transform() twice you would just have two completely different datasets based on completely different vocabularies (try it!).

This means that any words in the testset that are not in the vocabulary of the trainingset are disregarded. This may sound inconvenient, but you cannot train on what you don't have. Yet another example of why you need much and diverse training data.

In [None]:
# when we print a part of the feature names, we see that they contain words from our data

print(cv.get_feature_names_out()[:15])

# you can see that the features look like a big list of list of zeroes and ones 
# most of them are zeroes because only very few words from the vocabulary actually occur in each sample

print(features_train.toarray())

Now that we have these features, all we have to do is put it in the model we have defined before. We do not need to specify all of our features like we did before, we stored them in a variable called 'features' which we can input completely.

In [None]:
model.fit(features_train, df_train["label"])
model_predictions = model.predict(features_test)
print(accuracy_score(df_test['label'], model_predictions))

And, look! we significantly improved the performance of our model!

### Further exercises:

1. Last time you learned about word clouds. See if you can make a word cloud for the messages that are about meetings (with label 1). You will need to run the cell below to install and import the functions you will need. You can look here for help: 
https://www.python-graph-gallery.com/wordcloud/

You will also need to join all the texts into one long string using the following: 

```
joined_text = ' '.join(texts_to_join)
```

You should do this after you have selected all the texts with a positive label

In [None]:
!pip install wordcloud

from wordcloud import WordCloud
import matplotlib.pyplot as plt

2. Where we split the data in a train and a test set, we set a *random_state* and we've mentioned that *random_state* controls the shuffling of the data 

```
df_train, df_test = train_test_split(df, test_size=0.2, random_state=1)
```
 
In detail, a random state basically allows you to control the randomness train_test_split() uses when you run it. If you don't set a random_state, your train and test set will be different every time you run this function (even if you run it on the same data). Remove the random state and run the cell a few times, see how the top of the trainingset changes every time. If you would run the whole model every time train_test_split had a new configuration, you would see that the performance of the model would be slightly different as well. This makes sense because the split between train and test are different. This is why we often use cross-validation when we report the performance of our models in academic papers: we run the model multiple times with different train/test splits, to see how our model performs on average.

3. Before we talked about generating features for our dataset, we have converted the messages to lower case and removed punctuations. This is called *normalisation*. Here, we will discuss some basic steps needed for Text normalization. Other normalisation steps can be removing all numbers or, if they are essential, converting them to textual form; removing extra white spaces; and, removing stop words (i.e., words which can be filtered out without loss of information such as "the", "and", "of", etc.). Try implementing them!

For stop words you could have a look here: https://www.geeksforgeeks.org/removing-stop-words-nltk-python/

4. Now that you have finished the notebook, go back section "Training a model" and experiment with different models. You can replace logistic regression with other models, such as a Naive Bayes (from sklearn.naive_bayes import GaussianNB) or a Support Vector Machine (from sklearn.svm import SVC). These models are most frequently used with language data, but you can also check the documentation (https://scikit-learn.org/stable/supervised_learning.html) to try out other models.

# Kaggle Challenge

Do you want to put your data science skills to the test? See if you can implement your own model that predicts poisonous mushrooms in the Kaggle challenge: https://www.kaggle.com/t/3fb3213893214f28825b0f8848e471c9