
# Machine Learning Workflow Revisited

In the initial notebooks of the course we introduced you to the machine learning workflow based on the `Iris` and `Digits` datasets.

We know revisit the machine learning workflow and will dive a bit deeper into some of the aspects we covered in class. We will also start to work on datasets that you have created.
This notebook provides an introduction to creating and evaluating a text classification model using the `scikit-learn` library. 

### Machine Learning Workflow

The machine learning workflow as we know it involves the following steps:

1. Dataset Curation
2. Dataset Provisioning
3. Model Training Run
4. Evaluation
5. Iterative Optimisation

As a new element we will introduce text as input samples. All samples we have worked on before were based on features that were already numeric.
With text as input we will have input that requires us to first consider how we transform the data.


## 1. Dataset Curation

The data we will use are tweets collected from the UK around the time-frame of the original Brexit discussion.
Please note that the data is not filtered in any way and might contain offensive content. 

The data has been annotated with two classes:
* Brexit : tweets that relate to the topic Brexit
* non-Brexit : tweets about other topics
        

In [21]:
import pandas as pd
tweets_df = pd.read_csv("./cleaned_labeled_tweets.csv")

print(f"The columns of the dataframe are: {tweets_df.columns}.")
print(f"The shape of the dataframe is: {tweets_df.shape}")
tweets_df.describe()

The columns of the dataframe are: Index(['tweet', 'label'], dtype='object').
The shape of the dataframe is: (715, 2)


Unnamed: 0,tweet,label
count,715,715
unique,679,2
top,@_AnimalAdvocate she's one evil woman,non-Brexit
freq,6,700


### 1.1. Exercise: Explore the dataset

Explore the dataset and get some familiarity by using the methods head(), tail().
It is always recommendable to get `familiar` with the data.

At a minimum you should verify that the data has loaded correctly via `read_csv()`.


## 2. Dataset Provisioning

After loading the data we have to provision it. That is, we transform the data into the format required by the machine learning algorithm.
We'll split our dataset into a training set and a testing set. The training set is used to train the model, and the testing set is used to evaluate its performance.
        

In [4]:
from sklearn.model_selection import train_test_split
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(tweets_df['tweet'], tweets_df['label'], test_size=0.2, random_state=42)
        


## 3. Model Training


### 3.1 Create a Machine Learning Pipeline

We create a pipeline that first converts the text data into a format suitable for machine learning (using `CountVectorizer`), and then applies a classification algorithm (in this case, `MultinomialNB`).

The [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) in scikit-learn is a technique used to convert text input into numerical form. It involves several steps to transform textual data into a format that can be used for machine learning. The process includes:

1. **Tokenization**: Splits text into individual words (tokens).
2. **Text Cleaning**: Lowercasing, removing punctuation, and possibly removing stop words.
3. **Building a Vocabulary**: Creates a list of all unique words from the entire text dataset.
4. **Counting Occurrences**: Counts how many times each word appears in each document.
5. **Vectorization**: Converts each document into a vector where each element represents the count of a word from the vocabulary in that document.


        

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
# Create a text processing and classification pipeline
ml_pipeline = make_pipeline(CountVectorizer(), MultinomialNB())
        


### 3.2 Train the Model

With the pipeline set up, we can now train our model on the training data.
        

In [7]:

# Train the model
ml_pipeline.fit(X_train, y_train)
        

Two things happen when we execute the pipeline:

1. **The CountVectorizer is fitted.**
    * It cleans up all the text samples as described above
    * It then identifies all unique terms that appear in the input dataset
    * It stores those terms in a dictionary
    * Finally it transforms the tweets into a vectorised form
2. **The model is fitted (trained)**
    * The vectorised tweets (by convention we call this `X`) and the labels (by convention we call this `y`) are passed to the ML model
    * The ML model is trained 



### 3.3 Exercise: Exploring and Understanding the `CountVectorizer`

Explore the `CountVectorizer`.

Visit the documentation of the [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) and find the appropriate ways to:

1. Look at the vocabulary that resulted from the fitting process.
    * Identify how many unique terms were contained
    * Look at the kind of terms that is contained (you might be surprised at what is included)
2. Identify the methods for encoding and decoding sample texts or sample tweets with the fitted vectoriser




In [8]:
corpus = [
     'This is the first document.',
     'This document is the second document.',
     'And this is the third one.',
     'Is this the first document?',
 ]
vectorizer1 = CountVectorizer()
X = vectorizer1.fit_transform(corpus)
print(vectorizer1.get_feature_names_out())
print(X.toarray())
vectorizer2 = CountVectorizer(analyzer='word', ngram_range=(2, 2))
X2 = vectorizer2.fit_transform(corpus)
print(vectorizer2.get_feature_names_out())
print(X2.toarray())

vectorizer = ml_pipeline.steps[0][1]
vectorized = vectorizer.transform(["This is the first document. nice"])
vectorized_df = pd.DataFrame(vectorized.toarray(), columns=vectorizer.get_feature_names_out())

non_zero_entries = []
for column in vectorized_df.columns:
    if vectorized_df[[column]][column][0] > 0:
        non_zero_entries.append(column)
        print(f"column {column}")
        print(f"vector: {vectorized_df[[column]][column][0]}")

vectorized_df[non_zero_entries]

['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']
[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]
['and this' 'document is' 'first document' 'is the' 'is this'
 'second document' 'the first' 'the second' 'the third' 'third one'
 'this document' 'this is' 'this the']
[[0 0 1 1 0 0 1 0 0 0 0 1 0]
 [0 1 0 1 0 1 0 1 0 0 1 0 0]
 [1 0 0 1 0 0 0 0 1 1 0 1 0]
 [0 0 1 0 1 0 1 0 0 0 0 0 1]]
column first
vector: 1
column is
vector: 1
column nice
vector: 1
column the
vector: 1
column this
vector: 1


Unnamed: 0,first,is,nice,the,this
0,1,1,1,1,1


### 3.4 Questions

Some questions to think about and discuss.

1. What do you think about the method to transform the text into numerical form? Could there be a better way to do this kind of transformation? What would are potential problems that you can imagine that result from this approach?
2. What happens if we encounter a term that is not part of the vocabulary of the CountVectorizer?


## 4. Evaluation

After training the model, we use it to make predictions on the test dataset.
        

In [13]:

# Predict labels for the test set
predictions = ml_pipeline.predict(X_test)
        


### 4.1 Measure Precision
Finally, we evaluate the model's performance by looking at its accuracy and a detailed classification report.
        

In [14]:
from sklearn.metrics import accuracy_score
# Evaluate the model
print("Accuracy:", accuracy_score(y_test, predictions))
        

Accuracy: 0.972027972027972


### 4.2 Analysing Measurements

What do you think about the measured accuracy?
Are you satisfied with the model performance?

### 4.3 Exercise: Digging into the Metrics

Scikit-learn provides several methods for analyzing the predictions of a model. Some of these methods include:

1. **[Confusion Matrix](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html)**

2. **[Classification Report](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html)**

Use the above two methods to analyse the performance and discuss your findings.

In [19]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

print(classification_report(y_test, predictions))
print(confusion_matrix(y_test, predictions))

              precision    recall  f1-score   support

      Brexit       1.00      0.20      0.33         5
  non-Brexit       0.97      1.00      0.99       138

    accuracy                           0.97       143
   macro avg       0.99      0.60      0.66       143
weighted avg       0.97      0.97      0.96       143

[[  1   4]
 [  0 138]]
