
# Machine Learning Vectorisation & Optimisation

This exercise focused notebook builds up on the concepts introduced in L6S4N2.

We shift our focus on the **bold highlighted** steps steps of the machine learning workflow.

1. Dataset Curation
2. **Dataset Provisioning**
3. Model Training Run
4. Evaluation
5. **Iterative Optimisation**




## 1. Dataset Curation

The data we will use are tweets collected from the UK around the time-frame of the original Brexit discussion.
Please note that the data is not filtered in any way and might contain offensive content. 

The data has been annotated with two classes:
* Brexit : tweets that relate to the topic Brexit
* non-Brexit : tweets about other topics

Contrary to the previous notebook the dataset features a more balanced distribution between the classes and consists of 1000k tweets.
        

In [1]:
import pandas as pd
import numpy as np


tweets_df = pd.read_csv("./Brexit-Non-Brexit-100K.csv", delimiter=";", encoding='utf-8')

# We have to do some minimal clean-up of the dataset and replace missing values with empty strings (an empty string is still a string)
# If we don't do this we will run into an exception when we use the CountVectoriser
tweets_df['tweet'] = tweets_df['tweet'].replace(np.nan, '', regex=True)


print(f"The columns of the dataframe are: {tweets_df.columns}.")
print(f"The shape of the dataframe is: {tweets_df.shape}")

The columns of the dataframe are: Index(['tweet', 'label'], dtype='object').
The shape of the dataframe is: (99997, 2)


### 1.1. Exercise: Sanity Check & Getting Familiar with the Data

As seasoned data engineers you are aware that at a minimum you should verify that the data has loaded correctly via `read_csv()`.
Take a look at the dataframe. 



In [6]:
tweets_df.groupby("label").count()

Unnamed: 0_level_0,tweet
label,Unnamed: 1_level_1
Brexit,26579
non-Brexit,73418



## 2. Dataset Provisioning

We'll split our dataset into a training set and a testing set as was done before. The training set is used to train the model, and the testing set is used to evaluate its performance.
        

In [8]:
from sklearn.model_selection import train_test_split
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(tweets_df['tweet'], tweets_df['label'], test_size=0.2, random_state=42)
        


## 3. Model Training


### 3.1 Create a Machine Learning Pipeline

We utilize the same pipeline that first converts the text data into a format suitable for machine learning (using `CountVectorizer`), and then applies a classification algorithm (in this case, `MultinomialNB`).


        

In [13]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
# Create a text processing and classification pipeline
# Note that this has named steps so that we can access the individual parts of the pipeline.
ml_pipeline = Pipeline([
    ('countVectorizer',CountVectorizer()),
    ('classifier', MultinomialNB())
])
     
        


### 3.2 Train the Model

With the pipeline set up, we can now train our model on the training data.
        

In [14]:

# Train the model
ml_pipeline.fit(X_train, y_train)
        

Two things happen when we execute the pipeline:

1. **The CountVectorizer is fitted.**
    * It cleans up all the text samples as described above
    * It then identifies all unique terms that appear in the input dataset
    * It stores those terms in a dictionary
    * Finally it transforms the tweets into a vectorised form
2. **The model is fitted (trained)**
    * The vectorised tweets (by convention we call this `X`) and the labels (by convention we call this `y`) are passed to the ML model
    * The ML model is trained 



### 3.3 Exercise: Exploring and Understanding the `CountVectorizer`

Explore the `CountVectorizer`.

Visit the documentation of the [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) 


**a) Accessing the elements of the pipeline**

You can access the elements of the pipeline in the following way:

```python
ml_pipeline.named_steps['countVectorizer']
```

**b) Check the size of the vocabulary** 

There is an attribute `vocabulary_` that you can use to access the fitted vocabulary (all unique terms found in the tweets) of the `CountVectoriser`.
Have a guess how many unique terms you expect then check the actual size of the vocabulary. 

**c) Check what is contained in the vocabulary**

There are multiple good reasons for checking the content of the vocabulary.
This does not only apply to the use of the `CountVectoriser` in this example, but is generally true for all the pre-processing and transformation that is applied to any data that touches your `ML` or `Analytics` workflow. 

<div style="background-color: #e8f4f8; border-left: 5px solid #6fa8dc; padding: 10px; margin: 10px 0; font-size: 1em; line-height: 1.4;">
    <p><b>Practical Relevance:</b>
    <p>Familiarizing yourself with the data and its representation as features is extremely important. It serves the following purpose:
        <ul>
            <li>Identify data transformation errors: Data transformation and loading is subject to all kinds of errors. Starting with 
            using a wrong `delimiter`, to the application of a wrong `encoding`, to mistakes based on wrong `escaping` assumptions (Don't worry if you are not yet familiar with all these terms, we will dive into this in the Data Engineering Fundamentals 1 course.). Many of these mistakes can be visually identified.
            </li>
            <li>Getting to know your features: It is important that you develop an intuition for the features that your models are fed. In this case each unique term is a feature as we are working with text and apply the vectorisation defined by the CountVectoriser. Knowing what our features will look like gives us ideas how we can improve the features, and what might be advantages and limitations of the feature representation we chose.</li>
        </ul>
    </p>
</div>


Assume we have a list named `vocabulary`:

We have the following options available to analyse the contents.

### 1. Storing in a CSV File Using Pandas

```python
import pandas as pd

# Convert list to DataFrame
df = pd.DataFrame(vocabulary, columns=['vocabulary'])

# Save to CSV
df.to_csv('vocabulary.csv', index=False)
```

### 2. Printing Using `pprint`

```python
from pprint import pprint

# Pretty-print the list
pprint(vocabulary)
```

### 3. Printing by Placing in a Pandas DataFrame

```python
import pandas as pd

# Convert list to DataFrame
df = pd.DataFrame(vocabulary, columns=['vocabulary'])

# Print DataFrame
print(df)
```

In each of these methods, the list `vocabulary` is represented in different forms: as a CSV file, pretty-printed list, and a DataFrame, providing flexibility in how the data can be stored and displayed.


Use the code cell below to explore the different approaches and check the content of the vocabulary. **Note:** you might have to transform what you get from the `CountVectoriser` in order to make it work. 
 


In [23]:
from pprint import pprint
pprint(len(ml_pipeline.named_steps['countVectorizer'].vocabulary_))

249101



## 4. Evaluation

After training the model, we use it to make predictions on the test dataset.
        

In [18]:

# Predict labels for the test set
predictions = ml_pipeline.predict(X_test)
        


### 4.1 Measure Precision
Finally, we evaluate the model's performance by looking at its accuracy and a detailed classification report.
        

In [19]:
from sklearn.metrics import accuracy_score
# Evaluate the model
print("Accuracy:", accuracy_score(y_test, predictions))
        

Accuracy: 0.58765


### 4.2 Analysing Measurements

What do you think about the measured accuracy?
Are you satisfied with the model performance?

### 4.3 Exercise: Digging into the Metrics

Scikit-learn provides several methods for analyzing the predictions of a model. Some of these methods include:

1. **[Confusion Matrix](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html)**

2. **[Classification Report](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html)**

As in the last notebook use these methods to analyse the performance and behavior of the model in more detail. 

In [22]:
from sklearn.metrics import classification_report
 
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

      Brexit       0.39      0.98      0.56      5373
  non-Brexit       0.98      0.44      0.61     14627

    accuracy                           0.59     20000
   macro avg       0.69      0.71      0.59     20000
weighted avg       0.83      0.59      0.60     20000



# 5. Efficiently Optimizing Machine Learning Models

## 5.1. Define a Baseline Model
### Purpose
- Establishing a baseline is essential to set a reference point to compare the performance of future, more complex models.
- It provides a minimum performance threshold and is typically simple and easy to understand.


### 5.1 Exercise: Baseline Model Documentation

Document the baseline model in the markdown cell below.
Think about the things that should be documented.

Remember this serves two purposes:
* Provide you with a basis for your optimisations
* Document and save your work in order to allow the next team member or you to repeatably train the model or extend it at a later stage.

Discuss with your lecturer the practical relevance of this.

Use the cell below to write your documentation:



**Baseline Documentation**

Initial Pipeline:

- ml_pipeline = Pipeline([
    ('countVectorizer',CountVectorizer()),
    ('classifier', MultinomialNB())
])

- Vocabulary size: 249101
- Accuracy: 0.58765


## 5.2. Algorithm Optimisation

Look at the curent model we employed in the pipeline.

### 5.2 Exercise: Hyperparameter Optimisation

Try to identify if the current training can be optimised by trying different hyperparameters.
Take note of your training success. 


In [85]:
# Train and evaluate in this cell
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
ml_pipeline_2 = Pipeline([
    ('countVectorizer',CountVectorizer(max_df=0.16, min_df=456)),
    ('classifier', MultinomialNB())
])
ml_pipeline_2.fit(X_train, y_train)
predictions = ml_pipeline_2.predict(X_test)
print("Accuracy:", accuracy_score(y_test, predictions))
"0.8686 0.8, 0.05"
"0.8861 1.0 2"
"0.91845 1.0 3"
"0.92775 1.0 4"
"0.9331 1.0 5"
"0.93655 1.0 6"
"0.9399 1.0 7"
"0.94155 1.0 8"
"0.95645 1.0 30"
"0.9583 1.0 50"
"0.95875 1.0 60"
"0.96145 1.0 200"
"0.9627 1.0 300"
"0.96295 1.0 400"
"0.96355 1.0 450"
"0.9636 1.0 455"
"0.9639 0.11 455"


Accuracy: 0.9639


'0.9639 0.11 455'

**Document your training**

### 5.2 Exercise: Algorithm Optimisation

Identify other suitable algorithms (classification algorithms) that are available on sci-kit learn (only those).
Train and try to optimise your performance by swapping out the algorithm used for training.

In [98]:
# Train and evaluate in this cell
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
ml_pipeline_2 = Pipeline([
    ('countVectorizer',TfidfVectorizer(max_df=0.15, min_df=10)),
    ('classifier', LogisticRegression(max_iter=1000))
])
ml_pipeline_2.fit(X_train, y_train)
predictions = ml_pipeline_2.predict(X_test)
print("Accuracy:", accuracy_score(y_test, predictions))

Accuracy: 0.97695


**Document your training**