<h1 style="color: #6686D6; font-family: 'Helvetica Neue', sans-serif;"><small>L7S2N1</small> Optimise Me</h1>

<h3 style="color: #6686D6; font-family: 'Helvetica Neue', sans-serif;">1. Initial Setup</h3>

Below is the initial set up to re-instate our stack for pre-processing and 

#### 1.1 Dataset Loading

In [1]:
import pandas as pd
import numpy as np

tweets_df = pd.read_csv("./Brexit-Non-Brexit-100K.csv", delimiter=";", encoding='utf-8')
# We have to do some minimal clean-up of the dataset and replace missing values with empty strings (an empty string is still a string)
# If we don't do this we will run into an exception when we use the CountVectoriser
tweets_df['tweet'] = tweets_df['tweet'].replace(np.nan, '', regex=True)

#
from sklearn.model_selection import train_test_split
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(tweets_df['tweet'], tweets_df['label'], test_size=0.2, random_state=42)
        

#### 1.2 Model Loading

In [2]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
# Create a text processing and classification pipeline
ml_pipeline = make_pipeline(CountVectorizer(min_df=0.001), MultinomialNB())

### 1.3 Baseline Documentation



In [3]:
from sklearn.metrics import accuracy_score
ml_pipeline.fit(X_train, y_train)
print("Accuracy:", accuracy_score(y_test, ml_pipeline.predict(X_test) ))

Accuracy: 0.9582


<h3 style="color: #6686D6; font-family: 'Helvetica Neue', sans-serif;">2. Two Optimisation Vectors</h3>

The above accuracy already represents a rather strong baseline.
We know want to explore how we can improve further by focusing on two focus points:
* provisioning process of the data
* ML training process.

For the following exercises please form groups of 2-3 people.

#### 2.1 Data Curation & Provisioning Optimisation

We will for the moment exclude a commmon source for optimisation: the expansion of our dataset.
Adding data is often an effective way to improve the overall performance.
However in this case, when we focus on Twitter data, it can be quite tedious and labor intensive to collect additional data. 

Instead we will focus on the transformation and pre-processing step of the data.

#### 2.2 Exercise: Pre-processing and Vectorisation

Try to improve the performance by making changes to the vectorisation step we have used.
Read up on the options under: [Text feature extraction](https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction) .


In [15]:
# Use this cell to re-train
from sklearn.linear_model import LogisticRegression
ml_pipeline = make_pipeline(CountVectorizer(min_df=0.001), LogisticRegression(max_iter=1000))
ml_pipeline.fit(X_train, y_train)
print("Accuracy:", accuracy_score(y_test, ml_pipeline.predict(X_test) ))

Accuracy: 0.97485


#### 2.3 Algorithm Optimisation

This is usually the most commonly targeted step with respect to optimisation. 

#### 2.4 Exercise: Optimise ML Algorithm

Visit the following documentation site of sci-kit learn and identify ML algorithms that could be used to train for our classification task.

[Listing of supervised ML algorithms](https://scikit-learn.org/stable/supervised_learning.html)

Use the cell below to train and optimise on the algorithm side.
If you achieved an improvement in 2.2. then please take over the positive changes you made for the vectorisation step.
Consider changing the algorithm class as well as the hyperparameters of the algorithm you use.


In [5]:
ml_pipeline = make_pipeline(CountVectorizer(min_df=0.001), MultinomialNB())
ml_pipeline.fit(X_train, y_train)
print("Accuracy:", accuracy_score(y_test, ml_pipeline.predict(X_test) ))

Accuracy: 0.9582


<h3 style="color: #6686D6; font-family: 'Helvetica Neue', sans-serif;">3. Automating Optimisation</h3>

One observation you should be able to make is that the amount of combinations we encounter in optimisation set ups is quite large.
This is sometimes called the parameter space.

In order to support us in finding the optimal setup for the combinations we can utilize automation for the execution of our training runs. 

Use the code sample below to define training runs that further optimise on your best setup you have achieved so far.

In [24]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression

pipeline = Pipeline([
    ('vect', CountVectorizer(ngram_range=(1, 2))),
    ('clf', LogisticRegression(max_iter=1000))
])

# Define the parameter grid to search
param_grid = {
    'vect__max_df': (0.3, 0.4, 0.5, 0.6),
    'vect__min_df': (0.001, 0.002, 0.003),
}

# Create a GridSearchCV object
grid_search = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1, verbose=1)

# Perform the grid search on the data
grid_search.fit(X_train, y_train)

# Print the best parameters and the best score
print("Best Parameters: ", grid_search.best_params_)
print("Best Score: ", grid_search.best_score_)

Fitting 5 folds for each of 12 candidates, totalling 60 fits
Best Parameters:  {'vect__max_df': 0.3, 'vect__min_df': 0.001}
Best Score:  0.9765866585411589


### 3.1 Cross-Validation

The above code uses `GridSearch**CV**`. The `CV` part stands for cross-validation. Cross-validation is a more sophisticated approach to our test and train splitting. 
It has the same motivation as splitting the data into a test and train portion. Instead of splitting once, Cross-Validation splits N times. 
So if we talk about N-Fold Cross-Validation the N refers to the times we split the data randomly. 10-fold cross-validation means we create ten random splits of our data and measure the performance for each of these train-test setups.

1. **Random Sampling:** In cross-validation, the dataset is randomly divided into a fixed number of parts or "folds." This random division ensures that each fold is a representative sample of the overall dataset, covering various aspects of the data's variability.

2. **Sequential Evaluation:** The model is trained and tested multiple times, each time with a different fold acting as the test set and the remaining folds combined to form the training set. This sequential process ensures that every data point is used for both training and testing across the iterations.

3. **Independent Testing:** In each iteration, the model is tested on data that it hasn't seen during training. This is crucial for evaluating the model's performance on new, unseen data, mimicking real-world situations where the model will encounter data variations.

4. **Aggregated Results:** After all iterations, the performance metrics (like accuracy, precision, etc.) from each fold are aggregated. This aggregated result is a more reliable measure of the model's performance than a single train-test split, as it accounts for the variability in the dataset.

5. **Mitigating Overfitting:** Cross-validation helps in detecting overfitting. Overfitting occurs when a model performs exceptionally well on the training data but poorly on new data. By using multiple random samples, cross-validation exposes the model to various data scenarios, ensuring that the model's performance is consistent across different data samples.

6. **Enhanced Reliability:** The use of random samples in cross-validation ensures that the model's evaluation is not biased by any particular arrangement or peculiarity of the data. This enhances the reliability of the evaluation, making the model more trustworthy for practical applications.

In summary, cross-validation with random sampling is a robust method for assessing the generalizability and effectiveness of a machine learning model. It ensures that the model is tested under various scenarios, reflecting its likely performance in real-world applications.

### 3.2 Discussion & Outlook

* Large Parameter Spaces
* Effectiveness of Automation
* Auto-ML

<h3 style="color: #6686D6; font-family: 'Helvetica Neue', sans-serif;">4. Model Documentation</h3>

#### 4.1 Exercise: Best Runs

Navigate to the following Microsoft list and start documenting your runs.

* [Run documentation](https://bernerfachhochschule.sharepoint.com/:l:/s/ti-bscdataengineering/FHp4B79xst1HjZfgPQHlnt4BRGeVdXH10vrHsrmSkZ8j0Q?e=zia76E) 

#### 4.2 Discussion: Performance only a Part of the Story

In this notebook we focused on improving the accuracy of the model.
For real-world application of our models there are other aspects that are very important:


* **Overfitting and Underfitting:** Ensuring that the model generalizes well to new, unseen data, rather than memorizing the training data (overfitting) or being too simple to capture the patterns in the data (underfitting).

* **Model Complexity and Efficiency:** The trade-off between the model's complexity and its performance. More complex models might perform better but can require more data and computational resources.

* **Real-World Application Fit:** The quality of the training data, and whether the data is representative of the real-world scenarios where the model will be applied. Biased data can lead to biased predictions.

* **Robustness and Stability:** The model's ability to perform consistently across different datasets and in the presence of noisy or imperfect data.

* **Interpretability and Explainability:** Understanding why the model makes certain predictions, which is critical in many applications, especially those that require trust and transparency.

* **Ethical Considerations:** Ensuring that the model does not perpetuate or exacerbate unfair biases, and is ethically sound in its application.

* **Cost of Errors:** The real-world impact of errors made by the model, which can vary significantly depending on the application.

* **Scalability:** How well the model can be scaled to handle larger datasets or be deployed in different environments.

* **Regulatory Compliance:** Ensuring that the model complies with relevant laws and regulations, particularly in industries like finance and healthcare.

* **Environmental Impact:** The energy consumption and environmental impact of training and deploying the model, especially for large, complex models.

Each of these considerations plays a crucial role in the overall evaluation and deployment of machine learning models, and the importance of each can vary depending on the specific application and context.