# Supervised Learning Project

## 1. Data Selection and Preparation

This section will include:
- Loading the data.
- Exploring the data.
- Preparing the data for the modeling process, which includes splitting the data into training, testing and validation sets.

In [153]:
import pandas as pd
import pickle

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

import warnings
warnings.filterwarnings("ignore")

In [154]:
df = pd.read_csv('./data/IMDB Dataset.csv')

In [155]:
df.head(5)

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [156]:
df.sample(5)

Unnamed: 0,review,sentiment
30419,I was delighted to see this gem of a film avai...,positive
35397,I was blown away by the re-imagined Battlestar...,positive
18209,The opening sequence alone is worth the cost o...,positive
19121,"In a way, this film reminded me of ""Jumping Ja...",positive
36351,Can you people please stop believing everythin...,negative


In [157]:
df['sentiment'] = df['sentiment'].map({'positive': 1, 'negative': 0})
X = df['review']
y = df['sentiment']

The data has to columns `review` and `sentiment`. These sentiments can be either positive or negative. These values we have encoded as 1 for positive and 0 for negative, so we can use them in the modeling process.

In [158]:
# Vertify the data afther mapping the sentiment:
df.sample(5)

Unnamed: 0,review,sentiment
873,It is always difficult to bring a 450 pages bo...,1
24889,If I had realized John Wayne was in this movie...,0
39542,This really is a cringe making exercise. Dress...,0
17153,"Well, I notice IMDB has not offered any plot i...",0
48509,"Anyone can make a movie these days. Budget, pr...",0


In [159]:
# Data split into train, test and validation sets
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.25, random_state=42)

In [160]:
print("Training set size:", len(X_train))
print("Validation set size:", len(X_val))
print("Test set size:", len(X_test))

Training set size: 30000
Validation set size: 10000
Test set size: 10000


The data has been slited into 3 sets: 
- training 60% (size 30000)
- testing 20% (size 10000)
- validation 20% (size 10000)

The parameter `random_state` is a seed the makes the data shuffle at a random state. The `random_state` is set to 42 to have the same shuffle of the data each time.

In [161]:
# We vectorize the text data, so its numerical and can be fed into the model

vectorizer = TfidfVectorizer(stop_words='english')
X_train_vec = vectorizer.fit_transform(X_train)
X_val_vec = vectorizer.transform(X_val)
X_test_vec = vectorizer.transform(X_test)

In [178]:
print("Shape of the vectorized data:")
print("Training set:", X_train_vec.shape)
print("Validation set:", X_val_vec.shape)
print("Test set:", X_test_vec.shape)

Shape of the vectorized data:
Training set: (30000, 82265)
Validation set: (10000, 82265)
Test set: (10000, 82265)


For all data sets X, which contains the reviews, we make these into vectors, so that our models can process them. Important to note that this makes the data high-dimensional, which we can se on the shapes above.

`fit_transform` method is only used on the training set, so that the vectorizer learns the vocabulary of the training set. The testing and validation sets are only transformed, so that the vocabulary of the training set is used.

## 2. Model Selection and Training

This section will include:
- Three different models which will be trained and validated on the data
- Explanations of the metrics used
- Explanation of what hyperparameters are and how we adjust them
- Explanation of what overfitting is and how we can spot it

### Explanations of the metrics used:

NB: _These metrics are used for both validation and testing of the models._

- **Accuracy**: percentage of correct predictions. It gives an idear of how well the model is performing. It is calculated as the number of correct predictions divided by the total number of predictions.
<br> 
    `accuracy = (TP + TN) / (TP + TN + FP + FN)`

- **Precision**: tells us the how many of the positive predictions were actually correct. It is calculated as the number of true positives divided by the number of true positives and false positives.
<br>
    `precision = TP / (TP + FP)`


- **Recall**: tells us how many of the actual positive cases were predicted correctly. It is calculated as the number of true positives divided by the number of true positives and false negatives.
<br>
    `recall = TP / (TP + FN)`


- **F1 Score**: A combinded mesurement of precision and recall - the harmonic mean of the two. It gives a more balanced view of the model's performance, beacuse it takes both false positives and false negatives into account. 
<br>
    `F1 = 2 * (precision * recall) / (precision + recall)`


Source: [sklearn.metrics](https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics) & [F-Score: What are Accuracy, Precision, Recall, and F1 Score?](https://klu.ai/glossary/accuracy-precision-recall-f1)

All these metrics we are going to be using are from `sklearn.metrics` library.



### Choice of models:
- **Model 1: Multinomial Naive Bayes (source: [Multinomial Naive Bayes - geeksforgeeks](https://www.geeksforgeeks.org/multinomial-naive-bayes/))**
    - Reasoning: Should be suituable for text classification problems, like ours sentiment analysis problem, which is this process of finding if a message/text is positive or negative. This model dose this by making a base model, which is the probability of each word in the review, given the sentiment of the review. The model then uses this base model to predict the sentiment of a new review.

- **Model 2: Logistic Regression (source: [Logistic Regression in Machine Learning - geeksforgeeks](https://www.geeksforgeeks.org/understanding-logistic-regression/))**
    - Reasoning: The model should work well, when i comes to binary classification task or with other words, take the input data and determine which category it fit into (0 or 1). Furthermore, the model should be go at findeing out whihc features are important in the review, which should be good in prediction the sentiment of the review.

- **Model 3: Support Vector Machine (source: [Support Vector Machine (SVM) Algorithm - geeksforgeeks](https://www.geeksforgeeks.org/support-vector-machine-algorithm/))**
    - Reasoning: When using the LinearSVC model, we are using a linear kernel, which is a good choice for text classification and when the data is in a high-dimensional space. The data is in a high-dimensional space because we have vectorized the reviews.

## Model training and validation:

In [162]:
# View results for GridSearchCV search for best hyperparameters
def view_results(grid):
    
    results = grid.cv_results_
    
    print("GridSearchCV validation results:")
    print("")
    print("Best parameters:", grid.best_params_)
    print("")
    print("Mean accuracy:", results['mean_test_accuracy'])
    print("Mean precision:", results['mean_test_precision'])
    print("Mean recall:", results['mean_test_recall'])
    print("Mean F1 score:", results['mean_test_f1'])

In [163]:
# Scoring metrics used for GridSearchCV
scoring = {
    'accuracy': 'accuracy',
    'precision': 'precision',
    'recall': 'recall',
    'f1': 'f1'
}

In [164]:
# Function used to predict and calculate the accuracy, precision, recall and F1 score
def evaluate_model(model, X_v, y):
    
    y_pred = model.predict(X_v)

    test_accuracy = accuracy_score(y, y_pred)
    test_precision = precision_score(y, y_pred)
    test_recall = recall_score(y, y_pred)
    test_f1 = f1_score(y, y_pred)

    return test_accuracy, test_precision, test_recall, test_f1

### Model 1. Multinomial Naive Bayes

In [165]:
# Determine best hyperparameters for Multinomial Naive Bayes
mnb_param_grid = {
    'alpha': [0.1, 0.5, 1.0, 2.0]
}

mnb_grid = GridSearchCV(MultinomialNB(), mnb_param_grid, cv=5, scoring=scoring, refit='f1')
mnb_grid.fit(X_train_vec, y_train)

view_results(mnb_grid)

mnb = mnb_grid.best_estimator_

GridSearchCV validation results:

Best parameters: {'alpha': 1.0}

Mean accuracy: [0.8566     0.8602     0.86156667 0.86106667]
Mean precision: [0.85855164 0.86813816 0.87485682 0.8817984 ]
Mean recall: [0.85159035 0.84722735 0.84172298 0.83185612]
Mean F1 score: [0.85504084 0.85753623 0.8579405  0.85605236]


In [166]:
# Validate the Multinomial Naive Bayes model on the validation set
val_accuracy, val_precision, val_recall, val_f1 = evaluate_model(mnb, X_val_vec, y_val)
print(f'Multinomial Naive Bayes (validation test set):\nAccuracy: {val_accuracy}\nPrecision: {val_precision}\nRecall: {val_recall}\nF1: {val_f1}')

Multinomial Naive Bayes (validation test set):
Accuracy: 0.8608
Precision: 0.8856902710653498
Recall: 0.8325103693462375
F1: 0.8582773365913255


In [167]:
# Save the best Multinomial Naive Bayes model
pickle.dump(mnb, open('./models/MultinomialNB.pkl', 'wb'))

### Model 2: Logistic Regression

In [168]:
# Detiermine the best hyperparameters for the Logistic Regression model
param_grid = [
    {
        'penalty': ['l1'],
        'C': [0.01, 0.1, 1, 10],
        'solver': ['liblinear', 'saga'],
        'max_iter': [1000]
    },
    {
        'penalty': ['l2'],
        'C': [0.01, 0.1, 1, 10],
        'solver': ['lbfgs', 'liblinear', 'newton-cg', 'sag', 'saga'],
        'max_iter': [1000]
    }
]

logreg_grid = GridSearchCV(LogisticRegression(), param_grid, cv=5, scoring=scoring, refit='f1')
logreg_grid.fit(X_train_vec, y_train)

view_results(logreg_grid)

logreg = logreg_grid.best_estimator_

GridSearchCV validation results:

Best parameters: {'C': 10, 'max_iter': 1000, 'penalty': 'l2', 'solver': 'saga'}

Mean accuracy: [0.5457     0.6158     0.7894     0.78963333 0.87393333 0.87396667
 0.8736     0.87376667 0.80613333 0.807      0.80566667 0.80573333
 0.80583333 0.85773333 0.8577     0.85763333 0.85756667 0.85753333
 0.88943333 0.8894     0.8894     0.8894     0.8894     0.89193333
 0.89223333 0.89126667 0.89223333 0.89226667]
Mean precision: [0.22755129 0.5735932  0.75253008 0.75215348 0.86200571 0.86196959
 0.86906398 0.86920884 0.78793142 0.7908828  0.78772923 0.78779101
 0.78793774 0.83629265 0.83658489 0.83638944 0.83628461 0.83627558
 0.87772297 0.87751975 0.87747289 0.87751975 0.87747032 0.88458111
 0.88529661 0.88399171 0.88529661 0.885355  ]
Mean recall: [0.35187111 0.8823332  0.85816854 0.8597123  0.8884413  0.88857553
 0.87776877 0.87797015 0.83427223 0.83125167 0.83333252 0.83339964
 0.83339964 0.88729971 0.88676275 0.886897   0.886897   0.88682989
 0.90327541 

In [169]:
# Validate the Logistic Regression model on the validation test set
val_accuracy, val_precision, val_recall, val_f1 = evaluate_model(logreg, X_val_vec, y_val)
print(f'Logistic Regression (validation test set):\nAccuracy: {val_accuracy}\nPrecision: {val_precision}\nRecall: {val_recall}\nF1: {val_f1}')

Logistic Regression (validation test set):
Accuracy: 0.8943
Precision: 0.8971054718477399
Recall: 0.8937388899861742
F1: 0.8954190165232018


In [170]:
# Save the best Logistic Regression model
pickle.dump(logreg, open('./models/LogisticRegression.pkl', 'wb'))

### Model 3. Support Vector Machine

In [None]:
# Detiermine the best hyperparameters for the SVM model
svm_param_grid = {
    'C': [0.01, 0.1, 1, 10]
}

svm_grid = GridSearchCV(LinearSVC(), svm_param_grid, cv=5, scoring=scoring, refit='f1')
svm_grid.fit(X_train_vec, y_train)

view_results(svm_grid)

svm = svm_grid.best_estimator_

GridSearchCV validation results:

Best parameters: {'C': 0.1}

Mean accuracy: [0.8548     0.89113333 0.89116667 0.87566667]
Mean precision: [0.83070108 0.87744313 0.88470507 0.87187744]
Mean recall: [0.88884349 0.90763844 0.89790563 0.87884304]
Mean F1 score: [0.85876528 0.89226151 0.89124099 0.87531825]


In [172]:
# Validate the SVM model on the validation test set
val_accuracy, val_precision, val_recall, val_f1 = evaluate_model(svm, X_val_vec, y_val)
print(f'Linear SVC (validation test set):\nAccuracy: {val_accuracy}\nPrecision: {val_precision}\nRecall: {val_recall}\nF1: {val_f1}')

Linear SVC (validation test set):
Accuracy: 0.8945
Precision: 0.8917122752150117
Recall: 0.9010468101915861
F1: 0.8963552411828274


In [173]:
# Save the best SVM model
pickle.dump(svm, open('./models/LinearSVC.pkl', 'wb'))

### Hyperparameters and Overfitting

- **What is hyperparameters?**:
<br>
You can think of hyperparamters as a form of settings in ML models. These you have to define before the training begins. These can have a big impact on the performance of the model; however, its also important to be aware that it can have both negative and positive inpact. 

- **Which hyperparameters have been adjusted?**:

    For all models we have used `GridSearchCV` (sources: [mygreatlearning](https://www.mygreatlearning.com/blog/gridsearchcv/) & [scikit-learn.org](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)) to adjust the hyperparameters - which is a function from sklearn that searches for the best hyperparameters for the model. Here we get a result of different metrics such as accuracy, precision, recall and F1 score, where we can see how each hyperparameter affects the model. 
    
    For each mode we have set `GridSearchCV's` parameters to:
    
    - first parameter is the model.

    - `cv=5`: this is the number of folds in the cross-validation. We have set it to 5, because it is a good balance between the number of folds and the computational cost.

    - `scoring=soring`: so we get accuracy, precision, recall and F1 score for each hyperparameter.

    - `refit='f1'`: this is the metric we want to use to find the best hyperparameters. We have chosen F1 score, because it is a good balance between precision and recall.
    
    After fitting we use the `grid.best_params_` shows which hyperparameter values worked best on the training data. We also `grid.best_estimator_` to get the best model.

    
    Furthermore, we also use our own validation set to validate the model and look at the metrics to see if the model is overfitting. A combination of these two methods is used to adjust the hyperparameters.
    <br>


    **List of hyperparameters adjusted for each model**:
    - **Multinomial Naive Bayes**:
        - Alpha (1): this is a smoothing parameter, which we have adjusted to handle zero probabilities for words in reviews that happen to be unseen. 

    - **Logistic Regression**:
        - C (10): This is a regularization parameter. We have adjusted this hyperparameter to prevent overfitting, because it can help with the misclassification of the training data in the margin. For example, if C is a high value, then the model trys to classify all the training data correctly, which can lead to overfitting. On the other hand, if C is low, then it allows more/some misclassification of the training data, which can prevent overfitting.

        - Penalty (l2) & solver (saga): 
        Penalty has been adjusted for finde wich gives the best impact for the models proformance. We have adjusted solvers beacuse can give a better optimization of the model. Furthermore, saga also makes sence to be the preferred solver, because it should be good at handling sparse data, which is important for our text vectorized data.

        - max_iter (1000): change to 1000, because of the data being high dimensional, which need more iterations to converge.

    - **Support Vector Machine**:
        - C (0.1): Same as in Logistic Regression, which we have adjusted to prevent overfitting.

        - max_iter (1000): we dont adjust this hyperparameter, beacuse the default value is 1000.
        
- **Overfitting**: 
<br>
Overfitting is for example when a model is fitted very well to the training data and only can understand the training data. If the model gets represented with new data, then the model can't predict the data correctly. 
<br><br> 
One way to discover if a model is overfitted is under the validation of the model when looking at the accuracy. If the accuracy has a low value, then the model is overfitting.
<br>
To prevent overfitting in models can depend on the type of model. For example in multinomial naive bayes, the Alpha hyperparameter can be adjusted to prevent overfitting. Antoher example is adjusting the C hyperparameter in Logistic Regression and Support Vector Machine, which can prevent overfitting.

## 3. Model Evaluation

This section will include:
- Models evaluation on the test set.
- Save the metric scores for each model.
- Selection of the best model.

In [174]:
# Function that saves model results infomation like: accuracy, precision, recall and f1 score, to a text file:
def save_resualts(name, **kwargs):
    path = f'./models/{name}_results.txt'

    with open(f'{path}', 'w') as f:
        for key, value in kwargs.items():
            f.write(f"{key}: {value}\n")

### 3.1 Multinomial Naive Bayes evaluation:

In [175]:
accuracy, precision, recall, f1 = evaluate_model(mnb, X_test_vec, y_test)
save_resualts('MultinomialNB', accuracy=accuracy, precision=precision, recall=recall, f1=f1)

print(f'MultinomialNB:\nAccuracy: {accuracy}\nPrecision: {precision}\nRecall: {recall}\nF1: {f1}')

MultinomialNB:
Accuracy: 0.8632
Precision: 0.8797848127457066
Recall: 0.843818217900377
F1: 0.8614262560777958


### 3.2 Logistic Regression evaluation:

In [176]:
accuracy, precision, recall, f1 = evaluate_model(logreg, X_test_vec, y_test)
save_resualts('LogisticRegression', accuracy=accuracy, precision=precision, recall=recall, f1=f1)

print(f'LogisticRegression:\nAccuracy: {accuracy}\nPrecision: {precision}\nRecall: {recall}\nF1: {f1}')

LogisticRegression:
Accuracy: 0.8891
Precision: 0.8827425009738995
Recall: 0.8993847985711451
F1: 0.8909859431829352


### 3.3 Support Vector Machine evaluation:

In [177]:
accuracy, precision, recall, f1 = evaluate_model(svm, X_test_vec, y_test)
save_resualts('LinearSVC', accuracy=accuracy, precision=precision, recall=recall, f1=f1)

print(f'LinearSVC:\nAccuracy: {accuracy}\nPrecision: {precision}\nRecall: {recall}\nF1: {f1}')

LinearSVC:
Accuracy: 0.8911
Precision: 0.8792242703533026
Recall: 0.9087120460408811
F1: 0.8937249926807846


### The best fitting model:

When looking at the metrics for the models, we can see based on the test set that the best model are **Support Vector Machine** evaluation with the highest
 accuracy (0.891), precision (0.879), recall (0.908) and F1 score (0.893). However it is important to note that **Logistic Regression** also has a high accuracy (0.889), precision (0.882), recall (0.899) and F1 score (0.890).