# Supervised Learning Project

## 1. Data Selection and Preparation

This section will include:
- loading the data,
- exploring it,
- preparing it for the modeling process which includes splitting the data into training, testing and validation sets.

In [2]:
import pandas as pd
import pickle

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

In [3]:
df = pd.read_csv('./data/IMDB Dataset.csv')

In [4]:
df.head(5)

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [5]:
df.sample(5)

Unnamed: 0,review,sentiment
13543,This could have been a rather entertaining fil...,negative
16309,Take a look at those faces alongside the entra...,positive
33898,Spoiler Alert I worked as an extra on this Lif...,positive
38579,This film was released the year I was born and...,positive
11031,First of all this movie is not a comedy; unles...,negative


In [None]:
df['sentiment'] = df['sentiment'].map({'positive': 1, 'negative': 0})
X = df['review']
y = df['sentiment']

The data has to columns `review` and `sentiment`. These sentiments can be either positive or negative. These values we have encoded as 1 for positive and 0 for negative, so we can use them in the modeling process.

In [None]:
# Vertify the data afther mapping the sentiment:
df.sample(5)

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1
...,...,...
49995,I thought this movie did a down right good job...,1
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",0
49997,I am a Catholic taught in parochial elementary...,0
49998,I'm going to have to disagree with the previou...,0


In [None]:
# Data split into train, test and validation sets
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.25, random_state=42)

In [9]:
print("Training set size:", len(X_train))
print("Validation set size:", len(X_val))
print("Test set size:", len(X_test))

Training set size: 30000
Validation set size: 10000
Test set size: 10000


The data has been slited into 3 sets: 
- training 60% (size 30000)
- testing 20% (size 10000)
- validation 20% (size 10000)

The parameter `random_state` is a seed the makes the data shuffle at a random state. The `random_state` is set to 42 to have the same shuffle of the data each time.

In [None]:
# We vectorize the text data, so its numerical and can be fed into the model

vectorizer = TfidfVectorizer(stop_words='english')
X_train_vec = vectorizer.fit_transform(X_train)
X_val_vec = vectorizer.transform(X_val)
X_test_vec = vectorizer.transform(X_test)

For all data sets X, which contains the reviews, we make these into vectors, so that our models can process them.

`fit_transform` method is only used on the training set, so that the vectorizer learns the vocabulary of the training set. The testing and validation sets are only transformed, so that the vocabulary of the training set is used.

## 2. Model Selection and Training

This section will include:
- Three different models which will be trained and validated on the data
- Explanations of the metrics used
- Explanation of what hyperparameters are and how we adjust them
- Explanation of what overfitting is and how we can spot it

### Explanations of the metrics used:

NB: _These metrics are used for both validation and testing of the models._

- **Accuracy**: percentage of correct predictions. It gives an idear of how well the model is performing. It is calculated as the number of correct predictions divided by the total number of predictions.
<br> 
    `accuracy = (TP + TN) / (TP + TN + FP + FN)`

- **Precision**: tells us the how many of the positive predictions were actually correct. It is calculated as the number of true positives divided by the number of true positives and false positives.
<br>
    `precision = TP / (TP + FP)`


- **Recall**: tells us how many of the actual positive cases were predicted correctly. It is calculated as the number of true positives divided by the number of true positives and false negatives.
<br>
    `recall = TP / (TP + FN)`


- **F1 Score**: A combinded mesurement of precision and recall - the harmonic mean of the two. It gives a more balanced view of the model's performance, beacuse it takes both false positives and false negatives into account.
<br>
    `F1 = 2 * (precision * recall) / (precision + recall)`

All these metrics we are going to be using are from `sklearn.metrics` library.



### Choice of models:
- **Model 1: Multinomial Naive Bayes**
    - Reasoning:

- **Model 2: Logistic Regression**
    - Reasoning:

- **Model 3: Support Vector Machine**
    - Reasoning:

In [11]:
# Tuning the model
def validate_model(model):
    
    y_val_pred = model.predict(X_val_vec)

    val_accuracy = accuracy_score(y_val, y_val_pred)
    val_precision = precision_score(y_val, y_val_pred)
    val_recall = recall_score(y_val, y_val_pred)
    val_f1 = f1_score(y_val, y_val_pred)

    return val_accuracy, val_precision, val_recall, val_f1

### Model 1 traning & validation: Multinomial Naive Bayes

In [12]:
mnb = MultinomialNB()
mnb.fit(X_train_vec, y_train)

val_accuracy, val_precision, val_recall, val_f1 = validate_model(mnb)
print(f'Multinomial Naive Bayes:\nAccuracy: {val_accuracy}\nPrecision: {val_precision}\nRecall: {val_recall}\nF1: {val_f1}')

Multinomial Naive Bayes:
Accuracy: 0.8608
Precision: 0.8856902710653498
Recall: 0.8325103693462375
F1: 0.8582773365913255


### Model 2: Logistic Regression

In [24]:
log_reg = LogisticRegression(max_iter=1000) # Maximum number of iterations taken for the solvers to converge.

log_reg.fit(X_train_vec, y_train)

val_accuracy, val_precision, val_recall, val_f1 = validate_model(log_reg)
print(f'Logistic Regression:\nAccuracy: {val_accuracy}\nPrecision: {val_precision}\nRecall: {val_recall}\nF1: {val_f1}')

Logistic Regression:
Accuracy: 0.8915
Precision: 0.8909198113207547
Recall: 0.8953189808413984
F1: 0.8931139789183332


### Model 3: Support Vector Machine

In [14]:
svm = LinearSVC()
svm.fit(X_train_vec, y_train)

val_accuracy, val_precision, val_recall, val_f1 = validate_model(svm)
print(f'Linear SVC:\nAccuracy: {val_accuracy}\nPrecision: {val_precision}\nRecall: {val_recall}\nF1: {val_f1}')

Linear SVC:
Accuracy: 0.8921
Precision: 0.8936758893280632
Recall: 0.8931463559154651
F1: 0.8934110441568704


### Hyperparameters and Overfitting

- **What is hyperparameters?**:
<br>
You can think of hyperparamters as a form of settings in ML models. These you have to define before the training begins. These can have a big impact on the performance of the model; however, its also important to be aware that it can have both negative and positive inpact. 

- **Which hyperparameters have been adjusted?**:
    - Multinomial Naive Bayes:
        - x
        - y
        - z

    - Logistic Regression:
        - x
        - y
        - z

    - Support Vector Machine:
        - x
        - y
        - z


- **Overfitting**: 
<br>
This is when e.g a model is fitted very well to the training data, but when it comes to new data, then the model can't predict the data do to it being to specific to the training data. 
<br><br> 
One way to discover if a model is overfitted is under the validation of the model when looking at the accuracy. If the accuracy has a low value, then the model is overfitting.
<br>
To ajust the model to not overfit, can depend on the model. For example in multinomial naive bayes, the Alpha hyperparameter can be adjusted to prevent overfitting.

## 3. Model Evaluation

This section will include:
- Final train & test/evaluation of the models
- Selection of the best model.


We have made a function (`evaluate_model`) to calculate these metrics (accuracy, precision, recall, F1 score) for each model on the test set.

In [15]:
def evaluate_model(model):
    
    y_test_pred = model.predict(X_test_vec)

    test_accuracy = accuracy_score(y_test, y_test_pred)
    test_precision = precision_score(y_test, y_test_pred)
    test_recall = recall_score(y_test, y_test_pred)
    test_f1 = f1_score(y_test, y_test_pred)

    return test_accuracy, test_precision, test_recall, test_f1

In [None]:
# Function that saves model results infomation like: accuracy, precision, recall and f1 score, to a text file:
def save_resualts(name, **kwargs):
    path = f'./models/{name}_results.txt'

    with open(f'{path}', 'w') as f:
        for key, value in kwargs.items():
            f.write(f"{key}: {value}\n")

### Multinomial Naive Bayes evaluation:

In [17]:
accuracy, precision, recall, f1 = evaluate_model(mnb)
save_resualts('MultinomialNB', accuracy=accuracy, precision=precision, recall=recall, f1=f1)

print(f'MultinomialNB:\nAccuracy: {accuracy}\nPrecision: {precision}\nRecall: {recall}\nF1: {f1}')

MultinomialNB:
Accuracy: 0.8632
Precision: 0.8797848127457066
Recall: 0.843818217900377
F1: 0.8614262560777958


In [18]:
# Save the model
pickle.dump(mnb, open('./models/MultinomialNB.pkl', 'wb'))

### Logistic Regression evaluation:

In [19]:
accuracy, precision, recall, f1 = evaluate_model(log_reg)
save_resualts('LogisticRegression', accuracy=accuracy, precision=precision, recall=recall, f1=f1)

print(f'LogisticRegression:\nAccuracy: {accuracy}\nPrecision: {precision}\nRecall: {recall}\nF1: {f1}')

LogisticRegression:
Accuracy: 0.8911
Precision: 0.8814213982232523
Recall: 0.9057352649335185
F1: 0.8934129392189488


In [20]:
# Save the model
pickle.dump(log_reg, open('./models/LogisticRegression.pkl', 'wb'))

### Support Vector Machine evaluation:

In [21]:
accuracy, precision, recall, f1 = evaluate_model(svm)
save_resualts('LinearSVC', accuracy=accuracy, precision=precision, recall=recall, f1=f1)

print(f'LinearSVC:\nAccuracy: {accuracy}\nPrecision: {precision}\nRecall: {recall}\nF1: {f1}')

LinearSVC:
Accuracy: 0.8892
Precision: 0.8839617112717327
Recall: 0.8979956340543759
F1: 0.8909234101201023


In [22]:
# Save the model
pickle.dump(svm, open('./models/LinearSVC.pkl', 'wb'))

### The best fitting model is

When looking at the metrics for the models, we can see based on the test set that the best model are **x** with the highest
 accuracy (x), precision (x), recall (x) and F1 score (x).