# Supervised Learning Project

## 1. Data Selection and Preparation

This section will include:
- loading the data,
- exploring it,
- preparing it for the modeling process which includes splitting the data into training, testing and validation sets.

In [91]:
import pandas as pd
import pickle

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

In [92]:
df = pd.read_csv('./data/IMDB Dataset.csv')

In [93]:
df.head(5)

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [94]:
df.sample(5)

Unnamed: 0,review,sentiment
24898,John Carpenter's Halloween is quite frankly a ...,positive
47378,"Sheltered young woman, home-schooled and possi...",negative
946,I first remember bumping into this zaniness fr...,positive
41825,The slightly overlong set up episode of the pr...,positive
49240,"Yeah, I ""get"" Pasolini and his milieu, but at ...",negative


In [95]:
df['sentiment'] = df['sentiment'].map({'positive': 1, 'negative': 0})
X = df['review']
y = df['sentiment']

In [96]:
df

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1
...,...,...
49995,I thought this movie did a down right good job...,1
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",0
49997,I am a Catholic taught in parochial elementary...,0
49998,I'm going to have to disagree with the previou...,0


In [97]:
# Data split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [98]:
vectorizer = TfidfVectorizer(stop_words='english')
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

## 2. Model Selection and Training

This section will include:
- three different models which will be trained, tested and validated on the data
- explanation of what hyperparameters are and how we adjust them

Choice of models:
- Model 1: Multinomial Naive Bayes
- Model 2: Logistic Regression
- Model 3: Support Vector Machine

### Model 1: Multinomial Naive Bayes

In [99]:
mnb = MultinomialNB()
mnb.fit(X_train_vec, y_train)

pickle.dump(mnb, open('./models/MultinomialNB.pkl', 'wb'))

### Model 2: Logistic Regression

In [100]:
log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train_vec, y_train)

pickle.dump(log_reg, open('./models/LogisticRegression.pkl', 'wb'))

### Model 3: Support Vector Machine

In [101]:
svm = LinearSVC()
svm.fit(X_train_vec, y_train)

pickle.dump(svm, open('./models/LinearSVC.pkl', 'wb'))

## 3. Model Evaluation

This section will include:
- evaluation of the models and explanations of the metrics used
- the best model selection
- explanation of what overfitting is and how we can spot it


### Evaluation metrics:
- **Accuracy**:
- **Precision**:
- **Recall**:
- **F1 Score**:

We have made a function (`evaluate_model`) to calculate these metrics:

In [102]:
def evaluate_model(model):
    y_pred = model.predict(X_test_vec)
    
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)

    return accuracy, precision, recall, f1

In [103]:
def save_resualts(name, **kwargs):
    path = f'./models/{name}_results.txt'

    with open(f'{path}', 'w') as f:
        for key, value in kwargs.items():
            f.write(f"{key}: {value}\n")

### Multinomial Naive Bayes evaluation:

In [104]:
accuracy, precision, recall, f1 = evaluate_model(mnb)
save_resualts('MultinomialNB', accuracy=accuracy, precision=precision, recall=recall, f1=f1)

print(f'MultinomialNB:\nAccuracy: {accuracy}\nPrecision: {precision}\nRecall: {recall}\nF1: {f1}')

MultinomialNB:
Accuracy: 0.8652
Precision: 0.8753304860687411
Recall: 0.854137725739234
F1: 0.8646042587384491


### Logistic Regression evaluation:

In [105]:
accuracy, precision, recall, f1 = evaluate_model(log_reg)
save_resualts('LogisticRegression', accuracy=accuracy, precision=precision, recall=recall, f1=f1)

print(f'LogisticRegression:\nAccuracy: {accuracy}\nPrecision: {precision}\nRecall: {recall}\nF1: {f1}')

LogisticRegression:
Accuracy: 0.8942
Precision: 0.8827148625264373
Recall: 0.9110934709267712
F1: 0.8966796875


### Support Vector Machine evaluation:

In [106]:
accuracy, precision, recall, f1 = evaluate_model(svm)
save_resualts('LinearSVC', accuracy=accuracy, precision=precision, recall=recall, f1=f1)

print(f'LinearSVC:\nAccuracy: {accuracy}\nPrecision: {precision}\nRecall: {recall}\nF1: {f1}')

LinearSVC:
Accuracy: 0.8945
Precision: 0.8889105818039829
Recall: 0.9035522921214527
F1: 0.896171636649936


### The best fitting model is

xx

### What is overfitting?

xxx