# Feature Selection, Model Training, Evaluation, & Tuning

## Table of Contents

- [1. Data Cleaning and Preprocessing](#1.-Data-Cleaning-and-Preprocessing)
    - [1.1. Importing and Cleaning Data](#1.1.-Importing-and-Cleaning-Data)
    - [1.2. Filtering and Splitting Data](#1.2.-Filtering-and-Splitting-Data)
- [2. Training and Evaluating Baseline Models](#2.-Training-and-Evaluating-Baseline-Models)
- [3. Tuning Hyperparameters and Assessing Performance](#3.-Tuning-Hyperparameters-and-Assessing-Performance)
- [4. Displaying Evaluation Metrics for Baseline and Optimized Models](#4.-Displaying-Evaluation-Metrics-for-Baseline-and-Optimized-Models)

## 1. Data Cleaning and Preprocessing

### 1.1. Importing and Cleaning Data

In [1]:
# Import libraries for data manipulation
import pandas as pd
import numpy as np
import re

# Import libraries for text vectorization, model training, and feature selection
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import chi2

# Import classifiers for model training such as Logistic Regression and Random Forest
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import MultinomialNB

# Import model selection tools and evaluation metrics like accuracy, AUC, and F1 score
from sklearn import model_selection
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.metrics import roc_auc_score, f1_score

# Import libraries for tokenization, lemmatization, and stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Suppress all warnings during execution
import warnings
warnings.filterwarnings('ignore')

# Set the maximum column width for displaying DataFrame
pd.options.display.max_colwidth = 1000


# Load the IMDB dataset from a CSV file and display the first 3 rows
imdb = pd.read_csv('Data/IMDB_Dataset.csv')
imdb.head(3)

Unnamed: 0,Ratings,Reviews,Movies,Resenhas
0,1.0,"*Disclaimer: I only watched this movie as a conditional agreement. And I see films for free. I wouldn't be caught dead giving my hard earned money to these idiots.Well, to explain the depth of this 'film', I could write my shortest review, ever. Don't see this movie. It is by far the stupidest, lamest, most lazy, and unbelievably UNFUNNY movie I have ever seen. It is a total disaster. But since my hatred for this movie, and the others like it, extends far beyond one viewing, I think I'll go on for a bit.I don't know any of the people in the movie besides Carmen Electra, Vanessa Minnillo, and Kim Kardashian, but it doesn't matter. They're all horrible, though I think that was the point. The editing is flat out horrible, and possibly blatant continuity errors make this crapfast even crappier than I thought it would be. Now I know that these films are not supposed to be serious at all, but come on, it's film-making 101 that if someone gets a minor facial cut, it should be there in the...",Disaster Movie,"* Isenção de responsabilidade: eu só assisti esse filme como um acordo condicional. E eu vejo filmes de graça. Eu não seria pego morto dando meu dinheiro suado a esses idiotas. Bem, para explicar a profundidade desse 'filme', eu poderia escrever minha crítica mais curta de todos os tempos. Não vê este filme. É de longe o filme mais estúpido, lamenta, preguiçoso e inacreditavelmente UNFUNNY que eu já vi. É um desastre total. Mas como o meu ódio por este filme e por outros, se estende muito além de uma exibição, acho que vou continuar um pouco. Não conheço nenhuma das pessoas do filme além de Carmen Electra, Vanessa Minnillo, e Kim Kardashian, mas isso não importa. Eles são todos horríveis, embora eu ache que esse seja o ponto. A edição é horrível e, possivelmente, erros de continuidade flagrantes tornam essa porcaria ainda mais horrível do que eu pensava. Agora eu sei que esses filmes não devem ser sérios, mas vamos lá, é o cinema 101 que se alguém fizer um pequeno corte facial, ele..."
1,1.0,"I am writing this in hopes that this gets put over the previous review of this ""film"". How anyone can find this slop entertaining is completely beyond me. First of all a spoof film entitled ""Disaster Movie"", should indeed be a spoof on disaster films. Now I have seen 1 (yes count them, 1) disaster film being spoofed, that being ""Twister"". How does Juno, Iron Man, Batman, The Hulk, Alvin and the Chipmunks, Amy Winehouse, or Hancock register as Disaster films? Selzterwater and Failburg once again have shown that they lack any sort of writing skill and humor. Having unfortunately been tortured with Date Movie and Epic Movie I know exactly what to expect from these two...no plot, no jokes just bad references and cheaply remade scenes from other films. Someone should have informed them that satire is more than just copy and paste from one film to another, though I shouldn't say that because some of these actually just seem to be taken from trailers.There is nothing clever or witty or re...",Disaster Movie,"Estou escrevendo isso na esperança de que isso seja colocado sobre a revisão anterior deste ""filme"". Como alguém pode achar divertido esse desleixo está completamente além de mim. Antes de mais nada, um filme de paródia intitulado ""Filme de desastre"" deveria ser, de fato, uma paródia de filmes de desastre. Agora eu já vi 1 (sim, conte-os, 1) filme de desastre sendo falsificado, sendo ""Twister"". Como Juno, Homem de Ferro, Batman, O Hulk, Alvin e os Esquilos, Amy Winehouse ou Hancock se registram como filmes de Desastre? Selzterwater e Failburg mostraram mais uma vez que não possuem nenhum tipo de habilidade e humor de escrita. Infelizmente, tendo sido torturado com Date Movie e Epic Movie, sei exatamente o que esperar desses dois ... nenhum enredo, nenhuma piada, apenas más referências e cenas refeitas de outros filmes. Alguém deveria ter informado a eles que a sátira é mais do que apenas copiar e colar de um filme para outro, embora eu não deva dizer isso porque alguns deles realme..."
2,1.0,"Really, I could write a scathing review of this turd sandwich, but instead, I'm just going to be making a few observations and points I've deduced.There's just no point in watching these movies anymore. Does any reader out there remember Scary Movie? Remember how it was original with a few comedic elements to it? There was slapstick, some funny lines, it was a pretty forgettable comedy, but it was worth the price of admission. Well, That was the last time this premise was funny. STOP MAKING THESE MOVIES. PLEASE.I could call for a boycott of these pieces of monkey sh*t, but we all know there's going to be a line up of pre pubescent annoying little buggers, spouting crappy one liners like, ""THIS IS SPARTA!"" and, ""IM RICK JAMES BITCH"" so these movies will continue to make some form of monetary gain, considering the production value of this movie looks like it cost about 10 cents to make.Don't see this movie. Don't spend any money on it. Go home, rent Airplane, laugh your ass off, and ...",Disaster Movie,"Realmente, eu poderia escrever uma crítica contundente sobre esse sanduíche de cocô, mas, em vez disso, vou fazer algumas observações e pontos que deduzi. Não há mais sentido assistir a esses filmes. Algum leitor por aí se lembra do filme de terror? Lembra como era original, com alguns elementos cômicos? Havia palhaçada, algumas frases engraçadas, era uma comédia bastante esquecível, mas valia o preço da entrada. Bem, essa foi a última vez que essa premissa foi engraçada. PARE DE FAZER ESTES FILMES. POR FAVOR, eu poderia pedir um boicote a esses pedaços de macaco, mas todos sabemos que haverá uma fila de buggers irritantes e pré-pubescentes, jorrando uns forros ruins como: ""ISTO É SPARTA!"" e ""IM RICK JAMES BITCH"", para que esses filmes continuem gerando algum ganho monetário, considerando que o valor de produção deste filme parece custar cerca de 10 centavos de dólar. Não gaste dinheiro com isso. Vá para casa, alugue a Airplane, ria e julgue silenciosamente as pessoas que estão fal..."


In [2]:
# Customize stop words by adding and removing specific words
stop_words = stopwords.words('english')
new_stopwords = ["would", "could", "shall", "might"]

stop_words.extend(new_stopwords)
stop_words.remove("not")
stop_words = set(stop_words)


# Remove URLs from text
def remove_urls(text):
    return re.sub(r"http\S+|www\S+", "", text)

# Remove emails from text
def remove_emails(text):
    return re.sub(r"\S+@\S+", "", text)

# Remove special characters from text
def remove_special_chars(text):
    return re.sub(r"\W+", " ", text)

# Reduce repeated word sequences from text
def reduce_repeated_sequences(text):
    text = re.sub(r"\b(\w+)(\s+\1)+\b", r"\1", text)
    return text

# Remove repeated word sequences from text
def remove_repeated_sequences(text):
    target_words = ['blah', 'mario', 'la']
    text = re.sub(r"\b(" + "|".join(target_words) + r")(\s+\1)+\b", "", text)
    return text

# Remove stopwords from text
def remove_stopwords(text):
    filtered_words = []
    for word in text.split():
        normalized_word = word.lower()
        if normalized_word not in stop_words and normalized_word.isalpha():
            filtered_words.append(normalized_word)
    return " ".join(filtered_words)

# Expand contractions in text
def expand_contractions(text):
    contractions_dict = {
        "shouldn't": "should not",
        "weren't": "were not",
        "won't": "will not",
        "mightn't": "might not",
        "couldn't": "could not",
        "can't": "cannot",
        "didn't": "did not",
        "don't": "do not",
        "needn't": "need not",
        "haven't": "have not",
        "hasn't": "has not",
        "'re": " are",
        "'m": " am",
        "'ll": " will",
        "'ve": " have"
    }
    for contraction, expansion in contractions_dict.items():
        text = re.sub(contraction, expansion, text)
    return text
    
# Clean text by applying multiple preprocessing steps
def clean_data(text):
    text = remove_urls(text)
    text = remove_emails(text)
    text = expand_contractions(text)
    text = remove_special_chars(text)
    text = remove_stopwords(text)
    text = remove_repeated_sequences(text)
    text = reduce_repeated_sequences(text)
    text = " ".join(text.split())
    return text


# Apply text cleaning function to 'Reviews' and display the first 3 rows of the dataset
imdb['Reviews_Clean'] = imdb['Reviews'].apply(clean_data)
imdb.head(3)

Unnamed: 0,Ratings,Reviews,Movies,Resenhas,Reviews_Clean
0,1.0,"*Disclaimer: I only watched this movie as a conditional agreement. And I see films for free. I wouldn't be caught dead giving my hard earned money to these idiots.Well, to explain the depth of this 'film', I could write my shortest review, ever. Don't see this movie. It is by far the stupidest, lamest, most lazy, and unbelievably UNFUNNY movie I have ever seen. It is a total disaster. But since my hatred for this movie, and the others like it, extends far beyond one viewing, I think I'll go on for a bit.I don't know any of the people in the movie besides Carmen Electra, Vanessa Minnillo, and Kim Kardashian, but it doesn't matter. They're all horrible, though I think that was the point. The editing is flat out horrible, and possibly blatant continuity errors make this crapfast even crappier than I thought it would be. Now I know that these films are not supposed to be serious at all, but come on, it's film-making 101 that if someone gets a minor facial cut, it should be there in the...",Disaster Movie,"* Isenção de responsabilidade: eu só assisti esse filme como um acordo condicional. E eu vejo filmes de graça. Eu não seria pego morto dando meu dinheiro suado a esses idiotas. Bem, para explicar a profundidade desse 'filme', eu poderia escrever minha crítica mais curta de todos os tempos. Não vê este filme. É de longe o filme mais estúpido, lamenta, preguiçoso e inacreditavelmente UNFUNNY que eu já vi. É um desastre total. Mas como o meu ódio por este filme e por outros, se estende muito além de uma exibição, acho que vou continuar um pouco. Não conheço nenhuma das pessoas do filme além de Carmen Electra, Vanessa Minnillo, e Kim Kardashian, mas isso não importa. Eles são todos horríveis, embora eu ache que esse seja o ponto. A edição é horrível e, possivelmente, erros de continuidade flagrantes tornam essa porcaria ainda mais horrível do que eu pensava. Agora eu sei que esses filmes não devem ser sérios, mas vamos lá, é o cinema 101 que se alguém fizer um pequeno corte facial, ele...",disclaimer watched movie conditional agreement see films free caught dead giving hard earned money idiots well explain depth film write shortest review ever see movie far stupidest lamest lazy unbelievably unfunny movie ever seen total disaster since hatred movie others like extends far beyond one viewing think go bit not know people movie besides carmen electra vanessa minnillo kim kardashian matter horrible though think point editing flat horrible possibly blatant continuity errors make crapfast even crappier thought know films not supposed serious come film making someone gets minor facial cut next shot someone gets cut sword blood least cut though since narnia films get away give disaster movie pass jokes thoughtless mindless physical gags obviously take popular movies last year late well including best picture nominees know saddest thing stupid movies not care much money make many cameos sorry ass excuses films taking away jobs actors writers directors truly deserve attention ...
1,1.0,"I am writing this in hopes that this gets put over the previous review of this ""film"". How anyone can find this slop entertaining is completely beyond me. First of all a spoof film entitled ""Disaster Movie"", should indeed be a spoof on disaster films. Now I have seen 1 (yes count them, 1) disaster film being spoofed, that being ""Twister"". How does Juno, Iron Man, Batman, The Hulk, Alvin and the Chipmunks, Amy Winehouse, or Hancock register as Disaster films? Selzterwater and Failburg once again have shown that they lack any sort of writing skill and humor. Having unfortunately been tortured with Date Movie and Epic Movie I know exactly what to expect from these two...no plot, no jokes just bad references and cheaply remade scenes from other films. Someone should have informed them that satire is more than just copy and paste from one film to another, though I shouldn't say that because some of these actually just seem to be taken from trailers.There is nothing clever or witty or re...",Disaster Movie,"Estou escrevendo isso na esperança de que isso seja colocado sobre a revisão anterior deste ""filme"". Como alguém pode achar divertido esse desleixo está completamente além de mim. Antes de mais nada, um filme de paródia intitulado ""Filme de desastre"" deveria ser, de fato, uma paródia de filmes de desastre. Agora eu já vi 1 (sim, conte-os, 1) filme de desastre sendo falsificado, sendo ""Twister"". Como Juno, Homem de Ferro, Batman, O Hulk, Alvin e os Esquilos, Amy Winehouse ou Hancock se registram como filmes de Desastre? Selzterwater e Failburg mostraram mais uma vez que não possuem nenhum tipo de habilidade e humor de escrita. Infelizmente, tendo sido torturado com Date Movie e Epic Movie, sei exatamente o que esperar desses dois ... nenhum enredo, nenhuma piada, apenas más referências e cenas refeitas de outros filmes. Alguém deveria ter informado a eles que a sátira é mais do que apenas copiar e colar de um filme para outro, embora eu não deva dizer isso porque alguns deles realme...",writing hopes gets put previous review film anyone find slop entertaining completely beyond first spoof film entitled disaster movie indeed spoof disaster films seen yes count disaster film spoofed twister juno iron man batman hulk alvin chipmunks amy winehouse hancock register disaster films selzterwater failburg shown lack sort writing skill humor unfortunately tortured date movie epic movie know exactly expect two plot jokes bad references cheaply remade scenes films someone informed satire copy paste one film another though not say actually seem taken trailers nothing clever witty remotely smart way two write cannot believe people still pay see travesties insult audience though enjoy films doubt smart enough realize rating unfortunately not number low enough yes includes negatives rate deserves top worst films time right date movie epic faliure mean movie meet spartans rather forced hour manos hands fate marathon watch slop
2,1.0,"Really, I could write a scathing review of this turd sandwich, but instead, I'm just going to be making a few observations and points I've deduced.There's just no point in watching these movies anymore. Does any reader out there remember Scary Movie? Remember how it was original with a few comedic elements to it? There was slapstick, some funny lines, it was a pretty forgettable comedy, but it was worth the price of admission. Well, That was the last time this premise was funny. STOP MAKING THESE MOVIES. PLEASE.I could call for a boycott of these pieces of monkey sh*t, but we all know there's going to be a line up of pre pubescent annoying little buggers, spouting crappy one liners like, ""THIS IS SPARTA!"" and, ""IM RICK JAMES BITCH"" so these movies will continue to make some form of monetary gain, considering the production value of this movie looks like it cost about 10 cents to make.Don't see this movie. Don't spend any money on it. Go home, rent Airplane, laugh your ass off, and ...",Disaster Movie,"Realmente, eu poderia escrever uma crítica contundente sobre esse sanduíche de cocô, mas, em vez disso, vou fazer algumas observações e pontos que deduzi. Não há mais sentido assistir a esses filmes. Algum leitor por aí se lembra do filme de terror? Lembra como era original, com alguns elementos cômicos? Havia palhaçada, algumas frases engraçadas, era uma comédia bastante esquecível, mas valia o preço da entrada. Bem, essa foi a última vez que essa premissa foi engraçada. PARE DE FAZER ESTES FILMES. POR FAVOR, eu poderia pedir um boicote a esses pedaços de macaco, mas todos sabemos que haverá uma fila de buggers irritantes e pré-pubescentes, jorrando uns forros ruins como: ""ISTO É SPARTA!"" e ""IM RICK JAMES BITCH"", para que esses filmes continuem gerando algum ganho monetário, considerando que o valor de produção deste filme parece custar cerca de 10 centavos de dólar. Não gaste dinheiro com isso. Vá para casa, alugue a Airplane, ria e julgue silenciosamente as pessoas que estão fal...",really write scathing review turd sandwich instead going making observations points deduced point watching movies anymore reader remember scary movie remember original comedic elements slapstick funny lines pretty forgettable comedy worth price admission well last time premise funny stop making movies please call boycott pieces monkey sh know going line pre pubescent annoying little buggers spouting crappy one liners like sparta im rick james bitch movies continue make form monetary gain considering production value movie looks like cost cents make see movie spend money go home rent airplane laugh ass silently judge people talking movie monday favor


### 1.2. Filtering and Splitting Data

In [3]:
# Assign multiclass labels to ratings and filter data to exclude neutral ratings
imdb['Ratings_Label'] = imdb['Ratings'].apply(lambda x: 1 if x >= 7 else (0 if x <= 4 else 2))

imdb_filtered = imdb[imdb['Ratings_Label'] < 2]
imdb_filtered = imdb_filtered[['Reviews_Clean', 'Ratings_Label']].copy()

imdb_filtered['Ratings_Label'].value_counts()

Ratings_Label
0    60000
1    60000
Name: count, dtype: int64

In [4]:
# Define a tokenizer class that lemmatizes words in the text
class LemmatizerTokenizer(object):
    def __init__(self):
        self.lemmatizer = WordNetLemmatizer()
    def __call__(self, text):
        return [self.lemmatizer.lemmatize(word) for word in word_tokenize(text)]


# Split the data into training and testing sets, then display the label counts for both sets
train_data, test_data = train_test_split(imdb_filtered, stratify=imdb_filtered['Ratings_Label'],
                                         test_size=.3, random_state=42)

X_train = train_data['Reviews_Clean']
y_train = train_data['Ratings_Label']

X_test = test_data['Reviews_Clean']
y_test = test_data['Ratings_Label']

display(y_train.value_counts())
display(y_test.value_counts())

Ratings_Label
0    42000
1    42000
Name: count, dtype: int64

Ratings_Label
1    18000
0    18000
Name: count, dtype: int64

## 2. Training and Evaluating Baseline Models

In [6]:
# Vectorize text data using TfidfVectorizer with lemmatization and 1- to 3-grams
tfidf_vect_n = TfidfVectorizer(tokenizer=LemmatizerTokenizer(), max_features=5000, ngram_range=(1,3), min_df=10)

X_train_tfidf = tfidf_vect_n.fit_transform(X_train).toarray()
X_test_tfidf = tfidf_vect_n.transform(X_test).toarray()


# Evaluate model performance on training and testing sets, returning accuracy, AUC, and F1 scores
def evaluate_model(model, model_name="Model"):

    train_accuracy = accuracy_score(y_train, model.predict(X_train_tfidf))
    test_accuracy = accuracy_score(y_test, model.predict(X_test_tfidf))

    # train_precision = precision_score(y_train, model.predict(X_train_tfidf))
    # test_precision = precision_score(y_test, model.predict(X_test_tfidf))

    # train_recall = recall_score(y_train, model.predict(X_train_tfidf))
    # test_recall = recall_score(y_test, model.predict(X_test_tfidf))

    train_auc = roc_auc_score(y_train, model.predict_proba(X_train_tfidf)[:, 1])
    test_auc = roc_auc_score(y_test, model.predict_proba(X_test_tfidf)[:, 1])

    train_f1 = f1_score(y_train, model.predict(X_train_tfidf))
    test_f1 = f1_score(y_test, model.predict(X_test_tfidf))

    metrics_df = pd.DataFrame({f"{model_name}": ["Accuracy", "AUC", "F1"],
                               "Train": [train_accuracy, train_auc, train_f1],
                               "Test": [test_accuracy, test_auc, test_f1]})
    
    metrics_df = metrics_df.set_index(f"{model_name}")
    metrics_df = metrics_df.round(4)
    
    return metrics_df


# Train baseline Logistic Regression model and evaluate its performance
log_model = LogisticRegression()
log_model.fit(X_train_tfidf, y_train)

evaluate_model(log_model, model_name="Baseline Logistic Regression")

Unnamed: 0_level_0,Train,Test
Baseline Logistic Regression,Unnamed: 1_level_1,Unnamed: 2_level_1
Accuracy,0.9073,0.8916
AUC,0.9674,0.9581
F1,0.9075,0.8918


In [7]:
# Train baseline Naive Bayes model and evaluate its performance
nb_model = MultinomialNB()
nb_model.fit(X_train_tfidf, y_train)

evaluate_model(nb_model, model_name="Baseline Naive Bayes")

Unnamed: 0_level_0,Train,Test
Baseline Naive Bayes,Unnamed: 1_level_1,Unnamed: 2_level_1
Accuracy,0.8721,0.865
AUC,0.9453,0.9407
F1,0.8739,0.8671


In [8]:
# Train baseline Random Forest model and evaluate its performance
rf_model = RandomForestClassifier()
rf_model.fit(X_train_tfidf, y_train)

evaluate_model(rf_model, model_name="Baseline Random Forest")

Unnamed: 0_level_0,Train,Test
Baseline Random Forest,Unnamed: 1_level_1,Unnamed: 2_level_1
Accuracy,0.9999,0.8553
AUC,1.0,0.9283
F1,0.9999,0.8555


In [9]:
# Train baseline Ada Boost model and evaluate its performance
ab_model = AdaBoostClassifier()
ab_model.fit(X_train_tfidf, y_train)

evaluate_model(ab_model, model_name="Baseline Ada Boost")

Unnamed: 0_level_0,Train,Test
Baseline Ada Boost,Unnamed: 1_level_1,Unnamed: 2_level_1
Accuracy,0.7957,0.7953
AUC,0.8847,0.8845
F1,0.8044,0.8046


## 3. Tuning Hyperparameters and Assessing Performance

In [10]:
# Tune hyperparameters using GridSearchCV and return the best model and parameters
def tune_hyperparameters(estimator, param_grid, scoring_metric, cross_validation, verbosity=0):
    
    grid_search = model_selection.GridSearchCV(estimator=estimator, scoring=scoring_metric, param_grid=param_grid,
                                               verbose=verbosity, cv=cross_validation)
    grid_search.fit(X_train_tfidf, y_train)

    print(f"Best Score: {grid_search.best_score_.round(4)}")
    
    print("Best Hyperparameters:")
    best_params = grid_search.best_estimator_.get_params()
    for param_name in sorted(param_grid.keys()):
        print(f"• {param_name}: {best_params[param_name]}")
    print()
        
    return grid_search, best_params


# # Tune Logistic Regression model hyperparameters
# log_param_grid = {'penalty': ['l1', 'l2', 'elasticnet', 'none'],
#                   'solver': ['liblinear', 'saga'],
#                   'C': [0.01, 0.1, 1, 10, 100],
#                   'max_iter': [100, 200, 500, 1000],
#                   'class_weight': ['balanced', None]}

# opt_log_model, opt_log_params = tune_hyperparameters(LogisticRegression(), param_grid=log_param_grid,
#                                                      scoring_metric="accuracy", cross_validation=5, verbosity=10)


# Train optimized Logistic Regression model and evaluate its performance
opt_log_model = LogisticRegression(C=1, max_iter=200, class_weight='balanced', solver='saga', penalty='l1')
opt_log_model.fit(X_train_tfidf, y_train)

evaluate_model(opt_log_model, model_name="Optimized Logistic Regression")

Unnamed: 0_level_0,Train,Test
Optimized Logistic Regression,Unnamed: 1_level_1,Unnamed: 2_level_1
Accuracy,0.9079,0.8926
AUC,0.9676,0.958
F1,0.9082,0.8929


In [11]:
# # Tune Naive Bayes model hyperparameters
# nb_param_grid = {'alpha': [0.1, 0.5, 1.0, 2.0],
#                  'fit_prior': [True, False]}

# opt_nb_model, opt_nb_params = tune_hyperparameters(MultinomialNB(), param_grid=nb_param_grid,
#                                                    scoring_metric="accuracy", cross_validation=5, verbosity=10)


# Train optimized Naive Bayes model and evaluate its performance
opt_nb_model = MultinomialNB(alpha=2.0, fit_prior=True)
opt_nb_model.fit(X_train_tfidf, y_train)

evaluate_model(opt_nb_model, model_name="Optimized Naive Bayes")

Unnamed: 0_level_0,Train,Test
Optimized Naive Bayes,Unnamed: 1_level_1,Unnamed: 2_level_1
Accuracy,0.8719,0.8657
AUC,0.9454,0.9408
F1,0.8737,0.8678


In [12]:
# # Tune Random Forest model hyperparameters
# rf_param_grid = {"n_estimators": [300, 400, 500],
#                  "criterion": ["gini", "entropy"],
#                  "max_depth": [23, 30, 35, None],
#                  "max_features": ["sqrt", "log2"],
#                  "min_samples_leaf": [7, 11, 13],
#                  "min_samples_split": [11, 13, 15]}

# opt_rf_model, opt_rf_params = tune_hyperparameters(RandomForestClassifier(), param_grid=rf_param_grid,
#                                                    scoring_metric="accuracy", cross_validation=5, verbosity=10)


# Train optimized Random Forest model and evaluate its performance
opt_rf_model = RandomForestClassifier(criterion='gini', max_features='log2', n_estimators=400,
                                      min_samples_leaf=7, min_samples_split=13, max_depth=35)
opt_rf_model.fit(X_train_tfidf, y_train)

evaluate_model(opt_rf_model, model_name="Optimized Random Forest")

Unnamed: 0_level_0,Train,Test
Optimized Random Forest,Unnamed: 1_level_1,Unnamed: 2_level_1
Accuracy,0.8988,0.8655
AUC,0.9634,0.9365
F1,0.899,0.8667


In [13]:
# Tune Ada Boost model hyperparameters
# ab_param_grid = {"estimator": [DecisionTreeClassifier(max_depth=1),
#                                DecisionTreeClassifier(max_depth=4),
#                                DecisionTreeClassifier(max_depth=8),
#                                DecisionTreeClassifier(max_depth=10)],
#                  "learning_rate": [0.001, 0.01, 0.1, 0.5, 0.8, 1],
#                  "n_estimators": [50, 100, 200, 300],
#                  "algorithm": ["SAMME"]}

# opt_ab_model, opt_ab_params = tune_hyperparameters(AdaBoostClassifier(), param_grid=ab_param_grid,
#                                                    scoring_metric="accuracy", cross_validation=5, verbosity=10)


# Train optimized Ada Boost model and evaluate its performance
opt_ab_model = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1), n_estimators=300, learning_rate=1, algorithm='SAMME')
opt_ab_model.fit(X_train_tfidf, y_train)

evaluate_model(opt_ab_model, model_name="Optimized Ada Boost")

Unnamed: 0_level_0,Train,Test
Optimized Ada Boost,Unnamed: 1_level_1,Unnamed: 2_level_1
Accuracy,0.8214,0.8211
AUC,0.9041,0.9034
F1,0.8235,0.8235


## 4. Displaying Evaluation Metrics for Baseline and Optimized Models

In [14]:
# Create a DataFrame to display evaluation metrics for baseline models
metrics = [
    {'Baseline Model': 'Log Regression', 'Metric': 'Accuracy', 'Train': 0.9073, 'Test': 0.8916},
    {'Baseline Model': 'Log Regression', 'Metric': 'AUC', 'Train': 0.9674, 'Test': 0.9581},
    {'Baseline Model': 'Log Regression', 'Metric': 'F1', 'Train': 0.9075, 'Test': 0.8918},
    {'Baseline Model': 'Naive Bayes', 'Metric': 'Accuracy', 'Train': 0.8721, 'Test': 0.8650},
    {'Baseline Model': 'Naive Bayes', 'Metric': 'AUC', 'Train': 0.9453, 'Test': 0.9407},
    {'Baseline Model': 'Naive Bayes', 'Metric': 'F1', 'Train': 0.8739, 'Test': 0.8671},
    {'Baseline Model': 'Random Forest', 'Metric': 'Accuracy', 'Train': 0.9999, 'Test': 0.8528},
    {'Baseline Model': 'Random Forest', 'Metric': 'AUC', 'Train': 1.0000, 'Test': 0.9276},
    {'Baseline Model': 'Random Forest', 'Metric': 'F1', 'Train': 0.9999, 'Test': 0.8527},
    {'Baseline Model': 'Ada Boost', 'Metric': 'Accuracy', 'Train': 0.7957, 'Test': 0.7953},
    {'Baseline Model': 'Ada Boost', 'Metric': 'AUC', 'Train': 0.8847, 'Test': 0.8845},
    {'Baseline Model': 'Ada Boost', 'Metric': 'F1', 'Train': 0.8044, 'Test': 0.8046}
]

metrics_df = pd.DataFrame(metrics)
metrics_df.set_index(["Baseline Model", "Metric"], inplace=True)
metrics_df

Unnamed: 0_level_0,Unnamed: 1_level_0,Train,Test
Baseline Model,Metric,Unnamed: 2_level_1,Unnamed: 3_level_1
Log Regression,Accuracy,0.9073,0.8916
Log Regression,AUC,0.9674,0.9581
Log Regression,F1,0.9075,0.8918
Naive Bayes,Accuracy,0.8721,0.865
Naive Bayes,AUC,0.9453,0.9407
Naive Bayes,F1,0.8739,0.8671
Random Forest,Accuracy,0.9999,0.8528
Random Forest,AUC,1.0,0.9276
Random Forest,F1,0.9999,0.8527
Ada Boost,Accuracy,0.7957,0.7953


In [15]:
# Create a DataFrame to display evaluation metrics for optimized models
metrics = [
    {'Optimized Model': 'Log Regression', 'Metric': 'Accuracy', 'Train': 0.9079, 'Test': 0.8926},
    {'Optimized Model': 'Log Regression', 'Metric': 'AUC', 'Train': 0.9676, 'Test': 0.9580},
    {'Optimized Model': 'Log Regression', 'Metric': 'F1', 'Train': 0.9082, 'Test': 0.8929},
    {'Optimized Model': 'Naive Bayes', 'Metric': 'Accuracy', 'Train': 0.8719, 'Test': 0.8657},
    {'Optimized Model': 'Naive Bayes', 'Metric': 'AUC', 'Train': 0.9454, 'Test': 0.9408},
    {'Optimized Model': 'Naive Bayes', 'Metric': 'F1', 'Train': 0.8737, 'Test': 0.8678},
    {'Optimized Model': 'Random Forest', 'Metric': 'Accuracy', 'Train': 0.8994, 'Test': 0.8648},
    {'Optimized Model': 'Random Forest', 'Metric': 'AUC', 'Train': 0.9643, 'Test': 0.9369},
    {'Optimized Model': 'Random Forest', 'Metric': 'F1', 'Train': 0.8992, 'Test': 0.8659},
    {'Optimized Model': 'Ada Boost', 'Metric': 'Accuracy', 'Train': 0.8214, 'Test': 0.8211},
    {'Optimized Model': 'Ada Boost', 'Metric': 'AUC', 'Train': 0.9041, 'Test': 0.9034},
    {'Optimized Model': 'Ada Boost', 'Metric': 'F1', 'Train': 0.8235, 'Test': 0.8235}
]

metrics_df = pd.DataFrame(metrics)
metrics_df.set_index(["Optimized Model", "Metric"], inplace=True)
metrics_df

Unnamed: 0_level_0,Unnamed: 1_level_0,Train,Test
Optimized Model,Metric,Unnamed: 2_level_1,Unnamed: 3_level_1
Log Regression,Accuracy,0.9079,0.8926
Log Regression,AUC,0.9676,0.958
Log Regression,F1,0.9082,0.8929
Naive Bayes,Accuracy,0.8719,0.8657
Naive Bayes,AUC,0.9454,0.9408
Naive Bayes,F1,0.8737,0.8678
Random Forest,Accuracy,0.8994,0.8648
Random Forest,AUC,0.9643,0.9369
Random Forest,F1,0.8992,0.8659
Ada Boost,Accuracy,0.8214,0.8211
