In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import pickle

import nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import wordnet, stopwords
from nltk.tokenize import sent_tokenize, word_tokenize, regexp_tokenize, RegexpTokenizer
from nltk.sentiment.vader import SentimentIntensityAnalyzer

from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier, ExtraTreesClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.model_selection import train_test_split, cross_val_score, KFold, GridSearchCV, RandomizedSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, roc_auc_score, RocCurveDisplay, ConfusionMatrixDisplay, confusion_matrix

In [2]:
train = pd.read_csv('../data/reddit_train.csv')
train.head()

Unnamed: 0,subreddit,full_text
0,1,i want to sell my new car so i can afford to m...
1,1,disputing a medical bill i am a healthy indivi...
2,1,best credit card for travel i’m getting marrie...
3,1,anyone using laurel road? division of key bank...
4,1,signed up for personal advisor service with va...


## Model preambles

In this notebook, all the models will utilize a cross-validated randomized search. This approach allows the models to explore a range of potential hyperparameters and identify the optimal configuration for each model. The following section set up the specific tuning parameters for each model that will be used later in this notebook. Additionally, a KFold cross-validation is initialized, which will be utilized by the models for evaluation.

Please ensure that the necessary functions, including `nlp_random_search_modeler`, are imported from the `nlp_functions.py` file located in the current directory before running the models.

In [3]:
from nlp_functions import nlp_random_search_modeler, lemmatize_text, stem_text, custom_lemmatize, pos_lemmatizer

In [4]:
kf = KFold(n_splits=5, shuffle=True, random_state=42)

#### Hyperparameters for the Word Vectorizers

In [5]:
vec_params = {        
    'preprocessor': [None, lemmatize_text, stem_text, pos_lemmatizer], 
    'max_features':[None, 2250, 2500, 2750, 5750, 6000, 6250, 7750, 8000, 8250, 8500, 9000],
    'stop_words': [None, stopwords.words('english')],
    'min_df': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'max_df': [0.8, .825, .85, 0.875, 0.9, 0.925, .95],
    'ngram_range':[(1,1), (1,2)]
}

#### Hyperparameters for the Logistic Regression

In [6]:
logreg_params = {
    'penalty':[None, 'l1', 'l2'], #--> in my trail attempts the default penalty of l2 has been the pick of the randomized search
    'C': np.linspace(0.08, 0.15, 20), # again my trail search for optimal hyper parameters this range seems to hold the optimal level of penalty.   
}

#### Hyperparameters for the Random Forest classifier.

In [7]:
rf_params = {
    'n_estimators': [100, 150, 175, 200, 225, 250],
    'max_depth': [None, 65, 70, 75, 80, 85, 90, 95, 100],
    'min_samples_split': np.arange(6, 11, 1)
}

#### Hyperparameters for the Support Vector Machine Classifier

In [8]:
#https://numpy.org/doc/stable/reference/generated/numpy.append.html
svc_params={'C': np.append([0.1, 0.5, 1], np.linspace(2, 4, 20)),
            'kernel': ['rbf','poly', 'linear'], 
            'degree' : [2,3,4]
           }

#### Hyperparameters for the Support Vector Machine Classifier

In [9]:
gb_params = {
    'learning_rate': [0.1, 0.5, 1, 1.025, 1.05, 1.075, 1.1],
    'n_estimators': [100, 125, 150, 170, 175, 180, 185],
    'max_features': [None, 'sqrt', 'log2'], #--> sqrt was always the model preferred choice. so I'm adding it to the instantiation
    'max_depth': np.append(None, np.arange(1, 21, 2))
}

### Training and Validation Split
In this section, the training and validation data will be prepared and initialized.

In [10]:
y = train['subreddit']
X = train['full_text']

In [11]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

## Models

### Baseline Model

The baseline accuracy in the sample is 62.5%. The objective of the subsequent models is to achieve higher accuracy and surpass this baseline as much as possible.

In [12]:
y.value_counts(normalize=True)

1    0.625264
0    0.374736
Name: subreddit, dtype: float64

### Model Setup
By utilizing the `nlp_random_search_modeler` function and incorporating the parameters of the text vectorizers along with the predefined model parameters specified in the model preambles, we are now able to conduct a randomized search across a wide range of hyperparameters. The classification models considered in this search include **Naive Bayes, Logistic Regression, Random Forest, Support Vector Classifier, and Gradient Boosting Classifier.**

It is important to note that the values of these hyperparameters have been carefully adjusted by the author through multiple iterations, aiming to determine the optimal range for tuning parameters while also reducing the search time required.

In [13]:
model_dict = {'nb': {'mod_instant': MultinomialNB(), 'mod_param':None},
              'logreg': {'mod_instant': LogisticRegression(solver='liblinear', max_iter= 5000), 'mod_param':logreg_params},
              'rf': {'mod_instant': RandomForestClassifier(random_state= 42), 'mod_param':rf_params},
              'svc':{'mod_instant': SVC(probability=True), 'mod_param': svc_params},
              'gb': {'mod_instant': GradientBoostingClassifier(random_state= 42), 'mod_param':gb_params}           
             }

In [14]:
import warnings
warnings.filterwarnings("ignore")

The next two code cells below implement a randomized search over all the models specified above, utilizing both CountVectorizer() and TfidfVectorizer(). This process involves training and evaluating the models with different hyperparameter combinations using cross-validation.

In [15]:
%%time 
models_cvec = {}

# Iterate over each key-value pair in the model_dict dictionary
for key, dic in model_dict.items():
    # Create an instance of the nlp_random_search_modeler function
    # with the specified model, CountVectorizer, cross-validation strategy, vectorizer parameters, and model parameters
    nlp = nlp_random_search_modeler(dic['mod_instant'], 
                                    CountVectorizer(), 
                                    cross_validation=kf, 
                                    vectorizer_params=vec_params, 
                                    model_params=dic['mod_param'])
    
    # Fit the nlp_random_search_modeler object to the training data
    nlp.fit(X_train, y_train)
    
    # Assign the fitted model to the models_cvec dictionary using the corresponding key
    models_cvec[key] = nlp

CPU times: user 4h 15min 22s, sys: 1min 49s, total: 4h 17min 11s
Wall time: 4h 18min 59s


In [16]:
%%time 
models_tvec = {}
for key, dic in model_dict.items():
    nlp = nlp_random_search_modeler(dic['mod_instant'], 
                                    TfidfVectorizer(), 
                                    cross_validation=kf, 
                                    vectorizer_params=vec_params, 
                                    model_params=dic['mod_param'])
    nlp.fit(X_train, y_train)
    models_tvec[key] = nlp

CPU times: user 5h 13min 26s, sys: 2min 36s, total: 5h 16min 3s
Wall time: 5h 18min 20s


## Models Performance

### Performance of the models with CountVectorizer as text processor

In [17]:
for model in ['nb', 'logreg', 'rf', 'svc', 'gb']:

    print(f"{model} model - Train Accuracy:", models_cvec[model].score(X_train, y_train))
    print(f"{model} model - Validation Accuracy:", models_cvec[model].score(X_val, y_val))
    print(f"{model} model - Validation AUC:", roc_auc_score(y_val, models_cvec[model].predict_proba(X_val)[:, 1]))
    print('\n')
    print(classification_report(y_val, models_cvec[model].predict(X_val)))
    print('\n')

nb model - Train Accuracy: 0.9204545454545454
nb model - Validation Accuracy: 0.9027484143763214
nb model - Validation AUC: 0.949950624855134


              precision    recall  f1-score   support

           0       0.89      0.84      0.86       688
           1       0.91      0.94      0.92      1204

    accuracy                           0.90      1892
   macro avg       0.90      0.89      0.89      1892
weighted avg       0.90      0.90      0.90      1892



logreg model - Train Accuracy: 0.9788583509513742
logreg model - Validation Accuracy: 0.9038054968287527
logreg model - Validation AUC: 0.9540316194081743


              precision    recall  f1-score   support

           0       0.86      0.87      0.87       688
           1       0.93      0.92      0.92      1204

    accuracy                           0.90      1892
   macro avg       0.90      0.90      0.90      1892
weighted avg       0.90      0.90      0.90      1892



rf model - Train Accuracy: 0.995110993657

### Performance of the models with TfidfVectorizer as text processor

In [18]:
for model in ['nb', 'logreg', 'rf', 'svc', 'gb']:
    
    print(f"{model} model - Train Accuracy score:", models_tvec[model].score(X_train, y_train))
    print(f"{model} model - Validation Accuracy score:", models_tvec[model].score(X_val, y_val))
    print(f"{model} model - Validation AUC:", roc_auc_score(y_val, models_tvec[model].predict_proba(X_val)[:, 1]))
    print('\n')
    print(classification_report(y_val, models_tvec[model].predict(X_val)))
    print('\n')

nb model - Train Accuracy score: 0.9187367864693446
nb model - Validation Accuracy score: 0.8969344608879493
nb model - Validation AUC: 0.9595860214015297


              precision    recall  f1-score   support

           0       0.90      0.81      0.85       688
           1       0.90      0.95      0.92      1204

    accuracy                           0.90      1892
   macro avg       0.90      0.88      0.89      1892
weighted avg       0.90      0.90      0.90      1892



logreg model - Train Accuracy score: 0.9141120507399577
logreg model - Validation Accuracy score: 0.8990486257928119
logreg model - Validation AUC: 0.9543491172834736


              precision    recall  f1-score   support

           0       0.90      0.82      0.85       688
           1       0.90      0.95      0.92      1204

    accuracy                           0.90      1892
   macro avg       0.90      0.88      0.89      1892
weighted avg       0.90      0.90      0.90      1892



rf model - Train

After evaluating the performance of all implemented models, it has been determined that the Support Vector Machine (SVC) model, in conjunction with the TfidfVectorizer and the specified hyperparameters, demonstrates superior performance compared to the other models. As a result, it will be saved for the final evaluation using the test data.

In [19]:
best_model = {
    'mod_instant': model_dict['svc']['mod_instant'], 
    'model': models_tvec['svc'],
    'vectorizer': TfidfVectorizer()
             }

with open('../data/best_model.pkl', 'wb') as f:
    pickle.dump(best_model, f)