# NLP and GridSearch

This section will demonstrate how data was cleaned (tokenized, lemmatized, etc) and how the best model is chosen. The models tested here are Logistic Regression and Multinomial Naive Bayes, along with testing to compare performance between using CountVectorizer and TFIDFVectorizer on this dataset. 

## !! The last time this was run, GridSearching accumulated to a little over an hour !!

**It would be best to run this, watch an episode of The Witcher or something, and then come back to it.**

In [46]:
import pandas as pd
import numpy as np 
import regex as re
import time 

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction import stop_words
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from bs4 import BeautifulSoup
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer



In [2]:
# define function for calculating run time of GridSearching 
# some of these really took a while 

def run(start, end): 
    long = end - start 
    minutes = int(long // 60 )
    seconds = int(round(long - 60 * minutes))
    return f"{minutes}m {seconds}s"

### "Cleaning"

Many reddit posts can have special characters and emojis, along with some links. These are two very text-heavy subreddits, so there should not be much pollution by external links or images, however to remove links that may be present I will target strings that contain "www." or "https:" and remove only those elements, but keep the rest of the string if anything else remains. The data will also be lemmatized at this stage. "Cleaning" is in quotes here because some of the modeling below will involve either including or excluding stopwords, which could be considered another form of cleaning. 

In [3]:
df = pd.read_csv("./datasets/combined_raw.csv")

In [4]:
# fill null values with " "
# not filling with symbol (since it will be scrubbed later), or word since this could pollute the data 
df.fillna(" ", inplace = True)

In [5]:
# first, get rid of links
# target text that has "www." or "https:"

# scrubbing links from titles

for title in df["title"]: 
    # convert title to lowercase 
    lower_title = title.lower()
    
    # split title into individual words
    title_tokens = lower_title.split()
    
    # loop through each title searching for links, remove links ONLY
    for token in title_tokens: 
        if "www." in token: 
            title_tokens.remove(token)
        elif "http://" in token: 
            title_tokens.remove(token)
        elif "https://" in token: 
            title_tokens.remove(token)
            
    # combined processed words back into one string
    processed_title = " ".join(title_tokens)
    
    # replace old title with processed title
    df["title"].replace(to_replace = title, value = processed_title, inplace = True)

In [6]:
# same thing is done for body text of posts 

for body in df["selftext"]: 
    
    lower_body = body.lower()
    body_tokens = lower_body.split()
    
    for token in body_tokens: 
        if "www." in token: 
            body_tokens.remove(token)
        elif "http://" in token: 
            body_tokens.remove(token)
        elif "https://" in token: 
            body_tokens.remove(token)
            
    processed_body = " ".join(body_tokens)
    df["selftext"].replace(to_replace = body, value = processed_body, inplace = True)

In [7]:
# next, remove special (in this case, non-letter) characters 
# combine this step with lemmatizing 

tokenizer = RegexpTokenizer("\w+")

lemmatizer = WordNetLemmatizer()

for body in df["selftext"]: 
    words = tokenizer.tokenize(body)
    lemmatized = [lemmatizer.lemmatize(word) for word in words]
    processed = " ".join(lemmatized)
    df["selftext"].replace(to_replace = body, value = processed, inplace = True)

In [8]:
for title in df["title"]: 
    words = tokenizer.tokenize(title)
    lemmatized = [lemmatizer.lemmatize(word) for word in words]
    processed = " ".join(lemmatized)
    df["title"].replace(to_replace = title, value = processed, inplace = True)

After some brainstorming with our Merciful Overloard Charlie, combining the text from both the title and selftext into one column seemed like a great idea to organize the text for modeling. I don't really want to go back and do it from the beginning and then have to re-write cleaning for the one column right now, so I'm just going to create a new column using the two cleaned columns. I'll fix this when I have some more time for fine-tuning

In [9]:
df["text"] = df["title"] + " " + df["selftext"]


In [10]:
# out of curiosity let's CountVectorize the text column 

cvec = CountVectorizer()

curiosity = cvec.fit_transform(df["text"])

curiosity.shape

(2069, 14470)

In [11]:
# holy moly, 14,470 distinct words and that is without including word combinations! 

### Train/Test Split

In [12]:
df["abusive_relationship"].value_counts(normalize = True)

# close to equal distribution of abusive relationship occurrences, no need to stratify

0    0.535524
1    0.464476
Name: abusive_relationship, dtype: float64

In [13]:
X = df["text"]

y = df["abusive_relationship"]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, test_size = 0.3)

In [142]:
train = pd.merge(X_train, y_train, left_index = True, right_index = True)

test = pd.merge(X_test, y_test, left_index = True, right_index = True)

train.to_csv("./datasets/train.csv", index = False)
test.to_csv("./datasets/test.csv", index = False)

### Modeling

This section explores the best way to model this data using either CountVectorizer or TFIDFVectorizer in combination with either a Logistic Regression model or a Multinomial Naive Bayes model. I use GridSearchCV to test a range of values for hyperparameters for each model to determine the best configuration. 

Parameters 

| Parameter | Options | Purpose | 
|:--- | :-- |:---|
| ngram range | (1, 1), (1, 2), (1, 3) | Determines whether words used will be single words, single words and word pairings, or single words, word pairings, and 3-word sets
| stop words | None, "english" | Specifies whether to remove stop words from dataset or not | 
| max features | 100, 300, 500, 700, 900 | Determines number of words used in model | 
| min df | 0.1, 0.2, 0.3 | Specifies the minimum proportion of documents a word must appear in for it to be part of the model | 
| max df | 0.6, 0.7, 0.8, 0.9 | Determines the maximum proportion of documents a word must appear in for it to be part of the model | 

### Defining Functions and Hyperparameter Sets

When I tried running all of the different hyperparameters all at once, it took a ridiculous length of time. Instead, I decided to break it up into a few different sets of hyperparameters, define a function to GridSearch with every hyperparameter set, and then print out statistics from each fitting. I also created a dictionary to put the model statistics in to turn into a DataFrame for more clear visualization of which models with which hyperparameters performed best. 

In [111]:
# vectorizer hyperparameters involving ngram, stop word removal, and feature numbers
vec_params_features = {
    "vec__ngram_range": [(1, 1), (1, 2), (1, 3)], 
    "vec__stop_words": [None, "english"], 
    "vec__max_features": [100, 300, 500, 700, 900]}

# vectorizer hyperparameters involving ngram, stop word removal, min document appearance, and max document appearance
vec_params_dfs = {
    "vec__ngram_range": [(1, 1), (1, 2), (1, 3)], 
    "vec__stop_words": [None, "english"], 
    "vec__min_df": [0.1, 0.2, 0.3], 
    "vec__max_df": [0.6, 0.7, 0.8, 0.9]
}

# vectorizer hyperparameters involving all hyperparameters as above, though with more limited options 
vec_params_all = {
    "vec__ngram_range": [(1, 1), (1, 2)], 
    "vec__stop_words": [None, "english"], 
    "vec__min_df": [0.1, 0.2], 
    "vec__max_df": [0.7, 0.8], 
    "vec__max_features": [300, 500, 700]
}


In [112]:
# dictionary to store model metrics in 
# will be transformed to DataFrame at end for easy visualization of performance differences
model_outcomes = {"Transformer": [], 
                  "Estimator": [], 
                  "Parameters": [],
                  "Best Parameters": [], 
                  "Best Score": [], 
                  "Training Score": [], 
                  "Test Score": [], 
                  "Discrepancy": [], 
                  "Runtime": []}

In [113]:
param_dict = {"df_params": vec_params_dfs, "features_params": vec_params_features, 
              "limited_all_params": vec_params_all}

In [114]:
def gridsearch_batch(vectorizer, classifier, parameter_dict, outcomes_dict): 
    parameter_names = list(param_dict.keys())
    
    cycle = 0
    
    time_total = 0 
    for i in param_dict:
        pipe = Pipeline([
            ("vec", vectorizer), 
            ("class", classifier)
        ])
        
        grid = GridSearchCV(pipe, parameter_dict[i], cv = 5)
        
        start = time.time() 
        grid.fit(X_train, y_train)
        end = time.time()
        
        train = grid.score(X_train, y_train)
        test = grid.score(X_test, y_test)
        
        print(f"Model with {parameter_names[cycle]} took {run(start, end)} to run.")
        print(f"Best parameters: \n{grid.best_params_}")
        print(f"Best score: {grid.best_score_}")
        print(f"Training score: {train}")
        print(f"Test score: {test}")
        
        fill = [f"{vectorizer}", f"{classifier}", parameter_names[cycle], grid.best_params_, grid.best_score_, 
                train, test, (train - test), run(start, end)]
        
        count = 0
        for field in outcomes_dict: 
            outcomes_dict[field].append(fill[count])
            count += 1
        
        print("----------")
        
        cycle += 1
        time_total += (end - start)
    print(f"This entire process took {run(0, time_total)}")

##### TFIDFVectorizer and Logistic Regression 

In [115]:
gridsearch_batch(TfidfVectorizer(), LogisticRegression(solver = "liblinear"), param_dict, model_outcomes)

Model with df_params took 9m 26s to run.
Best parameters: 
{'vec__max_df': 0.6, 'vec__min_df': 0.1, 'vec__ngram_range': (1, 2), 'vec__stop_words': 'english'}
Best score: 0.9033149171270718
Training score: 0.9330110497237569
Test score: 0.9114331723027376
----------
Model with features_params took 3m 55s to run.
Best parameters: 
{'vec__max_features': 300, 'vec__ngram_range': (1, 2), 'vec__stop_words': 'english'}
Best score: 0.914364640883978
Training score: 0.93853591160221
Test score: 0.9162640901771336
----------
Model with limited_all_params took 3m 22s to run.
Best parameters: 
{'vec__max_df': 0.7, 'vec__max_features': 300, 'vec__min_df': 0.1, 'vec__ngram_range': (1, 2), 'vec__stop_words': 'english'}
Best score: 0.9033149171270718
Training score: 0.9343922651933702
Test score: 0.9146537842190016
----------
This entire process took 16m 44s


##### TFIDFVectorizer and Multinomial Naive Bayes 

In [116]:
gridsearch_batch(TfidfVectorizer(), MultinomialNB(), param_dict, model_outcomes)

Model with df_params took 8m 59s to run.
Best parameters: 
{'vec__max_df': 0.6, 'vec__min_df': 0.1, 'vec__ngram_range': (1, 2), 'vec__stop_words': None}
Best score: 0.8611878453038674
Training score: 0.8839779005524862
Test score: 0.8486312399355878
----------
Model with features_params took 3m 27s to run.
Best parameters: 
{'vec__max_features': 900, 'vec__ngram_range': (1, 3), 'vec__stop_words': 'english'}
Best score: 0.8825966850828729
Training score: 0.9164364640883977
Test score: 0.8904991948470209
----------
Model with limited_all_params took 3m 0s to run.
Best parameters: 
{'vec__max_df': 0.7, 'vec__max_features': 700, 'vec__min_df': 0.1, 'vec__ngram_range': (1, 2), 'vec__stop_words': None}
Best score: 0.8577348066298343
Training score: 0.8832872928176796
Test score: 0.8502415458937198
----------
This entire process took 15m 26s


##### CountVectorizer and Logistic Regression 

In [117]:
gridsearch_batch(CountVectorizer(), LogisticRegression(solver = "liblinear"), param_dict, model_outcomes)

Model with df_params took 9m 19s to run.
Best parameters: 
{'vec__max_df': 0.6, 'vec__min_df': 0.1, 'vec__ngram_range': (1, 1), 'vec__stop_words': 'english'}
Best score: 0.9468232044198895
Training score: 0.9854972375690608
Test score: 0.9436392914653784
----------
Model with features_params took 4m 3s to run.
Best parameters: 
{'vec__max_features': 500, 'vec__ngram_range': (1, 2), 'vec__stop_words': 'english'}
Best score: 0.9433701657458563
Training score: 0.9903314917127072
Test score: 0.9484702093397746
----------
Model with limited_all_params took 3m 26s to run.
Best parameters: 
{'vec__max_df': 0.8, 'vec__max_features': 300, 'vec__min_df': 0.1, 'vec__ngram_range': (1, 1), 'vec__stop_words': 'english'}
Best score: 0.9412983425414365
Training score: 0.9861878453038674
Test score: 0.9404186795491143
----------
This entire process took 16m 49s


##### CountVectorizer and Multinomial Naive Bayes

In [118]:
gridsearch_batch(CountVectorizer(), MultinomialNB(), param_dict, model_outcomes)

Model with df_params took 8m 59s to run.
Best parameters: 
{'vec__max_df': 0.6, 'vec__min_df': 0.1, 'vec__ngram_range': (1, 2), 'vec__stop_words': 'english'}
Best score: 0.888121546961326
Training score: 0.893646408839779
Test score: 0.8840579710144928
----------
Model with features_params took 3m 46s to run.
Best parameters: 
{'vec__max_features': 900, 'vec__ngram_range': (1, 1), 'vec__stop_words': 'english'}
Best score: 0.9019337016574586
Training score: 0.9247237569060773
Test score: 0.9066022544283414
----------
Model with limited_all_params took 3m 3s to run.
Best parameters: 
{'vec__max_df': 0.7, 'vec__max_features': 300, 'vec__min_df': 0.1, 'vec__ngram_range': (1, 2), 'vec__stop_words': 'english'}
Best score: 0.8846685082872928
Training score: 0.8957182320441989
Test score: 0.8872785829307569
----------
This entire process took 15m 48s


#### Model Performance Metrics 

In [125]:
outcomes = pd.DataFrame(model_outcomes)

outcomes.sort_values(by = "Best Score", ascending = False, inplace = True)

outcomes.reset_index(inplace = True)

outcomes.drop(columns = ["index"], inplace = True)

outcomes

Unnamed: 0,Transformer,Estimator,Parameters,Best Parameters,Best Score,Training Score,Test Score,Discrepancy,Runtime
0,"CountVectorizer(analyzer='word', binary=False,...","LogisticRegression(C=1.0, class_weight=None, d...",df_params,"{'vec__max_df': 0.6, 'vec__min_df': 0.1, 'vec_...",0.946823,0.985497,0.943639,0.041858,9m 19s
1,"CountVectorizer(analyzer='word', binary=False,...","LogisticRegression(C=1.0, class_weight=None, d...",features_params,"{'vec__max_features': 500, 'vec__ngram_range':...",0.94337,0.990331,0.94847,0.041861,4m 3s
2,"CountVectorizer(analyzer='word', binary=False,...","LogisticRegression(C=1.0, class_weight=None, d...",limited_all_params,"{'vec__max_df': 0.8, 'vec__max_features': 300,...",0.941298,0.986188,0.940419,0.045769,3m 26s
3,"TfidfVectorizer(analyzer='word', binary=False,...","LogisticRegression(C=1.0, class_weight=None, d...",features_params,"{'vec__max_features': 300, 'vec__ngram_range':...",0.914365,0.938536,0.916264,0.022272,3m 55s
4,"TfidfVectorizer(analyzer='word', binary=False,...","LogisticRegression(C=1.0, class_weight=None, d...",df_params,"{'vec__max_df': 0.6, 'vec__min_df': 0.1, 'vec_...",0.903315,0.933011,0.911433,0.021578,9m 26s
5,"TfidfVectorizer(analyzer='word', binary=False,...","LogisticRegression(C=1.0, class_weight=None, d...",limited_all_params,"{'vec__max_df': 0.7, 'vec__max_features': 300,...",0.903315,0.934392,0.914654,0.019738,3m 22s
6,"CountVectorizer(analyzer='word', binary=False,...","MultinomialNB(alpha=1.0, class_prior=None, fit...",features_params,"{'vec__max_features': 900, 'vec__ngram_range':...",0.901934,0.924724,0.906602,0.018122,3m 46s
7,"CountVectorizer(analyzer='word', binary=False,...","MultinomialNB(alpha=1.0, class_prior=None, fit...",df_params,"{'vec__max_df': 0.6, 'vec__min_df': 0.1, 'vec_...",0.888122,0.893646,0.884058,0.009588,8m 59s
8,"CountVectorizer(analyzer='word', binary=False,...","MultinomialNB(alpha=1.0, class_prior=None, fit...",limited_all_params,"{'vec__max_df': 0.7, 'vec__max_features': 300,...",0.884669,0.895718,0.887279,0.00844,3m 3s
9,"TfidfVectorizer(analyzer='word', binary=False,...","MultinomialNB(alpha=1.0, class_prior=None, fit...",features_params,"{'vec__max_features': 900, 'vec__ngram_range':...",0.882597,0.916436,0.890499,0.025937,3m 27s


In [127]:
# pulling best parameters for the model that yielded a high accuracy score (91%) with a smaller discrepancy (2.2%)

outcomes["Best Parameters"][3]

{'vec__max_features': 300,
 'vec__ngram_range': (1, 2),
 'vec__stop_words': 'english'}

In [130]:
outcomes["Best Parameters"][0]

{'vec__max_df': 0.6,
 'vec__min_df': 0.1,
 'vec__ngram_range': (1, 1),
 'vec__stop_words': 'english'}

##### The Final Model 

The final model for this project is a Logistic Regression model with transformation done by TFIDFVectorizer. This model yielded one of the highest accuracy scores, and though other models had higher accuracy scores they also had a wider range in accuracy scores between the training and test datasets. 

In [139]:
pipe = Pipeline([
                ("vec", TfidfVectorizer(max_features = 300, 
                                        ngram_range = (1, 2), 
                                        stop_words = "english")),
                ("lr", LogisticRegression(solver = "liblinear"))
            ])

pipe.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('vec',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=300,
                                 min_df=1, ngram_range=(1, 2), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words='english', strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, use_idf=True,
                                 vocabulary=None)),
                ('lr',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scalin

In [140]:
pipe.score(X_train, y_train)

0.93853591160221

In [141]:
pipe.score(X_test, y_test)

0.9162640901771336