# NLP and GridSearch

This section will demonstrate how data was cleaned (tokenized, lemmatized, etc) and how the best model is chosen. The models tested here are Logistic Regression and Multinomial Naive Bayes, along with testing to compare performance between using CountVectorizer and TFIDFVectorizer on this dataset. 

In [112]:
import pandas as pd
import numpy as np 
import regex as re

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction import stop_words
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from bs4 import BeautifulSoup
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer



### "Cleaning"

Many reddit posts can have special characters and emojis, along with some links. These are two very text-heavy subreddits, so there should not be much pollution by external links or images, however to remove links that may be present I will target strings that contain "www." or "https:" and remove only those elements, but keep the rest of the string if anything else remains. The data will also be lemmatized at this stage. "Cleaning" is in quotes here because some of the modeling below will involve either including or excluding stopwords, which could be considered another form of cleaning. 

In [56]:
df = pd.read_csv("./datasets/combined_raw.csv")

In [57]:
# fill null values with " "
# not filling with symbol (since it will be scrubbed later), or word since this could pollute the data 
df.fillna(" ", inplace = True)

In [70]:
# first, get rid of links
# target text that has "www." or "https:"

# scrubbing links from titles

for title in df["title"]: 
    # convert title to lowercase 
    lower_title = title.lower()
    
    # split title into individual words
    title_tokens = lower_title.split()
    
    # loop through each title searching for links, remove links ONLY
    for token in title_tokens: 
        if "www." in token: 
            title_tokens.remove(token)
        elif "http://" in token: 
            title_tokens.remove(token)
        elif "https://" in token: 
            title_tokens.remove(token)
            
    # combined processed words back into one string
    processed_title = " ".join(title_tokens)
    
    # replace old title with processed title
    df["title"].replace(to_replace = title, value = processed_title, inplace = True)

In [71]:
# same thing is done for body text of posts 

for body in df["selftext"]: 
    
    lower_body = body.lower()
    body_tokens = lower_body.split()
    
    for token in body_tokens: 
        if "www." in token: 
            body_tokens.remove(token)
        elif "http://" in token: 
            body_tokens.remove(token)
        elif "https://" in token: 
            body_tokens.remove(token)
            
    processed_body = " ".join(body_tokens)
    df["selftext"].replace(to_replace = body, value = processed_body, inplace = True)

In [89]:
# next, remove special (in this case, non-letter) characters 
# combine this step with lemmatizing 

tokenizer = RegexpTokenizer("\w+")

lemmatizer = WordNetLemmatizer()

for body in df["selftext"]: 
    words = tokenizer.tokenize(body)
    lemmatized = [lemmatizer.lemmatize(word) for word in words]
    processed = " ".join(lemmatized)
    df["selftext"].replace(to_replace = body, value = processed, inplace = True)

In [93]:
for title in df["title"]: 
    words = tokenizer.tokenize(title)
    lemmatized = [lemmatizer.lemmatize(word) for word in words]
    processed = " ".join(lemmatized)
    df["title"].replace(to_replace = title, value = processed, inplace = True)

After some brainstorming with our Merciful Overloard Charlie, combining the text from both the title and selftext into one column seemed like a great idea to organize the text for modeling. I don't really want to go back and do it from the beginning and then have to re-write cleaning for the one column right now, so I'm just going to create a new column using the two cleaned columns. I'll fix this when I have some more time for fine-tuning

In [104]:
df["text"] = df["title"] + " " + df["selftext"]


In [108]:
# out of curiosity let's CountVectorize the text column 

cvec = CountVectorizer()

curiosity = cvec.fit_transform(df["text"])

curiosity.shape

(2069, 14466)

In [None]:
# holy moly, 14,466 distinct words and that is without including word combinations! 

### Train/Test Split

In [94]:
df["abusive_relationship"].value_counts(normalize = True)

# close to equal distribution of abusive relationship occurrences, no need to stratify

0    0.535524
1    0.464476
Name: abusive_relationship, dtype: float64

In [250]:
X = df["text"]

y = df["abusive_relationship"]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, test_size = 0.3)

### Modeling

This section explores the best way to model this data using either CountVectorizer or TFIDFVectorizer in combination with either a Logistic Regression model or a Multinomial Naive Bayes model. I use GridSearchCV to test a range of values for hyperparameters for each model to determine the best configuration. 

In [255]:
min_list = list(np.arange(0.1, 0.3, 0.1))

max_list = list(np.arange(0.5, 1.0, 0.2))

vec_params = {
    "vec__ngram_range": [(1, 1), (1, 2), (1, 3)], 
    "vec__stop_words": [None, "english"], 
    "vec__max_features": [300, 500, 700, 900], 
#     "vec__min_df": [min_list], 
#     "vec__max_df": [max_list]
}

In [256]:
pipe_tfidf_log = Pipeline([
    ("vec", TfidfVectorizer()), 
    ("lr", LogisticRegression(solver = "liblinear"))
])

In [257]:
gs = GridSearchCV(pipe_tfidf_log, vec_params, cv = 5)

In [258]:
gs.fit(X_train, y_train)

GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('vec',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                           

In [259]:
gs.best_params_

{'vec__max_features': 300,
 'vec__ngram_range': (1, 2),
 'vec__stop_words': 'english'}

In [260]:
gs.best_score_

0.9150552486187845

In [209]:
X_train.shape

(1448, 1)

In [145]:
y_train.shape

(1448,)