# NLP and GridSearch

This section will demonstrate how data was cleaned (tokenized, lemmatized, etc) and how the best model is chosen. The models tested here are Logistic Regression and Multinomial Naive Bayes, along with testing to compare performance between using CountVectorizer and TFIDFVectorizer on this dataset. 

In [1]:
import pandas as pd
import numpy as np 

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import stop_words
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from bs4 import BeautifulSoup
from nltk.stem import WordNetLemmatizer


### Train/Test Split

In [2]:
df = pd.read_csv("./datasets/combined_raw.csv")

In [5]:
df["abusive_relationship"].value_counts(normalize = True)

# close to equal distribution of abusive relationship occurrences, no need to stratify

0    0.534749
1    0.465251
Name: abusive_relationship, dtype: float64

In [6]:
X = df[["title", "selftext"]]

y = df["abusive_relationship"]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, test_size = 0.3)

### "Cleaning" 

Many reddit posts can have special characters and emojis, along with some links. These are two very text-heavy subreddits, so there should not be much pollution by external links or images, however to remove links that may be present I will target strings that contain "www." or "https:" and remove only those elements, but keep the rest of the string if anything else remains. The data will also be lemmatized at this stage. 