## Comment + Parent Logistic Regression

### Notebook #4

## Intro

What follows is a baseline `Logistic Regression` on the reddit `comments` and `parent comments`. We will be using `ColumnTransform` to run `CountVectorizer` on the two columns in parallel. We will be splitting the features 750/750 for a total of 1500 to keep the amount of features consistent. This is an experiment conducted in the hopes of improving the model given added context from the `parent comment`.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
pd.set_option('display.max_columns', None, 'max_colwidth', 250)

import regex as re
import string
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords 

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression


from sklearn.compose import ColumnTransformer
from sklearn.pipeline import FeatureUnion




RuntimeError: module compiled against API version 0xe but this version of numpy is 0xd

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\gravi\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
# reading csv and checking head
reddit = pd.read_csv('comment_plus_parent.csv', index_col=0)
reddit.head()

  mask |= (ar1 == a)


Unnamed: 0,label,comment,parent_comment
0,0,NC and NH.,"Yeah, I get that argument. At this point, I'd prefer is she lived in NC as well."
1,0,You do know west teams play against west teams more than east teams right?,The blazers and Mavericks (The wests 5 and 6 seed) did not even carry a good enough record to make the playoffs in the east last year.
2,0,"They were underdogs earlier today, but since Gronk's announcement this afternoon, the Vegas line has moved to patriots -1",They're favored to win.
3,0,"This meme isn't funny none of the ""new york nigga"" ones are.",deadass don't kill my buzz
4,0,I could use one of those tools.,Yep can confirm I saw the tool they use for that. It was made by our boy EASports_MUT


In [3]:
# verifying shape
reddit.shape

(1010714, 3)

In [3]:
# dividing rows by 4 to create sample
reddit.shape[0]/4

252678.5

In [17]:
# creating sample to be used for testing
reddit_sample = reddit.sample(252679, random_state=42)
reddit_sample.head()

Unnamed: 0,label,comment,parent_comment
139485,0,Trufant,Yo I missed a large portion of the game. Any significant injuries?
276174,0,Too soon,When/If The Bismarck Is Added This Will Be The Scariest Thing a German Player Can See
590507,0,"Also, phone lights have no throw.",No. One of the benefits/requirements of a flashlight is that most edc-size ones can be held in the mouth for hands-free use. Try that with your phone.
652534,1,"The Pirate Bay, of course!",Where is the best place to buy windows? It doesn't matter what version as I plan on upgrading to windows 10. Thank you for your time.
333855,0,Doesn't Sop have a Mac version though?,"Right, guess I should say that my work uses Mac's so I can't do acestream at work. I do acestream at home."


In [5]:
# setting X and y
y = reddit_sample.pop('label')
X = reddit_sample

In [6]:
# verifying X shape
X.shape

(252679, 2)

In [7]:
# verifying y shape
y.shape

(252679,)

In [8]:
# first split into test and remainder
X_remainder, X_test, y_remainder, y_test = train_test_split(X, y, test_size = 0.25, random_state=42)

In [9]:
# second split into train and validation
X_train, X_val, y_train, y_val = train_test_split(X_remainder, y_remainder, test_size = 0.25, random_state=42)

In [10]:
# custom tokenizer function

# instantiating stemmer and stopwords
stemmer = nltk.stem.PorterStemmer()
stop_words = stopwords.words('english')

def tokenizer(sentence):

    # replacing numbers with empty string
    sentence = re.sub("\d+", "", sentence)
    
    # removing punctuation and setting to lower case
    for punctuation_mark in string.punctuation:
        sentence = sentence.replace(punctuation_mark,'').lower()
        
    
    # splitting sentence into words
    words_list = sentence.split(' ')
    stemmed_words = []
    
    
    # removing stopwords and any tokens that are just empty strings
    for word in words_list:
        if (not word in stop_words) and (word!=''):
            # Stem words
            stemmed_word = stemmer.stem(word)
            stemmed_words.append(stemmed_word)

    return stemmed_words

In [11]:
# Instantiating transformers to pass into column transformer
transformers = [('comment', CountVectorizer(min_df=25, tokenizer=tokenizer, max_features=750, dtype=np.int8), 'comment'),
                 ('parent', CountVectorizer(min_df=25, tokenizer=tokenizer, max_features=750, dtype=np.int8), 'parent_comment')]
                

# Creating the column transformer
column_transform = ColumnTransformer(transformers)

# Fitting and transforming train, test, val
X_train_tokens = column_transform.fit_transform(X_train)
X_val_tokens = column_transform.transform(X_val)
X_test_tokens = column_transform.transform(X_test)

In [12]:
# train to df
train_vectors = pd.DataFrame(columns=column_transform.get_feature_names_out(), data=X_train_tokens.toarray())
train_vectors.shape

(142131, 1500)

In [13]:
# val to df
val_vectors = pd.DataFrame(columns=column_transform.get_feature_names_out(), data=X_val_tokens.toarray())
val_vectors.shape

(47378, 1500)

In [14]:
# test to df
test_vectors = pd.DataFrame(columns=column_transform.get_feature_names_out(), data=X_test_tokens.toarray())
test_vectors.shape

(63170, 1500)

In [15]:
logreg = LogisticRegression(max_iter=5000, random_state=42, n_jobs=4)

logreg.fit(train_vectors, y_train)

print(f'Train score: {logreg.score(train_vectors, y_train)}')
print(f'Val score: {logreg.score(val_vectors, y_val)}')


Train score: 0.6564085245301869
Val score: 0.6476634724977838


---------------------------------------------------------------------------------------------------------------------------------------------------------

## Conclusion

Model performance was within the range of previous logreg models. As discussed previously, `CountVectorizer` and `TFIDF` have their limitations, and added context will not necessarily aid in their performance. Splitting the `max_features` between the two is also not ideal as it can hinder the predictability of the more immediate predictor, which is the `comment`.