The first step of preprocessing for the data we are using in class will be cleaning it for initial visualization. The comment bodies will need escape sequences removed, emojis/invalid characters parsed and removed, and any other issues in our data that could prevent a seamless exploration.

In [1]:
#package and data importing and loading

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

#comments full is pulled from the top 20 posts in kanye subreddit.
kanyeData = pd.read_csv("data/comments_full.csv", index_col=0)
scienceData = pd.read_csv("data/comments_askScience.csv")
politicalData = pd.read_csv("data/comments_PoliticalDiscussion.csv")
socialismData = pd.read_csv("data/comments_socialism.csv")

socialismData.head()

Unnamed: 0,postTag,user,comment_score,comment_body,mod_deleted,user_deleted,verified,is_gold,has_verified_email,link_karma,total_karma,created_utc,comment_karma
0,ykylq2,https://www.reddit.com/user/HankScorpio42,903,There is this Stockholm syndrome when it comes...,0,0,True,False,True,407101.0,503481.0,1462790000.0,90018.0
1,ykylq2,https://www.reddit.com/user/jacquix,280,\nThere is this Stockholm syndrome when it com...,0,0,True,False,True,558.0,9746.0,1466185000.0,9140.0
2,ykylq2,https://www.reddit.com/user/travissius,70,I hadn't noticed there was a comment section b...,0,0,True,False,True,1.0,240.0,1612706000.0,231.0
3,ykylq2,https://www.reddit.com/user/Indoril_Nereguar,75,"'It's not changed my opinion of her, I always ...",0,0,True,False,True,18262.0,61808.0,1516709000.0,42347.0
4,ykylq2,https://www.reddit.com/user/pdrock7,45,"I mean i do too, but she inspires me to be vio...",0,0,True,False,True,65905.0,122831.0,1343871000.0,55896.0


In [2]:
#regex expression for parsing escape sequences, or other invalid characters in the comment_body.
#we are using the comment body to identify keywords, so main goal of the comment cleaning is just seperating the bodies into lists of words.
print(kanyeData.dtypes) #->most values are numbers or objects. convert comment objects to strings to split into a list of keywords?

kanyeData['comment_body'] = kanyeData['comment_body'].str.split()
kanyeData['comment_body'].head()

postTag                object
user                   object
comment_score           int64
comment_body           object
mod_deleted             int64
user_deleted            int64
verified               object
is_gold                object
has_verified_email     object
link_karma            float64
total_karma           float64
created_utc           float64
comment_karma         float64
dtype: object


0    [Also, who, is, this, fucking, interviewer, eg...
1    [The, professional, paparazzi, literally, try,...
2    [They’re, trying, to, get, him, to, say, somet...
3                 [Naw, Dawg, he, just, mentally, ill]
4                                     [Why, not, both]
Name: comment_body, dtype: object

In [None]:
#cleaning escape sequences, invalid words, deleted comments, and other things that won't serve to help our analysis. regex?

Our analysis will be looking at which variables (from the data we collected) are the most useful in classifying whether a comment gets deleted, and if so, whether the user deleted it themselves, or a moderator deleted it. Can we predict based on certain keywords, or a threshold for karma, or any other classifers, what the outcome of the comments status will be? Could this information we use be utilized to enhance the auto moderator currently used on reddit?

The main classifier/variable we are studying will obviously be the comment bodies, as that content will be most critical to parsing the synoposis of messages that routinely get deleted or not. Thus, the data will be mostly free text, with no predefined features. As such, we will use multiple techniques to create training data to be used in model selection and training. Correlations discovered between account creation, comment karma, will be observed but will require less cleaning.

In [7]:
#CountVectorizer 
#We will use CountVectorizer during vectorization of datasets.
from sklearn.feature_extraction.text import CountVectorizer

kanye_comments = kanyeData['comment_body']
#each row in kanye_comments is a different bag of words.
#run bag of words through tokenizer, build a vocabulary over all document, and encode the matrix.
vect = CountVectorizer(analyzer=lambda x: x, max_features=10)
vect.fit(kanye_comments)
print(vect.get_feature_names_out())

X = vect.transform(kanye_comments).toarray()

['I' 'a' 'and' 'in' 'is' 'of' 'that' 'the' 'to' 'you']


In [None]:
#TF-IDF Rescaling Calculations. -> [Utilizing a param grid or pipeline could simplify this process.]
#-> A statistical measure to evaluate how relevant a word is to a document.
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer
from sklearn.pipeline import make_pipeline
kanye_tfidf = make_pipeline(CountVectorizer(), TfidfTransformer()).fit_transform(kanyeData)

In [None]:
#N-grams