The first step of preprocessing for the data we are using in class will be cleaning it for initial visualization. The comment bodies will need escape sequences removed, emojis/invalid characters parsed and removed, and any other issues in our data that could prevent a seamless exploration.

In [27]:
#package and data importing and loading

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

subreddits = ["kanye", "askScience", "PoliticalDiscussion", "socialism"]

#comments are pulled from the top 40 posts from the past month in each subreddit.
dataframes = []
for sub in subreddits:
    df = pd.read_csv(f"data/comments_{sub}.csv")
    dataframes.append(df)

In [28]:
print(dataframes[0].columns)
for i in range(len(dataframes)):
    print(subreddits[i])
    print(dataframes[i].shape)

Index(['postTag', 'user', 'comment_score', 'comment_body', 'mod_deleted',
       'user_deleted', 'verified', 'is_gold', 'has_verified_email',
       'link_karma', 'total_karma', 'created_utc', 'comment_karma'],
      dtype='object')
kanye
(8821, 13)
askScience
(11134, 13)
PoliticalDiscussion
(31235, 13)
socialism
(3038, 13)


In [29]:
# Drop comments that were removed too quickly and were not archived
for i in range(len(dataframes)):
    dataframes[i] = dataframes[i].dropna(subset=["user"])
    print(subreddits[i])
    print(dataframes[i].shape)

kanye
(8384, 13)
askScience
(6679, 13)
PoliticalDiscussion
(28076, 13)
socialism
(2716, 13)


The distribution of the target is very uneven so we need to make sure we don't overfit on the majority class


In [32]:
for i in range(len(dataframes)):
    print(subreddits[i])
    print(dataframes[i].groupby(["mod_deleted"]).size())
    print(dataframes[i].groupby(["user_deleted"]).size())
    print()

kanye
mod_deleted
0    8243
1     141
dtype: int64
user_deleted
0    8147
1     237
dtype: int64

askScience
mod_deleted
0    5146
1    1533
dtype: int64
user_deleted
0    6570
1     109
dtype: int64

PoliticalDiscussion
mod_deleted
0    27239
1      837
dtype: int64
user_deleted
0    27411
1      665
dtype: int64

socialism
mod_deleted
0.0    2487
1.0     229
dtype: int64
user_deleted
0.0    2655
1.0      61
dtype: int64



In [2]:
#regex expression for parsing escape sequences, or other invalid characters in the comment_body.
#we are using the comment body to identify keywords, so main goal of the comment cleaning is just seperating the bodies into lists of words.
print(kanyeData.dtypes) #->most values are numbers or objects. convert comment objects to strings to split into a list of keywords?

kanyeData['comment_body'] = kanyeData['comment_body'].str.split()
scienceData['comment_body'].head()

postTag                object
user                   object
comment_score           int64
comment_body           object
mod_deleted             int64
user_deleted            int64
verified               object
is_gold                object
has_verified_email     object
link_karma            float64
total_karma           float64
created_utc           float64
comment_karma         float64
dtype: object


0    There are a lot of ways you can estimate the p...
1    Fisheries scientist cosigning. They may also t...
2    To the fisheries scientists - any thought that...
3    not a fishery scientist, but I recently transl...
4    Here in New Zealand, we've had marine heat wav...
Name: comment_body, dtype: object

In [None]:
#cleaning escape sequences, invalid words, deleted comments, and other things that won't serve to help our analysis. regex?

Our analysis will be looking at which variables (from the data we collected) are the most useful in classifying whether a comment gets deleted, and if so, whether the user deleted it themselves, or a moderator deleted it. Can we predict based on certain keywords, or a threshold for karma, or any other classifers, what the outcome of the comments status will be? Could this information we use be utilized to enhance the auto moderator currently used on reddit?

The main classifier/variable we are studying will obviously be the comment bodies, as that content will be most critical to parsing the synoposis of messages that routinely get deleted or not. Thus, the data will be mostly free text, with no predefined features. As such, we will use multiple techniques to create training data to be used in model selection and training. Correlations discovered between account creation, comment karma, will be observed but will require less cleaning.

In [40]:
#CountVectorizer 
#We will use CountVectorizer during vectorization of datasets.
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords

comments_trans = []

for i in range(len(dataframes)):
    comments = dataframes[i]["comment_body"]
    vect = CountVectorizer(stop_words=stopwords.words('english'), max_features=25)
    vect.fit(comments)
    print(subreddits[i])
    print(vect.get_feature_names_out())
    print()
    comments_trans.append(vect.transform(comments).toarray())


kanye
['album' 'black' 'even' 'get' 'good' 'jewish' 'jews' 'kanye' 'know' 'like'
 'make' 'man' 'music' 'one' 'people' 'really' 'said' 'say' 'saying' 'see'
 'shit' 'still' 'think' 'would' 'ye']

askScience
['also' 'body' 'cells' 'could' 'different' 'earth' 'even' 'get' 'know'
 'like' 'lot' 'many' 'much' 'one' 'people' 'pressure' 'really' 'still'
 'system' 'think' 'time' 'water' 'way' 'would' 'years']

PoliticalDiscussion
['also' 'democrats' 'election' 'even' 'get' 'going' 'know' 'like' 'make'
 'much' 'one' 'party' 'people' 'really' 'republican' 'republicans' 'right'
 'think' 'time' 'trump' 'us' 'vote' 'want' 'way' 'would']

socialism
['also' 'anti' 'even' 'get' 'good' 'including' 'know' 'like' 'make' 'one'
 'people' 'please' 'really' 'right' 'see' 'social' 'socialism' 'socialist'
 'socialists' 'think' 'time' 'us' 'well' 'world' 'would']



In [10]:
#TF-IDF Rescaling Calculations. -> [Utilizing a param grid or pipeline could simplify this process.]
#-> A statistical measure to evaluate how relevant a word is to a document.
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer
from sklearn.pipeline import make_pipeline
kanye_tfidf = make_pipeline(CountVectorizer(stop_words=stopwords.words('english'), max_features=10), TfidfTransformer()).fit_transform(kanyeData['comment_body'].apply(lambda x: " ".join(x)))
kanye_tfidf.toarray()

array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.76848418],
       [0.        , 1.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

In [12]:
#N-grams
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression

param_grid = {"countvectorizer__ngram_range":[(1, 2), (2, 5)],
              "countvectorizer__min_df": [2, 3]
             }

grid = GridSearchCV(make_pipeline(CountVectorizer(analyzer="char"), LogisticRegression()), param_grid=param_grid,
                                  cv=10, scoring="f1_macro", return_train_score=True)