The first step of preprocessing for the data we are using in class will be cleaning it for initial visualization. The comment bodies will need escape sequences removed, emojis/invalid characters parsed and removed, and any other issues in our data that could prevent a seamless exploration.

In [3]:
#package and data importing and loading

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import re

subreddits = ["kanye", "askScience", "PoliticalDiscussion", "socialism"]

#comments are pulled from the top 40 posts from the past month in each subreddit.
dataframes = []
for sub in subreddits:
    df = pd.read_csv(f"data/comments_{sub}.csv")
    dataframes.append(df)

In [4]:
print(dataframes[0].columns)
for i in range(len(dataframes)):
    print(subreddits[i])
    print(dataframes[i].shape)

Index(['postTag', 'user', 'comment_score', 'comment_body', 'mod_deleted',
       'user_deleted', 'verified', 'is_gold', 'has_verified_email',
       'link_karma', 'total_karma', 'created_utc', 'comment_karma'],
      dtype='object')
kanye
(42654, 13)
askScience
(11134, 13)
PoliticalDiscussion
(31235, 13)
socialism
(3038, 13)


Drop comments that were removed too quickly and so were not archived. Also drop comments automatically generated by mods.

In [5]:
urlRegex = r"(https? *:*\/*\/*)(\s)*(www\.)?(\s)*((\w|\s)+\.)*([\w\-\s]+\/)*([\w\-]+)((\?)?[\w\s]*=\s*[\w\%&]*)*"
mods = ["https://www.reddit.com/user/AutoModerator", "https://www.reddit.com/user/socialism-ModTeam"]
for i in range(len(dataframes)):
    dataframes[i] = dataframes[i].dropna(subset=["user"])
    dataframes[i] = dataframes[i][~dataframes[i]["user"].isin(mods)]
    for col in ['verified', 'is_gold', 'has_verified_email']:
        dataframes[i][col] = dataframes[i][col].apply(lambda x: 1 if x else 0)
    dataframes[i]['comment_body'] = dataframes[i]['comment_body'].apply(lambda x: re.sub(urlRegex, ' ', x))
    print(subreddits[i])
    print(dataframes[i].shape)

kanye
(40436, 13)
askScience
(6679, 13)
PoliticalDiscussion
(28026, 13)
socialism
(2620, 13)


The distribution of the target is very uneven so we need to make sure we don't overfit on the majority class.


In [6]:
for i in range(len(dataframes)):
    print(subreddits[i])
    print(dataframes[i].groupby(["mod_deleted"]).size())
    print(dataframes[i].groupby(["user_deleted"]).size())
    print()

kanye
mod_deleted
0    39584
1      852
dtype: int64
user_deleted
0    38187
1     2249
dtype: int64

askScience
mod_deleted
0    5146
1    1533
dtype: int64
user_deleted
0    6570
1     109
dtype: int64

PoliticalDiscussion
mod_deleted
0    27189
1      837
dtype: int64
user_deleted
0    27361
1      665
dtype: int64

socialism
mod_deleted
0    2392
1     228
dtype: int64
user_deleted
0    2559
1      61
dtype: int64



In [7]:
#cleaning escape sequences, invalid words, deleted comments, and other things that won't serve to help our analysis. regex?

Our analysis will be looking at which variables (from the data we collected) are the most useful in classifying whether a comment gets deleted, and if so, whether the user deleted it themselves, or a moderator deleted it. Can we predict based on certain keywords, or a threshold for karma, or any other classifers, what the outcome of the comments status will be? Could this information we use be utilized to enhance the auto moderator currently used on reddit?

The main classifier/variable we are studying will obviously be the comment bodies, as that content will be most critical to parsing the synoposis of messages that routinely get deleted or not. Thus, the data will be mostly free text, with no predefined features. As such, we will use multiple techniques to create training data to be used in model selection and training. Correlations discovered between account creation, comment karma, will be observed but will require less cleaning.

In [8]:
#CountVectorizer 
#We will use CountVectorizer during vectorization of datasets.
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer

comments_trans = []

# Added stemming to vectorization but may not be necessary, max_df has a bigger impact
porter = SnowballStemmer("english")
def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

for i in range(len(dataframes)):
    comments = dataframes[i]["comment_body"].apply(lambda x: " ".join(tokenizer_porter(x)))
    vect = CountVectorizer(stop_words=stopwords.words('english'),max_df=.5, ngram_range=(1,2) ,max_features=50, strip_accents="unicode")
    vect.fit(comments)
    print(subreddits[i])
    print(vect.get_feature_names_out())
    print()
    comments_trans.append(vect.transform(comments).toarray())
    vect = CountVectorizer(stop_words=stopwords.words('english'),max_df=.5, ngram_range=(2,2) ,max_features=25, strip_accents="unicode")
    vect.fit(comments)
    print(vect.get_feature_names_out())
    print()


kanye
['actual' 'also' 'ani' 'anti' 'becaus' 'black' 'call' 'even' 'fuck' 'get'
 'go' 'good' 'hate' 'jew' 'jewish' 'kany' 'know' 'like' 'lol' 'make' 'man'
 'mean' 'music' 'need' 'one' 'onli' 'peopl' 'person' 'point' 'realli'
 'right' 'said' 'say' 'see' 'shit' 'someon' 'still' 'take' 'talk' 'thing'
 'think' 'time' 'tri' 'use' 'want' 'way' 'whi' 'white' 'would' 'ye']

['act like' 'anti semit' 'black peopl' 'black people' 'death con'
 'eric andr' 'feel like' 'georg floyd' 'get help' 'jewish peopl'
 'jewish people' 'kany said' 'kany west' 'like kany' 'look like'
 'mental health' 'mental ill' 'need help' 'peopl like' 'piec shit'
 'seem like' 'social media' 'sound like' 'white peopl' 'year old']

askScience
['actual' 'also' 'ani' 'becaus' 'bodi' 'caus' 'cell' 'could' 'differ'
 'doe' 'earth' 'effect' 'enough' 'even' 'get' 'go' 'human' 'know' 'like'
 'long' 'look' 'lot' 'make' 'mani' 'mean' 'much' 'need' 'one' 'onli'
 'peopl' 'pressur' 'realli' 'say' 'see' 'someth' 'still' 'system' 'take'
 'th

In [9]:
comments_trans[0]

array([[0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [10]:
#TF-IDF Rescaling Calculations. -> [Utilizing a param grid or pipeline could simplify this process.]
#-> A statistical measure to evaluate how relevant a word is to a document.
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer
from sklearn.pipeline import make_pipeline
kanye_tfidf = make_pipeline(CountVectorizer(stop_words=stopwords.words('english'), max_features=10), TfidfTransformer()).fit_transform(kanyeData['comment_body'].apply(lambda x: " ".join(x)))
kanye_tfidf.toarray()

NameError: name 'kanyeData' is not defined

In [None]:
#N-grams
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression

param_grid = {"countvectorizer__ngram_range":[(1, 2), (2, 5)],
              "countvectorizer__min_df": [2, 3]
             }

grid = GridSearchCV(make_pipeline(CountVectorizer(analyzer="char"), LogisticRegression()), param_grid=param_grid,
                                  cv=10, scoring="f1_macro", return_train_score=True)

In [14]:
from sklearn.model_selection import GroupShuffleSplit
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
x = ['postTag', 'comment_body', 'comment_score', 'verified', 'is_gold', 'has_verified_email',
       'link_karma', 'total_karma', 'created_utc', 'comment_karma']
y = "mod_deleted"
x_cont = x[2:]
x_text = 'comment_body'

In [24]:
#y is our target, what we are trying to predict. That is either deleted by mod or deleted by user (if deleted at all). Two splits.
for df in dataframes:
    tfidf = TfidfVectorizer(stop_words=stopwords.words('english'),max_df=.5, ngram_range=(1,2) ,max_features=50, strip_accents="unicode")
    # Use grouped split so that comments from a posts are not split between training & test set
    X = df[x]
    Y = df[y]
    gs = GroupShuffleSplit(n_splits=2, test_size=.3, random_state=0)
    train_ind, test_ind = next(gs.split(X, Y, groups=X.postTag))
    X_train = X.iloc[train_ind].drop("postTag", axis=1)
    y_train = Y.iloc[train_ind]
    X_test = X.iloc[test_ind].drop("postTag", axis=1)
    y_test = Y.iloc[test_ind]
    
    """
        Preprocess continuous columns and comment body
        We may want to grid search with tfidf params instead of using above params
    """
    cont_pipe = Pipeline([
        ('imputer', SimpleImputer(strategy="median")),
        ('scaler', StandardScaler())
    ])
    
    text_pipe = Pipeline([
        ('vect', tfidf)
    ])
    
    preprocessor = ColumnTransformer([
        ('cont', cont_pipe, x_cont),
        ('text', text_pipe, x_text)
    ])
    
    preprocessor.fit(X_train)
    columns = preprocessor.named_transformers_["text"][0].get_feature_names_out()
    columns = list(x_cont) + list(columns)
    X_train_trans = pd.DataFrame(preprocessor.transform(X_train).toarray(), columns=columns)
    print(X_train_trans.head()) # Transformed training data
    
    # Create model pipeline & param_grids

   comment_score  verified  is_gold  has_verified_email  link_karma  \
0      28.597130       0.0 -0.32277            0.455035   -0.084809   
1       7.747394       0.0 -0.32277            0.455035   -0.071857   
2       4.718825       0.0 -0.32277            0.455035   -0.128258   
3       2.339236       0.0 -0.32277            0.455035    0.022666   
4       0.917663       0.0 -0.32277            0.455035    0.022504   

   total_karma  created_utc  comment_karma  actually      also  ...     still  \
0    -0.078767    -0.402122      -0.058956       0.0  0.000000  ...  0.000000   
1     0.173269    -2.105970       0.260526       0.0  0.000000  ...  1.000000   
2    -0.236544     0.515729      -0.246319       0.0  0.000000  ...  0.000000   
3     1.475998     0.526639       1.973054       0.0  0.000000  ...  0.000000   
4     0.102184    -1.309254       0.126044       0.0  0.338941  ...  0.667799   

   take  things  think  time  want  way  white  would       ye  
0   0.0     0.0    0.

In [26]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Ridge
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_val_score

X_test_trans = pd.DataFrame(preprocessor.transform(X_test).toarray(), columns=columns)

param_grid = {'alpha': [0, 0.5, 0.25, 0.1, 0.01, 0.001, 0.0001, 1.0, 1.25, 1.5, 5, 10],  
              "solver": ['svd', 'cholesky', 'lsqr', 'sag', 'saga']}
    
ridgeRegression = GridSearchCV(Ridge(), param_grid=param_grid, n_jobs=-1)
ridgeRegression.fit(X_train_trans, y_train)
ridgeRegression.best_params_

accuracy = cross_val_score(estimator=ridgeRegression, X=X_train_trans, y=y_train, cv=5)
print(accuracy.mean())

ridgeRegression.score(X_test_trans, y_test)

-0.11181486671557539


-0.04023498412452442