### Data Cleaning and Pre-Processing

##### Load sample dataset

In [1]:
import pandas as pd

sample_df = pd.read_csv('sample dataset.csv')
sample_df

Unnamed: 0,URL ID,Comment,Date
0,fablco,"Here's why he used the word ""parasitic"":\n\n> ...",2020-02
1,fablco,ELI5 anyone?,2020-02
2,fablco,"Tiktok CEO: I know you are, but what am I?",2020-02
3,fablco,Coming from a company that is constantly tryin...,2020-02
4,fablco,"Adolescence, puberty, peer pressure, all that ...",2020-02
...,...,...,...
62950,17b9d3y,"Ofc they dont, they have spies not tied to the...",2023-10
62951,17b9d3y,Found the CCP bot,2023-10
62952,17b9d3y,Christ Reddit is such an echo chamber. Is anyo...,2023-10
62953,17b9d3y,Wtf does the comment even have to do with any ...,2023-10


##### Data cleaning

In [2]:
import re

def data_cleaning(dataset, comment_col = 'Comment'):
    
    dataset['Cleaned comment'] = dataset[comment_col]
    dataset = dataset[~dataset['Cleaned comment'].isin(['[removed]', '[deleted]'])]

    def remove_urls(text):
        url_pattern = re.compile(r'https?://\S+|www\.\S+')
        return url_pattern.sub(r'', text)

    def remove_usernames(text):
        user_mention_pattern = re.compile(r'\bu/\S+')
        return user_mention_pattern.sub(r'', text)

    def replace_tiktok(text):
        return re.sub(r'tik\s+tok', 'tiktok', text, flags = re.IGNORECASE)

    dataset['Cleaned comment'] = dataset['Cleaned comment'].apply(remove_urls)
    dataset['Cleaned comment'] = dataset['Cleaned comment'].apply(remove_usernames)
    dataset['Cleaned comment'] = dataset['Cleaned comment'].apply(replace_tiktok)

    return dataset

cleaned_df = data_cleaning(sample_df)
cleaned_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataset['Cleaned comment'] = dataset['Cleaned comment'].apply(remove_urls)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataset['Cleaned comment'] = dataset['Cleaned comment'].apply(remove_usernames)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataset['Cleaned comment'] = dataset['Cleaned comm

Unnamed: 0,URL ID,Comment,Date,Cleaned comment
0,fablco,"Here's why he used the word ""parasitic"":\n\n> ...",2020-02,"Here's why he used the word ""parasitic"":\n\n> ..."
1,fablco,ELI5 anyone?,2020-02,ELI5 anyone?
2,fablco,"Tiktok CEO: I know you are, but what am I?",2020-02,"Tiktok CEO: I know you are, but what am I?"
3,fablco,Coming from a company that is constantly tryin...,2020-02,Coming from a company that is constantly tryin...
4,fablco,"Adolescence, puberty, peer pressure, all that ...",2020-02,"Adolescence, puberty, peer pressure, all that ..."
...,...,...,...,...
62950,17b9d3y,"Ofc they dont, they have spies not tied to the...",2023-10,"Ofc they dont, they have spies not tied to the..."
62951,17b9d3y,Found the CCP bot,2023-10,Found the CCP bot
62952,17b9d3y,Christ Reddit is such an echo chamber. Is anyo...,2023-10,Christ Reddit is such an echo chamber. Is anyo...
62953,17b9d3y,Wtf does the comment even have to do with any ...,2023-10,Wtf does the comment even have to do with any ...


##### Data pre-processing

In [3]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk.tokenize import word_tokenize
import contractions

nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

custom_stop_words = set(["0o", "0s", "3a", "3b", "3d", "6b", "6o", "a", "a1", "a2", "a3", "a4", "ab", "able", "about", "above", "abst", "ac", "accordance", "according", "accordingly", "across", "act", "actually", "ad", "added", "adj", "ae", "af", "affected", "affecting", "affects", "after", "afterwards", "ag", "again", "against", "ah", "ain", "ain't", "aj", "al", "all", "allow", "allows", "almost", "alone", "along", "already", "also", "although", "always", "am", "among", "amongst", "amoungst", "amount", "an", "and", "announce", "another", "any", "anybody", "anyhow", "anymore", "anyone", "anything", "anyway", "anyways", "anywhere", "ao", "ap", "apart", "apparently", "appear", "appreciate", "appropriate", "approximately", "ar", "are", "aren", "arent", "aren't", "arise", "around", "as", "a's", "aside", "ask", "asking", "associated", "at", "au", "auth", "av", "available", "aw", "away", "awfully", "ax", "ay", "az", "b", "b1", "b2", "b3", "ba", "back", "bc", "bd", "be", "became", "because", "become", "becomes", "becoming", "been", "before", "beforehand", "begin", "beginning", "beginnings", "begins", "behind", "being", "believe", "below", "beside", "besides", "best", "better", "between", "beyond", "bi", "bill", "biol", "bj", "bk", "bl", "bn", "both", "bottom", "bp", "br", "brief", "briefly", "bs", "bt", "bu", "but", "bx", "by", "c", "c1", "c2", "c3", "ca", "call", "came", "can", "cannot", "cant", "can't", "cause", "causes", "cc", "cd", "ce", "certain", "certainly", "cf", "cg", "ch", "changes", "ci", "cit", "cj", "cl", "clearly", "cm", "c'mon", "cn", "co", "com", "come", "comes", "con", "concerning", "consequently", "consider", "considering", "contain", "containing", "contains", "corresponding", "could", "couldn", "couldnt", "couldn't", "course", "cp", "cq", "cr", "cry", "cs", "c's", "ct", "cu", "currently", "cv", "cx", "cy", "cz", "d", "d2", "da", "date", "dc", "dd", "de", "definitely", "describe", "described", "despite", "detail", "df", "di", "did", "didn", "didn't", "different", "dj", "dk", "dl", "do", "does", "doesn", "doesn't", "doing", "don", "done", "don't", "down", "downwards", "dp", "dr", "ds", "dt", "du", "due", "during", "dx", "dy", "e", "e2", "e3", "ea", "each", "ec", "ed", "edu", "ee", "ef", "effect", "eg", "ei", "eight", "eighty", "either", "ej", "el", "eleven", "else", "elsewhere", "em", "empty", "en", "end", "ending", "enough", "entirely", "eo", "ep", "eq", "er", "es", "especially", "est", "et", "et-al", "etc", "eu", "ev", "even", "ever", "every", "everybody", "everyone", "everything", "everywhere", "ex", "exactly", "example", "except", "ey", "f", "f2", "fa", "far", "fc", "few", "ff", "fi", "fifteen", "fifth", "fify", "fill", "find", "fire", "first", "five", "fix", "fj", "fl", "fn", "fo", "followed", "following", "follows", "for", "former", "formerly", "forth", "forty", "found", "four", "fr", "from", "front", "fs", "ft", "fu", "full", "further", "furthermore", "fy", "g", "ga", "gave", "ge", "get", "gets", "getting", "gi", "give", "given", "gives", "giving", "gj", "gl", "go", "goes", "going", "gone", "got", "gotten", "gr", "greetings", "gs", "gy", "h", "h2", "h3", "had", "hadn", "hadn't", "happens", "hardly", "has", "hasn", "hasnt", "hasn't", "have", "haven", "haven't", "having", "he", "hed", "he'd", "he'll", "hello", "help", "hence", "her", "here", "hereafter", "hereby", "herein", "heres", "here's", "hereupon", "hers", "herself", "hes", "he's", "hh", "hi", "hid", "him", "himself", "his", "hither", "hj", "ho", "home", "hopefully", "how", "howbeit", "however", "how's", "hr", "hs", "http", "hu", "hundred", "hy", "i", "i2", "i3", "i4", "i6", "i7", "i8", "ia", "ib", "ibid", "ic", "id", "i'd", "ie", "if", "ig", "ignored", "ih", "ii", "ij", "il", "i'll", "im", "i'm", "immediate", "immediately", "importance", "important", "in", "inasmuch", "inc", "indeed", "index", "indicate", "indicated", "indicates", "information", "inner", "insofar", "instead", "interest", "into", "invention", "inward", "io", "ip", "iq", "ir", "is", "isn", "isn't", "it", "itd", "it'd", "it'll", "its", "it's", "itself", "iv", "i've", "ix", "iy", "iz", "j", "jj", "jr", "js", "jt", "ju", "just", "k", "ke", "keep", "keeps", "kept", "kg", "kj", "km", "know", "known", "knows", "ko", "l", "l2", "la", "largely", "last", "lately", "later", "latter", "latterly", "lb", "lc", "le", "least", "les", "less", "lest", "let", "lets", "let's", "lf", "like", "liked", "likely", "line", "little", "lj", "ll", "ll", "ln", "lo", "look", "looking", "looks", "los", "lr", "ls", "lt", "ltd", "m", "m2", "ma", "made", "mainly", "make", "makes", "many", "may", "maybe", "me", "mean", "means", "meantime", "meanwhile", "merely", "mg", "might", "mightn", "mightn't", "mill", "million", "mine", "miss", "ml", "mn", "mo", "more", "moreover", "most", "mostly", "move", "mr", "mrs", "ms", "mt", "mu", "much", "mug", "must", "mustn", "mustn't", "my", "myself", "n", "n2", "na", "name", "namely", "nay", "nc", "nd", "ne", "near", "nearly", "necessarily", "necessary", "need", "needn", "needn't", "needs", "neither", "never", "nevertheless", "new", "next", "ng", "ni", "nine", "ninety", "nj", "nl", "nn", "no", "nobody", "non", "none", "nonetheless", "noone", "nor", "normally", "nos", "not", "noted", "nothing", "novel", "now", "nowhere", "nr", "ns", "nt", "ny", "o", "oa", "ob", "obtain", "obtained", "obviously", "oc", "od", "of", "off", "often", "og", "oh", "oi", "oj", "ok", "okay", "ol", "old", "om", "omitted", "on", "once", "one", "ones", "only", "onto", "oo", "op", "oq", "or", "ord", "os", "ot", "other", "others", "otherwise", "ou", "ought", "our", "ours", "ourselves", "out", "outside", "over", "overall", "ow", "owing", "own", "ox", "oz", "p", "p1", "p2", "p3", "page", "pagecount", "pages", "par", "part", "particular", "particularly", "pas", "past", "pc", "pd", "pe", "per", "perhaps", "pf", "ph", "pi", "pj", "pk", "pl", "placed", "please", "plus", "pm", "pn", "po", "poorly", "possible", "possibly", "potentially", "pp", "pq", "pr", "predominantly", "present", "presumably", "previously", "primarily", "probably", "promptly", "proud", "provides", "ps", "pt", "pu", "put", "py", "q", "qj", "qu", "que", "quickly", "quite", "qv", "r", "r2", "ra", "ran", "rather", "rc", "rd", "re", "readily", "really", "reasonably", "recent", "recently", "ref", "refs", "regarding", "regardless", "regards", "related", "relatively", "research", "research-articl", "respectively", "resulted", "resulting", "results", "rf", "rh", "ri", "right", "rj", "rl", "rm", "rn", "ro", "rq", "rr", "rs", "rt", "ru", "run", "rv", "ry", "s", "s2", "sa", "said", "same", "saw", "say", "saying", "says", "sc", "sd", "se", "sec", "second", "secondly", "section", "see", "seeing", "seem", "seemed", "seeming", "seems", "seen", "self", "selves", "sensible", "sent", "serious", "seriously", "seven", "several", "sf", "shall", "shan", "shan't", "she", "shed", "she'd", "she'll", "shes", "she's", "should", "shouldn", "shouldn't", "should've", "show", "showed", "shown", "showns", "shows", "si", "side", "significant", "significantly", "similar", "similarly", "since", "sincere", "six", "sixty", "sj", "sl", "slightly", "sm", "sn", "so", "some", "somebody", "somehow", "someone", "somethan", "something", "sometime", "sometimes", "somewhat", "somewhere", "soon", "sorry", "sp", "specifically", "specified", "specify", "specifying", "sq", "sr", "ss", "st", "still", "stop", "strongly", "sub", "substantially", "successfully", "such", "sufficiently", "suggest", "sup", "sure", "sy", "system", "sz", "t", "t1", "t2", "t3", "take", "taken", "taking", "tb", "tc", "td", "te", "tell", "ten", "tends", "tf", "th", "than", "thank", "thanks", "thanx", "that", "that'll", "thats", "that's", "that've", "the", "their", "theirs", "them", "themselves", "then", "thence", "there", "thereafter", "thereby", "thered", "therefore", "therein", "there'll", "thereof", "therere", "theres", "there's", "thereto", "thereupon", "there've", "these", "they", "theyd", "they'd", "they'll", "theyre", "they're", "they've", "thickv", "thin", "thing", "think", "third", "this", "thorough", "thoroughly", "those", "thou", "though", "thoughh", "thousand", "three", "throug", "through", "throughout", "thru", "thus", "ti", "til", "tip", "tj", "tl", "tm", "tn", "to", "together", "too", "took", "top", "toward", "towards", "tp", "tq", "tr", "tried", "tries", "truly", "try", "trying", "ts", "t's", "tt", "tv", "twelve", "twenty", "twice", "two", "tx", "u", "u201d", "ue", "ui", "uj", "uk", "um", "un", "under", "unfortunately", "unless", "unlike", "unlikely", "until", "unto", "uo", "up", "upon", "ups", "ur", "us", "use", "used", "useful", "usefully", "usefulness", "uses", "using", "usually", "ut", "v", "va", "value", "various", "vd", "ve", "ve", "very", "via", "viz", "vj", "vo", "vol", "vols", "volumtype", "vq", "vs", "vt", "vu", "w", "wa", "want", "wants", "was", "wasn", "wasnt", "wasn't", "way", "we", "wed", "we'd", "welcome", "well", "we'll", "well-b", "went", "were", "we're", "weren", "werent", "weren't", "we've", "what", "whatever", "what'll", "whats", "what's", "when", "whence", "whenever", "when's", "where", "whereafter", "whereas", "whereby", "wherein", "wheres", "where's", "whereupon", "wherever", "whether", "which", "while", "whim", "whither", "who", "whod", "whoever", "whole", "who'll", "whom", "whomever", "whos", "who's", "whose", "why", "why's", "wi", "widely", "will", "willing", "wish", "with", "within", "without", "wo", "won", "wonder", "wont", "won't", "words", "world", "would", "wouldn", "wouldnt", "wouldn't", "www", "x", "x1", "x2", "x3", "xf", "xi", "xj", "xk", "xl", "xn", "xo", "xs", "xt", "xv", "xx", "y", "y2", "yes", "yeah", "yet", "yj", "yl", "you", "youd", "you'd", "you'll", "your", "youre", "you're", "yours", "yourself", "yourselves", "you've", "yr", "ys", "yt", "z", "zero", "zi", "zz"])

def data_preprocessing(text):
    text = str(text)
    text = contractions.fix(text)
    tokens = word_tokenize(text)
    tokens = [token.lower() for token in tokens]
    tokens = [re.sub(r'[^a-zA-Z0-9\s]', '', token) for token in tokens]
    tokens = [word for word in tokens if word not in custom_stop_words]

    lemmatizer = WordNetLemmatizer()
    tagged = nltk.pos_tag(tokens)
    tokens = [lemmatizer.lemmatize(word, get_wordnet_pos(pos)) for word, pos in tagged]

    cleaned_text = ' '.join(tokens)
    return cleaned_text

cleaned_preprocessed_df = cleaned_df
cleaned_preprocessed_df['Cleaned and pre-processed comment'] = cleaned_df['Cleaned comment'].apply(data_preprocessing)
cleaned_preprocessed_df.dropna(subset = ['Cleaned and pre-processed comment'], inplace = True)
cleaned_preprocessed_df.sample(10)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/kurtvkazan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/kurtvkazan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cleaned_preprocessed_df['Cleaned and pre-processed comment'] = cleaned_df['Cleaned comment'].apply(data_preprocessing)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cleaned_preprocessed_df.dropna(subset = ['Cleaned and pre-process

Unnamed: 0,URL ID,Comment,Date,Cleaned comment,Cleaned and pre-processed comment
19840,iv4fob,Yeah a corporation trying to force another one...,2020-09,Yeah a corporation trying to force another one...,corporation force influence government order a...
41951,yw88rf,Tripadvisor is not an unknown website. It's bu...,2022-11,Tripadvisor is not an unknown website. It's bu...,tripadvisor unknown website busy ad link cras...
33004,vnd558,This never would have happened if Vine didn't ...,2022-06,This never would have happened if Vine didn't ...,happen vine die
3378,hi1mr0,What about chicken leg piece?,2020-06,What about chicken leg piece?,chicken leg piece
27980,nrxhw2,"So your associated data like financials, medic...",2021-06,"So your associated data like financials, medic...",data financials medical private chat social...
23318,iv4fob,Either they can be paywalled or not exist. Whi...,2020-09,Either they can be paywalled or not exist. Whi...,paywalled exist prefer
52329,123nfrx,Does TikTok pass muster with GDPR? B2b tech wi...,2023-03,Does TikTok pass muster with GDPR? B2b tech wi...,tiktok pas muster gdpr b2b tech gdpr depend ...
54734,123nfrx,Who was held accountable?,2023-03,Who was held accountable?,hold accountable
18217,i1kqsw,Nothing in that post describes anything that i...,2020-08,Nothing in that post describes anything that i...,post describes case popular social medium ap...
54044,123nfrx,"Yeah, and intentionally exacerbated by large s...",2023-03,"Yeah, and intentionally exacerbated by large s...",intentionally exacerbate large sector economy...


##### Relative pruning (data pre-processing)

In [4]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(cleaned_preprocessed_df['Cleaned and pre-processed comment'])

document_frequencies = np.sum(X > 0, axis=0).A1
num_documents = X.shape[0]

high_freq_threshold = num_documents * 0.99
low_freq_threshold = num_documents * 0.005

terms = vectorizer.get_feature_names_out()
terms_to_keep = [term for term, df in zip(terms, document_frequencies) if low_freq_threshold <= df <= high_freq_threshold]

pruned_vectorizer = CountVectorizer(vocabulary = terms_to_keep)
X_pruned = pruned_vectorizer.fit_transform(cleaned_preprocessed_df['Cleaned and pre-processed comment'])

def relative_pruning(comment, pruned_terms):
    words = comment.split()
    pruned_comment = ' '.join([word for word in words if word in pruned_terms])
    
    return pruned_comment

cleaned_preprocessed_df['Cleaned and pre-processed comment'] = cleaned_preprocessed_df['Cleaned and pre-processed comment'].apply(lambda x: relative_pruning(x, set(terms_to_keep)))
cleaned_preprocessed_df.dropna(subset = ['Cleaned and pre-processed comment'], inplace = True)
cleaned_preprocessed_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cleaned_preprocessed_df['Cleaned and pre-processed comment'] = cleaned_preprocessed_df['Cleaned and pre-processed comment'].apply(lambda x: relative_pruning(x, set(terms_to_keep)))
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cleaned_preprocessed_df.dropna(subset = ['Cleaned and pre-processed comment'], inplace = True)


Unnamed: 0,URL ID,Comment,Date,Cleaned comment,Cleaned and pre-processed comment
0,fablco,"Here's why he used the word ""parasitic"":\n\n> ...",2020-02,"Here's why he used the word ""parasitic"":\n\n> ...",word app technology bring app phone actively p...
1,fablco,ELI5 anyone?,2020-02,ELI5 anyone?,
2,fablco,"Tiktok CEO: I know you are, but what am I?",2020-02,"Tiktok CEO: I know you are, but what am I?",tiktok
3,fablco,Coming from a company that is constantly tryin...,2020-02,Coming from a company that is constantly tryin...,come company user app guy
4,fablco,"Adolescence, puberty, peer pressure, all that ...",2020-02,"Adolescence, puberty, peer pressure, all that ...",stuff dumb manipulate parent buy dumb apps tik...
...,...,...,...,...,...
62950,17b9d3y,"Ofc they dont, they have spies not tied to the...",2023-10,"Ofc they dont, they have spies not tied to the...",spy party
62951,17b9d3y,Found the CCP bot,2023-10,Found the CCP bot,ccp bot
62952,17b9d3y,Christ Reddit is such an echo chamber. Is anyo...,2023-10,Christ Reddit is such an echo chamber. Is anyo...,reddit allow think
62953,17b9d3y,Wtf does the comment even have to do with any ...,2023-10,Wtf does the comment even have to do with any ...,comment ccp think lol literally


##### Collapse sample dataset into yearly aggregates

In [5]:
yearly_aggregates_df = cleaned_preprocessed_df
yearly_aggregates_df['Date'] = pd.to_datetime(cleaned_preprocessed_df['Date'])
yearly_aggregates_df['Year'] = yearly_aggregates_df['Date'].dt.year
yearly_aggregates_df = yearly_aggregates_df.groupby('Year')['Cleaned and pre-processed comment'].agg(' '.join).reset_index()
yearly_aggregates_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  yearly_aggregates_df['Date'] = pd.to_datetime(cleaned_preprocessed_df['Date'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  yearly_aggregates_df['Year'] = yearly_aggregates_df['Date'].dt.year


Unnamed: 0,Year,Cleaned and pre-processed comment
0,2020,word app technology bring app phone actively p...
1,2021,general idea attack people personal hate spee...
2,2022,source happen deal buy tiktok privacy issue ho...
3,2023,day platform bring fight make problem thing t...


##### Save final sample dataset as CSV

In [6]:
yearly_aggregates_df.to_csv('Cleaned and pre-processed sample dataset.csv', index = False)

##### Training dataset (90%)

In [7]:
sample_size = 0.9

training_df = cleaned_preprocessed_df['Cleaned and pre-processed comment'].sample(frac = sample_size, random_state = 10)
training_df

6082                                                      
36101                                           tiktok bad
31388                              ccp fuck fuck fuck edit
15346    include president case support ban state ameri...
31029    exact argument bad thing facebook thing argume...
                               ...                        
11460                                         parent trump
15429                                                block
29716                        app thing pretty tiktok point
8929        data delete good day company data force delete
27185    point current society popular attention intern...
Name: Cleaned and pre-processed comment, Length: 53367, dtype: object

##### Test dataset (10%)

In [8]:
heldout_df = cleaned_preprocessed_df['Cleaned and pre-processed comment'].drop(training_df.index)
heldout_df

2                                                   tiktok
12                           reddit edit user content step
38                   social medium platform reddit include
53                             case call reddit push place
89                                                 chinese
                               ...                        
62844    tiktok platform hard block ad short people car...
62875                                                  god
62923                     reddit people china know country
62933    china bot reddit huge deal russia bot great hu...
62950                                            spy party
Name: Cleaned and pre-processed comment, Length: 5930, dtype: object

### LDA Topic Modeling

##### Coherence score

In [9]:
from gensim import corpora, models
from gensim.models import CoherenceModel

coherence_texts = [doc.split() for doc in training_df]
coherence_dictionary = corpora.Dictionary(coherence_texts)
coherence_corpus = [coherence_dictionary.doc2bow(text) for text in coherence_texts]

num_topics = [30, 50, 70]
alpha_values = [0.01, 0.05, 0.1, 0.2, 0.5, 1]

for k in num_topics:

    k_models_scores = []
    
    for alpha in alpha_values:
        lda_model = models.LdaMulticore(corpus = coherence_corpus, id2word = coherence_dictionary, num_topics = k, alpha = alpha,
                                        eta = 1/k, passes = 20, iterations = 2500, random_state = 10)
        coherence_model = CoherenceModel(model = lda_model, texts = coherence_texts, dictionary = coherence_dictionary, coherence = 'u_mass')
        coherence_score = coherence_model.get_coherence()
        k_models_scores.append((lda_model, alpha, coherence_score))

        print(f"K = {k}, α = {alpha}: {coherence_score}")
        
    best_model = max(k_models_scores, key = lambda x: x[2])
    best_lda_model, best_alpha, best_coherence = best_model
    model_filename = f'lda_model_k{k}_alpha{best_alpha}.model'
    best_lda_model.save(model_filename)

    print(f"Best model for K = {k}: α = {best_alpha}, coherence = {best_coherence}, saved as: {model_filename}")
    
    del k_models_scores

K = 30, α = 0.01: -3.4818772559150033
K = 30, α = 0.05: -3.469943413614231
K = 30, α = 0.1: -3.549308930014683
K = 30, α = 0.2: -3.5777264333909904
K = 30, α = 0.5: -3.368139069665063
K = 30, α = 1: -2.69670143800721
Best model for K = 30: α = 1, coherence = -2.69670143800721, saved as: lda_model_k30_alpha1.model
K = 50, α = 0.01: -3.4357989356659333
K = 50, α = 0.05: -3.5677489623279994
K = 50, α = 0.1: -3.5526095799495874
K = 50, α = 0.2: -3.3748898818671518
K = 50, α = 0.5: -3.0525830469487256
K = 50, α = 1: -2.4967928849907617
Best model for K = 50: α = 1, coherence = -2.4967928849907617, saved as: lda_model_k50_alpha1.model
K = 70, α = 0.01: -3.147414026025409
K = 70, α = 0.05: -3.1650126794597337
K = 70, α = 0.1: -3.08173846905687
K = 70, α = 0.2: -3.0770964468887634
K = 70, α = 0.5: -2.908357345139788
K = 70, α = 1: -2.4568976999809062
Best model for K = 70: α = 1, coherence = -2.4568976999809062, saved as: lda_model_k70_alpha1.model


##### Perplexity score

In [10]:
from gensim.models import LdaModel

perplexity_texts = [doc.split() for doc in heldout_df]
perplexity_dictionary = corpora.Dictionary(perplexity_texts)
perplexity_corpus = [perplexity_dictionary.doc2bow(text) for text in perplexity_texts]

model_filenames = ['lda_model_k30_alpha1.model', 'lda_model_k50_alpha1.model', 'lda_model_k70_alpha1.model']

for model_filename in model_filenames:
    lda_model = LdaModel.load(model_filename)
    perplexity_score = lda_model.log_perplexity(perplexity_corpus)

    print(f"{model_filename} perplexity score: {perplexity_score}")

lda_model_k30_alpha1.model perplexity score: -7.49620582850401
lda_model_k50_alpha1.model perplexity score: -8.517374603093826
lda_model_k70_alpha1.model perplexity score: -9.38655737741072


##### Topic modeling

The yearly aggregates (2020-2023) are  treated as separate documents in the topic modeling analysis

In [11]:
topic_texts = [comment.split() for comment in yearly_aggregates_df['Cleaned and pre-processed comment']]
topic_dictionary = corpora.Dictionary(topic_texts)
topic_corpus = [topic_dictionary.doc2bow(text) for text in topic_texts]

def print_topics(lda_model, topic_corpus_filtered, num_words = 5):
    
    for year, bow in zip(yearly_aggregates_df['Year'], topic_corpus_filtered):
        print(f"Year {year}:")
        topics = lda_model.get_document_topics(bow, minimum_probability = 0)
        
        for topic_id, proportion in topics:
            top_words = lda_model.show_topic(topic_id, topn = num_words)
            top_words_str = ', '.join([word for word, _ in top_words])
            print(f"Topic {topic_id} ({proportion:.2%}): {top_words_str}")
            
        print()

In [12]:
lda_model_k30 = LdaModel.load('lda_model_k30_alpha1.model')
print_topics(lda_model_k30, topic_corpus)

Year 2020:
Topic 0 (0.25%): tiktok, ban, china, network, chinese
Topic 1 (0.00%): chinese, government, tiktok, data, american
Topic 2 (2.04%): ban, software, tiktok, china, data
Topic 3 (0.14%): tiktok, china, data, people, government
Topic 4 (6.01%): tiktok, proof, govt, people, china
Topic 5 (2.55%): tiktok, ban, people, crazy, china
Topic 6 (4.55%): law, trump, congress, president, tiktok
Topic 7 (8.44%): kid, child, parent, online, year
Topic 8 (14.82%): people, tiktok, social, medium, fuck
Topic 9 (6.06%): app, device, track, phone, server
Topic 10 (2.87%): state, influence, china, people, citizen
Topic 11 (0.82%): company, government, chinese, ban, china
Topic 12 (0.01%): tiktok, people, tech, government, facebook
Topic 13 (3.83%): content, video, tiktok, youtube, algorithm
Topic 14 (7.58%): freedom, people, speech, limit, public
Topic 15 (3.10%): comment, app, remove, tiktok, post
Topic 16 (0.01%): tiktok, bad, people, american, fuck
Topic 17 (2.06%): people, tiktok, shit, talk,

In [13]:
lda_model_k50 = LdaModel.load('lda_model_k50_alpha1.model')
print_topics(lda_model_k50, topic_corpus)

Year 2020:
Topic 0 (0.00%): tiktok, china, company, ban, data
Topic 1 (0.00%): data, tiktok, company, chinese, people
Topic 2 (0.00%): tiktok, data, ban, china, chinese
Topic 3 (0.01%): data, tiktok, china, people, government
Topic 4 (2.33%): tiktok, china, people, company, ban
Topic 5 (1.39%): tiktok, china, people, track, ban
Topic 6 (1.38%): tiktok, law, congress, china, people
Topic 7 (24.45%): tiktok, china, chinese, data, people
Topic 8 (5.02%): tiktok, people, china, data, company
Topic 9 (5.09%): tiktok, app, data, user, ban
Topic 10 (0.02%): china, people, tiktok, government, chinese
Topic 11 (0.01%): company, ban, tiktok, attack, government
Topic 12 (0.00%): tiktok, data, people, company, government
Topic 13 (3.70%): content, child, kid, video, parent
Topic 14 (7.41%): people, tiktok, china, data, ban
Topic 15 (1.79%): remove, tiktok, app, people, ban
Topic 16 (0.02%): tiktok, people, data, company, china
Topic 17 (0.02%): tiktok, people, china, ban, bad
Topic 18 (2.21%): tik

In [14]:
lda_model_k70 = LdaModel.load('lda_model_k70_alpha1.model')
print_topics(lda_model_k70, topic_corpus)

Year 2020:
Topic 0 (0.00%): tiktok, china, data, company, ban
Topic 1 (0.00%): data, tiktok, company, people, chinese
Topic 2 (0.00%): tiktok, data, ban, china, chinese
Topic 3 (0.01%): data, tiktok, china, people, government
Topic 4 (2.18%): tiktok, china, people, company, ban
Topic 5 (3.19%): tiktok, china, people, data, ban
Topic 6 (3.79%): tiktok, china, people, data, law
Topic 7 (21.62%): tiktok, china, data, chinese, people
Topic 8 (6.98%): tiktok, people, china, data, company
Topic 9 (0.08%): tiktok, data, ban, user, app
Topic 10 (0.01%): china, tiktok, people, data, government
Topic 11 (0.01%): company, tiktok, ban, attack, data
Topic 12 (0.01%): tiktok, data, people, company, government
Topic 13 (0.01%): tiktok, content, people, ban, china
Topic 14 (3.32%): people, tiktok, data, china, ban
Topic 15 (1.09%): remove, tiktok, app, people, data
Topic 16 (0.33%): tiktok, data, people, company, china
Topic 17 (0.17%): tiktok, people, china, ban, data
Topic 18 (3.05%): tiktok, people

Topic 0 (0.01%): tiktok, china, data, company, ban
Topic 1 (0.00%): data, tiktok, company, people, chinese
Topic 2 (0.01%): tiktok, data, ban, china, chinese
Topic 3 (0.01%): data, tiktok, china, people, government
Topic 4 (3.80%): tiktok, china, people, company, ban
Topic 5 (3.50%): tiktok, china, people, data, ban
Topic 6 (6.72%): tiktok, china, people, data, law
Topic 7 (12.78%): tiktok, china, data, chinese, people
Topic 8 (9.05%): tiktok, people, china, data, company
Topic 9 (0.01%): tiktok, data, ban, user, app
Topic 10 (0.02%): china, tiktok, people, data, government
Topic 11 (0.01%): company, tiktok, ban, attack, data
Topic 12 (0.01%): tiktok, data, people, company, government
Topic 13 (0.01%): tiktok, content, people, ban, china
Topic 14 (2.11%): people, tiktok, data, china, ban
Topic 15 (1.33%): remove, tiktok, app, people, data
Topic 16 (4.45%): tiktok, data, people, company, china
Topic 17 (0.60%): tiktok, people, china, ban, data
Topic 18 (1.09%): tiktok, people, governmen

##### Data visualization

In [15]:
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis

lda_vis_30 = gensimvis.prepare(lda_model_k30, topic_corpus, topic_dictionary)

pyLDAvis.display(lda_vis_30)

In [16]:
lda_vis_50 = gensimvis.prepare(lda_model_k50, topic_corpus, topic_dictionary)

pyLDAvis.display(lda_vis_50)

In [17]:
lda_vis_70 = gensimvis.prepare(lda_model_k70, topic_corpus, topic_dictionary)

pyLDAvis.display(lda_vis_70)