# Algospeak Topic Modeling
This notebook contains some topic modeling for my algospeak project.  
This is a new, continuing script.

In [28]:
import pandas as pd
import numpy as np
import sklearn
%pprint
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', None)

Pretty printing has been turned ON


## Data reshaping
I'm only looking at the algospeak usage of terms, so I'll be looking at mention_codes a (algospeak) and m (mention)

In [4]:
algo_df = pd.read_csv('algospeak_top_posts.csv')
algo_df.head()

Unnamed: 0,text,author,display_name,date,likes,quotes,replies,reposts,uri,query,mention_code
0,"""Orange Man Bad"" and ""Unalive the Boer"" are un...",maiamindel.bsky.social,Maia,2025-02-23T17:07:01.256Z,92,1,1,18,at://did:plc:ur77nun2q74loi34r2e6r43u/app.bsky...,unalive,a
1,please for the love of god can we NOT normalis...,adrierising.bsky.social,adrie rose 🇯🇲,2025-02-25T03:25:40.832Z,83,0,1,10,at://did:plc:uajcdhsabyf4t7a3qgclm55x/app.bsky...,unalive,m
2,Good morning to everyone except Asian Doll who...,authorreneeb.bsky.social,ReneeB,2025-02-24T12:32:05.544Z,19,1,1,2,at://did:plc:ng6mdz23xa3jae4yr2crgocy/app.bsky...,unalive,a
3,Why did I marry someone who picks the WORST FU...,vitaminpac1.bsky.social,Vitamin Bee 🐝,2025-02-24T01:18:42.171Z,22,0,7,1,at://did:plc:37nxbbnnrozlllwyfngsalya/app.bsky...,unalive,a
4,postponing my unalive,baratiddys.bsky.social,Luis 🤍,2025-02-20T16:47:35.943Z,21,0,1,1,at://did:plc:dluxclbmnsh3bt6wyih5l6ds/app.bsky...,unalive,a


In [10]:
am_df = algo_df[algo_df['mention_code'].isin(['a', 'm'])]
am_df[['text', 'mention_code']].sample(15)

Unnamed: 0,text,mention_code
40,"The more I see this, the more I am VERY CONVINCED that Gale's more of a bottom than a top when it comes to seggs 🤡🍆💦 #BG3 #GaleDekarios #Weavelock💙💜",a
34,Drawing some nasty furry seggs...,a
4,postponing my unalive,a
54,"First sketch request of February, have some good old seggs with Cleopatra from Shin Megami Tensei V!",a
51,"I've been reading this one for a while, there's some reeeeal good seggs in there.",a
37,this is a collab for this fic!! pls do take a look if you like phaidei public seggs heheh #phaidei archiveofourown.org/works/63372577,a
41,"#VoidBound 0.6.5 Update just dropped with a 20% off!\n\nThat Caly Preggo scene we've been teasing is finally here! 🤰 💦\n\n✨️ New questlines ft. Caly & Haar, and some holographic seggs \n✨️ Brand New H-scene\n✨️ Various fixes\n\nstore.steampowered.com/app/2500710/...",a
2,Good morning to everyone except Asian Doll who openly admitted to trying to unalive Kash Doll over a damn name during BHM. 🤦🏿‍♀️,a
48,"Damn, so much seggs on my TL, y'all are horny! \n\nKeep going. 👀",a
33,FOXGIRL SEGGS,a


In [12]:
am_df.mention_code.value_counts()
# only down to 59 posts.... I think I'll need to get a lot more!

mention_code
a    49
m    10
Name: count, dtype: int64

In [13]:
#let's just see how topic modeling works with the 'unalive' portion

In [15]:
unalive_df = am_df[am_df['query'] == 'unalive']
unalive_docs = unalive_df.text
unalive_docs.head()

0                    "Orange Man Bad" and "Unalive the Boer" are universal truths that all living beings are innately attuned to. Like a message from the Creator
1    please for the love of god can we NOT normalise using "sw" and "unalive" on this app? use real words.\nsex work.\nkill.\ncunt.\n\nthere's no algorithm here.
2                                Good morning to everyone except Asian Doll who openly admitted to trying to unalive Kash Doll over a damn name during BHM. 🤦🏿‍♀️
3      Why did I marry someone who picks the WORST FUCKING MOVIES\n\nMy god I either want to die of boredom, want to unalive myself, or am too confused to decide
4                                                                                                                                           postponing my unalive
Name: text, dtype: object

In [18]:
len(unalive_docs)

29

## Topic Modeling Unalive (small version)
Since I'm only looking at 29 posts for this, I'm not going to take a huge stock in the results here. This is more of a proof of concept, maybe looking on how topic modeling can be used to aid qualitative sociolinguistic work.

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation

In [17]:
# Na-rae's function
def display_topics(model, feature_names, num_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic %d:" % (topic_idx))
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-num_top_words - 1:-1]]))


In [20]:
#let's just look at 3 topics since there's only 29 posts
num_feats = 1000
num_topics = 3

In [21]:
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=num_feats, stop_words='english')
tfidf_docs = tfidf_vectorizer.fit_transform(unalive_docs)

nmf_model = NMF(n_components=num_topics, random_state=1, l1_ratio=.5, 
                init='nndsvd').fit(tfidf_docs)

display_topics(nmf_model, tfidf_vectorizer.get_feature_names_out(), 10)

Topic 0:
people say vacation wanna work wait cooked actually don necessary
Topic 1:
app love physician snuffies assisted god sw normalise cunt real
Topic 2:
saying self delete ideation suicidal word ve helpful muted euphemisms


Topic 2 is very clear here, this is from the couple posts doing meta discourse. Topic 1 comes from one specific post about an app called snuffies.

In [25]:
unalive_df[unalive_df['mention_code'] == 'm'] # you can see topic 2 reflected here

Unnamed: 0,text,author,display_name,date,likes,quotes,replies,reposts,uri,query,mention_code
1,"please for the love of god can we NOT normalise using ""sw"" and ""unalive"" on this app? use real words.\nsex work.\nkill.\ncunt.\n\nthere's no algorithm here.",adrierising.bsky.social,adrie rose 🇯🇲,2025-02-25T03:25:40.832Z,83,0,1,10,at://did:plc:uajcdhsabyf4t7a3qgclm55x/app.bsky.feed.post/3lixw3icgps2u,unalive,m
6,"Y'all, please stop saying ""unalive"" and ""self-delete"" and so forth. There's no algorithm to fight here and a lot of people have the word ""suicide"" muted for a reason. The euphemisms are neither cute nor helpful.",aliothfox.ursamajorartworks.com,Alioth Daddyfox,2025-02-21T17:55:36.580Z,117,2,8,29,at://did:plc:5mkojgjmjfhdpd5lvepg2q6h/app.bsky.feed.post/3lipete74xs2l,unalive,m
9,I watched something and it used the word 'unalive' which made me so mad that I was shaken back to lucidity,killjill.itch.io,JILLIAN F. KILLS,2025-02-23T19:07:48.841Z,14,0,1,0,at://did:plc:3lcmedxtqgg72d4ag57go72w/app.bsky.feed.post/3liujscorlc2k,unalive,m
10,"wait, people actually say unalive? we’re all so cooked",verynormalguy.bsky.social,evan n,2025-02-20T14:42:20.392Z,12,2,0,0,at://did:plc:fcfkcslvkunsrcldbzmx2imc/app.bsky.feed.post/3limjktqulc2m,unalive,m
13,"cosigned. I was in a psych ward for suicidal ideation, not for ""unaliving myself ideation"". I've had passive suicidal thoughts for most of my life. Saying ""unalive"" and ""self-delete"" etc always make me cringe, because it comes off as infantilizing (derogatory) and makes it harder to talk about",blueearotter.bsky.social,Ace (they/them) 🏳️‍⚧️,2025-02-21T18:27:15.197Z,25,0,1,5,at://did:plc:bq4rxo6fza7itac6mvmr5mnd/app.bsky.feed.post/3lipglwud522n,unalive,m
19,"Cw: suicide\n\nYou don’t have to say “unalive” here. You don’t have to say “sewercide” or any other euphemism here. Use the actual word, and you have the added bonus of it hitting someone’s mute words so they don’t see it if it’s triggering",acab.dad,John Breen,2025-02-19T17:02:38.859Z,89,1,0,20,at://did:plc:l2ktomvijz42t6hfqaxl7gq6/app.bsky.feed.post/3likawsrbsk2v,unalive,m
25,"Me: *uses ""unalive""*\n\nPsychologist: *counters with ""mandatory report""*",cosmicallyf.bsky.social,Cosmically Funny,2025-02-20T02:35:13.030Z,5,0,0,1,at://did:plc:6yp2frndek4aksqkc5wxtgoo/app.bsky.feed.post/3lilawnhewc2e,unalive,m
27,"En general no me encanta la comparación constante de la realidad con novelas de ciencia ficción pero esa vaina de cambiar el lenguaje y decir cosas como unalive en vez de morir, muerte, no atreverse a decir sexo, etc, por el bien del algoritmo, sí es bastante bastante 1984",magieps.bsky.social,Mags,2025-02-19T21:38:03.820Z,13,0,1,0,at://did:plc:pwbjwfwjrutcm2ztnrhtlvty/app.bsky.feed.post/3likqdc75xk2k,unalive,m


In [27]:
# trying the LDA model now
tf_vectorizer = CountVectorizer(max_df=0.8, max_features=num_feats, stop_words='english')
tf_docs = tf_vectorizer.fit_transform(unalive_docs)

lda_model = LatentDirichletAllocation(n_components=num_topics, max_iter=5, learning_method='online', 
                                learning_offset=50.,random_state=0).fit(tf_docs)

display_topics(lda_model, tf_vectorizer.get_feature_names_out(), 10)

Topic 0:
people saying la bastante ideation decir want el doll women
Topic 1:
work algorithm people best word need week using report offers
Topic 2:
don live like long president ve say taken plan kentucky


Topic zero obviously takes a lot from the one Spanish post in my sample. The other two are less focused than the NMF model. That one seems to work a bit better.