## About

Proper about me can be made later
Classifier that classifies what niche category a certain reddit post falls into.


### To-do/Ideas for the future.
- Need to find/determine a workflow that cleans all the data that we scrape/get from reddit via PRAW.
- Can use LLMs for data-augmentation as well, not just weak supervision. I.e, we can pass our actual existing reddit posts' data into an LLM to give it some ideas and show it some inspiration, and use that to get it to generate more reddit stories that are likely to be viral within a specific chosen niche of our choice.
    - Additionally, instead of just passing good known stories into a general-purpose LLM (like Gemini or GPT-based LLMs) like we are right now, we could train or fine-tune a domain-specific LLM that is dedicated for this task (generating reddit posts within a specific niche that are likely to go viral).

First, we need to collect data.
There aren't many very good datasets, so we need to create our own.
This will be done through data scraping via PRAW and weak supervision via a chosen LLM (I am using Gemini for this).

First, scraping data via PRAW.

In [25]:
# Install all required dependencies

%pip install -r requirements.txt --user # --user flag is needed because one of the dependencies (google-genai) needs to access a script that is hidden in non-administrator environments.

Collecting psaw (from -r requirements.txt (line 2))
  Downloading psaw-0.1.0-py3-none-any.whl.metadata (10 kB)
[31mERROR: Could not find a version that satisfies the requirement distutils (from versions: none)[0m[31m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m
[31mERROR: No matching distribution found for distutils[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.


In [10]:
#!which pip

In [1]:
# Make your necessary imports
import praw
import pandas as pd
import time
from google import genai
import numpy as np

In [2]:
# Initialize reddit client session

CLIENT_ID = "0xeiOSktNDiHBw"
CLIENT_SECRET = "c-bNB_P5wRjHZmaD1eaJnx0D3mlr8Q"
USER_AGENT = "sestee 1.0"
cli = praw.Reddit(
        client_id=CLIENT_ID,
        client_secret=CLIENT_SECRET,
        user_agent=USER_AGENT
)


In [23]:
# Declare a way for you to scrape posts from a subreddit of your choice.
def scrape_popular_posts(subreddits, limit=None, sort_by="hot"):
    posts = []
    
    for sub_name in subreddits:
        subreddit = cli.subreddit(sub_name)
        count = 0
        if sort_by == "top":
            submissions = subreddit.top(time_filter="all", limit=limit)
        elif sort_by == "hot":
            submissions = subreddit.hot(limit=limit)
        elif sort_by == "new":
            submissions = subreddit.new(limit=limit)
        else:
            raise ValueError("Invalid sort_by value. Use 'top', 'hot', or 'new'.")
        
        for post in submissions:
            count += 1
            post_data = {
                "title": post.title,
                "selftext": post.selftext, # For reference, selftext is the ACTUAL body text of the post
                "subreddit": post.subreddit.display_name,
                "flair": post.link_flair_text,
                "score": post.score,
                "num_comments": post.num_comments,
                "upvote_ratio": post.upvote_ratio,
                "created_utc": post.created_utc,
                "id": post.id,
                "url": post.url
            }
            posts.append(post_data)
        print(sub_name, count)
    return posts

In [35]:
def scrape_hot_posts(subreddits, limit=1000):
    posts = []
    for sub_name in subreddits:
        subreddit = cli.subreddit(sub_name)
        count = 0
        for post in subreddit.hot(limit=limit):
            count += 1
            post_data = {
                "title": post.title,
                "selftext": post.selftext,
                "subreddit": post.subreddit.display_name,
                "flair": post.link_flair_text,
                "score": post.score,
                "num_comments": post.num_comments,
                "upvote_ratio": post.upvote_ratio,
                "created_utc": post.created_utc,
                "id": post.id,
                "url": post.url
            }
            posts.append(post_data)
        print(f"{sub_name}: Retrieved {count} hot posts")
    return posts


In [36]:
# Figure out what subreddits you want to scrape from
subreddits = ["AskReddit", "relationships", "AmItheAsshole", "TrueOffMyChest", "TIFU"]
# Scrape the data from the subreddits
#data = scrape_popular_posts(subreddits, limit=None, sort_by="top")
data = scrape_hot_posts(subreddits)
# Save the data in a pandas dataframe
df = pd.DataFrame(data)
# Can save the dataframe to a CSV file too!
#df.to_csv("reddit_posts.csv", index=False)
df["niche"] = None # Adding a new column to the dataframe for the niche
# Display the first few rows of the dataframe
df.head(20)

AskReddit: Retrieved 880 hot posts
relationships: Retrieved 204 hot posts
AmItheAsshole: Retrieved 444 hot posts
TrueOffMyChest: Retrieved 848 hot posts
TIFU: Retrieved 837 hot posts


Unnamed: 0,title,selftext,subreddit,flair,score,num_comments,upvote_ratio,created_utc,id,url,niche
0,"People over 35, what's something you genuinely...",,AskReddit,,6317,8698,0.94,1747601000.0,1kptz1u,https://www.reddit.com/r/AskReddit/comments/1k...,
1,(SERIOUS) What’s the worst way you know someon...,,AskReddit,Serious Replies Only,1449,2343,0.9,1747616000.0,1kpz8n7,https://www.reddit.com/r/AskReddit/comments/1k...,
2,What is the most surreal “this can’t be real” ...,,AskReddit,,4112,2706,0.96,1747594000.0,1kpr4d4,https://www.reddit.com/r/AskReddit/comments/1k...,
3,Forget elephants in the room. What’s a blue wh...,,AskReddit,,980,224,0.92,1747617000.0,1kpzhtw,https://www.reddit.com/r/AskReddit/comments/1k...,
4,What’s the worst city you’ve ever visited?,,AskReddit,,5043,7168,0.92,1747580000.0,1kplv0m,https://www.reddit.com/r/AskReddit/comments/1k...,
5,What was a don’t get paid enough for this sh*t...,,AskReddit,,2476,404,0.97,1747592000.0,1kpqi0f,https://www.reddit.com/r/AskReddit/comments/1k...,
6,How sick is too sick for you to go to work?,,AskReddit,,578,1332,0.93,1747613000.0,1kpy9yk,https://www.reddit.com/r/AskReddit/comments/1k...,
7,What's something that screams 'poorly raised'?,,AskReddit,,835,1043,0.92,1747602000.0,1kpu5zm,https://www.reddit.com/r/AskReddit/comments/1k...,
8,What's the grossest thing you've seen someone ...,,AskReddit,,3468,2742,0.92,1747575000.0,1kpjupd,https://www.reddit.com/r/AskReddit/comments/1k...,
9,What has become so expensive that it's not wor...,,AskReddit,,726,1072,0.9,1747601000.0,1kptxsd,https://www.reddit.com/r/AskReddit/comments/1k...,


Now we have a good chunk of all the data that we need. We need to clean it.

In [31]:
print(len(df))
df = df.drop_duplicates()
print(len(df))

18471
17972


In [16]:
# Cleaning the data
# One of the ways we can clean the data is by removing any rows that have empty string values in the 'selftext' or 'title' columns. 
# If you take a look at the dataframe output above, you'll see this is the case for some of them.
df = df[df["selftext"].str.strip() != ""] # dropping empty 'selftext' rows
df = df[df["title"].str.strip() != ""] # dropping empty 'title' rows
print(len(df))
#df = df.dropna(subset=["selftext", "title"])  # Drop rows with NaN in 'selftext' or 'title'

# Now if we inspect the dataframe you'll see it doesn't have empty strings anymore at all.
df.head(20)

2329


Unnamed: 0,title,selftext,subreddit,flair,score,num_comments,upvote_ratio,created_utc,id,url,niche
881,No Politics!,Hello! \n\nThis is a friendly reminder that po...,relationships,,208,0,0.92,1730132000.0,1ge6159,https://www.reddit.com/r/relationships/comment...,
882,My husband (63M) goes cycling way too much lea...,"Good people, I need some help.\n\nMy husband (...",relationships,,951,276,0.92,1747570000.0,1kpiccf,https://www.reddit.com/r/relationships/comment...,
883,"Wife, 30F, ignored me, 38M, all day on our 4th...","Edit; her boss is a late 30s female, her bosse...",relationships,,32,57,0.64,1747621000.0,1kq0rp0,https://www.reddit.com/r/relationships/comment...,
884,Overcome wife wanting anime men tattooed on th...,Throwaway account for obvious reasons and this...,relationships,,32,66,0.68,1747619000.0,1kq026k,https://www.reddit.com/r/relationships/comment...,
885,I (18M) got booted from the house for being ga...,so a few days ago i got kicked out. i’m gay an...,relationships,,55,2,0.96,1747628000.0,1kq2uwd,https://www.reddit.com/r/relationships/comment...,
886,Wife (44f) of 26 years just dropped a bomb on ...,Tldr: after 26 years if she was herself she wo...,relationships,,19,14,0.89,1747620000.0,1kq0jwp,https://www.reddit.com/r/relationships/comment...,
887,My boyfriend sleeps at 8am every night and it’...,It’s exactly like the title. I am (F17) and he...,relationships,,584,100,0.82,1747554000.0,1kpejxp,https://www.reddit.com/r/relationships/comment...,
888,I’m pretty sure my boyfriend doesn’t truly car...,Me(22) and my bf (21)’s relationship of 3 year...,relationships,,5,5,0.86,1747627000.0,1kq2hk1,https://www.reddit.com/r/relationships/comment...,
889,"I love my partner, but I miss the feeling of b...","Hi everyone,\nI’m a 30F and have been with my ...",relationships,,108,44,0.89,1747563000.0,1kpgm5k,https://www.reddit.com/r/relationships/comment...,
890,I (F/29) don’t like my boyfriend’s (M/27) dog ...,I’ve been with my boyfriend for almost 2 years...,relationships,,5,10,1.0,1747622000.0,1kq0z64,https://www.reddit.com/r/relationships/comment...,


Now that we have our data, we will create a pipeline that allows us to label all the data entries and add a "niche" column via weak supervision. All entries will then be classified.
These are the post classification categories we are planning to classify our posts into.

| Label         | Description                                |
|---------------|--------------------------------------------|
| `advice`      | Help-seeking posts, questions, dilemmas    |
| `story`       | Personal anecdotes with a beginning, middle, end |
| `drama`       | High-stakes conflict, betrayal, gossip      |
| `rant`        | Emotional venting or unfiltered frustration |
| `humor`       | Meme-like, comedic, shitpost-style content  |
| `informative` | Tips, how-tos, PSAs, educational content    |
| `confession`  | Vulnerable personal reveals or identity-based confessions |
| `unknown`     | Doesn’t fit confidently into other categories|

Note: We can use the `unknown` category to find the biggest weaknesses of our LLM, and we can then possibly fine-tune our LLM later very efficiently by especially targetting its weaknesses that we've detected here.

In [13]:
# Create an instance of the Google GenAI API client
client = genai.Client(api_key="AIzaSyDSyIBzIJ9yVnXYd6sJaE7oZ0Vqnc4kEPM")
#gemini-2.0-flash is also a really good option, but does have lower RPD and other dimension limits.
model = "gemma-3-27b-it" # There are a LOT of models to choose from. But in my experience, I feel comfortable with AND use 2.0-flash the most. Will look into 2.5 series once they go through stable release.
# the gemma 3 model here can process 10K+ requests a day, which is really good for this 
# specific contex because, as you saw, our dataset has 10K entries, which equates to 10K requests for this dataset.

template_prompt = f"""I want to train a transformer-based classifer that takes in the text of a reddit post and then classifes them into labels [personal advice, story, drama]. I only have a partial dataset for this. Can you help fill the rest for me?
It should JUST classify the post into one niche category. The niche categories I want you to choose from are [advice, story, drama, rant, humor, informative, confession, unknown]. unknown is for when you really are not sure what category the post belongs to.
I don't want anything else in your response aside from the 1-word niche category. I don't want any explanations or anything else. Just the 1-word niche category.
Here is the post's data:

"""

# This above is the main template prompt that will be used with the rest of the reddit post data to create full proper prompts for every single reddit post data entry that we will classify via the API.


In [16]:
# Store the name of the file thats going to contain the dataset.
data_filename = "reddit_posts_with_niches_large.csv"

In [14]:
# Pipeline to classify each one of the posts, making a call to the API and using the full prompt we made to get the response that contains the niche category we want.
for index, row in df.iterrows():
    if row["selftext"] == "":
        #print("Skipping empty selftext post.")
        continue
    post_data_prompt = f"Title: {row['title']}\nSelftext: {row['selftext']}\n\n"
    #print("Post that will be classified:")
    print(f"Title: {row['title']}")
    print(f"Body text: {row['selftext']}")
    #print("Classifying the post...")
    prompt = template_prompt + post_data_prompt
    #print(prompt)
    try:
        response = client.models.generate_content(
            model=model, contents=prompt
        )
    except Exception as e:
        print(e)
        time.sleep(61)
        continue
    model_niche_guess = response.text
    # It is possible that the model will give NO response (so response.text is None) because our prompt may contain NSFW language (outside our control). 
    # In this case we have to either set the niche to "unknown" or skip the post. I prefer to set it to unknown because it is a valid category still.
    if model_niche_guess is None:
        print("Model returned no response. Setting niche to 'unknown'.")
        model_niche_guess = "unknown"
    print(model_niche_guess + "\n")
    time.sleep(5)  # Sleep for 5 seconds to avoid hitting googles rpm limit
    # Now we need to add the model's guess to the dataframe
    df.loc[index, "niche"] = model_niche_guess
df.to_csv("reddit_posts_with_niches_large.csv", index=False)  # Save the dataframe with the new column to a CSV file

Title: No Politics!
Body text: Hello! 

This is a friendly reminder that politics are not allowed in this sub and any such posts/comments will be removed as soon as possible. 

Thanks for reading!
informative


Title: Why did he get so angry just because I couldn’t remember the last time I went out to a bar?
Body text: TL;DR; I (F/30) have just started seeing a guy (one week) (M/29) who I met up with last year but we stopped talking due to our schedules not aligning. Started talking again as he said he’d be more intentional and make more time for me and we had what felt like a really great connection again. We had just spent last weekend together and everything seemed fine… until it wasn’t.

Fast forward to yesterday (Friday) he called me whilst I was in the gym but I didn’t answer, I rarely miss his calls and when I text him once I left the gym I noticed he had his DND on which he rarely does. I don’t think he expected me to be in the gym on Friday night (boring I know!) I got no resp

And as you can see, models can get overloaded too. Only so optimistic we can be with Google's LLMs models sometimes. (and free AI services in general).

In [15]:
# Add the model's guess to the dataframe
df.to_csv("reddit_posts_with_niches_hot.csv", index=False)  # Save the dataframe with the new column to a CSV file

In [16]:
prompt_temp = template_prompt + """Title: TIFU by thinking a woman was a boy, and groping her boob. (kind of NSFW, though it happened at work)
Body text: Obligatory this actually happened a little over a year ago, and throwaway because I don't want people on my main account to know what I do for a living.

So, I work for the TSA, and have for a few years now. It's a good job overall. I'm underpaid, but the benefits are nice, and I get overtime when I want it.

A little over a year ago, during the week leading up to Christmas, we had some really bad weather that delayed all the flights. I volunteered to stay late so that my coworkers could go home to their families. Most of the work was done anyway, so it was mostly just standing around waiting for the odd latecomer

I was working the AIT (the space tube thingy), when three passengers came up together, a middle-aged man, a middle-aged woman, and a teenage boy. I figure it's a family traveling together for the holidays, and go about my work.

Mom goes through, all is fine. Dad goes through, all is fine.

Kid comes up, I get a good look at him. Hoodie, sweatpants, shortish hair, smooth face. I figure he's about 13, maybe 14.

I hit the button, direct him to wait with me for a moment, and then gesture to the screen, which lit up on his chest area.

I tell him that I have to pat that area down. He's a little nervous, I figure that because he's so young, this is probably his first time getting a pat down, but he says okay, and I start the patdown.

I do the left side of the chest, and feel some moob, which catches me off guard because he didn't look chubby at all.

I move to the right side of the chest, read what's on the hoodie, and it all clicks at once. The hoodie has the name of the local college on it. This is an adult, not a child. He's not wearing sweatpants, \*she\* is wearing yoga pants. She doesn't even know the couple that just came through.

I look at her face, which is bright red, my hand is still on her boob, and I pull it back like I just got bit by a snake.

I immediately call for my supervisor, who comes over and asks what's wrong, and I explain the situation to her.

My supervisor covers her mouth, and at first I thought she was absolutely mortified, but then I realized she's trying not to laugh.

She takes a minute to pull herself together, tells me to go take a break, and finishes screening the passenger herself.

Once that was done, I apologize to the passenger, she tells me it's fine, that it wasn't the first time she was mistaken for a boy, and she probably should have said something before I started touching her. I leave her alone, and go talk to my supervisor to figure out exactly how fired I am.

She tells me to calm down, that it was just an honest mistake, and that she has my back if the passenger files an official complaint, but that probably won't happen, and I shouldn't be worried.

That reassured me a little, but I still groped a woman and ruined Christmas, so I feel like an absolute monster.

I swallow my shame, and finish my shift, then I go into the airport proper to find some food, because I just finished a twelve hour shift and there's no way I have the energy to cook dinner.

I saw my hapless victim sitting at her gate, waiting for her flight. I went up to her to apologize again, and saw that the flight had been delayed until morning (it was about eleven at night).

I apologize again, she says it's fine, and I ask her if she's planning to stay the whole night. She says she has to, all the hotels in the area are book.

I tell her that I'm getting some dinner, and offer to get her some food as well. After all, I already got to second base, I think it's only fair that I buy her dinner.

She agrees, and we go to one of the restaurants that is open late, get some food, and start eating.

She said she gets mistaken for a boy a lot, and it's not a big deal. I told her about how I had long hair and no beard in college, and at the gym people would frequently walk into the men's bathroom, see me, and do a double take to make sure they didn't walk into the ladies' room.

She laughed, and we ended up talking for a few hours, before I finally told her that I had to get home, and apologized again for the accidental molestation.

She said that all is forgiven, if I promise to take her on a real date when she gets back.

I agreed, she gave me her phone number, and I went home, and immediately started texting her. We kept talking until her flight finally left, and when she got back I picked her up at the airport, and a few days later took her on that date that I promised her.

We just celebrated our one year anniversary.

She has long hair now.

&#x200B;

tl;dr: Thought an adult woman was a teenage boy, touched her on the boob, everything worked out better than expected."""

response = client.models.generate_content(
        model=model, contents=prompt_temp
    )
print(response)  

# This is an example of a model REFUSING to generate a response, because it detected that the content that was passed in was explicit/NSFW.


  prompt_temp = template_prompt + """Title: TIFU by thinking a woman was a boy, and groping her boob. (kind of NSFW, though it happened at work)


candidates=None create_time=None response_id=None model_version='gemini-2.0-flash' prompt_feedback=GenerateContentResponsePromptFeedback(block_reason=<BlockedReason.PROHIBITED_CONTENT: 'PROHIBITED_CONTENT'>, block_reason_message=None, safety_ratings=None) usage_metadata=GenerateContentResponseUsageMetadata(cache_tokens_details=None, cached_content_token_count=None, candidates_token_count=None, candidates_tokens_details=None, prompt_token_count=1339, prompt_tokens_details=[ModalityTokenCount(modality=<MediaModality.TEXT: 'TEXT'>, token_count=1339)], thoughts_token_count=None, tool_use_prompt_token_count=None, tool_use_prompt_tokens_details=None, total_token_count=1339, traffic_type=None) automatic_function_calling_history=[] parsed=None


We now have cleaned, labelled data.
We can now proceed to create and train our model.
First we need to choose a model architecture and create our model, before we actually start training it.

In [15]:
# creating our model and its architecture.
import tensorflow as tf

2025-05-17 22:10:04.938804: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [17]:

# Now we need to create arrays storing our features and possible labels (column names and categories respectively).
COLUMN_NAMES = ["title", "selftext", "subreddit", "flair", "score", "num_comments", "upvote_ratio", "created_utc", "id", "url"]
CATEGORIES = ["advice", "story", "drama", "rant", "humor", "informative", "confession", "unknown"]
# data_filename contains the training and testing data. We will use the first 8000 entries for training and the rest for testing to evaluate our model.


In [None]:
# Now we have to create an input function
# This function is used to create/reorganize our data into a format that can be used by the model for training or testing.
def input_fn(features, labels, training=True, batch_size=256):
    # Convert the inputs to a Dataset
    dataset = tf.data.Dataset.from_tensor_slices((dict(features), labels)) # This creates the dataset from the features and labels in TF's internal format.
   
    # If we're in training mode, we need to shuffle the data around and repeat it a couple times too, so that the model doesn't just learn the order of the data.
    if training:
        dataset = dataset.shuffle(1000).repeat() # shuffle 1000 means we shuffle the data around in a random order, 1000 times over.
   
    # You now batch the data, which is basically where the dataset gets put into groups of a set size, where each batch is a subset of the dataset.
    # Each batch contains batch_size number of samples/examples.
    # This is done to speed up the training process, because it allows the model to process multiple samples at once.
    # The batch size is a hyperparameter that you can tune to find the best value for your model.
    return dataset.batch(batch_size)
