## About

Proper about me can be made later
Classifier that classifies what niche category a certain reddit post falls into.


### To-do/Ideas for the future.
- Need to find/determine a workflow that cleans all the data that we scrape/get from reddit via PRAW.
- Can use LLMs for data-augmentation as well, not just weak supervision. I.e, we can pass our actual existing reddit posts' data into an LLM to give it some ideas and show it some inspiration, and use that to get it to generate more reddit stories that are likely to be viral within a specific chosen niche of our choice.
    - Additionally, instead of just passing good known stories into a general-purpose LLM (like Gemini or GPT-based LLMs) like we are right now, we could train or fine-tune a domain-specific LLM that is dedicated for this task (generating reddit posts within a specific niche that are likely to go viral).

First, we need to collect data.
There aren't many very good datasets, so we need to create our own.
This will be done through data scraping via PRAW and weak supervision via a chosen LLM (I am using Gemini for this).

First, scraping data via PRAW.

In [None]:
# Install all required dependencies

%pip install -r requirements.txt --user # --user flag is needed because one of the dependencies (google-genai) needs to access a script that is hidden in non-administrator environments.

In [2]:
# Make your necessary imports
import praw
import pandas as pd
import time
from google import genai
import numpy as np

In [3]:
# Initialize reddit client session

CLIENT_ID = "0xeiOSktNDiHBw"
CLIENT_SECRET = "c-bNB_P5wRjHZmaD1eaJnx0D3mlr8Q"
USER_AGENT = "sestee 1.0"
cli = praw.Reddit(
        client_id=CLIENT_ID,
        client_secret=CLIENT_SECRET,
        user_agent=USER_AGENT
)


Version 7.7.1 of praw is outdated. Version 7.8.1 was released Friday October 25, 2024.


In [5]:
# Declare a way for you to scrape posts from a subreddit of your choice.
def scrape_popular_posts(subreddits, limit=100, sort_by="top"):
    posts = []

    for sub_name in subreddits:
        subreddit = cli.subreddit(sub_name)

        if sort_by == "top":
            submissions = subreddit.top(limit=limit)
        elif sort_by == "hot":
            submissions = subreddit.hot(limit=limit)
        elif sort_by == "new":
            submissions = subreddit.new(limit=limit)
        else:
            raise ValueError("Invalid sort_by value. Use 'top', 'hot', or 'new'.")

        for post in submissions:
            post_data = {
                "title": post.title,
                "selftext": post.selftext, # For reference, selftext is the ACTUAL body text of the post
                "subreddit": post.subreddit.display_name,
                "flair": post.link_flair_text,
                "score": post.score,
                "num_comments": post.num_comments,
                "upvote_ratio": post.upvote_ratio,
                "created_utc": post.created_utc,
                "id": post.id,
                "url": post.url
            }
            posts.append(post_data)

    return posts

In [10]:
# Figure out what subreddits you want to scrape from
subreddits = ["AskReddit", "relationships", "AmItheAsshole", "TrueOffMyChest", "TIFU"]
# Scrape the data from the subreddits
data = scrape_popular_posts(subreddits, limit=50, sort_by="top")
# Save the data in a pandas dataframe
df = pd.DataFrame(data)

# Can save the dataframe to a CSV file too!
#df.to_csv("reddit_posts.csv", index=False)
df["niche"] = None # Adding a new column to the dataframe for the niche
# Display the first few rows of the dataframe
df.head(20)

Unnamed: 0,title,selftext,subreddit,flair,score,num_comments,upvote_ratio,created_utc,id,url,niche
0,"People who haven't pooped in 2019 yet, why are...",,AskReddit,,221995,7925,0.91,1546377000.0,ablzuq,https://www.reddit.com/r/AskReddit/comments/ab...,
1,How would you feel about Reddit adding 3 NSFW ...,,AskReddit,,217929,2886,0.87,1611860000.0,l7530r,https://www.reddit.com/r/AskReddit/comments/l7...,
2,Would you watch a show where a billionaire CEO...,,AskReddit,,197603,13327,0.9,1581069000.0,f08dxb,https://www.reddit.com/r/AskReddit/comments/f0...,
3,"What if God came down one day and said ""It's p...",,AskReddit,,195917,10227,0.92,1600611000.0,iwedc5,https://www.reddit.com/r/AskReddit/comments/iw...,
4,How would you feel about a feature where if so...,,AskReddit,,186429,2772,0.9,1572833000.0,draola,https://www.reddit.com/r/AskReddit/comments/dr...,
5,"How would you feel about a ""if you accidentall...",,AskReddit,,182977,4303,0.79,1609883000.0,kr8op6,https://www.reddit.com/r/AskReddit/comments/kr...,
6,Stan Lee has passed away at 95 years old,As many of you know today is day that many of ...,AskReddit,Breaking News,175367,27643,0.87,1542052000.0,9whgf4,https://www.reddit.com/r/AskReddit/comments/9w...,
7,"Reddit, how would you feel about a law that ba...",,AskReddit,,160344,6728,0.85,1537294000.0,9gx68l,https://www.reddit.com/r/AskReddit/comments/9g...,
8,"Bill Gates said, ""I will always choose a lazy ...",,AskReddit,,154335,14767,0.93,1593522000.0,himsju,https://www.reddit.com/r/AskReddit/comments/hi...,
9,What if Earth is like one of those uncontacted...,,AskReddit,,152112,8568,0.89,1608947000.0,kka536,https://www.reddit.com/r/AskReddit/comments/kk...,


Now we have a good chunk of all the data that we need. We need to clean it.

In [None]:
# Cleaning the data
# Check and do later
#df = df.dropna(subset=["selftext", "title"])  # Drop rows with NaN in 'selftext' or 'title'
#df = df[df["selftext"].str.strip() != ""]  # Drop empty 'selftext' rows

Now that we have our data, we will create a pipeline that allows us to label all the data entries and add a "niche" column via weak supervision. All entries will then be classified.
These are the post classification categories we are planning to classify our posts into.

| Label         | Description                                |
|---------------|--------------------------------------------|
| `advice`      | Help-seeking posts, questions, dilemmas    |
| `story`       | Personal anecdotes with a beginning, middle, end |
| `drama`       | High-stakes conflict, betrayal, gossip      |
| `rant`        | Emotional venting or unfiltered frustration |
| `humor`       | Meme-like, comedic, shitpost-style content  |
| `informative` | Tips, how-tos, PSAs, educational content    |
| `confession`  | Vulnerable personal reveals or identity-based confessions |
| `unknown`     | Doesn‚Äôt fit confidently into other categories|

Note: We can use the `unknown` category to find the biggest weaknesses of our LLM, and we can then possibly fine-tune our LLM later very efficiently by especially targetting its weaknesses that we've detected here.

In [11]:
# Create an instance of the Google GenAI API client
client = genai.Client(api_key="AIzaSyDSyIBzIJ9yVnXYd6sJaE7oZ0Vqnc4kEPM")
model = "gemini-2.0-flash" # There are a LOT of models to choose from. But in my experience, I feel comfortable with AND use 2.0-flash the most. Will look into 2.5 series once they go through stable release.
template_prompt = f"""I want to train a transformer-based classifer that takes in the text of a reddit post and then classifes them into labels [personal advice, story, drama]. I only have a partial dataset for this. Can you help fill the rest for me?
It should JUST classify the post into one niche category. The niche categories I want you to choose from are [advice, story, drama, rant, humor, informative, confession, unknown]. unknown is for when you really are not sure what category the post belongs to.
I don't want anything else in your response aside from the 1-word niche category. I don't want any explanations or anything else. Just the 1-word niche category.
Here is the post's data:

"""

# This above is the main template prompt that will be used with the rest of the reddit post data to create full proper prompts for every single reddit post data entry that we will classify via the API.


In [17]:
# Pipeline to classify each one of the posts, making a call to the API and using the full prompt we made to get the response that contains the niche category we want.
for index, row in df.iterrows():
    if row["selftext"] == "":
        #print("Skipping empty selftext post.")
        continue
    post_data_prompt = f"Title: {row['title']}\nSelftext: {row['selftext']}\n\n"
    #print("Post that will be classified:")
    print(f"Title: {row['title']}")
    print(f"Body text: {row['selftext']}")
    #print("Classifying the post...")
    prompt = template_prompt + post_data_prompt
    #print(prompt)
    response = client.models.generate_content(
        model=model, contents=prompt
    )
    model_niche_guess = response.text
    # It is possible that the model will give NO response (so response.text is None) because our prompt may contain NSFW language (outside our control). 
    # In this case we have to either set the niche to "unknown" or skip the post. I prefer to set it to unknown because it is a valid category still.
    if model_niche_guess is None:
        print("Model returned no response. Setting niche to 'unknown'.")
        model_niche_guess = "unknown"
    print(model_niche_guess + "\n")
    time.sleep(5)  # Sleep for 5 seconds to avoid hitting googles rpm limit
    # Now we need to add the model's guess to the dataframe
    row["niche"] = model_niche_guess
df.to_csv("reddit_posts_with_niches.csv", index=False)  # Save the dataframe with the new column to a CSV file

Title: Stan Lee has passed away at 95 years old
Body text: As many of you know today is day that many of us have dreaded. Stan Lee has passed away at the age of 95. He leaves behind a legacy of superheroes and stories that have touched many people's lives for decades. We wanted to make this thread to honor and remember this wonderful man, so please use it discuss his life, his work, [his cameos](https://thumbs.gfycat.com/RapidClearDungenesscrab-small.gif), etc and what they meant to you. 

Excelsior!

-The AskReddit mods
informative


Title: Without saying what the category is, what are your top five?
Body text:  
unknown


Title: Professor Stephen Hawking has passed away at the age of 76
Body text: We have lost one of the greatest minds in history today as Professor Stephen William Hawking has passed away on March 14, 2018 at the age of 76.

It is a terrible loss and we wanted to create this thread for people to share their thoughts about Professor Hawking, from favorite quotes, to th

In [None]:
# Add the model's guess to the dataframe

In [16]:
prompt_temp = template_prompt + """Title: TIFU by thinking a woman was a boy, and groping her boob. (kind of NSFW, though it happened at work)
Body text: Obligatory this actually happened a little over a year ago, and throwaway because I don't want people on my main account to know what I do for a living.

So, I work for the TSA, and have for a few years now. It's a good job overall. I'm underpaid, but the benefits are nice, and I get overtime when I want it.

A little over a year ago, during the week leading up to Christmas, we had some really bad weather that delayed all the flights. I volunteered to stay late so that my coworkers could go home to their families. Most of the work was done anyway, so it was mostly just standing around waiting for the odd latecomer

I was working the AIT (the space tube thingy), when three passengers came up together, a middle-aged man, a middle-aged woman, and a teenage boy. I figure it's a family traveling together for the holidays, and go about my work.

Mom goes through, all is fine. Dad goes through, all is fine.

Kid comes up, I get a good look at him. Hoodie, sweatpants, shortish hair, smooth face. I figure he's about 13, maybe 14.

I hit the button, direct him to wait with me for a moment, and then gesture to the screen, which lit up on his chest area.

I tell him that I have to pat that area down. He's a little nervous, I figure that because he's so young, this is probably his first time getting a pat down, but he says okay, and I start the patdown.

I do the left side of the chest, and feel some moob, which catches me off guard because he didn't look chubby at all.

I move to the right side of the chest, read what's on the hoodie, and it all clicks at once. The hoodie has the name of the local college on it. This is an adult, not a child. He's not wearing sweatpants, \*she\* is wearing yoga pants. She doesn't even know the couple that just came through.

I look at her face, which is bright red, my hand is still on her boob, and I pull it back like I just got bit by a snake.

I immediately call for my supervisor, who comes over and asks what's wrong, and I explain the situation to her.

My supervisor covers her mouth, and at first I thought she was absolutely mortified, but then I realized she's trying not to laugh.

She takes a minute to pull herself together, tells me to go take a break, and finishes screening the passenger herself.

Once that was done, I apologize to the passenger, she tells me it's fine, that it wasn't the first time she was mistaken for a boy, and she probably should have said something before I started touching her. I leave her alone, and go talk to my supervisor to figure out exactly how fired I am.

She tells me to calm down, that it was just an honest mistake, and that she has my back if the passenger files an official complaint, but that probably won't happen, and I shouldn't be worried.

That reassured me a little, but I still groped a woman and ruined Christmas, so I feel like an absolute monster.

I swallow my shame, and finish my shift, then I go into the airport proper to find some food, because I just finished a twelve hour shift and there's no way I have the energy to cook dinner.

I saw my hapless victim sitting at her gate, waiting for her flight. I went up to her to apologize again, and saw that the flight had been delayed until morning (it was about eleven at night).

I apologize again, she says it's fine, and I ask her if she's planning to stay the whole night. She says she has to, all the hotels in the area are book.

I tell her that I'm getting some dinner, and offer to get her some food as well. After all, I already got to second base, I think it's only fair that I buy her dinner.

She agrees, and we go to one of the restaurants that is open late, get some food, and start eating.

She said she gets mistaken for a boy a lot, and it's not a big deal. I told her about how I had long hair and no beard in college, and at the gym people would frequently walk into the men's bathroom, see me, and do a double take to make sure they didn't walk into the ladies' room.

She laughed, and we ended up talking for a few hours, before I finally told her that I had to get home, and apologized again for the accidental molestation.

She said that all is forgiven, if I promise to take her on a real date when she gets back.

I agreed, she gave me her phone number, and I went home, and immediately started texting her. We kept talking until her flight finally left, and when she got back I picked her up at the airport, and a few days later took her on that date that I promised her.

We just celebrated our one year anniversary.

She has long hair now.

&#x200B;

tl;dr: Thought an adult woman was a teenage boy, touched her on the boob, everything worked out better than expected."""

response = client.models.generate_content(
        model=model, contents=prompt_temp
    )
print(response)  




  prompt_temp = template_prompt + """Title: TIFU by thinking a woman was a boy, and groping her boob. (kind of NSFW, though it happened at work)


candidates=None create_time=None response_id=None model_version='gemini-2.0-flash' prompt_feedback=GenerateContentResponsePromptFeedback(block_reason=<BlockedReason.PROHIBITED_CONTENT: 'PROHIBITED_CONTENT'>, block_reason_message=None, safety_ratings=None) usage_metadata=GenerateContentResponseUsageMetadata(cache_tokens_details=None, cached_content_token_count=None, candidates_token_count=None, candidates_tokens_details=None, prompt_token_count=1339, prompt_tokens_details=[ModalityTokenCount(modality=<MediaModality.TEXT: 'TEXT'>, token_count=1339)], thoughts_token_count=None, tool_use_prompt_token_count=None, tool_use_prompt_tokens_details=None, total_token_count=1339, traffic_type=None) automatic_function_calling_history=[] parsed=None
