Step 2: Data Preparation

Load Reddit Dataset: 
Import the Wallstreetbets dataset using pandas.
Inspect the dataset structure (columns, missing values).

Preprocess Reddit Posts:
Clean the text data:
Remove punctuation, special characters, and stopwords.
Convert text to lowercase.

Extract useful fields:
Date: Convert timestamps to proper datetime format.
Title/Text: Focus on titles or combine with post text.
Filter data for relevant timeframes (e.g., GME short squeeze period).

Stock Ticker Extractiond (GME):
Use Named Entity Recognition (NER) to extract stock tickers and company mentions.
Example tools: spaCy or regex patterns like GME.
Save Cleaned Data:
Save a cleaned dataset as a CSV for further use.

In [77]:
import pandas as pd
import numpy as np
import torch
import re
import emoji
import spacy
from transformers import pipeline


In [78]:
reddit_data = pd.read_csv('data/reddit_wsb.csv', sep=",")

In [79]:
reddit_data.iloc[12]['body']

"You guys are champs. GME... who would have thought a bunch of crazy retards could reach the front page of the New York Times.\n\nAnd when you're done with GME, it's time to punish the big banks who have been suppressing the price of silver since the Bear Stearns / JPM merge. It's all in fucking Bloomberg:\n\n[https://www.bloomberg.com/news/articles/2019-09-16/precious-metals-traders-charged-with-rigging-futures-contracts](https://www.bloomberg.com/news/articles/2019-09-16/precious-metals-traders-charged-with-rigging-futures-contracts)\n\nThere's an excellent explanation of their scheme here\n\n[https://www.listennotes.com/podcasts/palisades-gold-radio/ted-butler-squeezing-out-the-hqxQ5mOdt02/](https://www.listennotes.com/podcasts/palisades-gold-radio/ted-butler-squeezing-out-the-hqxQ5mOdt02/)\n\nYou think GME squeezed hard? Look what happened to silver half a year ago in July:\n\n&#x200B;\n\nhttps://preview.redd.it/3yssvdm7y1e61.png?width=2588&format=png&auto=webp&s=25d4cfa973d57f1f60

In [80]:
def clean_text(text: str):
    text = re.sub(r'http\S+|www.\S+', '', text) # remove hyperlinks
    text = re.sub(r'[^\w\s$]', '', text)  # remove punctionation, keep '$' for tickers
    text = re.sub(r'\s+', ' ', text) # remove newlines and multiple spaces
    text = emoji.demojize(text) # handle emojis. They may contain info for sentiment..?
    text = text.lower()
    return text

In [81]:
# check for na values. All of them in body
print(f"Na values: \n{reddit_data.isna().sum()}")
reddit_data = reddit_data.dropna() 

# Only around half the data is left after this. I suspect, that titles without content wont be important for us. For now. I think we will continue with these.
# It is worth considering, that there may be images in the body? But its far more likely, that the body was deleted ( I checked a few ).
# Also, the number of comments may be interesting, as posts with many comments might contain more value for us.

# Score - We dont know exactly what the scure means. It may be upvotes.
# created - We drop this column, as we do have the timestamp column.

Na values: 
title            0
score            0
id               0
url              0
comms_num        0
created          0
body         28449
timestamp        0
dtype: int64


In [82]:
reddit_data['title'] = reddit_data['title'].map(clean_text)
reddit_data['body'] = reddit_data['body'].map(clean_text)
reddit_data = reddit_data.drop(columns=['url', 'created'])

In [83]:
# Augment the dataframe: We wish to capture all rows, that contain tickers and discuss specific stocks. 
# Especially GME and some of the others.

# Tickers of interest (match with or without $)
priority_tickers = ['TSLA', 'AMZN', 'AMC', 'SLV', 'AG', 'GME']

# Regex to capture:
# 1. Priority tickers (with or without $)
# 2. Other tickers, but only if they have the $ sign
ticker_pattern = r'\b(?:\$([A-Z]{2,5})|(?<!\$)(' + '|'.join(priority_tickers) + r'))\b'

# Function to extract relevant tickers from text
def extract_tickers(text):
    matches = re.findall(ticker_pattern, text, re.IGNORECASE)
    # Flatten the list and clean up empty matches
    match_list = [match[0] if match[0] else match[1].upper() for match in matches]
    match_text = ','.join(set(match_list))
    return match_text

# Apply extraction to the DataFrame
reddit_data['extracted_tickers'] = reddit_data['body'].apply(extract_tickers)
reddit_data['extracted_tickers'] = reddit_data['extracted_tickers'].apply(lambda x: x.lower())

In [84]:
# Flag all rows with mentions of gme
# First iterations, I only chose the posts that only discussed gme.
reddit_data['gme'] = reddit_data['extracted_tickers'].str.contains('gme').astype(int)

In [85]:
# Inspect a single post and title
title = reddit_data.iloc[2]['title']
text = reddit_data.iloc[2]['body']
print(f"Title: {title}")
print(f"Body: {text}")

Title: this is the moment
Body: life isnt fair my mother always told me that when i would complain about arbitrary treatment i would play by the rules and someone else would ignore them when they would win i would appeal to the first authority for an explanation are you going to let them get away with this life isnt fair no it is not the game is the game always in this moment the fascade cracks further when the first breach was made i do not know perhaps it was socrates but today i see thousands millions once they were laughing luxuries falling out of their disgusting diseased mouths as they cackled the unmistakable stench of derision carried on their breath they told anyone outside of their elite class that we were fools for even trying they told us that we were naive we needed networks to be successful we needed polish we needed expertise we needed them the game is the game always they are no longer laughing their odious oeuvre still wafts through the air while the rot and hate and c

In [86]:
reddit_data[reddit_data['gme'] == 1]
reddit_data.to_csv('data/rd_clean.csv', sep=',', index=False)

In [87]:
reddit_data

Unnamed: 0,title,score,id,comms_num,body,timestamp,extracted_tickers,gme
2,exit the system,0,l6uhhn,47,the ceo of nasdaq pushed to halt trading to gi...,2021-01-28 21:30:35,gme,1
6,short stock doesnt have an expiration date,317,l6uf6d,53,hedgefund whales are spreading disinfo saying ...,2021-01-28 21:26:27,,0
7,this is the moment,405,l6ub9l,178,life isnt fair my mother always told me that w...,2021-01-28 21:19:31,,0
10,we need to keep this movement going we all can...,222,l6uao1,70,i believe right now is one of those rare oppo...,2021-01-28 21:18:25,"gme,amc",1
12,once youre done with gme $ag and $slv the gent...,0,l6u9wu,16,you guys are champs gme who would have thought...,2021-01-28 21:17:10,gme,1
...,...,...,...,...,...,...,...,...
53181,ten year price prediction for tsla,156,owfbxp,204,its all contingent on them mastering fsd but i...,2021-08-02 17:11:36,,0
53182,what i learned investigating sava fud spreaders,238,owd2pn,87,tldr three bitter scientists partnered up with...,2021-08-02 15:03:27,,0
53183,daily popular tickers thread for august 02 202...,228,owd1a5,1070,your daily hype thread please keep the shitpo...,2021-08-02 15:01:03,,0
53185,daily discussion thread for august 02 2021,338,owbfjf,11688,your daily trading discussion thread please ke...,2021-08-02 13:00:16,,0
