## Part 1: Setup

### Install Packages

In [23]:
# Import Packages
import praw, time, os, pyarrow
from IPython.display import display
from dotenv import load_dotenv, dotenv_values
from requests import Session
import pandas as pd

# Load environment variables from .env file
load_dotenv('.env')
config = dotenv_values()

## Part 2: Collecting Data from Reddit

### Open Reddit Connection

In [24]:
# Create a custom session with a timeout
session = Session()
session.headers.update({'User-Agent': 'praw'})
session.timeout = 10  # Set a timeout of 10 seconds

# Login to Reddit using PRAW
reddit = praw.Reddit(
    client_id=config['CLIENT_ID'],
    client_secret=config['CLIENT_SECRET'],
    requestor_kwargs={"session": session},
    username=config['USERNAME'],
    password=config['PASSWORD'],
    user_agent="CS470 ML Project Access by u/GregorybLafayetteML"
)

# Add some peripheral config data
reddit.config.log_requests = 1
reddit.config.store_json_result = True

# Test the connection
try:
    username = reddit.user.me()
    print("Successfully logged in to Reddit!")
    print(f"Logged in as: u/{username}")
except Exception as e:
    print(f"Failed to log in: {e}")

Successfully logged in to Reddit!
Logged in as: u/GregorybLafayetteML


### Accessing Reddit Data

To access reddit posts, we'll need send a request with the number of post we want to get. The following example finds the top 10 hottest posts on the u/wallstreetbets subreddit. We'll show the post title, score, flair, and URL. 

In [25]:
top_posts = reddit.subreddit('wallstreetbets').hot(limit=10)
print("Top 10 hot posts from r/wallstreetbets:")
for post in top_posts:
    print(f"Title: {post.title}, Score: {post.score}, Flair: {post.link_flair_text}, URL: {post.url}")

Top 10 hot posts from r/wallstreetbets:
Title: What Are Your Moves Tomorrow, April 10, 2025, Score: 226, Flair: Daily Discussion, URL: https://www.reddit.com/r/wallstreetbets/comments/1jvf5zp/what_are_your_moves_tomorrow_april_10_2025/
Title: Weekly Earnings Thread 4/7 - 4/11, Score: 145, Flair: Earnings Thread, URL: https://i.redd.it/3sw567xwstse1.jpeg
Title: Tinfoil hat alert: Looks like insiders got a 20m head start for today’s face ripper rally, Score: 21186, Flair: News, URL: https://i.redd.it/pg8pcxd90vte1.jpeg
Title: Hold onto your butts, Score: 20208, Flair: Gain, URL: https://i.redd.it/ewzhovurnute1.jpeg
Title: BREAKING NEWS: Trump Says Tariffs Paused for 90 Days on Non-Retaliating Countries, Score: 25615, Flair: News, URL: https://www.bloomberg.com/news/live-blog/2025-04-08/trump-tariffs-stock-market-updates?srnd=homepage-europe&embedded-checkout=true
Title: Lost life savings, dad so mad he threatened to come to my school., Score: 9950, Flair: Loss, URL: https://www.reddit.co

For this project, we'll need far more than ten posts at a time. The reddit API will limit our access to 100 posts at a time. Fortunately, the api uses a ListingGenerator which allows us to access our metered connection in sequential blocks. The following example shows how we can utilize this behavior, grabbing blocks of 100 posts at a time. In our example, we'll grab blocks of posts until we reach 5000 posts or our access times out. Notice that the procedure ends early with around 750-800 posts collected. The results are sparce, because our connection either timed out or was metered down by reddit. The latter option is more likely.

In [26]:
# Access the subreddit
subreddit = reddit.subreddit("wallstreetbets")

# Initialize variables
batch_size = 50 # Number of posts per batch
total_posts = 5000  # Total number of posts to fetch
all_posts = []  # To store all the retrieved posts
after = None  # To keep track of the last post for pagination

# Fetch posts in batches
while len(all_posts) < total_posts:
    # Fetch the next batch of posts
    submissions = subreddit.new(limit=batch_size, params={"after": after})
    
    batch_posts = []
    for submission in submissions:
        batch_posts.append(submission)

        # Update the `after` variable with the last submission's fullname
        after = submission.fullname

    # Add the batch to the main list
    all_posts.extend(batch_posts)

    # Exit loop if no more posts are available
    if not batch_posts:
        print("No more posts to fetch.")
        break

    # Optional delay to avoid rate limits
    time.sleep(5)  # Adjust the delay as necessary

# Process the data (example: print the total number of posts fetched)
print(f"Fetched {len(all_posts)} posts in total.")

No more posts to fetch.
Fetched 797 posts in total.


Now that we have collected a large portion of posts/submssions, we'll parse the results and construct a dataframe with this data. We're going to collect more fields from this data than we might need right now, avoiding data limitations in the future.

In [27]:
# Parse are submission objects that we collected.
fields = ('title', 
          'created_utc', 
          'id', 
          'is_original_content', 
          'link_flair_text', 
          'locked',
          'name',
          'num_comments',
          'over_18',
          'permalink',
          'selftext',
          'spoiler',
          'upvote_ratio')
list_of_submissions = []

# Parse each submission into a dictionary of the lised fields.
for submission in all_posts:
    full = vars(submission)
    sub_dict = {field:full[field] for field in fields}
    list_of_submissions.append(sub_dict)

# Create a python dataframe of these submissions.
collected_data = pd.DataFrame.from_records(list_of_submissions)

# Display the dataframe.
display(collected_data)

Unnamed: 0,title,created_utc,id,is_original_content,link_flair_text,locked,name,num_comments,over_18,permalink,selftext,spoiler,upvote_ratio
0,Well didnt happen on monday but still....he wa...,1.744243e+09,1jvklop,False,Discussion,False,t3_1jvklop,1,False,/r/wallstreetbets/comments/1jvklop/well_didnt_...,Do anybody knows how is SHE' playing in this m...,False,1.00
1,Trump Pause,1.744243e+09,1jvklcp,False,Gain,False,t3_1jvklcp,1,False,/r/wallstreetbets/comments/1jvklcp/trump_pause/,,False,1.00
2,PUTS tomorrow,1.744243e+09,1jvkjwz,False,Meme,False,t3_1jvkjwz,2,False,/r/wallstreetbets/comments/1jvkjwz/puts_tomorrow/,These guys cant fit their egos in the same room,False,0.83
3,Revenge Trade vs Informed Trade?,1.744243e+09,1jvkie6,False,Discussion,False,t3_1jvkie6,2,False,/r/wallstreetbets/comments/1jvkie6/revenge_tra...,How do you know when you’re revenge trading vs...,False,1.00
4,Safe to say that Roth conversion was timed well I,1.744243e+09,1jvkgsj,False,Gain,False,t3_1jvkgsj,1,False,/r/wallstreetbets/comments/1jvkgsj/safe_to_say...,"I rarely if ever time the market well, hopeful...",False,1.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...
792,EU may delay first counter-tariffs against U.S...,1.742469e+09,1jfmfnk,False,News,False,t3_1jfmfnk,146,False,/r/wallstreetbets/comments/1jfmfnk/eu_may_dela...,"BRUSSELS, March 20 (Reuters) - The European Un...",False,0.93
793,US companies race to secure import tariff exem...,1.742469e+09,1jfmf49,False,News,False,t3_1jfmf49,56,False,/r/wallstreetbets/comments/1jfmf49/us_companie...,March 20 (Reuters) - Washington's temporary re...,False,0.96
794,"Tesla to recall more than 46,000 Cybertrucks d...",1.742466e+09,1jflrx5,False,News,False,t3_1jflrx5,378,False,/r/wallstreetbets/comments/1jflrx5/tesla_to_re...,,False,0.96
795,"Daily Discussion Thread for March 20, 2025",1.742465e+09,1jflc6z,False,Daily Discussion,False,t3_1jflc6z,15006,False,/r/wallstreetbets/comments/1jflc6z/daily_discu...,This post contains content not supported on ol...,False,0.92


### Saving Reddit Data

In [28]:
# Save the collected data to parquet format
PARQUET_PATH = './data/wallstreetbets-collection.parquet'

# Create a pyarrow schema for the data types.
schema = pyarrow.schema([
    ('title', pyarrow.string()), 
    ('created_utc', pyarrow.float64()),
    ('id', pyarrow.string()), 
    ('is_original_content', pyarrow.bool_()), 
    ('link_flair_text', pyarrow.string()), 
    ('locked', pyarrow.bool_()),
    ('name', pyarrow.string()),
    ('num_comments', pyarrow.int64()),
    ('over_18', pyarrow.bool_()),
    ('permalink', pyarrow.string()),
    ('selftext', pyarrow.string()),
    ('spoiler', pyarrow.bool_()),
    ('upvote_ratio', pyarrow.float64()),
])

# If the parqet does not exist, create it.
if not os.path.exists(PARQUET_PATH):
    collected_data.to_parquet(PARQUET_PATH, engine='pyarrow', schema=schema)
    
# If the data file already exist, merge new data with the existing one.
else:
    old_parquet = pd.read_parquet(PARQUET_PATH, engine='pyarrow', schema=schema)
    new_parquet = pd.concat([old_parquet, collected_data])
    new_parquet = new_parquet.drop_duplicates(subset=['id','title','created_utc','name','permalink'], keep='last').reset_index(drop=True)
    new_parquet.to_parquet(PARQUET_PATH, engine='pyarrow', schema=schema)


## Part _: Getting Comment Data

In [29]:
# Use the new collected data to get comment stuff.
data_for_comment_collection = collected_data

# Given a reddit post id, collect the comments and peripheral data for that post.
def get_comment_data_with_post(post_id: str):
    # Assuming we already have access to reddit.
    post = reddit.submission(id=post_id)
    
    # Note the comment section is more like a forest rather than a list.
    post.comments.replace_more(limit=None)
    for comment in submission.comments.list():
        # This will skip comments that are not top-level.
        if isinstance(comment, praw.models.MoreComments):
            continue
        
        # TODO: How is perfermance as forest grows in dfepth.
        
        # Otherwise, print the details.
        print(comment.author)
        print(comment.score)
        print(comment.created_utc)  # as a Unix timestamp
        print(comment.body)
        
# Use this methodolgy for some posts.
post_ids = data_for_comment_collection['id'][:10] # Get the fist 10 for now.
for post_id in post_ids:
    get_comment_data_with_post(post_id)
    

VisualMod
1
1742463669.0

**User Report**| | | |
:--|:--|:--|:--
**Total Submissions** | 1 | **First Seen In WSB** | 4 years ago
**Total Comments** | 11 | **Previous Best DD** | 
**Account Age** | 13 years | | 

[**Join WSB Discord**](http://discord.gg/wsbverse)
throwaway_0x90
5
1742465330.0
Can you explain this?

https://preview.redd.it/num6kav0ktpe1.png?width=754&format=png&auto=webp&s=c9394ebcc18e9a9a857d2cde6fe591c6a232af2a
effdallas
3
1742472587.0
I stopped there once. Imwoikd t feed their “New York strip steak” to my dog.  Food is gross.  Puts 
Kahluabomb
3
1742487700.0
They just built a huge warehouse down the road from my last house outside of Portland.

I love GO and they're popping up in more places. I never even considered buying their stock.
FlyingDiscsandJams
3
1742493303.0
We used to call this place The Used Food Store. Good times!
Lazy-Gene-7284
2
1742476959.0
Only thing lower margin than grocery in general is grocery focused on the lowest income group. Buy anything else

## Part _: Analysis of Reddit Data

In [30]:
# Access the data which we have collected.
PARQUET_PATH = './data/wallstreetbets-collection.parquet'
reddit_data = pd.read_parquet(PARQUET_PATH, engine='pyarrow', schema=schema)

# Show our greater data source
display(reddit_data)

Unnamed: 0,title,created_utc,id,is_original_content,link_flair_text,locked,name,num_comments,over_18,permalink,selftext,spoiler,upvote_ratio
0,Tariffs bonanza,1.743624e+09,1jpyan4,False,Discussion,False,t3_1jpyan4,16,False,/r/wallstreetbets/comments/1jpyan4/tariffs_bon...,How will the tariffs affect our market? Are we...,False,0.77
1,🤔,1.743624e+09,1jpy97b,False,Discussion,False,t3_1jpy97b,4,False,/r/wallstreetbets/comments/1jpy97b/_/,,False,0.96
2,"Is It Just Me, or Is Robinhood’s Security Stil...",1.743623e+09,1jpxza5,False,Discussion,False,t3_1jpxza5,8,False,/r/wallstreetbets/comments/1jpxza5/is_it_just_...,I feel like it’s pretty obvious (at least to m...,False,0.83
3,200 puts (~10k): If NVDA goes to $90 this week...,1.743393e+09,1jnuj43,False,YOLO,False,t3_1jnuj43,171,False,/r/wallstreetbets/comments/1jnuj43/200_puts_10...,"If it doesn’t, yall will cover my bet, right?\n",False,0.95
4,All in on Hood,1.743015e+09,1jkjfsa,False,YOLO,False,t3_1jkjfsa,88,False,/r/wallstreetbets/comments/1jkjfsa/all_in_on_h...,Perfect entry today in my opinion. Very simple...,False,0.66
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1338,EU may delay first counter-tariffs against U.S...,1.742469e+09,1jfmfnk,False,News,False,t3_1jfmfnk,146,False,/r/wallstreetbets/comments/1jfmfnk/eu_may_dela...,"BRUSSELS, March 20 (Reuters) - The European Un...",False,0.93
1339,US companies race to secure import tariff exem...,1.742469e+09,1jfmf49,False,News,False,t3_1jfmf49,56,False,/r/wallstreetbets/comments/1jfmf49/us_companie...,March 20 (Reuters) - Washington's temporary re...,False,0.96
1340,"Tesla to recall more than 46,000 Cybertrucks d...",1.742466e+09,1jflrx5,False,News,False,t3_1jflrx5,378,False,/r/wallstreetbets/comments/1jflrx5/tesla_to_re...,,False,0.96
1341,"Daily Discussion Thread for March 20, 2025",1.742465e+09,1jflc6z,False,Daily Discussion,False,t3_1jflc6z,15006,False,/r/wallstreetbets/comments/1jflc6z/daily_discu...,This post contains content not supported on ol...,False,0.92


## Part _: Sentiment Analysis

### Setup Tools

In [31]:
import nltk, re
nltk.download(['punkt',
               'punkt_tab',
               'stopwords',
               'vader_lexicon',
               'names',
               'averaged_perceptron_tagger',
               'wordnet'])

from nltk.sentiment import SentimentIntensityAnalyzer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

[nltk_data] Downloading package punkt to /home/bengr/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /home/bengr/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /home/bengr/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /home/bengr/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package names to /home/bengr/nltk_data...
[nltk_data]   Package names is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/bengr/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /home/bengr/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### Create some basic tools.

In [32]:
# Initialize the Sentiment Intensity Analyzer
sia = SentimentIntensityAnalyzer()
stop_words = set(stopwords.words('english'))

# Function to analyze sentiment of a single comment
def analyze_sentiment(comment):
    # Tokenize the comment
    tokens = word_tokenize(comment.lower())
    
    # Remove stop words
    filtered_tokens = [word for word in tokens if word not in stop_words]
    
    # Get sentiment scores
    sentiment_scores = sia.polarity_scores(' '.join(filtered_tokens))
    
    return sentiment_scores

# Function to clean text
def clean_text(text):
    # Remove URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    
    # Remove special characters
    text = re.sub(r'\@\w+|\#', '', text)
    
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    
    return text

### Analyse a single submission

In [None]:
# Analyze the sentiment of a single submission.
# We'll look at a Discussion post with the id: '1jpxza5s', talking about robinhoods security
submission = reddit.submission(id='1jpxza5')

# Show the results of the analysis.
sentiment_scores = analyze_sentiment(submission.title)
print(f"Submission: {submission.title}")
print(f"Sentiment Scores: {sentiment_scores}")
print("-" * 80)

Submission: Is It Just Me, or Is Robinhood’s Security Still a Concern?
Sentiment Scores: {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
--------------------------------------------------------------------------------


### Creating a database of reddit threads