## Part 1: Setup

### Install Packages

In [31]:
# Import Packages
import praw, time, json
from IPython.display import display
from dotenv import load_dotenv, dotenv_values
from requests import Session
import pandas as pd

# Load environment variables from .env file
load_dotenv('.env')
config = dotenv_values()

### Open Reddit Connection

In [32]:
# Create a custom session with a timeout
session = Session()
session.headers.update({'User-Agent': 'praw'})
session.timeout = 10  # Set a timeout of 10 seconds

# Login to Reddit using PRAW
reddit = praw.Reddit(
    client_id=config['CLIENT_ID'],
    client_secret=config['CLIENT_SECRET'],
    requestor_kwargs={"session": session},
    username=config['USERNAME'],
    password=config['PASSWORD'],
    user_agent="CS470 ML Project Access by u/GregorybLafayetteML"
)

# Add some peripheral config data
reddit.config.log_requests = 1
reddit.config.store_json_result = True

# Test the connection
try:
    username = reddit.user.me()
    print("Successfully logged in to Reddit!")
    print(f"Logged in as: u/{username}")
except Exception as e:
    print(f"Failed to log in: {e}")

Successfully logged in to Reddit!
Logged in as: u/GregorybLafayetteML


## Part 2: Accessing Reddit Data

To access reddit posts, we'll need send a request with the number of post we want to get. The following example finds the top 10 hottest posts on the u/wallstreetbets subreddit. We'll show the post title, score, flair, and URL. 

In [33]:
top_posts = reddit.subreddit('wallstreetbets').hot(limit=10)
print("Top 10 hot posts from r/wallstreetbets:")
for post in top_posts:
    print(f"Title: {post.title}, Score: {post.score}, Flair: {post.link_flair_text}, URL: {post.url}")

Top 10 hot posts from r/wallstreetbets:
Title: Weekly Earnings Thread 3/31 - 4/4, Score: 96, Flair: Earnings Thread, URL: https://i.redd.it/tjvgxuyo0gre1.jpeg
Title: Daily Discussion Thread for April 02, 2025, Score: 227, Flair: Daily Discussion, URL: https://www.reddit.com/r/wallstreetbets/comments/1jpkv70/daily_discussion_thread_for_april_02_2025/
Title: Tesla first quarter deliveries: 336,681 delivered, 362,615 produced, Score: 4479, Flair: News, URL: https://ir.tesla.com/press-release/tesla-first-quarter-2025-production-deliveries-and-deployments
Title: Tesla shares rise on unconfirmed report Elon Musk could be leaving DOGE post soon, Score: 1807, Flair: News, URL: https://www.cnbc.com/2025/04/02/tesla-shares-rise-on-unconfirmed-report-elon-musk-could-be-leaving-doge-post-soon.html
Title: Another Recession indicator?, Score: 444, Flair: Meme, URL: https://www.reddit.com/r/wallstreetbets/comments/1jprerv/another_recession_indicator/
Title: ‘Twas the Night Before Tariffs, Score: 6033

For this project, we'll need far more than ten posts at a time. The reddit API will limit our access to 100 posts at a time. Fortunately, the api uses a ListingGenerator which allows us to access our metered connection in sequential blocks. The following example shows how we can utilize this behavior, grabbing blocks of 100 posts at a time. In our example, we'll grab blocks of posts until we reach 5000 posts or our access times out. Notice that the procedure ends early with around 750-800 posts collected. The results are sparce, because our connection either timed out or was metered down by reddit. The latter option is more likely.

In [34]:
# Access the subreddit
subreddit = reddit.subreddit("wallstreetbets")

# Initialize variables
batch_size = 100  # Number of posts per batch
total_posts = 5000  # Total number of posts to fetch
all_posts = []  # To store all the retrieved posts
after = None  # To keep track of the last post for pagination

# Fetch posts in batches
while len(all_posts) < total_posts:
    # Fetch the next batch of posts
    submissions = subreddit.new(limit=batch_size, params={"after": after})
    
    batch_posts = []
    for submission in submissions:
        batch_posts.append(submission)

        # Update the `after` variable with the last submission's fullname
        after = submission.fullname

    # Add the batch to the main list
    all_posts.extend(batch_posts)

    # Exit loop if no more posts are available
    if not batch_posts:
        print("No more posts to fetch.")
        break

    # Optional delay to avoid rate limits
    time.sleep(1)  # Adjust the delay as necessary

# Process the data (example: print the total number of posts fetched)
print(f"Fetched {len(all_posts)} posts in total.")

No more posts to fetch.
Fetched 787 posts in total.


Now that we have collected a large portion of posts/submssions, we'll parse the results and construct a dataframe with this data. We're going to collect more fields from this data than we might need right now, avoiding data limitations in the future.

In [35]:
# Parse are submission objects that we collected.
fields = ('title', 
          'created_utc', 
          'distinguished', 
          'edited', 
          'id', 
          'is_original_content', 
          'link_flair_text', 
          'locked',
          'name',
          'num_comments',
          'over_18',
          'permalink',
          'selftext',
          'spoiler',
          'upvote_ratio')
list_of_submissions = []

# Parse each submission into a dictionary of the lised fields.
for submission in all_posts:
    full = vars(submission)
    sub_dict = {field:full[field] for field in fields}
    list_of_submissions.append(sub_dict)

# Create a python dataframe of these submissions.
collected_data = pd.DataFrame.from_dict(list_of_submissions, orient='columns')

# Display the dataframe.
display(collected_data)

Unnamed: 0,title,created_utc,distinguished,edited,id,is_original_content,link_flair_text,locked,name,num_comments,over_18,permalink,selftext,spoiler,upvote_ratio
0,"OnlyFans founder, crypto foundation submit lat...",1.743618e+09,,False,1jpvkm3,False,News,False,t3_1jpvkm3,1,False,/r/wallstreetbets/comments/1jpvkm3/onlyfans_fo...,,False,1.00
1,Can we get a WSB broadcast?,1.743618e+09,,False,1jpvjrn,False,Discussion,False,t3_1jpvjrn,1,False,/r/wallstreetbets/comments/1jpvjrn/can_we_get_...,I’m tired of listening to CNBC using big words...,False,1.00
2,Chinese firms place $16 billion in order for n...,1.743617e+09,,False,1jpv6ez,False,News,False,t3_1jpv6ez,5,False,/r/wallstreetbets/comments/1jpv6ez/chinese_fir...,,False,0.97
3,"Amazon bids to buy TikTok as deadline looms, N...",1.743613e+09,,False,1jptm06,False,News,False,t3_1jptm06,46,False,/r/wallstreetbets/comments/1jptm06/amazon_bids...,[https://finance.yahoo.com/news/amazon-bid-buy...,False,0.97
4,Rivian posts sharp fall in quarterly deliverie...,1.743609e+09,,False,1jps2rr,False,News,False,t3_1jps2rr,54,False,/r/wallstreetbets/comments/1jps2rr/rivian_post...,(Reuters) - Rivian reported a 36% decline in f...,False,0.97
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
782,Wow. Sold Friday held calls at open. Swapped f...,1.741034e+09,,False,1j2sinn,False,Gain,False,t3_1j2sinn,11,False,/r/wallstreetbets/comments/1j2sinn/wow_sold_fr...,Checkout the last photo. I’m finally in the gr...,False,0.84
783,167% QQQ PUTS 23.5k PROFITS,1.741033e+09,,False,1j2s2yo,False,Gain,False,t3_1j2s2yo,18,False,/r/wallstreetbets/comments/1j2s2yo/167_qqq_put...,DUMP IT 🥭🥭🥭🥭,False,0.96
784,13K 0DTE SPY 585 PUT --> +17K Profit,1.741032e+09,,False,1j2ry4j,False,Gain,False,t3_1j2ry4j,3,False,/r/wallstreetbets/comments/1j2ry4j/13k_0dte_sp...,Bear wins again \n\n\nhttps://preview.redd.it...,False,0.82
785,Next up for EU sovereignty: cloud computing,1.741030e+09,,False,1j2qy9i,False,DD,False,t3_1j2qy9i,15,False,/r/wallstreetbets/comments/1j2qy9i/next_up_for...,(resubmitted with positions)\n\n**TL;DR** If E...,False,0.92


## Part 3: Analysis of Reddit Data

In [36]:
## 
