## Part 1: Setup

### Install Packages

In [9]:
# Import Packages
import praw, time, os, pyarrow
from IPython.display import display
from dotenv import load_dotenv, dotenv_values
from requests import Session
import pandas as pd

# Load environment variables from .env file
load_dotenv('.env')
config = dotenv_values()

## Part 2: Collecting Data from Reddit

### Open Reddit Connection

In [10]:
# Create a custom session with a timeout
session = Session()
session.headers.update({'User-Agent': 'praw'})
session.timeout = 10  # Set a timeout of 10 seconds

# Login to Reddit using PRAW
reddit = praw.Reddit(
    client_id=config['CLIENT_ID'],
    client_secret=config['CLIENT_SECRET'],
    requestor_kwargs={"session": session},
    username=config['USERNAME'],
    password=config['PASSWORD'],
    user_agent="CS470 ML Project Access by u/GregorybLafayetteML"
)

# Add some peripheral config data
reddit.config.log_requests = 1
reddit.config.store_json_result = True

# Test the connection
try:
    username = reddit.user.me()
    print("Successfully logged in to Reddit!")
    print(f"Logged in as: u/{username}")
except Exception as e:
    print(f"Failed to log in: {e}")

Successfully logged in to Reddit!
Logged in as: u/GregorybLafayetteML


### Accessing Reddit Data

To access reddit posts, we'll need send a request with the number of post we want to get. The following example finds the top 10 hottest posts on the u/wallstreetbets subreddit. We'll show the post title, score, flair, and URL. 

In [11]:
top_posts = reddit.subreddit('wallstreetbets').hot(limit=10)
print("Top 10 hot posts from r/wallstreetbets:")
for post in top_posts:
    print(f"Title: {post.title}, Score: {post.score}, Flair: {post.link_flair_text}, URL: {post.url}")

Top 10 hot posts from r/wallstreetbets:
Title: What Are Your Moves Tomorrow, April 10, 2025, Score: 4, Flair: Daily Discussion, URL: https://www.reddit.com/r/wallstreetbets/comments/1jvf5zp/what_are_your_moves_tomorrow_april_10_2025/
Title: Weekly Earnings Thread 4/7 - 4/11, Score: 142, Flair: Earnings Thread, URL: https://i.redd.it/3sw567xwstse1.jpeg
Title: BREAKING NEWS: Trump Says Tariffs Paused for 90 Days on Non-Retaliating Countries, Score: 16932, Flair: News, URL: https://www.bloomberg.com/news/live-blog/2025-04-08/trump-tariffs-stock-market-updates?srnd=homepage-europe&embedded-checkout=true
Title: Hold onto your butts, Score: 5480, Flair: Gain, URL: https://i.redd.it/ewzhovurnute1.jpeg
Title: China announces 84% retaliatory tariffs on US goods, Score: 28652, Flair: News, URL: https://edition.cnn.com/2025/04/09/business/china-us-tariffs-retaliation-hnk-intl/index.html
Title: Lost life savings, dad so mad he threatened to come to my school., Score: 5441, Flair: Loss, URL: https:

For this project, we'll need far more than ten posts at a time. The reddit API will limit our access to 100 posts at a time. Fortunately, the api uses a ListingGenerator which allows us to access our metered connection in sequential blocks. The following example shows how we can utilize this behavior, grabbing blocks of 100 posts at a time. In our example, we'll grab blocks of posts until we reach 5000 posts or our access times out. Notice that the procedure ends early with around 750-800 posts collected. The results are sparce, because our connection either timed out or was metered down by reddit. The latter option is more likely.

In [12]:
# Access the subreddit
subreddit = reddit.subreddit("wallstreetbets")

# Initialize variables
batch_size = 50 # Number of posts per batch
total_posts = 5000  # Total number of posts to fetch
all_posts = []  # To store all the retrieved posts
after = None  # To keep track of the last post for pagination

# Fetch posts in batches
while len(all_posts) < total_posts:
    # Fetch the next batch of posts
    submissions = subreddit.new(limit=batch_size, params={"after": after})
    
    batch_posts = []
    for submission in submissions:
        batch_posts.append(submission)

        # Update the `after` variable with the last submission's fullname
        after = submission.fullname

    # Add the batch to the main list
    all_posts.extend(batch_posts)

    # Exit loop if no more posts are available
    if not batch_posts:
        print("No more posts to fetch.")
        break

    # Optional delay to avoid rate limits
    time.sleep(5)  # Adjust the delay as necessary

# Process the data (example: print the total number of posts fetched)
print(f"Fetched {len(all_posts)} posts in total.")

No more posts to fetch.
Fetched 792 posts in total.


Now that we have collected a large portion of posts/submssions, we'll parse the results and construct a dataframe with this data. We're going to collect more fields from this data than we might need right now, avoiding data limitations in the future.

In [13]:
# Parse are submission objects that we collected.
fields = ('title', 
          'created_utc', 
          'id', 
          'is_original_content', 
          'link_flair_text', 
          'locked',
          'name',
          'num_comments',
          'over_18',
          'permalink',
          'selftext',
          'spoiler',
          'upvote_ratio')
list_of_submissions = []

# Parse each submission into a dictionary of the lised fields.
for submission in all_posts:
    full = vars(submission)
    sub_dict = {field:full[field] for field in fields}
    list_of_submissions.append(sub_dict)

# Create a python dataframe of these submissions.
collected_data = pd.DataFrame.from_records(list_of_submissions)

# Display the dataframe.
display(collected_data)

Unnamed: 0,title,created_utc,id,is_original_content,link_flair_text,locked,name,num_comments,over_18,permalink,selftext,spoiler,upvote_ratio
0,See you on the water (I assume you all own yac...,1.744229e+09,1jvf7wc,False,News,False,t3_1jvf7wc,1,False,/r/wallstreetbets/comments/1jvf7wc/see_you_on_...,,False,1.00
1,"What Are Your Moves Tomorrow, April 10, 2025",1.744229e+09,1jvf5zp,False,Daily Discussion,False,t3_1jvf5zp,131,False,/r/wallstreetbets/comments/1jvf5zp/what_are_yo...,This post contains content not supported on ol...,False,0.83
2,NEVER BACK DOWN NEVER WHAT???,1.744229e+09,1jvf3yf,False,Gain,False,t3_1jvf3yf,4,False,/r/wallstreetbets/comments/1jvf3yf/never_back_...,,False,1.00
3,Leveraged Gain 🚀 TQQQ,1.744228e+09,1jvf3ih,False,Gain,False,t3_1jvf3ih,3,False,/r/wallstreetbets/comments/1jvf3ih/leveraged_g...,Sold at absolute peak yesterday. Bought back i...,False,1.00
4,BREAKING NEWS: Horsford Grills US Trade Rep On...,1.744228e+09,1jvf0on,False,News,False,t3_1jvf0on,7,False,/r/wallstreetbets/comments/1jvf0on/breaking_ne...,This is so satisfying to watch,False,1.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...
787,What's up with CRWD?,1.742412e+09,1jf5339,False,Discussion,False,t3_1jf5339,16,False,/r/wallstreetbets/comments/1jf5339/whats_up_wi...,This stock just keeps doing what it wants.\n\n...,False,0.57
788,Every single FED’s conference should look like...,1.742410e+09,1jf4ag1,False,Meme,False,t3_1jf4ag1,199,False,/r/wallstreetbets/comments/1jf4ag1/every_singl...,,False,0.96
789,The Fed holds rates steady.,1.742407e+09,1jf3cp2,False,News,False,t3_1jf3cp2,286,False,/r/wallstreetbets/comments/1jf3cp2/the_fed_hol...,,False,0.96
790,Tesla stock rebounds as California approves in...,1.742402e+09,1jf112w,False,Discussion,False,t3_1jf112w,64,False,/r/wallstreetbets/comments/1jf112w/tesla_stock...,,False,0.40


### Saving Reddit Data

In [14]:
# Save the collected data to parquet format
PARQUET_PATH = './data/wallstreetbets-collection.parquet'

# Create a pyarrow schema for the data types.
schema = pyarrow.schema([
    ('title', pyarrow.string()), 
    ('created_utc', pyarrow.float64()),
    ('id', pyarrow.string()), 
    ('is_original_content', pyarrow.bool_()), 
    ('link_flair_text', pyarrow.string()), 
    ('locked', pyarrow.bool_()),
    ('name', pyarrow.string()),
    ('num_comments', pyarrow.int64()),
    ('over_18', pyarrow.bool_()),
    ('permalink', pyarrow.string()),
    ('selftext', pyarrow.string()),
    ('spoiler', pyarrow.bool_()),
    ('upvote_ratio', pyarrow.float64()),
])

# If the parqet does not exist, create it.
if not os.path.exists(PARQUET_PATH):
    collected_data.to_parquet(PARQUET_PATH, engine='pyarrow', schema=schema)
    
# If the data file already exist, merge new data with the existing one.
else:
    old_parquet = pd.read_parquet(PARQUET_PATH, engine='pyarrow', schema=schema)
    new_parquet = pd.concat([old_parquet, collected_data])
    new_parquet = new_parquet.drop_duplicates(subset=['id','title','created_utc','name','permalink'], keep='last').reset_index(drop=True)
    new_parquet.to_parquet(PARQUET_PATH, engine='pyarrow', schema=schema)


## Part _: Getting Comment Data

In [15]:
# Use the new collected data to get comment stuff.
data_for_comment_collection = collected_data

# Given a reddit post id, collect the comments and peripheral data for that post.
def get_comment_data_with_post(post_id: str):
    # Assuming we already have access to reddit.
    post = reddit.submission(id=post_id)
    
    # Note the comment section is more like a forest rather than a list.
    post.comments.replace_more(limit=None)
    for comment in submission.comments.list():
        # This will skip comments that are not top-level.
        if isinstance(comment, praw.models.MoreComments):
            continue
        
        # TODO: How is perfermance as forest grows in dfepth.
        
        # Otherwise, print the details.
        print(comment.author)
        print(comment.score)
        print(comment.created_utc)  # as a Unix timestamp
        print(comment.body)
        
# Use this methodolgy for some posts.
post_ids = data_for_comment_collection['id'][:10] # Get the fist 10 for now.
for post_id in post_ids:
    get_comment_data_with_post(post_id)
    

VisualMod
1
1742399808.0

**User Report**| | | |
:--|:--|:--|:--
**Total Submissions** | 2 | **First Seen In WSB** | 1 year ago
**Total Comments** | 2 | **Previous Best DD** | 
**Account Age** | 7 years | | 

[**Join WSB Discord**](http://discord.gg/wsbverse)
Melodic_Fee5400
126
1742400268.0
And stock is pumping. Makes sense
Lofi-Fanboy123
27
1742400063.0
so im buying Rolls royce?
CUDAcores89
62
1742401384.0
So this article links to [this](https://www.reuters.com/technology/eu-antitrust-regulators-tell-apple-how-comply-with-tech-rules-2024-09-19/) article that details more of what the EU is asking from apple.

It looks like the EU simply wants to ensure other devices from 3rd party manufacturers will work with Apple products in a standardized fashion (smartwatches, headphones, VR headsets).

Because if that is all the EU wants, then Apple has no right to complain. Android and windows/Linux on the desktop already have device interoperability with most products.

Maybe Android users will

## Part _: Analysis of Reddit Data

In [16]:
# Access the data which we have collected.
PARQUET_PATH = './data/wallstreetbets-collection.parquet'
reddit_data = pd.read_parquet(PARQUET_PATH, engine='pyarrow', schema=schema)

# Show our greater data source
display(reddit_data)

Unnamed: 0,title,created_utc,id,is_original_content,link_flair_text,locked,name,num_comments,over_18,permalink,selftext,spoiler,upvote_ratio
0,Tariffs bonanza,1.743624e+09,1jpyan4,False,Discussion,False,t3_1jpyan4,16,False,/r/wallstreetbets/comments/1jpyan4/tariffs_bon...,How will the tariffs affect our market? Are we...,False,0.77
1,🤔,1.743624e+09,1jpy97b,False,Discussion,False,t3_1jpy97b,4,False,/r/wallstreetbets/comments/1jpy97b/_/,,False,0.96
2,"Is It Just Me, or Is Robinhood’s Security Stil...",1.743623e+09,1jpxza5,False,Discussion,False,t3_1jpxza5,8,False,/r/wallstreetbets/comments/1jpxza5/is_it_just_...,I feel like it’s pretty obvious (at least to m...,False,0.83
3,200 puts (~10k): If NVDA goes to $90 this week...,1.743393e+09,1jnuj43,False,YOLO,False,t3_1jnuj43,171,False,/r/wallstreetbets/comments/1jnuj43/200_puts_10...,"If it doesn’t, yall will cover my bet, right?\n",False,0.95
4,All in on Hood,1.743015e+09,1jkjfsa,False,YOLO,False,t3_1jkjfsa,88,False,/r/wallstreetbets/comments/1jkjfsa/all_in_on_h...,Perfect entry today in my opinion. Very simple...,False,0.66
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1284,What's up with CRWD?,1.742412e+09,1jf5339,False,Discussion,False,t3_1jf5339,16,False,/r/wallstreetbets/comments/1jf5339/whats_up_wi...,This stock just keeps doing what it wants.\n\n...,False,0.57
1285,Every single FED’s conference should look like...,1.742410e+09,1jf4ag1,False,Meme,False,t3_1jf4ag1,199,False,/r/wallstreetbets/comments/1jf4ag1/every_singl...,,False,0.96
1286,The Fed holds rates steady.,1.742407e+09,1jf3cp2,False,News,False,t3_1jf3cp2,286,False,/r/wallstreetbets/comments/1jf3cp2/the_fed_hol...,,False,0.96
1287,Tesla stock rebounds as California approves in...,1.742402e+09,1jf112w,False,Discussion,False,t3_1jf112w,64,False,/r/wallstreetbets/comments/1jf112w/tesla_stock...,,False,0.40


## Part _: Sentiment Analysis

### Setup Tools

In [17]:
import nltk
nltk.download(['punkt',
               'stopwords',
                'vader_lexicon',
                'names',
                'averaged_perceptron_tagger',
                'wordnet'])

from nltk.sentiment import SentimentIntensityAnalyzer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

ModuleNotFoundError: No module named 'nltk'

### Analyse a single submission

In [None]:
# Analyze the sentiment of a single submission.
# We'll look at a Discussion post with the id: '1jpyan4'

submission = reddit.submission(id='1jpyan4')
submission.comments.replace_more(limit=None)  # Load all comments

comments = submission.comments.list()  # Get all comments
comment_texts = [comment.body for comment in comments]  # Extract comment texts

# Initialize the Sentiment Intensity Analyzer
sia = SentimentIntensityAnalyzer()
stop_words = set(stopwords.words('english'))

# Function to analyze sentiment of a single comment
def analyze_sentiment(comment):
    # Tokenize the comment
    tokens = word_tokenize(comment.lower())
    
    # Remove stop words
    filtered_tokens = [word for word in tokens if word not in stop_words]
    
    # Get sentiment scores
    sentiment_scores = sia.polarity_scores(' '.join(filtered_tokens))
    
    return sentiment_scores

# Show the results of the analysis.
for comment in comments:
    sentiment_scores = analyze_sentiment(comment.body)
    print(f"Comment: {comment.body}")
    print(f"Sentiment Scores: {sentiment_scores}")
    print("-" * 80)