## Part 1: Setup

### Install Packages

In [6]:
# Import Packages
import praw, time
from dotenv import load_dotenv, dotenv_values
from requests import Session

# Load environment variables from .env file
load_dotenv('.env')
config = dotenv_values()

### Open Reddit Connection

In [7]:
# Create a custom session with a timeout
session = Session()
session.headers.update({'User-Agent': 'praw'})
session.timeout = 10  # Set a timeout of 10 seconds

# Login to Reddit using PRAW
reddit = praw.Reddit(
    client_id=config['CLIENT_ID'],
    client_secret=config['CLIENT_SECRET'],
    requestor_kwargs={"session": session},
    username=config['USERNAME'],
    password=config['PASSWORD'],
    user_agent="CS470 ML Project Access by u/GregorybLafayetteML"
)

# Test the connection
try:
    username = reddit.user.me()
    print("Successfully logged in to Reddit!")
    print(f"Logged in as: u/{username}")
except Exception as e:
    print(f"Failed to log in: {e}")

Successfully logged in to Reddit!
Logged in as: u/GregorybLafayetteML


## Part 2: Accessing Reddit Data

To access reddit posts, we'll need send a request with the number of post we want to get. The following example finds the top 10 hottest posts on the u/wallstreetbets subreddit. We'll show the post title, score, flair, and URL. 

In [8]:
top_posts = reddit.subreddit('wallstreetbets').hot(limit=10)
print("Top 10 hot posts from r/wallstreetbets:")
for post in top_posts:
    print(f"Title: {post.title}, Score: {post.score}, Flair: {post.link_flair_text}, URL: {post.url}")

Top 10 hot posts from r/wallstreetbets:
Title: Introducing: WSB's First Ever Paper Trading Competition, Score: 1281, Flair: Announcements , URL: https://v.redd.it/kmfu9l676fre1
Title: Daily Discussion Thread for March 31, 2025, Score: 318, Flair: Daily Discussion, URL: https://www.reddit.com/r/wallstreetbets/comments/1jnzl3k/daily_discussion_thread_for_march_31_2025/
Title: China, Japan, South Korea will jointly respond to US tariffs, Chinese state media says, Score: 9375, Flair: News, URL: https://www.reddit.com/r/wallstreetbets/comments/1jo6v9m/china_japan_south_korea_will_jointly_respond_to/
Title: Goldman Sachs sees Trump tariffs spiking inflation, stunting growth and raising recession risks, Score: 13510, Flair: News, URL: https://www.cnbc.com/2025/03/30/tariffs-to-spike-inflation-stunt-growth-and-raise-recession-risks-goldman-says-.html
Title: Household Savings not looking good, Score: 439, Flair: Discussion, URL: https://i.redd.it/59dft5gvo1se1.jpeg
Title: Half a mil in one posi

For this project, we'll need far more than ten posts at a time. The reddit API will limit our access to 100 posts at a time. Fortunately, the api uses a ListingGenerator which allows us to access our metered connection in sequential blocks. The following example shows how we can utilize this behavior, grabbing blocks of 100 posts at a time. In our example, we'll grab blocks of posts until we reach 5000 posts or our access times out. Notice that the procedure ends early with around 750-800 posts collected. The results are sparce, because our connection either timed out or was metered down by reddit. The latter option is more likely.

In [9]:
# Access the subreddit
subreddit = reddit.subreddit("wallstreetbets")

# Initialize variables
batch_size = 100  # Number of posts per batch
total_posts = 5000  # Total number of posts to fetch
all_posts = []  # To store all the retrieved posts
after = None  # To keep track of the last post for pagination

# Fetch posts in batches
while len(all_posts) < total_posts:
    # Fetch the next batch of posts
    submissions = subreddit.new(limit=batch_size, params={"after": after})
    
    batch_posts = []
    for submission in submissions:
        batch_posts.append({
            "title": submission.title,
            "score": submission.score,
            "url": submission.url,
            "created_utc": submission.created_utc
        })

        # Update the `after` variable with the last submission's fullname
        after = submission.fullname

    # Add the batch to the main list
    all_posts.extend(batch_posts)

    # Exit loop if no more posts are available
    if not batch_posts:
        print("No more posts to fetch.")
        break

    # Optional delay to avoid rate limits
    time.sleep(1)  # Adjust the delay as necessary

# Process the data (example: print the total number of posts fetched)
print(f"Fetched {len(all_posts)} posts in total.")

No more posts to fetch.
Fetched 786 posts in total.


## Part 3: Analysis of Reddit Data