# Collecting Reddit Data from r/WallStreetBets

## Part 1: Setup

### Install Packages

In [1]:
# Load environment variables from .env file
from dotenv import load_dotenv, dotenv_values
load_dotenv('.env')
config = dotenv_values()

In [2]:
# Import Packages
import praw, time, os, pyarrow
from IPython.display import display
from requests import Session
import pandas as pd
from IPython import get_ipython
from tqdm import tqdm 
import kagglehub, kaggle


# Get config from colab or other environment.
def is_colab():
    return get_ipython().__class__.__module__ == "google.colab._shell"

if is_colab():
    from google.colab import userdata
    config = {}
    config['CLIENT_SECRET'] = userdata.get('CLIENT_SECRET')
    config['CLIENT_ID'] = userdata.get('CLIENT_ID')
    config['NAME'] = userdata.get('NAME')
    config['REDIRECT_URI'] = userdata.get('REDIRECT_URI')
    config['USERNAME'] = userdata.get('USERNAME')
    config['PASSWORD'] = userdata.get('PASSWORD')

else:
    load_dotenv('.env')
    config = dotenv_values()


## Part 2: Collecting Submissions from Reddit

### Open Reddit Connection

In [3]:
# Create a custom session with a timeout
session = Session()
session.headers.update({'User-Agent': 'praw'})
session.timeout = 10  # Set a timeout of 10 seconds

# Login to Reddit using PRAW
reddit = praw.Reddit(
    client_id=config['CLIENT_ID'],
    client_secret=config['CLIENT_SECRET'],
    requestor_kwargs={"session": session},
    username=config['USERNAME'],
    password=config['PASSWORD'],
    user_agent="CS470 ML Project Access by u/GregorybLafayetteML"
)

# Add some peripheral config data
reddit.config.log_requests = 1
reddit.config.store_json_result = True

# Test the connection
try:
    username = reddit.user.me()
    print("Successfully logged in to Reddit!")
    print(f"Logged in as: u/{username}")
except Exception as e:
    print(f"Failed to log in: {e}")

Successfully logged in to Reddit!
Logged in as: u/GregorybLafayetteML


### Accessing Reddit Data

To access reddit posts, we'll need send a request with the number of post we want to get. The following example finds the top 10 hottest posts on the u/wallstreetbets subreddit. We'll show the post title, score, flair, and URL.

In [4]:
top_posts = reddit.subreddit('wallstreetbets').hot(limit=10)
print("Top 10 hot posts from r/wallstreetbets:")
for post in top_posts:
    print(f"Title: {post.title}, Score: {post.score}, Flair: {post.link_flair_text}, URL: {post.url}")

Top 10 hot posts from r/wallstreetbets:
Title: What Are Your Moves Tomorrow, May 02, 2025, Score: 82, Flair: Daily Discussion, URL: https://www.reddit.com/r/wallstreetbets/comments/1kci11n/what_are_your_moves_tomorrow_may_02_2025/
Title: Weekly Earnings Thread 4/28 - 5/2, Score: 359, Flair: Earnings Thread, URL: https://i.redd.it/mz7c9szamzwe1.jpeg
Title: McDonald’s reports largest U.S. same-store sales decline since 2020, Score: 9205, Flair: News, URL: https://www.cnbc.com/2025/05/01/mcdonalds-mcd-q1-2025-earnings.html
Title: A judge just blew up Apple’s control of the App Store, Score: 2161, Flair: News, URL: https://www.theverge.com/news/659246/apple-epic-app-store-judge-ruling-control
Title: US weekly jobless claims jump to two-month high, Score: 777, Flair: News, URL: https://www.reuters.com/business/world-at-work/us-weekly-jobless-claims-increase-more-than-expected-2025-05-01/
Title: Is no one talking about De Minimis starting tomorrow because of this current short term news?, Sc

For this project, we'll need far more than ten posts at a time. The reddit API will limit our access to 100 posts at a time. Fortunately, the api uses a ListingGenerator which allows us to access our metered connection in sequential blocks. The following example shows how we can utilize this behavior, grabbing blocks of 100 posts at a time. In our example, we'll grab blocks of posts until we reach 5000 posts or our access times out. Notice that the procedure ends early with around 750-800 posts collected. The results are sparce, because our connection either timed out or was metered down by reddit. The latter option is more likely.

In [5]:
# Access the subreddit
subreddit = reddit.subreddit("wallstreetbets")

# Initialize variables
batch_size = 50 # Number of posts per batch
total_posts = 5000  # Total number of posts to fetch
all_posts = []  # To store all the retrieved posts
after = None  # To keep track of the last post for pagination

# Fetch posts in batches
while len(all_posts) < total_posts:
    # Fetch the next batch of posts
    submissions = subreddit.new(limit=batch_size, params={"after": after})

    batch_posts = []
    for submission in tqdm(submissions, desc="Storing batch of posts", unit="post"):
        batch_posts.append(submission)

        # Update the `after` variable with the last submission's fullname
        after = submission.fullname

    # Add the batch to the main list
    all_posts.extend(batch_posts)

    # Exit loop if no more posts are available
    if not batch_posts:
        print("No more posts to fetch.")
        break

    # Optional delay to avoid rate limits
    time.sleep(1)  # Adjust the delay as necessary

# Process the data (example: print the total number of posts fetched)
print(f"Fetched {len(all_posts)} posts in total.")

Storing batch of posts: 50post [00:00, 88.87post/s]
Storing batch of posts: 50post [00:00, 89.47post/s]
Storing batch of posts: 50post [00:00, 77.52post/s]
Storing batch of posts: 50post [00:00, 68.96post/s]
Storing batch of posts: 50post [00:00, 64.40post/s]
Storing batch of posts: 50post [00:00, 77.67post/s]
Storing batch of posts: 50post [00:00, 70.22post/s]
Storing batch of posts: 50post [00:00, 72.35post/s]
Storing batch of posts: 50post [00:00, 71.43post/s]
Storing batch of posts: 50post [00:00, 76.69post/s]
Storing batch of posts: 50post [00:00, 71.03post/s]
Storing batch of posts: 50post [00:00, 82.69post/s]
Storing batch of posts: 50post [00:00, 80.49post/s]
Storing batch of posts: 50post [00:00, 70.36post/s]
Storing batch of posts: 50post [00:01, 49.72post/s]
Storing batch of posts: 50post [00:01, 25.24post/s]
Storing batch of posts: 11post [00:00, 39.05post/s]
Storing batch of posts: 0post [00:00, ?post/s]

No more posts to fetch.
Fetched 811 posts in total.





Now that we have collected a large portion of posts/submssions, we'll parse the results and construct a dataframe with this data. We're going to collect more fields from this data than we might need right now, avoiding data limitations in the future.

In [6]:
# Parse are submission objects that we collected.
fields = ('title',
          'created_utc',
          'id',
          'is_original_content',
          'link_flair_text',
          'locked',
          'name',
          'num_comments',
          'over_18',
          'permalink',
          'selftext',
          'spoiler',
          'upvote_ratio')
list_of_submissions = []

# Parse each submission into a dictionary of the lised fields.
for submission in all_posts:
    full = vars(submission)
    sub_dict = {field:full[field] for field in fields}
    list_of_submissions.append(sub_dict)

# Create a python dataframe of these submissions.
collected_data = pd.DataFrame.from_records(list_of_submissions)

# Display the dataframe.
display(collected_data)

Unnamed: 0,title,created_utc,id,is_original_content,link_flair_text,locked,name,num_comments,over_18,permalink,selftext,spoiler,upvote_ratio
0,"I call it ""Put Combo"" +44k",1.746135e+09,1kck7mt,False,Gain,False,t3_1kck7mt,2,False,/r/wallstreetbets/comments/1kck7mt/i_call_it_p...,"I call it ""Put Combo"" +44k\nThe holy grail of ...",False,1.00
1,Reddit shares rocket as high as 19% on strong ...,1.746135e+09,1kck67g,False,News,False,t3_1kck67g,5,False,/r/wallstreetbets/comments/1kck67g/reddit_shar...,,False,0.82
2,MSFT earnings,1.746132e+09,1kcj29d,False,Gain,False,t3_1kcj29d,5,False,/r/wallstreetbets/comments/1kcj29d/msft_earnings/,Was down fat on Tsla and MSFT saved my account...,False,0.97
3,Block shares plunge 17% on revenue miss,1.746132e+09,1kcizwx,False,News,False,t3_1kcizwx,26,False,/r/wallstreetbets/comments/1kcizwx/block_share...,,False,0.94
4,AAPL Results,1.746132e+09,1kciy2f,False,Discussion,False,t3_1kciy2f,24,False,/r/wallstreetbets/comments/1kciy2f/aapl_results/,Not so exciting. But new buyback can support p...,False,0.96
...,...,...,...,...,...,...,...,...,...,...,...,...,...
806,Largest Net Gain in a Single Day for the Nasda...,1.744231e+09,1jvg6np,False,Discussion,False,t3_1jvg6np,27,False,/r/wallstreetbets/comments/1jvg6np/largest_net...,https://preview.redd.it/72sswb2pevte1.png?widt...,False,0.89
807,Make my account green again with $60k gain,1.744231e+09,1jvg693,False,Gain,False,t3_1jvg693,9,False,/r/wallstreetbets/comments/1jvg693/make_my_acc...,"Dear regards, I’m back in the game. I lost $60...",False,0.96
808,TQQQ Short Put $1140 within minutes,1.744231e+09,1jvg5fe,False,Gain,False,t3_1jvg5fe,6,False,/r/wallstreetbets/comments/1jvg5fe/tqqq_short_...,,False,0.85
809,Got lucky. Sold before the pump.,1.744231e+09,1jvg4kf,False,Gain,False,t3_1jvg4kf,6,False,/r/wallstreetbets/comments/1jvg4kf/got_lucky_s...,Been buying puts on the basis that orange man ...,False,0.93


### Saving Reddit Data

In [7]:
# Save the collected data to parquet format
SUBMISSION_PARQUET_PATH = './data/wallstreetbets-collection.parquet'

# Create a pyarrow schema for the data types.
submission_schema = pyarrow.schema([
    ('title', pyarrow.string()),
    ('created_utc', pyarrow.float64()),
    ('id', pyarrow.string()),
    ('is_original_content', pyarrow.bool_()),
    ('link_flair_text', pyarrow.string()),
    ('locked', pyarrow.bool_()),
    ('name', pyarrow.string()),
    ('num_comments', pyarrow.int64()),
    ('over_18', pyarrow.bool_()),
    ('permalink', pyarrow.string()),
    ('selftext', pyarrow.string()),
    ('spoiler', pyarrow.bool_()),
    ('upvote_ratio', pyarrow.float64()),
])

In [8]:
# If the parqet does not exist, create it.
if not os.path.exists(SUBMISSION_PARQUET_PATH):
    collected_data.to_parquet(SUBMISSION_PARQUET_PATH, engine='pyarrow', schema=submission_schema)

# If the data file already exist, merge new data with the existing one.
else:
    old_parquet = pd.read_parquet(SUBMISSION_PARQUET_PATH, engine='pyarrow', schema=submission_schema)
    new_parquet = pd.concat([old_parquet, collected_data])
    new_parquet = new_parquet.drop_duplicates(subset=['id','title','created_utc','name','permalink'], keep='last').reset_index(drop=True)
    new_parquet.to_parquet(SUBMISSION_PARQUET_PATH, engine='pyarrow', schema=submission_schema)

# Use the new collected data to get comment stuff.
SUBMISSION_PARQUET_PATH = './data/wallstreetbets-collection.parquet'
submission_collection = pd.read_parquet(SUBMISSION_PARQUET_PATH, engine='pyarrow', schema=submission_schema)
display(submission_collection)

Unnamed: 0,title,created_utc,id,is_original_content,link_flair_text,locked,name,num_comments,over_18,permalink,selftext,spoiler,upvote_ratio
0,Nivea Along,1.744832e+09,1k0t4jk,False,YOLO,False,t3_1k0t4jk,5,False,/r/wallstreetbets/comments/1k0t4jk/nivea_along/,After -7% yesterday and -10% today,False,0.67
1,Powell to Volatile Stock Market: You’re on You...,1.744836e+09,1k0unbq,False,News,False,t3_1k0unbq,2,False,/r/wallstreetbets/comments/1k0unbq/powell_to_v...,,False,0.86
2,Made back the last Wendy’s paycheck I lost,1.744834e+09,1k0tv2y,False,Gain,False,t3_1k0tv2y,6,False,/r/wallstreetbets/comments/1k0tv2y/made_back_t...,,False,0.94
3,After market observation. When I finished buyi...,1.744833e+09,1k0tnqx,False,Gain,False,t3_1k0tnqx,8,False,/r/wallstreetbets/comments/1k0tnqx/after_marke...,https://preview.redd.it/41ilvj6f39ve1.png?widt...,False,0.72
4,Ominous,1.744833e+09,1k0thnd,False,Discussion,False,t3_1k0thnd,110,False,/r/wallstreetbets/comments/1k0thnd/ominous/,NVIDIA 2024 is starting to rhyme like Cisco 20...,False,0.85
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1385,Largest Net Gain in a Single Day for the Nasda...,1.744231e+09,1jvg6np,False,Discussion,False,t3_1jvg6np,27,False,/r/wallstreetbets/comments/1jvg6np/largest_net...,https://preview.redd.it/72sswb2pevte1.png?widt...,False,0.89
1386,Make my account green again with $60k gain,1.744231e+09,1jvg693,False,Gain,False,t3_1jvg693,9,False,/r/wallstreetbets/comments/1jvg693/make_my_acc...,"Dear regards, I’m back in the game. I lost $60...",False,0.96
1387,TQQQ Short Put $1140 within minutes,1.744231e+09,1jvg5fe,False,Gain,False,t3_1jvg5fe,6,False,/r/wallstreetbets/comments/1jvg5fe/tqqq_short_...,,False,0.85
1388,Got lucky. Sold before the pump.,1.744231e+09,1jvg4kf,False,Gain,False,t3_1jvg4kf,6,False,/r/wallstreetbets/comments/1jvg4kf/got_lucky_s...,Been buying puts on the basis that orange man ...,False,0.93


## Part 3: Collecting Comments from Reddit

### Creating a database of reddit threads

In [9]:
# Use the same methofology whih we used to collect submissions, but we'll add a parent submission id. and parent comment id.
# Since the comment section can be very deep, we'll limit comments to a breadth of 10.
# This may still be a lot more comments than we need for larger discussions.
def extract_comments_from_submission(submission_id: str):
    try:
        submission = reddit.submission(id=submission_id)
        submission.comments.replace_more(limit=10)  # Limit to 10 levels of comments
        comments = []

        for comment in submission.comments.list():
            if isinstance(comment, praw.models.MoreComments):
                continue

            # NOTE: It looks like the top comment may be a user report. We'll ignore if is has certain text.
            SKIPTEXT = '**User Report**'
            if SKIPTEXT in comment.body:
                continue

            # Append the comment data to the list
            comments.append({
                'parent_post_id': submission_id,
                'parent_comment_id': comment.parent_id,
                'comment_id': comment.id,
                'author': str(comment.author),
                'created_utc': comment.created_utc,
                'score': comment.score,
                'body': comment.body
            })

        return comments

    except Exception as e:
        # Get the HTTP error code if available
        if hasattr(e, 'response') and e.response is not None:
            error_code = e.response.status_code
            print(f"HTTP Error {error_code} while fetching comments for submission {submission_id}")
        else:
            error_code = None

        # Print the an erroor message and return nothing.
        print(f"Error fetching comments for submission {submission_id}: {e}")
        return []



In [10]:
# Show the results from one submission's comments
submission_id = submission_collection.iloc[0]['id']

# How many actual comments are there for this submission?
submission = reddit.submission(id=submission_id)
print(f"Submission ID: {submission_id}")
print(f"Title: {submission.title}")
print(f"Number of comments: {submission.num_comments}")

# Get the comments for the submission
results = extract_comments_from_submission(submission_id)

# Create a dataframe of the comments
comments_df = pd.DataFrame(results)

# Display the comments dataframe
display(comments_df)

Submission ID: 1k0t4jk
Title: Nivea Along
Number of comments: 6


Unnamed: 0,parent_post_id,parent_comment_id,comment_id,author,created_utc,score,body
0,1k0t4jk,t3_1k0t4jk,mngmxdi,chrissurra,1744832000.0,4,You mean Neveah
1,1k0t4jk,t3_1k0t4jk,mngn956,Alert_Barber_3105,1744832000.0,6,The skin cream?
2,1k0t4jk,t3_1k0t4jk,mngoka2,Reasonable_Roger,1744832000.0,1,Psoriasis calls
3,1k0t4jk,t1_mngmxdi,mngnbpk,Own-Foundation3873,1744832000.0,1,yesss sir
4,1k0t4jk,t1_mngn956,mngnj1n,Own-Foundation3873,1744832000.0,1,pow till it creams


In [11]:
# Collect the comments for all the submissions.
all_comments = []
for submission in tqdm(submission_collection['id'], desc="Fetching comments for all submissions", unit="submission"):    
    comments = extract_comments_from_submission(submission)
    all_comments.extend(comments)
    # time.sleep(1)  # Optional delay to avoid rate limits

# Create a python dataframe of these comments.
comments_df = pd.DataFrame.from_records(all_comments)
display(comments_df)

Fetching comments for all submissions:   4%|▍         | 62/1390 [00:47<16:56,  1.31submission/s]


KeyboardInterrupt: 

In [None]:
# Save the collected data to parquet format
COMMENT_PARQUET_PATH = './data/wallstreetbets-comment-collection.parquet'

# Create a pyarrow schema for the comment data
comment_schema = pyarrow.schema([
    ('parent_post_id', pyarrow.string()),
    ('parent_comment_id', pyarrow.string()),
    ('comment_id', pyarrow.string()),
    ('author', pyarrow.string()),
    ('created_utc', pyarrow.float64()),
    ('score', pyarrow.int64()),
    ('body', pyarrow.string())
])

In [None]:
# Write the comments to parquet file. If it exists, append to it.
if not os.path.exists(COMMENT_PARQUET_PATH):
    comments_df.to_parquet(COMMENT_PARQUET_PATH, engine='pyarrow', schema=comment_schema)
else:
    old_parquet = pd.read_parquet(COMMENT_PARQUET_PATH, engine='pyarrow', schema=comment_schema)
    new_parquet = pd.concat([old_parquet, comments_df])
    new_parquet = new_parquet.drop_duplicates(subset=['parent_post_id','parent_comment_id','author','created_utc'], keep='last').reset_index(drop=True)
    new_parquet.to_parquet(COMMENT_PARQUET_PATH, engine='pyarrow', schema=comment_schema)

# Part 4: Collecting More Data (w/ Kaggle)

### Read in the new reddit data from kaggle

In [12]:
# Download path for the kaggle dataset.
KAGGLE_DATASET_PATH = './data/kaggle-dataset'

kaggle.api.authenticate()
kaggle.api.dataset_download_files('gpreda/reddit-wallstreetsbets-posts', path='./data/', unzip=True)

Dataset URL: https://www.kaggle.com/datasets/gpreda/reddit-wallstreetsbets-posts


In [13]:
# Path to kaggle dataset.
KAGGLE_DATASET_RAW_PATH = './data/reddit_wsb.csv'


# Read the kaggle dataset.
kaggle_df = pd.read_csv(KAGGLE_DATASET_RAW_PATH)

# Drop the timestamp column if it exists.
if 'timestamp' in kaggle_df.columns:
    kaggle_df = kaggle_df.drop(columns=['timestamp'])

# Enforce the schema.
kaggle_df = kaggle_df.astype({
    'title': 'string',
    'score': 'int64',
    'id': 'string',
    'url': 'string',
    'comms_num': 'int64',
    'created': 'float64',
    'body': 'string',
})

# Display the kaggle dataset.
display(kaggle_df)

Unnamed: 0,title,score,id,url,comms_num,created,body
0,"It's not about the money, it's about sending a...",55,l6ulcx,https://v.redd.it/6j75regs72e61,6,1.611863e+09,
1,Math Professor Scott Steiner says the numbers ...,110,l6uibd,https://v.redd.it/ah50lyny62e61,23,1.611862e+09,
2,Exit the system,0,l6uhhn,https://www.reddit.com/r/wallstreetbets/commen...,47,1.611862e+09,The CEO of NASDAQ pushed to halt trading “to g...
3,NEW SEC FILING FOR GME! CAN SOMEONE LESS RETAR...,29,l6ugk6,https://sec.report/Document/0001193125-21-019848/,74,1.611862e+09,
4,"Not to distract from GME, just thought our AMC...",71,l6ufgy,https://i.redd.it/4h2sukb662e61.jpg,156,1.611862e+09,
...,...,...,...,...,...,...,...
53182,What I Learned Investigating SAVA FUD Spreaders,238,owd2pn,https://www.reddit.com/r/wallstreetbets/commen...,87,1.627906e+09,***TLDR: Three bitter scientists partnered up ...
53183,"Daily Popular Tickers Thread for August 02, 20...",228,owd1a5,https://www.reddit.com/r/wallstreetbets/commen...,1070,1.627906e+09,Your daily hype thread. Please keep the shitp...
53184,Hitler reacts to the market being irrational,7398,owc5dr,https://v.redd.it/46jxu074exe71,372,1.627902e+09,
53185,"Daily Discussion Thread for August 02, 2021",338,owbfjf,https://www.reddit.com/r/wallstreetbets/commen...,11688,1.627898e+09,Your daily trading discussion thread. Please k...


In [14]:
# Mapping to previous schema.
kaggle_mapping = {
    'title': 'title',
    'score': 'upvote_ratio',
    'id': 'id',
    'url': 'permalink',
    'comms_num': 'num_comments',
    'created': 'created_utc',
    'body': 'selftext'
}

# Rename the columns to match our schema.
kaggle_df.rename(columns=kaggle_mapping, inplace=True)

In [15]:
# Path to kaggle dataset final.
KAGGLE_DATASET_FINAL_PATH = './data/kaggle-reddit-wsb.parquet'

# Schema for the kaggle dataset.
kaggle_schema = pyarrow.schema([
    ('title', pyarrow.string()),
    ('upvote_ratio', pyarrow.int64()),
    ('id', pyarrow.string()),
    ('permalink', pyarrow.string()),
    ('num_comments', pyarrow.int64()),
    ('created_utc', pyarrow.float64()),
    ('selftext', pyarrow.string()),
])

# write the dataframe to parquet.
kaggle_df.to_parquet(KAGGLE_DATASET_FINAL_PATH, engine='pyarrow', schema=kaggle_schema)

### Merge the new data table with the old data.

In [16]:
# Read in the datasets.
praw_dataset = pd.read_parquet(SUBMISSION_PARQUET_PATH, engine='pyarrow', schema=submission_schema)
kaggle_dataset = pd.read_parquet(KAGGLE_DATASET_FINAL_PATH, engine='pyarrow', schema=kaggle_schema)

# Create similar columns with only the columns of the kaggle dataset.
praw_dataset = praw_dataset[['title', 'upvote_ratio', 'id', 'permalink', 'num_comments', 'created_utc', 'selftext']]
kaggle_dataset = kaggle_dataset[['title', 'upvote_ratio', 'id', 'permalink', 'num_comments', 'created_utc', 'selftext']]

# Use the mapping to merge the two datasets.
merged_dataset = pd.concat([praw_dataset, kaggle_dataset], ignore_index=True)

# Remove duplicates based on the 'id' column.
merged_dataset = merged_dataset.drop_duplicates(subset=['id'], keep='last').reset_index(drop=True)

In [17]:
# Creare a final schema for the merged dataset.
merged_schema = pyarrow.schema([
    ('title', pyarrow.string()),
    ('upvote_ratio', pyarrow.float64()),
    ('id', pyarrow.string()),
    ('permalink', pyarrow.string()),
    ('num_comments', pyarrow.int64()),
    ('created_utc', pyarrow.float64()),
    ('selftext', pyarrow.string())
])

# Save the merged dataset to parquet format.
MERGED_DATASET_PATH = './data/merged-reddit-wsb.parquet'
merged_dataset.to_parquet(MERGED_DATASET_PATH, engine='pyarrow', schema=merged_schema)

In [18]:
# Display properties of the merged dataset.
merged_dataset.info()
print("-" * 50)
print("Merged dataset saved to:", MERGED_DATASET_PATH)
print("Number of columns in merged dataset:", len(merged_dataset.columns))
print("Columns in merged dataset:", merged_dataset.columns.tolist())
print("-" * 50)
print("Number of rows in praw dataset:", len(praw_dataset))
print("Number of rows in kaggle dataset:", len(kaggle_dataset))
print("Number of rows in merged dataset:", len(merged_dataset))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54577 entries, 0 to 54576
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   title         54577 non-null  object 
 1   upvote_ratio  54577 non-null  float64
 2   id            54577 non-null  object 
 3   permalink     54577 non-null  object 
 4   num_comments  54577 non-null  int64  
 5   created_utc   54577 non-null  float64
 6   selftext      26128 non-null  object 
dtypes: float64(2), int64(1), object(4)
memory usage: 2.9+ MB
--------------------------------------------------
Merged dataset saved to: ./data/merged-reddit-wsb.parquet
Number of columns in merged dataset: 7
Columns in merged dataset: ['title', 'upvote_ratio', 'id', 'permalink', 'num_comments', 'created_utc', 'selftext']
--------------------------------------------------
Number of rows in praw dataset: 1390
Number of rows in kaggle dataset: 53187
Number of rows in merged dataset: 54577
