# Reddit Data Collection and Visualization

This notebook is designed to collect comments from a specified subreddit using Reddit's API through PRAW (Python Reddit API Wrapper). It filters out comments based on predefined blacklists and visualizes the data for insights.

## Features:
- Fetch comments from a chosen subreddit and filter ('top', 'hot', 'controversial').
- Exclude comments from blacklisted authors and comments with specific content (e.g., '[deleted]', '[removed]').
- Visualize the collected data for insights.

## Setup and Imports

Before running this notebook, ensure you have installed the necessary Python packages: `praw`, `pandas`, and any others required for your specific environment.


In [None]:
import praw
import pandas as pd
from datetime import datetime
from typing import TypedDict
import matplotlib.pyplot as plt

# PRAW core exceptions
import prawcore
from prawcore.exceptions import Redirect, RequestException

## Configuration

Set the target subreddit, and other configurations here.

In [None]:
# Subreddit configuration
SUBREDDIT_NAME = 'Philippines'
SUBREDDIT_FILTER = 'hot'
LIMIT = 11  # Adjust as needed, up to a maximum of 1000 due to Reddit's API limit

# Data filtering criteria
AUTHOR_BLACKLIST = ['AutoModerator']
BODY_BLACKLIST = ['[deleted]', '[removed]']

# Define options for subreddit fetching, mainly the limit
OPTIONS = {
    'limit': LIMIT,
}

# Constants for file naming
CURRENT_DATETIME = datetime.today().strftime("%Y%m%d-%H%M%S")   # Current date and time for filename
FILENAME = f'data-{SUBREDDIT_NAME}-{CURRENT_DATETIME}-{SUBREDDIT_FILTER}.csv'  # Filename format

## DataRow Definition

Define a structure for the data rows to ensure consistent data handling.

In [None]:
class DataRow(TypedDict):
    id: str
    author: str
    body: str
    score: int
    subreddit: str
    timestamp: str
    submission_name: str
    submission_text: str

## Data Collection & Execution

In this section, we will execute the data collection process which involves connecting to the Reddit API through PRAW (Python Reddit API Wrapper), fetching comments from the specified subreddit, and filtering the data based on predefined criteria. The final dataset will then be prepared for analysis and saved to a CSV file for further use.


In [None]:
from dotenv import load_dotenv

# Load the .env file for reddit secrets
load_dotenv()

In [None]:
# Initialize PRAW Reddit instance with credentials & user agent
reddit = praw.Reddit(
    client_id=%env ,
    client_secret=,
    user_agent=,
    ratelimit_seconds=6000, # Give heavy allowance for rate limits to avoid TooManyRequests error
)

data_collection: list[DataRow] = [] # List to hold all DataRow items

In [None]:
# Get subreddit instance from PRAW
subreddit_instance = reddit.subreddit(SUBREDDIT_NAME)
print(subreddit_instance)

In [None]:
# Select the subreddit section based on the filter argument (top, controversial, hot)
result = {
    'top': subreddit_instance.top(**OPTIONS),
    'controversial': subreddit_instance.controversial(**OPTIONS),
    'hot': subreddit_instance.hot(**OPTIONS),
}[SUBREDDIT_FILTER]

### Data Gathering Loop

In [None]:
from tqdm.auto import tqdm # Import tqdm for fancy progress bar

In [26]:
try:
    with tqdm(total=LIMIT) as progress_bar:
        for submission in result:   # Iterate through submissions in the selected subreddit section
            progress_bar.update(1)

            submission.comments.replace_more(limit=None)    # Load all comments by replacing "MoreComments"
            comments = submission.comments.list()   # Flatten the comment tree into a list

            for comment in tqdm(comments):    # Iterate through each comment
                # Get author name, or set as empty string if not available
                author = (
                    comment.author.name 
                    if isinstance(comment.author, praw.models.Redditor) 
                    else ''
                )
                body = comment.body # Comment text
    
                # Skip comment if the author is in the blacklist
                if author in AUTHOR_BLACKLIST: continue
    
                # Skip comment if body is in the blacklist
                if body in BODY_BLACKLIST: continue
    
                data_row: DataRow = {
                    'id': comment.id,
                    'subreddit': comment.subreddit.display_name,
                    'submission_name': submission.title,
                    'submission_text': submission.selftext,
                    'author': author,
                    'body': body,
                    'score': comment.score,
                    'timestamp': datetime.utcfromtimestamp(
                        comment.created_utc
                    ).strftime('%Y-%m-%d %H:%M:%S'),
                }
                data_collection.append(data_row)    # Add the data row to the collection
except prawcore.exceptions.TooManyRequests:
    pass    # Handle rate limit exceptions gracefully
except Redirect:
    print("ERROR: Request redirected. Please check subreddit name and try again")
    exit(1)
except RequestException:
    print("ERROR: Request exception. Please check subreddit name and try again")
    exit(1)

  0%|                                                 | 0/11 [00:00<?, ?it/s]


### Save the data

In [None]:
# Convert the list of DataRow dictionaries to a Pandas DataFrame
data_frame = pd.DataFrame(data_collection)

data_frame.to_csv(FILENAME) # Save the DataFrame to a CSV file

## Data Visualization

Visualize the collected data to gain insights, such as the number of comments per post.

In [None]:
# Show the table

data_frame.head()


In [None]:
# Group the data by submission_name and count the number of comments for each post
comments_per_post = data_frame.groupby('submission_name')['id'].count()

# Sorting the counts and selecting the top N posts for better visibility in the bar chart
top_comments_per_post = comments_per_post.sort_values(ascending=False).head(10)

plt.figure(figsize=(12, 8))
top_comments_per_post.plot(kind='bar', color='lightgreen')
plt.title('Top 10 Posts by Number of Comments')
plt.xlabel('Post Title')
plt.ylabel('Number of Comments')
plt.xticks(rotation=45, ha='right')  # Rotate post titles for better readability
plt.show()

In [None]:
# Group the data by 'submission_name' and count the number of comments for each post
comments_per_post = data_frame.groupby('submission_name')['id'].count().sort_values(ascending=False)

# Convert the Series object to DataFrame for better readability
comments_per_post_df = comments_per_post.to_frame(name='Number of Comments')

# Resetting the index to have 'submission_name' as a column instead of an index
comments_per_post_df.reset_index(inplace=True)

# Optionally, rename the columns for better readability
comments_per_post_df.columns = ['Post Title', 'Number of Comments']

# Display the DataFrame
comments_per_post_df