## Project Overview

### Problem Statement
In today's digital world, understanding how Reddit communities function is crucial for moderators, users, and researchers...

### Data Collection Overview
- **Tools**:
  - PRAW (Python Reddit API Wrapper)
  - BeautifulSoup / Scrapy
  
- **Data Points to Collect**:
  - **Posts**: Title, content, upvotes...
  - **Comments**: Content, upvotes...
  
### Solution Approach
1. **Sentiment & Engagement Analysis**
   - Visualizations using Matplotlib and Seaborn...
   
2. **Correlation Analysis**
   - Apply classification algorithms using scikit-learn...

### Expected Deliverables
- Insight Report
- Actionable Recommendations

## Setup Environment

### Purpose
This section prepares our Google Colab environment for the Reddit Communities analysis project as outlined in our team's proposal. We'll install the necessary Python libraries to handle data collection, processing, analysis, and visualization.

### Key Libraries
- PRAW: For accessing the Reddit API
- pandas: For data manipulation and analysis
- numpy: For numerical computing
- matplotlib and seaborn: For data visualization
- nltk: For natural language processing and sentiment analysis
- scikit-learn: For machine learning tasks

### Alignment with Project Goals
These libraries support our objectives of:
1. Analyzing moderation strategies
2. Predicting post impact
3. Visualizing Reddit community interactions


# 1. Setup and Installation
Make sure to install all necessary libraries first:

In [2]:
%pip install praw




[notice] A new release of pip is available: 24.2 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip





# 2. Importing Libraries

In [3]:
import numpy as np
import pandas as pd
from datetime import datetime
from concurrent.futures import ThreadPoolExecutor, as_completed
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
import praw  
import time 

# 3. Initialize Reddit API
Define and call a function to authenticate with the Reddit API:

In [4]:
def setup_reddit_api():
    return praw.Reddit(
        client_id="nrakGjG_wnBE_5UdcHNJoQ",
        client_secret="qmGr1q_4pGIBR0pYJE8cyhUbTbdX2w",
        user_agent="LittleCheesyExplorers/1.0 (Reddit Communities Analysis Project)"
    )

reddit = setup_reddit_api()

print(reddit.user.me())  

None


# 4. Load Read required Subreddits 
This code block will store the names of the subreddits that we want to collect data from.

The names will be stored in a text file and we will read from that and scrape based on that list.

In [5]:
with open('subreddits.txt', 'r') as file:
    subreddit_list = [line.strip() for line in file if line.strip()]

print(subreddit_list)

['AskReddit', 'ChangeMyView', 'TodayILearned', 'self', 'offmychest', 'Showerthoughts', 'personalfinance', 'AskScience', 'Writing', 'Advice', 'LetsNotMeet', 'SelfImprovement', 'DecidingToBeBetter', 'AskHistorians', 'TwoXChromosomes', 'CasualConversation', 'InternetIsBeautiful', 'nosleep', 'WritingPrompts', 'ExplainLikeImFive', 'TrueOffMyChest', 'UnpopularOpinion', 'relationships', 'TrueAskReddit', 'Confession', 'ShortScaryStories', 'ProRevenge', 'NuclearRevenge', 'LifeProTips', 'needadvice', 'TrueUnpopularOpinion']


# 5. Functions to Collect Data from Reddit
Define functions to collect posts, comments, and subreddit-level data. This is separated for modularity and ease of testing:

### 5.1 Collect Posts from a Subreddit

In [6]:
def collect_subreddit_posts(subreddit_name, post_limit=10):
    subreddit = reddit.subreddit(subreddit_name)
    posts_data = []

    for post in subreddit.hot(limit=post_limit):
        posts_data.append({
            'subreddit': subreddit_name,
            'title': post.title,
            'content': post.selftext,
            'upvotes': post.score,
            'upvote_ratio': post.upvote_ratio,
            'comments_count': post.num_comments,
            'author': post.author.name if post.author else '[deleted]',
            'timestamp': datetime.fromtimestamp(post.created_utc).strftime('%Y-%m-%d %H:%M:%S'),
            'post_id': post.id
        })

    return pd.DataFrame(posts_data)

### 5.2 Collect Data from Multiple Subreddits

In [7]:
def collect_posts_from_subreddits(subreddit_list, post_limit=10):
    all_posts = []

    for subreddit_name in subreddit_list:
        print(f"Collecting posts from r/{subreddit_name}")
        try:
            posts_df = collect_subreddit_posts(subreddit_name, post_limit)
            all_posts.append(posts_df)
            print(f"Collected {len(posts_df)} posts from r/{subreddit_name}")
        except Exception as e:
            print(f"Error collecting posts from r/{subreddit_name}: {str(e)}")

    combined_df = pd.concat(all_posts, ignore_index=True)
    return combined_df


### 5.3 Collect Subreddit-Level Data (Moderators, Rules, Subscriber Counts)
Functions to collect metadata for each subreddit, including subscriber count, rules, and moderator counts:

In [8]:
def collect_subreddit_level_data(reddit, subreddits, limit=10): 
    subreddit_level_data = []

    for subreddit_name in subreddits[:limit]:
        try:
            subreddit = reddit.subreddit(subreddit_name)

            subscriber_count = subreddit.subscribers

            try:
                rules = list(subreddit.rules())
                num_rules = len(rules)
                rule_severity = [rule.severity for rule in rules]
            except Exception as rule_error:
                num_rules = 0
                rule_severity = []
                print(f"Could not fetch rules for r/{subreddit_name}: {rule_error}")

            try:
                moderators = len(list(subreddit.moderators()))
            except Exception as mod_error:
                moderators = 0
                print(f"Could not fetch moderators for r/{subreddit_name}: {mod_error}")

            subreddit_data = {
                "subreddit_name": subreddit_name,
                "subscriber_count": subscriber_count,
                "num_rules": num_rules,
                "moderator_count": moderators,
                "rule_severity": rule_severity
            }
            subreddit_level_data.append(subreddit_data)

        except Exception as e:
            print(f"Error fetching data for subreddit {subreddit_name}: {e}")

    return subreddit_level_data


# Collecting Comments method


In [9]:
"""def collect_top_level_comments_from_csv(csv_file, comment_limit=2):
    # Read posts from CSV file
    posts_df = pd.read_csv(csv_file)
    comments_data = []

    # Iterate through each post in the CSV file
    for _, row in posts_df.iterrows():
        post_id = row['post_id']  # 'post_id' column from your CSV
        subreddit_name = row['subreddit']  # 'subreddit' column from your CSV
        
        try:
            # Fetch the post using the post ID
            post = reddit.submission(id=post_id)
            post.comments.replace_more(limit=0)  # Ensure all top-level comments are loaded
            
            # Debug: Print post details to confirm it's being processed
            print(f"Processing post {post_id} in subreddit {subreddit_name}")
            
            # Check if the post has comments
            if not post.comments:
                print(f"No comments found for post {post_id}")
                continue

            # Collect top-level comments
            for comment in post.comments[:comment_limit]:  # Limit to the top `comment_limit` comments
                if comment.parent_id.split('_')[1] == post_id:  # Ensure it's a top-level comment
                    
                    # Get comment author and karma (handle potential missing authors)
                    author_karma = 0
                    author_name = "[deleted]"
                    
                    if comment.author:
                        author_name = comment.author.name
                        author_karma = comment.author.link_karma + comment.author.comment_karma  # Total karma

                    comments_data.append({
                        'subreddit': subreddit_name,
                        'comment': comment.body,
                        'comment_author': author_name,
                        'author_karma': author_karma,
                        'post_title': post.title,
                        'post_content': post.selftext,
                        'post_upvotes': post.score,
                        'timestamp': datetime.fromtimestamp(comment.created_utc).strftime('%Y-%m-%d %H:%M:%S'),
                    })
            
            print(f"Collected {len(comments_data)} comments from post {post_id}")
        
        except Exception as e:
            print(f"Error collecting comments from post {post_id}: {e}")
        
        # Optional: Small delay to avoid hitting rate limits
        time.sleep(1) 
    
    # Return the DataFrame of collected comments
    return pd.DataFrame(comments_data)

"""

def collect_top_level_comments(csv_file, comment_limit=2):
    # Read the CSV containing post information
    posts_df = pd.read_csv(csv_file)
    comments_data = []

    # Loop through the DataFrame rows and extract post_id and subreddit
    for _, row in posts_df.iterrows():
        post_id = row['post_id']  # Ensure the CSV has a 'post_id' column
        subreddit_name = row['subreddit']  # Ensure the CSV has a 'subreddit' column

        try:
            # Get the post by its ID
            post = reddit.submission(id=post_id)
            post.comments.replace_more(limit=0)  # Ensure we get all comments

            # Debugging: Print post details to ensure it's fetched
            print(f"Processing post {post_id} from subreddit {subreddit_name}")

            # Collect top-level comments (limit to 'comment_limit' comments per post)
            for comment in post.comments[:comment_limit]:  # Limiting the number of comments
                if comment.parent_id.split('_')[1] == post_id:  # Check if it's a top-level comment
                    # Initialize author details
                    author_karma = 0
                    author_name = "[deleted]"
                    
                    if comment.author:
                        author_name = comment.author.name
                        author_karma = comment.author.link_karma + comment.author.comment_karma  # Total karma

                    # Add the comment data to the list
                    comments_data.append({
                        'subreddit': subreddit_name,
                        'comment': comment.body,
                        'comment_author': author_name,
                        'author_karma': author_karma,
                        'post_title': post.title,
                        'post_content': post.selftext,
                        'post_upvotes': post.score,
                        'timestamp': datetime.fromtimestamp(comment.created_utc).strftime('%Y-%m-%d %H:%M:%S'),
                    })

            print(f"Collected {len(comments_data)} comments from post {post_id}")
        
        except Exception as e:
            print(f"Error collecting comments from post {post_id}: {e}")
        
        # Small delay to avoid Reddit API rate limits
        time.sleep(1) 
    
    # Return the collected comments as a DataFrame
    return pd.DataFrame(comments_data)

def save_comments_to_csv(csv_file, comment_limit=2):
    comments_df = collect_top_level_comments(csv_file, comment_limit)
    
    # Check if the DataFrame has data
    if comments_df.empty:
        print("No comments collected.")
    else:
        comments_df.to_csv("top_level_comments.csv", index=False)
        print("\nComments saved to collected_comments_with_author_and_karma.csv")

"""
    # Read the post IDs and other data from the provided CSV file
    posts_df = pd.read_csv(csv_file)
    comments_data = []
    
    for _, row in posts_df.iterrows():
        post_id = row['post_id']
        subreddit_name = row['subreddit']
        
        try:
            # Fetch the post using the post_id
            post = reddit.submission(id=post_id)
            
            # Replace "MoreComments" with empty (expand all top-level comments)
            post.comments.replace_more(limit=0)
            
            # Iterate through top-level comments
            top_comments = [comment for comment in post.comments if isinstance(comment, praw.models.Comment)]
            
            # Limit the number of comments to comment_limit
            for comment in top_comments[:comment_limit]:  
                comments_data.append({
                    'subreddit': subreddit_name,
                    'comment': comment.body,
                    'comment_author': comment.author.name if comment.author else '[deleted]',
                    'author_karma': (comment.author.link_karma + comment.author.comment_karma) if comment.author else 0,
                    'post_title': post.title,
                    'post_content': post.selftext,
                    'post_upvotes': post.score,
                    'timestamp': datetime.fromtimestamp(comment.created_utc).strftime('%Y-%m-%d %H:%M:%S'),
                })
            
            print(f"Collected {len(top_comments[:comment_limit])} comments from post {post_id}")
        
        except Exception as e:
            print(f"Error collecting comments from post {post_id}: {e}")
        
        # Optional: Small delay to avoid hitting rate limits
        time.sleep(1) 
    
    return pd.DataFrame(comments_data)

"""

# Example usage
# save_comments_to_csv("reddit_posts.csv", comment_limit=2)"""

'\n    # Read the post IDs and other data from the provided CSV file\n    posts_df = pd.read_csv(csv_file)\n    comments_data = []\n    \n    for _, row in posts_df.iterrows():\n        post_id = row[\'post_id\']\n        subreddit_name = row[\'subreddit\']\n        \n        try:\n            # Fetch the post using the post_id\n            post = reddit.submission(id=post_id)\n            \n            # Replace "MoreComments" with empty (expand all top-level comments)\n            post.comments.replace_more(limit=0)\n            \n            # Iterate through top-level comments\n            top_comments = [comment for comment in post.comments if isinstance(comment, praw.models.Comment)]\n            \n            # Limit the number of comments to comment_limit\n            for comment in top_comments[:comment_limit]:  \n                comments_data.append({\n                    \'subreddit\': subreddit_name,\n                    \'comment\': comment.body,\n                    \'com

# 6 Analyze and Label Engagement for Posts
Calculates engagement scores and labels posts with engagement levels:

In [10]:

def calculate_engagement(post):
    upvotes = post['upvotes']
    comments_count = post['comments_count']
    subscribers = post['subscriber_count']
    
    if subscribers > 0:
        engagement = (upvotes + comments_count) / subscribers
    else:
        engagement = 0  
    return engagement

def define_engagement(posts_df):
    def label_engagement(score):
        if score < 0.0025:
            return "Low"
        elif score < 0.0050:
            return "Medium"
        else:
            return "High"
    posts_df['engagement_level'] = posts_df['normalized_engagement'].apply(label_engagement)
    return posts_df




### 6.1 Define Engagement Levels

In [11]:
def define_engagement(posts_df):
    def label_engagement(score):
        if score < 1:
            return "Low"
        elif score < 5:
            return "Medium"
        else:
            return "High"
    
    posts_df['engagement_level'] = posts_df['normalized_engagement'].apply(label_engagement)
    return posts_df


# 7. Add Features to Posts

In [12]:
def add_features_to_posts(df):
    df['title_length'] = df['title'].apply(len)
    df['post_length'] = df['content'].apply(lambda x: len(str(x)) if pd.notnull(x) else 0)
    
    df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce')
    
    df['time_of_day'] = df['timestamp'].dt.hour
    df['day_of_week'] = df['timestamp'].dt.day_name()

    return df

# 9. Running All Methods
Finally, running all the methods in order 

In [13]:
# Collect posts from all subreddits
all_posts_df = collect_posts_from_subreddits(subreddit_list, post_limit=4)

# Save all collected posts to a single CSV file
csv_filename = "subreddit_posts.csv"
all_posts_df.to_csv(csv_filename, index=False)
print(f"\nAll posts data saved to {csv_filename}")

# Print summary of collected posts
print("\nSummary of collected posts:")
print(all_posts_df['subreddit'].value_counts())

# Collect subreddit-level data (e.g., rules, moderation, etc.)
subreddit_level_data = collect_subreddit_level_data(reddit, subreddit_list, limit=20)

# Convert to DataFrame and save to CSV
subreddit_level_df = pd.DataFrame(subreddit_level_data)
subreddit_level_df.to_csv("subreddit_level_data.csv", index=False)
print("\nSubreddit-level data saved to subreddit_level_data.csv")



# Save updated posts data with engagement scores to CSV
all_posts_df.to_csv("subreddit_posts_with_engagement.csv", index=False)
print("\nSubreddit posts with engagement scores saved to subreddit_posts_with_engagement.csv")

# Run the function to collect top-level comments and save them to a new CSV
save_comments_to_csv("subreddit_posts.csv", comment_limit=2)

cleaned_labeled_data = pd.read_csv("cleaned_labeled_subreddit_posts.csv")
cleaned_labeled_data = cleaned_labeled_data.merge(subreddit_level_df[['subreddit_name', 'subscriber_count']], 
                                                  left_on='subreddit', right_on='subreddit_name', how='left')

cleaned_labeled_data['engagement'] = cleaned_labeled_data.apply(calculate_engagement, axis=1)
cleaned_labeled_data['normalized_engagement'] = cleaned_labeled_data['engagement'] * 10000

cleaned_labeled_data = define_engagement(cleaned_labeled_data)

cleaned_labeled_data.to_csv("labeled_subreddit_posts.csv", index=False)
print("\nUpdated data with engagement levels saved to labeled_subreddit_posts.csv")

print("\nDisplaying updated dataframe with engagement scores:")
print(cleaned_labeled_data.head())  

Collecting posts from r/AskReddit
Collected 4 posts from r/AskReddit
Collecting posts from r/ChangeMyView
Collected 4 posts from r/ChangeMyView
Collecting posts from r/TodayILearned
Collected 4 posts from r/TodayILearned
Collecting posts from r/self
Collected 4 posts from r/self
Collecting posts from r/offmychest
Collected 4 posts from r/offmychest
Collecting posts from r/Showerthoughts
Collected 4 posts from r/Showerthoughts
Collecting posts from r/personalfinance
Collected 4 posts from r/personalfinance
Collecting posts from r/AskScience
Collected 4 posts from r/AskScience
Collecting posts from r/Writing
Collected 4 posts from r/Writing
Collecting posts from r/Advice
Collected 4 posts from r/Advice
Collecting posts from r/LetsNotMeet
Collected 4 posts from r/LetsNotMeet
Collecting posts from r/SelfImprovement
Collected 4 posts from r/SelfImprovement
Collecting posts from r/DecidingToBeBetter
Collected 4 posts from r/DecidingToBeBetter
Collecting posts from r/AskHistorians
Collected 4

  rules = list(subreddit.rules())


Could not fetch rules for r/AskReddit: 'str' object has no attribute 'severity'
Could not fetch moderators for r/AskReddit: 'Subreddit' object has no attribute 'moderators'
Could not fetch rules for r/ChangeMyView: 'str' object has no attribute 'severity'
Could not fetch moderators for r/ChangeMyView: 'Subreddit' object has no attribute 'moderators'
Could not fetch rules for r/TodayILearned: 'str' object has no attribute 'severity'
Could not fetch moderators for r/TodayILearned: 'Subreddit' object has no attribute 'moderators'
Could not fetch rules for r/self: 'str' object has no attribute 'severity'
Could not fetch moderators for r/self: 'Subreddit' object has no attribute 'moderators'
Could not fetch rules for r/offmychest: 'str' object has no attribute 'severity'
Could not fetch moderators for r/offmychest: 'Subreddit' object has no attribute 'moderators'
Could not fetch rules for r/Showerthoughts: 'str' object has no attribute 'severity'
Could not fetch moderators for r/Showerthoug

# FINAL CLEANED CSV Creation and Storing
In this Code Block we will be creating a csv file and storing all our scraped data.