## Project Overview

### Problem Statement
In today's digital world, understanding how Reddit communities function is crucial for moderators, users, and researchers...

### Data Collection Overview
- **Tools**:
  - PRAW (Python Reddit API Wrapper)
  - BeautifulSoup / Scrapy
  
- **Data Points to Collect**:
  - **Posts**: Title, content, upvotes...
  - **Comments**: Content, upvotes...
  
### Solution Approach
1. **Sentiment & Engagement Analysis**
   - Visualizations using Matplotlib and Seaborn...
   
2. **Correlation Analysis**
   - Apply classification algorithms using scikit-learn...

### Expected Deliverables
- Insight Report
- Actionable Recommendations

## Setup Environment

### Purpose
This section prepares our Google Colab environment for the Reddit Communities analysis project as outlined in our team's proposal. We'll install the necessary Python libraries to handle data collection, processing, analysis, and visualization.

### Key Libraries
- PRAW: For accessing the Reddit API
- pandas: For data manipulation and analysis
- numpy: For numerical computing
- matplotlib and seaborn: For data visualization
- nltk: For natural language processing and sentiment analysis
- scikit-learn: For machine learning tasks

### Alignment with Project Goals
These libraries support our objectives of:
1. Analyzing moderation strategies
2. Predicting post impact
3. Visualizing Reddit community interactions


# 1. Setup and Installation
Make sure to install all necessary libraries first:

In [1]:
%pip install praw

Note: you may need to restart the kernel to use updated packages.


# 2. Importing Libraries

In [2]:
import numpy as np
import pandas as pd
from datetime import datetime
from concurrent.futures import ThreadPoolExecutor, as_completed
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
import praw  # This is the Reddit API wrapper

# 3. Initialize Reddit API
Define and call a function to authenticate with the Reddit API:

In [3]:
def setup_reddit_api():
    return praw.Reddit(
        client_id="nrakGjG_wnBE_5UdcHNJoQ",
        client_secret="qmGr1q_4pGIBR0pYJE8cyhUbTbdX2w",
        user_agent="LittleCheesyExplorers/1.0 (Reddit Communities Analysis Project)"
    )

reddit = setup_reddit_api()

print(reddit.user.me())  # To test if the Reddit API connection is successful

None


# 4. Load Read required Subreddits 
This code block will store the names of the subreddits that we want to collect data from.

The names will be stored in a text file and we will read from that and scrape based on that list.

In [4]:
with open('subreddits.txt', 'r') as file:
    subreddit_list = [line.strip() for line in file if line.strip()]

print(subreddit_list)

['WritingPrompts', 'TrueOffMyChest', 'NoSleep', 'ExplainLikeImFive', 'IAmA', 'CasualConversation', 'TrueAskReddit', 'Confession', 'relationships', 'ShortScaryStories', 'ProRevenge', 'NuclearRevenge', 'LifeProTips', 'needadvice', 'TrueUnpopularOpinion']


# 5. Functions to Collect Data from Reddit
Define functions to collect posts, comments, and subreddit-level data. This is separated for modularity and ease of testing:

### 5.1 Collect Posts from a Subreddit

In [5]:
def collect_subreddit_posts(subreddit_name, post_limit=10):
    # Collect posts from a single subreddit
    subreddit = reddit.subreddit(subreddit_name)
    posts_data = []

    for post in subreddit.hot(limit=post_limit):
        posts_data.append({
            'subreddit': subreddit_name,
            'title': post.title,
            'content': post.selftext,
            'upvotes': post.score,
            'upvote_ratio': post.upvote_ratio,
            'comments_count': post.num_comments,
            'author': post.author.name if post.author else '[deleted]',
            'timestamp': datetime.fromtimestamp(post.created_utc).strftime('%Y-%m-%d %H:%M:%S'),
            'post_id': post.id
        })

    return pd.DataFrame(posts_data)

### 5.2 Collect Data from Multiple Subreddits

In [6]:
def collect_posts_from_subreddits(subreddit_list, post_limit=10):
    all_posts = []

    for subreddit_name in subreddit_list:
        print(f"Collecting posts from r/{subreddit_name}")
        try:
            posts_df = collect_subreddit_posts(subreddit_name, post_limit)
            all_posts.append(posts_df)
            print(f"Collected {len(posts_df)} posts from r/{subreddit_name}")
        except Exception as e:
            print(f"Error collecting posts from r/{subreddit_name}: {str(e)}")

    combined_df = pd.concat(all_posts, ignore_index=True)
    return combined_df


### 5.3 Collect Subreddit-Level Data (Moderators, Rules, Subscriber Counts)
Functions to collect metadata for each subreddit, including subscriber count, rules, and moderator counts:

In [7]:
def collect_subreddit_level_data(reddit, subreddits, limit=10): 
    # Collect data at the subreddit level (e.g., subscriber count, rules)
    subreddit_level_data = []

    for subreddit_name in subreddits[:limit]:
        try:
            subreddit = reddit.subreddit(subreddit_name)

            subscriber_count = subreddit.subscribers

            try:
                rules = list(subreddit.rules())
                num_rules = len(rules)
                rule_severity = [rule.severity for rule in rules]
            except Exception as rule_error:
                num_rules = 0
                rule_severity = []
                print(f"Could not fetch rules for r/{subreddit_name}: {rule_error}")

            try:
                moderators = len(list(subreddit.moderators()))
            except Exception as mod_error:
                moderators = 0
                print(f"Could not fetch moderators for r/{subreddit_name}: {mod_error}")

            subreddit_data = {
                "subreddit_name": subreddit_name,
                "subscriber_count": subscriber_count,
                "num_rules": num_rules,
                "moderator_count": moderators,
                "rule_severity": rule_severity
            }
            subreddit_level_data.append(subreddit_data)

        except Exception as e:
            print(f"Error fetching data for subreddit {subreddit_name}: {e}")

    return subreddit_level_data


### 5.5 Collect Post Comments

In [8]:
""" def collect_comments(post_id, comment_limit=5):
    # Collect comments for a single post
    comments_data = []
    try:
        submission = reddit.submission(id=post_id)
        submission.comments.replace_more(limit=0)
        for comment in submission.comments.list()[:comment_limit]:
            comments_data.append({
                'post_id': post_id,
                'comment_id': comment.id,
                'content': comment.body,
                'upvotes': comment.score,
                'author': comment.author.name if comment.author else '[deleted]',
                'timestamp': datetime.fromtimestamp(comment.created_utc).strftime('%Y-%m-%d %H:%M:%S')
            })
    except Exception as e:
        print(f"Error collecting comments for post {post_id}: {str(e)}")

    return comments_data
""" 

' def collect_comments(post_id, comment_limit=5):\n    # Collect comments for a single post\n    comments_data = []\n    try:\n        submission = reddit.submission(id=post_id)\n        submission.comments.replace_more(limit=0)\n        for comment in submission.comments.list()[:comment_limit]:\n            comments_data.append({\n                \'post_id\': post_id,\n                \'comment_id\': comment.id,\n                \'content\': comment.body,\n                \'upvotes\': comment.score,\n                \'author\': comment.author.name if comment.author else \'[deleted]\',\n                \'timestamp\': datetime.fromtimestamp(comment.created_utc).strftime(\'%Y-%m-%d %H:%M:%S\')\n            })\n    except Exception as e:\n        print(f"Error collecting comments for post {post_id}: {str(e)}")\n\n    return comments_data\n'

### 5.6  Fetch Top-Level Comments

In [9]:
""" def collect_top_level_comments(post_id, comment_limit=5):
    # Collect top-level comments for a single post
    comments_data = []
    try:
        submission = reddit.submission(id=post_id)
        submission.comments.replace_more(limit=0)
        for comment in submission.comments[:comment_limit]:
            comments_data.append({
                'post_id': post_id,
                'comment_id': comment.id,
                'content': comment.body,
                'upvotes': comment.score,
                'author': comment.author.name if comment.author else '[deleted]',
                'timestamp': datetime.fromtimestamp(comment.created_utc).strftime('%Y-%m-%d %H:%M:%S')
            })
    except Exception as e:
        print(f"Error collecting comments for post {post_id}: {str(e)}")

    return comments_data
"""

' def collect_top_level_comments(post_id, comment_limit=5):\n    # Collect top-level comments for a single post\n    comments_data = []\n    try:\n        submission = reddit.submission(id=post_id)\n        submission.comments.replace_more(limit=0)\n        for comment in submission.comments[:comment_limit]:\n            comments_data.append({\n                \'post_id\': post_id,\n                \'comment_id\': comment.id,\n                \'content\': comment.body,\n                \'upvotes\': comment.score,\n                \'author\': comment.author.name if comment.author else \'[deleted]\',\n                \'timestamp\': datetime.fromtimestamp(comment.created_utc).strftime(\'%Y-%m-%d %H:%M:%S\')\n            })\n    except Exception as e:\n        print(f"Error collecting comments for post {post_id}: {str(e)}")\n\n    return comments_data\n'

### 5.5 Analyze and Label Engagement for Posts
Calculates engagement scores and labels posts with engagement levels:

In [10]:

def calculate_engagement(post):
    upvotes = post['upvotes']
    comments_count = post['comments_count']
    subscribers = post['subscriber_count']
    
    if subscribers > 0:
        engagement = (upvotes + comments_count) / subscribers
    else:
        engagement = 0  
    return engagement

def define_engagement(posts_df):
    def label_engagement(score):
        if score < 0.0025:
            return "Low"
        elif score < 0.0050:
            return "Medium"
        else:
            return "High"
    posts_df['engagement_level'] = posts_df['normalized_engagement'].apply(label_engagement)
    return posts_df

#


### 5.6 Add Features to Posts

In [11]:
def add_features_to_posts(df):
    # Title and content lengths
    df['title_length'] = df['title'].apply(len)
    df['post_length'] = df['content'].apply(lambda x: len(str(x)) if pd.notnull(x) else 0)
    
    # Convert timestamp to datetime format
    df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce')
    
    # Time of day and day of week
    df['time_of_day'] = df['timestamp'].dt.hour
    df['day_of_week'] = df['timestamp'].dt.day_name()

    return df

### 5.7 Define Engagement Levels

In [12]:
def define_engagement(posts_df):
    def label_engagement(score):
        if score < 1:
            return "Low"
        elif score < 5:
            return "Medium"
        else:
            return "High"
    
    posts_df['engagement_level'] = posts_df['engagement'].apply(label_engagement)
    return posts_df


### 5.6 Main Function to Collect and Label Engagement

# Running Collection Methods

# 9. Run the Script
Finally, run the main function with your desired subreddit list and post limit:

In [13]:
# Collect posts from all subreddits
all_posts_df = collect_posts_from_subreddits(subreddit_list, post_limit=10000)

# Save all collected posts to a single CSV file
csv_filename = "subreddit_posts.csv"
all_posts_df.to_csv(csv_filename, index=False)
print(f"\nAll posts data saved to {csv_filename}")

# Print summary of collected posts
print("\nSummary of collected posts:")
print(all_posts_df['subreddit'].value_counts())

# Collect subreddit-level data (e.g., rules, moderation, etc.)
subreddit_level_data = collect_subreddit_level_data(reddit, subreddit_list, limit=5000)

# Convert to DataFrame and save to CSV
subreddit_level_df = pd.DataFrame(subreddit_level_data)
subreddit_level_df.to_csv("subreddit_level_data.csv", index=False)
print("\nSubreddit-level data saved to subreddit_level_data.csv")


all_posts_df = all_posts_df.merge(subreddit_level_df[['subreddit_name', 'subscriber_count']], 
                          left_on='subreddit', right_on='subreddit_name', how='left')
all_posts_df['engagement'] = all_posts_df.apply(calculate_engagement, axis=1)
all_posts_df['normalized_engagement'] = all_posts_df['engagement'] * 10000


Collecting posts from r/WritingPrompts
Collected 975 posts from r/WritingPrompts
Collecting posts from r/TrueOffMyChest
Collected 904 posts from r/TrueOffMyChest
Collecting posts from r/NoSleep
Collected 554 posts from r/NoSleep
Collecting posts from r/ExplainLikeImFive
Collected 328 posts from r/ExplainLikeImFive
Collecting posts from r/IAmA
Collected 560 posts from r/IAmA
Collecting posts from r/CasualConversation
Collected 754 posts from r/CasualConversation
Collecting posts from r/TrueAskReddit
Collected 225 posts from r/TrueAskReddit
Collecting posts from r/Confession
Collected 177 posts from r/Confession
Collecting posts from r/relationships
Collected 250 posts from r/relationships
Collecting posts from r/ShortScaryStories
Collected 979 posts from r/ShortScaryStories
Collecting posts from r/ProRevenge
Collected 44 posts from r/ProRevenge
Collecting posts from r/NuclearRevenge
Collected 423 posts from r/NuclearRevenge
Collecting posts from r/LifeProTips
Collected 355 posts from r/

  rules = list(subreddit.rules())


Could not fetch rules for r/WritingPrompts: 'str' object has no attribute 'severity'
Could not fetch moderators for r/WritingPrompts: 'Subreddit' object has no attribute 'moderators'
Could not fetch rules for r/TrueOffMyChest: 'str' object has no attribute 'severity'
Could not fetch moderators for r/TrueOffMyChest: 'Subreddit' object has no attribute 'moderators'
Could not fetch rules for r/NoSleep: 'str' object has no attribute 'severity'
Could not fetch moderators for r/NoSleep: 'Subreddit' object has no attribute 'moderators'
Could not fetch rules for r/ExplainLikeImFive: 'str' object has no attribute 'severity'
Could not fetch moderators for r/ExplainLikeImFive: 'Subreddit' object has no attribute 'moderators'
Could not fetch rules for r/IAmA: 'str' object has no attribute 'severity'
Could not fetch moderators for r/IAmA: 'Subreddit' object has no attribute 'moderators'
Could not fetch rules for r/CasualConversation: 'str' object has no attribute 'severity'
Could not fetch moderato

# FINAL CLEANED CSV Creation and Storing
In this Code Block we will be creating a csv file and storing all our scraped data.