## Project Overview

### Problem Statement
In today's digital world, understanding how Reddit communities function is crucial for moderators, users, and researchers...

### Data Collection Overview
- **Tools**:
  - PRAW (Python Reddit API Wrapper)
  - BeautifulSoup / Scrapy
  
- **Data Points to Collect**:
  - **Posts**: Title, content, upvotes...
  - **Comments**: Content, upvotes...
  
### Solution Approach
1. **Sentiment & Engagement Analysis**
   - Visualizations using Matplotlib and Seaborn...
   
2. **Correlation Analysis**
   - Apply classification algorithms using scikit-learn...

### Expected Deliverables
- Insight Report
- Actionable Recommendations

## Setup Environment

### Purpose
This section prepares our Google Colab environment for the Reddit Communities analysis project as outlined in our team's proposal. We'll install the necessary Python libraries to handle data collection, processing, analysis, and visualization.

### Key Libraries
- PRAW: For accessing the Reddit API
- pandas: For data manipulation and analysis
- numpy: For numerical computing
- matplotlib and seaborn: For data visualization
- nltk: For natural language processing and sentiment analysis
- scikit-learn: For machine learning tasks

### Alignment with Project Goals
These libraries support our objectives of:
1. Analyzing moderation strategies
2. Predicting post impact
3. Visualizing Reddit community interactions

### Installation Code
Run the following cell to install the required libraries:


In [33]:
import numpy as np
import pandas as pd
import matplotlib
import seaborn
import nltk
%pip install praw
import praw
from datetime import datetime


Note: you may need to restart the kernel to use updated packages.


# Praw API Config

This block of code create tje connection with the reddit application and by authenticating.

In [34]:
def setup_reddit_api():
    return praw.Reddit(
        client_id="nrakGjG_wnBE_5UdcHNJoQ",
        client_secret="qmGr1q_4pGIBR0pYJE8cyhUbTbdX2w",
        user_agent="LittleCheesyExplorers/1.0 (Reddit Communities Analysis Project)"
    )

reddit = setup_reddit_api()

print(reddit.user.me())
subreddit = reddit.subreddit("science")
print(subreddit.display_name)
print(subreddit.title)

None
science
Reddit Science


# Read required Subreddits 
This code block will store the names of the subreddits that we want to collect data from.

The names will be stored in a text file and we will read from that and scrape based on that list.

In [35]:
# Read subreddit names from the text file
with open('subreddits.txt', 'r') as file:
    subreddit_list = [line.strip() for line in file if line.strip()]

print(subreddit_list)

['AskReddit', 'news', 'funny', 'gaming', 'todayilearned', 'science']


# Collection Methods

In [36]:
def collect_subreddit_posts(subreddit_name, post_limit=10):
    subreddit = reddit.subreddit(subreddit_name)
    posts_data = []

    for post in subreddit.hot(limit=post_limit):
        posts_data.append({
            'subreddit': subreddit_name,
            'title': post.title,
            'content': post.selftext,
            'upvotes': post.score,
            'upvote_ratio': post.upvote_ratio,
            'comments_count': post.num_comments,
            'author': post.author.name if post.author else '[deleted]',
            'timestamp': datetime.fromtimestamp(post.created_utc).strftime('%Y-%m-%d %H:%M:%S'),
            'post_id': post.id
        })

    return pd.DataFrame(posts_data)

def collect_posts_from_subreddits(subreddit_list, post_limit=10):
    all_posts = []
    
    for subreddit_name in subreddit_list:
        print(f"Collecting posts from r/{subreddit_name}")
        try:
            posts_df = collect_subreddit_posts(subreddit_name, post_limit)
            all_posts.append(posts_df)
            print(f"Collected {len(posts_df)} posts from r/{subreddit_name}")
        except Exception as e:
            print(f"Error collecting posts from r/{subreddit_name}: {str(e)}")
    
    combined_df = pd.concat(all_posts, ignore_index=True)
    return combined_df

# Running Collection Methods

In [38]:
# Collect posts from all subreddits
all_posts_df = collect_posts_from_subreddits(subreddit_list, post_limit=10000)

# Save all collected posts to a single CSV file
csv_filename = "subreddit_posts.csv"
all_posts_df.to_csv(csv_filename, index=False)
print(f"\nAll posts data saved to {csv_filename}")

# Print summary of collected posts
print("\nSummary of collected posts:")
print(all_posts_df['subreddit'].value_counts())

# Sample data output
print("\nSample post data:")
print(all_posts_df.iloc[0] if len(all_posts_df) > 0 else "No posts collected")

Collecting posts from r/AskReddit
Collected 851 posts from r/AskReddit
Collecting posts from r/news
Collected 203 posts from r/news
Collecting posts from r/funny
Collected 360 posts from r/funny
Collecting posts from r/gaming
Collected 150 posts from r/gaming
Collecting posts from r/todayilearned
Collected 504 posts from r/todayilearned
Collecting posts from r/science
Collected 774 posts from r/science

All posts data saved to subreddit_posts.csv

Summary of collected posts:
subreddit
AskReddit        851
science          774
todayilearned    504
funny            360
news             203
gaming           150
Name: count, dtype: int64

Sample post data:
subreddit                                                 AskReddit
title                           2024 United States Elections Thread
content           Please use this thread to discuss the ongoing ...
upvotes                                                          98
upvote_ratio                                                   0.68

# CSV Creation and Storing
In this Code Block we will be creating a csv file and storing all our scraped data.