# Data Collection

## 1. Project Overview

This notebook focuses on extracting **political debate data** from Reddit using the **Reddit API**. The goal is to gather a dataset of **comments** from political subreddits that will be used for exploratory data analysis (EDA), Natural Language Processing (NLP), and ultimately, model building.

This analysis is done within the **U.S. political context**, where political ideologies and classifications are based on **American political alignments**.

The data collection process includes:
1. **Accessing the Reddit API**: Using **PRAW** (Python Reddit API Wrapper) to interact with Reddit's data.
2. **Extracting posts from specific political debate subreddits**: We'll collect posts from political subreddits such as: </br></br>
   - **r/AskALiberal**
   - **r/AskConservatives**
   - **r/AskPolitics**
   - **r/AskTrumpSupporters**
     </br></br>
3. **Capturing user flairs**: We will also capture **user flairs** (the labels or tags that users have set on their profiles), which will be used to classify the **political leaning** of the comments to either: </br></br>
    - **Democrat-aligned (0)**
    - **Republican-aligned (1)**
</br></br>
4. Data Extracted: For each post and comment, we will extract the following fields:
- **comment_id**: The unique identifier for the Reddit comment. 
- **post_id**: The unique identifier for the Reddit post where comments come from.
- **subreddit**: The subreddit from which the post/comment is extracted.
- **text**: The content of the post or comment.
- **flair**: The user’s flair (if available) that may indicate political alignment or other user-defined information.
- **political alignment**: The target classification, either Democrat-aligned (0) or Republican-aligned (1), based on the user flair.

### Political Leaning Classification:

| **Democrat-Aligned (`0`)** | **Republican-Aligned (`1`)** |
|---------------------------|-----------------------------|
| Democrat                  | Republican                  |
| Democratic Socialist      | Conservative                |
| Far Left                  | Right-leaning               |
| Liberal                   | Neoconservative             |
| Progressive               | Paleoconservative           |
| Social Democrat           | Nationalist                 |
| Social Liberal            | Social Conservative         |
| Neoliberal                | Constitutionalist           |
| Social Democracy          | Rightwing                   |
| Leftist                   | Conservatarian              |
| Pragmatic Progressive     | Tea Party                   |
| Leftwing                  | Trump Supporter             |
| Left-leaning              | Far Right                   |

The **user flairs** will help us classify the political alignment of the posts and comments, facilitating the analysis of political ideologies in the data.

## 2. Import Libraries and Loading Data

In [1]:
import pandas as pd
#For API interaction
import praw 
import os
import time
import random
# For efficiently getting the top N comments
import heapq

In [2]:
# Data storage as a list of dictionaries
data = []
collected_post_ids = set()  # Track unique posts

file_name = 'political_subreddit.csv'

# If the file exists, reload old data into data[]
if os.path.exists(file_name):
    print('Reloading existing data from CSV...')
    
    # Load previous data
    old_data = pd.read_csv(file_name, encoding='utf-8').to_dict(orient='records')
    
    # Append old records to in-memory data[]
    data.extend(old_data)
    
    # Add previously collected post IDs to the tracking set
    collected_post_ids.update(entry['post_id'] for entry in old_data)

    print(f'Loaded {len(old_data)} previous entries into memory.')
    print(f'Tracking {len(collected_post_ids)} unique post IDs.')


Reloading existing data from CSV...
Loaded 7621 previous entries into memory.
Tracking 2598 unique post IDs.


## 3. Reddit API Access and Data Collection Functions

### 3.1 Comment Quality

This section discusses the challenges in collecting **high-quality comments** and addressing **class imbalance**.

1. **Initial Character Length Requirement**:
The goal was to collect comments with 400+ characters to ensure detailed responses. However, it was harder to gather enough Republican-leaning comments due to the dominance of Democrat-aligned users my chosen subreddits.
2. **Class Imbalance**:
Republican comments from republican-aligned flairs were underrepresented, particularly from my chosen subreddits, which skewed the data towards Democrat-aligned comments.
3. **Adjustment to Character Limit**:
To address this, I lowered the character requirement:
- From 400 to 350 characters.
- Then from 350 to 250 characters, ensuring more Republican-leaning comments were included.

In [None]:
# Reddit API Credentials
reddit = praw.Reddit(
    client_id='',
    client_secret='',
    user_agent='',
    username='',
    password=''
)

In [4]:
# Define subreddits to scrape
subreddits = ['AskALiberal', 'AskConservatives', 'AskPolitics', 'AskTrumpSupporters']

# Define flair-to-political leaning mapping
FLAIR_MAPPING = {
    # Democrat-Aligned (0)
    'Democrat': 0, 'Democratic Socialist': 0, 'Far Left': 0, 'Liberal': 0,
    'Progressive': 0, 'Social Democrat': 0, 'Social Liberal': 0, 'Neoliberal': 0, 'Social Democracy': 0,
    'Leftist': 0, 'Pragmatic Progressive': 0, 'Leftwing': 0, 'Left-leaning': 0,


    # Republican-Aligned (1)
    'Republican': 1, 'Conservative': 1, 'Right-leaning': 1, 'Neoconservative': 1,'Paleoconservative': 1,
    'Nationalist': 1, 'Social Conservative': 1, 'Constitutionalist': 1, 'Rightwing': 1, 'Conservatarian': 1,
    'Tea Party': 1, 'Trump Supporter': 1, 'Far Right': 1
}

In [5]:
def fetch_posts(generator, retries=3):
    attempt = 0
    while attempt < retries:
        try:
            return list(generator)  # Convert generator to list
        except Exception as e:
            print(f"Error fetching posts (attempt {attempt+1}): {e}")
            time.sleep(5 * (2 ** attempt))  # Exponential backoff
            attempt += 1
    return []  # Return empty list if all retries fail

In [6]:
def flatten_comments(comments, max_depth=1):
    #Recursively retrieves all comments and their replies up to a set depth.
    all_comments = []
    for comment in comments:
        all_comments.append(comment)  # Add top-level comment
        if max_depth > 0:  # Only go deeper if within depth limit
            comment.replies.replace_more(limit=0)  # Expand replies
            all_comments.extend(flatten_comments(comment.replies.list(), max_depth=max_depth - 1))  # Reduce depth
    return all_comments

In [7]:
# Function to scrape subreddit posts
def scrape_subreddit(subreddit_name, political_leaning, target_count=1500):
    total_count = 0
    leaning_dict = {0 : 'Democrat-Aligned', 1 : 'Republican-Aligned'}
    subreddit = reddit.subreddit(subreddit_name)
    print(f"Scraping: {subreddit_name} for {leaning_dict[political_leaning]} entries...")

    # Search categories
    search_methods = [ #Change limit to get more posts
        ('top_all', fetch_posts(subreddit.top(time_filter='all', limit=10))),
        # Fetch from these if there are not enough posts in 'top'
        ('controversial_all', fetch_posts(subreddit.controversial(time_filter='all', limit=10))),
        ('hot', fetch_posts(subreddit.hot(limit=10))),
        ('new', fetch_posts(subreddit.new(limit=10)))
    ]

    for source_name, posts in search_methods:
        print(f'Fetching from: {source_name}...\n ')

        if not posts:
            print(f'No posts found in {source_name}, skipping...\n')
            continue

        for post in posts:
            if total_count >= target_count:
                break
            if post.id in collected_post_ids:
                continue
            total_count += process_post(post, political_leaning)
            print(f'\nTotal Entries Collected (Posts + Comments): {total_count}')
            print('\n' + ('---' * 35) + '\n')
            time.sleep(random.uniform(0.3, 0.7))

    print(f'\nFinished collecting {total_count}/{target_count} entries from {subreddit_name}')

In [8]:
def process_post(post, political_leaning):
    comment_count = 0
    leaning_dict = {0 : 'Democrat-Aligned', 1 : 'Republican-Aligned'}
    post_flair = post.author_flair_text.strip() if post.author_flair_text else None
    
    # Save post_id
    collected_post_ids.add(post.id)

    print(f"Post: {post.title}\n{post.selftext}")
    print('(End of Post)')
    print('\nChecking comments...\n')
    post.comments.replace_more(limit=1)

    
    all_comments = flatten_comments(post.comments, max_depth=1)

    top_comments = heapq.nlargest(30, all_comments, key=lambda c: c.score)

    for comment in top_comments:
        comment_flair = comment.author_flair_text.strip() if comment.author_flair_text else None
        comment_text = comment.body.strip()

        if not comment_flair:
            print(f'Skipping comment {comment.id} - No user flair.')
            continue
        if comment_flair not in FLAIR_MAPPING:
            print(f'Skipping comment {comment.id} - Flair: {comment_flair} not in mapping.')
            continue
        if FLAIR_MAPPING[comment_flair] != political_leaning:
            print(f'Skipping comment {comment.id} - Flair: {comment_flair} is not {leaning_dict[political_leaning]}.')
            continue
        if len(comment_text) < 500:
            print(f'Skipping comment {comment.id} - Text too short ({len(comment_text)} chars).')
            continue

        # Save comment
        data.append({
            'comment_id': comment.id,
            'post_id': post.id,
            'subreddit': post.subreddit.display_name,
            'text': comment_text,
            'flair': comment_flair,
            'political_alignment': FLAIR_MAPPING[comment_flair]
        })

        comment_count += 1
        print(f'Collected comment: {comment.id} - User Flair: {comment_flair} - Length: {len(comment_text)} chars')  
        time.sleep(random.uniform(0.3, 0.7))
        
    return comment_count

## 4. Collecting Democrat-Aligned Posts

In [9]:
# Run the scraper for each subreddit
scrape_subreddit('AskALiberal', 0)

Scraping: AskALiberal for Democrat-Aligned entries...
Fetching from: top_all...
 
Fetching from: controversial_all...
 
Fetching from: hot...
 
Post: AskALiberal Biweekly General Chat
This Tuesday weekly thread is for general chat, whether you want to talk  politics or not, anything goes. Also feel free to ask the mods questions  below. As usual, please follow the rules.
(End of Post)

Checking comments...

Skipping comment mii1jmp - Text too short (85 chars).
Skipping comment mii2jxk - Text too short (230 chars).
Skipping comment miiam1f - Text too short (83 chars).
Skipping comment migopmw - Flair: Market Socialist not in mapping.
Skipping comment migrfy3 - Flair: Anarchist not in mapping.
Skipping comment mifxi7h - Text too short (239 chars).
Collected comment: mifnrff - User Flair: Pragmatic Progressive - Length: 553 chars
Skipping comment mifri1o - Text too short (275 chars).
Skipping comment mig1fia - Text too short (86 chars).
Skipping comment mig15kh - Text too short (80 chars)

In [10]:
df = pd.DataFrame(data)
df.shape

(7678, 6)

In [11]:
df.to_csv('political_subreddit.csv', index=False, encoding='utf-8')

In [12]:
scrape_subreddit('AskConservatives', 0)

Scraping: AskConservatives for Democrat-Aligned entries...
Fetching from: top_all...
 
Fetching from: controversial_all...
 
Post: So are Tim Pool, Dave Rubin, Lauren Southern, etc all traitorous scum or are they just “useful idiots” that spread Russian propaganda? 
i’m sorry but does this not concern anyone that this is happening with some of the biggest creators in the space? 

absolutely insane 
(End of Post)

Checking comments...

Skipping comment llxwhhe - Flair: Right Libertarian not in mapping.
Skipping comment lly22db - Text too short (27 chars).
Skipping comment lly0iiv - Flair: Left Libertarian not in mapping.
Skipping comment lly1r2f - Flair: European Liberal/Left not in mapping.
Skipping comment llxxag1 - No user flair.
Skipping comment llxv702 - Flair: Center-left not in mapping.
Skipping comment llxv0l9 - Flair: Right Libertarian not in mapping.
Skipping comment llxyggx - Flair: Conservative is not Democrat-Aligned.
Skipping comment llxvfm9 - No user flair.
Skipping comme

In [13]:
df = pd.DataFrame(data)
df.shape

(7684, 6)

In [14]:
df.to_csv('political_subreddit.csv', index=False, encoding='utf-8')

In [15]:
scrape_subreddit('AskPolitics', 0)

Scraping: AskPolitics for Democrat-Aligned entries...
Fetching from: top_all...
 
Post: Have the Trump supporters around you gotten quiet?
Mine have suddenly lost interest in discussing politics. Or egg prices. Or wars. As the inauguration nears they’ve pretty much gone silent and deep. We got one day of “God gave us Trump back!” then nothing. Especially as the cabinet nominees have been announced. 
(End of Post)

Checking comments...

Skipping comment m2lfhmh - No user flair.
Skipping comment m2li8ox - Text too short (289 chars).
Skipping comment m2lk9zk - Flair: Left but not crazy-left not in mapping.
Skipping comment m2lkgt8 - Text too short (51 chars).
Skipping comment m2ltk9s - No user flair.
Skipping comment m2lql26 - No user flair.
Skipping comment m2luaa3 - No user flair.
Skipping comment m2llew4 - Text too short (212 chars).
Skipping comment m2ll008 - No user flair.
Skipping comment m2mwuvl - No user flair.
Skipping comment m2m25ia - No user flair.
Skipping comment m2lkt4l - N

In [16]:
df = pd.DataFrame(data)
df.shape

(7694, 6)

In [17]:
df.to_csv('political_subreddit.csv', index=False, encoding='utf-8')

## 5. Collecting Republican-Aligned Posts

In [18]:
scrape_subreddit('AskTrumpSupporters', 1)

Scraping: AskTrumpSupporters for Republican-Aligned entries...
Fetching from: top_all...
 
Post: Trump claimed today that Corker was "set up" by the New York Times when they allegedly taped his interview without his knowledge; the Times immediately released the recording in which Corker not only acknowledged the recording but requested it. Is Trump guilty of spreading fake news?
[Source where the NYT](https://www.nytimes.com/2017/10/10/reader-center/trump-claims-we-tricked-bob-corker-heres-the-truth.html?_r=0) debunked Trump's claim:

>Far from being set up, Mr. Corker asked that I tape our conversation.

>“I know they’re recording it, and I hope you are, too,” he said as two of his aides listened in on other lines, one of them also taping the interview.

>As with most on-the-record discussions with an elected official, I was recording our conversation to ensure accuracy.

>And after Mr. Corker got off the phone, his two aides made sure I had recorded the call. Like the senator, they w

In [19]:
df = pd.DataFrame(data)
df.shape

(7698, 6)

In [20]:
df.to_csv('political_subreddit.csv', index=False, encoding='utf-8')

In [21]:
scrape_subreddit('AskConservatives', 1)

Scraping: AskConservatives for Republican-Aligned entries...
Fetching from: top_all...
 
Fetching from: controversial_all...
 
Fetching from: hot...
 
Fetching from: new...
 

Finished collecting 0/1500 entries from AskConservatives


In [22]:
df = pd.DataFrame(data)
df.shape

(7698, 6)

In [23]:
df.to_csv('political_subreddit.csv', index=False, encoding='utf-8')

In [24]:
scrape_subreddit('AskPolitics', 1)

Scraping: AskPolitics for Republican-Aligned entries...
Fetching from: top_all...
 
Fetching from: controversial_all...
 
Fetching from: hot...
 
Fetching from: new...
 

Finished collecting 0/1500 entries from AskPolitics


In [25]:
df = pd.DataFrame(data)
df.shape

(7698, 6)

In [26]:
df.to_csv('political_subreddit.csv', index=False, encoding='utf-8')