# Reddit Post and Comment Scraper

This Jupyter Notebook is designed to scrape posts and comments from a specified subreddit using the Python Reddit API Wrapper (PRAW). We will be scraping the post title, post ID, number of upvotes, tags, and post content for each post, as well as the commenter's name, comment body, and number of upvotes for each comment.

## Prerequisites

Before we begin, please make sure you have the following:

1. A Reddit account
2. A Reddit "app" created for API access (https://www.reddit.com/prefs/apps)
3. Your `REDDIT_CLIENT_ID` and `REDDIT_CLIENT_SECRET` from the Reddit app
4. The `PRAW` library installed (`pip install praw`)

## Install and Import Libraries

First, let's import the required libraries.

Importing a library in Python means loading and making available a set of pre-written code or modules that can be used to perform specific tasks. When you import a library, you gain access to a set of functions and classes that can be used in your own code. This allows you to take advantage of existing code instead of having to write everything from scratch.

In this case we're using the Python Reddit API Wrapper (PRAW) to help us interact with the python reddit api more easily.

In [None]:
pip install praw tqdm

In [7]:
import os
import praw
import csv
import json
from typing import List, Dict, Tuple
from tqdm import tqdm

## Create Reddit API Instance

Now, let's create a function to initialize the Reddit API instance using the `REDDIT_CLIENT_ID` and `REDDIT_CLIENT_SECRET`.

In [8]:
def create_reddit_instance():
    return praw.Reddit(
        client_id=os.environ["REDDIT_CLIENT_ID"],
        client_secret=os.environ["REDDIT_CLIENT_SECRET"],
        user_agent="reddit-scraper",
    )

In the Phone Calls metaphor, we can think of praw.Reddit as a phonebook that lists all the phone numbers we need to make a phone call to the Reddit API. The create_reddit_instance() function essentially looks up the phone number of the Reddit API from the phonebook and returns a connection to it.

To make the connection, the function requires three pieces of information: the client_id, client_secret, and user_agent. These pieces of information are like the caller ID, password, and name you provide when making a phone call. In this case, they identify who you are and why you're making the API call, so that the API can verify your identity and grant you access to its data. Once the connection is established, you can start making API requests to the Reddit API, just like you can start making phone calls once you've established a connection to the phone number you looked up in the phonebook.

## Scrape Posts

Next, we'll create a function to scrape a specified number of posts from a subreddit.

In [9]:
def scrape_posts(reddit, subreddit_name: str, num_posts: int) -> List[Dict]:
    print("Scraping posts...")
    subreddit = reddit.subreddit(subreddit_name)
    posts_data = []

    for post in tqdm(subreddit.hot(limit=num_posts), total=num_posts):
        post_data = {
            "post_title": post.title,
            "post_id": post.id,
            "num_upvotes": post.score,
            "tags": post.link_flair_text,
            "post_content": post.selftext,
        }
        posts_data.append(post_data)

    return posts_data

1. The function takes three arguments: reddit (an instance of the Reddit API), subreddit_name (a string representing the name of the subreddit to scrape), and num_posts (an integer representing the number of posts to scrape).

2. The function first prints a message indicating that it is about to start scraping posts.

3. The function then uses the subreddit_name argument to get a reference to the desired subreddit using the reddit.subreddit() method.

4. The function initializes an empty list called posts_data to hold the scraped post data.

5. The function then loops through the num_posts hottest posts on the subreddit (using the subreddit.hot() method) and for each post:

	a. The function extracts several pieces of information (the post title, post ID, number of upvotes, post tags, and post content) and stores them in a dictionary called post_data.
	
	b. The post_data dictionary is then appended to the posts_data list.

6. Finally, the function returns the posts_data list, which contains dictionaries representing the scraped post data.

## Scrape Comments

Now, let's create a function to scrape a specified number of comments from a list of Reddit posts.

In [10]:
def scrape_comments(reddit, post_ids: List[str], num_comments: int) -> List[Dict]:
    print("Scraping comments...")
    comments_data = []

    for post_id in tqdm(post_ids, desc="Posts"):
        post = reddit.submission(id=post_id)

        post.comments.replace_more(limit=None)
        for comment in tqdm(post.comments.list()[:num_comments], total=num_comments, desc="Comments"):
            comment_data = {
                "post_title": post.title,
                "post_id": post_id,
                "commenter_name": comment.author.name,
                "comment_body": comment.body,
                "num_upvotes": comment.score,
            }
            comments_data.append(comment_data)

    return comments_data

1. Define a function called scrape_comments() that takes three arguments: a Reddit instance (reddit), a list of post IDs (post_ids), and the number of comments to scrape per post (num_comments).

2. Print the message "Scraping comments..." to the console to indicate that the function has started.

3. Create an empty list called comments_data to store the scraped comment data.

4. For each post ID in the list of post IDs:
	
	* Use the reddit.submission(id=post_id) method to get the post object corresponding to the current post ID.
	
	* Use the post.comments.replace_more(limit=None) method to retrieve all comments for the current post.
	
	* For each comment in the list of comments:
		
		* Create a dictionary called comment_data with the following keys and corresponding values:
		
			* "post_title": the title of the post that the comment was made on
			
			* "post_id": the ID of the post that the comment was made on
	
			* "commenter_name": the username of the commenter

			* "comment_body": the text content of the comment

			* "num_upvotes": the number of upvotes the comment has received

			* Append the comment_data dictionary to the comments_data list.

5. Return the comments_data list containing all of the scraped comment data.

## Export Data to CSV

Next, we'll create a function to export the scraped data to CSV files.

In [11]:
def export_to_csv(data: List[Dict], filename: str):
    print(f"Exporting data to {filename}...")
    with open(filename, "w", newline="", encoding="utf-8") as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=data[0].keys())
        writer.writeheader()
        for row in data:
            writer.writerow(row)

here's a breakdown of the code:

1. Define a function called export_to_csv that takes two arguments, data and filename.

2. The function prints a message to the console indicating that it is exporting data to a CSV file with the specified filename.

3. The function then uses the open function to create a new file with the specified filename in write mode, with the optional newline parameter set to "" to ensure that no extra newlines are added to the file.

4. The csv.DictWriter function is used to create a writer object that writes rows to the CSV file. The fieldnames argument is set to the keys of the first dictionary in the data list, ensuring that all dictionaries have the same fields.

5. The writeheader method is called on the writer object to write the header row to the CSV file.

6. A loop is then used to iterate through each dictionary in the data list.

7. The writerow method is called on the writer object for each dictionary in the data list, writing a new row to the CSV file.

8. Finally, the function returns nothing.

## Gather and Export Data

Finally, let's create a function to gather the post and comment data and export them to separate CSV files.

In [15]:
def gather_and_export_data(subreddit: str, num_posts: int, num_comments: int):
    reddit = create_reddit_instance()
    
    posts = scrape_posts(reddit, subreddit, num_posts)
    comments = scrape_comments(reddit, [post["post_id"] for post in posts], num_comments)

    print(json.dumps(posts, indent=4))
    
    export_to_csv(posts, "posts.csv")
    export_to_csv(comments, "comments.csv")

This code defines a function called gather_and_export_data() that takes three arguments: subreddit (a string), num_posts (an integer), and num_comments (also an integer). Here's what the function does:

1. It creates a Reddit API instance using the create_reddit_instance() function.

2. It scrapes posts using the scrape_posts() function and saves the data to a variable called posts.

3. It extracts the IDs of the posts from the posts variable and passes them to the scrape_comments() function, which scrapes the comments for each post and saves the data to a variable called comments.

4. It exports the post data to a CSV file called "posts.csv" using the export_to_csv() function.

5. It exports the comment data to a CSV file called "comments.csv" using the export_to_csv() function.

In summary, this function scrapes data from a subreddit, gathers post and comment data, and exports the data to CSV files for further analysis.

## Usage

To use the scraper, simply call the `gather_and_export_data` function with the desired subreddit name, number of posts to scrape, and number of comments to scrape per post.

### Remember to add your REDDIT API KEY before running the next cell!

For example:

In [None]:
gather_and_export_data("learnpython", 10, 5)

This will scrape 10 posts and 5 comments for each post from the "learnpython" subreddit and save the data in "posts.csv" and "comments.csv" files.