# Reddit Post and Comment Scraper Tutorial

## Introduction

In this tutorial, you will learn how to use the Reddit API to scrape posts and comments from a subreddit. You will also learn how to export the scraped data to CSV files.

### Prerequisites

Before we begin, you will need to install the following Python packages:

* praw
* tqdm
* argparse

You can install these packages using pip:

In [None]:
pip install praw tqdm argparse

### Step 1: Create a Reddit API instance

The first step is to create a Reddit API instance using environment variables for authentication. We will define a function called create_reddit_instance() to create the instance:

In [None]:
import os
import praw

def create_reddit_instance():
    """
    Create a Reddit API instance using environment variables for authentication.
    :return: A Reddit API instance.
    """
    return praw.Reddit(
        client_id=os.environ["REDDIT_CLIENT_ID"],
        client_secret=os.environ["REDDIT_CLIENT_SECRET"],
        user_agent="reddit-scraper",
    )

Replace REDDIT_CLIENT_ID and REDDIT_CLIENT_SECRET with your own Reddit API credentials.

### Step 2: Scrape posts from a subreddit

Next, we will define a function called scrape_posts() to scrape posts from a subreddit using the Reddit API:

In [None]:
def scrape_posts(reddit, subreddit_name: str, num_posts: int) -> List[Dict]:
    """
    Scrape posts from a subreddit using the Reddit API.
    :param reddit: A Reddit API instance.
    :param subreddit_name: The name of the subreddit to scrape.
    :param num_posts: The number of posts to scrape.
    :return: A list of dictionaries containing post data.
    """
    print("Scraping posts...")
    subreddit = reddit.subreddit(subreddit_name)
    posts_data = []

    for post in tqdm(subreddit.hot(limit=num_posts), total=num_posts):
        post_data = {
            "post_title": post.title,
            "post_id": post.id,
            "num_upvotes": post.score,
            "tags": post.link_flair_text,
            "post_content": post.selftext,
            "post_sentiment": analyze_sentiment(post.selftext),
        }
        posts_data.append(post_data)

    return posts_data


The function takes three arguments: the Reddit API instance (reddit), the name of the subreddit to scrape (subreddit_name), and the number of posts to scrape (num_posts).

The function first prints a message to indicate that it is scraping posts from the subreddit. It then uses the subreddit.hot() method to get the top posts in the subreddit, limited to the number of posts specified by num_posts.

The function then iterates over each post and extracts the relevant data, including the post title, ID, number of upvotes, tags, content, and sentiment. The data is stored in a list of dictionaries called posts_data.

Finally, the function returns the list of post data dictionaries.

### Step 3: Scrape comments from posts

Next, we will define a function called scrape_comments() to scrape comments from the posts that were scraped in the previous step:

In [None]:
def scrape_comments(reddit, post_ids: List[str], num_comments: int) -> List[Dict]:
    """
    Scrape comments from a list of Reddit posts.
    :param reddit: A Reddit API instance.
    :param post_ids: A list of post IDs to scrape comments from.
    :param num_comments: The number of comments to scrape per post.
    :return: A list of dictionaries containing comment data.
    """
    print("Scraping comments...")
    comments_data = []

    for post_id in tqdm(post_ids, desc="Posts"):
        post = reddit.submission(id=post_id)

        post.comments.replace_more(limit=None)
        for comment in tqdm(post.comments.list()[:num_comments], total=num_comments, desc="Comments"):
            comment_data = {
                "post_title": post.title,
                "post_id": post_id,
                "commenter_name": comment.author.name,
                "comment_body": comment.body,
                "num_upvotes": comment.score,
                "comment_sentiment": analyze_sentiment(comment.body),
            }
            comments_data.append(comment_data)

    return comments_data


The scrape_comments() function takes in the Reddit instance, a list of post IDs, and the number of comments to scrape per post. It returns a list of dictionaries containing comment data.

The function first initializes an empty list called comments_data to store the comment data. It then loops through each post ID and retrieves the submission object using the id method of the reddit instance.

The replace_more method is called to ensure that all nested comments are included. The function then loops through each comment using the list() method and retrieves the comment data such as the author name, body, number of upvotes, and sentiment. The data is stored as a dictionary and appended to the comments_data list.

We use the tqdm library to display a progress bar for both the posts and comments loops. This makes it easy to track the progress of the scraping process.

### Step 4 export data to CSV 

After scraping both posts and comments, we can use a function called export_data() to write the scraped data to separate CSV files. Here's an example implementation:

In [None]:
def export_data(posts: List[Dict], comments: List[Dict]):
    """
    Export post and comment data to separate CSV files.
    :param posts: A list of dictionaries containing post data.
    :param comments: A list of dictionaries containing comment data.
    """
    export_to_csv(posts, "posts.csv")
    export_to_csv(comments, "comments.csv")


In the export_data() function, we use the previously defined export_to_csv() function to write the scraped post and comment data to separate CSV files.

In [None]:
### Step 5 Running it all in Main

def main():
    """
    Main function for the Reddit post and comment scraper script.
    """
    subreddit, num_posts, num_comments = parse_args()
    reddit = create_reddit_instance()
    posts, comments = gather_data(reddit, subreddit, num_posts, num_comments)
    export_data(posts, comments)
    print("Done.")

This will write the scraped post and comment data to CSV files named posts.csv and comments.csv, respectively.