# Scraping comments from Reddit using PRAW 

In the following we will demonstrate how Python's [praw](https://praw.readthedocs.io/en/latest/) library can be used to scrape comments from threads within Reddit subreddits via the Reddit API

### Accessing the Reddit API 

In order to access the API you need a client ID and client secret. To obtain these follow the instructions under **First Steps** [here](https://github.com/reddit-archive/reddit/wiki/OAuth2-Quick-Start-Example). Before following the instructions you will first need to create a Reddit account and sign into this account.

Once you have obtained the required credentials, it is best to store them in a .json file in order to keep them secure. The file structure is as follows:

```python

{"username":YourUsername, 
 "password": YourPassword, 
 "client_id":YourClientID, 
 "client_secret":YourClientSecret}
```

For the rest of this tutorial we assume that this information is stored within a file named `reddit_credentials.json`

### Creating a Reddit instance 

To access the API, we create an authorised `reddit` instance by passing credentials (stored in `reddit_credentials.json`) to praw's Reddit class. As your credentials are stored in a `.json` file, once the file is loaded it can be treated as a python `dict`:

```python

import praw
import json 


with open('reddit_credentials.json') as fin:
    creds = json.load(fin)
    
reddit = praw.Reddit(user_agent='Comment Extraction (by /u/{0})'.format(creds['username']),
                     client_id=creds['client_id'], client_secret=creds['client_secret'],
                     username=creds['username'], password=creds['password'])
```

### Accessing a Subreddit 

We can now create a `subreddit` instance by passing the name of the subreddit to our `reddit` instance. For this tutorial we will use the AskReddit subreddit: 

```python
subreddit = reddit.subreddit('AskReddit')
```

### Obtaining threads within a subreddit 

Our end goal is to scrape comments from threads within a subreddit. Our `subreddit` instance can be used to obtain `submissions` (ie. threads) within the subreddit. There are various orders in which submissions can be accessed, described in the [praw documentation](https://praw.readthedocs.io/en/latest/getting_started/quick_start.html#obtain-submission-instances-from-a-subreddit). Here we will at those which are `hot`, or currently begin highly commented. 

Lets start by looping through the 10 'rising' threads in our subreddit and printing the title:

```python

for submission in subreddit.rising(limit=10):
    print(submission.title)
```

The title is just one bit of metadata realting to each thread that can be obtained. As we want as much metadata as possible, we can create a funuction to obtain all metadata needed:

```python
def get_metadata(submission):

    thread={}
    thread["title"]=submission.title
    thread["score"]=submission.score
    thread["id"]=submission.id
    thread["url"]=submission.url
    thread["comms_num"]=submission.num_comments
    thread["created"]=submission.created
    thread["body"]=submission.selftext

    return thread

```

Our function will return a `dict` containing metadata about each thread. 

We can now loop through the 10 rising threads, obtain the metadata for each one and save the results to a file named `askreddit.json`:


```python
with open('askreddit.json', 'w') as fout:
    for submission in subreddit.rising(limit=10):
    
        thread = get_metadata(submission)
        json.dump(thread,fout)
        fout.write('\n')
    
```

### Obtaining comments from a thread 

We are now able to loop through `submissions` (threads) within a subreddit and obtain metadata about each one. 

In order to obtain the comments from each `submission` we can use the `comments` attribute. As the structure of the comments in a subreddit thread is tree-like, where a top-level comment can be replied to many times, and each reply can be replied to, we need to traverse through the "comment tree" to obtain all comments.

Traversing through the tree is easy thanks to the `CommentForest` object in praw. For in-depth details see the [example](https://praw.readthedocs.io/en/latest/tutorials/comments.html#extracting-comments) in praw's documantation.

Looking at the structure of a thread within AskReddit (or any other subreddit) you see that in order to view all comments you must click '1 more commment', '21 more comments' etc.

An important point to note is that, to obtain *all* comments in a thread, we need to be performing a similar action to this. In praw, this can be acieved using the `replace_more` method, which is part of `CommentForest`. 

Specifically, to obtain all coments we can call the `comments` attribute on each `submission` with the `replace_more` method and a limit of `None`. This only needs to be done once, after which a list of all comments can be looped over and comment data can be obtained:   


```python
def get_comments(submission):

    comments=[]

    submission.comments.replace_more(limit=None)
    for comment in submission.comments.list():
        data={}
        data['body']=comment.body
        data['id']=comment.fullname

        try:
            data['author']=comment.author.name

        except AttributeError:
            data['author']=comment.author

        data['time']=comment.created_utc
        data['parent']=comment.parent_id

        comments.append(data)

    return comments
```

### Putting it together 

Calling the above function `get_comments` from within the `get_metadata` function, we can now obtain data for all comments in a subreddit thread for all threads in a subreddit!

In [4]:
import praw
import json 

def get_comments(submission):

    comments=[]

    submission.comments.replace_more(limit=None)
    for comment in submission.comments.list():
        data={}
        data['body']=comment.body
        data['id']=comment.fullname

        try:
            data['author']=comment.author.name

        except AttributeError:
            data['author']=comment.author

        data['time']=comment.created_utc
        data['parent']=comment.parent_id

        comments.append(data)

    return comments


def get_metadata(submission):

    thread={}
    thread["title"]=submission.title
    thread["score"]=submission.score
    thread["id"]=submission.id
    thread["url"]=submission.url
    thread["comms_num"]=submission.num_comments
    thread["created"]=submission.created
    thread["body"]=submission.selftext
    
    thread['posts']= get_comments(submission)

    return thread



with open('reddit_credentials.json') as fin:
    
    creds = json.load(fin)
    

reddit = praw.Reddit(user_agent='Comment Extraction (by /u/{0})'.format(creds['username']),
                     client_id=creds['client_id'], client_secret=creds['client_secret'],
                     username=creds['username'], password=creds['password'])


subreddit = reddit.subreddit('AskReddit')

with open('askreddit.json', 'w') as fout:
    
    for submission in subreddit.rising(limit=10):
    
        thread = get_metadata(submission)
        json.dump(thread,fout)
        fout.write('\n')

### Finally.... 

If you would like to speed up this code, try the tutorial [Introduction to threading in Python](Introduction to threading in Python.ipynb).