# Accessing more data from Pushshift

A very important note on accessing data from any kind of API is that the speed of access is always limited. It means that never the speed of accessing the data equals your Internet connection capacity. The limitations set by the server usually are called rate limits. They serve one purpose and one purpose only -- as a defensive measure for service. Shared services (in that case `Pushshift` API) need to protect themselves from excessive use -- whether intended or unintended —- to maintain service availability. Even highly scalable systems should have limits on consumption at some level. For the system to perform well, clients must also be designed with rate limiting in mind to reduce the chances of cascading failure. Rate limiting on both the client-side and the server-side is crucial for maximizing throughput and minimizing end-to-end latency across large distributed systems ([source](https://cloud.google.com/solutions/rate-limiting-strategies-techniques)).

Normally the best way to learn about rate limits is to ask API about them. However, in terms of `Pushshift`, this seems to be far from an ideal solution. I mean in theory there is an endpoint `meta` which should support you with the knowledge of how many requests per minute you can send. You might find it under this URL: [https://api.pushshift.io/meta](https://api.pushshift.io/meta). However, if you try searching for the answer on Reddit you would learn that what it returns is not always accurate. Therefore, I wouldn't send more than one request per second (in theory it could be 2).

Ok, but what would happen if we send too many requests per second? This is quite simple. Instead of getting the status code `200` we would get `429` (or in the case of the `Pushshift` `502` and `523`). This would be the server way of saying that we sent too many requests in the given period and we have to wait ([list of possible status codes](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status)). In some extreme situations, we might get our IP blocked by the server. But it happens only in rare events when we break the rules repeatedly.

## Submissions

In practice, it is much easier than it might seem to download more data from `Pushshift`. Especially in terms of submissions. With comments, it is a bit more tricky but also relatively easy. More or less it takes five steps:

### Step 1

Like always we need to load modules first.

In [None]:
## Load needed modules
import requests as rq
import json
import time
import sys ## This is new but we will use it actually for prining only
import tqdm ## This is for a progress bar

### Step 2

This is not something you need to do but I found it easier to define custom functions because repetitively python would execute the same code. As a rule of thumb, you should create a function if you are going to execute the same code more than twice. It is a good practice to either define the custom functions in a separate script or to do it at the beginning of the script. 

I guess the functions below are self-explanatory. The important notion is to always spend time on writing what the function does. In `python`, you do it typically as I did it because then if you type `help(<name of the fucntion>)` you will get it printed nicely.

In [None]:
## Define custom functions which will make our life easier
def parse_date(date, format='human'):
    """"
    It takes a string and converts it into either human readable date format or epoch date format
    
    Parameteres:
    ============
    date: str
        A string with either epoch date format or human readable date format.
    format: str
        A string defining the format of the input string. By default, it takes the value 'human' and the other option is 'epoch'.
    """
    if format == 'human':
        pattern = '%Y-%m-%d %H:%M:%S'
        return str(int(time.mktime(time.strptime(date, pattern))))
    elif format == 'epoch':
        pattern = '%Y-%m-%d %H:%M:%S'
        return time.strftime(pattern, time.localtime(int(date)))
    
def collect_data(source_url, payload):
    """
    It takes the Pushift endpoint and payload as arguments and sends a request to the given URL. Depending on the status code it either returns 
    the list of mappings, a status code, or sleeps for 60 seconds and tries again to scrape the data.
    
    Parameters:
    ===========
    source_url: str
        A string with url of the Pushshift endpoint.
    payload: 
        A mapping with parameters passed to the Pushshift API.
    """
    if 'after' not in payload:
        payload['after'] = parse_date('2005-06-23 00:00:00')
    response = rq.get(source_url, params = payload)
    if response.status_code == 200:
        time.sleep(1)
        return json.loads(response.text)['data']
    elif response.status_code == 429 or response.status_code == 523 or response.status_code == 502:
        for i in range(60,0,-1):
            print(f'\rThe compulsory break finishes in {str(i)} seconds', end ='', flush=True)
            time.sleep(1)
        print('\r' + 100*' ')
        return collect_data(source_url = source_url, payload = payload)
    else:
        return [{'status' : response.status_code, 'message' : response.content }]

### Step 3
Define the parameters. Please note that it is a different date format than before. For some reason, you can set the size to be more than 25 even though I read it couldn't be more than that. Anyway, with 100 works so let's keep it for now.

In [None]:
source_url = 'https://api.pushshift.io/reddit/search/submission/'
payload = { 'subreddit' : 'todayilearned',
            'q' : 'science',
            'after' : parse_date('2017-10-24 00:00:00'),
            'size' : 100}


### Step 4
I general, probably there is a better way to write this code but I find it the easiest, both for me and I guess for you to understand. This chunk will scrape the first chunk of the data and store it in a data object called `data`.

In [None]:
data = collect_data(source_url = source_url, payload = payload)

### Step 5
We didn't talk about `while-loop` but you probably realize by just looking at it that what it does is to run the code inside until the condition is **not** met. So what happens here is that the code is executed until the request to `Pushshift` is empty. I added a few nice features which you might find useful when you run this code for time-consuming search:

1. The data is stored after collecting every single batch of data. In our case, it is after approximately 100 submissions. So even though you lose an internet connection or the loop-breaks you will not lose the data you have already collected.
2. We don't really know how much data we are going to collect but the progress bar will give you the idea of how much data we have already collected.
3. The script will return the status code and also the message whenever something unexpected happens and will stop collecting the data.

In [None]:
## Open a file in write mode
with open('submissions.jl', 'w') as file:
    ## Write out the data you already collected
    for line in data:
        line['created_utc'] = parse_date(line['created_utc'], format = 'epoch')
        file.write(json.dumps(line) + '\n')
    ## Set the prpogress bar
    pbar = tqdm.tqdm(position=0, leave=True,initial=100)
    ## Create a while-loop
    while len(data) > 0:
        ## Check if we got data from Reddit or a strange status code
        if len(data[0].keys()) > 2:
            ## Get the last collected data date in epoch time
            after = parse_date(data[-1]['created_utc'])
            ## Update the payload after field
            payload['after'] = after
            ## Collect the data
            data = collect_data(source_url = source_url, payload = payload)
            ## Write out the collected data to the file
            for line in data:
                if 'created_utc' in line:
                    line['created_utc'] = parse_date(line['created_utc'], format = 'epoch')
                    file.write(json.dumps(line) + '\n')
            ## Update the progress bar
            pbar.update(len(data))
        else:
            ## Print out the strange status code and its message
            print(f'Something went wrong. The status code error was {data.pop}.')

## Comments
It is a bit more tricky to access the comments than submissions but at the end of the day not that hard. First, you need to decide what you would want to access. You have multiple options:

* get all comments from the specific subreddit - it is relatively the easiest, however, it will take days to collect all comments or at least hours for big subreddits. You need to specify the `subreddit` field and nothing else (however, it is wise to change the default number of collected items to the maximum - `'size' : 100`). Therefore, the payload should look something like that:
```python
payload = { 'subreddit' : 'todayilearned',
            'size' : 100 }
```
* get all comments from under the specific submission. You need to specify the `link_id` field and nothing else (however, it is wise to change the default number of collected items to the maximum - `'size' : 100`). The important note is that as a `link_id` you should put an `id` of the submission. Therefore, the payload should look something like that:
```python
payload = { 'link_id' : '8vjr2l',
            'size' : 100 }
```
* get all comments from under the comment. You need to specify the `parent_id` field and nothing else (however, it is wise to change the defualt number of collected items to the maximum - `'size' : 100`). The important note is that as a `parent_id` you should put an `id` of the comment. Therefore, the payload should look something like that:
```python
payload = { 'parent_id' : 'e1nuxpc',
            'size' : 100 }
```
Obviously, you can specify also other fields to narrow the search, however, I guess for our purposes it will not be necessary.

Actually, the procedure of scraping the comments is as in terms of submissions. The only difference is the `payload` and `source_url`. However, we do not have to repeat in this particular notebook the first two steps cause you need to do it only once in the Notebook likewise in `R`. Therefore, the below code consists of steps from 3 to 5.

### Step 3

In [None]:
source_url = 'https://api.pushshift.io/reddit/search/comment/'
payload = { 'link_id' : '8vjr2l',
            'size' : 100 }

### Step 4

In [None]:
data_comments = collect_data(source_url = source_url, payload = payload)

### Step 5

In [None]:
## Open a file in write mode
with open('comments.jl', 'w') as file:
    ## Write out the data you already collected
    for line in data_comments:
        line['created_utc'] = parse_date(line['created_utc'], format = 'epoch')
        file.write(json.dumps(line) + '\n')
    ## Set the prpogress bar
    pbar = tqdm.tqdm(position=0, leave=True,initial=100)
    ## Create a while-loop
    while len(data_comments) > 0:
        ## Check if we got data from Reddit or a strange status code
        if len(data_comments[0].keys()) > 2:
            ## Get the last collected data date in epoch time
            after = parse_date(data_comments[-1]['created_utc'])
            ## Update the payload after field
            payload['after'] = after
            ## Collect the data
            data_comments = collect_data(source_url = source_url, payload = payload)
            ## Write out the collected data to the file
            for line in data_comments:
                if 'created_utc' in line:
                    line['created_utc'] = parse_date(line['created_utc'], format = 'epoch')
                    file.write(json.dumps(line) + '\n')
            ## Update the progress bar
            pbar.update(len(data_comments))
        else:
            ## Print out the strange status code and its message
            print(f'Something went wrong. The status code error was {data_comments.pop}.')

# Homework

I know there is not much time left for Tuesday but it is a relatively easy task. Please, pick a subreddit of your choice. Collect all the submissions which have been posted since Klay Thompson tore his right Achilles. Afterward, pick the one with the most comments and download all of them.

**Hint**: To read data from the file to python you simply need to execute this code:
```python
with open('submissions.jl', 'r') as file:
    data = [json.loads(line) for line in file.readlines()]
```
The code above might look complicated but it is not that complex. The first line opens the file in the read mode. The second is a shorter way of creating a list in a `for-loop`. What it says is just:
1. `file.readlines()` - read the file line by line.
2. `for line in file.readlines()` - in a loop create a temporary object `line` and store there the line you just read from the file.
3. `json.loads(line)` - convert the line read from the file to a mapping object.

In [None]:
## Read submissions
with open('submissions.jl', 'r') as file:
    data = [json.loads(line) for line in file.readlines()]
    
## Select submission with the biggest number of comments
comments_num = 0
submission_id = ''
for line in data:
    if line['num_comments'] > comments_num:
        comments_num = line['num_comments']
        submission_id = line['id']

## Define url and payload
source_url = 'https://api.pushshift.io/reddit/search/comment/'
payload = { 'link_id' : submission_id,
            'size' : 100}

## Collect first batch of data
data_comments = collect_data(source_url = source_url, payload = payload)

In [None]:
## Open a file in write mode
with open('comments.jl', 'w') as file:
    ## Write out the data you already collected
    for line in data_comments:
        line['created_utc'] = parse_date(line['created_utc'], format = 'epoch')
        file.write(json.dumps(line) + '\n')
    ## Set the prpogress bar
    pbar = tqdm.tqdm(position=0, leave=True,initial=100)
    ## Create a while-loop
    while len(data_comments) > 0:
        ## Check if we got data from Reddit or a strange status code
        if len(data_comments[0].keys()) > 2:
            ## Get the last collected data date in epoch time
            after = parse_date(data_comments[-1]['created_utc'])
            ## Update the payload after field
            payload['after'] = after
            ## Collect the data
            data_comments = collect_data(source_url = source_url, payload = payload)
            ## Write out the collected data to the file
            for line in data_comments:
                if 'created_utc' in line:
                    line['created_utc'] = parse_date(line['created_utc'], format = 'epoch')
                    file.write(json.dumps(line) + '\n')
            ## Update the progress bar
            pbar.update(len(data_comments))
        else:
            ## Print out the strange status code and its message
            print(f'Something went wrong. The status code error was {data_comments.pop}.')