# Reddit
<div style="text-align:center"><img src="png/reddit_log.png" /></div>

## What is reddit?

In general, Reddit is a big fat old forum. Below I copied some basic information from their [help page](https://www.reddithelp.com/hc/en-us/articles/204511479-What-is-Reddit/). They give the following answer to the question of what Reddit is:
> * Reddit is a source for what's new and popular on the Internet.
> * Users like you provide all of the content and decide, through voting, what's good and what's junk.
> * Reddit is made up of many individual communities, also known as subreddits. Each community has its own page, subject matter, users, and moderators.
> * Users post stories, links, and media to these communities, and other users vote and comment on the posts.
> * Through voting, users determine what posts rise to the top of community pages and, by extension, the public home page of the site.
> * Links that receive community approval bubble up towards #1, so the front page is constantly in motion and (hopefully) filled with fresh, interesting links.

~~Personally, I *do not have* an account on Reddit and probably not planning to have one, but~~ if you want to understand better what kind of data you can extract from there I would recommend setting up an account. As far as I understand Reddit is a big old internet forum (similar to 4chan or Polish Wykop) in which users post or comment on different information. Actually, every user can perform four types of actions:

1. Create a subreddit. Basically, it is a subforum on a given topic in which a group of users discusses it.
2. Write a post (submission) in a given subreddit.
3. Write a comment on a given post.
4. Rate a given comment or post.

For these actions, people earn **karma**.

### What is karma?

Again, according to [Reddit's help page](https://www.reddithelp.com/hc/en-us/articles/204511829-What-is-karma-) karma is:

>A user's **karma** reflects how much a user has contributed to the Reddit community by an approximate indication of the total votes a user has earned on their submissions ("post karma") and comments ("comment karma"). When posts or comments get upvoted, that user gains some karma. You can see how much karma a user has on their profile page.
>
>Karma is only approximate: there is not a 1:1 relationship with votes. Your post karma will always be significantly lower than the total number of votes you receive on your links. Comment karma is closer to a 1:1 relationship but is still only approximate.

Therefore, there are two important pieces of information here. First, users differ in karma points which are based on their activity and the popularity of the content they created. This information might be useful when/if we learn how to get information on users. Second, posts (submissions) or comments might be either upvoted or downvoted. This is important because as far as I understand the comments or posts (submissions) with the highest score are exposed on the front page of Reddit and might have a bigger impact on the users not necessarily only the given subreddit. Also, comments with a high score are displayed higher under the submission.

## Reddit API
In general, we now should know what API is and what Reddit is. So it is the right time to [talk about practice](https://www.youtube.com/watch?v=_UMIcM66S1M), i.e. where to find Reddit's API. This question is more complex than it might seem. There are two ways to access Reddit's data through the API:

1. **Official Reddit API.** In most cases the best way to access data from a webpage (social media platform) that has an API is to use the official one. You might find documentation on Reddit [here](https://www.reddit.com/dev/api). This webpage is not particularly beautiful but rarely documentation is. At first glance, you probably would be overwhelmed with the amount of information you find there. However, for now, you only need to know that you are not going to use the official Reddit API because it is inconvenient. It requires authentication (having a developer account). Anyhow, if you decide to perform a more detailed analysis of Reddit you probably should read the official documentation and visit these two pages: [Reddit's Archived GitHub repository](https://github.com/reddit-archive/reddit/wiki/API) and [Documentation on Reddit's API Python Wrapper](https://praw.readthedocs.io/en/latest/). This is a lot of reading and understanding, however, there is no other way unless...
2. **Pushshift Reddit's API.** There is a Reddit user Jason Baumgartner who for unclear reasons (at least for me they are unclear but I was not particularly motivated to look it up) decided to dump the whole Reddit. Moreover, he created an API to access the data he collects. On [this](https://pushshift.io) much nicer webpage (but not very up-to-date) you will find documentation on his API.

In our case, we will use **Pushshift Reddit API**. It is much easier to use and for our purposes, it will be more than enough. Though it does not allow for collecting exactly the same data as when using the **Official Reddit's API**, however, it has the huge advantage of not requesting authentication. When we are using Pushshift we need to remember a few things:

1. It is less reliable than the official API because it is run by a single person (half-truth).
2. It does not offer the same functionality as the official API.
3. It is likely to introduce some kind of authentication in the future.
4. It streams the data live from Reddit hence post scores are not that reliable (in theory it does it twice in 24h window, however, sometimes there are delays).

## API and where to find it?

So in simple terms, an API is an interface using which you send a specific message (request) and get something back (response). In the case of **Pushshift**, it lives under the following [url](https://api.pushshift.io). However, if you click on it a blank page will open. For some reason, it works like this but the more common practice is to use the API address to put the documentation there ([Wikipedia](https://en.wikipedia.org/w/api.php) does exactly that). You can find documentation of Pushshift API [here](https://pushshift.io/api-parameters/). Before you move any further you should start reading it. Why? Because you need to know what the API can offer you. In other words what kind of data you might access?

Using Pushshift API you might access either submissions or comments even though in the [docummantation](https://pushshift.io/api-parameters/) they state something different. Therefore, it is better to visit their [GitHub repository](https://github.com/pushshift/api). To access submissions or comments we will use something which is called endpoints. If an API is an interface endpoint is a communication channel. In Pushshift there are two (if you click on any of the links you should see the last 25 comments or last 25 submissions):

* [https://api.pushshift.io/reddit/submission/search](https://api.pushshift.io/reddit/submission/search)
* [https://api.pushshift.io/reddit/comment/search](https://api.pushshift.io/reddit/comment/search)


## Practice

### Step 1.
First things first. Likewise in _R_, we will start with loading libraries. In terms of _Python_, they are called modules. Below I load three modules that we will use in this script. Because we are using Google Colab we do not have to install them. If you were using _Python_ on your personal computer you would have to install them first. It is the same in _R_, where you have to install the package only once, and afterward, you might use it till the world's end.

In [None]:
import requests as rq ## this is a module to send requests
import json ## this is a module to process json
import time ## this is a module we will need to understand time

### Step 2.
Let's define our two endpoints as `strings`. It makes everything more convenient.

In [None]:
## Comments endpoint 
url_comments = 'https://api.pushshift.io/reddit/search/comment/'

## Submissions endpoint
url_submissions = 'https://api.pushshift.io/reddit/search/submission/'

### Step 3.
When we clicked the link of each endpoint we got a random 25 comments or submissions, but when we looked at the documentation we saw that there were plenty of different parameters we could specify. I would recommend visiting this website where the [documentation](https://github.com/pushshift/api) is presented more compactly and understandably. Below I use only two parameters but you might want to specify more. However, you need to be careful with after and before because they are in what might seem like a strange format. It is called [epoch time](https://en.wikipedia.org/wiki/Unix_time). It counts seconds from January 1st, 1970, the beginning of Unix Epoch, till [03:14:07 UTC on 19 January 2038](https://en.wikipedia.org/wiki/Year_2038_problem). To convert this strange format to more human-readable we will use two functions from the `time` module: 

* `time.strptime()` is a function to convert a string into a date format based on a specific pattern. In other words, we tell _Python_ that `'30.08.2011 11:05:02'` is not a normal string but a date object of the following pattern `'%d.%m.%Y %H:%M:%S'`;
* `time.mktime()` is a function to convert a normal date into epoch format, for example, it converts '`30.08.2011 11:05:02`' into the following number `1314695102`.

The usage of both of them would look something like this:
```python
date_time = '30.08.2011 11:05:02'
pattern = '%d.%m.%Y %H:%M:%S'
after_time = str(int(time.mktime(time.strptime(date_time, pattern))))
```
At the first glance, it might look a bit complex but when you think about it it is not. I mean I do not expect you to know it but to understand it on a very general level. The only thing you will have to change below data_time to the dates of your choice.

In [None]:
date_time = '30.08.2011 11:05:02'
pattern = '%d.%m.%Y %H:%M:%S'
after_time = str(int(time.mktime(time.strptime(date_time, pattern))))

payload = { 'subreddit' : 'climate',
            'after' : after_time }

### Step 4.
So when we know the URL of the endpoint and options we want to pass we should send the request to this URL. We will use the `get` function from the `requests` module. However, unlike in _R_, we need to tell _Python_ from which module that function is. Therefore, we will use `rq.get()`.
This function will take as the first argument the endpoint URL and as the second argument specific options, we want to pass.

In [None]:
## Let's send the request and save the response as the response object
response = rq.get(url_submissions, params = payload)

Let's check what we got.

In [None]:
response

If we just execute the chunk above we will get only a mysterious and enigmatic code 200. This is good information. It means that we got a valid response from the server. There are multiple different codes we could get when we send the request to the server but you should be aware of two: [5xx](https://github.com/500) and [4xx](https://www.pixar.com/404). In general, the former means that there is an issue on the server side and the latter that the resource you are looking for [does not exist](https://github.com/404).

### Step 5.
Ok, but how to extract from this response object dome data? It is easier than it might look like but what we need to do is to use a method text on the object. I am not going too much into detail but in _Python_ objects might have methods (functions) that might be run on them. It is a bit like the internal ability of the object that is executed on the object in question.

In [None]:
response.text

However, now our `response.text` object for _Python_ is just a not really interesting long string. To be able to process it further (mainly save it into a data file) we need to transform it into a _Python_ representation of JSON. We will do it by using function `json.loads()`.

In [None]:
json.loads(response.text)

So if you look closely at the object above you will realize it is a single curly bracket object. It has one key -- data and one value which is a list. You can either believe me or we can just check it in _python_ by using the method `dct.keys()` on the `json.loads(response.text)` object.

In [None]:
json.loads(response.text).keys()

To extract and assign the value from this key we will execute the following code which is a bit similar to what you are used to in _R_.

In [None]:
data = json.loads(response.text)['data']

In [None]:
len(data)

### Step 6. 
Let's save the file now to the JSON line file on our computer. It is not as complicated as it might seem. We are simply opening a file and then writing every single line from our data object into this file. When saving the file we convert this strange date format to a more useful one: `'Y-m-d H:M:S'`. To do so first we use the function `time.localtime()` that converts Epoch time to date format and afterward use time.strftime we convert it to the desired format of `'Y-m-d H:M:s'`.

In [None]:
## Open the file in write mode
with open('climate.jl', 'w') as file:
    ## Iterate every all elements
    for line in data:
        ## Convert Epoch time to human-readable format
        line['created_utc'] = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(line['created_utc']))
        ## Check whether the field author_created_utc exists
        if 'author_created_utc' in line.keys():
            ## Convert Epoch time to human-readable format
            line['author_created_utc'] = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(line['author_created_utc']))
        ## Dump the line to the file
        file.write(json.dumps(line) + '\n')

## Exercise

Read into _Python_ the file you just created.

In [None]:
## Open the file in read mode
with open('climate.jl', 'r') as file:
	## Create a new list with dictionaries as elements
    df = [ json.loads(item) for item in file.readlines() ]

## Comments
It is a bit more tricky to access the comments than submissions but at the end of the day not that hard. First, you need to decide what you would want to access. You have multiple options:

* get all comments from the specific subreddit - it is relatively the easiest, however, it will take days to collect all comments or at least hours for big subreddits. You need to specify the `subreddit` field and nothing else (however, it is wise to change the default number of collected items to the maximum - `'size' : 100`). Therefore, the payload should look something like that:
```python
payload = { 'subreddit' : 'todayilearned',
            'size' : 100 }
```
* get all comments from under the specific submission. You need to specify the `link_id` field and nothing else (however, it is wise to change the default number of collected items to the maximum - `'size' : 100`). The important note is that as a `link_id` you should put an `id` of the submission. Therefore, the payload should look something like that:
```python
payload = { 'link_id' : '8vjr2l',
            'size' : 100 }
```
* get all comments from under the comment. You need to specify the `parent_id` field and nothing else (however, it is wise to change the default number of collected items to the maximum - `'size' : 100`). The important note is that as a `parent_id` you should put an `id` of the comment. Therefore, the payload should look something like that:
```python
payload = { 'parent_id' : 'e1nuxpc',
            'size' : 100 }
```
Obviously, you can specify also other fields to narrow the search, however, I guess for our purposes it will not be necessary.

Actually, the procedure of scraping the comments is very similar to the submission one. The only difference is the `payload` and `source_url`. 

In [None]:
## Define the payload
date_time = '30.08.2022 11:05:02'
pattern = '%d.%m.%Y %H:%M:%S'
after_time = str(int(time.mktime(time.strptime(date_time, pattern))))

source_url = 'https://api.pushshift.io/reddit/search/comment/'
payload = { 'subreddit': 'todayilearned',
            'after' : after_time}

In [None]:
## Send the request
response = rq.get(source_url, params = payload)

In [None]:
## Extract the data to a list of dictionaries
data = json.loads(response.text)['data']

In [None]:
data

## Exercise

Write out the data in a JSON line file. Make sure that all the dates are in a human-readable format.

In [None]:
## Open the file in write mode
with open('comments.jl', 'w') as file:
    ## Iterate every all elements
    for line in data:
        ## Convert Epoch time to human-readable format
        line['created_utc'] = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(line['created_utc']))
        line['retrieved_utc'] = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(line['retrieved_utc']))
        ## Dump the line to the file
        file.write(json.dumps(line) + '\n')