<div style="text-align:center"><img src="png/reddit.png" /></div>

## What is reddit?

In general, I would say it is a good practice to start with learning what Reddit is. Below I copied some basics information from their [help page](https://www.reddithelp.com/hc/en-us/articles/204511479-What-is-Reddit/). They give the following answer to the question of what Reddit is:
> * Reddit is a source for what's new and popular on the Internet.
> * Users like you provide all of the content and decide, through voting, what's good and what's junk.
> * Reddit is made up of many individual communities, also known as subreddits. Each community has its own page, subject matter, users, and moderators.
> * Users post stories, links, and media to these communities, and other users vote and comment on the posts.
> * Through voting, users determine what posts rise to the top of community pages and, by extension, the public home page of the site.
> * Links that receive community approval bubble up towards #1, so the front page is constantly in motion and (hopefully) filled with fresh, interesting links.

Personally, I *do not have* an account on Reddit and probably not planning to have one, but if you want to understand better what kind of data you can extract from there I would recommend setting an account. As far as I understand Reddit is a big old internet forum (similar to 4chan or Polish Wykop) in which users post or comment on different information. Actually, every user can perform four types of actions:

1. Create a subreddit. Basically, it is a subforum on a given topic in which a group of users discusses it.
2. Write a post (submission) in a given subreddit.
3. Write a comment to a given post.
4. Rate a given comment or post.

For these actions, people earn **karma**.

### What is karma?

Again, according to [Reddit's help page](https://www.reddithelp.com/hc/en-us/articles/204511829-What-is-karma-) karma is:

>A user's **karma** reflects how much a user has contributed to the Reddit community by an approximate indication of the total votes a user has earned on their submissions ("post karma") and comments ("comment karma"). When posts or comments get upvoted, that user gains some karma. You can see how much karma a user has on their profile page.
>
>Karma is only approximate: there is not a 1:1 relationship with votes. Your post karma will always be significantly lower than the total number of votes you receive on your links. Comment karma is closer to a 1:1 relationship but is still only approximate.

Therefore, from our perspective important there are two important pieces of information here. First, we learned that users differ in terms of their activity and the popularity of their content by karma points. This information might be useful when/if learn how to get information on users. Second, we learned that posts or comments might be either upvoted or downvoted. This is important because as far as I understand the comments or posts with the highest score are exposed on the front page of Reddit and might have a bigger impact on the users not necessarily only the given subreddit. Also, comments with a high score are displayed higher under the post.

![api](png/api.jpg)

## What is API
When we know something about Reddit let's dig a bit deeper into the world of restaurants. I mean APIs.
In general, web APIs (Application Programming Interface) are publicly (usually; there is plenty of private APIs, but for obvious reasons, we do not care about them as we can not use them) available interfaces through which third parties (this is us!) can access some data resources in a **remote**, **reliable** and **programmable** manner.

What does it mean in practice?

* **Remote.** Users can access the resource from anywhere, provided they have an internet connection.
* **Reliable.** The interface exposed to users is independent of the internal details of the system that produces the data. In other words, the way a user communicates with the API is independent of the way the system works. In practice it means that a user does not have to know anything about the system, it is enough to know the API interface.
* **Programmable.** API can be interacted with based on a predefined set of commands/methods (an interface) in a way that can be expressed with a programming language. This is usually achieved by using HTTP protocol which a standard communication protocol in the Web and for which utilities are available in any major programming language.

## Reddit API
In general, we now should know what API is and what Reddit. So it is the right time to [talk about practice](https://www.youtube.com/watch?v=eGDBR2L5kzI). Where to find Reddit's API. This question is more complex than it might seem. There are two ways to access Reddit through API:

1. **Official Reddit API.** In most cases the best way to access data from a webpage that has an API is to use the official one. You might find documentation on Reddit's [here](https://www.reddit.com/dev/api). This webpage is not particularly beautiful but rarely documentation is. At first glance, you probably would be overwhelmed with the amount of information you might find there. However, for now, you only need to know that you are not going to use the official Reddit API cause it is inconvenient. It requires authentication (having a developer account) and as far as I am concerned it is not really developed. Anyhow, if you decide to perform a more detailed analysis of Reddit you probably should read the official documentation and visit these two pages: [Reddit's Archived GitHub repository](https://github.com/reddit-archive/reddit/wiki/API) and [Documentation on Reddit's API Python Wrapper](https://praw.readthedocs.io/en/latest/). This is a lot of reading and understanding. However, there is no other way unless...
2. **Pushshift Reddit's API.** There is a Reddit user Jason Baumgartner who for unclear reasons (at least for me they are unclear but I was not particularly motivated to look it up) decided to dump every month the whole Reddit. On [this](https://pushshift.io) a much nicer webpage you might find documentation on his API.

In our case, we will use **Pushft Reddit's API**. It is much easier to use and for our purposes, it will be enough. As far as I know, it does not allow to collect exactly the same data as when using the **Official Reddit's API**, however, it has a huge advantage of not requesting authentication. When we are using Pushshift we need to remember a few things:

1. It is less reliable than the official API because it is run by a single person.
2. It does not offer the same functionality as the official API.
3. It is likely to introduce some kind of authentication in the future.

## API and where to find it?

So in simple terms, an API is an interface using which you send a specific message (request) and get something back (response). In the case of **Pushshift**, it lives under the following [url](https://api.pushshift.io). However, if you click on it a blank page will open. For some reason, it works like this but the more common practice is to use the API address to put the documentation there (Wikipedia does exactly that). You can find documentation of Pushshift API [here](https://pushshift.io/api-parameters/). Before you move any further you should start reading it. Why? Because you need to know what the API can offer you. In other words what kind of data you might access.

Using Pushshift API you might access either submissions or comments even though in the [docummantation](https://pushshift.io/api-parameters/) they state something different. Therefore, it is better to visit their [GitHub repository](https://github.com/pushshift/api). To access submissions or comments we will use something which is called endpoints. In Pushshift there are two (if you click on any of the links you should see the last 25 comments or last 25 submissions):

* [https://api.pushshift.io/reddit/submission/search](https://api.pushshift.io/reddit/submission/search)
* [https://api.pushshift.io/reddit/comment/search](https://api.pushshift.io/reddit/comment/search)

Before I will tell you how to look for specific comments let's focus for a second on what you have just seen. The data displayed was not in the data format you are used to causing it was not a tabular data but a JSON. What is this and how it is different from a regular data format?

## What is JSON

Imagine that you are meant to somehow extract the most important information from the following text and write it up in the database:

>Alice is a *17* years old young lady. Although her main field of interest is physics (especially quantum physics and string theory), she also fancies sport. Her favorite physical activities are fishing and football. Bob, on the other hand, is a naughty 15 years old boy who only loves literature, especially Szymborska poems touches his heart.

So one way of doing it would be to put in the table like this:

|Name | Sex | Age | Interest A | Interest A1 | Interest A2 | Interest B | Interest B1 | Interest B2|
|-----|-----|-----|------------|-------------|-------------|------------|-------------|------------|
Alice | F   | 17 | physics | quantum physics | string theory | sport | fishing | football|
Bob | M | 15 | literature | poems | n/a | n/a | n/a | n/a |

However, this would not be informative and also we would los some space for empty columns in the second row. On the bigger than two records scale we would like to avoid wasting resources to for empty space. Therefore, one of the most popular ways of storing the data in general is a JSON format. The same data as above we would store in it in the following manner:
```json
{ "name" : "Alice",
  "Sex" : "F",
  "Age" : 17,
  "intersts" : [ { "name" : "physics", "fields" : [ "quantum physics", "string theory" ] },
                 { "name" : "sport", "fields" : [ "football", "fishing" ] } ]
}
{ "name" : "Bob",
  "Sex" : "M",
  "Age" : 15,
  "interests" : { "name" : "literature", "fileds" : "poems" }
}
```
Usually, this is the format that we will get from API. I mean there is a possibility to get data in XML format but then everything gets quite complicated. In most cases, you don't have to worry cause it will be JSON. This is good information cause Python has a special object called a dictionary which is really similar to JSON

## Practice

So enough talking let's see this API thing in real life. I am not going to talk too much about details in terms of python syntax or modules. If you are curious you might either visit great [resource](https://www.learnpython.org) or use one of [my notebooks](https://github.com/sztal/ecss-class). Either way, it is probably good to learn the same basics of python. Instead, here, I will try to show you a really basic code that will allow you to access data from Reddit, write it out to a JSON file and later process it in R.

### Step 1.
First things first. Likewise in R, we will start with loading libraries. In terms of python, they are called modules. Below I load three libraries which we will use in this script. Because we are using Google Colab we do not have to install them. If you were using python on your personal computer you would have to install them first. It is the same in R, where you have to install the package only once, and afterward you might use it till the world's end.

In [None]:
import requests as rq ## this is a module to send requests
import json ## this is a module to process json
import time ## this is a module we will need to understand time

### Step 2.
Let's define our two endpoints as `string`. It makes everything more convenient.

In [None]:
## Comments endpoint 
url_comments = 'https://api.pushshift.io/reddit/search/comment/'

## Submissions endpoint
url_submissions = 'https://api.pushshift.io/reddit/search/submission/'

### Step 3.
When we clicked the link of each endpoint before we got a random 25 comments or submissions, but when we looked at the documentation we saw there were plenty of different parameters we could specify. I would recommend visiting this website where the [documentation](https://github.com/pushshift/api) is presented in a more compact and understandable way. 

In [None]:
payload = { 'subreddit' : 'climate',
            'after' : "20d" }

### Step 4.
So when we know the URL of the endpoint and options we want to pass we should send the request to this URL. We will use the `get` function from the `requests` module. However, unlike in R, we need to tell python from which module that function is. Therefore, we will use `rq.get()`.
This function will take as the first argument the endpoint URL and as the second argument specific options, we want to pass.

In [None]:
## Let's send the request and save the response as response object
response = rq.get(url_submissions, params = payload)

Let's check what we got. If we just execute the chunk below we will get only a mysterious code 200. This is good information. It means that we got a valid response from the server. There are multiple different codes we could get when we send the request to the server but you should be aware of two: [5xx](png/error_500.png) and [4xx](png/error_404.png). In general, the former means that there is an issue on the server-side and the latter that the resource you are looking for does not exist.

In [None]:
response

### Step 5.
Ok, but how to extract from this response object dome data? It is easier than it might look like but what we need to do is to use a method text on the object. It is a method of this object and we will get the 

In [None]:
data = json.loads(response.text)['data']

So our data object looks a lot like Bob and Alice right now. So actually it contains a lot of objects within curly brackets, therefore it is a list. One of the primitive objects in python but it is a whole different story.

In [None]:
data

### Step 6. 
Let's save it to a JSON file so we can later use it in R. I am not going to talk in detail about the code below but what it does is it creates a file called `climate.jl` and dumps there every single line from the data object.

In [None]:
with open('climate.jl', 'w') as file:
    for line in data:
        file.write(json.dumps() + '\n')

### Step 7.
This is specific to only Google Colab and it will just download the file `climate.jl` to your computer.

In [None]:
## Download file from Google Workspace
from google.colab import files
files.download('climate.jl')