# Reddit
<div style="text-align:center"><img src="png/reddit_log.png" /></div>

## What is reddit?

In general, Reddit is a big fat old forum. Below I copied some basic information from their [help page](https://www.reddithelp.com/hc/en-us/articles/204511479-What-is-Reddit/). They give the following answer to the question of what Reddit is:
> * Reddit is a source for what's new and popular on the Internet.
> * Users like you provide all of the content and decide, through voting, what's good and what's junk.
> * Reddit is made up of many individual communities, also known as subreddits. Each community has its page, subject matter, users, and moderators.
> * Users post stories, links, and media to these communities, and other users vote and comment on the posts.
> * Through voting, users determine what posts rise to the top of community pages and, by extension, the public home page of the site.
> * Links that receive community approval bubble up towards #1, so the front page is constantly in motion and (hopefully) filled with fresh, interesting links.

~~Personally, I *do not have* an account on Reddit and probably not planning to have one, but~~ if you want to understand better what kind of data you can extract from there I would recommend setting up an account. As far as I understand Reddit is a big old internet forum (similar to 4chan or Polish Wykop) in which users post or comment on different information. Actually, every user can perform four types of actions:

1. Create a subreddit. Basically, it is a subforum on a given topic in which a group of users discusses it.
2. Write a post (submission) in a given subreddit.
3. Write a comment on a given post.
4. Rate a given comment or post.

For these actions, people earn **karma**.

### What is karma?

Again, according to [Reddit's help page](https://www.reddithelp.com/hc/en-us/articles/204511829-What-is-karma-) karma is:

>A user's **karma** reflects how much a user has contributed to the Reddit community by an approximate indication of the total votes a user has earned on their submissions ("post karma") and comments ("comment karma"). When posts or comments get upvoted, that user gains some karma. You can see how much karma a user has on their profile page.
>
>Karma is only approximate: there is not a 1:1 relationship with votes. Your post karma will always be significantly lower than the total number of votes you receive on your links. Comment karma is closer to a 1:1 relationship but is still only approximate.

Therefore, there are two important pieces of information here. First, users differ in karma points which are based on their activity and the popularity of the content they created. This information might be useful when/if we learn how to get information on users. Second, posts (submissions) or comments might be either upvoted or downvoted. This is important because as far as I understand the comments or posts (submissions) with the highest score are exposed on the front page of Reddit and might have a bigger impact on the users not necessarily only the given subreddit. Also, comments with a high score are displayed higher under the submission.

## Reddit API
In general, we now should know what API is and what Reddit is. So it is the right time to [talk about practice](https://www.youtube.com/watch?v=_UMIcM66S1M), i.e. where to find Reddit's API. This question is more complex than it might seem. There used to be two ways to access Reddit's data through the API:

1. **Official Reddit API.** In most cases the best way to access data from a webpage (social media platform) that has an API is to use the official one. You might find documentation on Reddit [here](https://www.reddit.com/dev/api). This webpage is not particularly beautiful but rarely documentation is. At first glance, you probably could be overwhelmed with the amount of information you find there. However, this is why [you are not alone in this journey](https://youtu.be/Ph0yhtZDWIQ?si=5omC46UFnoTktd3W). 

2. **Pushshift Reddit's API.** There is a Reddit user Jason Baumgartner who for unclear reasons (at least for me they are unclear but I have never been particularly motivated to look it up) decided to dump the whole Reddit. Moreover, he created an API to access the data he had collected. Unfortunately, this year, following the changes in Reddit's terms of use, access to the data was hindered. It either requires authentication or downloading tons of data and looking for interesting content by yourself.

In previous years, we used **Pushift Reddit's API** because it did not require authentication but because of the changes in Reddit's policy, we will use the **Official Reddit API**. [Those changes were introduced partially because of the Chat GPT](https://www.nytimes.com/2023/04/18/technology/reddit-ai-openai-google.html). 

## Creating an App

To access Reddit API we need to create an app on Reddit (it requires having a regular account first). It might sound a bit difficult at first but it is not.

1. Go to [www.reddit.com](https://www.reddit.com) and create an account if you do not have one. 
2. When you are logged in to Reddit go to this [link](https://www.reddit.com/prefs/apps). After pressing `Create app` you should see something like below.

<div style="text-align:center"><img src="png/reddit_app.png"/></div>

For now, you should be interested in only three fields: `name`, `type of the app`, and `redirect_url`. You should fill in these fields as follows:

* `name` -- the name of your app. It should be one word.
* `type of the app` -- you should select script. 
* `redirect url` -- the location where the authorization server sends the user once the app has been successfully authorized and granted an authorization code or access token. You should type the following: `http://localhost:8080`. It will make sure that after successful authorization you will stay on Reddit API.

Press `Create app`. Congratulations, you just have created your first app!

You should now see something like the following.

<div style="text-align:center"><img src="png/reddit_app2.png"/></div>

In the picture above I hid some information because those are the credentials (authorization details) you will use to tell Reddit API who you are. You should never share them with anyone, even your spouse or a firefighter! That is because they serve to identify you. If someone maluses them it will be on you. 

## Storing your credentials

There are multiple ways to store your credentials and passwords safely. We don't want them to be corrupted, right? However, it is one thing to store them [safely](https://youtu.be/MnjQV--o1-0?si=hIlgl9sCyt4JhVUd) and the other to have [strong passwords](https://youtu.be/mQ36sUT77qI?si=hxRw4O4UxKM_WUPy). We all know that we should use strong passwords, but do we really know why? The picture below shows how fast one can crack your password depending on its complexity.

<div style="text-align:center"><img src="png/password_table_2023.jpg"/></div>

Anyhow, the lesson we should take from the graph above is twofold:

1. Use strong passwords.
2. Use password managers to propose strong passwords and store them.

If for any reason, you are still reluctant to trust password managers at least create complex passwords by mixing nonsense words (it is the only place where making spelling errors helps) and special characters, for example:

>`$eating#keyborads-1ncreases_staminA`

In our case, we have already generated passwords and credentials which look pretty strong. How are we going to store them?

### Environmental variables

As you probably rightly suspect, in our case, we will need our credentials to connect to API. We don't really want to store them in the notebook because we want to be able to share the notebook (you want to share it with me and I want to share it with you). We don't want to copy and paste them every time we want to use the notebook cause it would be very inefficient. Also, it will be quite easy to forget about it. What are we going to do then?

We are going to use something which is called environmental variables. In other words, we are going to define some variables either on our computer or in the Colab that will be stored there. In the Notebook, we will just retrieve them by their names. For this purpose, we need to press the key on the left-hand side tab. We need to define the 5 variables:

1. `username` -- our Reddit's username.
2. `password` -- our password to Reddit.
3. `client_id` -- it is a string right below the name of our app.
4. `client_secret` -- it is a string called secret.
5. `user_agent` -- `<name of OS>:<client_id>:<version of the app> (r/<username)`, in my case it is `macos:<client_id>:v1 (r/profesor_floretu)`

In [None]:
## Load module
from google.colab import userdata

## Retrive our environmental variables and assing them to names.
client_id = userdata.get("client_id")
client_secret = userdata.get("client_secret")
password = userdata.get("password")
user_agent = userdata.get("user_agent")
username = userdata.get("username")

## PRAW

So, when finally we do have our credentials in the Notebook what are we going to do next? We need to pass it somehow through a request to the Reddit API, right? Intuitively, we would do it through a payload and `request` module, right? Yes, this is a good intuition but fortunately, we don't really have to do it this way. That is because most social media have so-called wrappers. Those are modules that allow us to connect to API and send requests. We could still do it through our web browser but the URL would be much more complicated.

That is why, in the case of Reddit, we will use the PRAW (Python Reddit API Wrapper) module. It will serve us to connect and get data from Reddit.

In [None]:
## Install praw module
!pip install praw

In [None]:
## Import module
import praw

## Connect to Reddit
reddit = praw.Reddit(
    client_id=client_id,
    client_secret=client_secret,
    password=password,
    user_agent=user_agent,
    username=username,
    check_for_async=False,
)

### Subsmissions

Once we have established a connection to Reddit let's now try to get some data. We will start with searching for submissions.

In [None]:
## Get top submissions from Reddit UK
submissions = reddit.subreddit("funfacts").top()
submissions

Ok, we got something. But I guess it is not exactly what we had expected. It is neither a list of strings nor the status code to which we are used. In the case of PRAW, we don't get the status code which in my opinion is a bit of a mistake but we can't do anything about it. Unless we want to create our own module -- which we don't want to do.

Anyhow, what the above message means is that submissions are a generator. A generator in _Python_ is a special type of object. It only exists when you iterate over it. Moreover, you can do it only once.

In [None]:
## Iterate over top submissions from r/funfacts
for n, item in enumerate(submissions):
    print(f"Submission {n} is titled {item.title}")

Let's try to do it again.

In [None]:
## Iterate over top submission from r/unitedkingdom
for n, item in enumerate(submissions):
    print(f"Submission {n} is titled {item.title}")

How come? We did exactly the same thing but this time it returned nothing. That is because generators work like that. You can go only once over them. It is a bit like a list of Snapchats. They are quite useful because they do not store the content in the memory. Each element of the generator somehow appears when we iterate over it and afterward disappears. This is very useful when we have a very long list and we want to perform an operation on every single element.

In our case, we are not going to be dealing with very long lists. We do not have to worry though because it is possible to store those elements in a list. For example, we can use list comprehension to put all elements of the generator in a list.

In [None]:
## Store top submissions in from r/funfacts in a list
submissions = [item for item in reddit.subreddit("funfacts").top()]

In [None]:
## Check the length of the list
len(submissions)

In [None]:
## Let's investigate the first element of the list
submissions[0]

Again, it is a bit underwhelming what we got. We expected a dictionary or at least something similar. What we got is an object called submission with its unique id.

I have already done the work and each one of the submission elements consists of the following fields:

* `author` -- provides an instance of Redditor.
* `author_flair_text` -- the text content of the author’s flair, or None if not flared. In simple terms, a flair on Reddit is a kind of tag added to either post or username. They are meant to categorize posts or users.
* `clicked` -- whether or not the submission has been clicked by the client.
* `comments` -- provides an instance of CommentForest.
* `created_utc` -- time the submission was created, represented in Unix Time.
* `distinguished` -- whether or not the submission is distinguished.
* `edited` -- Whether or not the submission has been edited.
* `id` -- ID of the submission.
* `is_original_content` -- whether or not the submission has been set as original content.
* `is_self` -- whether or not the submission is a selfpost (text-only).
* `link_flair_template_id` -- the link flair’s ID.
* `link_flair_text` -- The link flair’s text content, or None if not flared.
* `locked` -- whether or not the submission has been locked.
* `name` -- full name of the submission.
* `num_comments` -- the number of comments on the submission.
* `over_18` -- whether or not the submission has been marked as NSFW.
* `permalink` -- a permalink for the submission.
* `poll_data` -- a PollData object representing the data of this submission, if it is a poll submission.
* `saved` -- whether or not the submission is saved.
* `score` -- the number of upvotes for the submission.
* `selftext` -- the submissions’ selftext - an empty string if a link post.
* `spoiler` -- whether or not the submission has been marked as a spoiler.
* `stickied` -- whether or not the submission is stickied.
* `subreddit` -- provides an instance of Subreddit.
* `title` -- the title of the submission.
* `upvote_ratio` -- the percentage of upvotes from all votes on the submission.
* `url` -- the URL the submission links to, or the permalink if a selfpost.

In [None]:
## We can access each field by simply using dot notation
## In this particular case we get number of comments
submissions[0].num_comments

## Exercise

Create a list of dictionaries for the above-collected submissions. Each of the dictionaries should look something like the following.

```python
{ 'title' : 'Why is there "e" in the middle of the word Wednesday?', 'created_utc' : 24587.0, 'num_comments' : 22234, 'selftext' : '', 'score' : 3525 }
```

In [None]:
## YOUR CODE

There are quite a few things to unpack here. Let's start with the date. There is a `float` instead of the human-readable date format. It is called [epoch time](https://en.wikipedia.org/wiki/Unix_time). It counts seconds from January 1st, 1970, the beginning of Unix Epoch, till [03:14:07 UTC on 19 January 2038](https://en.wikipedia.org/wiki/Year_2038_problem). To convert this strange format to more human-readable we will use two functions from the `datetime` module: 

* `datetime.fromtimestamp()` is a function that converts a float to a datetime object. 
* `datetime.strftime()` is a method defined on `datetime` object that converts a `datetime` object into a string in a given format.

The usage of both of them would look something like this:

In [None]:
## From datetiem module we import datetime
from datetime import datetime

date_epoch = 24587.0
pattern = "%d-%m-%Y %H:%M:%S"
date_human = datetime.fromtimestamp(date_epoch).strftime(pattern)
date_human

Let's create a function that will do this for us. Because it seems we will be converting epoch time to human-readable format quite a lot.

In [None]:
from datetime import datetime


def convert_date(date_float: float) -> str:
    """
    Takes a date in epoch time format and converts it into a string in human-readable date format.

    Parameters:
    -----------
        date_float (float): a float representing a date in epoch time format.

    Returns:
    --------
        (str) : a string representing a date in human-readable format.
    """
    return datetime.fromtimestamp(date_float).strftime("%d-%m-%Y %H:%M:%S")

## Exercise

Convert epoch time in our list of dictionaries into human-readable format. So each element of your list should look like the following.

```python
{ 'title' : 'Why is there "e" in the middle of the word Wednesday?', 'created_utc' : '01-01-1970 07:49:47', 'num_comments' : 22234, 'selftext' : '', 'score' : 3525 }
```

In [None]:
## YOUR CODE

### Author

For some reason for now we ignored probably one of the most interesting fields -- `author`. That was on purpose. We ignored it because it is none of our familiar types. It is again a bit strange object. 

In [None]:
## Let's investigate it
submissions[0].author

Again, we only see that this object is called `Redditor` and apparently the username of the user who added the submission. If we want to see more information about the user we need to use dot notation. Below you will find the list of the fields we can investigate:

* `comment_karma` -- the comment karma for the Redditor.
* `comments` -- provide an instance of SubListing for comment access.
* `submissions` -- provide an instance of SubListing for submission access.
* `created_utc` -- time the account was created, represented in Unix Time.
* `has_verified_email` -- whether or not the Redditor has verified their email.
* `icon_img` -- the url of the Redditors’ avatar.
* `id` -- the ID of the Redditor.
* `is_employee` -- whether or not the Redditor is a Reddit employee.
* `is_friend` -- whether or not the Redditor is friends with the authenticated user.
* `is_mod` -- whether or not the Redditor mods any subreddits.
* `is_gold` -- whether or not the Redditor has active Reddit Premium status.
* `is_suspended` -- whether or not the Redditor is currently suspended.
* `link_karma` -- the link karma for the Redditor.
* `name` -- the Redditor’s username.
* `subreddit` -- if the Redditor has created a user-subreddit, provides a dictionary of additional attributes. See below.
* `subreddit["banner_img"]` -- the URL of the user-subreddit banner.
* `subreddit["name"]` -- the fullname of the user-subreddit.
* `subreddit["over_18"]` -- whether or not the user-subreddit is NSFW.
* `subreddit["public_description"]` -- the public description of the user-subreddit.
* `subreddit["subscribers"]` -- the number of users subscribed to the user-subreddit.
* `subreddit["title"]` -- the title of the user-subreddit.

In [None]:
## We can for example see visited submissions.
## The idea to get their titles is very similar to the one above.
## However, this time we are atarting from the level of a Redditor.
visited_submissions = [item.title for item in submissions[0].author.submissions.new()]
visited_submissions

In [None]:
## Or author karma
submissions[0].author.comment_karma

In terms of looking for submissions of our interest, there are quite a few options. 

* `reddit.subreddit('unitedkingdom').top()` -- returns top 100 submissions.
* `reddit.subreddit('unitedkingdom').hot()` -- returns the hottest 100 submissions.
* `reddit.subreddit('unitedkingdom').new()` -- returns 100 newest submissions.
* `reddit.subreddit('unitedkingdom').rising()` -- returns 100 submission being on the rise.

You can get more submissions if you add the argument `limit = None`. Unfortunately, for free you can't get more than 1000 submissions.

You can also find submissions by a given phrase. For example:

In [None]:
## Get submissions about Jeremy Sochan from r/nba
js_submissions = [
    item for item in reddit.subreddit("nba").search("Jeremy Sochan", limit=None)
]
len(js_submissions)

In [None]:
## Get submissions about Jeremy Sochan from all Reddit
js_submissions = [item for item in reddit.subreddit("all").search("Jeremy Sochan")]
for item in js_submissions:
    print(f"Subreddit: {item.subreddit.title}, Title of the submission: {item.title}")

## Exercise

From the submission about Jeremy Sochan find a submission with the biggest number of comments.

In [None]:
## YOUR CODE

### Comments

The last important element of data from Reddit is comments. Similarly, to author information, they are a bit of a peculiar data object.

In [None]:
## Submission about the best football player in the world -- Aitana Bonmati
## Note that I am getting a specific submission using its id but I could
## have used url.
submission = reddit.submission("17k3qrj")

In [None]:
## Let's see how many comments we have under this submission
num_comments = submission.num_comments
num_comments

In [None]:
## Examine the comments object
submission.comments


This seems to be a strange object we have not seen yet -- `CommentForest`. The good thing about it is that we can iterate over it.

In [None]:
## Iterate over comments object.
for n, comment in enumerate(submission.comments):
    ## Print body of the comment and its number.
    print(f"Comment {n}: {comment.body}")

What we see at first glance is that we did not get the expected number of comments. We got around 40 and there should be more than 100. What happened? The reason is very simple. The comments in Reddit as in most of the forums or social media are in the structure called comments tree. It means that comments might be nested in one another.

```bash
├── Aitana Bonmati is the best player ever!!11
├── Have you seen the assist she provided to Claudia Pina in the last game against Real Madrid?
│   ├── Simply brilliant!
│   ├── Yeah, it was a good run but where was Olga Cardona when we needed her?
└── IMHO Alexia Putellas is much better.
    └── Without Jenny Hermoso and Salma Paralluelo she would not have won the World Cup!
    	└── Olga Carmona!!11                  

```
When we just iterate over comments we only get the top comments. In other words, if the comments section looked like the above. We would get three comments instead of seven.

There is an easy way to deal with it.

In [None]:
## Set the option to get all comments
submission.comments.replace_more(limit=None)

## Iterate over list of all comments instead of just
## the top comments.
comments = [item for item in submission.comments.list()]

Great, we should now have all the comments! Let's check the number of elements we have.

In [None]:
## Check the number of collected comments
len(comments)

In this particular case, we got exactly the same number. However, this is not always the case. Very often, the value of `submission.num_comments` may not match the number of comments extracted via PRAW. This difference is normal as that count includes deleted, removed, and spam comments.

Let's now see what elements each comment has. It has a very similar structure to submission and author objects with which we are now familiar.

* `author` -- provides an instance of Redditor.
* `body` -- the body of the comment, as Markdown.
* `body_html` -- the body of the comment, as HTML.
* `created_utc` -- time the comment was created, represented in Unix Time.
* `distinguished` -- whether or not the comment is distinguished.
* `edited` -- whether or not the comment has been edited.
* `id` -- the ID of the comment.
* `is_submitter` -- whether or not the comment author is also the author of the submission.
* `link_id` -- the submission ID that the comment belongs to.
* `parent_id` -- the ID of the parent comment (prefixed with t1_). If it is a top-level comment, this returns the submission ID instead (prefixed with t3_).
* `permalink` -- a permalink for the comment. Comment objects from the inbox have a context attribute instead.
* `replies` -- provides an instance of CommentForest.
* `saved` -- whether or not the comment is saved.
* `score` -- the number of upvotes for the comment.
* `stickied` -- whether or not the comment is stickied.
* `submission` -- provides an instance of Submission. The submission that the comment belongs to.
* `subreddit` -- provides an instance of Subreddit. The subreddit that the comment belongs to.
* `subreddit_id` -- the subreddit ID that the comment belongs to.

## Exercise

Find all comments that were submitted on exactly the same day as the submission was posted.

In [None]:
## YOUR CODE