# Workbook: Making your own Reddit scraper with PRAW

In this file we will go step-by-step through the whole process of making a Reddit web scraper. If you want to create a dataset of any subreddit you like, you can just simply fill in the empty code spaces. In this way, any time you need a webscraper for Reddit you can just come back to this file and fill everything in. Moreover, we will guide you into cleaning and expecting the dataset but you can always skip this part if you think it is not needed  (although we highly recommend do to so in order to get a better understanding of your dataset). 

If anything is unclear you can look at the other file in our github reposity where we created the dataset of r/Feminism. Moreover, you can always look at the tutorial in our report. 

## 1. Installing PRAW

First make sure you downloaded PRAW to your computer: 

In [None]:
!pip install praw

## 2. Importing the PRAW and pandas libraries

Now we have to import praw and pandas to build the scraper and analyze the dataset.

In [None]:
import praw
import pandas as pd

## 3. Creating a Reddit App and connecting to the subreddit

Use https://www.reddit.com/prefs/apps to create a Reddit app. Choose 'Create App.' Here you can fill in a name (user agent), description and redirect uri. As described in the PRAW documentation (https://praw.readthedocs.io/en/latest/getting_started/authentication.html#script-application) you should choose http://localhost:8080 as your uri.

For the name you should avoid using words like 'scraping' or 'bot.' It could be that Reddit will not allow your authorization if you use these words. Lastly, select script for personal use and press 'create app.'

The client_id is a code which can be found underneath 'personal use script.' The client_secret can be found next to 'secret.' The user_agent is the name you chose yourself.

For our scraper we chose the 'reddit_read_only.' This means the scraper will only gather the data.

For a more indepth explanation on creating the Reddit app we refer to the tutorial section in our report or take a look here: https://towardsdatascience.com/scraping-reddit-data-1c0af3040768.

In [None]:
reddit_read_only = praw.Reddit(client_id="",       #your client id  
                               client_secret="",   #your client secret 
                               user_agent="")      # your user agent
subreddit = reddit_read_only.subreddit("") #The name of the subreddit.If you want to scrape all subreddits use 'all'
 
#With these lines of code you can check if PRAW is connected to the subreddit of your choice.

# Display the name of the Subreddit
print("Display Name:", subreddit.display_name)
 
# Display the title of the Subreddit
print("Title:", subreddit.title)
 
# Display the description of the Subreddit
print("Description:", subreddit.description)

## 4. Scraping data and creating a dataset

Now it is time to actually gain the data and put it in a pandas dataset. For this you have to follow the three steps as explained in our guide: 

1. Make an empty list
2. Make a loop to append the desired values to your list. Think about the information you need: Do you want usernames, titles, upvotes, name of the subreddit etc (Praw collects them automatically)
3. Make a pandas dataframe and specify the column names.

Think of the type of posts you need and the amount (limit): top posts or hot posts.

Example assignment: You want to collect 50 top posts from all subreddits. For this you also want to know the usernames, title of the thread, amount of upvotes, amount of comments, date of creation, the text in the post and the name of the subreddit. 

In [None]:
posts = []

#your code here:
for post in ...:
    posts.append([...])
df = pd.DataFrame(posts,columns=[...])
print(df)

## 5. Inspecting and cleaning the dataset

It is important to know what is in the dataset you created. Therefore you can run a few simple pandas commands:

In [None]:
#Checking the rows and columns: 
df.shape

In [None]:
#Checking the values: 
df.dtypes

In [None]:
#Checking the observations: 
df.info()

You probably noticed that you cannot see the actual dates of when the posts are created. Lets change this. 

Example assignment: Change the created column to dates and drop the created column. 

In [None]:
import datetime as dt
df['...'] = pd.to_datetime(df['...'] utc=True, unit='s')
df = df.drop(columns=['...'])

Now lets take a look again at the observations of your dataset. Does it have any null values? It is likely the column which contains the text of the thread has some null values as Reddit users could post threads without text. 

Example assignment: Create an overview of the rows with missing values for this column and think how this affects your dataset and further research. Does it matter? How can you interpret this?

In [None]:
df[df.isnull().any(axis=1)]

Example assignment: Now lets say you only want a dataset with posts which actually have text in the post. Create a new dataframe to filter the other posts out.

In [None]:
body_df = df[df['...'].notna()]
body_df

Now its time to take a closer look at the values of your dataset. 

Example assignment: Interpret the values. What is the average amount of upvotes? What is the maximum and minimum? The same goes for the comments. 

In [None]:
df.describe(include='all')

Example assignment: As you saw in our guide, it is very likely for the title column to not only exist out of unique values. Check this for yourself. If this is the case with your dataset aswell, look at the rows with duplicates. How can you interpret the duplicates? Do you need to remove them from your dataset?

In [None]:
df[df.duplicated(subset=['title'])]

## 6. Saving your dataset to a CSV file

Now its time to save your dataset to a CSV file:

In [None]:
df.to_csv("...", index=True)