# **1. Install PRAW**

In [2]:
#!pip install praw
import praw
import pandas as pd
import numpy as np
import datetime
import os

# **2. Create a Reddit App**

To access the Reddit API, you'll need to create an application on Reddit and obtain your API credentials. Follow these steps:

- Go to the Reddit website (https://www.reddit.com/) and log in to your account. Feel free to create a throwaway account for this project!
- Navigate to the Reddit Apps page (https://www.reddit.com/prefs/apps).
- Click the "are you a developer? create an app..." button in the top left.
- Provide a name for your app (e.g., "PRAW"), select the app type ('script') , and optionally add a description. Use http://localhost:8080 as your redirect URI.
- After submitting the form, you will reach a page that looks like the following image. You'll see your application's details, including the client ID and client secret. Keep these credentials handy for the next step.


![Praw](https://www.honchosearch.com/hubfs/Imported_Blog_Media/Client-ID-Client-Secret.png)

# **3. Initialize PRAW**

In [3]:
reddit = praw.Reddit(
    client_id = 'bH4RU2bXmyXbuwcX1757_A',
    client_secret='gF99Z7zQ3e-3Wt8II5MIYfnl8b6vlg',
    user_agent='ConnorDSB521',
    username='JaydadCTatumthe1st',
    password='papimvp_34'
)

Replace 'YOUR_CLIENT_ID', 'YOUR_CLIENT_SECRET', 'YOUR_USER_AGENT', 'YOUR_REDDIT_USERNAME', and 'YOUR_REDDIT_PASSWORD' with your actual Reddit API credentials.

Your user agent is an identifier used by reddit to identify the source of requests. You can make it whatever you want, but you'll want to choose something descriptive and unique, and it's recommended that your username is included.

**I have removed my own credentials from this workbook. We can show you how to hide your credentials before submitting the project! The following code will need your own credentials in order to successfully work.**

# 4. Take a look at the documentation [here](https://praw.readthedocs.io/)!

In [4]:
# Below is JUST an example of how you can use PRAW

# Choose your subreddit
#subreddit = reddit.subreddit('politics')

# Adjust the limit as needed -- Note that this will grab the 25 most recent posts
#posts = subreddit.hot(limit=10)

## NOTE
- Reddit API Limitations: The Reddit API imposes limitations on the number of posts you can retrieve in a single request. The maximum number of posts per request is typically 100. Therefore, if you set the limit parameter to a value greater than 100, PRAW will make multiple requests behind the scenes to fetch the desired number of posts.
- Rate Limiting: The Reddit API also enforces rate limits to prevent abuse and ensure fair usage. The specific rate limits can vary depending on factors such as your Reddit account's age and karma. As a standard user, you're typically allowed to make 60 requests per minute. If you exceed the rate limit, you may receive an error response until the rate limit resets.
- TIP: You can use the created_utc attribute of a post to keep track of the timestamp and ensure non-overlapping pulls. The created_utc attribute represents the post's creation time in UTC.

In [7]:
subreddit_name = 'politics'
subreddit = reddit.subreddit(subreddit_name)

posts = subreddit.top(time_filter='week', limit=30)
# making a function to use on various post calls
def process_post(post):
    try:
        # this gets all of the comments in a post
        post.comments.replace_more(limit=None)
        #sorting the comments by score
        comments_sorted = sorted(
            [[comment.body, comment.score] for comment in post.comments.list()],
            key=lambda x: x[1], reverse=True
        )
        num_comments = len(comments_sorted)
        #creates a list of all of my desired attributes
        return [post.created_utc, post.title, post.selftext, post.subreddit.display_name, comments_sorted, num_comments]
    #exception handler in case praw is rate limiting me
    except Exception as e:
        print(f"Error processing post {post.id}: {e}")
        return [None] * 6

data = []
#creating a numpy array of all of the posts and calling the function on 
for post in posts:
    data.append(process_post(post))

#creating the columns for my data and casting the numpy array as a dataframe
columns = ['created_utc', 'title', 'selftext', 'subreddit', 'comments', 'num_comments']
politics = pd.DataFrame(data, columns=columns)

#saving my data using the exact datetime so my data can't be overwritten
current_time = datetime.datetime.now().strftime('%Y-%m-%d_%H-%M-%S')
directory = 'scraping_data'
if not os.path.exists(directory):
    os.mkdir(directory)
    print(f"Directory '{directory}' created.")
else:
    print(f"Directory '{directory}' already exists.")
filename = f'{directory}/{subreddit_name}_top_past_wk_{current_time}.csv'

politics.to_csv(filename, index=False)

#printing my data once my code finishes
print(f"DataFrame saved to {filename}")
print(politics.head(5))

Error processing post 1dy8qjl: received 429 HTTP response
Error processing post 1dv9zdj: received 429 HTTP response
Error processing post 1dvum00: received 429 HTTP response
Error processing post 1dz1lk8: received 429 HTTP response
Error processing post 1dwomir: received 429 HTTP response
Error processing post 1duo2mp: received 429 HTTP response
Error processing post 1dv38zy: received 429 HTTP response
Error processing post 1dudhc8: received 429 HTTP response
Error processing post 1dubuqz: received 429 HTTP response
Error processing post 1dv810w: received 429 HTTP response
Error processing post 1dyk7kv: received 429 HTTP response
Error processing post 1dx20a6: received 429 HTTP response
Error processing post 1dxf1bw: received 429 HTTP response
Error processing post 1dy1s0q: received 429 HTTP response
Error processing post 1dw5jra: received 429 HTTP response
Error processing post 1dv555m: received 429 HTTP response
Error processing post 1dxhbt2: received 429 HTTP response
Error processi

In [56]:
politics.head()

Unnamed: 0,created_utc,title,selftext,subreddit,comments,num_comments
0,1719492000.0,Three female GOP state senators who filibuster...,,politics,"[[\nAs a reminder, this subreddit [is for civi...",447
1,1719499000.0,Jack Smith brings receipts of vile threats aga...,,politics,"[[\nAs a reminder, this subreddit [is for civi...",190
2,1719502000.0,MAGA Loses It Over Game-Changing Announcement ...,,politics,"[[\nAs a reminder, this subreddit [is for civi...",858
3,1719495000.0,Tonight’s debate could be Trump’s last act: Th...,,politics,"[[\nAs a reminder, this subreddit [is for civi...",1193
4,1719510000.0,MAGA Fumes Over New Microphone Rule at Biden-T...,,politics,"[[\nAs a reminder, this subreddit [is for civi...",330


In [38]:
politics.shape

(0, 4)

In [39]:
politics.shape

(0, 4)

In [9]:
current_time = datetime.datetime.now().strftime('%Y-%m-%d_%H-%M-%S')
print(current_time)

2024-07-09_09-59-02


In [51]:
directory = 'scraping_data'



Directory 'scraping_data' already exists.


In [52]:

filename = f'{directory}/politics_{current_time}.csv'


In [53]:
filename

'scraping_data/politics_2024-06-27_16-12-02.csv'

In [54]:
politics.to_csv(filename, index=False)

Remember, you will need to pull *at least* 1000 posts from each subreddit, not just 25. Like I mentioned above, you can use the created_utc attribute of a post to keep track of the timestamp and ensure non-overlapping pulls. We will leave this work for you all to complete.

Once you have at least 1000 posts from each subreddit, you can do some EDA (perhaps maybe the most common words in each subreddit..?) Eventually, you will want to combine your two dataframes together to do modeling.

### Hopefully this is enough of a tutorial to help get you started! If you have any questions, let us know!

### Note: Rather than working in this template notebook, make a brand new "scraping" notebook (or script), with your own comments, so you can use this project in a portfolio!