# Reddit Scraping

### API vs Scraping?

`pip install praw`

Helpful resources:

The official documentation of praw:
https://praw.readthedocs.io/en/stable/getting_started/quick_start.html

https://www.geeksforgeeks.org/scraping-reddit-using-python/

In [1]:
import praw
import pandas as pd
from praw.models import MoreComments
import datetime

### Before Starting
(First time instructions)

0. Familiarise yourself with what reddit is and how does it work, its structre, etc. what are posts, subreddits, communities, comments, etc.

1. Reddit Account:

You will need a reddit account to perform reddit scraping using praw. The account is required for the creation of a reddit app.

2. Creating a Reddit app (your key to the Reddit API):

Since we are using Reddit's public API, we need to create an app with reddit, which gives us credentials to access the API. So create an app by going to this website: https://www.reddit.com/prefs/apps . Follow the instructions given in the geeksforgeeks link I have attached.

### Initialising the account details
Creating a Reddit Instances

Basically giving the credentials to the API to authenticate you and allow your access.

### praw.ini file
Since we are supplying our credentials to the Reddit API, pasting it directly into our code might be risky, becuase you may expose your credentials when you share your code. So we create a separate file called `praw.ini` in the same directory as this code file and put the credentials there. Whereever we share the code, we should however not share the praw.ini file. Who ever is running their code, they should create their own praw.ini file.
(I'm attaching my praw.ini file also just so you can see how it is supposed to be.)

In [2]:
myReddit = praw.Reddit("bot1")

Just in case the above line does not work for creating a reddit instance, you can directly put your credentials and run it as shown below:

(uncomment it and put your credentials and run)

In [None]:
# myReddit = praw.Reddit(client_id="iry4l****",         # your client id
#                                 client_secret="hKHe****",      # your client secret
#                                 user_agent="HSL_Crawl",        # your user agent
#                        )

### Submissions
Types:
- controversial
- gilded
- hot
- new
- rising
- top

You can make these requests or 'submissions' depending on your requirment.

What do these different types mean? Refer to this reddit post: https://www.reddit.com/r/help/comments/32eu8w/what_is_the_difference_between_newrising_hot_top/?utm_source=share&utm_medium=web2x&context=3

#### Searching posts from a specific subreddit

In [3]:
mySubredditName = "csk"

# Your interface to access the subreddit
subrr = myReddit.subreddit(mySubredditName)

# Get the top 5 controversial posts in the subreddit
for post in subrr.controversial(limit=5):
	print(post.title)
	print()

Thoughts on this comment?

We are doomed today!

CSK and RCB fan base are quite similar and they respect a lot each other

This may sound petty. But I'm actually asking. Would ruturaj have done a better job in batting when compared to gill in the wtc?

What if?



### Possible Errors:
- 401: Wrong credentials. Verify if you have given the right credentials in the praw.ini (in case you have multiple bots created, check if you have given the credentials of teh same bot.)
- 404: Subreddit doesn't exist
- 403: This error apparently occurs when your Reddit App's user-agent is not being allowed. "To solve it change your user agent to be more descriptive." "Try changing it to something either longer or without keywords like "scraping" or "bot"."

### Storing in a dictionary to store in a Dataframe

One thing to remember is that praw uses something called Lazy Access. So because of this, when accessing the reddit posts, they are not immediately stored in your variable (the API request is not made), until you access it.

So while working with reddit data, it is best to access the instance immediately and store it in some database. You can then access the database whenever to work with the data.

In [4]:
# This is called a dictionary in python. You can think of it like a structure in C.
# Dictionaries are a collection of key-value pairs. Here, 'title', 'id', 'author' etc. are the keys
# Keys are sort of like the member variables of a structure in C.
# In this particular case, what we have done is, created an empty dictionary with keys as 'title', 'id', etc. The value of each key is an empty list.
# We intend to look at all the posts, and fill up these lists.
posts_dict = {
	"Title": [], 
	"Body": [],
	"ID": [], 
	"Score": [],
	"Comments Count": [],
	"URL": [],				# Any URL present in the post (like image, video, etc.). It not present, then url of the post
	"Author": [], 
	"createdUTC": [],
	"Post Name": [], 
	"Permalink": [],			#The relative URL of the post.
	# "Comments": []
	}

# Above, in the dictionary, I have created 11 keys in the dictionary. This is just to show you what all data we can possibly get from each reddit post.
# Depending on what data you really need, you can create only those keys.

# Getting the top 5 controversial posts from the subreddit 'csk'. This is stored in this variable 'posts'. It is a list of sorts.
posts = myReddit.subreddit("csk").controversial(limit=5)

# Now we iterate through this variable 'posts' and access each post one by one and get and store their information.
# the datatype of post is 'praw.models.reddit.submission.Submission'. It is some class defined in the praw module.
# So this gives us access to various items like the title, body, etc.
# To access the, you need to call it as post.title, post.selftext, etc.
for post in posts:
	posts_dict["Title"].append(post.title)
	posts_dict["Body"].append(post.selftext)
	posts_dict["ID"].append(post.id)
	posts_dict["Score"].append(post.score)
	posts_dict["Comments Count"].append(post.num_comments)
	posts_dict["URL"].append(post.url)
	posts_dict["Author"].append(post.author)

	# Time of creation of each post
	# The time of creation is in UNIX time. So we convert it to a more readable format.
	# This might be useful when you want posts of a particular time period.
	posts_dict["createdUTC"].append(datetime.datetime.fromtimestamp(post.created_utc))
	posts_dict["Post Name"].append(post.name)
	posts_dict["Permalink"].append(post.permalink)
	# posts_dict["Comments"].append(post.comments.list())

In [6]:
# Saving the data in a pandas dataframe
filename = "redditPostsDump.csv"
myRedditDataframe = pd.DataFrame(posts_dict)
myRedditDataframe.to_csv(filename, index=False)
myRedditDataframe

Unnamed: 0,Title,Body,ID,Score,Comments Count,URL,Author,createdUTC,Post Name,Permalink
0,Thoughts on this comment?,,ult53l,1,18,https://i.redd.it/fldc7x7tsgy81.jpg,Navneeth19,2022-05-09 20:29:08,t3_ult53l,/r/csk/comments/ult53l/thoughts_on_this_comment/
1,We are doomed today!,Narine and Varun will eat our batting line-up ...,12vhlcr,2,45,https://www.reddit.com/r/csk/comments/12vhlcr/...,ffskd,2023-04-23 01:37:40,t3_12vhlcr,/r/csk/comments/12vhlcr/we_are_doomed_today/
2,CSK and RCB fan base are quite similar and the...,I noticed that CSK and RCB have similar kind o...,13kundo,0,17,https://www.reddit.com/r/csk/comments/13kundo/...,Goku4477,2023-05-18 15:45:00,t3_13kundo,/r/csk/comments/13kundo/csk_and_rcb_fan_base_a...
3,This may sound petty. But I'm actually asking....,,146m58s,27,43,https://www.reddit.com/r/csk/comments/146m58s/...,boringsimp,2023-06-11 11:37:57,t3_146m58s,/r/csk/comments/146m58s/this_may_sound_petty_b...
4,What if?,What if we have won the bid for Dinesh Karthik...,veog8g,0,6,https://www.reddit.com/r/csk/comments/veog8g/w...,dineshalagu,2022-06-18 01:18:55,t3_veog8g,/r/csk/comments/veog8g/what_if/


### Search through all subreddits based on a query

In [7]:
query = "cricket world cup"
posts = myReddit.subreddit("all").search(query, limit=10)

In [8]:
for post in posts:
    print("Post Title: ",post.title)
    print("Post Subreddit Name: ",post.subreddit)
    print()

# Notice how you have posts from different subreddits.

Post Title:  Who's stopping India in Cricket World Cup 2023?
Post Subreddit Name:  cricketworldcup

Post Title:  Speed in India for Cricket World Cup
Post Subreddit Name:  cricketworldcup

Post Title:  India remain unbeaten in Cricket World Cup 2023
Post Subreddit Name:  IndiaCricket

Post Title:  Cricket World Cup points table. Which team’s ranking is the most surprising for you?
Post Subreddit Name:  cricketworldcup

Post Title:  Netherlands plead with ICC for more fixtures after second Cricket World Cup boilover
Post Subreddit Name:  Cricket

Post Title:  Who's stopping India in Cricket World Cup 2023?
Post Subreddit Name:  IndiaCricket

Post Title:  For the first time ever, All 6 Continents won't be represented in the Cricket World Cup
Post Subreddit Name:  Cricket

Post Title:  Cricket World Cup streaming services
Post Subreddit Name:  srilanka

Post Title:  Cricket World Cup: All dynasties must end – but what now for England?
Post Subreddit Name:  Cricket

Post Title:  Ben Stokes

#### Getting the comments of a post
The comments of a post is also stored in the post class. However, it is not in a accessible form. Hence, we have this extra function to get the posts.

In [9]:
def getComments(post):
    comments = []

    for comment in post.comments:
        if type(comment) == MoreComments:
            continue
 
        comments.append(comment.body)
    return comments

Lets try calling this function for a query

In [10]:
query = "cricket world cup"
posts = myReddit.subreddit("all").search(query, limit=5)

for post in posts:
    print("Post Title: ",post.title)
    print("Post Subreddit Name: ",post.subreddit)
    print("Comments: ")
    print(getComments(post))
    print("----------------------------")

# Notice how you have posts from different subreddits.

Post Title:  Who's stopping India in Cricket World Cup 2023?
Post Subreddit Name:  cricketworldcup
Comments: 
['Aussies in knockouts are a different breed \U0001f972.', "Today's top order collapse was a blessing in disguise for india IMHO, law of average was doing it's thing but somehow india manages to win \n\nNow, another winning streak is about to start... India will be in the final for sure, but winning the final will be tough, they need to prepare for knockouts now", 'Netherlands', 'A semi final match', 'A thiccc short black guy from South Africa is waiting in Kolkata to give Kohli his b’day gift 💪🏿💪🏿', 'Shreyas iyer 💀', 'Proteas in the group stage.... NZ, as usual in the semis..', 'A Semi-final', 'Any opponent in the Semi-finals. ![gif](emote|free_emotes_pack|trollface)', 'Chad Netherlands waiting to upset India!!', 'South Africa, but in semis', 'Netherlands 💀', 'Probably aus...in semis', 'Dutchmen are coming', 'Shreyas Iyer and Gill has to step up. Really bad performances from t