### This notebook will scrap the data from subreddit r/India and store it in a csv file

In [1]:
import pandas as pd
import praw #PRAW is the API being used to scrap data from Reddit

### Creating a reddit instance by authenticating ourselves

In [2]:
reddit = praw.Reddit(client_id='xxxx', client_secret='xxxx', user_agent='Reddit WebScrapping', password='xxxx', username='shreyagupta0806')

Note: The client_id, client_secret and password are anonymised to prevent privacy

### First, we will extract information from the subreddit r/India

In [3]:
# 1. get subreddit info of india subreddit
india_subreddit = reddit.subreddit('India')

### Now, we will extract useful columns from the posts in this subreddit

note: we extract the "hottest" posts, since these are the posts that received the maximum upvotes in the closest time from when they were put up.

Currently, we will extract 1000 posts for reference. This number can be increased by changing the numPosts variable intialised below.

In [4]:
# 2. extract x hottest posts' headline, id, url, content
def get_hottest_posts(numPosts):
    """
    Takes the number of posts we want to scrap and returns the
    pandas dataframe consisting of the post id (unique to a 
    post - will be used in the next step), post url (can be
    important in the coming steps), post title (or the headline)
    and its content (i.e. what the post contained)
    """
    for post in india_subreddit.hot(limit=numPosts):
        posts.append([post.id, post.url, post.title, post.selftext])

    return pd.DataFrame(posts, columns=['id', 'url', 'title', 'content'])

posts = []
numPosts = 2000
posts = get_hottest_posts(numPosts)

#displaying the first five rows of the posts dataframe
posts.head()

Unnamed: 0,id,url,title,content
0,fqqdsg,https://www.reddit.com/r/india/comments/fqqdsg...,Coronavirus (COVID-19) Megathread - News and U...,**Central thread for sharing coronavirus News ...
1,fo0xj9,https://www.reddit.com/r/india/comments/fo0xj9...,Coronavirus (Covid-19) Multi-Lingual Resources...,This post will serve as a wiki for multi-lingu...
2,fxm0xq,https://v.redd.it/o8y1vddg2qr41,Woke up to this today. Made my day!,
3,fxpkhy,https://www.thequint.com/neon/covid-19-bhopal-...,Bhopal Doctor Lives In Car To Protect Family F...,
4,fxqifi,https://theprint.in/india/rss-says-tablighi-ja...,RSS says Tablighi Jamaat conduct not reflectio...,


### Now that we have the basic attributes of a post, we will extract the flair of the post. 

Note: Flair is the category that the post belonged to. This category (for the India subreddit) is assigned by the moderator or community members of the subreddit. 

In [5]:
# 3. extract the flairs of these posts
def get_post_flairs(posts):
    """
    This function takes the posts extracted in get_hottest_posts
    function and returns the flairs of these posts using the post
    id.
    """
    flairs = []
    for i in range(posts.shape[0]):
        submission = reddit.submission(id=posts.id[i])
        flairs.append(submission.link_flair_text)
    return flairs
    
flairs = get_post_flairs(posts)

#displaying the flairs of the first five posts
flairs[:5]

['Coronavirus', 'Coronavirus', 'Non-Political', 'Coronavirus', 'Coronavirus']

### Finally, we append these flairs to our post data obtained from the get_hottest_posts function

In [6]:
# 4. append posts with their flairs to get the resultant dataset

posts['flair'] = flairs
posts.head()

Unnamed: 0,id,url,title,content,flair
0,fqqdsg,https://www.reddit.com/r/india/comments/fqqdsg...,Coronavirus (COVID-19) Megathread - News and U...,**Central thread for sharing coronavirus News ...,Coronavirus
1,fo0xj9,https://www.reddit.com/r/india/comments/fo0xj9...,Coronavirus (Covid-19) Multi-Lingual Resources...,This post will serve as a wiki for multi-lingu...,Coronavirus
2,fxm0xq,https://v.redd.it/o8y1vddg2qr41,Woke up to this today. Made my day!,,Non-Political
3,fxpkhy,https://www.thequint.com/neon/covid-19-bhopal-...,Bhopal Doctor Lives In Car To Protect Family F...,,Coronavirus
4,fxqifi,https://theprint.in/india/rss-says-tablighi-ja...,RSS says Tablighi Jamaat conduct not reflectio...,,Coronavirus


In [7]:
# Let's see how many posts were extracted
posts.shape

(758, 5)

### We might also want to see what were the potential flairs that could have been assigned in this subreddit

In [8]:
def print_possible_flairs(subreddit):
    
    print("Possible flairs: \n")
    i = 0

    for template in subreddit.flair.link_templates:
        print(str(i+1) + ". " + template["text"])
        i += 1
    
print_possible_flairs(india_subreddit)

Possible flairs: 

1. Politics
2. Non-Political
3. AskIndia
4. Policy/Economy
5. Business/Finance
6. Science/Technology
7. Scheduled
8. Sports
9. Food
10. Photography
11. CAA-NRC-NPR
12. Coronavirus


These are the possible flairs that can be assigned to a post by a moderator or community member of r/India

So we finally extracted the data from subreddit of India using praw. The only doubt I have here is why did 752 posts get extracted even though I extracted 1000 posts. Possible guesses: 
1. Either there were only 752 posts in the "hottest" category, or
2. There is some stopping condition on the number of posts that can be extracted
752 seems like a weird number for the second scenario, and I'm not sure about the first. So will probably get back to it, after completing the overall task, time remaining. Till then I think 752 should be a good number, if we need more, we'll get it.

### Finally, saving our extracted data in a csv file

In [9]:
# step 5: save the data in a csv
posts.to_csv('dataset.csv', index=False) 

#index=False will prevent the row numbers from being saved as an independent column/attribute in the saved csv file