# Part 1: Collecting subreddit "India" Data

Source: https://towardsdatascience.com/scraping-reddit-data-1c0af3040768

Importing necessary libraries

In [1]:
import pandas as pd # data manipulation 
import praw # python reddit API wrapper



Authenticate ourselves by creating a Reddit instance and providing client credentials

In [2]:
# Credentials generated from the reddit developers applications page
my_client_id = '8KS6G6Nt9BU9sg'
my_client_secret = 'CkeVFda-vf0DbseDb0eEr1YMpJo'
user = 'reddit_scrape'

reddit = praw.Reddit(client_id=my_client_id, client_secret=my_client_secret, user_agent=user)

## Get subreddit data 
Using the `reddit` instance from the previous section, we can aquire top 1000 posts or 1000 hottest posts or latest posts from reddit. 

In [3]:
num_of_posts = 1000   # Number of posts we want in our data
new_posts = reddit.subreddit('India').top(limit=num_of_posts)

In [None]:
# Printing the titles of the top_posts in this subreddit
for post in new_posts:
    print(post.title)

Will donate thrice the number of upvotes (amount in Rs.) i get for this thread in next 24 hours
Indian reply to NYtimes cartoon on Paris climate accord by Satish Acharya.
The essence of the Indian soap opera, distilled into one GIF.
Fuck all Religion
German exchange Student at IIT Madras is being sent back home by the Indian immigration department because he joined the protest.
Tragedy of India
Today's The Hindu
If you are not moved by this picture, I wish I had your heart. [NP]
Megathread: India-Pakistan border skirmish
"From midnight the entire country will go under a complete lockdown," says PM Modi.
The wealth inequality in India is truly horrifying
Dear Hindus, we Indian Muslims rejected an Islamic state in 1947.. Now it's your turn to reject a Hindu Nazi state.. Speak up. NOW
Spend nearly 500 rs bucks to watch JOKER and in return all i got was 30 mins of advertiment before the movies,20 mins adver. in interval, giant subtitles, and stupid tobacco discalimer which stuck through 70

This is Burger King's Chicken Tandoori Grill burger. What they advertise is above and what they delivered to me today is below.
An aerial view of Gangaikonda Cholapuram Temple
From Deccan Herald - 12.12.19
ISRO releases photo of moon taken by Chandrayaan-2
An ode to Gandhi ( from twitter )
Every home today morning
Need all the help possible.
Found at a hospital in India.
Bengaluru City Police
Pretty much the current state of India
Just a photo of HP Petrol Pump I took recently. The line drawing is added in PS later. [OC] 3922x4902
[OC] This Contest From Maggi.
An in depth news coverage of the monsoons
Modern day mathematics for modern problems...Brilliant stuff!
#satisfying
This is too funny
Pic From Rural Kerala
The Times Of Modi!
Science students can relate
I am a Muslim and I wish all my Hindu brothers and sisters a happy Dusshera!
YES Bank
It's confirmed but also not confirmed.
The Original 10-Year Challenge
When Google gives up
A Big Lake in Chennai - 2018 vs. 2019
Something I mad

Sacred Games season 1 fanart poster, by me
Mumbai police's recent tweet on scams.
Shimla
Indian auto drivers during a traffic jam
Translation: Be careful. We ought to save ourselves too while we save them.
I saw your Meghalya post and i raise you this boi.... Someplace on the way to Kheerganga, HP.
Happy birthday SIR APJ ABDUL KALAM. The missile man would've been 88 today.
Viral Hindu-Muslim, India-Pak lesbian couple celebrates anniversary with new pics. They are stunning
Kolkata at night during Durga Pujo
Two lenses
For all our desi Calvin and Hobbes and Chacha Chaudhary fans!
LED traffic strip lights in Hyderabad
I made 13 illegal toll booth vanish overnight, milking 3-4 lakhs everyday, using RTI
Deccan Herald's Speak Out: 7th March 2019.
The 90's is (calcium) strong with this one!
Migrant Workers
When you have more phones than toilets
This year's hand made Ganesha by my sis
[Showerthought] Netflix should provide a "Skip Song" Button for Indian movies.
I see your good boi in traffic 

Deccan Herald continues to deliver
For people sorting by new, meet two new members of our family
What TF has happened to my fellow countrymen on LinkedIn?
Chowkidhaar's nightmare
I work as a psychologist for children and teens. This is from a 10th standard student I've been counselling for the past few years.
No 'Clapping' Doesn't Kill Coronavirus.. God help this society.
This pic of Mumbai Conservancy workers cleaning gutters.....
We've Just Seen the First Use of Deepfakes in an Indian Election Campaign
Cartoon of the day: Jharkhand Election Results by @CartoonistAlok
Kiki challenge. Indian edition
They're just everywhere these days.
My neighbour's door, summarises cultural values of India !!
How do you know if someone went to IIT? They'll tell you.
I finally convinced him. That awesome moment when you understand the essense of life
1981: When Atal Bihari Vajpayee broke protocol to meet Faiz Ahmed Faiz.
Hasan Minhaj Has Heard His Fans, Next 'Patriot Act' Episode on CAA-NRC
Srinivasa G

Looking at the data type of the `new_posts` to identify the class it belongs to and to get more information about the various methods and attributes. 
Answer - object of the ListingGenerator class

In [None]:
print(type(new_posts))

## Structure subreddit data into a pandas dataframe

Create a list of features that have to be saved in the dataset. These are data columns that we will extract from our dataset. 

**Feature Description:**

* **Title:** The title of the submission.
* **Score:** The number of upvotes for the submission.
* **ID:** ID of the submission.
* **Subreddit:** Provides an instance of Subreddit.
* **URL:** The URL the submission links to, or the permalink if a selfpost.
* **Orginal**: Whether or not the submission has been set as original content.
* **num_comments:** The number of comments on the submission.
* **Body:** The submissions’ selftext - an empty string if a link post.
* **created_on:** Time the submission was created, 


In [None]:
features = ['Title', 'Score', 'ID', 'Subreddit', 'URL', 'Original', 'num_comments', 'Flair', 'Body', 'created_on']
posts = [] # List containing the data from individual reddit posts. Each item will be a new entry in the dataframe

In [None]:
india_sub = reddit.subreddit('India')

# Loop through each subreddit entry and append that t
for post in india_sub.top(limit=num_of_posts):
    posts.append([post.title, post.score, post.id, post.subreddit, post.url, post.is_original_content, post.num_comments, post.link_flair_text, post.selftext, post.created])

# Convert to a data frame 
posts = pd.DataFrame(posts, columns=features)

Let's look at the data frame and analyse the data a little more. 

In [None]:
posts

### Correct the time format using the datetime library

In [None]:
# Using the to_datetime function of pandas
posts['creation_date'] = pd.to_datetime(posts['created_on'], dayfirst=True, unit='s')

In [None]:
# Drop created_on column now 
posts.drop(['created_on'], axis=1, inplace=True)

In [None]:
posts.head()

In [None]:
posts.shape

Even after changing the `num_of_posts`, I still getting only 988 posts and a very skewed dataset. Hence, I need to create a better scraping mechanism to get a more balanced Dataset

# Data Collection: Improvised

The first thing I will do is to create lesser number of targets for classification for more robust models and increase the accuracy. 

In [None]:
posts['Flair'].value_counts().sort_values(ascending=False)

I sorted the flairs in the descending order and picked up the most popular flairs to avoid skewed data. I will collect top posts and their comments along with some other information relavnt for the analysis and the model. 

In [None]:
# Relevant flairs
flairs = ["AskIndia", "Non-Political", "[R]eddiquette", 
          "Photography", "Science/Technology",
          "Politics", "Business/Finance", "Policy/Economy",
          "Sports", "Food", "AMA", "Coronavirus", "CAA-NRC-NPR"]

In [None]:
# Data features that we will be collecting 
features = [
    'Title', 
    'Score', 
    'ID',
    'URL', 
    'num_comments', 
    'created_on', 
    'Body', 
    'Original',
    'Flair', 
    'Comments'
]

In [None]:
# Create a subreddit instance 
subreddit = reddit.subreddit('india')

In [None]:
posts = []
# Top 250 posts of each type 
for flair in flairs: 
    relevant_subs = subreddit.search(f"flair_name:{flair}", limit=250)
    
    for sub in relevant_subs:
        post = []
        post = [
            str(sub.title),
            str(sub.score),
            str(sub.id),
            str(sub.url),
            sub.num_comments,
            str(sub.created),
            str(sub.selftext),
            sub.is_original_content,
            str(sub.link_flair_text),
        ]
        
        sub.comments.replace_more(limit=0)
        comment = ''
        for top_comment in sub.comments:
            comment = str(top_comment.body) + ' '        
        
        post.append(str(comment))# Add to the end of the list 
        posts.append(post)    # Add to the main list 

In [None]:
# Convert to a data frame 
posts = pd.DataFrame(posts, columns=features)

In [None]:
posts

In [None]:
# More detailed Data
posts.to_csv('data.csv')