<a href="https://colab.research.google.com/github/NikV-JS/Reddit-Flair-Detector/blob/master/Notebooks/Reddit_Web_Scraping(Part_1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The first part of the assignment deals with the extraction of required data from Subreddit India posts. To achieve this, the approach is to use the Reddit API feature for web scraping. Firstly, a Reddit API account with client id, secret key and name is set up on the Reddit account so that we can connect to Reddit API for webscraping of posts. Initially, for the extraction of recent data, the PRAW (Python Wrapper for the Reddit API) module is used. PRAW module simplifies the Reddit API calls into simple function calls enabling quick and efficient web scraping. So, in the following code block we start by installing the required dependencies for the Part I notebook. 



In [0]:
!pip install -q praw pandas numpy

[K     |████████████████████████████████| 143kB 1.4MB/s 
[K     |████████████████████████████████| 204kB 38.7MB/s 
[?25h

In [0]:
# Importing required modules for the notebook
import praw
import pandas as pd
import numpy as np
from numpy import random

On Reddit, a number of submissions pertaining to different categories are submitted everyday. For filtering purposes, Reddit tags its posts into various categories called flairs. Few flair categories remain permanent on Reddit, where as few flair categories are dynamic and change according to the current trend worldwide. For example, 'Coronavirus' flair posts show up only after Feb '20. The line of code below, defines a list of flair names on the Reddit Website as of April 6th, 2020. There are a total of 12 flair categories.

In [0]:
flairs = ["Scheduled", "Politics", "Photography", "Policy/Economy", "AskIndia", "Sports", "Non-Political", "Science/Technology", "Food", "Business/Finance", "Coronavirus", "CAA-NRC-NPR"]

The approach for Reddit Flair Detection will be based on only Subreddit post's body and title but not the top comments on the subreddit post. Intuitively, there might be a improvement in performance of classifier if comments are also used but my goal was to build a classifier based on only body and title. The main reason behind this is having the non popular reddit posts in mind. Non popular reddit posts in generic categories and newly posted reddit posts tend to have no comments. So, if comments are considered as a feature the mean sequence length tends to move towards the larger side and thereby might cause hinderance of model performance on small and concise posts or posts with just title and link or the above type of posts with no comments. Also non inclusion of comments provides an opportunity to study the model's inference capabilities in classifying vague and short posts on Reddit. Accordingly, the following code block uses PRAW to make Reddit API calls to web scrape required data from Subreddit India posts.

In [0]:
reddit = praw.Reddit(client_id='EPeQ4_tZaSnieQ', client_secret="o8wiYMDri2RMiF1um14L1rGHXEs", user_agent='Reddit WebScraping')

subreddit = reddit.subreddit('india')
topics_dict = {"flair":[], "title":[], "url":[], "comms_num": [], "body":[], "author":[]}

for flair in flairs:
  
  get_subreddits = subreddit.search(flair, limit=None)
  
  for submission in get_subreddits:
    
    topics_dict["flair"].append(flair)
    topics_dict["title"].append(submission.title)
    topics_dict["url"].append(submission.url)
    topics_dict["comms_num"].append(submission.num_comments)
    topics_dict["body"].append(submission.selftext)
    topics_dict["author"].append(submission.author)

new_data = pd.DataFrame(topics_dict)
new_data.shape

(2663, 6)

The above code block was successful in extracting web data but there were limitations in the size of the dataset. Reddit API limits the size of a query search to 1,000 posts. I had a figure of 6,000 cases as apt for the dataset so this limitation was not a problem as I required 500 posts from each flair category. But due to limitations on Reddit's end I ended up with 2663 cases with almost all categories having cases near 250 capped. After some research, work arounds for this in praw such as cloudsearch and submissions have been deprecated. So another alternative for this was to use the archived Reddit posts on Google Cloud Platform Big Query. The following code block denotes the SQL search code used to obtain required data of 10 flair categories from the latest available months.

In [0]:
# This code block will not work, please skip it
Big Query SQL Search Code:
SELECT * except (domain, subreddit, author_flair_css_class, link_flair_css_class, author_flair_text,
                 from_kind, saved, hide_score, archived, from_id, name, quarantine, distinguished, stickied,
                 thumbnail, is_self, retrieved_on, gilded, subreddit_id)
FROM `fh-bigquery.reddit_posts.2019_06`
WHERE subreddit = "india" and link_flair_text in ("Scheduled", "Politics", "Photography", "Policy/Economy", "AskIndia", "Sports", "Non-Political", "Science/Technology", "Food", "Business/Finance", "Coronavirus", "CAA-NRC-NPR")
LIMIT 50000

In [0]:
# Loading the Csv data obtained from the Big Query SQL search code
data_08 = pd.read_csv('/content/data_2019_08.csv') #path to the 2019_08 Big Query Data needs to be provided
data_07 = pd.read_csv('/content/data_2019_07.csv') #path to the 2019_07 Big Query Data needs to be provided
data_06 = pd.read_csv('/content/data_2019_06.csv') #path to the 2019_06 Big Query Data needs to be provided

In [0]:
# The following code block was used to obtain a stastical inference of the number of flairs in each of the 4 datasets.
length = np.zeros((12,1))
length_19_08 = np.zeros((12,1))
length_19_07 = np.zeros((12,1))
length_19_06 = np.zeros((12,1))
for i in range(0,12):
  flair_data_08 = data_08[data_08['link_flair_text'] == flairs[i]];
  length_19_08[i] = flair_data_08.shape[0];
for i in range(0,12):
  new_flair_data = new_data[new_data['flair'] == flairs[i]];
  length[i] = new_flair_data.shape[0];
for i in range(0,12):
  flair_data_07 = data_07[data_07['link_flair_text'] == flairs[i]];
  length_19_07[i] = flair_data_07.shape[0];
for i in range(0,12):
  flair_data_06 = data_06[data_06['link_flair_text'] == flairs[i]];
  length_19_06[i] = flair_data_06.shape[0];

In [0]:
#The following code block was used to merge the 4 datasets and obtain the final dataset with only three fields i.e flair, body, text.
o_data = [data_06, data_07, data_08]
o_data = pd.concat(o_data) # Create a Pandas DataFrame with all the Big Query Data
data = o_data[['link_flair_text','title','selftext']] # Selecting only the required fields from o_data
data = data.rename(columns={'link_flair_text': 'flair', 'selftext': 'body'}) # Renaming the Reddit API properties to appropriate column names
n_data = new_data[['flair','title','body']] # Selecting only the required fields from new_data obtained from PRAW
data = [data,n_data] 
data = pd.concat(data) # Creating Final Pandas DataFrame
data.to_csv('data.csv') # Saving the Final Dataset

It is always a good practice to use random sampling when refining the size of a dataset as it terminates bias generated from sequential pickup of data. The following code block uses random sampling so each time the code runs the data produced changes. A total of 6,000 cases are sampled from around 22,000 cases in the original data file. After a brief analysis of number of each flair cases in the original data file, it is evident that Poltics, AskIndia, Non-Political have a bigger distribution and also Coronavirus and CAA-NRC-NPR have less number of cases due to Reddit API limitations. So, a total size of 6,000 cases was taken and 7 of the flairs are of size 500 each whereas the size of Coronavirus and CAA-NRC-NPR category was capped off to 250 and 109 respectivley and size of Poltics, AskIndia, Non-Political cases was 714, 714, 713 respectively. After a preliminary analysis and study of the obtained final dataset I feel that the short number of cases for Coronavirus and CAA-NRC-NPR shouldn't be a difficulty for the model as the title of submission mostly gives away the flair category in these cases.

In [0]:
data = pd.read_csv('/content/data.csv')
final_data = pd.DataFrame
for i in range(0,12):
  if i == 0:
    flair_data = data[data['flair'] == flairs[i]];
    f_data = flair_data.sample(n=500);
    final_data = f_data;
  elif i == 1 or i == 4:
    flair_data = data[data['flair'] == flairs[i]];
    f_data = flair_data.sample(n=714);
    final_data = [final_data,f_data];
    final_data = pd.concat(final_data)
  elif i == 6:
    flair_data = data[data['flair'] == flairs[i]];
    f_data = flair_data.sample(n=713);
    final_data = [final_data,f_data];
    final_data = pd.concat(final_data)
  else:
    flair_data = data[data['flair'] == flairs[i]];
    if flair_data.shape[0] < 500:
      n = flair_data.shape[0];
    else:
      n = 500;
    f_data = flair_data.sample(n);
    final_data = [final_data,f_data];
    final_data = pd.concat(final_data)
final_data = final_data[['flair','title','body']]
final_data.to_csv('Final_Dataset.csv', index= False)

Verification of size of final obtained dataset.

In [0]:
final_data.shape

(6000, 3)

In [0]:
flair_length = np.zeros((12,1));
for i in range(0,12):
  flair = final_data[final_data['flair'] == flairs[i]];
  flair_length[i] = flair.shape[0];

In [0]:
flair_length

array([[500.],
       [714.],
       [500.],
       [500.],
       [714.],
       [500.],
       [713.],
       [500.],
       [500.],
       [500.],
       [250.],
       [109.]])