# Data Aquisition and Cleaning:

This notebook demonstrates the scraping of data from the r/india subreddit using Praw Module in python.
PRAW which stands for Python Reddit API Wrapper helps us scrape data sceamlessly from any subreddit.

Post scraping the data, we saves the data in a csv file for further processing using Pandas Module.

We further load this raw Data again using Pandas and preform several cleaning and regularisation techniques and save it back into a csv file.

## Import Modules

In [11]:
import praw

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from numpy import random

import gensim
import nltk
from nltk.corpus import stopwords
import re
from bs4 import BeautifulSoup

## OAuth with PRAW:
Setting up Open-Standard Authorization Protocol for the service.

In [12]:
reddit = praw.Reddit(client_id="ffKcEa2xKfnhyg", client_secret="IJqQkTrDio0xKsKYKYmgeWSoOLM",
    user_agent="flair_predication", username="ASingh1206", password="g5gh#4$iQFGNBad")

subreddit = reddit.subreddit('india')

## Data Scraping:
Here we declare the various Flairs and setup code to efficiently receive data from the <b>API into the Pandas Dataframe</b>, further onces this raw data is collected it is then stored into a <b>CSV file</b>.

The API delivers around <b>240 Posts per Flair</b> on an average. Along with this the <b>Top 5 comments</b> per post have been also been saved. 

In [None]:
flairs = ["Photography", "Science/Technology","AskIndia","Business/Finance", "Policy/Economy",
          "Sports", "Food", "Politics", "Scheduled", "Non-Political"]
        
title = [] 
score = []
url = []
body = []
author = []
fl = []
com = []
num = []

for flair in flairs:
    relevant_subreddits = subreddit.search(f"flair_name:{flair}",limit=300)
    series = 0
    for submission in relevant_subreddits:
        count = 0
        if submission.num_comments != 0:
                count = min(5, submission.num_comments)
                submission.comments.replace_more(count)
                comment = ''
                for top_level_comment in submission.comments:
                    comment = comment + ' ' + top_level_comment.body

        print(flair, "  ", series)
        series = series + 1

        title.append(submission.title)
        score.append(submission.score)
        url.append(submission.url)
        body.append(submission.selftext)
        author.append(submission.author)
        fl.append(flair)
        com.append(comment)
        num.append(submission.num_comments)
    
dict = {'title': title, 'author': author, 'url': url, 'body': body, 'score': score,'flair': fl, 'num': num, 'comments' : com}  
     
df = pd.DataFrame(dict) 
df.fillna("",inplace = True) 

# saving the dataframe 
df.to_csv('f_300.csv', index=False) 

## Loading the Data for a Preview:

In [13]:
dff = pd.read_csv('f_300.csv')
print("Shape-->", dff.shape)
dff.head()

Shape--> (2422, 8)


Unnamed: 0,title,author,url,body,score,flair,num,comments
0,Different stages of hair loss in perfect order...,BreakingBrownBread,https://i.redd.it/ydbmwsa7jpt41.jpg,,2802,Photography,88,"So, can I guess that you're as bald as Shakaa..."
1,Women gather together during Dust storm in Raj...,TheDosaMan,https://i.redd.it/uapdc9dvels41.png,,3565,Photography,74,Steve McCurry captured this stunning image in...
2,Zoom in! I took over 600 shots of last night's...,vpsj,https://i.imgur.com/RLL0xvH.jpg,,1463,Photography,81,"#Details:\n\nFirst of all, please note that t..."
3,"A Wild Gaur, Nagarahole National Park",Coconut_Kid,https://i.redd.it/1zz6atjncds41.jpg,,658,Photography,71,In wild there is probably no Majestic beast a...
4,"Everyone, Puffy the Superdog. (ZenFone 6)",bosama_in_laden,https://i.redd.it/bk0fba8havt41.jpg,,603,Photography,39,Must be nice having trees around your house. ...


## Utility Function for regularize the data:
Here the data is made uniform in nature and any kind of StopWords have also been removed.

In [None]:
STOPWORDS = set(stopwords.words('english'))
replace_by_space = re.compile('[/(){}\[\]\|@,;]')
replace_symbol = re.compile('[^0-9a-z #+_]')

def clean_text(text):
    text = str(text)
    text = replace_by_space.sub(' ', text) 
    text = replace_symbol.sub('', text) 
    text = text.lower() 
    text = ' '.join(word for word in text.split() if word not in STOPWORDS) 
    return text

In [14]:
from gensim import utils
import gensim.parsing.preprocessing as gsp

filters = [
           gsp.strip_tags, 
           gsp.strip_punctuation,
           gsp.strip_multiple_whitespaces,
           gsp.strip_numeric,
           gsp.remove_stopwords, 
           gsp.strip_short, 
           gsp.stem_text
          ]

def cclean_text(s):
    s = str(s)
    s = s.lower()
    s = utils.to_unicode(s)
    for f in filters:
        s = f(s)
    return s

## Alteration to the data:
This step was performed after we had conducted the Exploratory Data Analysis on the the data set.

In [15]:
#dff = dff.drop("score", axis=1)
#dff = dff.drop("url", axis=1)
#dff = dff.drop("num", axis=1)

dff['title'] = dff['title'].apply(cclean_text)
dff['body'] = dff['body'].apply(cclean_text)
dff['comments'] = dff['comments'].apply(cclean_text)

dff = dff.fillna('')

combined_features = dff["title"] + ". " + dff["body"] + ". " + dff["comments"] 
dff = dff.assign(combined_features = combined_features)

#saving the csv file

dff.to_csv('f_300_clean.csv',index=False)

## Review the Data post Regularizing:
This data is now ready to be used to train and test our models on.

In [16]:
dff = pd.read_csv('f_300_clean.csv')
print(dff.shape)
dff.head()

(2422, 9)


Unnamed: 0,title,author,url,body,score,flair,num,comments,combined_features
0,differ stage hair loss perfect order mumbai lo...,BreakingBrownBread,https://i.redd.it/ydbmwsa7jpt41.jpg,,2802,Photography,88,guess bald shakaal hei gui rememb train hairst...,differ stage hair loss perfect order mumbai lo...
1,women gather dust storm rajasthan,TheDosaMan,https://i.redd.it/uapdc9dvels41.png,,3565,Photography,74,steve mccurri captur stun imag delet steven mc...,women gather dust storm rajasthan. nan. steve ...
2,zoom took shot night supermoon stack detail lu...,vpsj,https://i.imgur.com/RLL0xvH.jpg,,1463,Photography,81,detail note composit shot mean star moon shot ...,zoom took shot night supermoon stack detail lu...
3,wild gaur nagarahol nation park,Coconut_Kid,https://i.redd.it/1zz6atjncds41.jpg,,658,Photography,71,wild probabl majest beast beauti gaur photo cr...,wild gaur nagarahol nation park. nan. wild pro...
4,puffi superdog zenfon,bosama_in_laden,https://i.redd.it/bk0fba8havt41.jpg,,603,Photography,39,nice have tree hous free space live like actua...,puffi superdog zenfon. nan. nice have tree hou...
