# Project 3: Reddit API Classification & Natural Language Processing

## 01. API Web Scraping & Data Cleaning - /r/Python


## Problem Statement

In this project, we will explore how well Natural Language Processing Model differentiate post content from two similar subreddits, and which combinations of model and classifier works best? What is the accuracy and how much of the miss-classification will occur between two different subreddit posts.

## Table of contents

- [1.Data Scraping](#1.Data-Scraping)<br>
- [2.Import Data and Data Cleaning](#2.Import-Data-and-Data-Cleaning)<br>
- [3.Data Frame Export](#3.Data-Frame-Export)<br>

In [1]:
import pandas as pd
import numpy as np
import random
import requests
import json
import time
import re
import string
from nltk.stem.porter import PorterStemmer
from xml.sax.saxutils import unescape

pd.set_option('max_colwidth', 100)

## 1.Data Scraping

In [2]:
urls = ['https://www.reddit.com/r/Python.json']

In [3]:
# Define function for data scraping

# def data_scrape(url, num):
    
# Get posts as list of dictionaries, each containing data on one post
posts = []

for url in urls:
    after = None

    for a in range(100):
        if after == None:
            current_url = url
        else:
            current_url = url + '?after=' + after
        print(current_url)
        res = requests.get(current_url, headers={'User-agent': 'Learn Python Bot 1.0'})
    
        if res.status_code != 200:
            print('Status error', res.status_code)
            break
    
        df_posts = pd.DataFrame(posts)
        current_dict = res.json()
        current_posts = [p['data'] for p in current_dict['data']['children']]
        print ("No of posts " + str(len(current_posts)))
        posts.extend(current_posts)
        after = current_dict['data']['after']
    
        pd.DataFrame(posts).to_csv('../datasets/python.csv', index=False)
    
        # generate a random sleep duration to look more 'natural'
        sleep_duration = random.randint(2,10)
        print(sleep_duration)
        time.sleep(sleep_duration)

https://www.reddit.com/r/Python.json
No of posts 27
4
https://www.reddit.com/r/Python.json?after=t3_glfxsx
No of posts 25
9
https://www.reddit.com/r/Python.json?after=t3_gl1zty
No of posts 25
3
https://www.reddit.com/r/Python.json?after=t3_gkwhqd
No of posts 25
7
https://www.reddit.com/r/Python.json?after=t3_gkvx67
No of posts 25
5
https://www.reddit.com/r/Python.json?after=t3_gjudun
No of posts 25
4
https://www.reddit.com/r/Python.json?after=t3_gkg5fx
No of posts 25
5
https://www.reddit.com/r/Python.json?after=t3_gk38gn
No of posts 25
3
https://www.reddit.com/r/Python.json?after=t3_gjtvq4
No of posts 25
6
https://www.reddit.com/r/Python.json?after=t3_gjdedk
No of posts 25
4
https://www.reddit.com/r/Python.json?after=t3_ginw87
No of posts 25
6
https://www.reddit.com/r/Python.json?after=t3_givm0h
No of posts 25
2
https://www.reddit.com/r/Python.json?after=t3_gj0nwx
No of posts 25
2
https://www.reddit.com/r/Python.json?after=t3_gib74w
No of posts 25
6
https://www.reddit.com/r/Python.json

## 2.Import Data and Data Cleaning

In [27]:
py_df = pd.read_csv('../datasets/python.csv')
py_df.head()

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,created_utc,num_crossposts,media,is_video,link_flair_template_id,crosspost_parent_list,crosspost_parent,media_metadata,poll_data,author_cakeday
0,,Python,Top Level comments must be **Job Opportunities.**\n\nPlease include **Location** or any other **...,t2_628u,False,,0,False,"/r/Python Job Board for May, June, July",[],...,1588611000.0,0,,False,,,,,,
1,,Python,"Tell /r/python what you're working on this week! You can be bragging, grousing, sharing your pas...",t2_6l4z3,False,,0,False,What's everyone working on this week?,[],...,1589293000.0,0,,False,,,,,,
2,,Python,,t2_4etcgp3v,False,,0,False,Created a python script that execute Exploratory Data Analysis on any CSV file. It generates a t...,[],...,1589683000.0,0,"{'reddit_video': {'fallback_url': 'https://v.redd.it/2jw3tbx3l8z41/DASH_720?source=fallback', 'h...",True,d7dfae22-4113-11ea-b9fe-0e741fe75651,,,,,
3,,Python,,t2_e3nop1u,False,,0,False,"I made an Android app that detects and recognises traffic signs, using Kivy and OpenCV, to help ...",[],...,1589734000.0,0,"{'reddit_video': {'fallback_url': 'https://v.redd.it/3opn5k9uscz41/DASH_1080?source=fallback', '...",True,d7dfae22-4113-11ea-b9fe-0e741fe75651,,,,,
4,,Python,I have been watching the lectures from a computer science 101 course offered for free through MI...,t2_4s1r4pn9,False,,0,False,Anyone else learning python during the quarantine?,[],...,1589734000.0,0,,False,0df42996-1c5e-11ea-b1a0-0e44e1c5b731,,,,,


In [28]:
py_df.shape

(2484, 107)

In [29]:
py_df = py_df[['name','title','selftext','subreddit']]

In [30]:
py_df.shape

(2484, 4)

In [31]:
py_df.head()

Unnamed: 0,name,title,selftext,subreddit
0,t3_gdfaip,"/r/Python Job Board for May, June, July",Top Level comments must be **Job Opportunities.**\n\nPlease include **Location** or any other **...,Python
1,t3_gibxv4,What's everyone working on this week?,"Tell /r/python what you're working on this week! You can be bragging, grousing, sharing your pas...",Python
2,t3_gl7lp7,Created a python script that execute Exploratory Data Analysis on any CSV file. It generates a t...,,Python
3,t3_glikj1,"I made an Android app that detects and recognises traffic signs, using Kivy and OpenCV, to help ...",,Python
4,t3_glikya,Anyone else learning python during the quarantine?,I have been watching the lectures from a computer science 101 course offered for free through MI...,Python


In [32]:
py_df.drop_duplicates(subset='name',inplace=True)

In [33]:
py_df.shape

(866, 4)

In [34]:
py_df.isnull().sum()

name           0
title          0
selftext     373
subreddit      0
dtype: int64

In [35]:
py_df['selftext'].fillna(value='', inplace=True)

In [36]:
py_df.isnull().sum()

name         0
title        0
selftext     0
subreddit    0
dtype: int64

In [37]:
py_df['title_text'] = py_df['title'] + " " +py_df['selftext']

In [38]:
py_df['title_text'].astype(str)

0      /r/Python Job Board for May, June, July Top Level comments must be **Job Opportunities.**\n\nPle...
1      What's everyone working on this week? Tell /r/python what you're working on this week! You can b...
2      Created a python script that execute Exploratory Data Analysis on any CSV file. It generates a t...
3      I made an Android app that detects and recognises traffic signs, using Kivy and OpenCV, to help ...
4      Anyone else learning python during the quarantine? I have been watching the lectures from a comp...
                                                      ...                                                 
861    What Linux OS for Python? I'm a python novice, but I'm checking around for the job requirements ...
862                                           Showing ADS to other parties in my network Do you know how ?
863    Pycharm Venv Issues Anybody experienced anything like the below / know why this would keep happe...
864                                  

In [39]:
py_df.head()

Unnamed: 0,name,title,selftext,subreddit,title_text
0,t3_gdfaip,"/r/Python Job Board for May, June, July",Top Level comments must be **Job Opportunities.**\n\nPlease include **Location** or any other **...,Python,"/r/Python Job Board for May, June, July Top Level comments must be **Job Opportunities.**\n\nPle..."
1,t3_gibxv4,What's everyone working on this week?,"Tell /r/python what you're working on this week! You can be bragging, grousing, sharing your pas...",Python,What's everyone working on this week? Tell /r/python what you're working on this week! You can b...
2,t3_gl7lp7,Created a python script that execute Exploratory Data Analysis on any CSV file. It generates a t...,,Python,Created a python script that execute Exploratory Data Analysis on any CSV file. It generates a t...
3,t3_glikj1,"I made an Android app that detects and recognises traffic signs, using Kivy and OpenCV, to help ...",,Python,"I made an Android app that detects and recognises traffic signs, using Kivy and OpenCV, to help ..."
4,t3_glikya,Anyone else learning python during the quarantine?,I have been watching the lectures from a computer science 101 course offered for free through MI...,Python,Anyone else learning python during the quarantine? I have been watching the lectures from a comp...


In [40]:
# Convert the &amp, &gt, %lt and XML character entity reference back to &, > and <
py_df['title_text'] = py_df['title_text'].apply(unescape)

In [41]:
# Replace the http, www into blank
py_df['title_text'] = py_df['title_text'].replace(r'http\S+', '', regex=True).replace(r'www\S+', '', regex=True)

In [42]:
def clean_text(text):
    text = re.sub(r'^https?:\/\/.*[\r\n]*', '', text, flags=re.MULTILINE)#removing links
    text = text.lower()  #making eveything lower case
    text = re.sub('\[.*?()\]',' ',text) #removing some punctuation
    text = re.sub('[%s]'%re.escape(string.punctuation),'',text)  #removing more punctuation
    text = re.sub('\w*d\w*',' ',text) #removing words with numbers in them
    text = re.sub('\d',' ',text) #removing numbers
    text = re.sub('\n',' ',text) #removing newlines
    return text
cleaner= lambda x: clean_text(x)

In [43]:
py_df['title_text'] = py_df['title_text'].apply(cleaner)

In [44]:
py_df.head(10)

Unnamed: 0,name,title,selftext,subreddit,title_text
0,t3_gdfaip,"/r/Python Job Board for May, June, July",Top Level comments must be **Job Opportunities.**\n\nPlease include **Location** or any other **...,Python,rpython job for may june july top level comments must be job opportunities please location ...
1,t3_gibxv4,What's everyone working on this week?,"Tell /r/python what you're working on this week! You can be bragging, grousing, sharing your pas...",Python,whats everyone working on this week tell rpython what youre working on this week you can be brag...
2,t3_gl7lp7,Created a python script that execute Exploratory Data Analysis on any CSV file. It generates a t...,,Python,a python script that execute exploratory analysis on any csv file it generates a text report...
3,t3_glikj1,"I made an Android app that detects and recognises traffic signs, using Kivy and OpenCV, to help ...",,Python,i an app that recognises traffic signs using kivy opencv to help combat traffic casual...
4,t3_glikya,Anyone else learning python during the quarantine?,I have been watching the lectures from a computer science 101 course offered for free through MI...,Python,anyone else learning python the quarantine i have been watching the lectures from a computer s...
5,t3_glj128,I made a python script to get reviews about a movie or a TV series in various aspects. This scri...,,Python,i a python script to get reviews about a movie or a tv series in various aspects this script u...
6,t3_glgl86,"Build &amp; Deploy A Python Web App To Automate Twitter | Flask, Heroku, Twitter API &amp; Googl...",,Python,a python web app to automate twitter flask heroku twitter api google sheets api
7,t3_glkc76,I made a script to pull images from Reddit as a wallpaper collection,,Python,i a script to pull images from as a wallpaper collection
8,t3_glb2zc,I made a tool that allows you to search through code snippets using natural language - including...,,Python,i a tool that allows you to search through snippets using natural language m python s...
9,t3_glkcvv,GUI's in Python: best choices?,"Hi guys,\n\nI have been considering developing this desktop application with a GUI in Python. No...",Python,guis in python best choices hi guys i have been this application with a gui in python now...


## 3.Data Frame Export

In [45]:
py_df.to_csv('../datasets/Python_cleaned.csv',index=False)

The web scraping and data cleaning process had created a DataFrame table which containing titles,post and combined values(title_text) and save to a csv file.