# Project 3:  Reddit API Classification & Natural Language Processing


## 02. API Web Scraping & Data Cleaning - /r/bigdata

## Table of contents

- [1.Data Scraping](#1.Data-Scraping)<br>
- [2.Import Data and Data Cleaning](#2.Import-Data-and-Data-Cleaning)<br>
- [3.Data Frame Export](#3.Data-Frame-Export)<br>

In [1]:
import pandas as pd
import numpy as np
import random
import requests
import json
import time
import re
import string
from nltk.stem.porter import PorterStemmer
from xml.sax.saxutils import unescape

pd.set_option('max_colwidth', 100)

## 1.Data Scraping

In [2]:
urls = ['https://www.reddit.com/r/bigdata.json']

In [3]:
# Define function for data scraping

# def data_scrape(url, num):
    
# Get posts as list of dictionaries, each containing data on one post
posts = []

for url in urls:
    after = None

    for a in range(100):
        if after == None:
            current_url = url
        else:
            current_url = url + '?after=' + after
        print(current_url)
        res = requests.get(current_url, headers={'User-agent': 'Learn Python Bot 1.0'})
    
        if res.status_code != 200:
            print('Status error', res.status_code)
            break
    
        df_posts = pd.DataFrame(posts)
        current_dict = res.json()
        current_posts = [p['data'] for p in current_dict['data']['children']]
        print ("No of posts " + str(len(current_posts)))
        posts.extend(current_posts)
        after = current_dict['data']['after']
    
        pd.DataFrame(posts).to_csv('../datasets/bigdata.csv', index=False)
    
        # generate a random sleep duration to look more 'natural'
        sleep_duration = random.randint(2,10)
        print(sleep_duration)
        time.sleep(sleep_duration)

https://www.reddit.com/r/bigdata.json
No of posts 25
4
https://www.reddit.com/r/bigdata.json?after=t3_gis3xu
No of posts 25
10
https://www.reddit.com/r/bigdata.json?after=t3_gg651b
No of posts 25
10
https://www.reddit.com/r/bigdata.json?after=t3_geuszg
No of posts 25
2
https://www.reddit.com/r/bigdata.json?after=t3_gd4jpo
No of posts 25
5
https://www.reddit.com/r/bigdata.json?after=t3_gad0ah
No of posts 25
5
https://www.reddit.com/r/bigdata.json?after=t3_g7moca
No of posts 25
7
https://www.reddit.com/r/bigdata.json?after=t3_g469ow
No of posts 25
7
https://www.reddit.com/r/bigdata.json?after=t3_g1vppx
No of posts 25
3
https://www.reddit.com/r/bigdata.json?after=t3_fz4oke
No of posts 25
3
https://www.reddit.com/r/bigdata.json?after=t3_fw43h8
No of posts 25
6
https://www.reddit.com/r/bigdata.json?after=t3_ftk2ap
No of posts 25
9
https://www.reddit.com/r/bigdata.json?after=t3_frj7h1
No of posts 25
6
https://www.reddit.com/r/bigdata.json?after=t3_fnrsvr
No of posts 25
9
https://www.reddit.c

## 2.Import Data and Data Cleaning

In [4]:
bd_df = pd.read_csv('../datasets/bigdata.csv')
bd_df.head()

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,subreddit_subscribers,created_utc,num_crossposts,media,is_video,media_metadata,crosspost_parent_list,crosspost_parent,author_cakeday,poll_data
0,,bigdata,,t2_xf2t5,False,,0,False,Spark Partitions,[],...,35554,1589729000.0,0,,False,,,,,
1,,bigdata,"From a computing perspective, there are essentially 2 types of scaling — vertical and horizontal...",t2_6g6ggfmr,False,,0,False,Hadoop Distributed File System - A comprehensive guide,[],...,35554,1589701000.0,0,,False,"{'grt1zi433az41': {'status': 'valid', 'e': 'AnimatedImage', 'm': 'image/gif', 'p': [{'y': 127, '...",,,,
2,,bigdata,,t2_ku4l5,False,,0,False,Laughing at Big Data – eBook – Great new insight into realities of IT,[],...,35554,1589613000.0,0,,False,,,,,
3,,bigdata,,t2_ku4l5,False,,0,False,Why I called bullshit on the data lakehouse nonsense,[],...,35554,1589641000.0,0,,False,,,,,
4,,bigdata,One of the big headaches of a traditional data warehouse is its hardware and software infrastruc...,t2_150ojy,False,,0,False,What Is Data Warehouse As a Service and Why Would You Need It,[],...,35554,1589615000.0,0,,False,,,,,


In [5]:
bd_df.shape

(2492, 106)

In [6]:
bd_df = bd_df[['name','title','selftext','subreddit']]

In [7]:
bd_df.shape

(2492, 4)

In [8]:
bd_df.head()

Unnamed: 0,name,title,selftext,subreddit
0,t3_glhbet,Spark Partitions,,bigdata
1,t3_glbdff,Hadoop Distributed File System - A comprehensive guide,"From a computing perspective, there are essentially 2 types of scaling — vertical and horizontal...",bigdata
2,t3_gkqdor,Laughing at Big Data – eBook – Great new insight into realities of IT,,bigdata
3,t3_gkw1lm,Why I called bullshit on the data lakehouse nonsense,,bigdata
4,t3_gkqu9a,What Is Data Warehouse As a Service and Why Would You Need It,One of the big headaches of a traditional data warehouse is its hardware and software infrastruc...,bigdata


In [9]:
bd_df.drop_duplicates(subset='name',inplace=True)

In [10]:
bd_df.shape

(971, 4)

In [11]:
bd_df.isnull().sum()

name           0
title          0
selftext     623
subreddit      0
dtype: int64

> Not all post have selftext, but the subreddit are complete and unique, so we going to reaplce the NaN values with empty string.

In [12]:
bd_df['selftext'].fillna(value='',inplace=True)

In [13]:
#Check the null value again
bd_df.isnull().sum()

name         0
title        0
selftext     0
subreddit    0
dtype: int64

In [14]:
bd_df['title_text'] = bd_df['title'] + " " + bd_df['selftext']

In [15]:
bd_df['title_text'].astype(str)

0                                                                                        Spark Partitions 
1      Hadoop Distributed File System - A comprehensive guide From a computing perspective, there are e...
2                                   Laughing at Big Data – eBook – Great new insight into realities of IT 
3                                                    Why I called bullshit on the data lakehouse nonsense 
4      What Is Data Warehouse As a Service and Why Would You Need It One of the big headaches of a trad...
                                                      ...                                                 
966                       Big Data in Retail Industry [Case Studies] - Take your Business to Next Level!! 
967                                                            A Brief Introduction Of Big Data Framework 
968                                                                         Simplifying the data pipeline 
969                                  

In [16]:
bd_df.head()

Unnamed: 0,name,title,selftext,subreddit,title_text
0,t3_glhbet,Spark Partitions,,bigdata,Spark Partitions
1,t3_glbdff,Hadoop Distributed File System - A comprehensive guide,"From a computing perspective, there are essentially 2 types of scaling — vertical and horizontal...",bigdata,"Hadoop Distributed File System - A comprehensive guide From a computing perspective, there are e..."
2,t3_gkqdor,Laughing at Big Data – eBook – Great new insight into realities of IT,,bigdata,Laughing at Big Data – eBook – Great new insight into realities of IT
3,t3_gkw1lm,Why I called bullshit on the data lakehouse nonsense,,bigdata,Why I called bullshit on the data lakehouse nonsense
4,t3_gkqu9a,What Is Data Warehouse As a Service and Why Would You Need It,One of the big headaches of a traditional data warehouse is its hardware and software infrastruc...,bigdata,What Is Data Warehouse As a Service and Why Would You Need It One of the big headaches of a trad...


In [17]:
# Convert the &amp, &gt, %lt and XML character entity reference back to &, > and <
bd_df['title_text'] = bd_df['title_text'].apply(unescape)

In [18]:
# Replace the http, www into blank
bd_df['title_text'] = bd_df['title_text'].replace(r'http\S+', '', regex=True).replace(r'www\S+', '', regex=True)

In [19]:
def clean_text(text):
    text = re.sub(r'^https?:\/\/.*[\r\n]*', '', text, flags=re.MULTILINE)#removing links
    text = text.lower()  #making eveything lower case
    text = re.sub('\[.*?()\]',' ',text) #removing some punctuation
    text = re.sub('[%s]'%re.escape(string.punctuation),'',text)  #removing more punctuation
    text = re.sub('\w*d\w*',' ',text) #removing words with numbers in them
    text = re.sub('\d',' ',text) #removing numbers
    text = re.sub('\n',' ',text) #removing newlines
    return text
cleaner= lambda x: clean_text(x)

In [20]:
bd_df['title_text'] = bd_df['title_text'].apply(cleaner)

In [21]:
bd_df.head(10)

Unnamed: 0,name,title,selftext,subreddit,title_text
0,t3_glhbet,Spark Partitions,,bigdata,spark partitions
1,t3_glbdff,Hadoop Distributed File System - A comprehensive guide,"From a computing perspective, there are essentially 2 types of scaling — vertical and horizontal...",bigdata,file system a comprehensive from a computing perspective there are essentially types of...
2,t3_gkqdor,Laughing at Big Data – eBook – Great new insight into realities of IT,,bigdata,laughing at big – ebook – great new insight into realities of it
3,t3_gkw1lm,Why I called bullshit on the data lakehouse nonsense,,bigdata,why i bullshit on the lakehouse nonsense
4,t3_gkqu9a,What Is Data Warehouse As a Service and Why Would You Need It,One of the big headaches of a traditional data warehouse is its hardware and software infrastruc...,bigdata,what is warehouse as a service why you it one of the big of a warehouse is its ...
5,t3_gkeozm,Big Data: Its Impact and Significance,,bigdata,big its impact significance
6,t3_gkdqs3,Computational social science #bigdatalearning #learning #socialnetworksanalysis #onlinecourse,Hi from the University of California\n\nInterested in learning more about Computational Social S...,bigdata,computational social science learning socialnetworksanalysis onlinecourse hi from the universi...
7,t3_gkakk1,Webinar on How To Choose the Right Data Science Program For Your Career,,bigdata,webinar on how to choose the right science program for your career
8,t3_gk59h2,Role of Web Scraping in the E-commerce Industry,[E commerce web scraping](https://www.loginworks.com/ecommerce-web-scraping) provides a bird’s e...,bigdata,role of web scraping in the ecommerce a ’s eye view of pricing market prevailing patt...
9,t3_gk2oof,Doing redesign of Statistics without Borders non-profit organization,Hi everyone! I’m a UX designer student and my team is working on a redesign of a non-profit orga...,bigdata,of statistics without nonprofit organization hi everyone i’m a ux my team is working...


## 3.Data Frame Export

In [22]:
bd_df.to_csv('../datasets/BigData_cleaned.csv',index=False)

The web scraping and data cleaning process had created a DataFrame table which containing titles,post and combined values(title_text) and save to a csv file.