<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3 - Web APIs & NLP (Part 1 - Data Acquisition)

## 1. Executive Summary

The world is evolving rapidly in current times as the impact of technology trends become more prevalent in our daily lives. Boosted by the COVID-19 crisis, companies are investing heavily in technology and in turn sped up the adoption of digital technologies. As a result of the rapid changes, many jobs were suddenly made redundant and new jobs requiring particular digital skills were in huge demand to fill the employment gap. In the midst of this, many positions requiring skillsets such as data science, digital marketing, web development and software development were created in response to the new employment trend.

## 2. Problem Statement

As a business analyst with General Assembly, a private organizational entity which specializes in digital upskilling and career transformation, you are tasked to follow up on the current trends in digital upskilling by scraping text data from online forums such as Reddit, and build a text classifier model to identify if a particular post belongs to the data science or digital marketing subreddit. By understanding if there are more meaningful chatter regarding either topic, it gives the instructional team an understanding of the popularity and relevance of each skillset and allows the instructional team to better allocate resources to improve the curriculum offered.

## 3. Importing Libraries

In [1]:
# imports
from tqdm.notebook import trange
import pandas as pd
import requests
import time

#Improve Dataframe visualizations 
pd.set_option("max_columns", None)
pd.set_option("max_rows", None)
pd.set_option("display.max_colwidth", 100)

## 4. Scraping 20,000 posts from Data Science subreddit

20,000 posts will be scraped from each subreddit to collect enough text information, as it is expected that there will be some posts that will be removed furing our data cleaning steps.

For scraping, we will utilize the Pushshift API to scrape the required text information from each subreddit and subsequently grouped into a pandas DataFrame for manipulating the data with Python.

#### 4.1 Scraping first 100 submissions from r/datascience

In [2]:
#url for searching subreddit with Pushshift.io
url = "https://api.pushshift.io/reddit/search/submission"
    
#parameters for searching first 100 r/datascience submissions:
params = {
    'subreddit': 'datascience',
    'size': 100 
}

#scrape r/datascience data into json format
req = requests.get(url, params=params)
data = req.json()

#group data into dataframe
ds_df = pd.DataFrame(data['data'])

In [3]:
ds_df.shape

(100, 80)

In [4]:
ds_df.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,author_premium,awarders,can_mod_post,contest_mode,created_utc,domain,full_link,gildings,id,is_created_from_ads_ui,is_crosspostable,is_meta,is_original_content,is_reddit_media_domain,is_robot_indexable,is_self,is_video,link_flair_background_color,link_flair_css_class,link_flair_richtext,link_flair_template_id,link_flair_text,link_flair_text_color,link_flair_type,locked,media_only,no_follow,num_comments,num_crossposts,over_18,parent_whitelist_status,permalink,pinned,pwls,retrieved_on,score,selftext,send_replies,spoiler,stickied,subreddit,subreddit_id,subreddit_subscribers,subreddit_type,suggested_sort,thumbnail,title,total_awards_received,treatment_tags,upvote_ratio,url,whitelist_status,wls,poll_data,post_hint,preview,thumbnail_height,thumbnail_width,url_overridden_by_dest,crosspost_parent,crosspost_parent_list,media,media_embed,secure_media,secure_media_embed,author_flair_background_color,author_flair_text_color,media_metadata,removed_by_category,is_gallery
0,[],False,Passion_369,,[],,text,t2_6joc8fq7,False,False,False,[],False,False,1646318273,self.datascience,https://www.reddit.com/r/datascience/comments/t5sz8q/hello_everyone_what_are_the_different_metho...,{},t5sz8q,False,True,False,False,False,True,True,False,,discussion,[],4fad7108-d77d-11e7-b0c6-0ee69f155af2,Discussion,dark,text,False,False,True,0,0,False,all_ads,/r/datascience/comments/t5sz8q/hello_everyone_what_are_the_different_methods_and/,False,6,1646318284,1,,True,False,False,datascience,t5_2sptq,704226,public,confidence,self,"Hello everyone, what are the different methods and algorithms for model aggregation in federated...",0,[],1.0,https://www.reddit.com/r/datascience/comments/t5sz8q/hello_everyone_what_are_the_different_metho...,all_ads,6,,,,,,,,,,,,,,,,,
1,[],False,TobogganFetish,,[],,text,t2_6q7dq,False,False,False,[],False,False,1646316403,self.datascience,https://www.reddit.com/r/datascience/comments/t5sc5u/is_it_worth_starting_data_science_as_an/,{},t5sc5u,False,True,False,False,False,True,True,False,,career,[],a6ee6fa0-d780-11e7-b6d0-0e0bd8823a7e,Career,dark,text,False,False,True,0,0,False,all_ads,/r/datascience/comments/t5sc5u/is_it_worth_starting_data_science_as_an/,False,6,1646316414,1,"For context, I've worked with data for 10 years but mostly in analysis/reporting roles. I've rec...",True,False,False,datascience,t5_2sptq,704203,public,confidence,self,Is it worth starting Data Science as an Individual Contributor?,0,[],1.0,https://www.reddit.com/r/datascience/comments/t5sc5u/is_it_worth_starting_data_science_as_an/,all_ads,6,,,,,,,,,,,,,,,,,
2,[],False,Malcolm101,,[],,text,t2_1hqlqa2v,False,False,False,[],False,False,1646310679,self.datascience,https://www.reddit.com/r/datascience/comments/t5qk8a/imputing_features_like_ratings_and_rankings/,{},t5qk8a,False,True,False,False,False,True,True,False,,education,[],99f9652a-d780-11e7-b558-0e52cdd59ace,Education,dark,text,False,False,True,0,0,False,all_ads,/r/datascience/comments/t5qk8a/imputing_features_like_ratings_and_rankings/,False,6,1646310689,1,Can any one tell how to deal with null values for rankings and ratings features in a movie reven...,True,False,False,datascience,t5_2sptq,704145,public,confidence,self,Imputing features like ratings and rankings,0,[],1.0,https://www.reddit.com/r/datascience/comments/t5qk8a/imputing_features_like_ratings_and_rankings/,all_ads,6,,,,,,,,,,,,,,,,,
3,[],False,call-mws,,[],,text,t2_dyaj3swo,False,False,False,[],False,False,1646305949,self.datascience,https://www.reddit.com/r/datascience/comments/t5p9ey/best_way_to_deal_with_missingempty_data_in_a/,{},t5p9ey,False,True,False,False,False,True,True,False,,discussion,[],4fad7108-d77d-11e7-b0c6-0ee69f155af2,Discussion,dark,text,False,False,False,0,0,False,all_ads,/r/datascience/comments/t5p9ey/best_way_to_deal_with_missingempty_data_in_a/,False,6,1646305959,1,"Hi. Potentially a simple, recurring questions here..\n\nI have a small dataset with around 10k r...",False,False,False,datascience,t5_2sptq,704104,public,confidence,self,Best way to deal with missing/empty data in a small dataset,0,[],1.0,https://www.reddit.com/r/datascience/comments/t5p9ey/best_way_to_deal_with_missingempty_data_in_a/,all_ads,6,,,,,,,,,,,,,,,,,
4,[],False,javioverflow,,[],,text,t2_4czmxpo8,False,False,False,[],False,False,1646304050,self.datascience,https://www.reddit.com/r/datascience/comments/t5ose6/curious_how_many_of_us_work_with_data_strea...,{},t5ose6,False,True,False,False,False,True,True,False,,discussion,[],4fad7108-d77d-11e7-b0c6-0ee69f155af2,Discussion,dark,text,False,False,False,0,0,False,all_ads,/r/datascience/comments/t5ose6/curious_how_many_of_us_work_with_data_streaming/,False,6,1646304061,1,What are your thoughts on those two?\n\n[View Poll](https://www.reddit.com/poll/t5ose6),True,False,False,datascience,t5_2sptq,704083,public,confidence,self,Curious how many of us work with data streaming or data batch,0,[],1.0,https://www.reddit.com/r/datascience/comments/t5ose6/curious_how_many_of_us_work_with_data_strea...,all_ads,6,"{'is_prediction': False, 'options': [{'id': '14220331', 'text': 'only data streaming'}, {'id': '...",,,,,,,,,,,,,,,,


In [5]:
#define parameters for collecting more submissions

def parameters(df, subreddit):
    
    params = {
        'subreddit': subreddit,
        'size': 100,
        'before': df.loc[(df.shape[0] - 1), 'created_utc']
    }

    return params

In [6]:
#def function to collect more submissions from subreddit
def get_posts(params):
    
    #url for searching subreddit with Pushshift.io
    url = "https://api.pushshift.io/reddit/search/submission"
    
    #scrape submissions data from reddit into json format
    req = requests.get(url, params=params)
    data = req.json()
    
    #return data in pandas dataframe format
    df = pd.DataFrame(data['data'])
    
    return df

#### 4.2 Scraping 20,000 submissions from r/datascience

In [7]:
#### scrape 100 more submissions for a total of 199x, to obtain 20,000 submissions in total

for i in trange(199):
    
    try:
        param = parameters(ds_df, 'datascience')
        ds_df = pd.concat([ds_df, get_posts(param)], ignore_index=True)
    
    except:
        #notifies us if there is an error during scraping
        print(f"Error occurred while scraping")
        
    #1 seconds interval per requests to prevent server overload    
    time.sleep(1)

  0%|          | 0/199 [00:00<?, ?it/s]

Error occurred while scraping
Error occurred while scraping
Error occurred while scraping


In [8]:
ds_df.shape

(19692, 86)

We managed to scrape a total of 19,692 posts from the datascience subreddit.

## 5. Scraping 20,000 posts from Digital Marketing subreddit

Again, we will attempt to scrape 20,000 posts from the digital marketing subreddit.

#### 5.1 Scraping first 100 submissions from r/digitalmarketing

In [9]:
#url for searching subreddit with Pushshift.io
url = "https://api.pushshift.io/reddit/search/submission"
    
#parameters for searching first 100 r/softwareengineering submissions:
params = {
    'subreddit': 'digitalmarketing',
    'size': 100 
}

#scrape r/softwareengineering data into json format
req = requests.get(url, params=params)
data = req.json()

#group data into dataframe
dm_df = pd.DataFrame(data['data'])

In [10]:
dm_df.shape

(100, 68)

In [11]:
dm_df.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,author_premium,awarders,can_mod_post,contest_mode,created_utc,domain,full_link,gildings,id,is_created_from_ads_ui,is_crosspostable,is_meta,is_original_content,is_reddit_media_domain,is_robot_indexable,is_self,is_video,link_flair_background_color,link_flair_richtext,link_flair_text_color,link_flair_type,locked,media_only,no_follow,num_comments,num_crossposts,over_18,parent_whitelist_status,permalink,pinned,pwls,removed_by_category,retrieved_on,score,selftext,send_replies,spoiler,stickied,subreddit,subreddit_id,subreddit_subscribers,subreddit_type,thumbnail,title,total_awards_received,treatment_tags,upvote_ratio,url,whitelist_status,wls,post_hint,preview,author_cakeday,crosspost_parent,crosspost_parent_list,url_overridden_by_dest,thumbnail_height,thumbnail_width
0,[],False,June_Born,,[],,text,t2_ehakld5x,False,False,False,[],False,False,1646351513,self.DigitalMarketing,https://www.reddit.com/r/DigitalMarketing/comments/t658un/online_product_reviews/,{},t658un,False,False,False,False,False,False,True,False,,[],dark,text,False,False,True,1,0,False,all_ads,/r/DigitalMarketing/comments/t658un/online_product_reviews/,False,6,moderator,1646351524,1,[removed],True,False,False,DigitalMarketing,t5_2s3d6,76069,public,self,Online Product Reviews,0,[],1.0,https://www.reddit.com/r/DigitalMarketing/comments/t658un/online_product_reviews/,all_ads,6,,,,,,,,
1,[],False,throwawaygal1992,,[],,text,t2_4rhow1dx,False,False,False,[],False,False,1646349722,self.DigitalMarketing,https://www.reddit.com/r/DigitalMarketing/comments/t64mvc/does_anything_other_than_a_com_like_a_...,{},t64mvc,False,True,False,False,False,True,True,False,,[],dark,text,False,False,True,0,0,False,all_ads,/r/DigitalMarketing/comments/t64mvc/does_anything_other_than_a_com_like_a_live_org_or/,False,6,,1646349733,1,,True,False,False,DigitalMarketing,t5_2s3d6,76068,public,self,"Does anything other than a .com, like a .live .org or .net mess up with your SEO rankings on Goo...",0,[],1.0,https://www.reddit.com/r/DigitalMarketing/comments/t64mvc/does_anything_other_than_a_com_like_a_...,all_ads,6,,,,,,,,
2,[],False,catsandblankets,,[],,text,t2_elfob,False,False,False,[],False,False,1646341258,self.DigitalMarketing,https://www.reddit.com/r/DigitalMarketing/comments/t61lxq/we_share_courses_guides_often_here_but...,{},t61lxq,False,True,False,False,False,True,True,False,,[],dark,text,False,False,True,0,0,False,all_ads,/r/DigitalMarketing/comments/t61lxq/we_share_courses_guides_often_here_but_what_would/,False,6,,1646341269,1,The company I work for was a 10m/yr wholesale CPG company which grew to a 30m/yr as soon as they...,True,False,False,DigitalMarketing,t5_2s3d6,76061,public,self,"We share courses &amp; guides often here, but what would you suggest is a good 101 video that I ...",0,[],1.0,https://www.reddit.com/r/DigitalMarketing/comments/t61lxq/we_share_courses_guides_often_here_but...,all_ads,6,,,,,,,,
3,[],False,hasinul_babu,,[],,text,t2_j7xpnyul,False,False,False,[],False,False,1646340350,self.DigitalMarketing,https://www.reddit.com/r/DigitalMarketing/comments/t619rx/how_to_create_a_fiverr_gig_video_using...,{},t619rx,False,False,False,False,False,False,True,False,,[],dark,text,False,False,True,0,0,False,all_ads,/r/DigitalMarketing/comments/t619rx/how_to_create_a_fiverr_gig_video_using_powerpoint/,False,6,moderator,1646340361,1,[removed],True,False,False,DigitalMarketing,t5_2s3d6,76061,public,self,"How to create a Fiverr Gig Video using PowerPoint, Camtasia |Fiverr Skil...",0,[],1.0,https://www.reddit.com/r/DigitalMarketing/comments/t619rx/how_to_create_a_fiverr_gig_video_using...,all_ads,6,,,,,,,,
4,[],False,saguirre99,,[],,text,t2_4t8g70zj,False,False,False,[],False,False,1646340213,self.DigitalMarketing,https://www.reddit.com/r/DigitalMarketing/comments/t617ze/help_with_content_ideas_for_blog/,{},t617ze,False,False,False,False,False,False,True,False,,[],dark,text,False,False,True,0,0,False,all_ads,/r/DigitalMarketing/comments/t617ze/help_with_content_ideas_for_blog/,False,6,moderator,1646340224,1,[removed],True,False,False,DigitalMarketing,t5_2s3d6,76060,public,self,Help with Content Ideas for Blog,0,[],1.0,https://www.reddit.com/r/DigitalMarketing/comments/t617ze/help_with_content_ideas_for_blog/,all_ads,6,self,"{'enabled': False, 'images': [{'id': '5Cie5far9IgYCcTEUIqrA4BfHDvOgpx1ZosPJpKIcDM', 'resolutions...",,,,,,


#### 5.2 Scraping 20,000 submissions from r/digitalmarketing

In [12]:
#scrape 100 more submissions for a total of 199x, to obtain 20,000 submissions in total

for i in trange(199):
    
    try:
        param = parameters(dm_df, 'digitalmarketing')
        dm_df = pd.concat([dm_df, get_posts(param)], ignore_index=True)
    
    except:
        #notifies us if there is an error during scraping
        print(f"Error occurred while scraping")
        
    #1 seconds interval per requests to prevent server overload    
    time.sleep(1)

  0%|          | 0/199 [00:00<?, ?it/s]

Error occurred while scraping
Error occurred while scraping
Error occurred while scraping


In [13]:
dm_df.shape

(19697, 79)

We managed to scrape a total of 19,697 posts from the digital marketing subreddit.

## 6. Export data to csv format

The following 2 codes to export the 2 dataframes containing submissions from both subreddits will be commented out to prevent accidental override of data. To update and override the datasets with the latest 20,000 submissions from both datasets, un-comment the following 2 codes and run the cells.

In [14]:
#ds_df.to_csv('../data/ds_subreddit_submissions', index=False)

In [15]:
#dm_df.to_csv('../data/dm_subreddit_submissions', index=False)