
# Project 3: Web APIs & Classification
## Notebook 1: Data Collection 
---

## Introduction

A leading online grocery platform aims to make use of trending topics on 2 subreddits, r/Cooking and r/AskBaking in line with their upcoming marketing campaign to promote groceries and food items to consumers based on the trending topics. During their data collection, they mistakenly collected posts without identifying which subreddit the post originated from. My task would be to help them accurately classify the posts into the correct subreddits to eliminate the need to manually classify them.

## Problem Statement

Given a post from either r/Cooking and r/AskBaking, how can I correctly identify which subreddit is it from?

## Overview

The task is sub-divided into 2 notebooks; in this notebook, we will focus on the data collection while the next notebook will focus on data cleaning, EDA and modeling. 

- In this segment, we will aim to collect 1000 posts each from the 2 subreddits, r/Cooking and r/AskBaking. 

## Data Collection
---

In [1]:
#Imports
import requests 
import pandas as pd
import time
import random

In [2]:
#url links for the 2 subreddits

url_1 ='https://www.reddit.com/r/Cooking.json'
url_2 = 'https://www.reddit.com/r/AskBaking.json'

In [3]:
#define the file paths 

file_path_1 = '../datasets/cooking_raw.csv'
file_path_2 = '../datasets/baking_raw.csv'
file_path_3 = '../datasets/cooking_new.csv'
file_path_4 = '../datasets/baking_new.csv'

## r/Cooking
---
##### Accessing reddit's API to obtain the posts for r/Cooking
- 786 unique entries were saved
- total number of entries was check against the feature 'name', which provides a unique name for each post 

In [4]:
# extract cooking posts
posts_1 = []
after = None

# we are able to access 25 posts each time, as such we will repeat this 40 times to get 1000 posts 

for a in range(40):
    if after == None and a==0:        #for the first instance 
        current_url = url_1
    elif after == None and a!=0:      #breaks loop if there are less than 1000 posts; when after==None, to prevent collection of duplictes
        print('No more posts')
        break
    else:
        current_url = url_1 + '?after=' + after
    print(current_url)
    res = requests.get(current_url, headers={'User-agent': 'Kitten 1.0'})  #define a different user agent
    
    if res.status_code != 200:
        print('Status error', res.status_code)
        break
    
    current_dict = res.json()
    current_posts = [p['data'] for p in current_dict['data']['children']]
    posts_1.extend(current_posts)
    after = current_dict['data']['after']
    
    if a > 0:
        prev_posts = pd.read_csv(file_path_1)
        current_df = pd.DataFrame()
    else:
        pd.DataFrame(posts_1).to_csv(file_path_1, index = False)

    # generate a random sleep duration to look more 'natural'
    sleep_duration = random.randint(2,6)
    print(sleep_duration)
    time.sleep(sleep_duration)


https://www.reddit.com/r/Cooking.json
6
https://www.reddit.com/r/Cooking.json?after=t3_espfu3
4
https://www.reddit.com/r/Cooking.json?after=t3_esd8xg
6
https://www.reddit.com/r/Cooking.json?after=t3_esi0l3
5
https://www.reddit.com/r/Cooking.json?after=t3_es6t40
2
https://www.reddit.com/r/Cooking.json?after=t3_es5znv
3
https://www.reddit.com/r/Cooking.json?after=t3_erwjg5
6
https://www.reddit.com/r/Cooking.json?after=t3_erkfuo
3
https://www.reddit.com/r/Cooking.json?after=t3_erpd2j
6
https://www.reddit.com/r/Cooking.json?after=t3_erhe2o
4
https://www.reddit.com/r/Cooking.json?after=t3_erhfv3
3
https://www.reddit.com/r/Cooking.json?after=t3_erbgxl
3
https://www.reddit.com/r/Cooking.json?after=t3_er436f
6
https://www.reddit.com/r/Cooking.json?after=t3_er1z53
3
https://www.reddit.com/r/Cooking.json?after=t3_eqs5oi
2
https://www.reddit.com/r/Cooking.json?after=t3_eqo08k
6
https://www.reddit.com/r/Cooking.json?after=t3_eqne0e
2
https://www.reddit.com/r/Cooking.json?after=t3_epl173
2
https://

In [5]:
#save cooking posts file to csv
print(f"Save {len(posts_1)} entries to csv file..")
pd.DataFrame(posts_1).to_csv(file_path_1, index = False)

Save 786 entries to csv file..


The column 'name' served as a unique identifier for each post:

In [6]:
#check
#prints names column in cooking
for x in posts_1:
    print(x['name'])

t3_esklbe
t3_esh9jv
t3_esnbr6
t3_es3e31
t3_ese5fw
t3_esjowz
t3_esi20j
t3_eslquq
t3_esm5np
t3_esffsc
t3_esq1y8
t3_esm721
t3_esksph
t3_espag7
t3_eslubf
t3_esl034
t3_esp6sv
t3_esntw5
t3_esqs6i
t3_esltnw
t3_esqg51
t3_esno11
t3_eskn38
t3_eshpi2
t3_espfu3
t3_esmoxc
t3_eskplr
t3_esoyv5
t3_esj9bl
t3_esm4rh
t3_eslspu
t3_eshp7z
t3_esno2e
t3_esnjxw
t3_eskhq7
t3_esb11u
t3_esngcf
t3_eskdo0
t3_esne3r
t3_esic8s
t3_esbxqh
t3_esmxtu
t3_esmvb6
t3_esp8gr
t3_esmem4
t3_esmefa
t3_esmal0
t3_esm9p6
t3_esm6x9
t3_esd8xg
t3_esiwgx
t3_eslxoo
t3_eslvx7
t3_eslso4
t3_eslre3
t3_esife8
t3_eslbdi
t3_eso5jt
t3_esl67t
t3_esepzd
t3_esbe98
t3_esbcc1
t3_esket8
t3_esh6q5
t3_esk72d
t3_esk4ns
t3_esgrx8
t3_erymjo
t3_esjpqq
t3_esg2pz
t3_esizo8
t3_esiorn
t3_esfg4t
t3_eslnw9
t3_esi0l3
t3_eshvsy
t3_esonnu
t3_eshnba
t3_esh3bh
t3_esh0o1
t3_esgfc5
t3_es7yko
t3_ermku7
t3_es88y7
t3_es9qa8
t3_esevn7
t3_esetgq
t3_eselrn
t3_es8nxa
t3_esb0x7
t3_esdt1m
t3_esds4n
t3_esdg1r
t3_eryx9q
t3_esi0e9
t3_es7mnh
t3_es1yw3
t3_es5dtc
t3_es6ttj
t3_es6t40


Performing a check to confirm that there are no duplicate posts:

In [7]:
#check if all the posts are unique 
unique = len(set([x['name'] for x in posts_1])) #get length of unique names
unique == len(posts_1)

True

## r/AskBaking
---
##### Accessing reddit's API to obtain the posts for r/Baking
- 988 unique entries were saved
- total number of entries was check against the feature 'name', which provides a unique name for each post 

In [8]:
#get baking posts
posts_2 = []
# after_key = set()
after = None

for a in range(40):
    if after == None and a==0:
        current_url = url_2
    elif after == None and a!=0:     #breaks loop if there are less than 1000 posts
        print('No more posts')
        break
    else:
        current_url = url_2 + '?after=' + after
    print(current_url)
    res = requests.get(current_url, headers={'User-agent': 'Kitten 1.0'})
    
    if res.status_code != 200:
        print('Status error', res.status_code)
        break
    
    current_dict = res.json()
    current_posts = [p['data'] for p in current_dict['data']['children']]
    posts_2.extend(current_posts)
    after = current_dict['data']['after']
    
    if a > 0:
        prev_posts = pd.read_csv(file_path_2)
        current_df = pd.DataFrame()
    else:
        pd.DataFrame(posts_2).to_csv(file_path_2, index = False)

    # generate a random sleep duration to look more 'natural'
    sleep_duration = random.randint(2,6)
    print(sleep_duration)
    time.sleep(sleep_duration)

https://www.reddit.com/r/AskBaking.json
4
https://www.reddit.com/r/AskBaking.json?after=t3_entkpi
6
https://www.reddit.com/r/AskBaking.json?after=t3_ejz70c
5
https://www.reddit.com/r/AskBaking.json?after=t3_eeqte2
3
https://www.reddit.com/r/AskBaking.json?after=t3_ecpqgh
6
https://www.reddit.com/r/AskBaking.json?after=t3_e9stx8
2
https://www.reddit.com/r/AskBaking.json?after=t3_e5itzr
2
https://www.reddit.com/r/AskBaking.json?after=t3_e1usbh
4
https://www.reddit.com/r/AskBaking.json?after=t3_dwuqs4
2
https://www.reddit.com/r/AskBaking.json?after=t3_drz105
4
https://www.reddit.com/r/AskBaking.json?after=t3_dp93h6
5
https://www.reddit.com/r/AskBaking.json?after=t3_dijjd9
2
https://www.reddit.com/r/AskBaking.json?after=t3_de5juq
5
https://www.reddit.com/r/AskBaking.json?after=t3_d8t6nt
2
https://www.reddit.com/r/AskBaking.json?after=t3_d3dhyg
4
https://www.reddit.com/r/AskBaking.json?after=t3_czt39z
2
https://www.reddit.com/r/AskBaking.json?after=t3_cu891p
2
https://www.reddit.com/r/AskBa

In [9]:
#save file to csv
print(f"Save {len(posts_2)} entries to csv file..")
pd.DataFrame(posts_2).to_csv(file_path_2, index = False)

Save 988 entries to csv file..


The column 'name' served as a unique identifier for each post:

In [11]:
for x in posts_2:
    print(x['name'])

t3_crm4hs
t3_e1z1cn
t3_esn5w7
t3_esbf71
t3_es0iss
t3_es5y4p
t3_erewjz
t3_ernui3
t3_erk3xe
t3_eriiz2
t3_erf7nk
t3_erew1h
t3_er9d5y
t3_er04z7
t3_eqrnl9
t3_er316g
t3_eqqadw
t3_eq1muh
t3_eq35zs
t3_epn66y
t3_ep83us
t3_ep89p1
t3_epb8bk
t3_eoo8u2
t3_eoc7xs
t3_eohkzc
t3_entkpi
t3_envjyy
t3_enr4l5
t3_eniirk
t3_enij5i
t3_en9cq1
t3_emw0ge
t3_embq92
t3_emidjs
t3_emd461
t3_elz1wr
t3_elrobr
t3_elzxhx
t3_elcyap
t3_el4trn
t3_elfofm
t3_el5h8p
t3_ekx15w
t3_el31hj
t3_ektt4p
t3_ektwa8
t3_ekt62o
t3_ekgggi
t3_ekg84y
t3_ek7qbu
t3_ejz70c
t3_ek6g9v
t3_ek77y9
t3_ejzf0y
t3_ejm2ta
t3_ejohqy
t3_ej093m
t3_ej1kcu
t3_eijewi
t3_eisbv6
t3_eimtdm
t3_ei98l4
t3_ei4yo9
t3_ehzahd
t3_ei4k0a
t3_ei0m1l
t3_ehqxvm
t3_eh4a90
t3_eh2uuz
t3_egapzo
t3_eghiul
t3_efku8g
t3_efgug1
t3_ef56lu
t3_ef4l62
t3_eeqte2
t3_ef2f3z
t3_eel2to
t3_eevlxt
t3_eeu3ma
t3_eemo00
t3_eepqdu
t3_eefhuq
t3_eeljym
t3_eentl7
t3_ee6xan
t3_eealuk
t3_eeb87q
t3_ee04bu
t3_ee2n47
t3_edx4np
t3_edopy5
t3_edgttf
t3_edfha7
t3_ed91xp
t3_eddece
t3_edamqq
t3_eda3jz
t3_ed8thk


In [12]:
#check for duplicates which will be dropped during data cleaning
unique = len(set([x['name'] for x in posts_2])) 
unique == len(posts_2)

True

### Observations 
Taking a look at the selftext of r/Baking, the following observations were made:
1. The top 2 posts from r/Baking are stuck on top for newcomers and will be removed subsequently
2. Some of the posts include links to other sites for recipes and it is not critical for our analysis, hence they will also be removed subsequently

In [13]:
#taking a look at the selftexts of baking
for x in posts_2:
    print(x['selftext'])

Need answers fast? Hoping to get help in real-time? [Join our Discord Server](https://discord.gg/gFFnxcE) and you may just get the help that you need and a few new friends along the way!

**Disclaimer:** This Discord server is a brand new experiment and prone to changes as we work out the kinks. Bare with us!
# This is an important notice from the Moderators:

As the Holiday season is upon many of us (with the US Thanksgiving, Christmas internationally, New Years, and any other baking-related holiday) it's important that you remember to **FLAIR YOUR POSTS**. ***Any post not flaired within 24 hours of being submitted is removed by Automod.*** If you need your questions answered, please flair your post or it will be lost to time. You should receive a DM from Automod within 15-20 minutes of your post asking you to flair it, or you can manually flair it immediately before or after hitting submit.

If you need a question answered ASAP, consider [joining our Discord Community](https://discor

3) If I can't do individual, if I bake it in one pan, should I use a spring form? 
I am going to attempt Mary Berry's white farmhouse loaf recipe, and I am wondering if I can proof it overnight in the fridge (save me getting up early :P ) or would it be better to do it on the counter (probably not a good idea, three cheers for cats) ooor just get up early and do it tomorrow! Any advice is appreciated :)
Hello!  I’ve been trying to make profiteroles / cream puffs for 2 days now and each time they end up flat with an eggy taste to them! I follow recipes exactly so I have no idea where I’d go wrong but I did notice whilst watching videos that my dough always looked more watery/liquid than those in videos. Any help would be appreciated thank you :)
I plan on making the White Bread with 80% Biga from FWSY. All I have is bread flour, while the recipe calls for all-purpose. Can I use the bread flour I have for this recipe? Thanks in advance! Extra info, I also have rye flour, can I use this i

### Saving the Data
The original data obtained from reddit was saved as 2 separate csv files, cooking_raw.csv and baking_raw.csv

In [14]:
#read shower csv and count the number of null rows in selftext
#35 out of 650 posts are null, the self text will be excluded from analysis
df = pd.read_csv(file_path_2)
df['selftext'].isnull().sum()

35

In [15]:
#read world news csv and count the number of null rows in selftext
#since 791 out of 791 posts are null, the self text will be excluded from analysis
df2 = pd.read_csv(file_path_1)
df2['selftext'].isnull().sum()

51

### Extracting Relevant Features
The following features were extracted:

|Feature|Description|
|:---|:---|
|title|Title of post| 
|selftext|Text of post| 
|name|ID of each post| 
|author|Author of post|
|target|1 for r/Cooking, 0 for r/AskBaking| 


In [47]:
# defining a function to obtain columns title, selftext, name and author

def get_features(filepath,new_filepath,target):
    
    #read original csv file
    df_orig = pd.read_csv(filepath)  
    
    df = pd.DataFrame()
    
    #extract relevant features
    df['title'] = df_orig['title'] 
    df['selftext'] = df_orig['selftext']
    df['name'] = df_orig['name']
    df['author'] = df_orig['author']
    df['target'] = target
    
    df.to_csv(new_filepath,index=False)
    
    return print('Feature Extraction Complete!') 
   

In [48]:
# get features for r/Cooking
get_features(file_path_1,file_path_3,1)

Feature Extraction Complete!


In [49]:
#check
check = pd.read_csv(file_path_3)

In [50]:
check

Unnamed: 0,title,selftext,name,author,target
0,"Money has been tight this past month, so we’ve...",I typically save several dollars every trip to...,t3_esklbe,iScReAm612,1
1,I hate my own cooking,I have been learning how to cook over the past...,t3_esh9jv,nightglitter89x,1
2,Skip the fancy apron trend and get yer'self a ...,"Want that hip, macho aesthetic in the kitchen ...",t3_esnbr6,ProtoNate,1
3,My mom's measuring cup,Maybe this is about cooking. Maybe not. Since ...,t3_es3e31,ManosVanBoom,1
4,How can I make steamed vegetables taste more i...,"I eat a lot of steamed vegetables, particularl...",t3_ese5fw,TheTousler,1
...,...,...,...,...,...
781,What can I do with ripe/overripe bananas?,Bananas are one of the few fruits that I like ...,t3_eo9h8p,mathchem,1
782,Too much heat!,I love heat in my cooking by its overwhelming ...,t3_eoegcn,calfee777,1
783,What are some good main dishes to go with risotto,Hello! I want to make a simple risotto for my ...,t3_eoeg2x,SlingsAndArrowsOf,1
784,When to season eggs and why?,"So, I’ve been attempting to teach myself to co...",t3_eo821h,FalseDmitriy02,1


In [51]:
# get features for r/AskBaking
get_features(file_path_2,file_path_4,0)

Feature Extraction Complete!


In [3]:
#combining the 2 files for subsequent EDA and cleaning
df_new1 = pd.read_csv(file_path_3)
df_new2 = pd.read_csv(file_path_4)

combined_to_clean = pd.concat([df_new1, df_new2], ignore_index=True)

NameError: name 'pd' is not defined

In [54]:
combined_to_clean

Unnamed: 0,title,selftext,name,author,target
0,"Money has been tight this past month, so we’ve...",I typically save several dollars every trip to...,t3_esklbe,iScReAm612,1
1,I hate my own cooking,I have been learning how to cook over the past...,t3_esh9jv,nightglitter89x,1
2,Skip the fancy apron trend and get yer'self a ...,"Want that hip, macho aesthetic in the kitchen ...",t3_esnbr6,ProtoNate,1
3,My mom's measuring cup,Maybe this is about cooking. Maybe not. Since ...,t3_es3e31,ManosVanBoom,1
4,How can I make steamed vegetables taste more i...,"I eat a lot of steamed vegetables, particularl...",t3_ese5fw,TheTousler,1
...,...,...,...,...,...
1769,Hi! Any idea how we can compress raw food like...,,t3_9t81vw,spit666999,0
1770,Sugar cookies absorbed my royal icing,"Made a huge batch of sugar cookies, Flooded al...",t3_9srzqd,11_29_77,0
1771,Dairy free scones?,Tried this recipe tonight:\n\nhttps://www.goda...,t3_9s8e64,hexagonalshit,0
1772,Adding peanut butter to brownies,I would like to incorporate peanut butter (the...,t3_9rzllx,Icarus367,0


In [55]:
#check number of rows 791 + 650 = 1441 
combined_to_clean.shape

(1774, 5)

In [56]:
#export file to csv 
combined_to_clean.to_csv('../datasets/data_to_clean.csv')