# Web Scraping 

## Pulling Data From Reddit

In [2]:
import requests
import pandas as pd
import time
import random

In [9]:
url_um = 'https://www.reddit.com/r/UnresolvedMysteries.json'

In [1]:
url_coldcases = 'https://www.reddit.com/r/coldcases.json'

Defining our web scrapping function

In [3]:
def web_scrapper(subreddit, url):
    posts = []
    after = None

    for a in range(50):

        # function to update url
        if after == None:
            current_url = url
        else:
            current_url = url+'?after='+after
        print(current_url)
    
        # check for status code==200 to make sure we are doing good
        response = requests.get(current_url, headers={'User-agent': 'Pony Inc 1.0'})
        if response.status_code != 200:
            print('Error in getting response')
            break
    
        reddit_dictionary = response.json()
        reddit_posts = [p['data'] for p in reddit_dictionary['data']['children']]
        posts.extend(reddit_posts)
        after = reddit_dictionary['data']['after']
    
        sleep_duration = random.randint(2,6)
        print(sleep_duration)
        print(len(posts))
        time.sleep(sleep_duration)
    pd.DataFrame(posts).to_csv(subreddit, index=False)

In [14]:
web_scrapper('unsolved_mystery', url_um)

https://www.reddit.com/r/UnresolvedMysteries.json
5
26
https://www.reddit.com/r/UnresolvedMysteries.json?after=t3_k0kkhv
6
51
https://www.reddit.com/r/UnresolvedMysteries.json?after=t3_jz6q8n
5
76
https://www.reddit.com/r/UnresolvedMysteries.json?after=t3_jy1irm
4
101
https://www.reddit.com/r/UnresolvedMysteries.json?after=t3_jwkkkv
4
126
https://www.reddit.com/r/UnresolvedMysteries.json?after=t3_jviau2
6
151
https://www.reddit.com/r/UnresolvedMysteries.json?after=t3_jtqy9q
3
176
https://www.reddit.com/r/UnresolvedMysteries.json?after=t3_jrxzey
4
201
https://www.reddit.com/r/UnresolvedMysteries.json?after=t3_jpx08g
4
226
https://www.reddit.com/r/UnresolvedMysteries.json?after=t3_jnsqdo
5
251
https://www.reddit.com/r/UnresolvedMysteries.json?after=t3_jllcl8
5
276
https://www.reddit.com/r/UnresolvedMysteries.json?after=t3_jjqs53
2
301
https://www.reddit.com/r/UnresolvedMysteries.json?after=t3_jih30p
5
326
https://www.reddit.com/r/UnresolvedMysteries.json?after=t3_jgr802
3
351
https://www

In [4]:
web_scrapper('cold_cases', url_coldcases)

https://www.reddit.com/r/coldcases.json
3
25
https://www.reddit.com/r/coldcases.json?after=t3_jc041h
4
50
https://www.reddit.com/r/coldcases.json?after=t3_ivhsxg
5
75
https://www.reddit.com/r/coldcases.json?after=t3_i093nr
5
100
https://www.reddit.com/r/coldcases.json?after=t3_h0qsms
6
125
https://www.reddit.com/r/coldcases.json?after=t3_f4ua16
4
150
https://www.reddit.com/r/coldcases.json?after=t3_eq59bk
6
175
https://www.reddit.com/r/coldcases.json?after=t3_e80moc
5
200
https://www.reddit.com/r/coldcases.json?after=t3_du29eo
3
225
https://www.reddit.com/r/coldcases.json?after=t3_d3etux
4
250
https://www.reddit.com/r/coldcases.json?after=t3_ct3g4n
2
259
https://www.reddit.com/r/coldcases.json
5
284
https://www.reddit.com/r/coldcases.json?after=t3_jc041h
2
309
https://www.reddit.com/r/coldcases.json?after=t3_ivhsxg
6
334
https://www.reddit.com/r/coldcases.json?after=t3_i093nr
2
359
https://www.reddit.com/r/coldcases.json?after=t3_h0qsms
2
384
https://www.reddit.com/r/coldcases.json?aft

## Initial Data Cleaning

In [3]:
unslvd_mystery = pd.read_csv('../Data/unsolved_mystery')

In [4]:
unslvd_mystery.shape

(1241, 105)

In [5]:
unslvd_mystery.drop_duplicates(subset=None, keep='first', inplace=True, ignore_index=True)

In [6]:
unslvd_mystery.shape

(1205, 105)

In [7]:
coldcases = pd.read_csv('../Data/cold_cases')

In [8]:
coldcases.shape

(1186, 110)

In [9]:
coldcases.drop_duplicates(subset=None, keep='first', inplace=True, ignore_index=True)

In [10]:
coldcases.shape

(941, 110)

- Drop the first row from each subreddit because they are the announcements from moderator. 
- Also notice that there are a few posts from reddit moderator, we will remove them from our analysis. 
- Drop one low effect post from paranormal subreddit

In [11]:
unslvd_mystery.drop(index=0, inplace=True)

Since number of features are not the same for two dataset, let's take a look at the differences. 

In [12]:
mystery_features = list(unslvd_mystery.columns)
coldcases_features = list(coldcases.columns)

In [13]:
extra_feat_frm_mystery = [i for i in mystery_features if i not in coldcases_features]
extra_feat_frm_mystery

['collections', 'author_cakeday']

In [14]:
extra_feat_frm_coldcase = [i for i in coldcases_features if i not in mystery_features]
extra_feat_frm_coldcase

['thumbnail_height',
 'thumbnail_width',
 'post_hint',
 'preview',
 'crosspost_parent_list',
 'url_overridden_by_dest',
 'crosspost_parent']

After examining the extra features from each dataset, it is observed that they contain irrelevant information on the content of the posts. Therefore, we drop them. 

In [15]:
unslvd_mystery.drop(columns = extra_feat_frm_mystery, inplace=True)

In [16]:
coldcases.drop(columns = extra_feat_frm_coldcase, inplace=True)

In [22]:
unslvd_mystery.dropna(subset=['selftext'], inplace=True)
coldcases.dropna(subset=['selftext'], inplace=True)

In [30]:
unslvd_mystery.shape

(650, 103)

In [24]:
coldcases.shape

(624, 103)

Let's drop some rows from unresolved mysteries to make sure we have a balanced dataset

In [29]:
unslvd_mystery = unslvd_mystery[0:650]

In [31]:
unslvd_mystery.shape

(650, 103)

Finally, we combine data and save it to .csv file for our analysis. 

In [32]:
all_posts2 = pd.concat([unslvd_mystery, coldcases])

In [33]:
all_posts2.to_csv('../Data/all_posts2', index=False)