# Project 3 - Reddit.com Web APIs & Classification

## NOTE - The data had been extracted and saved in the 'datasets' folder. If you try to run this code again, take note that it will overwrite the previously downloaded data files. When the overwritten data files are read into the processing codes file, it will inadvertently affect the results and values that are being discussed in there.

**Thus, we run this code once to extract the analysis data and changed to a different filename for subsequent trial runs. Proceed with caution or change the filename.**

# Importing packages

In [1]:
import requests
import pandas as pd
import time
import random

# Extracting data from reddit.com using json API

Prior to defining a function to mass extract data. We ran some codes to test and check what we needed to extract from reddit.com.

In [2]:
url = "https://www.reddit.com/r/technicalanalysis/new.json" # extracting recent post using 'new' tag

In [3]:
# make request using a dummy user agent and check status code
res = requests.get(url, headers={'User-agent': 'money datascientist'})

In [4]:
# check request status code
res.status_code

200

In [5]:
# assign the extracted json data to a dictionary
reddit_dict = res.json()
print(reddit_dict)

{'kind': 'Listing', 'data': {'modhash': '', 'dist': 25, 'children': [{'kind': 't3', 'data': {'approved_at_utc': None, 'subreddit': 'technicalanalysis', 'selftext': '', 'author_fullname': 't2_2155qzgo', 'saved': False, 'mod_reason_title': None, 'gilded': 0, 'clicked': False, 'title': 'Time Frames Guide.(Please pay attention that the maximum of the time frame range for each trading style is usually used for identifying primary trend and major support and resistance levels for that trading style, and the minimum of the ranges usually for precise entry and exit points).', 'link_flair_richtext': [], 'subreddit_name_prefixed': 'r/technicalanalysis', 'hidden': False, 'pwls': 6, 'link_flair_css_class': None, 'downs': 0, 'thumbnail_height': 75, 'top_awarded_type': None, 'hide_score': False, 'name': 't3_k6lbqs', 'quarantine': False, 'link_flair_text_color': 'dark', 'upvote_ratio': 0.85, 'author_flair_background_color': None, 'subreddit_type': 'public', 'ups': 13, 'total_awards_received': 0, 'med

In [6]:
# find out what are the keys in the dict
reddit_dict.keys()

dict_keys(['kind', 'data'])

In [7]:
# drilling down to find the data level
reddit_dict['kind']

'Listing'

In [8]:
# this is the data level but a dictionary inside a dictionary
reddit_dict['data']

{'modhash': '',
 'dist': 25,
 'children': [{'kind': 't3',
   'data': {'approved_at_utc': None,
    'subreddit': 'technicalanalysis',
    'selftext': '',
    'author_fullname': 't2_2155qzgo',
    'saved': False,
    'mod_reason_title': None,
    'gilded': 0,
    'clicked': False,
    'title': 'Time Frames Guide.(Please pay attention that the maximum of the time frame range for each trading style is usually used for identifying primary trend and major support and resistance levels for that trading style, and the minimum of the ranges usually for precise entry and exit points).',
    'link_flair_richtext': [],
    'subreddit_name_prefixed': 'r/technicalanalysis',
    'hidden': False,
    'pwls': 6,
    'link_flair_css_class': None,
    'downs': 0,
    'thumbnail_height': 75,
    'top_awarded_type': None,
    'hide_score': False,
    'name': 't3_k6lbqs',
    'quarantine': False,
    'link_flair_text_color': 'dark',
    'upvote_ratio': 0.85,
    'author_flair_background_color': None,
    's

In [9]:
# within data level we see more dictionary keys, list the keys
reddit_dict['data'].keys()

dict_keys(['modhash', 'dist', 'children', 'after', 'before'])

In [10]:
# inspecting what is in modhash
reddit_dict['data']['modhash']

''

In [11]:
# inspecting what is in dist
reddit_dict['data']['dist']
# meaning 25 post each pull

25

In [12]:
# inspecting what is in children
reddit_dict['data']['children']

[{'kind': 't3',
  'data': {'approved_at_utc': None,
   'subreddit': 'technicalanalysis',
   'selftext': '',
   'author_fullname': 't2_2155qzgo',
   'saved': False,
   'mod_reason_title': None,
   'gilded': 0,
   'clicked': False,
   'title': 'Time Frames Guide.(Please pay attention that the maximum of the time frame range for each trading style is usually used for identifying primary trend and major support and resistance levels for that trading style, and the minimum of the ranges usually for precise entry and exit points).',
   'link_flair_richtext': [],
   'subreddit_name_prefixed': 'r/technicalanalysis',
   'hidden': False,
   'pwls': 6,
   'link_flair_css_class': None,
   'downs': 0,
   'thumbnail_height': 75,
   'top_awarded_type': None,
   'hide_score': False,
   'name': 't3_k6lbqs',
   'quarantine': False,
   'link_flair_text_color': 'dark',
   'upvote_ratio': 0.85,
   'author_flair_background_color': None,
   'subreddit_type': 'public',
   'ups': 13,
   'total_awards_received'

In [13]:
# inspecting what is in after
reddit_dict['data']['after']

't3_jyvuta'

In [14]:
# inspecting what is in before
reddit_dict['data']['before']
# return blank meaning this is the first 25 records

**Observation**

The `before` and `after` keys were the identifiers in the JSON API that determined the start and end points of each pull.

In [15]:
# now we know the where is the level and keys of the data we needed, let extract just those and put into a dataframe
posts = [p['data'] for p in reddit_dict['data']['children']]
pd.DataFrame(posts).head(2)

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,permalink,parent_whitelist_status,stickied,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,media_metadata
0,,technicalanalysis,,t2_2155qzgo,False,,0,False,Time Frames Guide.(Please pay attention that t...,[],...,/r/technicalanalysis/comments/k6lbqs/time_fram...,all_ads,False,https://i.redd.it/i7tp114qe6361.jpg,5067,1607090000.0,0,,False,
1,,technicalanalysis,"Hello,\n\nCan someone tell me how I can divide...",t2_6jq9yzv8,False,,0,False,Dividing Moving Averages,[],...,/r/technicalanalysis/comments/k6b62e/dividing_...,all_ads,False,https://www.reddit.com/r/technicalanalysis/com...,5067,1607047000.0,0,,False,


In [16]:
# since every pull is limited to 25 post, we need to find the end of the first pull
print(reddit_dict['data']['after']) # we saw earlier that this was name of the poster in the last record
pd.DataFrame(posts)[['name','author']].tail(2)
# retrieving the author name and cross checking on reddit.com page confirms this is the last record in the pull

t3_jyvuta


Unnamed: 0,name,author
23,t3_jz4wta,hroob777
24,t3_jyvuta,KeepingForexSimple


In [17]:
# we established the keys for referencing the next set of records,
# now we can assemble the new url for the next set of 25 records
url + '?after=' + reddit_dict['data']['after']

'https://www.reddit.com/r/technicalanalysis/new.json?after=t3_jyvuta'

**Observation**

Testing the above in a browser confirmed it was correct.

# Making a loop for mass extraction

Once the above was confirmed, we defined a function to mass extract data from each of the targeted subreddits for 'training' our models.

In [18]:
def scrap_loop(url, filename):

    posts = []
    after = None

    for a in range(50): # looping thru 25 posts each time
        if after == None: # this is the first 25 post where after is always none
            current_url = url
        else:
            current_url = url + '?after=' + after # this is where we assemble the url for the next 25 posts
        print(current_url)
        
        # make request and check status code
        res = requests.get(current_url, headers={'User-agent': 'money datascientist'})
        if res.status_code != 200:
            print('Status error', res.status_code)
            break

        # assign to a dataframe
        current_dict = res.json()
        current_posts = [p['data'] for p in current_dict['data']['children']]
        posts.extend(current_posts)
        after = current_dict['data']['after'] # this is assigning the last post identifier

        if a > 0:  #if this is not the first run
            prev_posts = pd.read_csv('datasets/'+filename) # read in the previous run csv file data into dataframe
            current_df = pd.DataFrame(current_posts) # convert post from current run into a dataframe
            new_df = pd.concat([prev_posts, current_df]) # concat the two dataframe
            new_df.to_csv('datasets/'+filename, index=False) # export combined data and overwrite the previous file
        else:
            # saving to a csv file, for the first run
            pd.DataFrame(posts).to_csv('datasets/'+filename, index = False)

        # generate a random sleep duration to look more 'natural'
        sleep_duration = random.randint(2,6)
        print(sleep_duration)
        time.sleep(sleep_duration)

# Extracting data from `r/technicalanalysis`

In [19]:
url = "https://www.reddit.com/r/technicalanalysis/new.json" # extracting recent post using 'new' tag
filename = "test_tech.csv" # using a different dummy filename so it does not overwrite the previous file used for analysis
#filename = "technicalanalysis.csv"  # this was the filename we used
scrap_loop(url, filename)

https://www.reddit.com/r/technicalanalysis/new.json
3
https://www.reddit.com/r/technicalanalysis/new.json?after=t3_jyvuta
3
https://www.reddit.com/r/technicalanalysis/new.json?after=t3_js1pve
6
https://www.reddit.com/r/technicalanalysis/new.json?after=t3_jithsg
5
https://www.reddit.com/r/technicalanalysis/new.json?after=t3_j6tf7c
4
https://www.reddit.com/r/technicalanalysis/new.json?after=t3_irjx3v
2
https://www.reddit.com/r/technicalanalysis/new.json?after=t3_iew2c2
6
https://www.reddit.com/r/technicalanalysis/new.json?after=t3_i4o8td
6
https://www.reddit.com/r/technicalanalysis/new.json?after=t3_hskri2
3
https://www.reddit.com/r/technicalanalysis/new.json?after=t3_hlpjn8
3
https://www.reddit.com/r/technicalanalysis/new.json?after=t3_hfk0fy
2
https://www.reddit.com/r/technicalanalysis/new.json?after=t3_gyxhzh
2
https://www.reddit.com/r/technicalanalysis/new.json?after=t3_gqkhzq
3
https://www.reddit.com/r/technicalanalysis/new.json?after=t3_gkct28
4
https://www.reddit.com/r/technicalan

# Extracting data from `r/ValueInvesting`

In [20]:
url = "https://www.reddit.com/r/ValueInvesting/new.json" # extracting recent post using 'new' tag
filename = "test_value.csv" # using a different dummy filename so it does not overwrite the previous file used for analysis
#filename = "valueinvesting.csv" # this was the filename we used
scrap_loop(url, filename)

https://www.reddit.com/r/ValueInvesting/new.json
2
https://www.reddit.com/r/ValueInvesting/new.json?after=t3_k4d8fv
5
https://www.reddit.com/r/ValueInvesting/new.json?after=t3_k1iqyy
3
https://www.reddit.com/r/ValueInvesting/new.json?after=t3_jygm9r
3
https://www.reddit.com/r/ValueInvesting/new.json?after=t3_juw6pp
5
https://www.reddit.com/r/ValueInvesting/new.json?after=t3_jpkayh
4
https://www.reddit.com/r/ValueInvesting/new.json?after=t3_jlaas9
2
https://www.reddit.com/r/ValueInvesting/new.json?after=t3_jj1u3j
5
https://www.reddit.com/r/ValueInvesting/new.json?after=t3_je7lkn
2
https://www.reddit.com/r/ValueInvesting/new.json?after=t3_j9q0z2
3
https://www.reddit.com/r/ValueInvesting/new.json?after=t3_j5hxi9
5
https://www.reddit.com/r/ValueInvesting/new.json?after=t3_j1ey7l
4
https://www.reddit.com/r/ValueInvesting/new.json?after=t3_ixoxwk
3
https://www.reddit.com/r/ValueInvesting/new.json?after=t3_iss725
3
https://www.reddit.com/r/ValueInvesting/new.json?after=t3_iobnr0
2
https://www

This was the end of the data extraction process. Please refer to the main code file ["preprocessing_eda_model_tuning.ipynb"](preprocessing_eda_model_tuning.ipynb) for the rest of the steps.