# General Assembly - Project 3:  Webs API & Classification
## Notebook 1: Exploring Reddit API
### DSI19 / Jordan David Nalpon

### Notebook 1 Index
---
* [Executive Summary](#exe)
* [Notebook 1 - Reddit API](#nb1)
* [Import Libraries](#lib)
* [Explore Reddit API](#explore)
---

<a name="exe"></a>
# __Executive Summary__
---

## Executive Summary - Reddit

![image info](../03_images/reddit_logo.jpg)


Reddit is an American social news aggregation, web content rating, and discussion website. Registered members submit content to the site such as links, text posts, and images, which are then voted up or down by other members.

The website is made up of several smaller online discussion forums called subreddits. Subreddits are community driven with people looking for like-minded people around the world. They cover a wide number of interest groups like sports, films, literature, photography and several others.

I would group the subreddits into 2 groups; image driven and text driven subreddits. 

Image driven subreddits would have images as its main central theme in the group. These images can include photographs, memes or videos. Some examples of image driven subreddits include r/photograph, r/earthporn and r/PrequelMemes/.


Text driven subreddits have very little or no images in the subreddit at all as the main content are mostly texts. Some popular text driven subreddits are r/askreddit, r/nosleep and r/writingprompts.

### Executive Summary - Problem Statement

I hypothesis that by having two similar text-driven subreddits, NLP will have a difficult time spotting the difference between the 2 subreddits. If I select 2 subreddits with a relatively large and active community, the posted content in a subreddit will be written by various people with different writing styles. The only connecting factor is that they are in the same subreddit.

### Executive Summary - Selected Subreddits

The 2 subreddits I've selected are __r/tifu__ (today I fucked up) and __r/confessions__. Both subreddits are ideal for my problem statement as both normally have lengthy text posts with an active community. The writing style for both are normally a personal recount of the OP (original poster) as oppose to r/nosleep and r/writingpromts where posts are written as fictional narrative stories.

r/tifu is a subreddit that people normally share a funny story about how they fucked up on that day. The community is mostly light hearted and cheerful as the posters are willing to share a funny story at the expense of being laughed at. Below are some notable posts of r/tifu:

https://www.reddit.com/r/tifu/comments/2tdbig/tifu_by_enraging_the_parents_of_my_girlfriend_by/
https://www.reddit.com/r/tifu/comments/3im341/tifu_by_throwing_my_steak_out_a_window/

r/confessions is slightly different from r/tifu as it treats the posts seriously and its a place for people to confess something personal anonymously on the internet. The community are mostly supportive and give suggestions to help the OP. Below are some notable posts of r/confessions:

https://www.reddit.com/r/confessions/comments/bcxd98/i_kicked_all_my_friends_off_my_hbo_account/
https://www.reddit.com/r/confessions/comments/ax83tk/i_put_my_infant_daughter_in_the_closet_shut_the/


### Executive Summary - Steps

I will tackle the problem by doing the following:
1. Exploring Reddit's API to help understand how to unravel the data
2. Web scrap the posts from the subreddits
3. Clean and format the posts
4. EDA the data and pre-process it at the same time
5. Combine all the data into a single dataframe
6. Train/Test split the data
7. Grid Search
8. Analysis and Conclusion

<p style="color:red">
<b>Is it clear what the student plans to do?</b><br>
    What is your data science problem that you are taking on?<br>
    <a href="https://monkeylearn.com/text-classification/#:~:text=Text%20classification%20is%20the%20process,spam%20detection%2C%20and%20intent%20detection.">Here's a read on text classification</a>
</p>


<p style="color:red">
<b>How will success be evaluated?</b><br>
Specify and explain your choice of success metric. Is it Accuracy / AUC / F1? Why did you choose that?
</p>

<p style="color:red">
<b>Is it clear who cares about this or why this is important to investigate?</b><br>
    Describe why classifying these 2 topics is meaningful to your stakeholders.
</p>

<p style="color:red">
<b>Does the student consider the audience and the primary and secondary stakeholders?</b><br>
    Who is this project intended for? Define your stakeholders and who will be reading this notebook
</p>

<a name="nb1"></a>
## Notebook 1 - Reddit's API
This notebook will contain the process of exploring Reddit's API before we begin the web scraping. The data collection will only begin in the next notebook.

<a name="lib"></a>
### Import Libraries

In [2]:
import numpy as np
import pandas as pd
import time
import requests

<a name="explore"></a>
### Exploring Reddit API

In [4]:
# creating custom header parameter for Reddit's API
# default header may prevent sending too many requests to Reddit

headers = {'User-agent': 'anything can be inserted here'}

In [6]:
url = 'https://www.reddit.com/hot.json'

In [18]:
res = requests.get(url,headers=headers)

In [21]:
# status code 200 indicates the code is able to read the website fine
res.status_code

200

In [29]:
#requesting the json of reddit's hot posts
json_hot = res.json()

In [30]:
# getting the keys of the json of reddit's hot posts
sorted(json_hot.keys())

['data', 'kind']

When looking through data and kind keys, it looks like data contains most of the information we need.

In [32]:
sorted(json_hot['data'].keys())

['after', 'before', 'children', 'dist', 'modhash']

In [40]:
pd.DataFrame(json_hot['data']['children'])

Unnamed: 0,kind,data
0,t3,"{'approved_at_utc': None, 'subreddit': 'intere..."
1,t3,"{'approved_at_utc': None, 'subreddit': 'news',..."
2,t3,"{'approved_at_utc': None, 'subreddit': 'Murder..."
3,t3,"{'approved_at_utc': None, 'subreddit': 'AskRed..."
4,t3,"{'approved_at_utc': None, 'subreddit': 'books'..."
5,t3,"{'approved_at_utc': None, 'subreddit': 'memes'..."
6,t3,"{'approved_at_utc': None, 'subreddit': 'news',..."
7,t3,"{'approved_at_utc': None, 'subreddit': 'wholes..."
8,t3,"{'approved_at_utc': None, 'subreddit': 'woooos..."
9,t3,"{'approved_at_utc': None, 'subreddit': 'politi..."


In [41]:
json_hot['data']['children'][0]

{'kind': 't3',
 'data': {'approved_at_utc': None,
  'subreddit': 'interestingasfuck',
  'selftext': '',
  'author_fullname': 't2_98hblf5j',
  'saved': False,
  'mod_reason_title': None,
  'gilded': 0,
  'clicked': False,
  'title': 'Give me a lever long enough and a fulcrum on which to place it, and I shall move the world. -Archimedes',
  'link_flair_richtext': [],
  'subreddit_name_prefixed': 'r/interestingasfuck',
  'hidden': False,
  'pwls': 6,
  'link_flair_css_class': 'approve',
  'downs': 0,
  'thumbnail_height': 140,
  'top_awarded_type': None,
  'hide_score': False,
  'name': 't3_ksxx6s',
  'quarantine': False,
  'link_flair_text_color': 'dark',
  'upvote_ratio': 0.96,
  'author_flair_background_color': None,
  'subreddit_type': 'public',
  'ups': 36332,
  'total_awards_received': 73,
  'media_embed': {},
  'thumbnail_width': 140,
  'author_flair_template_id': None,
  'is_original_content': False,
  'user_reports': [],
  'secure_media': None,
  'is_reddit_media_domain': False,


In [43]:
# this will show us the number of 'children' entries are pulled
len(json_hot['data']['children'])

25

In [46]:
# find out the name of the last entry pulled
hot_last_entry = json_hot['data']['after']
hot_last_entry

't3_ksvfgb'

In [48]:
param = {'after' : hot_last_entry}

In [49]:
requests.get(url, params = param, headers = headers)

<Response [200]>

In [50]:
# this will create a long list containing the posts of the first 

post = []
after = None
for i in range(4):
    print(i)
    if after == None:
        params = {}
    else:
        params = {'after':after}
    url = 'https://www.reddit.com/hot.json'
    res = requests.get(url, params=params, headers = headers)
    if res.status_code == 200: #200 indicates it is working
        the_json = res.json()
        post.extend(the_json['data']['children'])
        after = the_json['data']['after']
    else:
        print(res.status_code) #breaks loop and tell us the status code
        break
    time.sleep(1) #adds a second between each requests to not overload the website servers

0
1
2
3


In [52]:
#shows how many posts were pulled
len(post)

100

In [22]:
res.content

b'{"kind": "Listing", "data": {"modhash": "", "dist": 25, "children": [{"kind": "t3", "data": {"approved_at_utc": null, "subreddit": "interestingasfuck", "selftext": "", "author_fullname": "t2_98hblf5j", "saved": false, "mod_reason_title": null, "gilded": 0, "clicked": false, "title": "Give me a lever long enough and a fulcrum on which to place it, and I shall move the world. -Archimedes", "link_flair_richtext": [], "subreddit_name_prefixed": "r/interestingasfuck", "hidden": false, "pwls": 6, "link_flair_css_class": "approve", "downs": 0, "thumbnail_height": 140, "top_awarded_type": null, "hide_score": false, "name": "t3_ksxx6s", "quarantine": false, "link_flair_text_color": "dark", "upvote_ratio": 0.96, "author_flair_background_color": null, "subreddit_type": "public", "ups": 36332, "total_awards_received": 73, "media_embed": {}, "thumbnail_width": 140, "author_flair_template_id": null, "is_original_content": false, "user_reports": [], "secure_media": null, "is_reddit_media_domain": f