# Using Reddit's API for Predicting Comments

In this project, we will practice two major skills. Collecting data via an API request and then building a binary predictor.

As we discussed in week 2, and earlier today, there are two components to starting a data science problem: the problem statement, and acquiring the data.

For this article, your problem statement will be: _What characteristics of a post on Reddit contribute most to the overall interaction (as measured by number of comments)?_

Your method for acquiring the data will be scraping the 'hot' threads as listed on the [Reddit homepage](https://www.reddit.com/). You'll acquire _AT LEAST FOUR_ pieces of information about each thread:
1. The title of the thread
2. The subreddit that the thread corresponds to
3. The length of time it has been up on Reddit
4. The number of comments on the thread

Once you've got the data, you will build a classification model that, using Natural Language Processing and any other relevant features, predicts whether or not a given Reddit post will have above or below the _median_ number of comments.

**BONUS PROBLEMS**
1. If creating a logistic regression, GridSearch Ridge and Lasso for this model and report the best hyperparameter values.
1. Scrape the actual text of the threads using Selenium (you'll learn about this in Webscraping II).
2. Write the actual article that you're pitching and turn it into a blog post that you host on your personal website.

### Scraping Thread Info from Reddit.com

#### Set up a request (using requests) to the URL below. 

*NOTE*: Reddit will throw a [429 error](https://httpstatuses.com/429) when using the following code:
```python
res = requests.get(URL)
```

This is because Reddit has throttled python's default user agent. You'll need to set a custom `User-agent` to get your request to work.
```python
res = requests.get(URL, headers={'User-agent': 'YOUR NAME Bot 0.1'})
```

NOTES:  Problem:  What characteristics of a post on Reddit contribute most to the overall interaction (as measured by number of comments)?
Pseudo-code/Approach:  Analyze, using NLP, Reddit posts that have variable numbers of comments, 

In [1]:
import requests
import time
import pandas as pd

In [2]:
headers = {'User-agent': 'yukihadeishi .1'}
res = requests.get('https://reddit.com/hot.json', headers=headers)

In [3]:
res.status_code

200

In [4]:
curr_json = res.json()

In [5]:
curr_json['data'].keys()

dict_keys(['modhash', 'dist', 'children', 'after', 'before'])

In [6]:
sorted(curr_json['data'].keys())

['after', 'before', 'children', 'dist', 'modhash']

In [7]:
len(curr_json['data']['children'])

25

In [8]:
curr_json['data']['children'][24]['data']['name']

't3_8m2py9'

In [9]:
sorted(curr_json['data']['children'][24]['data'].keys())

['approved_at_utc',
 'approved_by',
 'archived',
 'author',
 'author_flair_css_class',
 'author_flair_template_id',
 'author_flair_text',
 'banned_at_utc',
 'banned_by',
 'can_gild',
 'can_mod_post',
 'clicked',
 'contest_mode',
 'created',
 'created_utc',
 'distinguished',
 'domain',
 'downs',
 'edited',
 'gilded',
 'hidden',
 'hide_score',
 'id',
 'is_crosspostable',
 'is_reddit_media_domain',
 'is_self',
 'is_video',
 'likes',
 'link_flair_css_class',
 'link_flair_text',
 'locked',
 'media',
 'media_embed',
 'media_only',
 'mod_note',
 'mod_reason_by',
 'mod_reason_title',
 'mod_reports',
 'name',
 'no_follow',
 'num_comments',
 'num_crossposts',
 'num_reports',
 'over_18',
 'parent_whitelist_status',
 'permalink',
 'pinned',
 'post_categories',
 'post_hint',
 'preview',
 'pwls',
 'quarantine',
 'removal_reason',
 'report_reasons',
 'saved',
 'score',
 'secure_media',
 'secure_media_embed',
 'selftext',
 'selftext_html',
 'send_replies',
 'spoiler',
 'stickied',
 'subreddit',
 'subr

In [11]:
curr_json['data']['children'][24]['data']

{'approved_at_utc': None,
 'approved_by': None,
 'archived': False,
 'author': 'Weaverino',
 'author_flair_css_class': None,
 'author_flair_template_id': None,
 'author_flair_text': None,
 'banned_at_utc': None,
 'banned_by': None,
 'can_gild': False,
 'can_mod_post': False,
 'clicked': False,
 'contest_mode': False,
 'created': 1527291669.0,
 'created_utc': 1527262869.0,
 'distinguished': None,
 'domain': 'i.redd.it',
 'downs': 0,
 'edited': False,
 'gilded': 0,
 'hidden': False,
 'hide_score': False,
 'id': '8m2py9',
 'is_crosspostable': False,
 'is_reddit_media_domain': True,
 'is_self': False,
 'is_video': False,
 'likes': None,
 'link_flair_css_class': None,
 'link_flair_text': None,
 'locked': False,
 'media': None,
 'media_embed': {},
 'media_only': False,
 'mod_note': None,
 'mod_reason_by': None,
 'mod_reason_title': None,
 'mod_reports': [],
 'name': 't3_8m2py9',
 'no_follow': False,
 'num_comments': 163,
 'num_crossposts': 0,
 'num_reports': None,
 'over_18': False,
 'parent

In [13]:
def get_posts( sub = 'all', num_pages = 4, avoid_distinguished = True, attached = None):
    """
    Returns a list of pages from a subreddit. 
    
    ===========================
    ======= Parameters ========
    ===========================

    sub = 'all' (default): type = string
        The subreddit you want to querry. 
        https://reddit.com/r/{sub}/ 
    -------------------------------------------------------------
    num_pages = 4 (default): type = int
        Number of pages to read from.  
        This also is the number of seconds
        this function takes to run
    -------------------------------------------------------------
    avoid_distinguished = True (default): type = bool
        Whether or not to avoid stickied, archived,
        and admin posts
    -------------------------------------------------------------
    attached = None (default): type = List
        The list that you are appending new data onto.
        Default to make a new list.  
        
    ===========================
    ========  Example =========
    ===========================    
    
    the_posts= get_posts(sub = 'jokes',
                            num_pages=1, 
                            avoid_distinguished=True)
                            
    the_posts= get_posts(sub = 'nosleep',
                            num_pages=1, 
                            avoid_distinguished=True, 
                            attached=the_posts )
    
    >>> Returns a list of ~25 posts from reddit.com/r/jokes and
                    ~25 posts from reddit.com/r/nosleep
    
    
    """
    if attached:
        posts = attached
    else:
        posts = []
    counter = 0
    after = None
    while counter < num_pages:
        if after == None:
            params = {}
        else:
            params = {'after': after}
        res = requests.get(f'https://reddit.com/r/{sub}/.json', params ,headers=headers)
        if(res.status_code!=200):
            print('invalid sub')
            return None
        the_json = res.json()
        if avoid_distinguished:
            page = [child for child in the_json['data'].get('children') if not child['data']['stickied'] and not child['data']['archived'] and not child['data']['distinguished']]
        else:
            page = the_json['data'].get('children')
        posts.extend(page)
        after = the_json['data']['after']
        counter += 1
        time.sleep(1)
    return posts

In [17]:
datascience_posts = get_posts(sub='datascience', num_pages=50)

In [18]:
genetics_posts = get_posts(sub='genetics', num_pages=50)

In [22]:
datasci_toy = get_posts(sub='datascience', num_pages=5)

In [26]:
datasci_toy

[{'data': {'approved_at_utc': None,
   'approved_by': None,
   'archived': False,
   'author': 'Geckoboard',
   'author_flair_css_class': None,
   'author_flair_template_id': None,
   'author_flair_text': None,
   'banned_at_utc': None,
   'banned_by': None,
   'can_gild': False,
   'can_mod_post': False,
   'clicked': False,
   'contest_mode': False,
   'created': 1527307568.0,
   'created_utc': 1527278768.0,
   'distinguished': None,
   'domain': 'i.redd.it',
   'downs': 0,
   'edited': False,
   'gilded': 0,
   'hidden': False,
   'hide_score': False,
   'id': '8m4pkw',
   'is_crosspostable': False,
   'is_reddit_media_domain': True,
   'is_self': False,
   'is_video': False,
   'likes': None,
   'link_flair_css_class': None,
   'link_flair_text': None,
   'locked': False,
   'media': None,
   'media_embed': {},
   'media_only': False,
   'mod_note': None,
   'mod_reason_by': None,
   'mod_reason_title': None,
   'mod_reports': [],
   'name': 't3_8m4pkw',
   'no_follow': False,
   '

In [30]:
def posts_as_DataFrame(posts, features = ['subreddit', 'author',
                                          'title', 'selftext',
                                          'created_utc', 'num_comments']):
    feat_dict = [{feat : post['data'][feat] for feat in features}
                 for post in posts]
    return pd.DataFrame(feat_dict)

In [31]:
posts_as_DataFrame(datasci_toy)

Unnamed: 0,author,created_utc,num_comments,selftext,subreddit,title
0,Geckoboard,1.527279e+09,3,,datascience,Artist recreates his favourite musician's disc...
1,InevitableRaisin,1.527247e+09,30,I'm a Product focussed generalist based in Lon...,datascience,What are the potential career paths for Data S...
2,bweber,1.527290e+09,0,,datascience,Data Science for Startups: Model Production
3,dubious_chewy,1.527198e+09,34,It seems to me like pretty much everyone agree...,datascience,Data cleaning and wrangling: I heard it's impo...
4,StingerOo,1.527277e+09,2,"Hey, I'm trying to segment different socioecon...",datascience,Clustering analysis using z-score
5,pg_gargleblaster,1.527275e+09,4,What red flags make you throw away a resume or...,datascience,Hiring: What makes you disqualify someone?
6,SandWraith,1.527217e+09,2,,datascience,datasheets: A library to read from and write d...
7,PhysicalPresentation,1.527222e+09,5,I'm tasked with building a baseline for credit...,datascience,Can someone explain to me how this baseline wo...
8,Datascientist17,1.527234e+09,6,Hi all! \nI was offered a contract job opportu...,datascience,Contract hourly rate student data science
9,URLSweatshirt,1.527168e+09,20,"Hello,\n\nI am currently working on a project ...",datascience,What would be a good approach for packaging a ...


In [None]:
with open('test_dump.json', 'w+') as f:
    json.dump(posts, f)

In [None]:
with open('test_dump.json', 'r') as f:
    import_test = json.load(f)

In [None]:
len(import_test)

In [None]:
posts[1]['data'].keys()

In [None]:
posts[1]['data']['selftext']

In [None]:
posts[7]['data']['ups']

In [None]:
from sys import getsizeof

In [None]:
getsizeof(import_test)

In [None]:
curr_json.keys()

In [None]:
curr_json['data']

#### Use `res.json()` to convert the response into a dictionary format and set this to a variable. 

```python
data = res.json()
```

In [None]:
data = res.json