# Web Scraping for Reddit & Predicting Comments

In this project, we will practice two major skills. Collecting data by scraping a website and then building a binary predictor.

As we discussed in week 2, and earlier today, there are two components to starting a data science problem: the problem statement, and acquiring the data.

For this article, your problem statement will be: _What characteristics of a post on Reddit contribute most to the overall interaction (as measured by number of comments)?_

Your method for acquiring the data will be scraping the 'hot' threads as listed on the [Reddit homepage](https://www.reddit.com/). You'll acquire _AT LEAST FOUR_ pieces of information about each thread:
1. The title of the thread
2. The subreddit that the thread corresponds to
3. The length of time it has been up on Reddit
4. The number of comments on the thread

Once you've got the data, you will build a classification model that, using Natural Language Processing and any other relevant features, predicts whether or not a given Reddit post will have above or below the _median_ number of comments.

**BONUS PROBLEMS**
1. If creating a logistic regression, GridSearch Ridge and Lasso for this model and report the best hyperparameter values.
1. Scrape the actual text of the threads using Selenium (you'll learn about this in Webscraping II).
2. Write the actual article that you're pitching and turn it into a blog post that you host on your personal website.

### Scraping Thread Info from Reddit.com

#### Set up a request (using requests) to the URL below. Use BeautifulSoup to parse the page and extract all results

https://blog.datastories.com/blog/reddit-front-page

Top page Reddit posts stay on the front page an average of 4 hours 15 minutes.
The average lifetime on the front page of an image is 3.5 hours, while the text posts live for 4 hours and 45 minutes on average.
Internal self-posts LIVE SIGNIFICANTLY LONGER than external posts.
The average lifetime of a Reddit's self post is 5 hours and 15 minutes.
The average lifetime of an external post is only 3 hours and 45 minutes.
The average lifetime of text posts with a positive headline is significantly longer than the lifetime of posts with a neutral or negative headline.
Textual self-posts with positive headlines stay significantly longer on the front page.
Starting at 9am PST is the fastest time for getting upvotes.
For text posts, Very Positive or Very Negative posts perform significantly better than Neutral ones.
Images get much more upvotes than text posts.
However….text posts get more comments and stay on the front page longer.
There are 5 Sub-Reddits that completely dominate the front page of Reddit.

In [2]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

In [None]:
## The Plan

# scrape multiple times, eg top 100 posts, 3*day for 3 days, or more often
# score/ upvotes, downvotes, # comments, rank, time posted, len time on reddit
# title: sentiment?, numbers?, special char?
# content for text posts, sentiment, subreddit, # subscribers per sub, image?, video?, podcast?,
#     internal/external, 

## other - how long was it in the top 100? , what freq can i scrape....
# analyze title - is there a clickbait formula?
# time it takes to get to front page (top25)
# does the OP engage in the comments section? how much?
# has OP gone viral before? how often?


While this has some more verbose elements removed, we can see that there is some structure to the above:
- The thread title is within an `<a>` tag with the attribute `data-event-action="title"`.
- The time since the thread was created is within a `<time>` tag with attribute `class="live-timestamp"`.
- The subreddit is within an `<a>` tag with the attribute `class="subreddit hover may-blank"`.
- The number of comments is within an `<a>` tag with the attribute data-event-action="comments"`.

In [None]:
# # sleepy scraper
# def scraper(url):
#     time.sleep(2)
#     response = requests.get(url)
#     print('status code', response.status_code)
#     html = response.text
#     return html

In [3]:
url = "http://www.reddit.com/"

In [16]:
# html = requests.get(url, headers= {'User-Agent': 'kittenMitten'})
# html_doc = html.text
# soup = BeautifulSoup(html_doc, 'lxml')
# print(soup.prettify())
html = requests.get(url, headers= {'User-Agent': 'kittenMittenz'})
html_doc = html.text
soup = BeautifulSoup(html_doc, 'html.parser')
# time.sleep(0.01)

In [17]:
print(html.status_code)

200


## Write 4 functions to extract these items (one function for each): title, time, subreddit, and number of comments.¶
Example
```python
def extract_title_from_result(result):
    return result.find ...
```

##### - Make sure these functions are robust and can handle cases where the data/field may not be available.
>- Remember to check if a field is empty or None for attempting to call methods on it
>- Remember to use try/except if you anticipate errors.

- **Test** the functions on the results above and simple examples

In [None]:
#Notice that within the div tag there is 
#an attribute called id and it is set to "thing_t3_788tye"

In [None]:
### title EDA
# has emoji
# has numbers
# sentiment
# len(title)
# title structure... ??

In [160]:
def titles(url, mysoup):
    return [titles.text for titles in
                  mysoup.find_all('a', {"data-event-action": "title"})]

In [161]:
def subreddit(url, mysoup):
    return [subreddit.text.replace('r/', '') for subreddit in
                     mysoup.find_all('a', {"class": "subreddit hover may-blank"})]


In [162]:
def timeup(url, mysoup):
    return [timeup.text.replace(' hours ago', '') for timeup in
                  mysoup.find_all('time', {"class": "live-timestamp"})]

In [163]:
def num_comments(url, mysoup):
    return [num_comments.text.replace(' comments', '') for num_comments in
                        mysoup.find_all('a', {"data-event-action": "comments"})]

Now, to scale up our scraping, we need to accumulate more results.

First, look at the source of a Reddit.com page: (https://www.reddit.com/).
Try manually changing the page by clicking the 'next' button on the bottom. Look at how the url changes.

After leaving the Reddit homepage, the URLs should look something like this:
```
https://www.reddit.com/?count=25&after=t3_787ptc
```

The URL here has two query parameters
- count is the result number that the page starts with
- after is the unique id of the last result on the _previous_ page

In order to scrape lots of pages from Reddit, we'll have to change these parameters every time we make a new request so that we're not just scraping the same page over and over again. Incrementing the count by 25 every time will be easy, but the bizarre code after `after` is a bit trickier.

To start off, let's look at a block of HTML from a Reddit page to see how we might solve this problem:
```html
<div class=" thing id-t3_788tye odd gilded link " data-author="LordSneaux" data-author-fullname="t2_j3pty" data-comments-count="1548" data-context="listing" data-domain="v.redd.it" data-fullname="t3_788tye" data-kind="video" data-num-crossposts="0" data-permalink="/r/funny/comments/788tye/not_all_heroes_wear_capes/" data-rank="25" data-score="51468" data-subreddit="funny" data-subreddit-fullname="t5_2qh33" data-timestamp="1508775581000" data-type="link" data-url="https://v.redd.it/ush0rh2tultz" data-whitelist-status="all_ads" id="thing_t3_788tye" onclick="click_thing(this)">
      <p class="parent">
      </p>
      <span class="rank">
       25
      </span>
      <div class="midcol unvoted">
       <div aria-label="upvote" class="arrow up login-required access-required" data-event-action="upvote" role="button" tabindex="0">
       </div>
       <div class="score dislikes" title="53288">
        53.3k
       </div>
       <div class="score unvoted" title="53289">
        53.3k
       </div>
       <div class="score likes" title="53290">
        53.3k
       </div>
       <div aria-label="downvote" class="arrow down login-required access-required" data-event-action="downvote" role="button" tabindex="0">
       </div>
      </div>
```

Notice that within the `div` tag there is an attribute called `id` and it is set to `"thing_t3_788tye"`. By finding the last ID on your scraped page, you can tell your _next_ request where to start (pass everything after "thing_").

For more info on this, you can take a look at the [Reddit API docs](https://github.com/reddit/reddit/wiki/JSON)

## Write one more function that finds the last `id` on the page, and stores it.

In [164]:
def last_reddit_id(url, mysoup): 
    return mysoup.find('div', attrs={'class': 'nav-buttons'}).find('span', attrs={'class': 'next-button'}).find('a').attrs['href']

## (Optional) Collect more information

While we only require you to collect four features, there may be other info that you can find on the results page that might be useful. Feel free to write more functions so that you have more interesting and useful data.

In [165]:
def likes(url, mysoup):
    return [likes.text for likes in
                 mysoup.find_all('div', {'class', 'score likes'})]

## Now, let's put it all together.

Use the functions you wrote above to parse out the 4 fields - title, time, subreddit, and number of comments. Create a dataframe from the results with those 4 columns.

In [172]:
def scraper(url):
    html = requests.get(url, headers= {'User-Agent': 'puppies'})
    print(html.status_code)
    html_doc = html.text
    mysoup = BeautifulSoup(html_doc, 'html.parser')

    try:
        titles_list = titles(url, mysoup)
        timeup_list = timeup(url, mysoup)
        subreddit_list = subreddit(url, mysoup)
        num_comments_list = num_comments(url, mysoup)
        likes_list = likes(url, mysoup)

#         get last reddit id on page for next page url
        lrid = last_reddit_id(url, mysoup)

        reddit = pd.DataFrame()

        df = pd.DataFrame({'title': titles_list, 'timeup [hours_ago]': timeup_list, 
                   'subreddit': subreddit_list, 'num_comments': num_comments_list, 
                       'likes': likes_list})

        return df, lrid
    
    except:
        print('NOOOOOOOOOOOOOOOOOOOOOOOO!!!!!!!!!!!')

    time.sleep(2)

In [173]:
i = 0
while i < 100:

    if i < 1:
        url = "http://www.reddit.com/"
        print('loop num:\t', i, 'url is:\t', url)
        reddit, lrid = scraper(url)
        i += 25

    else:
        url = lrid
        print('loop num:\t', i, 'url is:\t', url)
        new_df, new_url = scraper(url)
        reddit = reddit.append(new_df, ignore_index=True)
        lrid = new_url
        i += 25

#     reddit['rank'] = reddit.index.get_values() + 1
#     reddit['datetime'] = datetime = html.headers['Date']
    
# df.to_csv('/Users/meredithjackson/Desktop/ga/week5/reddit_data.csv')
# print("status code: " + str(html.status_code))
        


loop num:	 0 url is:	 http://www.reddit.com/
200
loop num:	 25 url is:	 https://www.reddit.com/?count=25&after=t3_80wjzd
200
loop num:	 50 url is:	 https://www.reddit.com/?count=50&after=t3_80vx0y
200
loop num:	 75 url is:	 https://www.reddit.com/?count=75&after=t3_80uukq
200


In [175]:
reddit

Unnamed: 0,likes,num_comments,subreddit,timeup [hours_ago],title
0,•,359,sports,1 hour ago,"""Just stay in there, you're done for tonight"""
1,45.2k,523,pics,4,Butterflies will sometimes land on a Caiman an...
2,33.9k,822,funny,4,Two drunk gentlemen try to pass each other
3,9808,112,DunderMifflin,4,"Hey guys, I know I'm late to the meme party bu..."
4,49.7k,413,gifs,4,These VR apps are getting out of hand
5,18.5k,211,NatureIsFuckingLit,3,Octopus riding sea turtle 🔥
6,29.3k,526,OldSchoolCool,5,Bride leaving her recently bombed home to get ...
7,6802,60,StrangerThings,3,"Feeling cute, might delete later"
8,9694,219,gifsthatkeepongiving,3,Two drunk gentlemen try to pass each other
9,10.1k,296,DIY,4,Perpetual Flip Calendar


In [None]:
# post_type = [post_type for post_type in
#                      soup.find_all('div', {"class": "top matter" })]

# for i in post_type:
#     print(i.find_all('div').attrs)


In [None]:
# run scraper every 2 minutes for a few hours
# time posted
# post type -video, external link, 
# get time stamp for each scrape 
# username of poster, 
# karma points of poster- post karma, comment karma
# external link- what websites are posts going to 
# how many followers does the subreddit page have
# is post relevant to subreddit?


        
       
# post_type =
        
#         user_name = 
#         user_commment_karma = 
#         user_post_karma = 
#         external_link_site =
#         num_subr_followers = 
#         time_posted= 



### Save your results as a CSV
You may do this regularly while scraping data as well, so that if your scraper stops of your computer crashes, you don't lose all your data.

In [None]:
df.head()

In [None]:
# Export to csv
df.to_csv('/Users/meredithjackson/Desktop/ga/week5/reddit_data.csv')


## Predicting comments using Random Forests + Another Classifier

#### Load in the the data of scraped results

In [None]:
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report
from sklearn.naive_bayes import MultinomialNB, BernoulliNB, GaussianNB
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import FunctionTransformer
from sklearn.linear_model import LogisticRegression

In [None]:
## YOUR CODE HERE

#### We want to predict a binary variable - whether the number of comments was low or high. Compute the median number of comments and create a new binary variable that is true when the number of comments is high (above the median)

We could also perform Linear Regression (or any regression) to predict the number of comments here. Instead, we are going to convert this into a _binary_ classification problem, by predicting two classes, HIGH vs LOW number of comments.

While performing regression may be better, performing classification may help remove some of the noise of the extremely popular threads. We don't _have_ to choose the `median` as the splitting point - we could also split on the 75th percentile or any other reasonable breaking point.

In fact, the ideal scenario may be to predict many levels of comment numbers. 

In [None]:
## YOUR CODE HERE

#### Thought experiment: What is the baseline accuracy for this model?

In [None]:
## YOUR CODE HERE

#### Create a Random Forest model to predict High/Low number of comments using Sklearn. Start by ONLY using the subreddit as a feature. 

In [None]:
## YOUR CODE HERE

#### Create a few new variables in your dataframe to represent interesting features of a thread title.
- For example, create a feature that represents whether 'cat' is in the title or whether 'funny' is in the title. 
- Then build a new Random Forest with these features. Do they add any value?
- After creating these variables, use count-vectorizer to create features based on the words in the thread titles.
- Build a new random forest model with subreddit and these new features included.

In [None]:
## YOUR CODE HERE

#### Use cross-validation in scikit-learn to evaluate the model above. 
- Evaluate the accuracy of the model, as well as any other metrics you feel are appropriate. 

In [None]:
## YOUR CODE HERE

#### Repeat the model-building process with a non-tree-based method.

In [None]:
## YOUR CODE HERE

#### Use Count Vectorizer from scikit-learn to create features from the thread titles. 
- Examine using count or binary features in the model
- Re-evaluate your models using these. Does this improve the model performance? 
- What text features are the most valuable? 

In [None]:
## YOUR CODE HERE

# Executive Summary
---
Put your executive summary in a Markdown cell below.

### BONUS
Refer to the README for the bonus parts

In [None]:
## YOUR CODE HERE