# Web Scraping for Reddit & Predicting Comments

In this project, we will practice two major skills. Collecting data by scraping a website and then building a binary predictor.

As we discussed in week 2, and earlier today, there are two components to starting a data science problem: the problem statement, and acquiring the data.

For this article, your problem statement will be: _What characteristics of a post on Reddit contribute most to the overall interaction (as measured by number of comments)?_

Your method for acquiring the data will be scraping the 'hot' threads as listed on the [Reddit homepage](https://www.reddit.com/). You'll acquire _AT LEAST FOUR_ pieces of information about each thread:
1. The title of the thread
2. The subreddit that the thread corresponds to
3. The length of time it has been up on Reddit
4. The number of comments on the thread

Once you've got the data, you will build a classification model that, using Natural Language Processing and any other relevant features, predicts whether or not a given Reddit post will have above or below the _median_ number of comments.

**BONUS PROBLEMS**
1. If creating a logistic regression, GridSearch Ridge and Lasso for this model and report the best hyperparameter values.
1. Scrape the actual text of the threads using Selenium (you'll learn about this in Webscraping II).
2. Write the actual article that you're pitching and turn it into a blog post that you host on your personal website.

### Scraping Thread Info from Reddit.com

### IMPORT ALL LIBRARIES HERE

In [461]:
# IMPORT ALL LIBRARIES HERE

import pandas as pd
from bs4 import BeautifulSoup
import requests
import time
from selenium import webdriver
import re
from selenium.webdriver.common.keys import Keys
from uuid import uuid4 as uuid

# H/T ALEX HOLT FOR UUID

### DEFINE OUR MAIN REDDIT SOUPER FUNCTION HERE!

In [462]:
def reddit_secrets(URL):
    reddit_request = requests.get(URL, headers={"User-agent": str(uuid())}) # H/T ALEX HOLT FOR UUID
    soup = BeautifulSoup(reddit_request.text, "lxml")
#     print(soup.prettify())
    return soup


#     driver = webdriver.Chrome(executable_path="/Users/AndyKashyap/Downloads/chromedriver")
#     driver.get(URL)
#     driver.title
#     assert "reddit:" in driver.title

### NEXT WE DEFINE OUR REDDIT USER SOUPER FUNCTION HERE!

In [463]:
def user_secrets(URL_USER):
    user_request = requests.get(URL_USER, headers={"User-agent": str(uuid())}) # H/T ALEX HOLT FOR UUID
    user_soup = BeautifulSoup(user_request.text, "lxml")
    return user_soup
    

### DEFINE REDDIT SCRAPER HERE. 

In [464]:

def reddit_scraper(soup,index_count):
    cols = ["Title", "Subreddit", "Username", "Upvotes", "Time Elapsed", "Submission Time", "#Comments", "Link", "UserID"]
    master_reddit = pd.DataFrame(columns = cols)

    for posts in soup.find('div', {'class':'sitetable linklisting'}):


        """DOCSTRING: START PULLING STUFF FROM REDDIT.""" 

        # POST_TITLE
        try:
            post_title.append(posts.find('a', {"data-event-action" : "title"}).text)
        except:
            pass    
        # SUBREDDIT
        try:
            post_subreddit.append(posts.find('a', {"class" : "subreddit hover may-blank"}).text)
        except:
            pass 

        # UPVOTES
        try:
            post_upvotes.append(posts.find('div', {"class" : "score unvoted"}).text)
        except:
            pass

        # USERNAME
        try:
            post_username.append(posts['data-author'])
        except:
            pass


        # SUBMISSION TIME ELAPSED
        try:
            post_timepassed.append(posts.time.text)
        except:
            pass

        # SUBMISSION TIME 
        try:
            post_timing.append(posts.time["title"])
        except:
            pass

        # NUMBER OF COMMENTS
        try:
            post_comments.append(posts.find('a', {"data-event-action" : "comments"}).text)
        except:
            pass


        # LINK DETAILS
        try:
            post_link.append(posts.find('span', {"class" : "domain"}).a.text)
        except:
             pass

        # ID
        try:
            next_id = (posts['id'].replace("thing_", ""))
            post_userid.append(next_id)
        except:
            pass
        
        try:
            for link in posts.find_all('p', {"class" : "tagline"}):
                user_link.append(link.a['href'])          # GET USER URL FROM DIV 
        except:
            pass
 

    reddit_table = pd.DataFrame({"Title" : post_title, "Subreddit" : post_subreddit, "Username": post_username,
                                 "Upvotes": post_upvotes, "Time Elapsed": post_timepassed, "Submission Time": post_timing,
                                "#Comments": post_comments, "Link": post_link, "UserID" : post_userid})


    master_reddit = pd.concat([master_reddit, reddit_table])
    
    # DEFINE THE NUMBER OF ELEMENTS TO GET TO NEXT PAGE
    URL = "https://www.reddit.com/?count=" + str(index_count) + "&after=" + next_id
    
    # RETURNING THE ITERATING URL & THE DATAFRAME
    
    return URL, master_reddit

CPU times: user 11 µs, sys: 1 µs, total: 12 µs
Wall time: 14.8 µs


In [465]:
%%time

""" USER_LINK IS THE LINK TO THE PROFILES OF ALL USERS. WE WILL LOOP THROUGH IT"""
def karma_court(user_link):
    for count in user_link:
        user_soup = user_secrets(count)   # CALLING THE FUNCTION HERE TO MAKE SOUP OF USER PAGE
    #    print(user_soup.title.text)   # THERE ARE 2 TYPES OF REDDIT PROFILES! PAID/PREMIUM/NEW VS OLD. HERE IS A SPLIT BETWEEN THEM

        if "overview" in user_soup.title.text:
            try:
                info = user_soup.find('div', {'class':'titlebox'})
                user_name.append(info.h1.text)
                user_karma.append(info.span.text)
                user_commentkarma.append(info.find('span', {'class':'karma comment-karma'}).text)
            except:
                pass
        else:
            try:
                info = user_soup.find('div', {"class" : "ProfileSidebar"})
                user_name.append(info.find('div', {"class" : "ProfileSidebar__displayName"}).text)
                karma_split = info.find('div', {"class": "ProfileSidebar__counterInfo"}).text.split()
                user_karma.append(karma_split[0])
                user_commentkarma.append(karma_split[2].replace("Karma", ""))
            except:
                pass

    user_master = pd.DataFrame({"Username":user_name, "User Karma":user_karma, "User Comment_Karma":user_commentkarma})
    
    return user_master

CPU times: user 6 µs, sys: 2 µs, total: 8 µs
Wall time: 12.9 µs


In [492]:
%%time

# for timer in range(1,3):
#     time.sleep(30)
    
URL = "https://www.reddit.com/"
post_comments = []
post_link = []
post_subreddit = []
post_timepassed = []
post_timing = []
post_title = []
post_upvotes = []
post_userid = []
post_username = []
user_karma = []
user_commentkarma = []
user_name = []
user_link = []

for i in range(1,10):
    new_soup = reddit_secrets(URL)
    URL = reddit_scraper(new_soup, i*25)

user_master = karma_court(user_link)

reddit_master.to_csv('Reddit_Master.csv', mode='a', header=False)
user_master.to_csv('User_Master.csv', mode='a', header=False)

KeyboardInterrupt: 

In [None]:
    cols = ["Title", "Subreddit", "Username", "Upvotes", "Time Elapsed", "Submission Time", "#Comments", "Link", "UserID"]
    master_reddit = pd.DataFrame(columns = cols)

## Predicting comments using Random Forests + Another Classifier

#### Load in the the data of scraped results

In [None]:
## YOUR CODE HERE

#### We want to predict a binary variable - whether the number of comments was low or high. Compute the median number of comments and create a new binary variable that is true when the number of comments is high (above the median)

We could also perform Linear Regression (or any regression) to predict the number of comments here. Instead, we are going to convert this into a _binary_ classification problem, by predicting two classes, HIGH vs LOW number of comments.

While performing regression may be better, performing classification may help remove some of the noise of the extremely popular threads. We don't _have_ to choose the `median` as the splitting point - we could also split on the 75th percentile or any other reasonable breaking point.

In fact, the ideal scenario may be to predict many levels of comment numbers. 

In [None]:
## YOUR CODE HERE

#### Thought experiment: What is the baseline accuracy for this model?

In [None]:
## YOUR CODE HERE

#### Create a Random Forest model to predict High/Low number of comments using Sklearn. Start by ONLY using the subreddit as a feature. 

In [None]:
## YOUR CODE HERE

#### Create a few new variables in your dataframe to represent interesting features of a thread title.
- For example, create a feature that represents whether 'cat' is in the title or whether 'funny' is in the title. 
- Then build a new Random Forest with these features. Do they add any value?
- After creating these variables, use count-vectorizer to create features based on the words in the thread titles.
- Build a new random forest model with subreddit and these new features included.

In [None]:
## YOUR CODE HERE

#### Use cross-validation in scikit-learn to evaluate the model above. 
- Evaluate the accuracy of the model, as well as any other metrics you feel are appropriate. 

In [None]:
## YOUR CODE HERE

#### Repeat the model-building process with a non-tree-based method.

In [None]:
## YOUR CODE HERE

#### Use Count Vectorizer from scikit-learn to create features from the thread titles. 
- Examine using count or binary features in the model
- Re-evaluate your models using these. Does this improve the model performance? 
- What text features are the most valuable? 

In [None]:
## YOUR CODE HERE

# Executive Summary
---
Put your executive summary in a Markdown cell below.

### BONUS
Refer to the README for the bonus parts

In [None]:
## YOUR CODE HERE