# !!!Reminder to source code!!!

# Web Scraping for Reddit & Predicting Comments

In this project, we will practice two major skills. Collecting data by scraping a website and then building a binary predictor.

As we discussed in week 2, and earlier today, there are two components to starting a data science problem: the problem statement, and acquiring the data.

For this article, your problem statement will be: _What characteristics of a post on Reddit contribute most to the overall interaction (as measured by number of comments)?_

Your method for acquiring the data will be scraping the 'hot' threads as listed on the [Reddit homepage](https://www.reddit.com/). You'll acquire _AT LEAST FOUR_ pieces of information about each thread:
1. The title of the thread
2. The subreddit that the thread corresponds to
3. The length of time it has been up on Reddit
4. The number of comments on the thread

Once you've got the data, you will build a classification model that, using Natural Language Processing and any other relevant features, predicts whether or not a given Reddit post will have above or below the _median_ number of comments.

**BONUS PROBLEMS**
1. If creating a logistic regression, GridSearch Ridge and Lasso for this model and report the best hyperparameter values.
1. Scrape the actual text of the threads using Selenium (you'll learn about this in Webscraping II).
2. Write the actual article that you're pitching and turn it into a blog post that you host on your personal website.

### Scraping Thread Info from Reddit.com

#### Set up a request (using requests) to the URL below. Use BeautifulSoup to parse the page and extract all results

In [1]:
import requests
from bs4 import BeautifulSoup

In [2]:
url = "http://www.reddit.com/"

In [3]:
homepage = requests.get(url, headers = {'User-agent': 'Ross'})

In [4]:
html = homepage.text

In [5]:
print(html[:700])

<!doctype html><html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en"><head><title>reddit: the front page of the internet</title><meta name="keywords" content=" reddit, reddit.com, vote, comment, submit " /><meta name="description" content="reddit: the front page of the internet" /><meta name="referrer" content="always"><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><link type="application/opensearchdescription+xml" rel="search" href="/static/opensearch.xml"/><link rel="canonical" href="https://www.reddit.com/" /><meta name="viewport" content="width=1024"><link rel="dns-prefetch" href="//out.reddit.com"><link rel="preconnect" href="//out.reddit.com"><link re


In [6]:
soup = BeautifulSoup(html, 'lxml')

In [7]:
soup

<!DOCTYPE html>
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml"><head><title>reddit: the front page of the internet</title><meta content=" reddit, reddit.com, vote, comment, submit " name="keywords"/><meta content="reddit: the front page of the internet" name="description"/><meta content="always" name="referrer"/><meta content="text/html; charset=utf-8" http-equiv="Content-Type"/><link href="/static/opensearch.xml" rel="search" type="application/opensearchdescription+xml"/><link href="https://www.reddit.com/" rel="canonical"/><meta content="width=1024" name="viewport"/><link href="//out.reddit.com" rel="dns-prefetch"/><link href="//out.reddit.com" rel="preconnect"/><link href="//www.redditstatic.com/desktop2x/img/favicon/apple-icon-57x57.png" rel="apple-touch-icon" sizes="57x57"/><link href="//www.redditstatic.com/desktop2x/img/favicon/apple-icon-60x60.png" rel="apple-touch-icon" sizes="60x60"/><link href="//www.redditstatic.com/desktop2x/img/favicon/apple-icon-72x72

In [8]:
soup.find("span", {"class": "next-button"}).a['href']

'https://www.reddit.com/?count=25&after=t3_80607r'

In [9]:
print(soup.html.title)

<title>reddit: the front page of the internet</title>


In [10]:
print(soup.html.title.text)

reddit: the front page of the internet


In [11]:
title = soup.find_all("a", {"class": "title"})
title[0].text

'Meeting his new sister for the first time'

In [12]:
for i in title:
    print(i.text)

Meeting his new sister for the first time
🔥Potter wasp🔥
Finally became the proud owner of this beautiful little guy yesterday. Always wanted my own pet and I finally did it. One of the happiest days of my life so far.
The S9 Keeps the 3.5mm Headphone Jack!
They did it again...
When a family argument gets out of hand
Gonna jump on top the cashier's table
Russian athlete filmed in 'I don’t do doping' shirt fails Olympic drug test
A Guy Who Knows a Creepshow When He Sees One.
Vaccinations Drop in Europe, and the Result Was Over 21,000 Cases of Measles
Nash Appreciation Upvote Party. Thank you BDR.
I put a bubble level on my basic drill so I know when I’m 90 degrees perpendicular to ground when drilling vertically.
!Heads Up!: Congress it trying to pass Bill H.R.1856 on Tuesday that removes protections of site owners for what their users post
TIL Coyotes use their howls and yipping to create a kind of census of coyote populations. If their howls are not answered by other packs, it triggers

In [13]:
time_since = soup.find_all("time", {"class": "live-timestamp"})
time_since[0].text

'4 hours ago'

In [14]:
subreddit = soup.find_all("a", {"class": "subreddit hover may-blank"})
subreddit[0].text

'r/Eyebleach'

In [15]:
num_comments = soup.find_all("a", {"class": "comments"})
num_comments[0].text

'14 comments'

While this has some more verbose elements removed, we can see that there is some structure to the above:
- The thread title is within an `<a>` tag with the attribute `data-event-action="title"`.
- The time since the thread was created is within a `<time>` tag with attribute `class="live-timestamp"`.
- The subreddit is within an `<a>` tag with the attribute `class="subreddit hover may-blank"`.
- The number of comments is within an `<a>` tag with the attribute data-event-action="comments"`.

## Write 4 functions to extract these items (one function for each): title, time, subreddit, and number of comments.¶
Example
```python
def extract_title_from_result(result):
    return result.find ...
```

##### - Make sure these functions are robust and can handle cases where the data/field may not be available.
>- Remember to check if a field is empty or None for attempting to call methods on it
>- Remember to use try/except if you anticipate errors.

- **Test** the functions on the results above and simple examples

In [16]:
def extract_title_from_result(soup):
    title = soup.find("p", {"class": "title"})
    return title.text if title else None

In [17]:
def extract_time_since_from_result(soup):
    time_since = soup.find("time", {"class": "live-timestamp"})
    return time_since.text if time_since else None

In [18]:
def extract_subreddit_from_result(soup):
    subreddit = soup.find("a", {"class": "subreddit hover may-blank"})
    return subreddit.text if subreddit else None

In [19]:
def extract_num_comments_from_result(soup):
    num_comments = soup.find("a", {"class": "comments"})
    return num_comments.text if num_comments else None

Now, to scale up our scraping, we need to accumulate more results.

First, look at the source of a Reddit.com page: (https://www.reddit.com/).
Try manually changing the page by clicking the 'next' button on the bottom. Look at how the url changes.

After leaving the Reddit homepage, the URLs should look something like this:
```
https://www.reddit.com/?count=25&after=t3_787ptc
```

The URL here has two query parameters
- count is the result number that the page starts with
- after is the unique id of the last result on the _previous_ page

In order to scrape lots of pages from Reddit, we'll have to change these parameters every time we make a new request so that we're not just scraping the same page over and over again. Incrementing the count by 25 every time will be easy, but the bizarre code after `after` is a bit trickier.

To start off, let's look at a block of HTML from a Reddit page to see how we might solve this problem:
```html
<div class=" thing id-t3_788tye odd gilded link " data-author="LordSneaux" data-author-fullname="t2_j3pty" data-comments-count="1548" data-context="listing" data-domain="v.redd.it" data-fullname="t3_788tye" data-kind="video" data-num-crossposts="0" data-permalink="/r/funny/comments/788tye/not_all_heroes_wear_capes/" data-rank="25" data-score="51468" data-subreddit="funny" data-subreddit-fullname="t5_2qh33" data-timestamp="1508775581000" data-type="link" data-url="https://v.redd.it/ush0rh2tultz" data-whitelist-status="all_ads" id="thing_t3_788tye" onclick="click_thing(this)">
      <p class="parent">
      </p>
      <span class="rank">
       25
      </span>
      <div class="midcol unvoted">
       <div aria-label="upvote" class="arrow up login-required access-required" data-event-action="upvote" role="button" tabindex="0">
       </div>
       <div class="score dislikes" title="53288">
        53.3k
       </div>
       <div class="score unvoted" title="53289">
        53.3k
       </div>
       <div class="score likes" title="53290">
        53.3k
       </div>
       <div aria-label="downvote" class="arrow down login-required access-required" data-event-action="downvote" role="button" tabindex="0">
       </div>
      </div>
```

Notice that within the `div` tag there is an attribute called `id` and it is set to `"thing_t3_788tye"`. By finding the last ID on your scraped page, you can tell your _next_ request where to start (pass everything after "thing_").

For more info on this, you can take a look at the [Reddit API docs](https://github.com/reddit/reddit/wiki/JSON)

In [20]:
#Request multiple URLs, get URL structure
#individual requests for each next page

url1 = requests.get("http://www.reddit.com/?count=25&after=t3_7z5lqe")
url2 = requests.get("http://www.reddit.com/?count=50&after=t3_7z8pv5")
url3 = requests.get("http://www.reddit.com/?count=75&after=t3_7z2497")
url4 = requests.get("http://www.reddit.com/?count=100&after=t3_7z9hoi")

## Write one more function that finds the last `id` on the page, and stores it.

In [21]:
def extract_id_from_result(result):
    result_id = result.find_all("a", {"id": " "})
    return result_id if result_id else None

## (Optional) Collect more information

While we only require you to collect four features, there may be other info that you can find on the results page that might be useful. Feel free to write more functions so that you have more interesting and useful data.

In [56]:
## YOUR CODE HERE

## Now, let's put it all together.

Use the functions you wrote above to parse out the 4 fields - title, time, subreddit, and number of comments. Create a dataframe from the results with those 4 columns.

In [57]:
import time

url = 'http://www.reddit.com'
x1=[]
x2=[]
x3=[]
x4=[]
for i in range(80):
    time.sleep(1)
    results = requests.get(url, headers = {'User-agent': 'Ross'})
    print(results.status_code)
    if results.status_code != 200:
        continue
    soup = BeautifulSoup(results.text, 'lxml')
    for a in soup.find_all("div", {"class": "top-matter"}):
        x1.append(extract_title_from_result(a))
        x2.append(extract_time_since_from_result(a))
        x3.append(extract_subreddit_from_result(a))
        x4.append(extract_num_comments_from_result(a))
    url = soup.find("span", {"class": "next-button"}).a['href']

200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200


AttributeError: 'NoneType' object has no attribute 'a'

In [58]:
len(x2)

888

In [59]:
import pandas as pd
df = pd.DataFrame(
    {'num_comments':x4,
     'subreddit':x3,
     'time_since':x2,
     'title':x1,
    })
df.head()

df1 = pd.DataFrame()
df1['title']=x1
df1['time_since']=x2
df1['subreddit']=x3
df1['num_comments']=x4
df1.head()

Unnamed: 0,title,time_since,subreddit,num_comments
0,Meeting his new sister for the first time (i.i...,4 hours ago,r/Eyebleach,172 comments
1,🔥Potter wasp🔥 (i.redd.it),4 hours ago,r/NatureIsFuckingLit,440 comments
2,Finally became the proud owner of this beautif...,4 hours ago,r/pics,1693 comments
3,The S9 Keeps the 3.5mm Headphone Jack! (thever...,4 hours ago,r/gadgets,3113 comments
4,They did it again... (i.redd.it),5 hours ago,r/funny,850 comments


### Save your results as a CSV
You may do this regularly while scraping data as well, so that if your scraper stops of your computer crashes, you don't lose all your data.

In [60]:
df1.head()

Unnamed: 0,title,time_since,subreddit,num_comments
0,Meeting his new sister for the first time (i.i...,4 hours ago,r/Eyebleach,172 comments
1,🔥Potter wasp🔥 (i.redd.it),4 hours ago,r/NatureIsFuckingLit,440 comments
2,Finally became the proud owner of this beautif...,4 hours ago,r/pics,1693 comments
3,The S9 Keeps the 3.5mm Headphone Jack! (thever...,4 hours ago,r/gadgets,3113 comments
4,They did it again... (i.redd.it),5 hours ago,r/funny,850 comments


In [61]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 888 entries, 0 to 887
Data columns (total 4 columns):
title           888 non-null object
time_since      888 non-null object
subreddit       888 non-null object
num_comments    888 non-null object
dtypes: object(4)
memory usage: 27.8+ KB


In [62]:
df1.describe()

Unnamed: 0,title,time_since,subreddit,num_comments
count,888,888,888,888
unique,541,22,338,288
top,The rear of the Alfa Romeo Sauber C37 is absol...,6 hours ago,r/aww,18 comments
freq,3,124,45,17


In [63]:
df1['num_comments']

0       172 comments
1       440 comments
2      1693 comments
3      3113 comments
4       850 comments
5       203 comments
6       815 comments
7      2378 comments
8      1128 comments
9       216 comments
10     1371 comments
11      133 comments
12     1680 comments
13      395 comments
14      717 comments
15      126 comments
16      481 comments
17      353 comments
18      182 comments
19      328 comments
20      241 comments
21      196 comments
22      114 comments
23       93 comments
24      120 comments
25      693 comments
26       90 comments
27      278 comments
28     2326 comments
29      726 comments
           ...      
858      20 comments
859      23 comments
860      25 comments
861      36 comments
862      40 comments
863      13 comments
864     504 comments
865      74 comments
866    9823 comments
867     107 comments
868      18 comments
869     289 comments
870       5 comments
871      23 comments
872      34 comments
873      26 comments
874      10 c

In [64]:
df1['num_comments'] = df1['num_comments'].str.replace('comments','')
df1['num_comments'] = df1['num_comments'].str.replace('comment','')

In [65]:
df1.head()

Unnamed: 0,title,time_since,subreddit,num_comments
0,Meeting his new sister for the first time (i.i...,4 hours ago,r/Eyebleach,172
1,🔥Potter wasp🔥 (i.redd.it),4 hours ago,r/NatureIsFuckingLit,440
2,Finally became the proud owner of this beautif...,4 hours ago,r/pics,1693
3,The S9 Keeps the 3.5mm Headphone Jack! (thever...,4 hours ago,r/gadgets,3113
4,They did it again... (i.redd.it),5 hours ago,r/funny,850


In [66]:
df1['num_comments'] = df1['num_comments'].apply(pd.to_numeric)
df1.dtypes

title           object
time_since      object
subreddit       object
num_comments     int64
dtype: object

In [67]:
df1['time_since'].str.contains("minutes").unique()

array([False,  True], dtype=bool)

In [68]:
df1['time_since'].str.contains("days").unique()

array([False], dtype=bool)

In [69]:
time = []
for i in df1['time_since']:
    if 'minutes ago' in i:
        x = i.replace('minutes ago','')
        x = int(x)/60
        time.append(x)
    else:
        time.append(i)
time

['4 hours ago',
 '4 hours ago',
 '4 hours ago',
 '4 hours ago',
 '5 hours ago',
 '3 hours ago',
 '5 hours ago',
 '6 hours ago',
 '5 hours ago',
 '2 hours ago',
 '6 hours ago',
 '7 hours ago',
 '7 hours ago',
 '6 hours ago',
 '6 hours ago',
 '5 hours ago',
 '7 hours ago',
 '4 hours ago',
 '6 hours ago',
 '7 hours ago',
 '5 hours ago',
 '7 hours ago',
 '3 hours ago',
 '4 hours ago',
 '4 hours ago',
 '8 hours ago',
 '4 hours ago',
 '4 hours ago',
 '8 hours ago',
 '8 hours ago',
 '8 hours ago',
 '7 hours ago',
 '7 hours ago',
 '6 hours ago',
 '6 hours ago',
 '8 hours ago',
 '5 hours ago',
 '5 hours ago',
 '7 hours ago',
 '8 hours ago',
 '7 hours ago',
 '9 hours ago',
 '4 hours ago',
 '8 hours ago',
 '8 hours ago',
 '6 hours ago',
 '8 hours ago',
 '8 hours ago',
 '3 hours ago',
 '10 hours ago',
 '9 hours ago',
 '5 hours ago',
 '5 hours ago',
 '5 hours ago',
 '4 hours ago',
 '4 hours ago',
 '6 hours ago',
 '8 hours ago',
 '10 hours ago',
 '10 hours ago',
 '8 hours ago',
 '9 hours ago',
 '8 h

In [70]:
# time = []
for i in time:
    if 'hours ago' in i:
        x = i.replace('hours ago','')
        x = int(x)
        time.append(x)
    else:
        time.append(i)
time

[0.06666666666666667,
 0.06666666666666667,
 0.06666666666666667,
 0.06666666666666667,
 0.08333333333333333,
 0.05,
 0.08333333333333333,
 0.1,
 0.08333333333333333,
 0.03333333333333333,
 0.1,
 0.11666666666666667,
 0.11666666666666667,
 0.1,
 0.1,
 0.08333333333333333,
 0.11666666666666667,
 0.06666666666666667,
 0.1,
 0.11666666666666667,
 0.08333333333333333,
 0.11666666666666667,
 0.05,
 0.06666666666666667,
 0.06666666666666667,
 0.13333333333333333,
 0.06666666666666667,
 0.06666666666666667,
 0.13333333333333333,
 0.13333333333333333,
 0.13333333333333333,
 0.11666666666666667,
 0.11666666666666667,
 0.1,
 0.1,
 0.13333333333333333,
 0.08333333333333333,
 0.08333333333333333,
 0.11666666666666667,
 0.13333333333333333,
 0.11666666666666667,
 0.15,
 0.06666666666666667,
 0.13333333333333333,
 0.13333333333333333,
 0.1,
 0.13333333333333333,
 0.13333333333333333,
 0.05,
 0.16666666666666666,
 0.15,
 0.08333333333333333,
 0.08333333333333333,
 0.08333333333333333,
 0.066666666666

In [71]:
for i in time:
    if 'days ago' in i:
        x = i.replace('days ago','')
        x = int(x)*24
        time.append(x)
    else:
        time.append(i)
time

['4 hours ago',
 '4 hours ago',
 '4 hours ago',
 '4 hours ago',
 '5 hours ago',
 '3 hours ago',
 '5 hours ago',
 '6 hours ago',
 '5 hours ago',
 '2 hours ago',
 '6 hours ago',
 '7 hours ago',
 '7 hours ago',
 '6 hours ago',
 '6 hours ago',
 '5 hours ago',
 '7 hours ago',
 '4 hours ago',
 '6 hours ago',
 '7 hours ago',
 '5 hours ago',
 '7 hours ago',
 '3 hours ago',
 '4 hours ago',
 '4 hours ago',
 '8 hours ago',
 '4 hours ago',
 '4 hours ago',
 '8 hours ago',
 '8 hours ago',
 '8 hours ago',
 '7 hours ago',
 '7 hours ago',
 '6 hours ago',
 '6 hours ago',
 '8 hours ago',
 '5 hours ago',
 '5 hours ago',
 '7 hours ago',
 '8 hours ago',
 '7 hours ago',
 '9 hours ago',
 '4 hours ago',
 '8 hours ago',
 '8 hours ago',
 '6 hours ago',
 '8 hours ago',
 '8 hours ago',
 '3 hours ago',
 '10 hours ago',
 '9 hours ago',
 '5 hours ago',
 '5 hours ago',
 '5 hours ago',
 '4 hours ago',
 '4 hours ago',
 '6 hours ago',
 '8 hours ago',
 '10 hours ago',
 '10 hours ago',
 '8 hours ago',
 '9 hours ago',
 '8 h

In [72]:
df1.time_since = time
#df1['time_since'] = df1['time_since'].str.replace('hours ago','')
#df1['time_since'] = pd.to_numeric(df1['time_since'].str.replace('hour ago',''))

df1.head()
df1.dtypes

title           object
time_since      object
subreddit       object
num_comments     int64
dtype: object

In [73]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 888 entries, 0 to 887
Data columns (total 4 columns):
title           888 non-null object
time_since      888 non-null object
subreddit       888 non-null object
num_comments    888 non-null int64
dtypes: int64(1), object(3)
memory usage: 27.8+ KB


In [74]:
df1.to_csv('results1.csv', index=False)

## Predicting comments using Random Forests + Another Classifier

#### Load in the the data of scraped results

#### We want to predict a binary variable - whether the number of comments was low or high. Compute the median number of comments and create a new binary variable that is true when the number of comments is high (above the median)

We could also perform Linear Regression (or any regression) to predict the number of comments here. Instead, we are going to convert this into a _binary_ classification problem, by predicting two classes, HIGH vs LOW number of comments.

While performing regression may be better, performing classification may help remove some of the noise of the extremely popular threads. We don't _have_ to choose the `median` as the splitting point - we could also split on the 75th percentile or any other reasonable breaking point.

In fact, the ideal scenario may be to predict many levels of comment numbers. 

In [75]:
df1 = pd.read_csv('results1.csv')

In [76]:
time = []
for i in df1['time_since']:
    if 'hours ago' in str(i):
        x = str(i).replace('hours ago','')
        x = int(x)/60
        time.append(x)
    else:
        time.append(i)
time

[0.06666666666666667,
 0.06666666666666667,
 0.06666666666666667,
 0.06666666666666667,
 0.08333333333333333,
 0.05,
 0.08333333333333333,
 0.1,
 0.08333333333333333,
 0.03333333333333333,
 0.1,
 0.11666666666666667,
 0.11666666666666667,
 0.1,
 0.1,
 0.08333333333333333,
 0.11666666666666667,
 0.06666666666666667,
 0.1,
 0.11666666666666667,
 0.08333333333333333,
 0.11666666666666667,
 0.05,
 0.06666666666666667,
 0.06666666666666667,
 0.13333333333333333,
 0.06666666666666667,
 0.06666666666666667,
 0.13333333333333333,
 0.13333333333333333,
 0.13333333333333333,
 0.11666666666666667,
 0.11666666666666667,
 0.1,
 0.1,
 0.13333333333333333,
 0.08333333333333333,
 0.08333333333333333,
 0.11666666666666667,
 0.13333333333333333,
 0.11666666666666667,
 0.15,
 0.06666666666666667,
 0.13333333333333333,
 0.13333333333333333,
 0.1,
 0.13333333333333333,
 0.13333333333333333,
 0.05,
 0.16666666666666666,
 0.15,
 0.08333333333333333,
 0.08333333333333333,
 0.08333333333333333,
 0.066666666666

In [118]:
med = df2['num_comments'].median()
med

70.0

In [119]:
df1['high_comments'] = df1['num_comments'].apply(lambda x: 1 if x > med else 0)

In [152]:
df1['high_comments'].value_counts()

1    458
0    430
Name: high_comments, dtype: int64

In [120]:
df1.head(10)

Unnamed: 0,title,time_since,subreddit,num_comments,high_comments,has_Cat,has_Dog,has_Olympic,has_Curling,has_Trump,has_Economy,has_Interesting,has_Hot,has_TIL,has_Science
0,Meeting his new sister for the first time (i.i...,4 hours ago,r/Eyebleach,172,1,0,0,0,0,0,0,0,0,0,0
1,🔥Potter wasp🔥 (i.redd.it),4 hours ago,r/NatureIsFuckingLit,440,1,0,0,0,0,0,0,0,0,0,0
2,Finally became the proud owner of this beautif...,4 hours ago,r/pics,1693,1,0,0,0,0,0,0,0,0,0,0
3,The S9 Keeps the 3.5mm Headphone Jack! (thever...,4 hours ago,r/gadgets,3113,1,0,0,0,0,0,0,0,0,0,0
4,They did it again... (i.redd.it),5 hours ago,r/funny,850,1,0,0,0,0,0,0,0,0,0,0
5,Gonna jump on top the cashier's table (i.imgur...,3 hours ago,r/Whatcouldgowrong,203,1,0,0,0,0,0,0,0,0,0,0
6,When a family argument gets out of hand (i.red...,5 hours ago,r/funny,815,1,0,0,0,0,0,0,0,0,0,0
7,Russian athlete filmed in 'I don’t do doping' ...,6 hours ago,r/news,2378,1,0,0,1,0,0,0,0,0,0,0
8,A Guy Who Knows a Creepshow When He Sees One. ...,5 hours ago,r/PoliticalHumor,1128,1,0,0,0,0,0,0,0,0,0,0
9,I put a bubble level on my basic drill so I kn...,2 hours ago,r/lifehacks,216,1,0,0,0,0,0,0,0,0,0,0


#### Thought experiment: What is the baseline accuracy for this model?

In [121]:
baseline = df1.high_comments.mean()
baseline

0.5157657657657657

#### Create a Random Forest model to predict High/Low number of comments using Sklearn. Start by ONLY using the subreddit as a feature. 

In [122]:
len(df1['subreddit'].unique()), len(df1['subreddit'])

(338, 888)

In [123]:
li = []
for x in df1['subreddit']:
    if 'People' in x:
        x = 'r/People'
        li.append(x)
    elif 'meme' in x or 'Meme' in x or 'dank' in x or 'Dank' in x:
        x = 'r/Meme'
        li.append(x)
    elif 'irl' in x:
        x = 'r/irl'
        li.append(x)
    else:
        li.append(x)

# df1['subreddit'] = li

In [124]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import cross_val_score



X = df2[['subreddit']]
y = df2['high_comments']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

Fitting the Model on Training Data

In [125]:
vect = CountVectorizer(stop_words='english')
vect.fit(X_train['subreddit'])
X_train = vect.transform(X_train['subreddit'])
X_train = pd.DataFrame(X_train.todense(), columns=vect.get_feature_names())

In [126]:
X_train.head()

Unnamed: 0,2healthbars,2meirl4meirl,3dprinting,abandonedporn,absolutelynotmeirl,accidentalcomedy,accidentalrenaissance,accidentalwesanderson,againsthatesubreddits,android,...,wellthatsucks,whatcouldgowrong,whitepeoplegifs,whitepeopletwitter,wholesomememes,worldnews,writingprompts,xboxone,zelda,zoomies
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [127]:
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)
train_predictions = rfc.predict(X_train)

print('Train Accuracy Score', rfc.score(X_train, y_train))
print('Cross Validated Accuracy Train Score', cross_val_score(rfc, X_train, y_train, cv=5).mean())

Train Accuracy Score 0.887366818874
Cross Validated Accuracy Train Score 0.686368593239


Test Data

In [128]:
X_test = vect.transform(X_test['subreddit'])
X_test = pd.DataFrame(X_test.todense(), columns=vect.get_feature_names())


train_predictions = rfc.predict(X_test)

print('Test Accuracy Score', rfc.score(X_test, y_test))
print('Cross Validated Accuracy Test Score', cross_val_score(rfc, X_test, y_test, cv=5).mean())

Test Accuracy Score 0.741538461538
Cross Validated Accuracy Test Score 0.553952505828


#### Create a few new variables in your dataframe to represent interesting features of a thread title.
- For example, create a feature that represents whether 'cat' is in the title or whether 'funny' is in the title. 
- Then build a new Random Forest with these features. Do they add any value?
- After creating these variables, use count-vectorizer to create features based on the words in the thread titles.
- Build a new random forest model with subreddit and these new features included.

In [129]:
df1['has_Cat'] = df1['title'].apply(lambda x: 1 if 'cat ' in x.lower() else 0)

df1['has_Cat'].value_counts()

0    876
1     12
Name: has_Cat, dtype: int64

In [130]:
df1['has_Dog'] = df1['title'].apply(lambda x: 1 if 'dog ' in x.lower() else 0)

df1['has_Dog'].value_counts()

0    878
1     10
Name: has_Dog, dtype: int64

In [131]:
df1['has_Olympic'] = df1['title'].apply(lambda x: 1 if 'olympic' in x.lower() else 0)

df1['has_Olympic'].value_counts()

0    879
1      9
Name: has_Olympic, dtype: int64

In [132]:
df1['has_Curling'] = df1['title'].apply(lambda x: 1 if 'curling' in x.lower() else 0)

df1['has_Curling'].value_counts()

0    888
Name: has_Curling, dtype: int64

In [133]:
df1['has_Trump'] = df1['title'].apply(lambda x: 1 if 'trump' in x.lower() else 0)

df1['has_Trump'].value_counts()

0    882
1      6
Name: has_Trump, dtype: int64

In [134]:
df1['has_Economy'] = df1['title'].apply(lambda x: 1 if 'economy' in x.lower() else 0)

df1['has_Economy'].value_counts()

0    888
Name: has_Economy, dtype: int64

In [135]:
df1['has_Interesting'] = df1['title'].apply(lambda x: 1 if 'interesting' in x.lower() else 0)

df1['has_Interesting'].value_counts()

0    888
Name: has_Interesting, dtype: int64

In [136]:
df1['has_Hot'] = df1['title'].apply(lambda x: 1 if 'hot ' in x.lower() else 0)

df1['has_Hot'].value_counts()

0    888
Name: has_Hot, dtype: int64

In [137]:
df1['has_TIL'] = df1['title'].apply(lambda x: 1 if 'TIL ' in x else 0)

df1['has_TIL'].value_counts()

0    870
1     18
Name: has_TIL, dtype: int64

In [138]:
df1['has_Science'] = df1['title'].apply(lambda x: 1 if 'science' in x.lower() else 0)

df1['has_Science'].value_counts()

0    886
1      2
Name: has_Science, dtype: int64

In [148]:
df1['has_Basketball'] = df1['title'].apply(lambda x: 1 if 'basketball' in x.lower() else 0)

df1['has_Basketball'].value_counts()

0    886
1      2
Name: has_Basketball, dtype: int64

In [150]:
df1['has_Football'] = df1['title'].apply(lambda x: 1 if 'football' in x.lower() else 0)

df1['has_Football'].value_counts()

0    888
Name: has_Football, dtype: int64

In [151]:
df1['has_Politics'] = df1['title'].apply(lambda x: 1 if 'politics' in x.lower() else 0)

df1['has_Politics'].value_counts()

0    888
Name: has_Politics, dtype: int64

In [139]:
X = df1.drop(['high_comments', 'num_comments','time_since'], axis=1)
y = df1['high_comments']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [140]:
X_train.head()

Unnamed: 0,title,subreddit,has_Cat,has_Dog,has_Olympic,has_Curling,has_Trump,has_Economy,has_Interesting,has_Hot,has_TIL,has_Science
6,When a family argument gets out of hand (i.red...,r/funny,0,0,0,0,0,0,0,0,0,0
575,that's a bit greedy (i.redd.it),r/memes,0,0,0,0,0,0,0,0,0,0
444,The Mars Volta - L'Via L'viaquez [Progressive ...,r/Music,0,0,0,0,0,0,0,0,0,0
73,This are some skills (i.imgur.com),r/holdmybeer,0,0,0,0,0,0,0,0,0,0
669,Teeny-tiny kitty sketch (i.imgur.com),r/pics,0,0,0,0,0,0,0,0,0,0


In [141]:
vect = CountVectorizer(stop_words='english')
vect.fit(X_train['subreddit'])
X_train_t = vect.transform(X_train['subreddit'])
X_train_t = pd.DataFrame(X_train_t.todense(), columns=vect.get_feature_names())
X_train = X_train[['has_Cat', 'has_Dog', 'has_Olympic', 'has_Curling', 'has_Trump', 'has_Economy', 'has_Interesting', 'has_Hot', 'has_TIL', 'has_Science']].reset_index().drop(['index'], axis=1)
X_train = pd.concat([X_train, X_train_t], axis=1)

In [142]:
X_train.head()

Unnamed: 0,has_Cat,has_Dog,has_Olympic,has_Curling,has_Trump,has_Economy,has_Interesting,has_Hot,has_TIL,has_Science,...,whitepeopletwitter,wholesomememes,woahdude,woof_irl,worldnews,writingprompts,xboxone,youdontsurf,youtubehaiku,zelda
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [143]:
from sklearn.preprocessing import StandardScaler

In [144]:
ss = StandardScaler()

In [145]:
X_train = ss.fit_transform(X_train)

rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)
train_predictions = rfc.predict(X_train)

print('Train Accuracy Score', rfc.score(X_train, y_train))
print('Cross Validated Accuracy Train Score', cross_val_score(rfc, X_train, y_train, cv=5).mean())

Train Accuracy Score 0.902356902357
Cross Validated Accuracy Train Score 0.70880881166


In [146]:
X_test_t = vect.transform(X_test['subreddit'])
X_test_t = pd.DataFrame(X_test_t.todense(), columns=vect.get_feature_names())

In [116]:
X_test = X_test[['has_Cat', 'has_Dog', 'has_Olympic', 'has_Curling', 'has_Trump', 'has_Economy', 'has_Interesting', 'has_Hot', 'has_TIL', 'has_Science']].reset_index().drop(['index'], axis=1)
X_test = pd.concat([X_test, X_test_t], axis=1)

KeyError: "['has_Science'] not in index"

In [113]:
X_test.head()

Unnamed: 0,has_Cat,has_Dog,has_Olympic,has_Curling,has_Trump,has_Economy,has_Interesting,has_Hot,has_TIL,2healthbars,...,whitepeopletwitter,wholesomememes,woahdude,woof_irl,worldnews,writingprompts,xboxone,youdontsurf,youtubehaiku,zelda
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [147]:
X_test = ss.transform(X_test)

train_predictions = rfc.predict(X_test)

print('Test Accuracy Score', rfc.score(X_test, y_test))
print('Cross Validated Accuracy Test Score', cross_val_score(rfc, X_test, y_test, cv=7).mean())

ValueError: could not convert string to float: 'r/Art'

#### Use cross-validation in scikit-learn to evaluate the model above. 
- Evaluate the accuracy of the model, as well as any other metrics you feel are appropriate. 

In [None]:
cross_val_score(rfc, X_test, y_test, cv=7).mean()

#### Repeat the model-building process with a non-tree-based method.

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

In [None]:
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()
logreg.fit(X_train, y_train)


In [None]:
print(logreg.score(X_train, y_train))
print(logreg.predict_proba(X_train))

In [None]:
predictions = logreg.predict_proba(X_test)
# logreg.score(X_test, y_test)
print(cross_val_score(logreg, X_test, y_test).mean())

In [None]:
probs = pd.DataFrame(predictions)
probs.drop([1],axis=1,inplace=True)

probs[0] = probs[0].apply(lambda x: 1 if x <.1 else 0)

In [None]:
print(confusion_matrix(y_test, probs[0]))
print(classification_report(y_test, probs[0]))

#### Use Count Vectorizer from scikit-learn to create features from the thread titles. 
- Examine using count or binary features in the model
- Re-evaluate your models using these. Does this improve the model performance? 
- What text features are the most valuable? 

In [None]:
X = df1.drop(['high_comments', 'num_comments', 'subreddit'], axis=1) #Explain why drop sub
y = df1['high_comments']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [None]:
vect = CountVectorizer(stop_words='english')
vect.fit(X_train['title'])
X_train_t = vect.transform(X_train['title'])
X_train_t = pd.DataFrame(X_train_t.todense(), columns=vect.get_feature_names())
X_train = X_train[['time_since', 'has_Cat', 'has_Dog', 'has_Olympic', 'has_Number', 'has_Curling']].reset_index().drop(['index'], axis=1)
X_train = pd.concat([X_train, X_train_t], axis=1)

In [None]:
X_test_t = vect.transform(X_test['title'])
X_test_t = pd.DataFrame(X_test_t.todense(), columns=vect.get_feature_names())

In [None]:
X_test = X_test[['time_since', 'has_Cat', 'has_Dog', 'has_Olympic', 'has_Number', 'has_Curling']].reset_index().drop(['index'], axis=1)
X_test = pd.concat([X_test, X_test_t], axis=1)

In [None]:
ss = StandardScaler()
X_train_s = ss.fit_transform(X_train)

In [None]:
X_test_s = ss.transform(X_test)

In [None]:
rfc = RandomForestClassifier()
rfc.fit(X_train_s, y_train)
train_predictions = rfc.predict(X_train_s)

print('Train Accuracy Score', rfc.score(X_train_s, y_train))
print('Cross Validated Accuracy Train Score', cross_val_score(rfc, X_train_s, y_train, cv=5).mean())

In [None]:
test_predictions = rfc.predict(X_test_s)

print('Test Accuracy Score', rfc.score(X_test_s, y_test))
print('Cross Validated Accuracy Test Score', cross_val_score(rfc, X_test_s, y_test, cv=5).mean())

In [None]:
feats = pd.DataFrame(rfc.feature_importances_)
feats['columns'] = X_train.columns
feats.sort_values(0, ascending=False, inplace=True)
feats.rename(columns={0:'feature_importance'}, inplace=True)
feats.head()

In [None]:
#Print out baseline and talk about how different they are.

# Executive Summary
---
Put your executive summary in a Markdown cell below.

### BONUS
Refer to the README for the bonus parts

In [None]:
## YOUR CODE HERE