# Project 3: Data Scrapping

### Scenario

You're fresh out of your Data Science bootcamp and looking to break through in the world of freelance data journalism. Nate Silver and co. at FiveThirtyEight have agreed to hear your pitch for a story in two weeks!

Your piece is going to be on how to create a Reddit post that will get the most engagement from Reddit users. Because this is FiveThirtyEight, you're going to have to get data and analyze it in order to make a compelling narrative.

### Problem Statement: What characteristics of a post on Reddit are most predictive of the overall interaction on a thread (as measured by number of comments) ?

GET:
- [x] The title of the thread
- [x] The subreddit that the thread corresponds to
- [x] The length of time it has been up on Reddit
- [x] The number of comments on the thread
- [x] Scrape at least 10,000 threads.

Once you've got the data, you will build a classification model that, using the text and any other relevant features, predicts whether or not a given Reddit post will have above or below the _median_ number of comments.

## Import Libraries

In [1]:
import pandas as pd
import numpy as np

import requests
import json
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from time import sleep
import praw

### Scrapping Data Round 1

In [56]:
# Setting up Reddit Scrapper
reddit = praw.Reddit(client_id='tTV90taLsVas5YYvv6XiNw', 
                     client_secret='AidES-QOZHurdZG9U4hVmr6xjxUxJA', 
                     user_agent='Michael_Capparelli')

In [None]:
# First Scrape, Reddit FrontPage, Hot Posts Now
posts = []
r_all = reddit.subreddit('all')
for post in r_all.hot(limit=10_000):
    posts.append([post.subreddit,post.title,
                  post.score,post.num_comments,
                  post.created])
    
posts = pd.DataFrame(posts,columns=['sub','title','score',
                                    'comments','created'])

In [None]:
reddit = posts

In [None]:
# Saving first scrape to csv
reddit.to_csv('data/reddit_all.csv',index=False)

https://praw.readthedocs.io/en/stable/code_overview/models/subreddit.html

### Scrapping Data Round 2
- The first round of scrapping had extreme outliers.
- Because the scrapping was done on only recently hot posts, we have outliers that shouldn't be considered hot, such as posts with 10 comments, so we need more data

In [None]:
# scrapping all top posts, all time
posts = []
r_all = reddit.subreddit('all')
for post in r_all.top(time_filter='all',limit=10_500):
    posts.append([post.subreddit,post.title,
                  post.score,post.num_comments,
                  post.created])
    
posts = pd.DataFrame(posts,columns=['sub','title','score',
                                    'comments','created'])

In [None]:
posts

In [None]:
posts.to_csv('data/r_all.csv',index=False)

### Merging Two Scrapes and Eliminating Duplicates

In [39]:
reddit1 = pd.read_csv('data/reddit_all.csv')
reddit2 = pd.read_csv('data/r_all.csv')

In [40]:
reddit1.shape,reddit2.shape

((7332, 5), (2410, 5))

In [41]:
reddit3 = pd.concat([reddit1,reddit2]).drop_duplicates().reset_index(drop=True)

In [42]:
reddit3.shape

(9708, 5)

In [50]:
reddit3.to_csv('data/reddit_3.csv',index=False)

In [51]:
reddit3 = pd.read_csv('data/reddit_3.csv')

### Scrapping Data Round 3
- Insufficient amount of posts and the outliers are likely not gone, we will grab hot posts from all again, additionally we will increase the limit

In [57]:
submissions = []
r4 = reddit.subreddit('all')
for post in r4.hot(limit=10_500):
    submissions.append([post.subreddit,post.title,
                  post.score,post.num_comments,
                  post.created])
    
submissions = pd.DataFrame(submissions,columns=['sub','title','score',
                                          'comments','created'])    

In [60]:
submissions.to_csv('data/reddit_4.csv',index=False)

### Concat For Final DataFrame

In [62]:
reddit4 = pd.read_csv('data/reddit_4.csv')

In [63]:
reddit3.shape, reddit4.shape

((9708, 5), (7208, 5))

In [64]:
df = pd.concat([reddit3,reddit4]).drop_duplicates().reset_index(drop=True)

In [68]:
# converting unix time to date time
df.created = pd.to_datetime(df['created'],unit='s')

### Checking to see if conversion worked, have varying dates, and that duplicates were dropped successfully

In [90]:
df[['title','score','created']].sort_values(by='score',ascending=False)

Unnamed: 0,title,score,created
7299,Times Square right now,467439,2021-01-30 18:00:38
7298,I’ve found a few funny memories during lockdow...,438832,2020-06-17 16:17:27
7301,The Senate. Upvote this so that people see it ...,407258,2017-04-01 12:57:54
7304,A short story,369073,2020-06-07 16:27:35
7300,Joe Biden elected president of the United States,365121,2020-11-07 16:28:37
...,...,...,...
12720,"Released a demo of my brand new game, Pokémon:...",68,2022-09-01 19:48:33
14517,[Self-timer] Should I keep this dress or retur...,66,2022-09-01 19:48:12
15749,Chibi Anastasia and Viy are looking very styli...,65,2022-09-01 19:48:01
16489,Nottingham Forest are in take to hijack Michy ...,61,2022-09-01 19:49:12


### Our top titles are different along with their scores, we will save the df to a csv and work in a clean notebook to avoid making errors on any scrapping that was done

In [91]:
df.to_csv('data/reddit_complete.csv',index=False)

In [93]:
df = pd.read_csv('data/reddit_complete.csv')
df

Unnamed: 0,sub,title,score,comments,created
0,LeopardsAteMyFace,Did they really think he'd pay out?,20509,1031,2022-08-29 20:28:53
1,news,"China drought causes Yangtze to dry up, sparki...",19660,1546,2022-08-29 19:15:21
2,MadeMeSmile,He did it!,86623,871,2022-08-29 18:46:11
3,aww,Loving the water.,13717,112,2022-08-29 19:47:36
4,politics,Biden vows to crack down on colleges 'jacking ...,44057,2259,2022-08-29 18:28:16
...,...,...,...,...,...
16895,Superstonk,"Did the thing, better late than never! +84",1197,6,2022-09-01 05:10:22
16896,Superstonk,Lucy Komisar interview Robert Shapiro on the S...,122,16,2022-09-01 16:41:25
16897,Superstonk,It do be like that these days!,133,4,2022-09-01 16:11:28
16898,Superstonk,How many times can I keep doing this? 🧱 🧱 🧱,89,3,2022-09-01 18:28:09
