# Project 3: Data Scrapping

### Scenario

You're fresh out of your Data Science bootcamp and looking to break through in the world of freelance data journalism. Nate Silver and co. at FiveThirtyEight have agreed to hear your pitch for a story in two weeks!

Your piece is going to be on how to create a Reddit post that will get the most engagement from Reddit users. Because this is FiveThirtyEight, you're going to have to get data and analyze it in order to make a compelling narrative.

### Problem Statement: What characteristics of a post on Reddit are most predictive of the overall interaction on a thread (as measured by number of comments) ?

GET:
1. The title of the thread
2. The subreddit that the thread corresponds to
3. The length of time it has been up on Reddit
4. The number of comments on the thread

Scrape at least 10,000 threads.

Once you've got the data, you will build a classification model that, using the text and any other relevant features, predicts whether or not a given Reddit post will have above or below the _median_ number of comments.

## Import Libraries

In [1]:
import pandas as pd
import numpy as np

import requests
import json
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from time import sleep
import praw

### Scrapping Data Round 1

In [2]:
reddit = praw.Reddit(client_id='tTV90taLsVas5YYvv6XiNw', 
                     client_secret='AidES-QOZHurdZG9U4hVmr6xjxUxJA', 
                     user_agent='Michael_Capparelli')

In [None]:
posts = []
r_all = reddit.subreddit('all')
for post in r_all.hot(limit=10_000):
    posts.append([post.subreddit,post.title,
                  post.score,post.num_comments,
                  post.created])
    
posts = pd.DataFrame(posts,columns=['sub','title','score',
                                    'comments','created'])

In [None]:
reddit = posts

In [None]:
reddit

In [None]:
reddit.to_csv('data/reddit_all.csv',index=False)

https://praw.readthedocs.io/en/stable/code_overview/models/subreddit.html

### Scrapping Data Round 2

In [None]:
posts = []
r_all = reddit.subreddit('all')
for post in r_all.top(time_filter='all',limit=10_500):
    posts.append([post.subreddit,post.title,
                  post.score,post.num_comments,
                  post.created])
    
posts = pd.DataFrame(posts,columns=['sub','title','score',
                                    'comments','created'])

In [None]:
posts

In [None]:
posts.to_csv('data/r_all.csv',index=False)

### Merging Two Scrapes and Eliminating Duplicates

In [3]:
reddit1 = pd.read_csv('data/reddit_all.csv')
reddit2 = pd.read_csv('data/r_all.csv')

In [4]:
reddit1.shape,reddit2.shape

((7332, 5), (2410, 5))

In [8]:
reddit3 = pd.concat([reddit1,reddit2]).drop_duplicates().reset_index(drop=True)

In [9]:
reddit3.shape

(9708, 5)

In [10]:
reddit3.head()

Unnamed: 0,sub,title,score,comments,created
0,LeopardsAteMyFace,Did they really think he'd pay out?,20509,1031,1661805000.0
1,news,"China drought causes Yangtze to dry up, sparki...",19660,1546,1661801000.0
2,MadeMeSmile,He did it!,86623,871,1661799000.0
3,aww,Loving the water.,13717,112,1661802000.0
4,politics,Biden vows to crack down on colleges 'jacking ...,44057,2259,1661798000.0


In [11]:
from datetime import datetime
 
reddit3.created = reddit3['created'].apply(lambda t: datetime.utcfromtimestamp(
    int("1284101485")).strftime('%Y-%m-%d %H:%M:%S'))

In [13]:
reddit3.to_csv('data/reddit_3.csv',index=False)

In [14]:
reddit3 = pd.read_csv('data/reddit_3.csv')

In [15]:
reddit3.head()

Unnamed: 0,sub,title,score,comments,created
0,LeopardsAteMyFace,Did they really think he'd pay out?,20509,1031,2010-09-10 06:51:25
1,news,"China drought causes Yangtze to dry up, sparki...",19660,1546,2010-09-10 06:51:25
2,MadeMeSmile,He did it!,86623,871,2010-09-10 06:51:25
3,aww,Loving the water.,13717,112,2010-09-10 06:51:25
4,politics,Biden vows to crack down on colleges 'jacking ...,44057,2259,2010-09-10 06:51:25


### Scrapping Data Round 3

In [23]:
submissions = []
r4 = reddit.subreddit('all')
for post in r4.hot(limit=10_500):
    submissions.append([post.subreddit,post.title,
                  post.score,post.num_comments,
                  post.created])
    
submissions = pd.DataFrame(submissions,columns=['sub','title','score',
                                          'comments','created'])    

In [28]:
submissions.head()

Unnamed: 0,sub,title,score,comments,created
0,pics,On August 7th I finished hiking the Pacific Cr...,21288,1183,2010-09-10 06:51:25
1,AbruptChaos,Being an ASMR TikToker,20408,670,2010-09-10 06:51:25
2,blackmagicfuckery,Astronaut dissolves effervescent tablet in wat...,14074,350,2010-09-10 06:51:25
3,interestingasfuck,"My fiancée got a spider bite on her ankle, and...",8474,644,2010-09-10 06:51:25
4,politics,Some of the documents retrieved from Mar-a-Lag...,32289,2083,2010-09-10 06:51:25


In [24]:
submissions['created'] = submissions.created.apply(lambda t: datetime.utcfromtimestamp(
    int("1284101485")).strftime('%Y-%m-%d %H:%M:%S'))

### Concat For Final DataFrame

In [27]:
df = pd.concat([reddit3,submissions]).drop_duplicates().reset_index(drop=True)

In [29]:
df.to_csv('data/reddit_complete.csv',index=False)

In [30]:
df = pd.read_csv('data/reddit_complete.csv')

In [31]:
df

Unnamed: 0,sub,title,score,comments,created
0,LeopardsAteMyFace,Did they really think he'd pay out?,20509,1031,2010-09-10 06:51:25
1,news,"China drought causes Yangtze to dry up, sparki...",19660,1546,2010-09-10 06:51:25
2,MadeMeSmile,He did it!,86623,871,2010-09-10 06:51:25
3,aww,Loving the water.,13717,112,2010-09-10 06:51:25
4,politics,Biden vows to crack down on colleges 'jacking ...,44057,2259,2010-09-10 06:51:25
...,...,...,...,...,...
16790,Superstonk,Guess it's time to finally feed the bot. +1171...,1800,13,2010-09-10 06:51:25
16791,memes,guess they're gone forever now,235,4,2010-09-10 06:51:25
16792,Superstonk,JAN ‘21 apes attempting to purchase GameStop O...,3396,86,2010-09-10 06:51:25
16793,memes,I think there will be a problem,100,11,2010-09-10 06:51:25
