# Scraping for Project 4

The majority of this project will take place in the 'Project4_KGourlay' notebook, but I have used this notebook to do the scraping of the two sub-reddit groups: 'TalesFromTheServer' and 'TalesFromTheCustomer'. After scraping both sub-reddit pages, the data has been inspected and then concatenated into one Data Frame, which was then transformed into a CSV.

In [1]:
import datetime
import pandas as pd
import requests
import time
from bs4 import BeautifulSoup

In [2]:
headers = {'User-Agent': 'My User Agent 1.0'}

In [3]:
def fetch_page(url, after=''):
    params = {'after': after}
    response = requests.get(url, headers=headers, params=params)
    return response.json()['data']['children']

In [4]:
def parse_post(post):
    keep = ['subreddit', 'title', 'selftext', 'downs', 'ups', 'num_comments', 'permalink', 'name', 'author'] 
    return {k:v for k, v in post['data'].items() if k in keep}

In [5]:
def parse_page(page):
    after = ''
    posts = []
    for post in page:
        post = parse_post(post)
        after = post['name']
        posts.append(post)
    return posts, after

In [6]:
def fetch_subreddit(subreddit, pages=4):
    url = f'https://www.reddit.com/r/{subreddit}.json'
    after = ''
    all_posts = []
    for i in range(pages):
        print(f'Fetching Page {i + 1}')
        page = fetch_page(url, after)
        posts, after = parse_page(page)
        all_posts.extend(posts)
        time.sleep(5)
    return all_posts

In [7]:
posts = fetch_subreddit('TalesFromYourServer', pages=40)

Fetching Page 1
Fetching Page 2
Fetching Page 3
Fetching Page 4
Fetching Page 5
Fetching Page 6
Fetching Page 7
Fetching Page 8
Fetching Page 9
Fetching Page 10
Fetching Page 11
Fetching Page 12
Fetching Page 13
Fetching Page 14
Fetching Page 15
Fetching Page 16
Fetching Page 17
Fetching Page 18
Fetching Page 19
Fetching Page 20
Fetching Page 21
Fetching Page 22
Fetching Page 23
Fetching Page 24
Fetching Page 25
Fetching Page 26
Fetching Page 27
Fetching Page 28
Fetching Page 29
Fetching Page 30
Fetching Page 31
Fetching Page 32
Fetching Page 33
Fetching Page 34
Fetching Page 35
Fetching Page 36
Fetching Page 37
Fetching Page 38
Fetching Page 39
Fetching Page 40


In [8]:
server = pd.DataFrame(posts)

In [9]:
server.head()

Unnamed: 0,author,downs,name,num_comments,permalink,selftext,subreddit,title,ups
0,WalkinSteveHawkin,0,t3_at8kic,250,/r/TalesFromYourServer/comments/at8kic/dedicat...,"Hi all,\n\n&amp;#x200B;\n\nI realized our new ...",TalesFromYourServer,Dedicated thread for new server advice,113
1,BraskytheSOB,0,t3_bmkpp8,10,/r/TalesFromYourServer/comments/bmkpp8/im_a_so...,Short and sweet one today. Prime steak house....,TalesFromYourServer,"""I'm a sommmelier"" woman orders house malbec. ...",110
2,heysharkdontdothat,0,t3_bmmo7x,3,/r/TalesFromYourServer/comments/bmmo7x/this_is...,Hey everyone. For anyone who had a rough night...,TalesFromYourServer,This is the story of the worst coworker to eve...,63
3,RookiePuck,0,t3_bmj1ss,17,/r/TalesFromYourServer/comments/bmj1ss/guy_tri...,Sorry this got long. So I work as a bar manage...,TalesFromYourServer,Guy tried to take back his tip,104
4,lindbulm,0,t3_bmn9sc,5,/r/TalesFromYourServer/comments/bmn9sc/yes_you...,I just cannot fathom why some people don't inc...,TalesFromYourServer,"Yes, your children need to be counted in the r...",18


In [10]:
server['selftext'][0]

"Hi all,\n\n&amp;#x200B;\n\nI realized our new server advice thread, which has tons of good advice and comments, has been archived, so new and veteran servers can't post questions/answers.  I got permission from the mods to post a new thread, so whether you're a new server with a specific/general question or a server who has some good advice, feel free to jump in!\n\n&amp;#x200B;\n\nThe [old thread can be found here](https://www.reddit.com/r/TalesFromYourServer/comments/8b1cnk/dedicated_thread_for_new_server_advice/) if anyone is interested."

In [11]:
server.describe()

Unnamed: 0,downs,num_comments,ups
count,981.0,981.0,981.0
mean,0.0,25.671764,266.567788
std,0.0,56.63469,736.970274
min,0.0,0.0,0.0
25%,0.0,4.0,11.0
50%,0.0,9.0,29.0
75%,0.0,20.0,89.0
max,0.0,611.0,5779.0


In [12]:
server.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 981 entries, 0 to 980
Data columns (total 9 columns):
author          981 non-null object
downs           981 non-null int64
name            981 non-null object
num_comments    981 non-null int64
permalink       981 non-null object
selftext        981 non-null object
subreddit       981 non-null object
title           981 non-null object
ups             981 non-null int64
dtypes: int64(3), object(6)
memory usage: 69.1+ KB


In [13]:
server.isna().sum()

author          0
downs           0
name            0
num_comments    0
permalink       0
selftext        0
subreddit       0
title           0
ups             0
dtype: int64

In [14]:
server['subreddit'] = 0

In [16]:
posts2 = fetch_subreddit('TalesFromTheCustomer', pages=40)

Fetching Page 1
Fetching Page 2
Fetching Page 3
Fetching Page 4
Fetching Page 5
Fetching Page 6
Fetching Page 7
Fetching Page 8
Fetching Page 9
Fetching Page 10
Fetching Page 11
Fetching Page 12
Fetching Page 13
Fetching Page 14
Fetching Page 15
Fetching Page 16
Fetching Page 17
Fetching Page 18
Fetching Page 19
Fetching Page 20
Fetching Page 21
Fetching Page 22
Fetching Page 23
Fetching Page 24
Fetching Page 25
Fetching Page 26
Fetching Page 27
Fetching Page 28
Fetching Page 29
Fetching Page 30
Fetching Page 31
Fetching Page 32
Fetching Page 33
Fetching Page 34
Fetching Page 35
Fetching Page 36
Fetching Page 37
Fetching Page 38
Fetching Page 39
Fetching Page 40


In [17]:
customer = pd.DataFrame(posts2)

In [18]:
customer.describe()

Unnamed: 0,downs,num_comments,ups
count,971.0,971.0,971.0
mean,0.0,32.527291,388.830072
std,0.0,57.291838,739.745269
min,0.0,0.0,0.0
25%,0.0,5.0,22.0
50%,0.0,11.0,59.0
75%,0.0,34.0,426.5
max,0.0,701.0,9476.0


In [19]:
customer.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 971 entries, 0 to 970
Data columns (total 9 columns):
author          971 non-null object
downs           971 non-null int64
name            971 non-null object
num_comments    971 non-null int64
permalink       971 non-null object
selftext        971 non-null object
subreddit       971 non-null object
title           971 non-null object
ups             971 non-null int64
dtypes: int64(3), object(6)
memory usage: 68.4+ KB


In [20]:
customer.isna().sum()

author          0
downs           0
name            0
num_comments    0
permalink       0
selftext        0
subreddit       0
title           0
ups             0
dtype: int64

In [21]:
customer['subreddit'].unique()

array(['TalesFromTheCustomer'], dtype=object)

In [22]:
customer['subreddit'] = 1

In [23]:
customer.shape

(971, 9)

In [24]:
df = pd.concat([server, customer])

In [25]:
df.shape

(1952, 9)

In [26]:
df['subreddit'].value_counts()

0    981
1    971
Name: subreddit, dtype: int64

In [27]:
df.groupby('subreddit').mean()

Unnamed: 0_level_0,downs,num_comments,ups
subreddit,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.0,25.671764,266.567788
1,0.0,32.527291,388.830072


In [30]:
df.to_csv('reddit_scraping.csv')