# Data Acquisition: Reddit Scraping

In this exercise, we will search a query (e.g., "data science") on old Reddit site (https://www.old.reddit.com/). We will then grab the url (e.g., https://old.reddit.com/search?q=data+science) of the result page and scrap the returned posts. More specifically, we will extract title, author, subreddit, category, timestamp, number of votes, and number of comments. 

<img src="../images/reddit_search.png" />




You are free to use your own query string. 

**Activity 1:** Fetch the page and create a soup object using Beautiful soup library

In [1]:
# Your code for activity 1 goes here..
#---------------------------------------

#import the library to query a website
import requests
# import Beautiful soup library to access 
# functions to parse the data returned from the website
from bs4 import BeautifulSoup

#import pandas to convert list to data frame
import pandas as pd
#imprt numpy
import numpy as np

# headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9'}

headers = {'User-Agent': 'MyAPP/1.0'}


# specify the url
url = "https://old.reddit.com/search?q=data+science"

# Open website URL and return the html to the variable 'response'
response = requests.get(url, headers=headers)

# Parse the html in the 'response' variable, and store it in Beautiful Soup format
soup = BeautifulSoup(response.content, "html.parser")



In [2]:
posts = soup.find_all(class_='search-result')
posts[5].find(class_="search-result-header")

<header class="search-result-header"><a class="search-title may-blank" href="https://old.reddit.com/r/cscareerquestions/comments/ruo9su/data_science_job_market_is_shrinking/">Data science job market is shrinking</a></header>

In [3]:
posts[3]



<div class="search-result search-result-link has-thumbnail has-linkflair linkflair-career" data-fullname="t3_vtd6ln"><a class="may-blank thumbnail self" href="/r/datascience/comments/vtd6ln/the_data_science_trap/"></a><div><header class="search-result-header"><a class="search-title may-blank" href="https://old.reddit.com/r/datascience/comments/vtd6ln/the_data_science_trap/">The Data Science Trap</a><span class="linkflairlabel" title="Career">Career</span></header><div class="search-result-meta"><span class="search-result-icon search-result-icon-score"></span><span class="search-score">503 points</span> <a class="search-comments may-blank" href="https://old.reddit.com/r/datascience/comments/vtd6ln/the_data_science_trap/">226 comments</a> <span class="search-time">submitted <time datetime="2022-07-07T07:54:32+00:00" title="Thu Jul 7 07:54:32 2022 UTC">1 month ago</time></span> <span class="search-author">by <a class="author may-blank id-t2_porqj2dw" href="https://old.reddit.com/user/albe

#### Lets look at one of the entries

---

<div class="search-result search-result-link has-thumbnail has-linkflair linkflair-career" data-fullname="t3_vtd6ln"><a class="may-blank thumbnail self" href="/r/datascience/comments/vtd6ln/the_data_science_trap/"></a><div><header class="search-result-header"><a class="search-title may-blank" href="https://old.reddit.com/r/datascience/comments/vtd6ln/the_data_science_trap/">The Data Science Trap</a><span class="linkflairlabel" title="Career">Career</span></header><div class="search-result-meta"><span class="search-result-icon search-result-icon-score"></span><span class="search-score">504 points</span> <a class="search-comments may-blank" href="https://old.reddit.com/r/datascience/comments/vtd6ln/the_data_science_trap/">226 comments</a> <span class="search-time">submitted <time datetime="2022-07-07T07:54:32+00:00" title="Thu Jul 7 07:54:32 2022 UTC">1 month ago</time></span> <span class="search-author">by <a class="author may-blank id-t2_porqj2dw" href="https://old.reddit.com/user/alberto-matamoro">alberto-matamoro</a><span class="userattrs"></span></span> <span>to <a class="search-subreddit-link may-blank" href="https://old.reddit.com/r/datascience/">r/datascience</a></span></div><div class="search-expando collapsed"><div class="search-result-body"><div class="md"><p>It is no longer open to question that data scientists in the industry are merely glorified data analysts. Businesses are pouring money into STEM graduates to create colorful charts and BS reporting. Aside from hypothesis testing and linear or logistic regressions, nothing they do comes close to statistics or modeling. There have been several threads about how research scientists are the new data scientists - and these threads are full of scorn for the state of the data scientist job market. </p>
<p>Now, I'm finding that some places require doctorates in statistics, computer science, physics, and math - all for the same data analytics role. Don't get me wrong: data analytics is an important part of running a business, but that work isn't fully utilizing the capabilities of the fields listed above. This is what I call the data science trap.</p>
<p>Unfortunately, a quick LinkedIn search and a quick search of alumni from top departments at top schools reveal several who end up working as data scientists at firms notorious for hiring data scientists to be SQL monkeys. </p>
<p>I've already learned to recognize phony job descriptions for data analysts masquerading as data scientist positions. But I'm curious how one avoids the data science trap, especially for those with a graduate degree.</p>
</div>
</div></div><div class="search-expando-button collapsed"><span class="search-expando-button-label-collapsed">more</span><span class="search-expando-button-label-expanded">less</span></div></div></div>

---

Make sure we are setting the attributes used by the initial soup parsing.


In [4]:
attrs = {'class': 'search-result'}

**Activity 2:** Extract the titles and URLs of the retrieved posts from the soup and print them.

In [5]:
n_post = 0

titles = []
urls = []

for post in soup.find_all('div', attrs=attrs):
    
    if n_post < 3:
        n_post +=1 
        continue
    
    title = post.find('header', class_="search-result-header")
    if title is None:
        continue
        
    a = title.find('a')
    #print (a.text) 
    titles.append(a.text)
    
    url = a.get('href')
    #print(url)
    urls.append(url)
    

print(titles)
print(urls)

['The Data Science Trap', 'Not getting any interviews, but getting a lot of rejections - applying to ENTRY LEVEL Data Science / Data Analysis jobs.', 'Data science job market is shrinking', "People who decided to take Data Science as their profession. What advice would you like to give to someone who's just starting in data science.", 'Data Science Salary Progression', "What is the 'Bible' of Data Science?", 'Data Science or Web Development? I want to hear your perspectives', 'Nobody talks about all of the waiting in Data Science', 'Data Science job postings read like Software Engineering jobs with the added requirements of DS/ML tools...yet still pay less than Software Engineer job postings', 'The Data Science Hierarchy of Needs', 'Are remote data science/analyst jobs a thing?', 'How "good" of a programmer do you have to be to get an entry-level data science job?', 'Has anyone left or considered leaving data science because of the need for "full stack data scientists"?', 'Thank you to

**Activity 3:** Extract the author ids and profile links from the retrieved posts and print them.

In [6]:
n_post = 0

authors = []
profiles = []

for post in soup.find_all('div', attrs=attrs):
    
    if n_post < 3:
        n_post +=1 
        continue
    
    author = post.find('a', class_="author")
    #print(author)
    authors.append(author.text)
    
    profile_link = author['href']
    #print(profile_link)
    profiles.append(profile_link)
    

print(authors, profiles) 

['alberto-matamoro', 'Dean-Pearce', 'Normal-Illustrator95', 'mega_dong_04', 'da_chosen1', 'manurbs', 'Jos-I', 'mcjon77', 'CDRSkywalker1991', 'IpsoCranius', 'riderx65', 'DrSkoolie', 'no_gpu', 'tits_mcgee_92', 'Mindful_Scribe', 'BoiElroy', 'Duncan_Sarasti', 'Pleasant_Type_4547', 'LatterConcentrate6', 'AntiqueFigure6', 'AvikalpGupta', 'AdFew4357'] ['https://old.reddit.com/user/alberto-matamoro', 'https://old.reddit.com/user/Dean-Pearce', 'https://old.reddit.com/user/Normal-Illustrator95', 'https://old.reddit.com/user/mega_dong_04', 'https://old.reddit.com/user/da_chosen1', 'https://old.reddit.com/user/manurbs', 'https://old.reddit.com/user/Jos-I', 'https://old.reddit.com/user/mcjon77', 'https://old.reddit.com/user/CDRSkywalker1991', 'https://old.reddit.com/user/IpsoCranius', 'https://old.reddit.com/user/riderx65', 'https://old.reddit.com/user/DrSkoolie', 'https://old.reddit.com/user/no_gpu', 'https://old.reddit.com/user/tits_mcgee_92', 'https://old.reddit.com/user/Mindful_Scribe', 'https:

**Activity 4:** Extract the submission time of the retrieved posts and print them.

In [7]:
n_post = 0

tstamps = []


for post in soup.find_all('div', attrs=attrs):
    
    if n_post < 3:
        n_post +=1 
        continue
    
    tstamp = post.find('span', class_="search-time")
    tstamp = tstamp.find('time').get('datetime')
    tstamps.append(pd.to_datetime(tstamp))
                   
print(tstamps)

[Timestamp('2022-07-07 07:54:32+0000', tz='UTC'), Timestamp('2022-08-10 09:06:26+0000', tz='UTC'), Timestamp('2022-01-03 00:42:56+0000', tz='UTC'), Timestamp('2022-06-13 08:56:59+0000', tz='UTC'), Timestamp('2022-05-01 08:38:01+0000', tz='UTC'), Timestamp('2022-06-07 11:24:22+0000', tz='UTC'), Timestamp('2022-06-14 10:44:36+0000', tz='UTC'), Timestamp('2022-08-10 21:57:31+0000', tz='UTC'), Timestamp('2022-08-19 17:41:41+0000', tz='UTC'), Timestamp('2022-08-07 00:27:25+0000', tz='UTC'), Timestamp('2022-08-24 12:03:23+0000', tz='UTC'), Timestamp('2022-08-07 14:14:28+0000', tz='UTC'), Timestamp('2022-08-08 12:14:30+0000', tz='UTC'), Timestamp('2022-07-18 15:30:52+0000', tz='UTC'), Timestamp('2022-08-09 15:38:41+0000', tz='UTC'), Timestamp('2022-07-16 16:21:58+0000', tz='UTC'), Timestamp('2022-07-12 08:54:04+0000', tz='UTC'), Timestamp('2022-07-29 16:01:39+0000', tz='UTC'), Timestamp('2022-08-06 06:03:35+0000', tz='UTC'), Timestamp('2022-08-24 12:24:51+0000', tz='UTC'), Timestamp('2022-08-

**Activity 5:** Extract the subreddits of the retrieved posts and print them


Note: Looking for   
<span>to <a class="search-subreddit-link may-blank" href="https://old.reddit.com/r/cscareerquestions/">r/cscareerquestions</a></span>

from
```HTML
<span>to <a class="search-subreddit-link may-blank" href="https://old.reddit.com/r/cscareerquestions/">r/cscareerquestions</a></span>
```

In [8]:
n_post = 0

subs = []

for post in soup.find_all('div', attrs=attrs):
    
    if n_post < 3:
        n_post +=1 
        continue
    
    sublink = post.find('a', class_="search-subreddit-link")
    #print(sublink.text)
    subs.append(sublink.text)

    

print(subs) 

['r/datascience', 'r/resumes', 'r/cscareerquestions', 'r/ApplyingToCollege', 'r/datascience', 'r/datascience', 'r/cscareerquestions', 'r/datascience', 'r/datascience', 'r/datascience', 'r/datascience', 'r/datascience', 'r/datascience', 'r/datascience', 'r/datascience', 'r/datascience', 'r/datascience', 'r/visualization', 'r/datascience', 'r/datascience', 'r/datascience', 'r/datascience']


**Activity 6:** Extract the points of the retrieved posts and print them

Note: Looking for   
<span class="search-score">16 points</span> 

from 
```HTML
<span class="search-score">16 points</span> 
```


In [9]:
n_post = 0

scores = []

for post in soup.find_all('div', attrs=attrs):
    
    if n_post < 3:
        n_post +=1 
        continue
    
    score = post.find('span', class_="search-score")
    print(score.text)
    scores.append(score.text)

    

print(scores) 

503 points
209 points
17 points
50 points
640 points
722 points
45 points
664 points
325 points
915 points
76 points
223 points
295 points
1,749 points
110 points
296 points
1,024 points
331 points
58 points
43 points
88 points
156 points
['503 points', '209 points', '17 points', '50 points', '640 points', '722 points', '45 points', '664 points', '325 points', '915 points', '76 points', '223 points', '295 points', '1,749 points', '110 points', '296 points', '1,024 points', '331 points', '58 points', '43 points', '88 points', '156 points']


**Activity 7:**  Extract the number of comments of the retrieved posts and print them


Note: Looking for
<a class="search-comments may-blank" href="https://old.reddit.com/r/datascience/comments/vtd6ln/the_data_science_trap/">226 comments</a>
    
from
```HTML
<a class="search-comments may-blank" href="https://old.reddit.com/r/datascience/comments/vtd6ln/the_data_science_trap/">226 comments</a>
```

In [10]:
n_post = 0

comments = []

for post in soup.find_all('div', attrs=attrs):
    
    if n_post < 3:
        n_post +=1 
        continue
    
    comment = post.find('a', class_="search-comments")
    #print(comment.text)
    comments.append(comment.text)

    

print(comments) 

['226 comments', '165 comments', '38 comments', '26 comments', '276 comments', '189 comments', '39 comments', '221 comments', '56 comments', '68 comments', '94 comments', '105 comments', '87 comments', '128 comments', '110 comments', '222 comments', '275 comments', '65 comments', '80 comments', '40 comments', '112 comments', '121 comments']


**Activity 8:** Using the above features you extracted, create a dataframe for the retrieved posts, and print the first 10 entries. 

In [11]:
df = pd.DataFrame({'title'   : titles,
                  'url'     : urls,
                  'author'  : authors,
                  'profile' : profiles,
                  'tstamp'  : tstamps,
                  'subs'    : subs,
                  'score'   : scores,
                  'comments': comments
                  })

df.head(10)

Unnamed: 0,title,url,author,profile,tstamp,subs,score,comments
0,The Data Science Trap,https://old.reddit.com/r/datascience/comments/...,alberto-matamoro,https://old.reddit.com/user/alberto-matamoro,2022-07-07 07:54:32+00:00,r/datascience,503 points,226 comments
1,"Not getting any interviews, but getting a lot ...",https://old.reddit.com/r/resumes/comments/wksm...,Dean-Pearce,https://old.reddit.com/user/Dean-Pearce,2022-08-10 09:06:26+00:00,r/resumes,209 points,165 comments
2,Data science job market is shrinking,https://old.reddit.com/r/cscareerquestions/com...,Normal-Illustrator95,https://old.reddit.com/user/Normal-Illustrator95,2022-01-03 00:42:56+00:00,r/cscareerquestions,17 points,38 comments
3,People who decided to take Data Science as the...,https://old.reddit.com/r/ApplyingToCollege/com...,mega_dong_04,https://old.reddit.com/user/mega_dong_04,2022-06-13 08:56:59+00:00,r/ApplyingToCollege,50 points,26 comments
4,Data Science Salary Progression,https://old.reddit.com/r/datascience/comments/...,da_chosen1,https://old.reddit.com/user/da_chosen1,2022-05-01 08:38:01+00:00,r/datascience,640 points,276 comments
5,What is the 'Bible' of Data Science?,https://old.reddit.com/r/datascience/comments/...,manurbs,https://old.reddit.com/user/manurbs,2022-06-07 11:24:22+00:00,r/datascience,722 points,189 comments
6,Data Science or Web Development? I want to hea...,https://old.reddit.com/r/cscareerquestions/com...,Jos-I,https://old.reddit.com/user/Jos-I,2022-06-14 10:44:36+00:00,r/cscareerquestions,45 points,39 comments
7,Nobody talks about all of the waiting in Data ...,https://old.reddit.com/r/datascience/comments/...,mcjon77,https://old.reddit.com/user/mcjon77,2022-08-10 21:57:31+00:00,r/datascience,664 points,221 comments
8,Data Science job postings read like Software E...,https://old.reddit.com/r/datascience/comments/...,CDRSkywalker1991,https://old.reddit.com/user/CDRSkywalker1991,2022-08-19 17:41:41+00:00,r/datascience,325 points,56 comments
9,The Data Science Hierarchy of Needs,https://old.reddit.com/r/datascience/comments/...,IpsoCranius,https://old.reddit.com/user/IpsoCranius,2022-08-07 00:27:25+00:00,r/datascience,915 points,68 comments


**Activity 9:** Save the retrieved posts in a json file.

In [12]:
df.to_json("posts.json", orient='records')

---

#### Confirm file has data in records

In [13]:
import json
with open("posts.json", "r") as f:
    parsed = json.load(f)
    print( json.dumps(parsed, indent=4) )


[
    {
        "title": "The Data Science Trap",
        "url": "https://old.reddit.com/r/datascience/comments/vtd6ln/the_data_science_trap/",
        "author": "alberto-matamoro",
        "profile": "https://old.reddit.com/user/alberto-matamoro",
        "tstamp": 1657180472000,
        "subs": "r/datascience",
        "score": "503 points",
        "comments": "226 comments"
    },
    {
        "title": "Not getting any interviews, but getting a lot of rejections - applying to ENTRY LEVEL Data Science / Data Analysis jobs.",
        "url": "https://old.reddit.com/r/resumes/comments/wksm4f/not_getting_any_interviews_but_getting_a_lot_of/",
        "author": "Dean-Pearce",
        "profile": "https://old.reddit.com/user/Dean-Pearce",
        "tstamp": 1660122386000,
        "subs": "r/resumes",
        "score": "209 points",
        "comments": "165 comments"
    },
    {
        "title": "Data science job market is shrinking",
        "url": "https://old.reddit.com/r/cscareerquestio

# Save your notebook, then `File > Close and Halt`