# Data Acquisition: Reddit Scraping

In this exercise, we will search a query (e.g., "data science") on the old Reddit interface (https://www.old.reddit.com/). We will then grab the url (e.g., https://old.reddit.com/search?q=data+science) of the search page and scrap the returned posts. The reason for using the old Reddit interface is that the html tags are user-friendly. We will focus on extracting title, author, author's profile, subreddit, tag, timestamp, number of votes, and number of comments. 
<img src="../images/reddit_search.png" />



* You are free to use your own query string. 
* On the search page, a set of subreddits are shown. Ignore these subreddits and focus on extracting Reddit posts. 



**Activity 1:** Fetch the page and create a soup object using Beautiful soup library

In [19]:
# Your code for activity 1 goes here..
#---------------------------------------

#import the library to query a website
import requests
# import Beautiful soup library to access 
# functions to parse the data returned from the website
from bs4 import BeautifulSoup

#import pandas to convert list to data frame
import pandas as pd
#imprt numpy
import numpy as np


headers = {'User-Agent': 'MyAPP/1.0'}  
# this will make sure our query is comming from a browser and it's not a bot


# specify the url
url = "https://old.reddit.com/search?q=data+science"
# Open website URL and return the html to the variable 'response'
response = requests.get(url, headers=headers)

# Parse the html in the 'response' variable, and store it in Beautiful Soup format
soup = BeautifulSoup(response.text,"html")



**Activity 2:** Extract the titles and URLs of the retrieved posts from the soup and print them.

In [20]:
# Your Code Below:
# ----------------

posts = soup.find_all('div', class_='search-result')
print(posts)
titles_and_urls = []

for post in posts:
    title_tag = post.find('h3') 
    print(title_tag)
    if title_tag:
        title = title_tag.get_text()
        url = post.find('a', href=True)['href']  
        titles_and_urls.append((title, url))

for title, url in titles_and_urls:
    print(f"Title: {title}, URL: {url}")

[<div class="search-result search-result-subreddit" data-fullname="t5_2sptq"><header class="search-result-header"><a class="search-title may-blank" href="https://old.reddit.com/r/datascience/">Data Science</a></header><div class="search-result-meta"><span class="fancy-toggle-button search-subscribe-button toggle" data-sr_name="datascience" style=""><a class="option active add login-required" href="#" tabindex="100">join</a><a class="option remove" href="#">leave</a></span><a class="search-subreddit-link may-blank" href="https://old.reddit.com/r/datascience/">r/datascience</a> <span class="search-subscribers">2,141,035 members,</span> <span class="search-time">a community for 13 years</span> </div><div class="search-result-body">A space for data science professionals to engage in discussions and debates on the subject of data science.</div><div class="search-result-footer"><span class="search-result-icon search-result-icon-filter"></span><a class="search-link" href="https://old.reddit.c

**Activity 3:** Extract the author ids and their profile links from the retrieved posts and print them.

In [6]:
attrs = {'class': 'search-result'}

In [23]:
# Your Code Below:
# ----------------

posts = soup.find_all('div',attrs=attrs)

authors_and_profiles = []
#print(posts)
for post in posts:
    author_tag = post.find('a', class_='author')  
    print(author_tag)
    if author_tag:
        author_id = author_tag.get_text()
        profile_url = author_tag['href'] 
        authors_and_profiles.append((author_id, profile_url))

for author_id, profile_url in authors_and_profiles:
    print(f"Author ID: {author_id}, Profile URL: {profile_url}")


None
None
None


**Activity 4:** Extract the submission time of the retrieved posts and print them.

In [8]:
# Your Code Below:
# ----------------

posts = soup.find_all('div', class_='search-result')

submission_times = []

for post in posts:
    time_tag = post.find('time')  
    if time_tag:
        submission_time = time_tag['datetime']  
        submission_times.append(submission_time)

for submission_time in submission_times:
    print(f"Submission Time: {submission_time}")


**Activity 5:** Extract the subreddits of the retrieved posts and print them

In [9]:
# Your Code Below:
# ----------------

posts = soup.find_all('div', class_='search-result')

subreddits = []

for post in posts:
    subreddit_tag = post.find('a', class_='subreddit')  
    if subreddit_tag:
        subreddit_name = subreddit_tag.get_text()  
        subreddits.append(subreddit_name)

for subreddit in subreddits:
    print(f"Subreddit: {subreddit}")

**Activity 6:** Extract the points of the retrieved posts and print them

In [10]:
# Your Code Below:
# ----------------

posts = soup.find_all('div', class_='search-result')

points = []

for post in posts:
    points_tag = post.find('div', class_='score')  
    if points_tag:
        post_points = points_tag.get_text()  
        points.append(post_points)

for post_points in points:
    print(f"Points: {post_points}")


**Activity 7:** Extract the number of comments of the retrieved posts and print them

In [11]:
# Your Code Below:
# ----------------

posts = soup.find_all('div', class_='search-result')

comments = []

for post in posts:
    comments_tag = post.find('a', class_='comments')  
    if comments_tag:
        post_comments = comments_tag.get_text()  
        comments.append(post_comments)

for post_comments in comments:
    print(f"Comments: {post_comments}")


**Activity 8:** Using the above features you extracted, create a dataframe for the retrieved posts, and print the first 10 entries. 

In [12]:
# Your Code Below:
# ----------------

import pandas as pd

titles = [...]  
urls = [...]  
authors = [...]  
subreddits = [...]  
points = [...]  
comments = [...] 
submission_times = [...]  

data = {
    'Title': titles,
    'URL': urls,
    'Author': authors,
    'Subreddit': subreddits,
    'Points': points,
    'Comments': comments,
    'Submission Time': submission_times
}

df = pd.DataFrame(data)

print(df.head(10))


      Title       URL    Author Subreddit    Points  Comments Submission Time
0  Ellipsis  Ellipsis  Ellipsis  Ellipsis  Ellipsis  Ellipsis        Ellipsis


**Activity 9:** Save the retrieved posts in a json file.  

In [13]:
# Your Code Below:
# ----------------
import json

titles = ['Title 1', 'Title 2'] 
urls = ['https://example.com/post1', 'https://example.com/post2']  
authors = ['Author1', 'Author2'] 
subreddits = ['subreddit1', 'subreddit2']  
points = [100, 200]  
comments = [10, 20]  
submission_times = ['2023-09-18T12:00:00', '2023-09-18T13:00:00'] 

posts_data = []
for i in range(len(titles)):
    post = {
        'Title': titles[i],
        'URL': urls[i],
        'Author': authors[i],
        'Subreddit': subreddits[i],
        'Points': points[i],
        'Comments': comments[i],
        'Submission Time': submission_times[i]
    }
    posts_data.append(post)

with open('reddit_posts.json', 'w') as json_file:
    json.dump(posts_data, json_file, indent=4)

print("Data saved to reddit_posts.json")


Data saved to reddit_posts.json


**Verification:** Let's reload the JSON file and print it out

In [14]:

import json
with open('reddit_posts.json', 'r') as f:
    parsed = json.load(f)
    print( json.dumps(parsed, indent=4) )




[
    {
        "Title": "Title 1",
        "URL": "https://example.com/post1",
        "Author": "Author1",
        "Subreddit": "subreddit1",
        "Points": 100,
        "Comments": 10,
        "Submission Time": "2023-09-18T12:00:00"
    },
    {
        "Title": "Title 2",
        "URL": "https://example.com/post2",
        "Author": "Author2",
        "Subreddit": "subreddit2",
        "Points": 200,
        "Comments": 20,
        "Submission Time": "2023-09-18T13:00:00"
    }
]


# Save your notebook, then `File > Close and Halt`