Scrape any website using BeautifulSoup or Selenium. If possible, you might want to start collecting data for the project that you have in mind. Your data must have the following specifications at the very least:

Data points: 100
Features: 5

You can add some metadata present on the page if what you have is not enough. In 5 minutes, present in class how you did your scraping. You may use the following as your guideline:

Why did you choose to scrape this site?
What were the challenges you encountered?
You may go through your notebook.
Do you think that the data you collected contains personally identifiable information (PII)?
Any other learning?

In [180]:
# %pip install selenium
# %pip install webdriver_manager
# %pip install --upgrade webdriver_manager

In [181]:
import pandas as pd

from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

import time
import random

First, let's initialize the Selenium webdriver. **You can either change the chromedriver_path to the path of your local chrome driver or use ChromeDriverManager**.

In [182]:
# chromedriver_path = "./chromedriver-mac-x64/chromedriver"
# service = Service(chromedriver_path)

# driver = webdriver.Chrome(service=service)

In [183]:
driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()))

## Scraping from a single thread

Let's try to scrape one of the the latest Makeup Discussion Threads from a subreddit, `r/beautytalkph`.

In [184]:
url = "https://www.reddit.com/r/beautytalkph/comments/1cy3t0z/makeup_thread_may_23_2024/?rdt=65047"
driver.get(url)

When we first open the page, we can scroll down to the bottom of the page to trigger the lazy loading to render the next batch of comments. Though at a certain point, Reddit will stop fetching for more comments once we scroll far enough. To view the rest of the comments, we have to click on a "View more comments" button. 

The script below handles both cases, and continues this process of loading more comments until we reach the last comment.

In [185]:
def scroll_to_bottom(driver):
    last_height = driver.execute_script("return document.body.scrollHeight")

    while True:
        # First, check for "View more comments" button
        page_buttons = driver.find_elements(By.XPATH, "//button[@rpl]")
        btn_matches_found = [btn for btn in page_buttons if btn.text == "View more comments"]

        # Scroll to bottom to view more comments
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        print("Scrolled to bottom...")

        # If "View more comments" was found, click to view more comments
        if len(btn_matches_found) > 0:
            view_more_comments_btn = btn_matches_found[0]
            view_more_comments_btn.click()
            print("Clicked \"View more\"...")

        # Wait for a few seconds before checking if new content has loaded
        time.sleep(random.uniform(4, 8))

        # Check if new content has loaded by checking the "height" of the page
        current_page_height = driver.execute_script("return document.body.scrollHeight")

        if current_page_height == last_height:
            print("Done.")
            break
        else:
            last_height = current_page_height
            print("More content loaded...")

# Call the function
scroll_to_bottom(driver)

Scrolled to bottom...
More content loaded...
Scrolled to bottom...
Clicked "View more"...
More content loaded...
Scrolled to bottom...
Clicked "View more"...
More content loaded...
Scrolled to bottom...
Clicked "View more"...
More content loaded...
Scrolled to bottom...
Done.


By looking at the HTML through our own browser, we can see that each comment in the thread is represented by a `<shreddit-comment>` tag. The tag contains some notable attributes, such as the comment's id and author, and we can get other kinds of data within the children of the tag, such as the date and time the comment was created and the comment content.

In [186]:
def extract_comments(driver):
    df_comments = pd.DataFrame(columns=[
        "comment_id",
        "author",
        "text",
        "score",
        "depth",
        "parent_comment_id",
        "thread_id",
        "date_created"
    ])
    
    comments = driver.find_elements(By.XPATH, "//shreddit-comment")

    for comment_div in comments:
        comment_id = comment_div.get_attribute("thingid")
        author = comment_div.get_attribute("author")
        text = comment_div.find_element(By.XPATH, ".//div[@slot='comment']").text
        score = comment_div.get_attribute("score")
        depth = comment_div.get_attribute("depth")
        parent_comment_id = comment_div.get_attribute("parentid")
        thread_id = comment_div.get_attribute("postid")
        date_created = comment_div.find_element(By.XPATH, ".//time").get_attribute("datetime")
        
        new_row_data = {
            'comment_id': comment_id,
            'author': author,
            'text': text,
            'score': score,
            'depth': depth,
            'parent_comment_id': parent_comment_id,
            'thread_id': thread_id,
            'date_created': date_created
        }

        new_row_df = pd.DataFrame([new_row_data])
        
        df_comments = pd.concat([df_comments, new_row_df], ignore_index=True)
    
    return df_comments

# Call the function
df_comments = extract_comments(driver)

In [187]:
len(df_comments)

219

### Looping Multiple Threads

Now let's open the subreddit r/beautytalkph, specifically targeting threads tagged with "Makeup Weekly Thread". Let's get the links to the first three threads as these threads are not affected by the Shadow DOM.

In [188]:
# Open the r/beatuytalkph subreddit threads
url = "https://www.reddit.com/r/beautytalkph/?f=flair_name%3A%22Makeup%20Weekly%20Thread%22"
driver.get(url)

# Get the links to the top 3 threads
n = 3
thread_links = []

# Define the XPath for the specific <a> element
specific_xpath = "//a[@slot='full-post-link']"

# Find the <a> elements using the XPath
elements = driver.find_elements(By.XPATH, specific_xpath)

# Extract the href attributes and store them in the array
for element in elements[:n]:  # Only process the first 3 elements
    link = element.get_attribute('href')
    thread_links.append(link)

# Print the extracted href attributes
print(thread_links)

['https://www.reddit.com/r/beautytalkph/comments/1d77yqa/makeup_thread_june_04_2024/', 'https://www.reddit.com/r/beautytalkph/comments/1d3fosr/makeup_thread_may_30_2024/', 'https://www.reddit.com/r/beautytalkph/comments/1d1uvhe/makeup_thread_may_28_2024/']


### TODO: Get past the Shadow DOM to get the other links (If we need more links)

In [189]:
# TODO: Get the href links of the threads
# shadow_root = driver.find_element(By.XPATH, "//*[@id=\"main-content\"]/div[2]/faceplate-batch").shadow_root
# shadow_element = shadow_root.find_element(By.XPATH, "//*[@id=\"faceplate_1\"]")
# print(link)

### Visit each of the links and scrape the comments

After collecting the links, let's visit them one by one and scrape the comments in each thread by calling the functions scroll_to_bottom and extract_comments.

In [190]:
for link in thread_links:
    driver.get(link)
    # Check if driver got the link
    print("Visiting ", driver.current_url)

    # Wait for the comments to load
    WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//shreddit-comment")))
    
    # Scroll to the bottom of the page
    scroll_to_bottom(driver)
    
    # Get the comments
    comments = extract_comments(driver)
    print("Comments extracted: ", len(comments))

    # Append the comments to the all_comments DataFrame
    df_comments = pd.concat([df_comments, comments], ignore_index=True)

print("Total comments extracted: ", len(df_comments))
print("Done")    

Visiting  https://www.reddit.com/r/beautytalkph/comments/1d77yqa/makeup_thread_june_04_2024/
Scrolled to bottom...
More content loaded...
Scrolled to bottom...
Clicked "View more"...
More content loaded...
Scrolled to bottom...
Clicked "View more"...
More content loaded...
Scrolled to bottom...
Clicked "View more"...
More content loaded...
Scrolled to bottom...
Done.
Comments extracted:  211
Visiting  https://www.reddit.com/r/beautytalkph/comments/1d3fosr/makeup_thread_may_30_2024/
Scrolled to bottom...
More content loaded...
Scrolled to bottom...
Clicked "View more"...
More content loaded...
Scrolled to bottom...
Clicked "View more"...
More content loaded...
Scrolled to bottom...
Clicked "View more"...
More content loaded...
Scrolled to bottom...
Clicked "View more"...
More content loaded...
Scrolled to bottom...
Done.
Comments extracted:  312
Visiting  https://www.reddit.com/r/beautytalkph/comments/1d1uvhe/makeup_thread_may_28_2024/
Scrolled to bottom...
More content loaded...
Scroll

In [191]:
df_comments

Unnamed: 0,comment_id,author,text,score,depth,parent_comment_id,thread_id,date_created
0,t1_l76ljn8,tonychoppa513,Hi I'm a guy and been wanting to use beauty pr...,1,0,,1cy3t0z,2024-06-05T06:34:56.870Z
1,t1_l6et4jq,hlg64,"For those who use skincare products, facial an...",0,0,,1cy3t0z,2024-05-31T01:14:27.261Z
2,t1_l60t8zd,beefymademoiselle,Any recs for red blushes for light medium w/ n...,2,0,,1cy3t0z,2024-05-28T13:32:12.778Z
3,t1_l5wpz8f,Dear_Elephant7549,i need a compact brow product reco po! i'm onl...,1,0,,1cy3t0z,2024-05-27T17:26:00.782Z
4,t1_l6lehui,sunnyisloved,"not a powder girly, but I know strokes release...",1,1,t1_l5wpz8f,1cy3t0z,2024-06-01T07:36:15.586Z
...,...,...,...,...,...,...,...,...
883,t1_l65h2lw,[deleted],Repost on hair care thread po ☺️,,0,,,2024-05-29T09:38:36.607Z
884,t1_l65x4ly,sarcastronaughty,Repost on hair care thread po ☺️,1,1,t1_l65h2lw,t3_1d1uvhe,2024-05-29T12:16:05.570Z
885,t1_l628arj,[deleted],This belongs in the Seller’s Thread!,,0,,,2024-05-28T18:42:20.725Z
886,t1_l63ucg8,cmr82,This belongs in the Seller’s Thread!,6,1,t1_l628arj,t3_1d1uvhe,2024-05-29T00:41:51.631Z


In [192]:
len(df_comments)

888

In [193]:
driver.quit()

### Exporting the Dataframe to a CSV file

In [194]:
# Export data to CSV
df_comments.to_csv('comments.csv', index=False)