# Real-time Apple Podcast Review Scraping Simulator

This program sets up a demo browser with the **selenium** module and scrolls down programmatically using Javascript commands. The demo browser's .html automatically expands as if a "real" user scrolls down a page. The .html page is then stored and its content is extracted with **BeautifulSoup**. The content is then added to a DataFrame with **pandas** and stored to a .csv file.

A few things to keep in mind:
- The .html page is growing bigger and bigger after each scroll and this slows down any actions performed on the page. So after about ~400 scrolls (equivalent to only a few 1000 reviews), the time it takes to scroll took about 10-20 s.
- The .html page is already 10MB for roughly 5000 reviews.
- This scraping method is not very efficient.

To get started:
1. Set the right url, podcast_name and path_out under the **Create .html page** section
2. Run everything


## Import modules

In [None]:
from bs4 import BeautifulSoup
from selenium import webdriver
import time
import os
import pandas as pd

## Create .html page

In [None]:
# URL of the podcast page
#url = 'https://podcasts.apple.com/us/podcast/black-girl-gone-a-true-crime-podcast/id1556267741?see-all=reviews'
url = 'https://podcasts.apple.com/us/podcast/crime-junkie/id1322200189?see-all=reviews'

podcast_name = 'crime-junkie'

# Define output folder 
path_out = ''

if not(os.path.isdir(path_out)):
    raise Exception(f"Output folder ({path_out}) does not exist, please create it first.")

# Define output .html file
filename_html = f'{podcast_name}_reviews.txt'
file_html = os.path.join(path_out, filename_html)

# Define output .csv file
filename_csv = f'{podcast_name}_reviews_table.csv'
file_csv = os.path.join(path_out, filename_csv)

# Set up Selenium webdriver
driver = webdriver.Chrome()
driver.get(url)

## Start scrolling

In [None]:
# This is kind of like the frame rate
SCROLL_PAUSE_TIME = 0.5

# Get page height
last_height = driver.execute_script("return document.body.scrollHeight")

# Initialize counter
n = 0
m = 0

# Set total number of scrolls 
nlim = int(2*24*3600//SCROLL_PAUSE_TIME) # 2 day total scroll wait time
mlim = int(5*60//SCROLL_PAUSE_TIME) # 10 minute time-out for scrolling

# Start iterating
while (n < nlim):
    
    # Try-except here, because after ~400 scrolls, the page will get laggy and the Javascript handler my time out
    try:
        # Scroll down to bottom
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    except Exception as e:
        # If the handler did actually time out, then we keep track of that
        time.sleep(SCROLL_PAUSE_TIME)
        m += 1        
        
        # Too many time-outs will lead to a complete halt
        if m > mlim:
            break
        else:
            continue
    
    # Wait to load page
    time.sleep(SCROLL_PAUSE_TIME)
    
    # Calculate new scroll height and compare with last scroll height
    
    # Try-except here, because after ~400 scrolls, the page will get laggy and the Javascript handler my time out
    try:
        # Get page height
        new_height = driver.execute_script("return document.body.scrollHeight")
    except Exception as e:
        # If the handler did actually time out, then we keep track of that
        time.sleep(SCROLL_PAUSE_TIME)
        m += 1       
        
        # Too many time-outs will lead to a complete halt
        if m > mlim:
            break
        else:
            continue
    
    # Let's check if the height has changed
    if new_height == last_height:
        # It didn't change, so let's track that
        time.sleep(SCROLL_PAUSE_TIME)
        m += 1        
        
        # Too many time-outs will lead to a complete halt
        if m > mlim:
            break
        else:
            continue
    # So it did change, let's update and move on
    else:
        m = 0
    
        # Update height
        last_height = new_height

        # Update increment
        n += 1
        
    
print(f'Done. After {n} scrolls.')

## Save to disk

In [None]:
# Extract the HTML content of the page, 
# this you can do anytime (simply interrupt the previous step by pressing "i" twice on your keyboard)
# and then run this section
html = driver.page_source

# Store to file
with open(file_html, 'w', encoding='utf-8') as f:
    f.write(html)

In [None]:
# Close the webdriver
driver.quit()

## Load html page & process

In [None]:
# The code can pick up from where it was previously left, because the stored .html page is all we need
with open(file_html, 'r', encoding='utf-8') as f:
    lines = f.readlines()
    
# Create one big string
html = ''.join(lines)

## Get list of reviews

In [None]:
# Parse the HTML content with BeautifulSoup
soup = BeautifulSoup(html, "html.parser")

# Find all the reviews
reviews = soup.find_all("div", {"class": "we-customer-review lockup"})

## Extract info from each review and store to dictionary

In [None]:
# Create empty dictionary
D = {}

# Create empty lists
ratings = []
titles = []
texts = []
timestamps = []
users = []

# Build a dictionary, is a bit slow and poorly written, but I suppose its fine
for review in reviews:
    # '2023-02-27T13:20:53.000Z'
    timestamp = review.time.get('datetime').strip()

    # Get user
    user = review.div.span.text.strip()
    
    # Get title
    title = review.h3.text.strip()

    # Get text
    text = review.blockquote.div.p.text.strip()

    # Get rating
    rating = review.figure.get('aria-label').strip()
    rating_int = int(rating[0])
    
    # Append
    timestamps.append(timestamp)
    ratings.append(rating_int)
    titles.append(title)
    texts.append(text)
    users.append(user)
    

# Add to dictionary
D['user'] = users
D['timestamp'] = timestamps
D['rating'] = ratings
D['title'] = titles
D['text'] = texts


## Export to CSV

In [None]:
# Store to dataframe
df = pd.DataFrame(D)

# Export to .csv
df.to_csv(file_csv, index=False, sep='\t)

In [None]:
df