# Apple Podcast Review Scraping with the app_store_scraper

This program is a wrapper for scraping the Apple Podcast Reviews with the **app-store-scraper** (thank you Eric Lim, see https://pypi.org/project/app-store-scraper/, MIT license). It was adapted for use in teaching at Maastricht University by Monika Barget and Arnoud Wils in 2023.

The main script is kept as lean as possible to make it easy to use for students without previous coding experience. To use the script, carefully read and follow the instructions below.

## Install and import modules

This section ensures that your script has all the necessary functionalities. Just select the grey box below and click on the black arrow in the tool bar. Wait for the completion message before you continue!

In [6]:
!pip install --upgrade pip
!pip install urllib3==1.26.16
!pip install app-store-web-scraper

from pprint import pprint
from verify_countries import pool_checks
from scrape_reviews import scrape_reviews
import os
import glob
import re
import pandas as pd
import urllib3
import numpy as np

print("Installations and package import complete!")

Collecting urllib3==1.26.16
  Using cached urllib3-1.26.16-py2.py3-none-any.whl.metadata (48 kB)
Using cached urllib3-1.26.16-py2.py3-none-any.whl (143 kB)
Installing collected packages: urllib3
  Attempting uninstall: urllib3
    Found existing installation: urllib3 2.2.3
    Uninstalling urllib3-2.2.3:
      Successfully uninstalled urllib3-2.2.3
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
app-store-web-scraper 0.2.0 requires urllib3<3,>=2.0.0, but you have urllib3 1.26.16 which is incompatible.[0m[31m
[0mSuccessfully installed urllib3-1.26.16
Collecting urllib3<3,>=2.0.0 (from app-store-web-scraper)
  Using cached urllib3-2.2.3-py3-none-any.whl.metadata (6.5 kB)
Using cached urllib3-2.2.3-py3-none-any.whl (126 kB)
Installing collected packages: urllib3
  Attempting uninstall: urllib3
    Found existing installation: urllib3 1.26.16
    Uninstalli

## Define data input

In this section, you may need to adjust a few things, depending on your research project. In the first grey box below, country codes for the Apple Store are imported. It is recommended to use all country codes available, but if you want to limit them, you can do so in the separate applestore_country_codes.py file. The second grey box has code that checks what reviews are available for selected podcasts. Here, you may need to insert the app IDs and app names for your own selected podcasts. The existing values can be used as a test to see if the script works.

In [7]:
# import Apple Store country codes from separate file
from applestore_country_codes import select_countries 

countries = select_countries()

# display items in list
print("The country codes have been successfully loaded: ", countries)

The country codes have been successfully loaded:  ['DZ', 'AO', 'AI', 'AR', 'AM', 'AU', 'AT', 'AZ', 'BH', 'BB', 'BY', 'BE', 'BZ', 'BM', 'BO', 'BW', 'BR', 'VG', 'BN', 'BG', 'CA', 'KY', 'CL', 'CN', 'CO', 'CR', 'HR', 'CY', 'CZ', 'DK', 'DM', 'EC', 'EG', 'SV', 'EE', 'FI', 'FR', 'DE', 'GH', 'GB', 'GR', 'GD', 'GT', 'GY', 'HN', 'HK', 'HU', 'IS', 'IN', 'ID', 'IE', 'IL', 'IT', 'JM', 'JP', 'JO', 'KE', 'KW', 'LV', 'LB', 'LT', 'LU', 'MO', 'MG', 'MY', 'ML', 'MT', 'MU', 'MX', 'MS', 'NP', 'NL', 'NZ', 'NI', 'NE', 'NG', 'NO', 'OM', 'PK', 'PA', 'PY', 'PE', 'PH', 'PL', 'PT', 'QA', 'MK', 'RO', 'RU', 'SA', 'SN', 'SG', 'SK', 'SI', 'ZA', 'KR', 'ES', 'LK', 'SR', 'SE', 'CH', 'TW', 'TZ', 'TH', 'TN', 'TR', 'UG', 'UA', 'AE', 'US', 'UY', 'UZ', 'VE', 'VN', 'YE']


In [8]:
# Define a list of App Store items with app_id and app_name for scraping
# remove or add lines within the podcast list if needed
podcasts = [
    {"app_id": 1568547321, "app_name": 'are-you-menstrual'},
    {"app_id": 1614435903, "app_name": '28ish-days-later'},
    {"app_id": 1537830674, "app_name": 'holistic-womens-health-hormones-endometriosis-pcos'}
]

# URL structure of typical App Store item:
# https://podcasts.apple.com/us/podcast/black-women-talk-tech-podcast/id1453181438
# copy ID and podcast name from your URL

# Standard URL for Apple Podcasts
base_url = "https://podcasts.apple.com/us/podcast/"

# Important: country codes will be selected from the list above

# Set output path
path_out = "output/"

print("Podcasts defined!")

Podcasts defined!


## Validate data and collect reviews

Here, you only need to run the code below and monitor the output. No changes in the script are required from your side. If you encounter an error, let your tutor know. A common mistake is that the external scripts called here are not in the right place and cannot be found.

In [11]:
# Loop through podcasts and country codes

for podcast in podcasts:
    app_id = podcast['app_id']
    app_name = podcast['app_name']

    # Construct full URL for each podcast
    podcast_url = f"{base_url}{app_name}/id{app_id}"
    print(f"Scraping URL: {podcast_url}")

    # Create the filename and path for each podcast's review file
    filename_csv = f'{app_name}_reviews_table.csv'
    file_csv = os.path.join(path_out, filename_csv)
    print(f"Saving reviews to: {file_csv}")

    # Check available countries and get the list of country codes
    countries_reviewed = pool_checks(podcast_url, countries)
    print("The following countries have reviews:", countries_reviewed)
    
    # Collect all reviews for selected countries using scrape_reviews
    all_reviews = scrape_reviews(countries_reviewed, app_name, app_id)
    print("All reviews collected for ", podcast, "!")

    # Only proceed to save if reviews were actually collected
    if all_reviews:
        try:
            # Concatenate all country-specific DataFrames into one DataFrame per podcast
            combined_reviews_df = pd.concat(all_reviews, ignore_index=True)
            
            # Save the combined DataFrame for each podcast
            combined_reviews_df.to_csv(file_csv, index=False)
            print(f"Reviews saved successfully to {file_csv}")
        
        except Exception as e:
            print(f"An error occurred while saving reviews for {app_name}: {e}")
    else:
        print(f"No reviews collected for {app_name}. Skipping save.")
        
# After creating all individual CSV files, merge them

output_filename = 'all_reviews_table'

# Check if output file exists from previous script execution
existing_files = glob.glob(os.path.join(path_out, f"{output_filename}*.csv"))
if existing_files:
    max_index = max(
        [
            int(os.path.splitext(os.path.basename(f))[0].split('_')[-1]) 
            for f in existing_files if f"{output_filename}_" in f
        ] + [1]
    )
    new_filename = f"{output_filename}_{max_index + 1}.csv"
else:
    new_filename = f"{output_filename}.csv"  # First file

file_csv2 = os.path.join(path_out, new_filename)

# Exclude existing all_reviews_table.csv
all_files = glob.glob(os.path.join(path_out, "*_reviews_table.csv"))
all_files = [f for f in all_files if os.path.basename(f) != "all_reviews_table.csv"]

if all_files:
    # Combine all review files
    combined_df = pd.concat((pd.read_csv(f, sep="\t") for f in all_files), ignore_index=True)

    # Save the combined DataFrame to the new CSV file
    print("Your final dataframe has", len(combined_df), "rows.")
    combined_df.to_csv(file_csv2, index=False, sep="\t")
    print(f'Exported to {file_csv2}')
else:
    print("No review files found to combine.")

    # NOTE: the review count seen on the landing page of a podcast differs from the actual number of reviews fetched.
    # This is simply because only some users who rated the app also leave reviews.
    

Scraping URL: https://podcasts.apple.com/us/podcast/are-you-menstrual/id1568547321
Saving reviews to: output/are-you-menstrual_reviews_table.csv
The following countries have reviews: ['AU', 'BH', 'BE', 'CA', 'CO', 'HR', 'FR', 'DE', 'GB', 'IN', 'LU', 'MX', 'NL', 'NZ', 'PK', 'PH', 'PL', 'ZA', 'ES', 'US']
Scraping reviews for are-you-menstrual in country AU
Retrieved 10 reviews for country AU.
Scraping reviews for are-you-menstrual in country BH
An error occurred while scraping for country BH: string indices must be integers
Scraping reviews for are-you-menstrual in country BE
An error occurred while scraping for country BE: string indices must be integers
Scraping reviews for are-you-menstrual in country CA
Retrieved 42 reviews for country CA.
Scraping reviews for are-you-menstrual in country CO
An error occurred while scraping for country CO: string indices must be integers
Scraping reviews for are-you-menstrual in country HR
An error occurred while scraping for country HR: string indic

Now check your output data in the /output folder and download them! Make sure to save restart the kernel (circle button above) before entering new podcast URLs and running the script again. Otherwise you may see old data copied to your new files.