# AdultSwim Show Scraper

AdultSwim has a large offering of partial and full seasons of shows on their [shows page](https://www.adultswim.com/videos/). The purpose of this scraper is to create dictionaries of metadata for episodes that are currently available on the site.

I discovered that the AdultSwim shows page was built with React when I was digging around in the Chrome Dev Tools and happened to click on the React Developer's Tools extension tab. An idea for a Chrome extension that could manage watchlists for multiple streaming sites had been running through my mind and I was delighted to find that AdultSwim had filled their components with plenty of meta tags that would make scraping the site a breeze. This led to the following.

## Challenges
1. Properly labelling the seasons and episodes.
  - AdultSwim offers many partial seasons featuring mid-season epsiodes. I initially wrote the scraper for a show page that featured a complete series, relying on counting to properly label seasons and episodes. Thankfully AdultSwim put metadata in each of their season and episode components as props. These props are located within each component's rendered HTML as child meta elements. Each meta element has an attribute of 'itemprop' that describes the data in the accompanying 'content' attribute. After identifying the 'itemprop' attributes that describe season and episode numbers I was able to easily fix my season and episode numbering.



In [None]:
# Try replacing the following with another AS show link

adultswim_show_link = 'https://www.adultswim.com/videos/ghost-in-the-shell'

In [None]:
# Imports

from pprint import pprint

from selenium import webdriver
from bs4 import BeautifulSoup

In [None]:
# Functions

def create_browser():
    """Returns a Selenium Chrome webdriver."""
    options = webdriver.ChromeOptions()
    options.add_argument('headless')
    browser = webdriver.Chrome(
        executable_path='../utilities/chromedriver', 
        chrome_options=options)
    return browser


def get_html(as_show_link):
    """Returns the HTML content for a given URL."""
    browser = create_browser()
    browser.get(as_show_link)
    html = browser.page_source
    browser.quit()
    return html


def normalize_episodes(season):
    """Returns a dict of details for each episode in a given season."""
    episodes = season.find_all('div', class_='_29ThWwPi')
    episode_guide = {}

    for index, episode in enumerate(episodes):
        episode_data = {}
        episode_number = 0
        meta_tags = episode.find_all('meta')

        for meta_data in meta_tags:
            prop, content = meta_data.get('itemprop'), meta_data.get('content')
            if prop == 'episodeNumber': episode_number = content
            if prop and content: episode_data[prop] = content

        episode_guide[f'episode_{episode_number}'] = episode_data

    return episode_guide


def normalize_season(season):
    """Returns the season number and a dict of episode details for a given season."""
    season_number_tag = season.find(name='meta', attrs={'itemprop':'seasonNumber'})
    season_number = season_number_tag['content']
    episode_guide = normalize_episodes(season)

    return season_number, episode_guide


def create_show_guide(show_link):
    """Returns a dict of episode details separated by seasons for a given AdultSwim show link."""
    html = get_html(show_link)
    soup = BeautifulSoup(html, 'html.parser')
    seasons = soup.find_all(name='div', attrs={'itemprop':'containsSeason'})
    show_guide = {}

    for index, season in enumerate(seasons):
        season_number, episode_guide = normalize_season(season)
        show_guide[f'season_{season_number}'] = episode_guide

    return show_guide

In [None]:
# Execution

show_guide = create_show_guide(adultswim_show_link)
pprint(show_guide)