# AdultSwim Show Scraper

AdultSwim has a large offering of partial and full seasons of shows on their [shows page](https://www.adultswim.com/videos/). The purpose of this scraper is to create dictionaries of metadata for episodes that are currently available on the site.

I discovered that the AdultSwim shows page was built with React when I was digging around in the Chrome Dev Tools and happened to click on the React Developer's Tools extension tab. An idea for a Chrome extension that could manage watchlists for multiple streaming sites had been running through my mind and I was delighted to find that AdultSwim had filled their components with plenty of meta tags that would make scraping the site a breeze. This led to the following.

## Challenges
1. Properly labelling the seasons and episodes.
  - AdultSwim offers many partial seasons featuring mid-season epsiodes. I initially wrote the scraper for a show page that featured a complete series, relying on counting to properly label seasons and episodes. Thankfully AdultSwim put metadata in each of their season and episode components as props. These props are located within each component's rendered HTML as child meta elements. Each meta element has an attribute of 'itemprop' that describes the data in the accompanying 'content' attribute. After identifying the 'itemprop' attributes that describe season and episode numbers I was able to easily fix my season and episode numbering.



In [1]:
# Try replacing the following with another AS show link

adultswim_show_link = 'https://www.adultswim.com/videos/ghost-in-the-shell'

In [2]:
# Imports

from pprint import pprint

from selenium import webdriver
from bs4 import BeautifulSoup

In [3]:
# Functions

def create_browser():
    """Returns a Selenium Chrome webdriver."""
    options = webdriver.ChromeOptions()
    options.add_argument('headless')
    browser = webdriver.Chrome(
        executable_path='../utilities/chromedriver', 
        chrome_options=options)
    return browser


def get_html(as_show_link):
    """Returns the HTML content for a given URL."""
    browser = create_browser()
    browser.get(as_show_link)
    html = browser.page_source
    browser.quit()
    return html


def normalize_episodes(season):
    """Returns a dict of details for each episode in a given season."""
    episodes = season.find_all('div', class_='_29ThWwPi')
    episode_guide = {}

    for index, episode in enumerate(episodes):
        episode_data = {}
        episode_number = 0
        meta_tags = episode.find_all('meta')

        for meta_data in meta_tags:
            prop, content = meta_data.get('itemprop'), meta_data.get('content')
            if prop == 'episodeNumber': episode_number = content
            if prop and content: episode_data[prop] = content

        episode_guide[f'episode_{episode_number}'] = episode_data

    return episode_guide


def normalize_season(season):
    """Returns the season number and a dict of episode details for a given season."""
    season_number_tag = season.find(name='meta', attrs={'itemprop':'seasonNumber'})
    season_number = season_number_tag['content']
    episode_guide = normalize_episodes(season)

    return season_number, episode_guide


def create_show_guide(show_link):
    """Returns a dict of episode details separated by seasons for a given AdultSwim show link."""
    html = get_html(show_link)
    soup = BeautifulSoup(html, 'html.parser')
    seasons = soup.find_all(name='div', attrs={'itemprop':'containsSeason'})
    show_guide = {}

    for index, season in enumerate(seasons):
        season_number, episode_guide = normalize_season(season)
        show_guide[f'season_{season_number}'] = episode_guide

    return show_guide

In [4]:
# Execution

show_guide = create_show_guide(adultswim_show_link)
pprint(show_guide)

{'season_1': {'episode_1': {'contentRating': 'TV-14 SV',
                            'datePublished': '2004-11-07T05:30:00.000Z',
                            'description': 'Public Security Section 9 is an '
                                           'elite special ops unit that works '
                                           'directly under the control of the '
                                           "Prime Minister. They've been "
                                           'called in to rescue a high-ranking '
                                           'government official from a hostage '
                                           "situation. But something doesn't "
                                           'seem right to Major Motoko '
                                           'Kusanagi. After some '
                                           'investigating, Section 9 uncovers '
                                           "a major espionage plot, and it's "
                 

## Synthesizing Into Classes

Below is an example of combining the above exploratory code into classes using the builder design pattern to make modelling AdultSwim Shows easy and repeatable. Check out [`adultswim_builder.py`](https://github.com/SkylerBurger/apis_and_scrapers/blob/master/adultswim/adultswim_builder.py) for a look under the hood at the code behind the classes.

In [5]:
from pprint import pprint

from adultswim_builder import AdultSwimDirector

# Create a Director instance to build Shows and Collections
adultswim_director = AdultSwimDirector('../utilities/chromedriver')

In [6]:
# Tell the Director to build a Show with the given link
adultswim_show_link = 'https://www.adultswim.com/videos/ghost-in-the-shell'
ghost_in_the_shell = adultswim_director.build('show', adultswim_show_link)

In [7]:
print('Ghost in the Shell - Season Count:')
pprint(ghost_in_the_shell.season_count)

Ghost in the Shell - Season Count:
2


In [8]:
print('Ghost in the Shell - Episode Count:')
pprint(ghost_in_the_shell.episode_count)

Ghost in the Shell - Episode Count:
52


In [9]:
print('Ghost in the Shell - AdultSwim Show Link:')
pprint(ghost_in_the_shell.show_link)

Ghost in the Shell - AdultSwim Show Link:
'https://www.adultswim.com/videos/ghost-in-the-shell'


In [10]:
print('Ghost in the Shell - Season List:')
pprint(ghost_in_the_shell.season_list)

Ghost in the Shell - Season List:
['season_1', 'season_2']


In [11]:
print('Ghost in the Shell - Show Guide:')
pprint(ghost_in_the_shell.show_guide)

Ghost in the Shell - Show Guide:
{'season_1': {'episode_1': {'contentRating': 'TV-14 SV',
                            'datePublished': '2004-11-07T05:30:00.000Z',
                            'description': 'Public Security Section 9 is an '
                                           'elite special ops unit that works '
                                           'directly under the control of the '
                                           "Prime Minister. They've been "
                                           'called in to rescue a high-ranking '
                                           'government official from a hostage '
                                           "situation. But something doesn't "
                                           'seem right to Major Motoko '
                                           'Kusanagi. After some '
                                           'investigating, Section 9 uncovers '
                                           "a major espionage p

In [12]:
print('Ghost in the Shell - Season 1:')
pprint(ghost_in_the_shell.get_season(1))

Ghost in the Shell - Season 1:
{'episode_1': {'contentRating': 'TV-14 SV',
               'datePublished': '2004-11-07T05:30:00.000Z',
               'description': 'Public Security Section 9 is an elite special '
                              'ops unit that works directly under the control '
                              "of the Prime Minister. They've been called in "
                              'to rescue a high-ranking government official '
                              "from a hostage situation. But something doesn't "
                              'seem right to Major Motoko Kusanagi. After some '
                              'investigating, Section 9 uncovers a major '
                              "espionage plot, and it's up to them to prevent "
                              'a major international incident.',
               'duration': 'T24M42S',
               'expires': '2036-01-31T05:00:00.000Z',
               'thumbnailUrl': 'https://media.cdn.adultswim.com/uploads/202

In [13]:
pprint('Ghost in the Shell - Season 2, Episode 2:')
pprint(ghost_in_the_shell.get_episode(2, 2))

'Ghost in the Shell - Season 2, Episode 2:'
{'contentRating': 'TV-MA',
 'datePublished': '2005-11-27T05:30:00.000Z',
 'description': 'This episode focuses on the life of a refugee living in '
                'Japan. He has plans to "reset the world" and change things. '
                'But are his plans merely violent daydreams, or something more '
                'sinister?',
 'duration': 'T24M41S',
 'expires': '2036-01-31T05:00:00.000Z',
 'thumbnailUrl': 'https://media.cdn.adultswim.com/uploads/20200305/thumbnails/2_20351021482-gits2_2ndgig_002_air_cid-364JF.jpg',
 'uploadDate': '2019-05-02T15:00:00.000Z'}
