# Scrape Data From JustWatch

## About The Notebook

>This notebook uses a third party API to scrape data from [justwatch.com](https://www.justwatch.com/). The GitHub for the API can be found [here](https://github.com/dawoudt/JustWatchAPI). To first use the API you must run the cell below to pip install the JustWatch library. If you have not pip installed it yet, then uncomment the next cell and run it. JustWatch.com has information on all the titles available on the Crunchyroll and Funimation platforms. This notebook makes API calls to retrieve shows and movies for Crunchyroll and Funimation. This notebook uses for loops to make calls to the API so I suggest waiting to run each for loop because you may run into an error with making too many API calls.

In [None]:
# !pip install JustWatch

In [None]:
# import libraries
import pandas as pd
import time

# import JustWatchAPI
from justwatch import JustWatch

The cell below instantiates the JustWatch class and sets the country parameter to the United States.

In [None]:
# instantiate JustWatch class
just_watch = JustWatch(country='US')

## Helper Function

>The API returns a dictionary with information pertaining to the title of the show or movie you ask for. I only need certain information from the dictionary the API returns so I created the below function to create a new dictionary based on the information I need from the API dictionary.

In [None]:
def create_show_dict(show_title, show_id):    
    # show dictionary
    show_dict = {}

    # keys and values
    show_dict['jw_entity_id'] = show_title['jw_entity_id']
    show_dict['id'] = show_title['id']
    show_dict['title'] = show_title['title']
    
    if 'poster' in show_title.keys():
        show_dict['poster'] = show_title['poster']
    else:
        show_dict['poster'] = None
    
    if 'short_description' in show_title.keys():
        show_dict['description'] = show_title['short_description']
    else:
        show_dict['description'] = None
        
    if 'original_release_year' in show_title.keys():
        show_dict['release_year'] = show_title['original_release_year']
    else:
        show_dict['release_year'] = None
        
    show_dict['type'] = show_title['object_type']
    show_dict['imdb_popularity'] = None
    show_dict['tmdb_popularity'] = None
    show_dict['imdb_score'] = None
    show_dict['imdb_votes'] = None
    show_dict['tmdb_score'] = None
    
    try:
        # runs a loop on scoring to get the proper values
        for item in show_title['scoring']:
            if item['provider_type'] == 'imdb:popularity':
                show_dict['imdb_popularity'] = item['value']

            if item['provider_type'] == 'tmdb:popularity':
                show_dict['tmdb_popularity'] = item['value']

            if item['provider_type'] == 'imdb:score':
                show_dict['imdb_score'] = item['value']       

            if item['provider_type'] == 'imdb:votes':
                show_dict['imdb_votes'] = item['value']

            if item['provider_type'] == 'tmdb:score':    
                show_dict['tmdb_score'] = item['value']
    except:
        pass
    
    show_dict['imdb_id'] = None
    show_dict['tmdb_id'] = None

    # runs a for loop on external_ids to get the proper values
    for item in show_title['external_ids']:
        if item['provider'] == 'imdb_latest':
            show_dict['imdb_id'] = item['external_id']

        if item['provider'] == 'tmdb_latest':
            show_dict['tmdb_id'] = item['external_id']
            
    if 'genre_ids' in show_title.keys():
        show_dict['genre_ids'] = show_title['genre_ids']
    else:
        show_dict['genre_ids'] = None
    
    # checks if age_certification is in dictionary keys
    if 'age_certification' in show_title.keys():
        show_dict['age_certification'] = show_title['age_certification']
    else:
        show_dict['age_certification'] = None
        
    if 'runtime' in show_title.keys():
        show_dict['runtime'] = show_title['runtime']
    else:
        show_dict['runtime'] = None
    
    if 'production_countries' in show_title.keys():
        show_dict['production_countries'] = show_title['production_countries']
    else:
        show_dict['production_countries'] = None
    
    if 'seasons' in show_title.keys():
        show_dict['seasons'] = len(show_title['seasons'])
    else:
        show_dict['seasons'] = 0
    
    return show_dict

>Now I will make API calls through for loops to extract information on shows and movies for Funimation and Crunchyroll. I will first start with Funimation and run shows and movies separately. I will create dataframes for the shows and movies from Funimation and then combine them into one Funimation dataframe. I will do the same thing for the Crunchyroll data. At the end, I will combine the Funimation dataframe with the Crunchyroll dataframe.
>
>Below are for loops to make the API calls. I have to fun for loops because the `search_for_item` only returns about 30 titles per call on a 'page' with different titles being on different pages. The for loop goes through each page of titles and runs another for loop for item in the page to extract the information needed using the `create_show_dict` function.

# Get Funimation Shows and Movies

### Shows

First for loop is for the shows for Funimation.

In [None]:
# for loop to run through each item from provider
# place it into a list of dictionaries
funimation_shows = []
for n in range(1,50):
    # runs the loop every 1 seconds
    time.sleep(1)
    results = just_watch.search_for_item(providers=['fmn'],
                                     page=n,
                                     content_types=['show'])
    
    # check if 'items' is empty
    if not results['items']:
        break
    else:
        # for loop to run through results['items']
        for show in results['items']:

            # sets the show id so it can used in the get_title method
            show_id = show['id']
            show_title = just_watch.get_title(title_id=show_id, content_type='show')

            show_dict = create_show_dict(show_title, show_id)

            # adds dictionary to show list
            funimation_shows.append(show_dict)

In [None]:
# create a Funimation shows dataframe from the above dictionary
fun_shows_df = pd.DataFrame.from_dict(funimation_shows)

In [None]:
# take a look at the new dataframe
fun_shows_df.head(3)

### Movies

Next for loop is for the movies for Funimation.

In [None]:
# for loop to run through each item from provider
# place it into a list of dictionaries
funimation_movies = []
for n in range(1,50):
    # runs the loop every 1 seconds
    time.sleep(1)
    results = just_watch.search_for_item(providers=['fmn'],
                                     page=n,
                                     content_types=['movie'])
    
    # check if 'items' is empty
    if not results['items']:
        break
    else:
        # for loop to run through results['items']
        for show in results['items']:

            # sets the show id so it can used in the get_title method
            show_id = show['id']
            show_title = just_watch.get_title(title_id=show_id, content_type='movie')

            show_dict = create_show_dict(show_title, show_id)

            # adds dictionary to show list
            funimation_movies.append(show_dict)

In [None]:
# create a Funimation movies dataframe from the dictionary above
fun_movies_df = pd.DataFrame.from_dict(funimation_movies)

In [None]:
# take a look at the new dataframe
fun_movies_df.head(3)

In [None]:
# combine the two dataframes into one Funimation dataframe
funimation_titles = fun_shows_df.append(fun_movies_df)

In [None]:
# take a look at the shape of the dataframe
funimation_titles.shape

There are 848 titles in the Funimation dataframe which is the same amount of titles it says on the JustWatch.com website.

In [None]:
# adding streaming_app column to dataframe
funimation_titles['streaming_app'] = 'Funimation'

In [None]:
# take a look at the Funimation dataframe
funimation_titles.head(3)

## Crunchyroll Shows and Movies

### Shows

First for loop to get the shows for Crunchyroll

In [None]:
# for loop to run through each item from provider
# place it into a list of dictionaries
crunchyroll_shows = []
for n in range(1,50):
    # runs the loop every 1 seconds
    time.sleep(2)
    results = just_watch.search_for_item(providers=['cru'],
                                     page=n,
                                     content_types=['show'])
    
    # check if 'items' is empty
    if not results['items']:
        break
    else:
        # for loop to run through results['items']
        for show in results['items']:

            # sets the show id so it can used in the get_title method
            show_id = show['id']
            show_title = just_watch.get_title(title_id=show_id, content_type='show')

            show_dict = create_show_dict(show_title, show_id)

            # adds dictionary to show list
            crunchyroll_shows.append(show_dict)

In [None]:
# create a Crunchyroll shows dataframe from the above dict.
crunchyroll_shows_df = pd.DataFrame.from_dict(crunchyroll_shows)

In [None]:
# look at the shape of the dataframe
crunchyroll_shows_df.shape

In [None]:
# look at the new dataframe
crunchyroll_shows_df.head(3)

### Movies

Next for loop is to get the movies for Crunchyroll.

In [None]:
# for loop to run through each item from provider
# place it into a list of dictionaries
crunchyroll_movies = []
for n in range(1,50):
    # runs the loop every 1 seconds
    time.sleep(2)
    results = just_watch.search_for_item(providers=['cru'],
                                     page=n,
                                     content_types=['movie'])
    
    # check if 'items' is empty
    if not results['items']:
        break
    else:
        # for loop to run through results['items']
        for show in results['items']:

            # sets the show id so it can used in the get_title method
            show_id = show['id']
            show_title = just_watch.get_title(title_id=show_id, content_type='movie')

            show_dict = create_show_dict(show_title, show_id)

            # adds dictionary to show list
            crunchyroll_movies.append(show_dict)

In [None]:
# create a Crunchyroll movie dataframe from the dictionary above
crunchyroll_movies_df = pd.DataFrame(crunchyroll_movies)

In [None]:
# look at the shape of the dataframe
crunchyroll_movies_df.shape

In [None]:
# combine the movies and shows into one Crunchyroll dataframe
crunchyroll_titles = crunchyroll_shows_df.append(crunchyroll_movies_df)

In [None]:
# look at the shape of the dataframe
crunchyroll_titles.shape

There are 1,093 titles in the dataframe which is the same amount shown on the JustWatch.com website.

In [None]:
# adding streaming_app column
crunchyroll_titles['streaming_app'] = 'Crunchyroll'

In [None]:
crunchyroll_titles.head(3)

## Combing Funimation and Crunchyroll

>Finally, I am going to combine the Funimation and Crunchyroll dataframes into one and save it to a csv file. This file will be used to create the recommendation system.

In [None]:
# combine the Funimation and Crunchyroll dataframes
titles = funimation_titles.append(crunchyroll_titles)

In [None]:
# look at the shape of the dataframe
titles.shape

In [None]:
# look at the first few rows of the dataframe
titles.head(3)

In [None]:
# save titles to csv
titles.to_csv('./Data/titles.csv')