## Data Collection & Preparation
### Task 1 - Data Identification & Collection:

*API used- **The Movie Db (TMDb)** <br/>
API Link - https://developers.themoviedb.org/3*

In this assignment I will collect detailed movie data from the TMDb open web API for the last 30 years. Also, I will collect the tv series data for one of the most popular shows : "The Game of Thrones".

This notebook covers Task 1 - Data Identification and Collection. Since the API provides movie data on the basis of unique movie ids so there will be multiple API requests for these movie ids to collect sufficient data for analysis purposes.

In [1]:
import json, requests, urllib
from pathlib import Path
from datetime import datetime
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
from functools import reduce
import seaborn as sns

Settings for the API and data collection.

In [61]:
# API Key
api_key = "****"
# Prefix for API URLs
TMDB_Prefix = 'https://api.themoviedb.org/3'

Create directory for raw data storage, if it does not already exist:

In [62]:
dir_raw = Path("raw")
dir_raw.mkdir(parents=True, exist_ok=True)

Defining a fetch function to make the API calls for different requests.

In [63]:
def fetch(endpoint, params={}):
    # construct the url
    url = TMDB_Prefix
    if not endpoint.startswith("/"):
        url += "/"
    url += endpoint
    # Passing one common parameter for all requests
    params["api_key"] = api_key
    url += "?" + urllib.parse.urlencode(params)
    print("Fetching %s" % url)
    # fetching the data
    response = requests.get(url)
    jdata = response.text
    return json.loads(jdata)

Defining a function to fetch movie ids for the specified time period

In [64]:
def get_movie_ids(start, end, pages):
    movie_ids = []
    # Year range as specified by the user and the number of pages that are required to fetch
    for i in range(start, end+1):
        for j in range(1, pages+1):
            try:
                # Creating a list of movie ids for each year
                discover = fetch("discover/movie", {"primary_release_year":i,"page":j})
                for r in discover["results"]:
                    movie_ids.append(r["id"])
            except:
                break
    return movie_ids

Defining a function to fetch the movie details for the collected movie ids

In [65]:
def fetch_movie_details(movie_ids):
    collection = {}
    #movie_ids = get_movie_ids(start, end, pages)
    for mid in movie_ids:
        # Fetching the movie details for each of the movie ids in the passed list
        movie_details = fetch("/movie/"+str(mid), {"language":"en-US"})
        collection[mid] = movie_details
    return collection
        

Generating a JSON dump for the movie details

In [None]:
# Fetching the movie ids from 1992 to 2022 (30 years data)
movie_ids = get_movie_ids(1992,2022,20)

# Fetching the movie details for the generated movie id list
collection = fetch_movie_details(movie_ids)

# Generating the JSON dump for the movie data collection
fname = "movie_data.json"
out_path = dir_raw / fname
print("Writing data to %s" % out_path)
fout = open(out_path, "w")
json.dump(collection, fout, indent=4, sort_keys=True)
fout.close()

Defining a function to fetch data for the specified tv show and seasons. This function will also fetch the cast information and similar show details with respect to the user specified show.

In [66]:
def fetch_tvshow_details(tvshow_id, seasons):
    show_details = {}
    # Iterating for the user specified number of seasons
    for season in range(1, seasons+1):
        try:
            fetched_details = fetch("/tv/"+str(tvshow_id)+"/season/"+str(season), {"language":"en-US"})
            season_name = "season-"+str(season)
            show_details[season_name] = fetched_details
        except:
            # Break out of the loop if wrong number of seasons are specified
            break
    
    # Fetching the similar show details using the API call
    similar_show_details = fetch("/tv/"+str(tvshow_id)+"/similar", {"language":"en-US"})
    show_details["similar_show"] = similar_show_details
    
    # Fetching the cast and crew details using the API call
    show_cast_details = fetch("/tv/"+str(tvshow_id)+"/credits", {"language":"en-US"})
    show_details["show_cast"] = show_cast_details
    
    return show_details

Generating the JSON dump for the "GAME OF THRONES" TV Show (All 8 Seasons)

In [68]:
# Fetching the data for "GAME OF THRONES" TV Show
show_details = fetch_tvshow_details(1399,8)

# Generating the JSON dump for the series data collection
fname = "show_data.json"
out_path = dir_raw / fname
print("Writing data to %s" % out_path)
fout = open(out_path, "w")
json.dump(show_details, fout, indent=4, sort_keys=True)
fout.close()

Fetching https://api.themoviedb.org/3/tv/1399/season/1?language=en-US&api_key=282d085ce6c55bda16f885af4cf4c0c0
Fetching https://api.themoviedb.org/3/tv/1399/season/2?language=en-US&api_key=282d085ce6c55bda16f885af4cf4c0c0
Fetching https://api.themoviedb.org/3/tv/1399/season/3?language=en-US&api_key=282d085ce6c55bda16f885af4cf4c0c0
Fetching https://api.themoviedb.org/3/tv/1399/season/4?language=en-US&api_key=282d085ce6c55bda16f885af4cf4c0c0
Fetching https://api.themoviedb.org/3/tv/1399/season/5?language=en-US&api_key=282d085ce6c55bda16f885af4cf4c0c0
Fetching https://api.themoviedb.org/3/tv/1399/season/6?language=en-US&api_key=282d085ce6c55bda16f885af4cf4c0c0
Fetching https://api.themoviedb.org/3/tv/1399/season/7?language=en-US&api_key=282d085ce6c55bda16f885af4cf4c0c0
Fetching https://api.themoviedb.org/3/tv/1399/season/8?language=en-US&api_key=282d085ce6c55bda16f885af4cf4c0c0
Fetching https://api.themoviedb.org/3/tv/1399/similar?language=en-US&api_key=282d085ce6c55bda16f885af4cf4c0c0
Fe

Note - In this task, I have collected the necessary data by making multiple API calls and created two JSON dumps - **movie_data.json** and **show_data.json** using the TMDb API, which have been saved in the "raw" directory. These will be used in task 2 for further data preparation and analysis.