## Project Progress Report: Introduction

### Changes from proposal:
- Data Handler role changed from Arman to Lachlan
- Data Visualizer role changed from Lachlan to Arman

Reasons: Arman was ill for several classes during which APIs were covered + convenience surrounding the repair timing of my (Lachlan's) inbuilt keyboard breaking

### Structure of the Report:

I will pull the data that is needed to answer the questions and talk a bit about next steps.

## Step 0: Imports + Getting API Key from User Data

In [1]:
import requests
from pprint import pprint
import json
import pandas as pd

In [2]:
# getting api key from user secrets
from google.colab import userdata
api_key = userdata.get('TMDB_API')

In [3]:
# I decided to store the data in parquet files after pulling it
from google.colab import drive
drive.mount('drive')

Drive already mounted at drive; to attempt to forcibly remount, call drive.mount("drive", force_remount=True).


## Step 1: Making Sure We Can Call the API

In [4]:
params = {"api_key": api_key}
url = f"https://api.themoviedb.org/3/genre/movie/list"
response = requests.get(url,params)
print(response.status_code)

200


## Step 2: Pulling Data

### Core Questions

1. How does budget correlate to user score?
2. How does user score correlate to revenue?
3. Is there a correlation between release year and user score?
4. What are the most common keywords and genres on the list of top rated movies?
5. Who are the most common actors among the top rated movies?


Questions 1-3 are on a general sample of movies and questions 4-5 are on top rated movies.

For a general sample of movies, we decided to pull the movies with the highest vote counts since they should also have more accurate data on stuff like revenue and release year.

The most voted movies dataframe needs:
- user score
- revenue
- release year
- budget

The top rated movies dataframe needs:
- keywords
- genres
- actors

Both will need movie id from the initial API call in order to fetch additional data. Title might become useful in later steps (ex. for storytelling), so I'll include that too.

Note: We can do approximately 40 requests/second before being rate limited.

### Step 2.1: Getting The Most Voted Movies

The [discover movie](https://developer.themoviedb.org/reference/discover-movie) request allows us to get a list of movies with their id, title (and original title), release date, and vote average (user score).

Budget and revenue can be obtained via the [movie details](https://api.themoviedb.org/3/movie/{movie_id}) request.

In [5]:
discover_movies_url = "https://api.themoviedb.org/3/discover/movie"
params = {"api_key": api_key, "page": 1, "sort_by": 'vote_count.desc'}

# placeholder for getting all of the pages of movie data into one
aggregate = []

# pulls results pages 1-40
for i in range(1,41):
    params["page"] = i

    response = requests.get(discover_movies_url,params)

    # appends each movie from the page to aggregate
    if(response.status_code == 200):
        results = response.json()["results"]
        # each entry is a dictionary of movie information
        for entry in results:
            aggregate.append(entry)
    else:
        print(f"ERROR: {response.status_code}")

In [6]:
# turns list of movie info dictionaries into a dataframe and selects desired columns
by_votes = pd.DataFrame(aggregate)
by_votes = by_votes[['id','title','original_title','release_date','vote_average','vote_count']]

In [7]:
by_votes.sample(5)

Unnamed: 0,id,title,original_title,release_date,vote_average,vote_count
293,82992,Fast & Furious 6,Fast & Furious 6,2013-05-21,6.8,10933
750,2300,Space Jam,Space Jam,1996-11-15,6.8,6246
789,39513,Paul,Paul,2011-02-14,6.7,6050
489,1979,Fantastic Four: Rise of the Silver Surfer,Fantastic Four: Rise of the Silver Surfer,2007-06-13,5.626,8208
762,333484,The Magnificent Seven,The Magnificent Seven,2016-09-14,6.5,6183


In [8]:
# defining a function for getting the budget and revenue
def get_movie_details(id,column):
    """
    fetches a piece of data from a movie by id

    for options, see https://developer.themoviedb.org/reference/movie-details

    used here for "budget" and "revenue"
    """
    movie_details_url = f"https://api.themoviedb.org/3/movie/{id}"
    params = {"api_key": api_key}

    # gives an indicator of how it's going
    # ad-hoc loading screen?
    print(id)

    response = requests.get(movie_details_url,params)
    if response.status_code == 200:
        return response.json()[column]
    else:
        print(f"ERROR: {response.status_code}")

In [None]:
# gets the budget and puts it in a column
by_votes["budget"] = by_votes["id"].apply(lambda x: get_movie_details(x,"budget"))

In [None]:
# gets the revenue and puts it in a column
by_votes["revenue"] = by_votes["id"].apply(lambda x: get_movie_details(x,"revenue"))

In [11]:
by_votes.sample(5)

Unnamed: 0,id,title,original_title,release_date,vote_average,vote_count,budget,revenue
698,18823,Clash of the Titans,Clash of the Titans,2010-03-26,5.9,6505,125000000,493214993
177,363088,Ant-Man and the Wasp,Ant-Man and the Wasp,2018-07-04,6.9,13835,140000000,622674139
617,302946,The Accountant,The Accountant,2016-10-13,7.13,7059,44000000,155160045
766,861,Total Recall,Total Recall,1990-06-01,7.3,6180,65000000,261317921
555,385128,F9,F9,2021-05-19,7.03,7513,200000000,726229501


In [12]:
# sets the movie id as the index
by_votes.set_index("id",inplace=True)

In [13]:
# budget and revenue default to 0 when there is no value
# replaces 0 values with None
by_votes.loc[by_votes['budget'] == 0, 'budget'] = None
by_votes.loc[by_votes['revenue'] == 0, 'revenue'] = None

In [14]:
# stores in .gzip file
by_votes.to_parquet("by_votes.gzip")
!cp by_votes.gzip "/content/drive/My Drive/Parquets"
# might change target directory once I set up a GitHub repo
# (so this Friday around)

### Step 2.2: Getting the Top-Rated Movies

The [movies: top rated](https://developer.themoviedb.org/reference/movie-top-rated-list) request allows us to get a list of movies with their id, title (and original title), and genre ids.

Genre names can be obtained via [genres: movie list](https://api.themoviedb.org/3/genre/movie/list).

Keywords can be obtained via [movies: keywords](https://api.themoviedb.org/3/movie/%7Bmovie_id%7D/keywords).

Actors can be obtained via [movies: credits](https://api.themoviedb.org/3/movie/%7Bmovie_id%7D/credits).

In [15]:
top_movies_url = "https://api.themoviedb.org/3/movie/top_rated"
params = {"api_key": api_key, "page": 1}

# placeholder for getting all of the pages of movie data into one
aggregate = []

# pulls results pages 1-40
for i in range(1,41):
    params["page"] = i

    response = requests.get(top_movies_url,params)

    # appends each movie from the page to aggregate
    if(response.status_code == 200):
        results = response.json()["results"]
        # each entry is a dictionary of movie information
        for entry in results:
            aggregate.append(entry)
    else:
        print(f"ERROR: {response.status_code}")

In [16]:
# puts into dataframe + limits to desired columns
top_movies = pd.DataFrame(aggregate)
top_movies = top_movies[['id','title','original_title','genre_ids']]

In [17]:
top_movies.sample(5)

Unnamed: 0,id,title,original_title,genre_ids
193,103663,The Hunt,Jagten,[18]
380,6715,The Cure,The Cure,"[18, 10751]"
426,313369,La La Land,La La Land,"[35, 18, 10749]"
488,501929,The Mitchells vs. the Machines,The Mitchells vs. the Machines,"[16, 12, 35]"
608,335578,Land of Mine,Under sandet,"[10752, 18, 36]"


In [18]:
# getting a dictionary of genre ids to names
genres_url = "https://api.themoviedb.org/3/genre/movie/list"
params = {"api_key": api_key}

genres = {}

response = requests.get(genres_url,params)

if response.status_code == 200:
    for item in response.json()["genres"]:
        genres[item["id"]] = item["name"]
else:
    print(f"ERROR: {response.status_code}")

pprint(genres_url)

'https://api.themoviedb.org/3/genre/movie/list'


In [19]:
# maps the genre names onto the lists of genre ids
top_movies["genre_names"] = top_movies["genre_ids"].apply(lambda x:
                                                        list(
                                                            map(
                                                                lambda y:
                                                                genres[y],x
                                                                )
                                                            )
                                                        )

In [20]:
top_movies.sample(5)

Unnamed: 0,id,title,original_title,genre_ids,genre_names
111,694,The Shining,The Shining,"[27, 53]","[Horror, Thriller]"
491,369557,Sing Street,Sing Street,"[10749, 18, 10402, 35]","[Romance, Drama, Music, Comedy]"
200,239,Some Like It Hot,Some Like It Hot,"[35, 10749, 80]","[Comedy, Romance, Crime]"
443,334533,Captain Fantastic,Captain Fantastic,"[12, 18]","[Adventure, Drama]"
34,1891,The Empire Strikes Back,The Empire Strikes Back,"[12, 28, 878]","[Adventure, Action, Science Fiction]"


In [21]:
def get_keyword_data(id):
    """
    gets keywords for a given movie id
    """
    movie_keyword_url = f"https://api.themoviedb.org/3/movie/{id}/keywords"
    params = {"api_key": api_key}

    response = requests.get(movie_keyword_url,params)
    if response.status_code == 200:
        keywords = []
        for item in response.json()["keywords"]:
            keywords.append(item["name"])
        pprint(keywords)
        return keywords
    else:
        print(f"ERROR: {response.status_code}")

In [None]:
# gets keywords and puts them in the dataframe
top_movies["keywords"] = top_movies["id"].apply(lambda x: get_keyword_data(x))

In [23]:
top_movies.sample(5)

Unnamed: 0,id,title,original_title,genre_ids,genre_names,keywords
604,31011,Mr. Nobody,Mr. Nobody,"[878, 18, 10749]","[Science Fiction, Drama, Romance]","[time travel, surrealism, time, choice, free w..."
671,1062722,Frankenstein,Frankenstein,"[18, 27, 14]","[Drama, Horror, Fantasy]","[monster, based on novel or book, supernatural..."
684,1675,Day for Night,La Nuit américaine,"[35, 18]","[Comedy, Drama]","[lovesickness, movie business, nice, alcoholic..."
763,272,Batman Begins,Batman Begins,"[18, 80, 28]","[Drama, Crime, Action]","[martial arts, undercover, loss of loved one, ..."
513,3083,Mr. Smith Goes to Washington,Mr. Smith Goes to Washington,"[35, 18]","[Comedy, Drama]","[governor, washington dc, usa, senate, senator..."


In [24]:
def get_actors_data(id):
    """
    Gets actors for a given movie id
    """
    movie_credits_url = f"https://api.themoviedb.org/3/movie/{id}/credits"
    params = {"api_key": api_key}
    print(id)

    response = requests.get(movie_credits_url,params)
    if response.status_code == 200:
        actors = []
        for item in response.json()["cast"]:
            # filters out actors from credits
            if item["known_for_department"] == "Acting":
                actors.append(item["name"])
        pprint(actors)
        return actors
    else:
        print(f"ERROR: {response.status_code}")

In [None]:
# gets actors and assigns to dataframe column
top_movies["actors"] = top_movies["id"].apply(lambda x: get_actors_data(x))

In [26]:
top_movies.sample(5)

Unnamed: 0,id,title,original_title,genre_ids,genre_names,keywords,actors
632,142,Brokeback Mountain,Brokeback Mountain,"[18, 10749]","[Drama, Romance]","[secret love, wyoming, usa, countryside, homop...","[Heath Ledger, Jake Gyllenhaal, Michelle Willi..."
432,14836,Coraline,Coraline,"[16, 10751, 14]","[Animation, Family, Fantasy]","[friendship, dreams, based on novel or book, v...","[Dakota Fanning, Teri Hatcher, Jennifer Saunde..."
55,3782,Ikiru,生きる,[18],[Drama],"[dying and death, japan, bureaucracy, age diff...","[Takashi Shimura, Haruo Tanaka, Nobuo Kaneko, ..."
255,106646,The Wolf of Wall Street,The Wolf of Wall Street,"[80, 18, 35]","[Crime, Drama, Comedy]","[corruption, based on novel or book, drug addi...","[Leonardo DiCaprio, Jonah Hill, Margot Robbie,..."
240,437068,A Taxi Driver,택시운전사,"[28, 18, 36]","[Action, Drama, History]","[taxi, taxi driver, protest, based on true sto...","[Song Kang-ho, Thomas Kretschmann, Yoo Hai-jin..."


In [27]:
# sets index to movie id
top_movies.set_index("id",inplace=True)

In [28]:
top_movies.sample(5)

Unnamed: 0_level_0,title,original_title,genre_ids,genre_names,keywords,actors
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
635302,Demon Slayer -Kimetsu no Yaiba- The Movie: Mug...,劇場版「鬼滅の刃」無限列車編,"[16, 28, 14, 53]","[Animation, Action, Fantasy, Thriller]","[fight, magic, supernatural, psychology, gore,...","[Natsuki Hanae, Akari Kito, Hiro Shimono, Yosh..."
1091,The Thing,The Thing,"[27, 9648, 878]","[Horror, Mystery, Science Fiction]","[spacecraft, helicopter, space marine, based o...","[Kurt Russell, Keith David, Wilford Brimley, T..."
666277,Past Lives,Past Lives,"[18, 10749]","[Drama, Romance]","[new york city, immigrant, friendship, nostalg...","[Greta Lee, Teo Yoo, John Magaro, Moon Seung-a..."
666,Central Station,Central do Brasil,[18],[Drama],"[rio de janeiro, letter, wilderness, teacher, ...","[Fernanda Montenegro, Vinícius de Oliveira, Ma..."
1398,Stalker,Сталкер,"[878, 18]","[Science Fiction, Drama]","[based on novel or book, guard, wish, stalker,...","[Alisa Freyndlikh, Aleksandr Kaydanovskiy, Ana..."


In [29]:
# parquets
top_movies.to_parquet("top_movies.gzip")
!cp top_movies.gzip "/content/drive/My Drive/Parquets"

## Step 3: Next Steps

I will meet with the team to dicuss how to process and visualize the data that we have pulled.

I estimate that Q 1-3 could likely be answered roughly by plotting the columns against each other in a scatter plot.

There are some additional processing steps that will need to take place between the data and the visualization. We might want to extract the year from the date in by_votes. We will need a function for tallying the genres and the keywords as well as the actors.

I will likely ask Aron to do some of the data cleaning/processing so that we all have a coding-related task to do.

I need to make a GitHub repo so that we can collaborate more seamlessly + monitor changes

Note: I'm not entirely sure about my decision to pickle the dataframes. They could alternatively be stored as a parquet or a .json or a .csv.

Update: I changed it to be stored as a parquet because it's more secure + I was running into version compatibility issues