## Project Progress Report: Introduction

### Changes from proposal:
- Data Handler role changed from Arman to Lachlan
- Data Visualizer role changed from Lachlan to Arman

Reasons: Arman was ill for several classes during which APIs were covered + convenience surrounding the repair timing of my (Lachlan's) inbuilt keyboard breaking

### Structure of the Report:

I will pull the data that is needed to answer the questions and talk a bit about next steps.

## Step 0: Imports + Getting API Key from User Data

In [None]:
import requests
from pprint import pprint
import json
import pandas as pd

In [None]:
# getting api key from user secrets
from google.colab import userdata
api_key = userdata.get('TMDB_API')

In [None]:
# I decided to store the data in .pkl files after pulling it
from google.colab import drive
drive.mount('drive')

Mounted at drive


## Step 1: Making Sure We Can Call the API

In [None]:
params = {"api_key": api_key}
url = f"https://api.themoviedb.org/3/genre/movie/list"
response = requests.get(url,params)
print(response.status_code)

200


## Step 2: Pulling Data

### Core Questions

1. How does budget correlate to user score?
2. How does user score correlate to revenue?
3. Is there a correlation between release year and user score?
4. What are the most common keywords and genres on the list of top rated movies?
5. Who are the most common actors among the top rated movies?


Questions 1-3 are on a general sample of movies and questions 4-5 are on top rated movies.

For a general sample of movies, we decided to pull the movies with the highest vote counts since they should also have more accurate data on stuff like revenue and release year.

The most voted movies dataframe needs:
- user score
- revenue
- release year
- budget

The top rated movies dataframe needs:
- keywords
- genres
- actors

Both will need movie id from the initial API call in order to fetch additional data. Title might become useful in later steps (ex. for storytelling), so I'll include that too.

Note: We can do approximately 40 requests/second before being rate limited.

### Step 2.1: Getting The Most Voted Movies

The [discover movie](https://developer.themoviedb.org/reference/discover-movie) request allows us to get a list of movies with their id, title (and original title), release date, and vote average (user score).

Budget and revenue can be obtained via the [movie details](https://api.themoviedb.org/3/movie/{movie_id}) request.

In [None]:
discover_movies_url = "https://api.themoviedb.org/3/discover/movie"
params = {"api_key": api_key, "page": 1, "sort_by": 'vote_count.desc'}

# placeholder for getting all of the pages of movie data into one
aggregate = []

# pulls results pages 1-40
for i in range(1,41):
    params["page"] = i

    response = requests.get(discover_movies_url,params)

    # appends each movie from the page to aggregate
    if(response.status_code == 200):
        results = response.json()["results"]
        # each entry is a dictionary of movie information
        for entry in results:
            aggregate.append(entry)
    else:
        print(f"ERROR: {response.status_code}")

In [None]:
# turns list of movie info dictionaries into a dataframe and selects desired columns
by_votes = pd.DataFrame(aggregate)
by_votes = by_votes[['id','title','original_title','release_date','vote_average','vote_count']]

In [None]:
by_votes.sample(5)

Unnamed: 0,id,title,original_title,release_date,vote_average,vote_count
536,395992,Life,Life,2017-03-22,6.445,7677
691,242,The Godfather Part III,The Godfather Part III,1990-12-25,7.418,6572
783,426,Vertigo,Vertigo,1958-05-28,8.2,6077
224,399055,The Shape of Water,The Shape of Water,2017-12-01,7.24,12514
776,460465,Mortal Kombat,Mortal Kombat,2021-04-07,7.004,6098


In [None]:
# defining a function for getting the budget and revenue
def get_movie_details(id,column):
    """
    fetches a piece of data from a movie by id

    for options, see https://developer.themoviedb.org/reference/movie-details

    used here for "budget" and "revenue"
    """
    movie_details_url = f"https://api.themoviedb.org/3/movie/{id}"
    params = {"api_key": api_key}

    # gives an indicator of how it's going
    # ad-hoc loading screen?
    print(id)

    response = requests.get(movie_details_url,params)
    if response.status_code == 200:
        return response.json()[column]
    else:
        print(f"ERROR: {response.status_code}")

In [None]:
# gets the budget and puts it in a column
by_votes["budget"] = by_votes["id"].apply(lambda x: get_movie_details(x,"budget"))

In [None]:
# gets the revenue and puts it in a column
by_votes["revenue"] = by_votes["id"].apply(lambda x: get_movie_details(x,"revenue"))

In [None]:
by_votes.sample(5)

Unnamed: 0,id,title,original_title,release_date,vote_average,vote_count,budget,revenue
550,773,Little Miss Sunshine,Little Miss Sunshine,2006-07-26,7.692,7527,8000000,100523181
552,9693,Children of Men,Children of Men,2006-09-22,7.6,7517,76000000,70595464
592,492188,Marriage Story,Marriage Story,2019-09-28,7.732,7200,18000000,333686
483,10764,Quantum of Solace,Quantum of Solace,2008-10-29,6.335,8214,200000000,589593688
763,302699,Central Intelligence,Central Intelligence,2016-06-15,6.416,6166,50000000,216972543


In [None]:
# sets the movie id as the index
by_votes.set_index("id",inplace=True)

In [None]:
# budget and revenue default to 0 when there is no value
# replaces 0 values with None
by_votes.loc[by_votes['budget'] == 0, 'budget'] = None
by_votes.loc[by_votes['revenue'] == 0, 'revenue'] = None

In [None]:
# stores in .pkl file
by_votes.to_pickle("by_votes.pkl")
!cp by_votes.pkl "/content/drive/My Drive/"
# might change target directory once I set up a GitHub repo
# (so this Friday around)

### Step 2.2: Getting the Top-Rated Movies

The [movies: top rated](https://developer.themoviedb.org/reference/movie-top-rated-list) request allows us to get a list of movies with their id, title (and original title), and genre ids.

Genre names can be obtained via [genres: movie list](https://api.themoviedb.org/3/genre/movie/list).

Keywords can be obtained via [movies: keywords](https://api.themoviedb.org/3/movie/%7Bmovie_id%7D/keywords).

Actors can be obtained via [movies: credits](https://api.themoviedb.org/3/movie/%7Bmovie_id%7D/credits).

In [None]:
top_movies_url = "https://api.themoviedb.org/3/movie/top_rated"
params = {"api_key": api_key, "page": 1}

# placeholder for getting all of the pages of movie data into one
aggregate = []

# pulls results pages 1-40
for i in range(1,41):
    params["page"] = i

    response = requests.get(top_movies_url,params)

    # appends each movie from the page to aggregate
    if(response.status_code == 200):
        results = response.json()["results"]
        # each entry is a dictionary of movie information
        for entry in results:
            aggregate.append(entry)
    else:
        print(f"ERROR: {response.status_code}")

In [None]:
# puts into dataframe + limits to desired columns
top_movies = pd.DataFrame(aggregate)
top_movies = top_movies[['id','title','original_title','genre_ids']]

In [None]:
top_movies.sample(5)

Unnamed: 0,id,title,original_title,genre_ids
408,20722,La Maison en Petits Cubes,つみきのいえ,[16]
698,40662,Batman: Under the Red Hood,Batman: Under the Red Hood,"[9648, 80, 16]"
605,58129,The Phantom Carriage,Körkarlen,"[18, 14, 27]"
227,4348,Pride & Prejudice,Pride & Prejudice,"[18, 10749]"
620,149870,The Wind Rises,風立ちぬ,"[18, 16, 10749, 10752, 36]"


In [None]:
# getting a dictionary of genre ids to names
genres_url = "https://api.themoviedb.org/3/genre/movie/list"
params = {"api_key": api_key}

genres = {}

response = requests.get(genres_url,params)

if response.status_code == 200:
    for item in response.json()["genres"]:
        genres[item["id"]] = item["name"]
else:
    print(f"ERROR: {response.status_code}")

pprint(genres_url)

'https://api.themoviedb.org/3/genre/movie/list'


In [None]:
# maps the genre names onto the lists of genre ids
top_movies["genre_names"] = top_movies["genre_ids"].apply(lambda x:
                                                        list(
                                                            map(
                                                                lambda y:
                                                                genres[y],x
                                                                )
                                                            )
                                                        )

In [None]:
top_movies.sample(5)

Unnamed: 0,id,title,original_title,genre_ids,genre_names
302,11658,Tae Guk Gi: The Brotherhood of War,태극기 휘날리며,"[28, 12, 18, 36, 10752]","[Action, Adventure, Drama, History, War]"
528,93,Anatomy of a Murder,Anatomy of a Murder,"[80, 18, 9648]","[Crime, Drama, Mystery]"
254,800,The Young and the Damned,Los olvidados,"[18, 80]","[Drama, Crime]"
445,1402,The Pursuit of Happyness,The Pursuit of Happyness,[18],[Drama]
717,113833,The Normal Heart,The Normal Heart,[18],[Drama]


In [None]:
def get_keyword_data(id):
    """
    gets keywords for a given movie id
    """
    movie_keyword_url = f"https://api.themoviedb.org/3/movie/{id}/keywords"
    params = {"api_key": api_key}

    response = requests.get(movie_keyword_url,params)
    if response.status_code == 200:
        keywords = []
        for item in response.json()["keywords"]:
            keywords.append(item["name"])
        pprint(keywords)
        return keywords
    else:
        print(f"ERROR: {response.status_code}")

In [None]:
# gets keywords and puts them in the dataframe
top_movies["keywords"] = top_movies["id"].apply(lambda x: get_keyword_data(x))

In [None]:
top_movies.sample(5)

Unnamed: 0,id,title,original_title,genre_ids,genre_names,keywords
344,3090,The Treasure of the Sierra Madre,The Treasure of the Sierra Madre,"[12, 18, 37]","[Adventure, Drama, Western]","[gold, mexico, based on novel or book, greed, ..."
558,631,Sunrise: A Song of Two Humans,Sunrise: A Song of Two Humans,"[18, 10749]","[Drama, Romance]","[adultery, lake, love triangle, pig, marriage ..."
507,89,Indiana Jones and the Last Crusade,Indiana Jones and the Last Crusade,"[12, 28]","[Adventure, Action]","[saving the world, nazi, holy grail, venice, i..."
248,698687,Transformers One,Transformers One,"[16, 878, 12, 10751]","[Animation, Science Fiction, Adventure, Family]","[based on toy, giant robot, aftercreditsstinge..."
486,331482,Little Women,Little Women,"[18, 10749, 36]","[Drama, Romance, History]","[new york city, sibling relationship, based on..."


In [None]:
def get_actors_data(id):
    """
    Gets actors for a given movie id
    """
    movie_credits_url = f"https://api.themoviedb.org/3/movie/{id}/credits"
    params = {"api_key": api_key}
    print(id)

    response = requests.get(movie_credits_url,params)
    if response.status_code == 200:
        actors = []
        for item in response.json()["cast"]:
            # filters out actors from credits
            if item["known_for_department"] == "Acting":
                actors.append(item["name"])
        pprint(actors)
        return actors
    else:
        print(f"ERROR: {response.status_code}")

In [None]:
# gets actors and assigns to dataframe column
top_movies["actors"] = top_movies["id"].apply(lambda x: get_actors_data(x))

In [None]:
top_movies.sample(5)

Unnamed: 0,id,title,original_title,genre_ids,genre_names,keywords,actors
622,1366,Rocky,Rocky,[18],[Drama],"[underdog, philadelphia, pennsylvania, transpo...","[Sylvester Stallone, Talia Shire, Burt Young, ..."
184,100,"Lock, Stock and Two Smoking Barrels","Lock, Stock and Two Smoking Barrels","[35, 80]","[Comedy, Crime]","[ambush, joint, alcohol, shotgun, tea, machism...","[Vinnie Jones, Jason Flemyng, Dexter Fletcher,..."
481,777,Grand Illusion,La Grande Illusion,"[18, 36, 10752]","[Drama, History, War]","[prisoner, france, countryside, escape, german...","[Jean Gabin, Pierre Fresnay, Erich von Strohei..."
339,48035,Ordet,Ordet,[18],[Drama],"[faith, religion, black and white, religious f...","[Henrik Malberg, Birgitte Federspiel, Emil Has..."
666,6844,The Ten Commandments,The Ten Commandments,"[18, 36]","[Drama, History]","[epic, egypt, israel, moses, ten commandments,...","[Charlton Heston, Yul Brynner, Anne Baxter, Ed..."


In [None]:
# sets index to movie id
top_movies.set_index("id",inplace=True)

In [None]:
top_movies.sample(5)

Unnamed: 0_level_0,title,original_title,genre_ids,genre_names,keywords,actors
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
160885,Tel chi el telùn,Tel chi el telùn,[35],[Comedy],[cabaret],"[Aldo Baglio, Giovanni Storti, Giacomo Poretti..."
15859,A Moment to Remember,내 머리 속의 지우개,"[18, 10749]","[Drama, Romance]","[alzheimer's disease, love tested]","[Jung Woo-sung, Son Ye-jin, Baek Jong-hak, Lee..."
505262,My Hero Academia: Two Heroes,僕のヒーローアカデミア THE MOVIE ～2人の英雄～,"[16, 28, 12, 14]","[Animation, Action, Adventure, Fantasy]","[japan, hero, superhero, school, fighting, hos...","[Daiki Yamashita, Kenta Miyake, Mirai Shida, K..."
12,Finding Nemo,Finding Nemo,"[16, 10751]","[Animation, Family]","[fish, sydney, australia, parent child relatio...","[Albert Brooks, Ellen DeGeneres, Alexander Gou..."
666277,Past Lives,Past Lives,"[18, 10749]","[Drama, Romance]","[new york city, immigrant, friendship, nostalg...","[Greta Lee, Teo Yoo, John Magaro, Moon Seung-a..."


In [None]:
# pickles
top_movies.to_pickle("top_movies.pkl")
!cp top_movies.pkl "/content/drive/My Drive/"

## Step 3: Next Steps

I will meet with the team to dicuss how to process and visualize the data that we have pulled.

I estimate that Q 1-3 could likely be answered roughly by plotting the columns against each other in a scatter plot.

There are some additional processing steps that will need to take place between the data and the visualization. We might want to extract the year from the date in by_votes. We will need a function for tallying the genres and the keywords as well as the actors.

I will likely ask Aron to do some of the data cleaning/processing so that we all have a coding-related task to do.

I need to make a GitHub repo so that we can collaborate more seamlessly + monitor changes

Note: I'm not entirely sure about my decision to pickle the dataframes. They could alternatively be stored as a parquet or a .json or a .csv.