In [1]:
import pandas as pd
import sqlite3

# The Movie Database (tmdb.movies.csv)

In [2]:
df_movies = pd.read_csv('./data/tmdb.movies.csv.gz', index_col=0)

In [3]:
df_movies.sort_values(by='release_date', ascending = False).head(10)

Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
26057,"[27, 80, 80, 80, 80, 80, 80]",570704,en,Murdery Christmas,0.84,2020-12-25,Murdery Christmas,0.0,1
24265,"[10749, 18]",428836,en,Ophelia,8.715,2019-06-28,Ophelia,0.0,4
24892,[99],541577,en,This Changes Everything,3.955,2019-06-28,This Changes Everything,0.0,1
24819,[18],481880,en,Trial by Fire,4.48,2019-05-17,Trial by Fire,7.0,3
24297,[18],415085,en,All Creatures Here Below,8.316,2019-05-17,All Creatures Here Below,5.0,5
24003,"[18, 9648, 53]",411144,en,We Have Always Lived in the Castle,14.028,2019-05-17,We Have Always Lived in the Castle,5.2,24
25006,[99],500850,es,El silencio de otros,3.299,2019-05-08,The Silence of Others,8.5,15
24612,[99],541576,en,Meeting Gorbachev,5.87,2019-05-03,Meeting Gorbachev,10.0,3
24691,"[18, 28, 80]",547590,en,El Chicano,5.274,2019-05-03,El Chicano,9.0,1
25012,[99],523994,en,Hesburgh,3.262,2019-04-26,Hesburgh,10.0,1


In [4]:
df_movies.head()

Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
0,"[12, 14, 10751]",12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610
2,"[12, 28, 878]",10138,en,Iron Man 2,28.515,2010-05-07,Iron Man 2,6.8,12368
3,"[16, 35, 10751]",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174
4,"[28, 878, 12]",27205,en,Inception,27.92,2010-07-16,Inception,8.3,22186


In [5]:
df_movies.shape


(26517, 9)

In [6]:
df_movies.columns

Index(['genre_ids', 'id', 'original_language', 'original_title', 'popularity',
       'release_date', 'title', 'vote_average', 'vote_count'],
      dtype='object')

In [7]:
df_movies.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 26517 entries, 0 to 26516
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   genre_ids          26517 non-null  object 
 1   id                 26517 non-null  int64  
 2   original_language  26517 non-null  object 
 3   original_title     26517 non-null  object 
 4   popularity         26517 non-null  float64
 5   release_date       26517 non-null  object 
 6   title              26517 non-null  object 
 7   vote_average       26517 non-null  float64
 8   vote_count         26517 non-null  int64  
dtypes: float64(2), int64(2), object(5)
memory usage: 2.0+ MB


In [8]:
df_movies['genre_ids'].value_counts()

[99]                     3700
[]                       2479
[18]                     2268
[35]                     1660
[27]                     1145
                         ... 
[27, 878, 28, 53, 80]       1
[27, 18, 9648]              1
[18, 27, 28, 53, 878]       1
[37, 35]                    1
[10752, 36, 18]             1
Name: genre_ids, Length: 2477, dtype: int64

### Genre Keys
We can match these with the numbers in "genre_ids" column
(per https://www.themoviedb.org/talk/5daf6eb0ae36680011d7e6ee)
- Action          28
- Adventure       12
- Animation       16
- Comedy          35
- Crime           80
- Documentary     99
- Drama           18
- Family          10751
- Fantasy         14
- History         36
- Horror          27
- Music           10402
- Mystery         9648
- Romance         10749
- Science Fiction 878
- TV Movie        10770
- Thriller        53
- War             10752
- Western         37

Adding new column for genre!

In [20]:
list(df_movies.iloc[1]['genre_ids'])

['[',
 '1',
 '4',
 ',',
 ' ',
 '1',
 '2',
 ',',
 ' ',
 '1',
 '6',
 ',',
 ' ',
 '1',
 '0',
 '7',
 '5',
 '1',
 ']']

In [9]:
genre_dict = {28: 'Action', 12: 'Adventure', 16: 'Animation', 35: 'Comedy', 80: 'Crime', 99: 'Documentary', 18: 'Drama', 10751: 'Family', 14: 'Fantasy', 36: 'History', 27: 'Horror', 10402: 'Music', 9648: 'Mystery', 10749: 'Romance', 878: 'Science Fiction', 10770: 'TV Movie', 53: 'Thriller', 10752: 'War', 37: 'Western'}

In [15]:
def create_genre(genre_list_id):
    genre_list = []
    for genre_id in genre_list_id:
        genre_list.append(genre_dict[genre_id])
    movie_dict['genre'] = genre_list

In [16]:
df_movies['genre_ids'].map(create_genre)

KeyError: '['

# Rotten Tomatoes (rt.movie_info.tsv)

In [10]:
df_info = pd.read_csv('./data/rt.movie_info.tsv.gz', delimiter='\t')

In [None]:
df_info.head()

In [None]:
df_info['theater_date'] = pd.to_datetime(df_info['theater_date'])

df_info.sort_values(by='theater_date', ascending = False).head(10)

In [None]:
df_info.shape

In [None]:
df_info.columns

In [None]:
df_info.info()

# The Numbers (tn.movie_budgets.csv)

In [None]:
df_budget = pd.read_csv('./data/tn.movie_budgets.csv', index_col=0)

In [None]:
df_budget['release_date'] = pd.to_datetime(df_budget['release_date'])

In [None]:
df_budget.sort_values(by='release_date', ascending=False).head(10)

In [None]:
df_budget.head()

In [None]:
df_budget.shape

In [None]:
df_budget.columns

In [None]:
df_budget.info()

# Box Office Mojo (bom.movie_gross.csv)



In [None]:
df_gross = pd.read_csv('./data/bom.movie_gross.csv')

In [None]:
df_gross.head()

In [None]:
df_gross.sort_values(by='year', ascending=False)

In [None]:
df_gross.shape

In [None]:
df_gross.columns

In [None]:
df_gross.info()

# IM.DB

In [None]:
conn = sqlite3.Connection("./data/im.db")

In [None]:
q0 = """
SELECT *
FROM movie_ratings
;
"""

df_ratings = pd.read_sql(q0, conn)

In [None]:
df_ratings.shape

In [None]:
df_ratings.head()

In [None]:
q1 = """
SELECT *
FROM movie_akas
;
"""

df_movie_akas = pd.read_sql(q1, conn)

In [None]:
df_movie_akas.shape

In [None]:
df_movie_akas.head()

In [None]:
q2 = """
SELECT *
FROM movie_basics
;
"""

df_movie_basics = pd.read_sql(q2, conn)


In [None]:
df_movie_basics.shape

In [None]:
df_movie_basics.head()

In [None]:
pd.read_sql("""
SELECT *
FROM movie_basics
WHERE
    start_year > 2010 AND
    start_year <= 2022
ORDER BY
    start_year DESC
;
"""

,conn)


In [None]:
conn.close()

# Problem & Questions

Business Problem:

You are charged with exploring what types of films are currently doing the best at the box office. You must then translate those findings into actionable insights that the head of Microsoft's new movie studio can use to help decide what type of films to create.

Questions:

- What genre of movies are doing well currently? 
    - are movie franchises, i.e. star wars, marvel, harry potter, etc. any better than one-offs when measuring success
- what does "doing well" mean?
    - presumably, total gross > production budget. The Numbers (data source) includes domestic & international in Worldwide Gross.   
    - based on the data we have, we could measure by worldwide gross, imdb rating, The Movie Database popularity
    - both are important for a new studio, probably lean toward measuring success monetarily. need some easy wins 
        - maybe measure against MSFT financial position?
- what actors are in those successful movies?
- what writers/directors are responsible for the movies?
- is rating significant?
- does movie length have any impact?




Data Clearning Thoughts:

- We'll need to decide what "doing well currently" means. We'll be able to eliminate a lot of bloat by ignoring things that were released too long ago to be relevant. Maybe we start at movies released in the last 10-15 years?

- probably want to identify the categories we want from each csv/db and then create one dataframe to work with. In addition to making comparisons easier, I think that's our most realistic path to solving the file size issue git has with im.db.

- the im.db file has the most records by far of any of our data sources. I think maybe we could make our starting position the movie_basics table (146k rows), and then add columns from the other sources to it and drop records for which we don't have enough info.

- remember we don't have to use all data sources--don't want to create problems with unknown values if a data file is super short (rotten tomatoes). Will see what happens with that.


Taking a stab at the columns in our ideal dataframe:
   - Movie Title
   - Genre(s)
   - Release Year
   - Rating (R/PG-13/etc)
   - Total Gross
   - Production Budget
   - Associated Actors/Star/TBD
   - Writer
   - Director
   - Run Time
   - IMDB Rating
   - The Movie Database Popularity (still need to understand what this is)
   - The Movie Database Vote Count