# Capstone Breakdown Group 1A

**Computing Vision (a made-up company for the purposes of this project) sees all the big companies creating original video content and they want to get in on the fun. They have decided to create a new movie studio, but they don’t have much background in creating movies. You are charged with exploring what types of films are currently doing the best at the box office using different samples of available data. You then will translate those findings into actionable insights that the head of Computing Vision's new movie studio can use to help decide what type of films to create.**

## EDA process

Our project is divided in the following two umbrella categories:

PROFIT:

- Budget v Revenue
- Genre vs. Revenue 
- Popularity v Revenue
- Foreign/Domestic Results vs. Revenue  

POPULARITY: X

- Director vs. Popularity/Voter Avg. x
- Genre vs. Popularity/Voter Avg. 
- Domestic/International vs. Popularity/Voter Avg.  

## Specific insight

We want to gain insights from our data on the following specific points. We are performing EDA to be able to answer questions regarding:

- Revenue compared to rating of the film (critics and audience) x
- I.P and foreign/domestic revenue x
- Original language to revenue 
- Market: domestic/ global
- Writers and directors to revenue

## Import Packages

In [1]:
# Import packages

import numpy as np
import pandas as pd
import sqlite3
import seaborn as sns
import matplotlib.pyplot as plt 
import itertools
%matplotlib inline

## Read In Data

In [2]:
# Read data sets

rtDF = pd.read_csv("Data/rt.movie_info.tsv", sep="\t") #Rotten Tomatoes Movies
rtDF_reviews = pd.read_csv("Data/rt.reviews.tsv", sep="\t", encoding = "latin_1") #Rotten Tomatoes Reviews
bomDF = pd.read_csv("Data/bom.movie_gross.csv") #Box Office Mojo Database
tmdbDF = pd.read_csv("Data/tmdb.movies.csv",index_col=0) #The MovieDB
tnmDF = pd.read_csv("Data/tn.movie_budgets.csv") #The Numbers

conn = sqlite3.connect('Data/im.db')

## Preview Data

In [3]:
#Visualize rotten tomatoes
print(rtDF.info())
rtDF.head()


#Visualize rotten tomatoes reviews
print(rtDF_reviews.info())
rtDF_reviews.head()


#Visualize Box office mojo
print(bomDF.info())
bomDF.head()


#Visualize the movieDB
print(tmdbDF.info())
tmdbDF.head()


#Visualize the numbers
print(tnmDF.info())
tnmDF.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1560 entries, 0 to 1559
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            1560 non-null   int64 
 1   synopsis      1498 non-null   object
 2   rating        1557 non-null   object
 3   genre         1552 non-null   object
 4   director      1361 non-null   object
 5   writer        1111 non-null   object
 6   theater_date  1201 non-null   object
 7   dvd_date      1201 non-null   object
 8   currency      340 non-null    object
 9   box_office    340 non-null    object
 10  runtime       1530 non-null   object
 11  studio        494 non-null    object
dtypes: int64(1), object(11)
memory usage: 146.4+ KB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54432 entries, 0 to 54431
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          54432 non-null  int64 
 1   review      48

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279"
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875"
2,3,"Jun 7, 2019",Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350"
3,4,"May 1, 2015",Avengers: Age of Ultron,"$330,600,000","$459,005,868","$1,403,013,963"
4,5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$620,181,382","$1,316,721,747"


## Cleaning Up Data

### Rotten Tomatoes DF Cleaning

In [4]:
# 
rtDF.drop(columns=['currency','box_office','studio','synopsis','dvd_date'],inplace=True)

In [5]:
rtDF.head()

Unnamed: 0,id,rating,genre,director,writer,theater_date,runtime
0,1,R,Action and Adventure|Classics|Drama,William Friedkin,Ernest Tidyman,"Oct 9, 1971",104 minutes
1,3,R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012",108 minutes
2,5,R,Drama|Musical and Performing Arts,Allison Anders,Allison Anders,"Sep 13, 1996",116 minutes
3,6,R,Drama|Mystery and Suspense,Barry Levinson,Paul Attanasio|Michael Crichton,"Dec 9, 1994",128 minutes
4,7,NR,Drama|Romance,Rodney Bennett,Giles Cooper,,200 minutes


In [6]:
def splits(string):
    string = str(string)
    return string.split("|")
rtDF["genre"] = rtDF["genre"].apply(splits)
rtDF['genre'] = rtDF['genre'].astype('object')
rtDF

Unnamed: 0,id,rating,genre,director,writer,theater_date,runtime
0,1,R,"[Action and Adventure, Classics, Drama]",William Friedkin,Ernest Tidyman,"Oct 9, 1971",104 minutes
1,3,R,"[Drama, Science Fiction and Fantasy]",David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012",108 minutes
2,5,R,"[Drama, Musical and Performing Arts]",Allison Anders,Allison Anders,"Sep 13, 1996",116 minutes
3,6,R,"[Drama, Mystery and Suspense]",Barry Levinson,Paul Attanasio|Michael Crichton,"Dec 9, 1994",128 minutes
4,7,NR,"[Drama, Romance]",Rodney Bennett,Giles Cooper,,200 minutes
...,...,...,...,...,...,...,...
1555,1996,R,"[Action and Adventure, Horror, Mystery and Sus...",,,"Aug 18, 2006",106 minutes
1556,1997,PG,"[Comedy, Science Fiction and Fantasy]",Steve Barron,Terry Turner|Tom Davis|Dan Aykroyd|Bonnie Turner,"Jul 23, 1993",88 minutes
1557,1998,G,"[Classics, Comedy, Drama, Musical and Performi...",Gordon Douglas,,"Jan 1, 1962",111 minutes
1558,1999,PG,"[Comedy, Drama, Kids and Family, Sports and Fi...",David Mickey Evans,David Mickey Evans|Robert Gunter,"Apr 1, 1993",101 minutes


### Rotten Tomatoes Review DF Cleaning

In [7]:
rtDF_reviews.info()
rtDF_reviews.head(10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54432 entries, 0 to 54431
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          54432 non-null  int64 
 1   review      48869 non-null  object
 2   rating      40915 non-null  object
 3   fresh       54432 non-null  object
 4   critic      51710 non-null  object
 5   top_critic  54432 non-null  int64 
 6   publisher   54123 non-null  object
 7   date        54432 non-null  object
dtypes: int64(2), object(6)
memory usage: 3.3+ MB


Unnamed: 0,id,review,rating,fresh,critic,top_critic,publisher,date
0,3,A distinctly gallows take on contemporary fina...,3/5,fresh,PJ Nabarro,0,Patrick Nabarro,"November 10, 2018"
1,3,It's an allegory in search of a meaning that n...,,rotten,Annalee Newitz,0,io9.com,"May 23, 2018"
2,3,... life lived in a bubble in financial dealin...,,fresh,Sean Axmaker,0,Stream on Demand,"January 4, 2018"
3,3,Continuing along a line introduced in last yea...,,fresh,Daniel Kasman,0,MUBI,"November 16, 2017"
4,3,... a perverse twist on neorealism...,,fresh,,0,Cinema Scope,"October 12, 2017"
5,3,... Cronenberg's Cosmopolis expresses somethin...,,fresh,Michelle Orange,0,Capital New York,"September 11, 2017"
6,3,"Quickly grows repetitive and tiresome, meander...",C,rotten,Eric D. Snider,0,EricDSnider.com,"July 17, 2013"
7,3,Cronenberg is not a director to be daunted by ...,2/5,rotten,Matt Kelemen,0,Las Vegas CityLife,"April 21, 2013"
8,3,"Cronenberg's cold, exacting precision and emot...",,fresh,Sean Axmaker,0,Parallax View,"March 24, 2013"
9,3,Over and above its topical urgency or the bit ...,,fresh,Kong Rithdee,0,Bangkok Post,"March 4, 2013"


In [8]:
print(sum(rtDF_reviews['rating'].isna()))
# Dropping top_critic, publisher, and date columns because the information provided is not relevant to the scope of this study
rtDF_reviews.drop(columns=["top_critic","publisher","date"],inplace=True)

13517


In [9]:
#replace with binary value, 1 is fresh
rtDF_reviews['fresh'].replace(['fresh', 'rotten'], [1,0], inplace=True)

In [10]:
rtDF_reviews.groupby(['id'])['fresh'].sum()

id
3       103
5        18
6        32
8        56
10       50
       ... 
1996     96
1997     10
1998      2
1999     27
2000     18
Name: fresh, Length: 1135, dtype: int64

In [11]:
rtDF_reviews.groupby(['id', 'fresh']).size()

id    fresh
3     0         60
      1        103
5     0          5
      1         18
6     0         25
              ... 
1998  1          2
1999  0         19
      1         27
2000  0         20
      1         18
Length: 2070, dtype: int64

In [12]:
#Create df grouped by id
rtDF_grouped = rtDF_reviews.groupby(['id'])
#add column in main df that adds sum of fresh column by grouped id 
rtDF_reviews['sum_fresh'] = rtDF_grouped['fresh'].transform(sum)
#add column in main df that counts the total observations in column grouped by id
rtDF_reviews['count_fresh'] = rtDF_grouped['fresh'].transform('count')

In [13]:
#add column in main df that divides the sum of fresh reviews by the total number of reviews - gives percentages to each movie
rtDF_reviews['percentage'] = rtDF_reviews['sum_fresh'] / rtDF_reviews['count_fresh']
rtDF_reviews

Unnamed: 0,id,review,rating,fresh,critic,sum_fresh,count_fresh,percentage
0,3,A distinctly gallows take on contemporary fina...,3/5,1,PJ Nabarro,103,163,0.631902
1,3,It's an allegory in search of a meaning that n...,,0,Annalee Newitz,103,163,0.631902
2,3,... life lived in a bubble in financial dealin...,,1,Sean Axmaker,103,163,0.631902
3,3,Continuing along a line introduced in last yea...,,1,Daniel Kasman,103,163,0.631902
4,3,... a perverse twist on neorealism...,,1,,103,163,0.631902
...,...,...,...,...,...,...,...,...
54427,2000,The real charm of this trifle is the deadpan c...,,1,Laura Sinagra,18,38,0.473684
54428,2000,,1/5,0,Michael Szymanski,18,38,0.473684
54429,2000,,2/5,0,Emanuel Levy,18,38,0.473684
54430,2000,,2.5/5,0,Christopher Null,18,38,0.473684


In [14]:
# rtDF_counts_fresh = rtDF_reviews.groupby(['id', 'fresh']).size().to_frame('size')
# rtDF_counts_fresh

In [15]:
# rtDF_counts_fresh['size']

In [16]:
# rtDF_counts_fresh.reset_index(level=0, inplace=True)

In [17]:
# rtDF_counts_fresh['id'][1]

In [18]:
rtDF_reviews['rating'].unique()

array(['3/5', nan, 'C', '2/5', 'B-', '2/4', 'B', '3/4', '4/5', '4/4',
       '6/10', '1/4', '8', '2.5/4', '4/10', '2.0/5', '3/10', '7/10', 'A-',
       '5/5', 'F', '3.5/4', 'D+', '1.5/4', '3.5/5', '8/10', 'B+', '9/10',
       '2.5/5', '7.5/10', '5.5/10', 'C-', '1.5/5', '1/5', '5/10', 'C+',
       '0/5', '6', '0.5/4', 'D', '3.1/5', '3/6', '4.5/5', '0/4', '2/10',
       'D-', '7', '1/10', '3', 'A+', 'A', '4.0/4', '9.5/10', '2.5',
       '2.1/2', '6.5/10', '3.7/5', '8.4/10', '9', '1', '7.2/10', '2.2/5',
       '0.5/10', '5', '0', '2', '4.5', '7.7', '5.0/5', '8.5/10', '3.0/5',
       '0.5/5', '1.5/10', '3.0/4', '2.3/10', '4.5/10', '4/6', '3.5',
       '8.6/10', '6/8', '2.0/4', '2.7', '4.2/10', '5.8', '4', '7.1/10',
       '5/4', 'N', '3.5/10', '5.8/10', 'R', '4.0/5', '0/10', '5.0/10',
       '5.9/10', '2.4/5', '1.9/5', '4.9', '7.4/10', '1.5', '2.3/4',
       '8.8/10', '4.0/10', '2.2', '3.8/10', '6.8/10', '7.3', '7.0/10',
       '3.2', '4.2', '8.4', '5.5/5', '6.3/10', '7.6/10', '8.1/10',
  

### Box Office Mojo DF Cleaning

In [19]:
print(bomDF.info())
bomDF.head(10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3387 non-null   object 
 1   studio          3382 non-null   object 
 2   domestic_gross  3359 non-null   float64
 3   foreign_gross   2037 non-null   object 
 4   year            3387 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 132.4+ KB
None


Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,Inception,WB,292600000.0,535700000,2010
4,Shrek Forever After,P/DW,238700000.0,513900000,2010
5,The Twilight Saga: Eclipse,Sum.,300500000.0,398000000,2010
6,Iron Man 2,Par.,312400000.0,311500000,2010
7,Tangled,BV,200800000.0,391000000,2010
8,Despicable Me,Uni.,251500000.0,291600000,2010
9,How to Train Your Dragon,P/DW,217600000.0,277300000,2010


In [20]:
# Dropped rows where domestic gross was NaN
bomDF = bomDF[bomDF['domestic_gross'].notna()]
# 
bomDF['foreign_gross'] = bomDF['foreign_gross'].replace(',','', regex=True)
bomDF["foreign_gross"] = pd.to_numeric(bomDF["foreign_gross"])
bomDF.drop(columns=['studio'],inplace=True)
bomDF.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3359 entries, 0 to 3386
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3359 non-null   object 
 1   domestic_gross  3359 non-null   float64
 2   foreign_gross   2009 non-null   float64
 3   year            3359 non-null   int64  
dtypes: float64(2), int64(1), object(1)
memory usage: 131.2+ KB


### The MovieDB DF Cleaning

In [21]:
print(tmdbDF.info())
tmdbDF.head(5)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 26517 entries, 0 to 26516
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   genre_ids          26517 non-null  object 
 1   id                 26517 non-null  int64  
 2   original_language  26517 non-null  object 
 3   original_title     26517 non-null  object 
 4   popularity         26517 non-null  float64
 5   release_date       26517 non-null  object 
 6   title              26517 non-null  object 
 7   vote_average       26517 non-null  float64
 8   vote_count         26517 non-null  int64  
dtypes: float64(2), int64(2), object(5)
memory usage: 2.0+ MB
None


Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
0,"[12, 14, 10751]",12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610
2,"[12, 28, 878]",10138,en,Iron Man 2,28.515,2010-05-07,Iron Man 2,6.8,12368
3,"[16, 35, 10751]",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174
4,"[28, 878, 12]",27205,en,Inception,27.92,2010-07-16,Inception,8.3,22186


In [22]:
GenreDict = {28:"Action", 12:"Adventure", 16:"Animation", 35:"Comedy", 80:"Crime",
             99:"Documentary",18:"Drama",10751:"Family",14:"Fantasy",36:"History",
             27:"Horror",10402:"Music",9648:"Mystery",10749:"Romance",
             878:"Science Fiction", 10770:"TV Movie",53:"Thriller",10752:"War",37:"Western"}

In [23]:
tmdbDF.head()

Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
0,"[12, 14, 10751]",12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610
2,"[12, 28, 878]",10138,en,Iron Man 2,28.515,2010-05-07,Iron Man 2,6.8,12368
3,"[16, 35, 10751]",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174
4,"[28, 878, 12]",27205,en,Inception,27.92,2010-07-16,Inception,8.3,22186


In [28]:
#Make each item in genre id an iterable list
tmdbDF['genre_ids'] = tmdbDF['genre_ids'].apply(eval)

#Replace each list item with the corresponding dictionary value
tmdbDF['genre_ids']= tmdbDF['genre_ids'].apply(lambda x: [GenreDict[i] for i in x])

In [29]:
#Check for correct output

# for i in tmdbDF['genre_ids']:
#     for j in i:
#         print(j)


Adventure
Fantasy
Family
Fantasy
Adventure
Animation
Family
Adventure
Action
Science Fiction
Animation
Comedy
Family
Action
Science Fiction
Adventure
Adventure
Fantasy
Family
Action
Adventure
Fantasy
Science Fiction
Animation
Family
Comedy
Animation
Family
Comedy
Animation
Action
Comedy
Family
Science Fiction
Animation
Comedy
Family
Family
Fantasy
Adventure
Thriller
Adventure
Action
Animation
Family
Horror
Crime
Adventure
Fantasy
Drama
Romance
Action
Thriller
Science Fiction
Music
Romance
Action
Drama
Thriller
Drama
Thriller
Mystery
Action
Drama
Mystery
Thriller
Action
Comedy
Adventure
Family
Fantasy
Drama
Romance
Thriller
Action
Adventure
Comedy
Crime
Drama
History
Action
Comedy
Crime
Thriller
Action
Adventure
Family
Fantasy
Action
Science Fiction
Adventure
Thriller
Drama
Action
Thriller
Drama
History
Action
Thriller
Science Fiction
Adventure
Action
Drama
Mystery
Thriller
Adventure
Fantasy
Action
Comedy
Romance
Action
Adventure
Drama
Comedy
Fantasy
Family
Comedy
Adventure
Fantasy
Anim

Documentary
Drama
Drama
Comedy
Drama
Romance
Drama
Comedy
Romance
Documentary
Documentary
Documentary
Documentary
Documentary
Action
Documentary
Documentary
Documentary
Music
Documentary
Horror
Documentary
Animation
Documentary
Documentary
History
Documentary
Romance
Comedy
Animation
Horror
Comedy
Documentary
Documentary
Animation
Horror
Horror
Animation
Documentary
Music
Documentary
Drama
Documentary
Drama
Documentary
Horror
Documentary
Comedy
Documentary
Mystery
Romance
Comedy
Documentary
Science Fiction
Comedy
Horror
Animation
Documentary
Documentary
Documentary
Horror
TV Movie
Fantasy
Science Fiction
Documentary
Drama
Romance
Drama
Documentary
Documentary
Horror
Drama
Action
Drama
Documentary
Comedy
Comedy
Horror
Mystery
Thriller
Comedy
Romance
Documentary
Drama
Crime
Comedy
Comedy
Family
Drama
Drama
Comedy
Romance
Documentary
Drama
Drama
Drama
Thriller
Documentary
Drama
Horror
Thriller
Drama
Action
Horror
Action
Thriller
Science Fiction
Comedy
Romance
Documentary
Documentary
TV Mo

Thriller
Drama
Drama
Romance
Drama
Fantasy
Drama
TV Movie
Music
Thriller
Drama
Science Fiction
Horror
Drama
History
Fantasy
Family
Documentary
Thriller
Action
Documentary
Music
Comedy
Romance
Documentary
History
Animation
Music
Documentary
Drama
Comedy
Crime
Thriller
Action
Adventure
Comedy
Fantasy
Thriller
Documentary
Thriller
Mystery
Fantasy
Comedy
TV Movie
Family
Comedy
Drama
Science Fiction
Comedy
Horror
Family
Documentary
Drama
History
War
Drama
Western
Comedy
Drama
Thriller
Family
Fantasy
Science Fiction
Adventure
Fantasy
Action
Family
Thriller
Drama
Horror
Mystery
Drama
Drama
Drama
Comedy
Horror
Comedy
Romance
Documentary
Drama
Comedy
Romance
Comedy
Science Fiction
Drama
TV Movie
Comedy
Family
Comedy
Romance
Animation
Family
Drama
Comedy
Science Fiction
Drama
Horror
Drama
Comedy
Romance
Drama
Documentary
Adventure
Comedy
Comedy
Romance
Music
Documentary
TV Movie
Comedy
Comedy
Documentary
Comedy
Drama
Drama
Romance
Horror
Thriller
Comedy
Drama
Action
Crime
Mystery
Comedy
Action
D

Animation
Family
Crime
Drama
Thriller
Drama
Romance
Comedy
Drama
Animation
Horror
Western
Comedy
Drama
Comedy
Family
Horror
Thriller
Drama
TV Movie
Comedy
Drama
Family
Horror
Thriller
Mystery
Adventure
History
Music
Thriller
Crime
Mystery
Thriller
Documentary
Drama
Action
Thriller
Comedy
Crime
Comedy
Drama
Romance
Drama
Science Fiction
Horror
Drama
Family
Drama
Thriller
Crime
Mystery
Family
Comedy
Comedy
Drama
Thriller
TV Movie
Comedy
Drama
Documentary
Comedy
Animation
Comedy
Fantasy
Science Fiction
Documentary
Horror
Thriller
Horror
Horror
Drama
Documentary
Animation
Action
Science Fiction
Horror
Comedy
Fantasy
Comedy
Romance
Fantasy
TV Movie
Horror
Mystery
Thriller
Music
Drama
TV Movie
Comedy
Action
Thriller
Music
Documentary
Drama
Crime
Animation
Documentary
Drama
History
Animation
Family
Comedy
Science Fiction
Drama
Mystery
Comedy
Action
Documentary
Thriller
Drama
Thriller
Comedy
Comedy
Documentary
Drama
Drama
Drama
Comedy
Comedy
Horror
Music
Documentary
Comedy
Romance
TV Movie
Dra

Family
Romance
Drama
Romance
Thriller
Science Fiction
Mystery
Thriller
Drama
Comedy
Romance
Family
Action
Adventure
Drama
Crime
Documentary
Drama
History
Crime
Drama
Drama
Documentary
Science Fiction
Action
Adventure
Horror
Drama
Mystery
TV Movie
Horror
Science Fiction
Documentary
History
Music
Documentary
Documentary
Drama
Science Fiction
Action
Music
Documentary
Action
Comedy
Horror
Thriller
Music
Comedy
TV Movie
Action
Fantasy
Horror
Crime
Horror
Drama
Thriller
Science Fiction
Horror
Science Fiction
Thriller
Comedy
Comedy
Science Fiction
Documentary
Thriller
Drama
Documentary
Drama
Crime
Drama
Horror
Action
Fantasy
Thriller
Documentary
Comedy
Science Fiction
Adventure
Drama
Thriller
TV Movie
Horror
Comedy
Comedy
Horror
Thriller
Science Fiction
TV Movie
Documentary
Crime
Horror
Thriller
Western
Drama
History
Adventure
Comedy
Documentary
Mystery
History
Crime
Drama
Drama
Documentary
Drama
Documentary
Crime
Horror
Documentary
Drama
TV Movie
Action
Drama
Thriller
Thriller
Animation
Adve

Comedy
Fantasy
Thriller
Drama
History
Documentary
Animation
Action
Crime
Thriller
Action
Crime
Thriller
Comedy
Thriller
Thriller
Mystery
Crime
Documentary
Drama
Horror
Drama
Music
Horror
Family
Fantasy
Animation
Adventure
Drama
Animation
Comedy
Family
Mystery
Drama
Comedy
Comedy
Thriller
Animation
Fantasy
Documentary
Horror
Thriller
Animation
Music
Family
Comedy
Crime
Romance
Drama
Mystery
Animation
Family
Drama
Action
Comedy
Romance
Drama
Comedy
Drama
Thriller
Horror
History
Comedy
Documentary
Drama
Thriller
Action
Comedy
Comedy
Thriller
Drama
Music
Documentary
Horror
Drama
Adventure
Fantasy
Horror
Comedy
Animation
Drama
Action
Drama
Crime
Mystery
Action
Drama
Crime
Thriller
Action
Thriller
Drama
Comedy
Family
Romance
Drama
Horror
Drama
Fantasy
Mystery
Romance
Drama
Horror
Thriller
Drama
Thriller
Horror
Crime
Comedy
Comedy
Drama
Horror
Science Fiction
Thriller
Romance
Comedy
Documentary
Drama
Music
Comedy
Thriller
Mystery
Western
Action
Drama
Science Fiction
Romance
Comedy
Drama
Drama

Animation
Science Fiction
Family
Drama
Drama
Comedy
Action
Action
Fantasy
Adventure
Family
Animation
Comedy
Adventure
Action
Comedy
Crime
Action
Adventure
Crime
Drama
Thriller
Horror
Thriller
Drama
Thriller
Science Fiction
Action
Animation
Adventure
Comedy
Thriller
Mystery
Drama
Romance
Drama
Horror
Western
Action
Thriller
Adventure
Drama
Thriller
Drama
Music
Adventure
Animation
Fantasy
Drama
Mystery
Romance
Adventure
Animation
Family
War
Drama
Animation
Fantasy
Thriller
Adventure
Comedy
Animation
Family
Documentary
Comedy
Horror
Drama
Romance
Animation
Drama
Comedy
Comedy
Drama
Mystery
Thriller
Thriller
Action
Drama
History
Drama
Music
Romance
Comedy
Romance
Comedy
Crime
Drama
Drama
Action
Adventure
Fantasy
Thriller
Science Fiction
Thriller
Action
Drama
War
Comedy
Horror
Thriller
Action
Drama
Crime
Drama
Drama
Comedy
Animation
Comedy
Drama
Romance
Thriller
Horror
Drama
Horror
Thriller
Mystery
Crime
Thriller
Comedy
Western
Thriller
Horror
Horror
Thriller
Drama
Comedy
Drama
Music
Fantas

Documentary
Horror
Drama
Science Fiction
Horror
Documentary
Horror
Comedy
Crime
Drama
TV Movie
Documentary
Music
Horror
Crime
Drama
Drama
Documentary
Thriller
Horror
Documentary
Drama
Documentary
Family
Documentary
Documentary
Comedy
Horror
Drama
Comedy
Drama
Thriller
Science Fiction
Comedy
Comedy
Drama
Thriller
Fantasy
Drama
Documentary
Horror
Action
Comedy
Drama
TV Movie
Family
Animation
Comedy
Comedy
Drama
Horror
Science Fiction
Comedy
TV Movie
Comedy
Science Fiction
Animation
Horror
Documentary
Science Fiction
Comedy
Action
Drama
Crime
Science Fiction
Comedy
Drama
Documentary
Drama
Mystery
Drama
Horror
Adventure
Drama
Mystery
Drama
Drama
Comedy
History
Documentary
Music
Documentary
Music
Documentary
Documentary
Romance
Drama
Drama
Mystery
Drama
Music
Documentary
Action
Drama
Horror
Drama
Documentary
Comedy
Horror
Drama
Romance
Comedy
Drama
Drama
Documentary
Music
Horror
Comedy
Documentary
Comedy
Drama
Documentary
Drama
Thriller
Science Fiction
Fantasy
Animation
Action
Family
Horror

Thriller
Thriller
Drama
Horror
Comedy
Documentary
Romance
Comedy
TV Movie
Horror
Horror
Drama
Romance
Comedy
Drama
Drama
Documentary
Family
Drama
Fantasy
Horror
Thriller
Science Fiction
Family
Comedy
Science Fiction
Documentary
Documentary
Animation
Comedy
Drama
Documentary
Music
Documentary
Comedy
Documentary
Action
Crime
Drama
Science Fiction
Thriller
Comedy
Documentary
Comedy
Drama
Documentary
Drama
Horror
Science Fiction
Action
Drama
Documentary
Romance
Drama
Thriller
Crime
Drama
TV Movie
Animation
Documentary
Drama
Thriller
War
Action
Adventure
Drama
Documentary
Documentary
Fantasy
Animation
Documentary
Music
Comedy
Documentary
Comedy
Documentary
Drama
Comedy
Horror
Documentary
Documentary
Documentary
Crime
Drama
Documentary
Documentary
Drama
History
Thriller
Western
Music
Documentary
Mystery
Thriller
Music
Documentary
Fantasy
Drama
Comedy
Drama
Romance
Horror
Documentary
Horror
Thriller
Romance
Documentary
Comedy
Horror
Documentary
Adventure
Comedy
Crime
Drama
Horror
Thriller
Doc

Comedy
Animation
Comedy
Drama
Drama
Romance
Comedy
Horror
Drama
Romance
Drama
War
Comedy
Horror
Action
Comedy
TV Movie
Thriller
Horror
Mystery
Adventure
Horror
Thriller
Comedy
Horror
Family
Comedy
Drama
Thriller
Comedy
Comedy
Drama
Comedy
TV Movie
Drama
Crime
Comedy
Thriller
Romance
TV Movie
Drama
Documentary
Comedy
Comedy
Horror
Science Fiction
Drama
Comedy
Science Fiction
Documentary
Adventure
Family
Comedy
Comedy
Horror
Science Fiction
Horror
TV Movie
Mystery
TV Movie
Mystery
Thriller
Romance
TV Movie
Comedy
Drama
Comedy
Mystery
TV Movie
Drama
Romance
Thriller
Horror
Action
Crime
Science Fiction
Thriller
Drama
Drama
Comedy
Horror
War
Drama
Comedy
Drama
TV Movie
Mystery
Romance
Drama
Romance
Documentary
Comedy
Comedy
TV Movie
Comedy
Romance
Drama
Drama
Romance
Documentary
Crime
Drama
Drama
Romance
Romance
Drama
Action
Documentary
Drama
Fantasy
Horror
Documentary
Drama
Thriller
Comedy
TV Movie
Horror
Action
Drama
Music
Romance
Thriller
Horror
Action
Adventure
Family
Fantasy
Action
Thr

Comedy
Horror
Comedy
Music
Romance
Horror
Fantasy
Action
Science Fiction
Thriller
Thriller
Action
Drama
TV Movie
Drama
Drama
Romance
Drama
Thriller
Mystery
Action
Science Fiction
Action
Science Fiction
Horror
TV Movie
Comedy
Western
Drama
Crime
Drama
Thriller
Comedy
Drama
Drama
Drama
TV Movie
Comedy
Mystery
Thriller
Comedy
Drama
Romance
Mystery
Thriller
Horror
Drama
Romance
Comedy
Action
Adventure
TV Movie
Thriller
Crime
Mystery
Science Fiction
Horror
Thriller
War
Drama
Romance
Comedy
Drama
Science Fiction
Action
Horror
Thriller
Drama
Mystery
Comedy
Drama
Adventure
Animation
Comedy
Family
Science Fiction
Horror
Thriller
Thriller
Drama
Science Fiction
Comedy
Comedy
Drama
Romance
Music
Horror
Drama
Comedy
Romance
Drama
Horror
Thriller
Comedy
Horror
Drama
Comedy
Comedy
Drama
Western
Comedy
Thriller
Horror
Horror
Drama
War
Comedy
Drama
History
Documentary
Drama
Horror
Documentary
Romance
Comedy
Comedy
Comedy
Drama
War
Action
Adventure
Science Fiction
Horror
Thriller
Drama
Comedy
Drama
Come

In [30]:
#Quick count of each value 
# tmdbDF['genre_ids'].value_counts()

TypeError: unhashable type: 'list'

Exception ignored in: 'pandas._libs.index.IndexEngine._call_map_locations'
Traceback (most recent call last):
  File "pandas\_libs\hashtable_class_helper.pxi", line 1709, in pandas._libs.hashtable.PyObjectHashTable.map_locations
TypeError: unhashable type: 'list'


[Documentary]                                                 3700
[]                                                            2479
[Drama]                                                       2268
[Comedy]                                                      1660
[Horror]                                                      1145
                                                              ... 
[Comedy, Drama, Family, Fantasy, Romance, Science Fiction]       1
[Thriller, Crime, Drama, TV Movie]                               1
[Fantasy, Comedy, Science Fiction, Family]                       1
[Science Fiction, Action, Animation, Adventure]                  1
[Crime, Mystery, Thriller, Drama]                                1
Name: genre_ids, Length: 2477, dtype: int64

In [31]:
def to_1D(series):
    return pd.Series([x for _list in series for x in _list])
to_1D(tmdbDF["genre_ids"]).value_counts()

Drama              8303
Comedy             5652
Documentary        4965
Thriller           4207
Horror             3683
Action             2612
Romance            2321
Science Fiction    1762
Family             1565
Crime              1515
Animation          1486
Adventure          1400
Music              1267
Mystery            1237
Fantasy            1139
TV Movie           1084
History             622
War                 330
Western             205
dtype: int64

In [32]:
# tmdbDF[["genre_ids", 'vote_average']]

Unnamed: 0,genre_ids,vote_average
0,"[Adventure, Fantasy, Family]",7.7
1,"[Fantasy, Adventure, Animation, Family]",7.7
2,"[Adventure, Action, Science Fiction]",6.8
3,"[Animation, Comedy, Family]",7.9
4,"[Action, Science Fiction, Adventure]",8.3
...,...,...
26512,"[Horror, Drama]",0.0
26513,"[Drama, Thriller]",0.0
26514,"[Fantasy, Action, Adventure]",0.0
26515,"[Family, Adventure, Action]",0.0


In [39]:
#create new df with row repeated for each individual genre within the list 
new_df = tmdbDF.explode('genre_ids').reset_index(drop=True)
#group by genre and return the mean vote average
new_df.groupby('genre_ids')['vote_average'].mean()

### The Numbers DF Cleaning

In [None]:
print(tnmDF.info())
tnmDF.head()


## To-do
- Plug-in genre to the movies tables
- Split genres in rtDF
- Fill in nulls in foreign gross
- movie & genre IDs match up (explore relationships between dataframes)

## Joining Tables

## Exploratory Data Analysis 
Begin looking into relationships between variables, uncover information that will form our recommendations to Computing Vision (client)

## Linear Model
Test out linear model with added variables 