In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Box Office Mojo Movie Info

In [11]:
bom_df = pd.read_csv('zippedData/bom.movie_gross.csv')

In [12]:
bom_df.head()

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,Inception,WB,292600000.0,535700000,2010
4,Shrek Forever After,P/DW,238700000.0,513900000,2010


In [18]:
bom_df.year.describe()

count    3387.000000
mean     2013.958075
std         2.478141
min      2010.000000
25%      2012.000000
50%      2014.000000
75%      2016.000000
max      2018.000000
Name: year, dtype: float64

### Note:
Newest information on movies is from 2018. This might be problematic. We can possibly trend the data to extrapolate for the 2019 +

## IMDB Movie Info

### NOTE: 
 - tconst = unique string with numerical characters associated to a movie
 
 - nconst = unique string with numerical characters associated to a person who worked on set

### Onset workers and what movies they worked on

In [13]:
imdb_name_df = pd.read_csv('zippedData/imdb.name.basics.csv')

In [17]:
imdb_name_df.head()

Unnamed: 0,nconst,primary_name,birth_year,death_year,primary_profession,known_for_titles
0,nm0061671,Mary Ellen Bauder,,,"miscellaneous,production_manager,producer","tt0837562,tt2398241,tt0844471,tt0118553"
1,nm0061865,Joseph Bauer,,,"composer,music_department,sound_department","tt0896534,tt6791238,tt0287072,tt1682940"
2,nm0062070,Bruce Baum,,,"miscellaneous,actor,writer","tt1470654,tt0363631,tt0104030,tt0102898"
3,nm0062195,Axel Baumann,,,"camera_department,cinematographer,art_department","tt0114371,tt2004304,tt1618448,tt1224387"
4,nm0062798,Pete Baxter,,,"production_designer,art_department,set_decorator","tt0452644,tt0452692,tt3458030,tt2178256"


This data looks like it has information on people that worked on movies. Maybe we can see who would be the best director/ producer to hire for a movie? (give the movie the best chance at success)

### Movies' alternative names for different regions

In [21]:
imdb_titles_diff_lang = pd.read_csv('zippedData/imdb.title.akas.csv')

In [22]:
imdb_titles_diff_lang.head()

Unnamed: 0,title_id,ordering,title,region,language,types,attributes,is_original_title
0,tt0369610,10,Джурасик свят,BG,bg,,,0.0
1,tt0369610,11,Jurashikku warudo,JP,,imdbDisplay,,0.0
2,tt0369610,12,Jurassic World: O Mundo dos Dinossauros,BR,,imdbDisplay,,0.0
3,tt0369610,13,O Mundo dos Dinossauros,BR,,,short title,0.0
4,tt0369610,14,Jurassic World,FR,,imdbDisplay,,0.0


This data set gives information about movies in other regions of the world, mainly what their other titles are. I'm not sure how useful this will be right now, but could be useful.

### Directors and writers for each film

In [35]:
imdb_creative_crew = pd.read_csv('zippedData/imdb.title.crew.csv')

In [36]:
imdb_creative_crew.head()

Unnamed: 0,tconst,directors,writers
0,tt0285252,nm0899854,nm0899854
1,tt0438973,,"nm0175726,nm1802864"
2,tt0462036,nm1940585,nm1940585
3,tt0835418,nm0151540,"nm0310087,nm0841532"
4,tt0878654,"nm0089502,nm2291498,nm2292011",nm0284943


This dataset gives information about the writers and directors who worked on a particular movie. Again might be cool to recommend a team of directors/ writers to try to hire based on their track record.

### Cast members / film crew for each movie

In [38]:
imdb_film_crew = pd.read_csv('zippedData/imdb.title.principals.csv')

In [39]:
imdb_film_crew.head(10)

Unnamed: 0,tconst,ordering,nconst,category,job,characters
0,tt0111414,1,nm0246005,actor,,"[""The Man""]"
1,tt0111414,2,nm0398271,director,,
2,tt0111414,3,nm3739909,producer,producer,
3,tt0323808,10,nm0059247,editor,,
4,tt0323808,1,nm3579312,actress,,"[""Beth Boothby""]"
5,tt0323808,2,nm2694680,actor,,"[""Steve Thomson""]"
6,tt0323808,3,nm0574615,actor,,"[""Sir Lachlan Morrison""]"
7,tt0323808,4,nm0502652,actress,,"[""Lady Delia Morrison""]"
8,tt0323808,5,nm0362736,director,,
9,tt0323808,6,nm0811056,producer,producer,


In [40]:
imdb_film_crew.ordering.value_counts()

1     143454
2     134649
3     126538
4     117775
5     108862
6     100140
7      90820
8      80587
9      69218
10     56143
Name: ordering, dtype: int64

Hmm, i'm not really sure what ordering means. Need to research it to figure out. This could be helpful with trying to decide what actors/actresses to cast for the movie.

### Average movie ratings (1 -10 scale)

In [44]:
imdb_ratings = pd.read_csv('zippedData/imdb.title.ratings.csv')

In [45]:
imdb_ratings.head()

Unnamed: 0,tconst,averagerating,numvotes
0,tt10356526,8.3,31
1,tt10384606,8.9,559
2,tt1042974,6.4,20
3,tt1043726,4.2,50352
4,tt1060240,6.5,21


Self explanatory. Some of the most important data.

## Rotten Tomatoes Movie Info

### Movie info without the movie title

In [51]:
rt_movie_info = pd.read_csv('zippedData/rt.movie_info.tsv', delimiter = '\t')

In [52]:
rt_movie_info.head()

Unnamed: 0,id,synopsis,rating,genre,director,writer,theater_date,dvd_date,currency,box_office,runtime,studio
0,1,"This gritty, fast-paced, and innovative police...",R,Action and Adventure|Classics|Drama,William Friedkin,Ernest Tidyman,"Oct 9, 1971","Sep 25, 2001",,,104 minutes,
1,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000.0,108 minutes,Entertainment One
2,5,Illeana Douglas delivers a superb performance ...,R,Drama|Musical and Performing Arts,Allison Anders,Allison Anders,"Sep 13, 1996","Apr 18, 2000",,,116 minutes,
3,6,Michael Douglas runs afoul of a treacherous su...,R,Drama|Mystery and Suspense,Barry Levinson,Paul Attanasio|Michael Crichton,"Dec 9, 1994","Aug 27, 1997",,,128 minutes,
4,7,,NR,Drama|Romance,Rodney Bennett,Giles Cooper,,,,,200 minutes,


In [55]:
rt_movie_info.tail()

Unnamed: 0,id,synopsis,rating,genre,director,writer,theater_date,dvd_date,currency,box_office,runtime,studio
1555,1996,Forget terrorists or hijackers -- there's a ha...,R,Action and Adventure|Horror|Mystery and Suspense,,,"Aug 18, 2006","Jan 2, 2007",$,33886034.0,106 minutes,New Line Cinema
1556,1997,The popular Saturday Night Live sketch was exp...,PG,Comedy|Science Fiction and Fantasy,Steve Barron,Terry Turner|Tom Davis|Dan Aykroyd|Bonnie Turner,"Jul 23, 1993","Apr 17, 2001",,,88 minutes,Paramount Vantage
1557,1998,"Based on a novel by Richard Powell, when the l...",G,Classics|Comedy|Drama|Musical and Performing Arts,Gordon Douglas,,"Jan 1, 1962","May 11, 2004",,,111 minutes,
1558,1999,The Sandlot is a coming-of-age story about a g...,PG,Comedy|Drama|Kids and Family|Sports and Fitness,David Mickey Evans,David Mickey Evans|Robert Gunter,"Apr 1, 1993","Jan 29, 2002",,,101 minutes,
1559,2000,"Suspended from the force, Paris cop Hubert is ...",R,Action and Adventure|Art House and Internation...,,Luc Besson,"Sep 27, 2001","Feb 11, 2003",,,94 minutes,Columbia Pictures


In [54]:
rt_movie_info.iloc[0,1]

'This gritty, fast-paced, and innovative police drama earned five Academy Awards, including Best Picture, Best Adapted Screenplay (written by Ernest Tidyman), and Best Actor (Gene Hackman). Jimmy "Popeye" Doyle (Hackman) and his partner, Buddy Russo (Roy Scheider), are New York City police detectives on narcotics detail, trying to track down the source of heroin from Europe into the United States. Suave Alain Charnier (Fernando Rey) is the French drug kingpin who provides a large percentage of New York City\'s dope, and Pierre Nicoli (Marcel Bozzuffi) is a hired killer and Charnier\'s right-hand man. Acting on a hunch, Popeye and Buddy start tailing Sal Boca (Tony Lo Bianco) and his wife, Angie (Arlene Faber), who live pretty high for a couple whose corner store brings in about 7,000 dollars a year. It turns out Popeye\'s suspicions are right -- Sal and Angie are the New York agents for Charnier, who will be smuggling 32 million dollars\' worth of heroin into the city in a car shipped 

Well this looks like it is going to take a lot of work to match the movie with the information. Don't know if it is worth it, but we will see.

### Movie reviews

In [62]:
rt_reviews = pd.read_csv('zippedData/rt.reviews.tsv', delimiter = '\t', encoding = 'unicode_escape')

In [63]:
rt_reviews.head()

Unnamed: 0,id,review,rating,fresh,critic,top_critic,publisher,date
0,3,A distinctly gallows take on contemporary fina...,3/5,fresh,PJ Nabarro,0,Patrick Nabarro,"November 10, 2018"
1,3,It's an allegory in search of a meaning that n...,,rotten,Annalee Newitz,0,io9.com,"May 23, 2018"
2,3,... life lived in a bubble in financial dealin...,,fresh,Sean Axmaker,0,Stream on Demand,"January 4, 2018"
3,3,Continuing along a line introduced in last yea...,,fresh,Daniel Kasman,0,MUBI,"November 16, 2017"
4,3,... a perverse twist on neorealism...,,fresh,,0,Cinema Scope,"October 12, 2017"


In [64]:
rt_reviews.rating.value_counts(dropna = False)

NaN       13517
3/5        4327
4/5        3672
3/4        3577
2/5        3160
          ...  
7.4           1
1.7           1
8.5           1
3 1/2         1
4.3/10        1
Name: rating, Length: 187, dtype: int64

These might all be from critics. I'm not completely sure. Needs further exploration. Also there are a lot of missing values for the rating column at the very least.

## The Movie Database Movie Info

In [65]:
df = pd.read_csv('zippedData/tmdb.movies.csv')

In [66]:
df.head()

Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
0,0,"[12, 14, 10751]",12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610
2,2,"[12, 28, 878]",10138,en,Iron Man 2,28.515,2010-05-07,Iron Man 2,6.8,12368
3,3,"[16, 35, 10751]",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174
4,4,"[28, 878, 12]",27205,en,Inception,27.92,2010-07-16,Inception,8.3,22186


This dataset seems to be the most put together one so far. Has a lot of good infomation that's useful for EDA.

<b>
Info we could web scrape:
    
    - data from online releases of movies during covid and how they did
    - data to get all of our info current
    - more refined data once we drill down to a certain movie genre
    - we could scrape data about streaming services and the movies they are putting out 
    - something else i haven't thought of
    
I know there is a dataset out there with movie budgets vs. movie revenue that we could use to augment our findings.
</b>