## Final Project Submission

* Please fill out:
* John Paul Hernandez Alcala
* Part Time
* Scheduled project review date/time: 
* Instructor name: Eli
* Blog post URL:


### Libraries used

In [1]:
#!pip install omdb #if not installed already

import pandas as pd
import re
import matplotlib.pyplot as plt
import seaborn
import omdb
import itertools as it


### Import Cleaned Budget Data

First we are gonna bring in the movie budget dataframe we obtained from MovieBudgetData.ipynb to use with the OMDb API

In [2]:
%%capture 
%run ./MovieBudgetData.ipynb

In [3]:
%store -r dfmoviebudget
dfmoviebudget.head()

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,"Dec 18, 2009",Avatar,425000000,760507625,2776345279
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,410600000,241063875,1045663875
2,3,"Jun 7, 2019",Dark Phoenix,350000000,42762350,149762350
3,4,"May 1, 2015",Avengers: Age of Ultron,330600000,459005868,1403013963
4,5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,317000000,620181382,1316721747


### OMDb API Usuage

#### Get API Key for OMDb, Set Key as Default, and Request Data From OMDb API 

In [4]:
f = open('C:/Users/johnh/.secret/OMDb_API.txt', 'r') #requires valid API_KEY to run
API_KEY = f.read()
omdb.set_default('apikey', API_KEY)

Now we search each of the movies through the API

In [5]:
#We would first try with a small sample first to make sure everything will go smoothly for the rest of the data
df2 = dfmoviebudget.copy() #Then progress to all of the data
df2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5126 entries, 0 to 5781
Data columns (total 6 columns):
id                   5126 non-null int64
release_date         5126 non-null object
movie                5126 non-null object
production_budget    5126 non-null int64
domestic_gross       5126 non-null int64
worldwide_gross      5126 non-null int64
dtypes: int64(4), object(2)
memory usage: 280.3+ KB


We will use tuples of the moive title and year to deal with our issues of duplicate movie titles in our budget data.

In [6]:
#we will make a list of our movie titles
movieTitles = list(df2.movie)

#Below looks at the release_date column of each movie and splits off the year for each moive title
movieYears = [df2.release_date.iloc[x].split()[2] for x in range(0, len(movieTitles))]

I was not exactly sure how to convert two list to tuples. Here is the solution I uncovered [1] 

In [7]:
movTY = list(zip(movieTitles, movieYears)) #the zip() function helps match the title and year
movTY

[('Avatar', '2009'),
 ('Pirates of the Caribbean: On Stranger Tides', '2011'),
 ('Dark Phoenix', '2019'),
 ('Avengers: Age of Ultron', '2015'),
 ('Star Wars Ep. VIII: The Last Jedi', '2017'),
 ('Star Wars Ep. VII: The Force Awakens', '2015'),
 ('Avengers: Infinity War', '2018'),
 ('Justice League', '2017'),
 ('Spectre', '2015'),
 ('The Dark Knight Rises', '2012'),
 ('Solo: A Star Wars Story', '2018'),
 ('The Lone Ranger', '2013'),
 ('John Carter', '2012'),
 ('Tangled', '2010'),
 ('Spider-Man 3', '2007'),
 ('Captain America: Civil War', '2016'),
 ('Batman v Superman: Dawn of Justice', '2016'),
 ('The Hobbit: An Unexpected Journey', '2012'),
 ('Harry Potter and the Half-Blood Prince', '2009'),
 ('The Hobbit: The Desolation of Smaug', '2013'),
 ('The Hobbit: The Battle of the Five Armies', '2014'),
 ('The Fate of the Furious', '2017'),
 ('Superman Returns', '2006'),
 ('Pirates of the Caribbean: Dead Men Tell No Tales', '2017'),
 ('Quantum of Solace', '2008'),
 ('The Avengers', '2012'),
 (

In [8]:
movTY[0] #how to access one entity in a tuple

('Avatar', '2009')

In [9]:
len(movTY) #Makes sure we still have all our data

5126

Now we take a look at what the data looks like coming from OMDb

In [10]:
movieInfo = omdb.title('Avatar', year='2009')
movieInfo

{'title': 'Avatar',
 'year': '2009',
 'rated': 'PG-13',
 'released': '18 Dec 2009',
 'runtime': '162 min',
 'genre': 'Action, Adventure, Fantasy, Sci-Fi',
 'director': 'James Cameron',
 'writer': 'James Cameron',
 'actors': 'Sam Worthington, Zoe Saldana, Sigourney Weaver, Stephen Lang',
 'plot': 'A paraplegic Marine dispatched to the moon Pandora on a unique mission becomes torn between following his orders and protecting the world he feels is his home.',
 'language': 'English, Spanish',
 'country': 'USA',
 'awards': 'Won 3 Oscars. Another 86 wins & 129 nominations.',
 'poster': 'https://m.media-amazon.com/images/M/MV5BMTYwOTEwNjAzMl5BMl5BanBnXkFtZTcwODc5MTUwMw@@._V1_SX300.jpg',
 'ratings': [{'source': 'Internet Movie Database', 'value': '7.8/10'},
  {'source': 'Rotten Tomatoes', 'value': '82%'},
  {'source': 'Metacritic', 'value': '83/100'}],
 'metascore': '83',
 'imdb_rating': '7.8',
 'imdb_votes': '1,086,714',
 'imdb_id': 'tt0499549',
 'type': 'movie',
 'dvd': '22 Apr 2010',
 'box_o

We will use the above as a template for our future dataframe (i.e. keep the keys for use as column names)

In [11]:
movieDetails = movieInfo.copy()

Next, we will define a function that will make sure each key from our template (movieDetails) will match each key from the OMDb data we request (movieInfo2)

In [12]:
def moviesDict(movieDetails, movieInfo2, OrgMovieTitle):
    if len(movieDetails) > len(movieInfo2):
        for key in movieDetails.keys():
            if movieInfo2.get(key):
                #Do nothing since this key is present;
                ;
            else:#create this key so our column sizes match for each movie title
                movieInfo2[key] = 'N/A' #This will also fill in movies that do not come up in the API           
    for key, value in movieDetails.items():
        for key2, value2 in movieInfo2.items():
            if key == key2:
                if value == value2:#this part turns the initial value into a list for development of the columns
                    if key == 'title':
                        movieDetails[key] = [OrgMovieTitle] #we keep the original title from movTY for joining other data later
                    else:
                        movieDetails[key] = [value2] #If not a title related item just turn into list
                else: #This adds the value from the next movie title to the existing on intial value
                    if key == 'title': #checks for titles that came from API as 'N/A'
                        if value2 != 'N/A':
                            value.append(OrgMovieTitle) #if title does not have 'N/A' put original title form MovTY instead
                            movieDetails[key] = value 
                        else:#if it does have 'N/A' leave it in there
                            value.append(value2)
                            movieDetails[key] = value 
                    else: #Addes values to keys other than the title key.
                        value.append(value2)
                        movieDetails[key] = value        
    return movieDetails

Now we see our function in action for each movie title and associated year.

In [13]:
for mov in movTY:
    movieInfo2 = omdb.title(mov[0], year=mov[1])
    movieDetails = moviesDict(movieDetails, movieInfo2, mov[0])

#If error produced it is because API requesting requires subscription to OMDb API

Then we convert the data from movieDetails to a Pandas DataFrame

In [14]:
dfmovieDetails = pd.DataFrame.from_dict(movieDetails)
dfmovieDetails.head(5)

Unnamed: 0,title,year,rated,released,runtime,genre,director,writer,actors,plot,...,metascore,imdb_rating,imdb_votes,imdb_id,type,dvd,box_office,production,website,response
0,Avatar,2009.0,PG-13,18 Dec 2009,162 min,"Action, Adventure, Fantasy, Sci-Fi",James Cameron,James Cameron,"Sam Worthington, Zoe Saldana, Sigourney Weaver...",A paraplegic Marine dispatched to the moon Pan...,...,83.0,7.8,1086714.0,tt0499549,movie,22 Apr 2010,"$749,700,000",20th Century Fox,,True
1,Pirates of the Caribbean: On Stranger Tides,2011.0,PG-13,20 May 2011,136 min,"Action, Adventure, Fantasy",Rob Marshall,"Ted Elliott (screenplay), Terry Rossio (screen...","Johnny Depp, Penélope Cruz, Geoffrey Rush, Ian...",Jack Sparrow and Barbossa embark on a quest to...,...,45.0,6.6,465510.0,tt1298650,movie,18 Oct 2011,"$241,063,875",Walt Disney Pictures,,True
2,Dark Phoenix,2019.0,PG-13,07 Jun 2019,113 min,"Action, Adventure, Sci-Fi",Simon Kinberg,Simon Kinberg,"James McAvoy, Michael Fassbender, Jennifer Law...",Jean Grey begins to develop incredible powers ...,...,43.0,5.8,132810.0,tt6565702,movie,03 Sep 2019,,20th Century Fox,,True
3,Avengers: Age of Ultron,2015.0,PG-13,01 May 2015,141 min,"Action, Adventure, Sci-Fi",Joss Whedon,"Joss Whedon, Stan Lee (based on the Marvel com...","Robert Downey Jr., Chris Hemsworth, Mark Ruffa...",When Tony Stark and Bruce Banner try to jump-s...,...,66.0,7.3,704588.0,tt2395427,movie,02 Oct 2015,"$429,113,729",Walt Disney Pictures,,True
4,,,,,,,,,,,...,,,,,,,,,,


In [15]:
#dfmovieDetails.to_csv('OMDb API Data') #only run if the above has valid API Key and no errors.

## Clean and Export OMDb Data

As you can see, there is a moive which did not get recognized by the OMDb API. We need to clean these out.

In [16]:
df3 = pd.read_csv('./OMDb API Data', index_col=0) #we need index_col = 0 to get rid of index column (i.e. unnamed: 0 column) [2]
df3

Unnamed: 0,title,year,rated,released,runtime,genre,director,writer,actors,plot,...,metascore,imdb_rating,imdb_votes,imdb_id,type,dvd,box_office,production,website,response
0,Avatar,2009,PG-13,18 Dec 2009,162 min,"Action, Adventure, Fantasy, Sci-Fi",James Cameron,James Cameron,"Sam Worthington, Zoe Saldana, Sigourney Weaver...",A paraplegic Marine dispatched to the moon Pan...,...,83.0,7.8,1086714,tt0499549,movie,22 Apr 2010,"$749,700,000",20th Century Fox,,True
1,Pirates of the Caribbean: On Stranger Tides,2011,PG-13,20 May 2011,136 min,"Action, Adventure, Fantasy",Rob Marshall,"Ted Elliott (screenplay), Terry Rossio (screen...","Johnny Depp, Penélope Cruz, Geoffrey Rush, Ian...",Jack Sparrow and Barbossa embark on a quest to...,...,45.0,6.6,462689,tt1298650,movie,18 Oct 2011,"$241,063,875",Walt Disney Pictures,,True
2,Dark Phoenix,2019,PG-13,07 Jun 2019,113 min,"Action, Adventure, Sci-Fi",Simon Kinberg,Simon Kinberg,"James McAvoy, Michael Fassbender, Jennifer Law...",Jean Grey begins to develop incredible powers ...,...,43.0,5.8,132810,tt6565702,movie,03 Sep 2019,,20th Century Fox,,True
3,Avengers: Age of Ultron,2015,PG-13,01 May 2015,141 min,"Action, Adventure, Sci-Fi",Joss Whedon,"Joss Whedon, Stan Lee (based on the Marvel com...","Robert Downey Jr., Chris Hemsworth, Mark Ruffa...",When Tony Stark and Bruce Banner try to jump-s...,...,66.0,7.3,704588,tt2395427,movie,02 Oct 2015,"$429,113,729",Walt Disney Pictures,,True
4,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5121,,,,,,,,,,,...,,,,,,,,,,
5122,,,,,,,,,,,...,,,,,,,,,,
5123,,,,,,,,,,,...,,,,,,,,,,
5124,,,,,,,,,,,...,,,,,,,,,,


Let's get rid of all the movies that have the title 'NaN'

In [17]:
df3 = df3[~df3.title.isna()]
df3.head()

Unnamed: 0,title,year,rated,released,runtime,genre,director,writer,actors,plot,...,metascore,imdb_rating,imdb_votes,imdb_id,type,dvd,box_office,production,website,response
0,Avatar,2009,PG-13,18 Dec 2009,162 min,"Action, Adventure, Fantasy, Sci-Fi",James Cameron,James Cameron,"Sam Worthington, Zoe Saldana, Sigourney Weaver...",A paraplegic Marine dispatched to the moon Pan...,...,83.0,7.8,1086714,tt0499549,movie,22 Apr 2010,"$749,700,000",20th Century Fox,,True
1,Pirates of the Caribbean: On Stranger Tides,2011,PG-13,20 May 2011,136 min,"Action, Adventure, Fantasy",Rob Marshall,"Ted Elliott (screenplay), Terry Rossio (screen...","Johnny Depp, Penélope Cruz, Geoffrey Rush, Ian...",Jack Sparrow and Barbossa embark on a quest to...,...,45.0,6.6,462689,tt1298650,movie,18 Oct 2011,"$241,063,875",Walt Disney Pictures,,True
2,Dark Phoenix,2019,PG-13,07 Jun 2019,113 min,"Action, Adventure, Sci-Fi",Simon Kinberg,Simon Kinberg,"James McAvoy, Michael Fassbender, Jennifer Law...",Jean Grey begins to develop incredible powers ...,...,43.0,5.8,132810,tt6565702,movie,03 Sep 2019,,20th Century Fox,,True
3,Avengers: Age of Ultron,2015,PG-13,01 May 2015,141 min,"Action, Adventure, Sci-Fi",Joss Whedon,"Joss Whedon, Stan Lee (based on the Marvel com...","Robert Downey Jr., Chris Hemsworth, Mark Ruffa...",When Tony Stark and Bruce Banner try to jump-s...,...,66.0,7.3,704588,tt2395427,movie,02 Oct 2015,"$429,113,729",Walt Disney Pictures,,True
6,Avengers: Infinity War,2018,PG-13,27 Apr 2018,149 min,"Action, Adventure, Sci-Fi","Anthony Russo, Joe Russo","Christopher Markus (screenplay by), Stephen Mc...","Robert Downey Jr., Chris Hemsworth, Mark Ruffa...",The Avengers and their allies must be willing ...,...,68.0,8.5,754875,tt4154756,movie,14 Aug 2018,"$664,987,816",Walt Disney Pictures,,True


In [18]:
df3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4380 entries, 0 to 5120
Data columns (total 25 columns):
title          4380 non-null object
year           4380 non-null object
rated          4187 non-null object
released       4336 non-null object
runtime        4341 non-null object
genre          4372 non-null object
director       4316 non-null object
writer         4253 non-null object
actors         4352 non-null object
plot           4294 non-null object
language       4362 non-null object
country        4376 non-null object
awards         3784 non-null object
poster         4260 non-null object
ratings        4375 non-null object
metascore      3841 non-null float64
imdb_rating    4306 non-null float64
imdb_votes     4306 non-null object
imdb_id        4380 non-null object
type           4380 non-null object
dvd            3829 non-null object
box_office     2298 non-null object
production     3827 non-null object
website        10 non-null object
response       4380 non-null 

From above, we can see that we have quite a few NaN values in our dataset. Depending on our analysis, we will drop certain columns. For all analysis, we can drop the 'website' column because it only has 10 valid points and the 'response' column does not relate to the actual movie information

In [19]:
df3.drop(['website', 'response'], axis=1, inplace=True)#inplace being true keeps change

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [20]:
df3.to_csv('ReadyOMDbAPIData')

In [OMDbBudgetDataAnalysis.ipynb](OMDbBudgetDataAnalysis.ipynb), we will use 'ReadyOMDbAPIData' along with R.O.I. data to conduct a further analysis.

## Resources used for development:
1. https://docs.quantifiedcode.com/python-anti-patterns/readability/not_using_zip_to_iterate_over_a_pair_of_lists.html
2. https://stackoverflow.com/questions/36519086/how-to-get-rid-of-unnamed-0-column-in-a-pandas-dataframe
3. http://www.omdbapi.com/