# Taylor Imhof
# Bellevue University | DSC 540
# Final Project Milestone 4
# 2/25/2022

## Milestone 4: Connecting to an API/Pulling in the Data and Cleaning/Formatting

Perform at least 5 data transformations and/or cleansing steps to your API data.
 - Replace Headers
 - Format data into a more readable format
 - Identify outliers and bad data
 - Find duplicates
 - Fix casing or inconsistent values
 - Conduct Fuzzy Matching

## (1) Load Required Libraries

In [3]:
# load required libraries
import pandas as pd
import numpy as np
import requests
import json
from urllib.error import HTTPError, URLError
from urllib.request import Request, urlopen
import urllib.request, urllib.parse

## (2) Get Dataframe From Milestone Step 3 In Order To Use Movie Title For API Calls

My primary plan of attack for using OMDb's API call was to extract the title from each of the records in my previous step's dataframe for each API call. Since I am able to pull data from OMDb's API using unique movie titles, I planned on attempting to pull api data for each movie title

In [4]:
# read in csv from milestone step 3 to dataframe
website_data = pd.read_csv('cleaned_webiste_data.csv', index_col=0)
website_data.head()

Unnamed: 0,release_date,title,production_budget,domestic_gross,worldwide_gross
0,"Apr 23, 2019",Avengers: Endgame,400000000.0,858373000.0,2797801000.0
1,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,379000000.0,241071802.0,1045714000.0
2,"Apr 22, 2015",Avengers: Age of Ultron,365000000.0,459005868.0,1395317000.0
3,"Dec 16, 2015",Star Wars Ep. VII: The Force Awakens,306000000.0,936662225.0,2064616000.0
4,"Apr 25, 2018",Avengers: Infinity War,300000000.0,678815482.0,2048360000.0


After loading in the .csv from the previous milestone's dataframe into a new dataframe, I extracted the movie titles from the `title` column into a list that I could use to create unique API requests for each movie.

In [5]:
# create list of movie titles to use for unique api calls
list_of_movie_titles = website_data['title'].tolist()

## (3) Load OMDb Secret API Key

In order to keep my API key away from prying eyes, I found an interesting method of reading in the API key from a directory on the local machine. I first created a simple utility function that would take a JSON file at a designated directory and return it in JSON format.

In [6]:
# utility function to retrieve api json from directory on local machine
def get_keys(path):
    """
    args: directory on local machine where json file is located
    returns: json type of file
    """
    with open(path) as f:
        return json.load(f)

In [7]:
# store api key in variable
api_json = get_keys('C:/Users/taylo/.secret/beep_boop.json')
api_key = api_json['api_key']

I then decided to store OMDb's static API endpoint in its own variable. That way I could append the API key argument with my own unique API key.

In [8]:
# store api url end-point as string var
omdb_url = 'http://www.omdbapi.com/?'

In [9]:
# append unique api key to required api call format
apikey = '&apikey=' + api_key

## (4) Create Function To Search For OMDb API Movie Data

To query OMDbs API, I created a function that takes a string argument movie title and returns all data (in JSON format). I came up with a way to get all the movie titles from the movies in my dataframe from milestone 3. That way, I could simply append the title to my API call and get individual movie data.

In [41]:
# utility function for api calls using movie titles and api key
def search_by_title(title):
    """
    args: movie title as a string
    
    attempts to access omdb api endpoing using string movie title
    urllib requests the data that should be in json format
    then, the json is loaded into a variable using json.loads
    if the json responded correctly the json is returned by the function
    otherwise, standard error handling is used to locate where the function went wrong
    
    returns: json (if no error encountered)
    """
    # attempt to access omdb api end-point
    try:
        url = omdb_url + urllib.parse.urlencode({'t': str(title)}) + apikey
        # print('...Attempting to get data for {}...\n...Please wait...'.format(title))
        req = Request(url, headers={'User-Agent':'Mozilla/5.0'})
        content = urlopen(req).read().decode('utf-8')
        json_data = json.loads(content)
        
        if json_data['Response'] == 'True':
            # print(type(json_data)) ## for testing purposes
            return json_data
        #else: # for debugging purposes
            # print('Some error has occured. Please try again!')
            # print(json_data['Error'])
    except HTTPError as e:
        print('http error', e.reason)
    except URLError as e:
        print('url error:', e.reason)

It was at this stage that I realized there were some strange ASCII output when there were certain characters in the movie title. So, I had to go back to my code for milestone 3 and change how the content was decoded. The solution was quite straightforward, but it certainly had me banging my head against the table for a few hours.

In [9]:
## get api movie data from omdb and append json dictionaries to empty list
list_of_json = [] # empty list to store created json dicts

# iterate across titles from milestone 3 df and use title to pull data from api call
for title in list_of_movie_titles[:2000]:
    json_data = search_by_title(title)
    list_of_json.append(json_data) # appends new dict to empty list

http error Unauthorized
http error Unauthorized
http error Unauthorized
http error Unauthorized
http error Unauthorized
http error Unauthorized
http error Unauthorized
http error Unauthorized
http error Unauthorized
http error Unauthorized
http error Unauthorized
http error Unauthorized
http error Unauthorized
http error Unauthorized
http error Unauthorized
http error Unauthorized
http error Unauthorized
http error Unauthorized
http error Unauthorized
http error Unauthorized
http error Unauthorized
http error Unauthorized
http error Unauthorized
http error Unauthorized
http error Unauthorized
http error Unauthorized
http error Unauthorized
http error Unauthorized
http error Unauthorized
http error Unauthorized
http error Unauthorized
http error Unauthorized
http error Unauthorized
http error Unauthorized
http error Unauthorized
http error Unauthorized
http error Unauthorized
http error Unauthorized
http error Unauthorized
http error Unauthorized
http error Unauthorized
http error Unaut

After making several test API calls and getting the function to work properly, I had reached my daily limit of 1000 calls. For this reason, I wound up working through the steps again but creating a new dataframe that I could write to combine with the original dataframe. That might make the following cells a little hard to understand. However, I feel that my work for pulling the API data into the dataframes should be evident.

In [14]:
test_json_second_day

[{'Title': 'Mulan',
  'Year': '1998',
  'Rated': 'G',
  'Released': '19 Jun 1998',
  'Runtime': '88 min',
  'Genre': 'Animation, Adventure, Comedy',
  'Director': 'Tony Bancroft, Barry Cook',
  'Writer': 'Robert D. San Souci, Rita Hsiao, Chris Sanders',
  'Actors': 'Ming-Na Wen, Eddie Murphy, BD Wong',
  'Plot': "To save her father from death in the army, a young maiden secretly goes in his place and becomes one of China's greatest heroines in the process.",
  'Language': 'English, Mandarin',
  'Country': 'United States',
  'Awards': 'Nominated for 1 Oscar. 17 wins & 21 nominations total',
  'Poster': 'https://m.media-amazon.com/images/M/MV5BODkxNGQ1NWYtNzg0Ny00Yjg3LThmZTItMjE2YjhmZTQ0ODY5XkEyXkFqcGdeQXVyMTQxNzMzNDI@._V1_SX300.jpg',
  'Ratings': [{'Source': 'Internet Movie Database', 'Value': '7.7/10'},
   {'Source': 'Rotten Tomatoes', 'Value': '86%'},
   {'Source': 'Metacritic', 'Value': '71/100'}],
  'Metascore': '71',
  'imdbRating': '7.7',
  'imdbVotes': '276,291',
  'imdbID': 'tt0

In [15]:
for title in list_of_movie_titles[526:1000]:
    json_data = search_by_title(title)
    test_json_second_day.append(json_data)

In [16]:
len(test_json_second_day)

499

## (5) Filter Returned Dictionaries To Include Only Useful Columns

There were quite a few columns of data that I felt would not really be necessary to analyze. For this reason, I create a list of keys that I wanted to keep for each JSON dictionary that was returned in the previous step. Then, by utilizing dictionary comprehension, I was able to create a new list of dictionaries with the filtered keys that I desired. I then displayed the first five entires to ensure only the keys I wanted were used in their creation.

In [17]:
# define list of keys to keep for conversion to dataframe
keys_to_keep = ['Title',
               'Year',
               'Rated',
               'Genre',
               'Country',
               'imdbID',
               'imdbRating',
               'imdbVotes',
               'Metascore']

In [10]:
# empty list to store new filtered json results
dicts_to_keep = []

for each_dict in list_of_json:
    if each_dict is not None:
        new_dict = {kept_key: each_dict[kept_key] for kept_key in keys_to_keep}
        dicts_to_keep.append(new_dict)
        
# display new list of dictionarys with filtered keys
dicts_to_keep[:5]

[{'Title': 'Avengers: Endgame',
  'Year': '2019',
  'Rated': 'PG-13',
  'Genre': 'Action, Adventure, Drama',
  'Country': 'United States',
  'imdbID': 'tt4154796',
  'imdbRating': '8.4',
  'imdbVotes': '1,016,281',
  'Metascore': '78'},
 {'Title': 'Pirates of the Caribbean: On Stranger Tides',
  'Year': '2011',
  'Rated': 'PG-13',
  'Genre': 'Action, Adventure, Fantasy',
  'Country': 'United States, United Kingdom',
  'imdbID': 'tt1298650',
  'imdbRating': '6.6',
  'imdbVotes': '507,945',
  'Metascore': '45'},
 {'Title': 'Avengers: Age of Ultron',
  'Year': '2015',
  'Rated': 'PG-13',
  'Genre': 'Action, Adventure, Sci-Fi',
  'Country': 'United States',
  'imdbID': 'tt2395427',
  'imdbRating': '7.3',
  'imdbVotes': '818,321',
  'Metascore': '66'},
 {'Title': 'Avengers: Infinity War',
  'Year': '2018',
  'Rated': 'PG-13',
  'Genre': 'Action, Adventure, Sci-Fi',
  'Country': 'United States',
  'imdbID': 'tt4154756',
  'imdbRating': '8.4',
  'imdbVotes': '983,926',
  'Metascore': '68'},
 

In [18]:
# simliar ops for second api pull data
dicts_to_keep_2 = []
for each_dict in test_json_second_day:
    if each_dict is not None:
        new_dict = {kept_key: each_dict[kept_key] for kept_key in keys_to_keep}
        dicts_to_keep_2.append(new_dict)
# display first five from new dict list
dicts_to_keep_2[:5]

[{'Title': 'Mulan',
  'Year': '1998',
  'Rated': 'G',
  'Genre': 'Animation, Adventure, Comedy',
  'Country': 'United States',
  'imdbID': 'tt0120762',
  'imdbRating': '7.7',
  'imdbVotes': '276,291',
  'Metascore': '71'},
 {'Title': 'Tropic Thunder',
  'Year': '2008',
  'Rated': 'R',
  'Genre': 'Action, Comedy, War',
  'Country': 'United States, United Kingdom, Germany',
  'imdbID': 'tt0942385',
  'imdbRating': '7.0',
  'imdbVotes': '401,689',
  'Metascore': '71'},
 {'Title': 'The Girl with the Dragon Tattoo',
  'Year': '2011',
  'Rated': 'R',
  'Genre': 'Crime, Drama, Mystery',
  'Country': 'United States, Sweden, Norway',
  'imdbID': 'tt1568346',
  'imdbRating': '7.8',
  'imdbVotes': '451,420',
  'Metascore': '71'},
 {'Title': 'Contact',
  'Year': '1997',
  'Rated': 'PG',
  'Genre': 'Drama, Mystery, Sci-Fi',
  'Country': 'United States',
  'imdbID': 'tt0118884',
  'imdbRating': '7.5',
  'imdbVotes': '268,313',
  'Metascore': '62'},
 {'Title': "You Don't Mess with the Zohan",
  'Year

## (6) Converting JSON Dictionaries To Pandas Dataframes

I actually spent quite a bit of time during this section, when ultimately all I had to do was write one line of code using the pd.DataFrame() method. I had tried numerous styles of different loops to iterate across each JSON dictionary in my list but kept running into errors that I could not find my way around. It was certainly satisyfing (and quite frustrating) to find out that all I had to do was simply create the dataframe directly from the list of dictionaries that I created.

In [11]:
# create dataframe from list of json dictionaries
df = pd.DataFrame(dicts_to_keep)

In [19]:
df2 = pd.DataFrame(dicts_to_keep_2)
df2.head()

Unnamed: 0,Title,Year,Rated,Genre,Country,imdbID,imdbRating,imdbVotes,Metascore
0,Mulan,1998,G,"Animation, Adventure, Comedy",United States,tt0120762,7.7,276291,71
1,Tropic Thunder,2008,R,"Action, Comedy, War","United States, United Kingdom, Germany",tt0942385,7.0,401689,71
2,The Girl with the Dragon Tattoo,2011,R,"Crime, Drama, Mystery","United States, Sweden, Norway",tt1568346,7.8,451420,71
3,Contact,1997,PG,"Drama, Mystery, Sci-Fi",United States,tt0118884,7.5,268313,62
4,You Don't Mess with the Zohan,2008,PG-13,Comedy,United States,tt0960144,5.5,198259,54


In [12]:
# display first five records to ensure load was successful
df.head()

Unnamed: 0,Title,Year,Rated,Genre,Country,imdbID,imdbRating,imdbVotes,Metascore
0,Avengers: Endgame,2019,PG-13,"Action, Adventure, Drama",United States,tt4154796,8.4,1016281,78
1,Pirates of the Caribbean: On Stranger Tides,2011,PG-13,"Action, Adventure, Fantasy","United States, United Kingdom",tt1298650,6.6,507945,45
2,Avengers: Age of Ultron,2015,PG-13,"Action, Adventure, Sci-Fi",United States,tt2395427,7.3,818321,66
3,Avengers: Infinity War,2018,PG-13,"Action, Adventure, Sci-Fi",United States,tt4154756,8.4,983926,68
4,Pirates of the Caribbean: At World's End,2007,PG-13,"Action, Adventure, Fantasy",United States,tt0449088,7.1,625898,50


In [13]:
# view shape to check dimensions
df.shape

(479, 9)

In [42]:
# view shape to check dimensions
df2.shape

(456, 9)

In [14]:
# view info() to check encoded data types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 479 entries, 0 to 478
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Title       479 non-null    object
 1   Year        479 non-null    object
 2   Rated       479 non-null    object
 3   Genre       479 non-null    object
 4   Country     479 non-null    object
 5   imdbID      479 non-null    object
 6   imdbRating  479 non-null    object
 7   imdbVotes   479 non-null    object
 8   Metascore   479 non-null    object
dtypes: object(9)
memory usage: 33.8+ KB


## (7) Rename Columns/Headers

In order to keep consistency with how I have been naming my columns in the previous milestone's dataframes, I made all of the headers lowercase. Also, if the headers contained more than one word, I separated them with an underscore.

In [15]:
# create copy of df for preservation
renamed_df = df.copy()

# rename columns to keep consistent naming convention
renamed_df = renamed_df.rename({'Title':'title',
                               'Year':'year',
                               'Rated':'rated',
                               'Genre':'genre',
                               'Country':'country',
                               'imdbID':'imdb_id',
                               'imdbRating':'imdb_rating',
                               'imdbVotes':'imdb_votes',
                               'Metascore':'metascore'}, axis=1)
renamed_df.head()

Unnamed: 0,title,year,rated,genre,country,imdb_id,imdb_rating,imdb_votes,metascore
0,Avengers: Endgame,2019,PG-13,"Action, Adventure, Drama",United States,tt4154796,8.4,1016281,78
1,Pirates of the Caribbean: On Stranger Tides,2011,PG-13,"Action, Adventure, Fantasy","United States, United Kingdom",tt1298650,6.6,507945,45
2,Avengers: Age of Ultron,2015,PG-13,"Action, Adventure, Sci-Fi",United States,tt2395427,7.3,818321,66
3,Avengers: Infinity War,2018,PG-13,"Action, Adventure, Sci-Fi",United States,tt4154756,8.4,983926,68
4,Pirates of the Caribbean: At World's End,2007,PG-13,"Action, Adventure, Fantasy",United States,tt0449088,7.1,625898,50


In [21]:
# renaming columns second iteration of api data pull
renamed_2 = df2.copy()
renamed_2 = renamed_2.rename({'Title':'title',
                               'Year':'year',
                               'Rated':'rated',
                               'Genre':'genre',
                               'Country':'country',
                               'imdbID':'imdb_id',
                               'imdbRating':'imdb_rating',
                               'imdbVotes':'imdb_votes',
                               'Metascore':'metascore'}, axis=1)
renamed_2.head()

Unnamed: 0,title,year,rated,genre,country,imdb_id,imdb_rating,imdb_votes,metascore
0,Mulan,1998,G,"Animation, Adventure, Comedy",United States,tt0120762,7.7,276291,71
1,Tropic Thunder,2008,R,"Action, Comedy, War","United States, United Kingdom, Germany",tt0942385,7.0,401689,71
2,The Girl with the Dragon Tattoo,2011,R,"Crime, Drama, Mystery","United States, Sweden, Norway",tt1568346,7.8,451420,71
3,Contact,1997,PG,"Drama, Mystery, Sci-Fi",United States,tt0118884,7.5,268313,62
4,You Don't Mess with the Zohan,2008,PG-13,Comedy,United States,tt0960144,5.5,198259,54


## (8) Check For Duplicate Observations

In [16]:
renamed_df.duplicated().value_counts()

False    470
True       9
dtype: int64

In [17]:
# create copy of df and drop all duplicate rows
no_dupes = renamed_df.copy().drop_duplicates()
no_dupes.duplicated().value_counts() # check to ensure all instancef of true have been removed

False    470
dtype: int64

In [22]:
# same operations for df2
no_dupes = renamed_2.copy()
no_dupes.duplicated().value_counts() # check for dupes

False    479
dtype: int64

In [27]:
type_convert_2.isnull().sum()

title           0
year            3
rated           0
genre           0
country         0
imdb_id         0
imdb_rating     0
imdb_votes      0
metascore      23
dtype: int64

In [28]:
no_nulls_2 = type_convert_2.copy().dropna()
no_nulls_2.isnull().sum()

title          0
year           0
rated          0
genre          0
country        0
imdb_id        0
imdb_rating    0
imdb_votes     0
metascore      0
dtype: int64

In [22]:
dtype_conversion.isnull().sum()

title           0
year            1
rated           0
genre           0
country         0
imdb_id         0
imdb_rating     6
imdb_votes      6
metascore      12
dtype: int64

The amount of missing values was quite low, so I felt that simply dropping the rows with missing/na values would be alright.

## (9) Check Datatypes and Convert If Necessary

There were a few columns that were encoding as strings that I needed to cast to numeric datatypes. I used a basic lambda function to extract the commas from the `imdb_votes` column that were causing it to be coded as an object. Then, I used the `pd.to_numeric()` function to cast all of the desired columns to numeric data types.

In [23]:
type_convert_2 = no_dupes.copy()
type_convert_2.dtypes

title          object
year           object
rated          object
genre          object
country        object
imdb_id        object
imdb_rating    object
imdb_votes     object
metascore      object
dtype: object

In [25]:
type_convert_2['imdb_votes'] = type_convert_2['imdb_votes'].apply(lambda x: x.replace(',', '') if isinstance(x,str) else x)

In [26]:
# define list of column names that need to be converted to numeric values
num_values = ['year', 'imdb_rating', 'imdb_votes', 'metascore']

for value in num_values:
    type_convert_2[value] = pd.to_numeric(type_convert_2[value], errors='coerce')

# check dtypes
type_convert_2.dtypes

title           object
year           float64
rated           object
genre           object
country         object
imdb_id         object
imdb_rating    float64
imdb_votes       int64
metascore      float64
dtype: object

In [18]:
# check df types
no_dupes.dtypes

title          object
year           object
rated          object
genre          object
country        object
imdb_id        object
imdb_rating    object
imdb_votes     object
metascore      object
dtype: object

There were quite a few columns that needed to be converted to numerical values. Year, imdb_rating, imdb_votes, and metascore all needed to be converted to numerical values.

In [19]:
# create df copy for cya
dtype_conversion = no_dupes.copy()

In [20]:
# remove commas from string value for imdb_votes columns
dtype_conversion['imdb_votes'] = dtype_conversion['imdb_votes'].apply(lambda x: x.replace(',','') if isinstance(x,str) else x)

In [21]:
# define list of column names that need to be converted to numeric values
num_values = ['year', 'imdb_rating', 'imdb_votes', 'metascore']

# iterate across each column and attempt to convert to numeric value
for value in num_values:
    dtype_conversion[value] = pd.to_numeric(dtype_conversion[value], errors='coerce')

# check dtypes again to ensure values were converted properly
dtype_conversion.dtypes

title           object
year           float64
rated           object
genre           object
country         object
imdb_id         object
imdb_rating    float64
imdb_votes     float64
metascore      float64
dtype: object

In [23]:
# create copy for cya
no_nulls_df = dtype_conversion.copy()

In [24]:
# drop all rows with na values
no_nulls_df = no_nulls_df.dropna()

# check again for missing values to make sure dropna() functioned properly
no_nulls_df.isnull().sum()

title          0
year           0
rated          0
genre          0
country        0
imdb_id        0
imdb_rating    0
imdb_votes     0
metascore      0
dtype: int64

## (10) Extracting One Genre For Simplicity

As with my dataframe from flatfile back in milestone 2, most of the data in the genre column contained several values. In order to make this column easier to analyze, I simply extracted the first genre from each observation and had that become the sole member for that row.

In [25]:
# create copy just in case
extracting_genre = no_nulls_df.copy()

In [26]:
# check length of unique values for genre
len(extracting_genre['genre'].unique())

89

94 genres would be wayy too many to see any noticeable trends amongst the movie data.

In [27]:
# change genre values to list of strings split at the comma
extracting_genre['genre'] = extracting_genre['genre'].str.split(',')

# assign first string from list of strings to column value
one_genre_df = extracting_genre.explode('genre').reset_index(drop=True)

# remove duplicated rows
one_genre_df.drop_duplicates(subset=['title'], keep='first', inplace=True, ignore_index=True)

# check shape to ensure original dimensions were preserved
one_genre_df.shape

(458, 9)

In [29]:
one_genre_2 = no_nulls_2.copy()
one_genre_2['genre'] = one_genre_2['genre'].str.split(',')
one_genre_2 = one_genre_2.explode('genre').reset_index(drop=True)
one_genre_2.drop_duplicates(subset=['title'], keep='first', inplace=True, ignore_index=True)
one_genre_2.shape

(456, 9)

## (11) Extracting One Country For Simplicity

Similar operation as done for gnere with multiple string values.

In [28]:
# create copy for cya
one_country_df = one_genre_df.copy()

In [29]:
# change country values to list of strings split at the comma
one_country_df['country'] = one_country_df['country'].str.split(',')

# assing first string from list of strings to column value
one_country = one_country_df.explode('country').reset_index(drop=True)

In [30]:
# remove duplicated rows created when splitting dataframe by each genre
one_country.drop_duplicates(subset=['title'], keep='first', inplace=True, ignore_index=True)

# check shape to ensure original dimensions were preserved
one_country.shape

(458, 9)

In [30]:
one_country_2 = one_genre_2.copy()
one_country_2['country'] = one_country_2['country'].str.split(',')
one_country_2 = one_country_2.explode('country').reset_index(drop=True)
one_country_2.drop_duplicates(subset=['title'], keep='first', inplace=True, ignore_index=True)
one_country_2.shape

(456, 9)

## (12) Display Final Clean Dataframe From API

In [31]:
final_api_df = one_country.copy()
final_api_df.head()

Unnamed: 0,title,year,rated,genre,country,imdb_id,imdb_rating,imdb_votes,metascore
0,Avengers: Endgame,2019.0,PG-13,Action,United States,tt4154796,8.4,1016281.0,78.0
1,Pirates of the Caribbean: On Stranger Tides,2011.0,PG-13,Action,United States,tt1298650,6.6,507945.0,45.0
2,Avengers: Age of Ultron,2015.0,PG-13,Action,United States,tt2395427,7.3,818321.0,66.0
3,Avengers: Infinity War,2018.0,PG-13,Action,United States,tt4154756,8.4,983926.0,68.0
4,Pirates of the Caribbean: At World's End,2007.0,PG-13,Action,United States,tt0449088,7.1,625898.0,50.0


In [32]:
final_api_df.shape

(458, 9)

In [33]:
final_api_df.to_csv('omdb_api_data.csv')

In [31]:
api_df_2 = one_country_2.copy()
api_df_2.head()

Unnamed: 0,title,year,rated,genre,country,imdb_id,imdb_rating,imdb_votes,metascore
0,Mulan,1998.0,G,Animation,United States,tt0120762,7.7,276291,71.0
1,Tropic Thunder,2008.0,R,Action,United States,tt0942385,7.0,401689,71.0
2,The Girl with the Dragon Tattoo,2011.0,R,Crime,United States,tt1568346,7.8,451420,71.0
3,Contact,1997.0,PG,Drama,United States,tt0118884,7.5,268313,62.0
4,You Don't Mess with the Zohan,2008.0,PG-13,Comedy,United States,tt0960144,5.5,198259,54.0


In [32]:
api_df_2.shape

(456, 9)

In [33]:
api_df_2.to_csv('api_day_2.csv')

In [37]:
#
df1 = pd.read_csv('omdb_api_data.csv', index_col=0)
df2 = pd.read_csv('api_day_2.csv', index_col=0)
final_merged_df = pd.concat([df1, df2])
final_merged_df.shape

(914, 9)

In [38]:
final_merged_df.head()

Unnamed: 0,title,year,rated,genre,country,imdb_id,imdb_rating,imdb_votes,metascore
0,Avengers: Endgame,2019.0,PG-13,Action,United States,tt4154796,8.4,1016281.0,78.0
1,Pirates of the Caribbean: On Stranger Tides,2011.0,PG-13,Action,United States,tt1298650,6.6,507945.0,45.0
2,Avengers: Age of Ultron,2015.0,PG-13,Action,United States,tt2395427,7.3,818321.0,66.0
3,Avengers: Infinity War,2018.0,PG-13,Action,United States,tt4154756,8.4,983926.0,68.0
4,Pirates of the Caribbean: At World's End,2007.0,PG-13,Action,United States,tt0449088,7.1,625898.0,50.0


As I stated earlier, I had to wait for my API call limit to reset. In doing so, I felt the easiest way to consolitate the data from each pull was to create their own respective dataframes and combine them using `pd.concat()`.

In [40]:
final_merged_df.to_csv('final_api_data.csv')