# Box Office Revenue Predictictive Models

### Data Wrangling Notebook


The purpose of this project is to further practice building my own models in order to build and evaluate the performance of models that predict the revenue of movies, as a function of features associated with them.

These models will be evaluated against relevant KPI's (R-squared, Mean Absolute Error, MAE^2), and the best model will be used to predict revenue performance for movies in a world where theaters continued to stay open in 2020 and 2021.  

This notebook's purpose is to prepare the data set(s) for exploratory analysis in a subsequent notebook. 

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import os
import ast
from imdb import IMDb

**Load Box office data**

Here I load in both .csv files with the box-office data.  The data was pre-split for a Kaggle competition years ago. In this instance I chose to merge both files into a single dataframe and later will use SciKit Learn to create a new train/test split for modeling. 

Data Source: 

Kaggle. (May 2019). TMDB Box Office Prediction, V1. Retrieved 04/16/2021 from https://www.kaggle.com/c/tmdb-box-office-prediction/data.

In [None]:
boxoffice = pd.read_csv(r'C:\Users\deann\Documents\Data\Box Office Prediction Data\test.csv')

pd.set_option("display.max_colwidth", 20)

boxoffice.head()

Pulling up basic summary statistics for both .csv's

In [None]:
boxoffice.describe()

Here I can see that there are few numeric features for this dataset - determining what other data types I'll be working with is going to be important before going too far with the data here. 

I'm also going to remove the scientific notation from Pandas for this dataset.  The maximum numeric value in the entire set is 1.5 billion - which has more digits when written with scientific notation than without.  In a different scenario the notation might be helpful but since it's not something I use regularly I'll remove it for the time being to make my data more readable. 

In [None]:
pd.options.display.float_format = '{:,.2f}'.format

boxoffice.describe()

While removing scientific notation does make this more readable - the floats have and unessecary degree of precision and I'd like to include some comma separators for mroe readability so I'll adjust that pandas setting here as well. 

It's time to take a look at the first portion of the combined data frame as well. 

In [None]:
pd.options.display.float_format = '{:,.2f}'.format #reformats pandas for this notebook to round floats to 0.00

boxoffice.head()     

***Dictionary Problem***

It seems that there are several values for which the data frame returns a dictionary: ***Cast, Crew, Keywords, Genres, 'Belongs to collection', and 'Spoken Languages'***.  All of these seem to be features which could conceivably contribute or indicate revenue potential for a movie, either by providing exposure (casting Tom Hanks as your lead), quality production value (bringing in Steven Spielberg as your EP), or expanding/adopting an existing market (action movies, zombie films, existing franchises like Star Wars, or a new country like China).  

For now I'm going to leave these in as-is however later on in this notebook I would like to be able to identify if a given collection, genre, actor, producer, or language is associated with a higher revenue.  It's likely that this will be served best by transforming these dictionaries into lists.  However I would like to do some more data cleaning before taking that step.

***Duplicates***

Before doing anything I need to check for duplicate values in this dataset and resolve those. 

In [None]:
boxoffice.original_title.unique        #Checking for duplicate films by title

In [None]:
boxoffice.id.unique                  #Checking for duplicate films by id

***Missing Values:***

In this section I'll take a look at missing values for the dataset and determine which, if any, columns to drop from the dataset.  

In [None]:
missing = pd.concat([boxoffice.isnull().sum(), 100*boxoffice.isnull().mean()], axis=1)
missing.columns = ['Count', '% Missing']
missing.sort_values(by='% Missing', ascending=False) #creates a data frame displaying missing values by column

Based on the above it seems that missing values will not be a large impediment for this project.  

There are significant missing values for both ***'Belongs to collection' and 'homepage'***, with 80% an 68% missing respectively.  These columns definitely warrant further review.  

The 'tagline' and 'Keywords' columns are missing less than 20% of their data and for now this seems workable.  Previously I've built accurate models with more data missing than that so I'll leave these columns alone for now.  

### Solving my dictionary problem

The column *boxoffice('belongs_to_collection')* has the largest amount of missing values.  I need to determine what to do with this column, potentially turning it into a boolean value representing if the film belongs to a collection at all.  

Additionally this is a good opportunity to take a look at values which are lists or dictionaries in this data set since that will present interesting opportunities for analysis. 

In [None]:
has_collection = boxoffice['belongs_to_collection'].notnull()

boxoffice['belongs_to_collection'][has_collection]   #I want to look at only actual values to better understand what they are

In [None]:
pd.set_option("display.max_colwidth", None)    #updating pandas setting so that colums aren't truncated

boxoffice['belongs_to_collection'][has_collection].head()

Now I can see that each value is a list containing a dictionary with the ID, name, path to poster, adn path to backdrop for each collection. 

While it's possible that the backdrop and/or poster could have an impact on revenue for marketing reasons; I would need to asess them using computer vision which is not in the scope of this project.  

For now I'm going to replace this dictionary with only the name of each collection since the other values aren't ones that I want to work with.

In [None]:
print(type(boxoffice.loc[0, 'belongs_to_collection']))


solution = boxoffice['belongs_to_collection'].fillna('NA')  #fills nan values with NA

***Extracting film collection names from the 'belongs_to_collections' column***

After a lot of tinkering I'm able to discover that individual values in this column are lists of dictionaries formatted as strings.  This was a huge problem and after consulting with some colleagues I was able to come up with the following solution to extract a numpy array of names for each of the 'collections' or 'NA' if there was no collection.  

For the other columns like this (Genre, Language Spoken, Cast) I'll need to write an additional for loop to iterate through the list of dictionaries and pull out each of the names as well. 

In [None]:
names = np.array([ast.literal_eval(item)[0].get('name', 'NA') if item != 'NA' else 'NA' for item in solution.values])

names[0]       #checks to see if I have succcessffully extracted the name for Toy Story 2. 

I'm able to use the parsing library AST to read each string, call the first dictionary in the list, then extract the value representing the name for each film collection.  I'll need to replicated this at a more advanced level to extract the names for all spoken langauges, genres, cast, and crew members. 

In [None]:
boxoffice['belongs_to_collection'] = names    #reassigning the 'belongs_to_collection' column to the array of names

boxoffice['belongs_to_collection'].head()

In [None]:
boxoffice['collectionbool'] = has_collection #create a column of boolean values identifying if a movie is in a collection at all

list(boxoffice.columns)

In [None]:
pd.set_option("display.max_colwidth", 20) #reset column width to make the entire dataframe readable

boxoffice.head()

For 'belongs_to_collection' I have decided to imput the missing values as NA and extracted the title of each film collection from the original data object in this column.  I've also created a second column that provides boolean values: either a film is part of a larger collection or it is not.

In my next notebook one question I'd like to answer is does being part of a collection impact revenue at all?  Subsequently I'd also like to identify if any collections seem to have significantly higher revenue (Star Wars vs Hot Tub Time Machine).  In order to enable this with more ease I've created one boolean column that identifies if a movie is part of a collection at all, and a second categorical column that identifies which specific film collection it's a member of. 

### Applying the value extraction to other columns in the data frame

Now that I have a method to extract the important values from each of these columns I need to apply that to the other columns that are also lists of dictionaries formatted as strings, and decide if I'll drop some extraneous columns.

In [None]:
pd.set_option("display.max_colwidth", None)  #again altering column settings so that I can read the entire list for 'genres'

boxoffice[['original_title', 'genres']].head()

Here it seems that some movies had multiple genres assigned to them - which makes sense.  Rom-Coms are one of the most well known genres and are a hybrid.  

It doesn't seem that the genres are lsited in any particular order, 'Whiplash's' are in descending order by name and ID while 'Kahaani's' are in ascending order. 

My first instinct is to turn these lists of dictionaries into a list of genres.  This is an approach that will carry over well to other columns like boxoffice.cast.  I may revisit this after beginning EDA in my next notebook depending on what problems arise and simplify each movie to a single genre, for now there doesn't seem to be a statistically valid way to select one genre from the list provided for each movie.

***Declaring the 'extractname' function***

It's immediately apparent that the method I used to pull the name for each film collection won't work for these other columns since the lists contain multiple dictionaries and I want to pull out the name from each dictionary.  Here I create a new function using the previous method that I can map over the whole dataframe to do this for me. 


I want to acknowledge Dipanjan Sarkar, one of Springboard's Data Science Mentors, for helping me debug this function and pointing me towards the AST library to work with.  

In [None]:
def extractnames(df):
    
    """Will take a column from boxoffice and for each dictionary in the list for each row will extract the 
    name of the (genre, language, cast member, etc) and return a numpy array that can be used to replace the column"""
    
    new_array = []
    
    for item in df.values:
        
        if item != 'NA':
            
            new_item = []   #new list to replace the original object, will hold only names
            
            for i in range(len(ast.literal_eval(item))):     #pulls out each name for every dictionary in original object
                
                genre = ast.literal_eval(item)[i].get('name', 'NA')
                
                new_item.append(genre)                         #stores each name in new_item list
                
            new_array.append(new_item)                     #appends each new_item list to the new_array
                
        else: 
            new_array.append(None)                              #leaves null values in as 'NA'
    return new_array

***Applying 'extractnames' to remaining columns that are lists of dictionaries***

Here I'll apply my new function to the remaining columns to extract the genres, keywords, cast, crew, and spoken langauges for each film. 

In [None]:
g = boxoffice['genres'].fillna('NA')

genre = extractnames(g)

genre = np.array(genre, dtype='object')

boxoffice['genres'] = genre

boxoffice[['original_title', 'genres']].head()

In [None]:
k = boxoffice['Keywords'].fillna('NA')

key = extractnames(k)

key = np.array(key, dtype='object')

boxoffice['Keywords'] = key

boxoffice[['original_title', 'Keywords']].head()

In [None]:
c = boxoffice['cast'].fillna('NA')

cast = extractnames(c)

cast = np.array(cast, dtype='object')

boxoffice['cast'] = cast

boxoffice[['original_title', 'cast']].head()

In [None]:
cr = boxoffice['crew'].fillna('NA')

crew = extractnames(cr)

crew = np.array(crew, dtype='object')

boxoffice['crew'] = crew

boxoffice[['original_title', 'crew']].head()

In [None]:
sl = boxoffice['spoken_languages'].fillna('NA')

spoken = extractnames(sl)

spoken = np.array(spoken, dtype='object')

boxoffice['spoken_languages'] = spoken

boxoffice[['original_title', 'spoken_languages']].head()

In [None]:
pcont = boxoffice['production_countries'].fillna('NA')

country = extractnames(pcont)

country = np.array(country, dtype='object')

boxoffice['production_countries'] = country

boxoffice[['original_title', 'production_countries']].head()

In [None]:
pcomp = boxoffice['production_companies'].fillna('NA')

company = extractnames(pcomp)

company = np.array(company, dtype='object')

boxoffice['production_companies'] = company

boxoffice[['original_title', 'production_companies']].head()

In [None]:
pd.set_option("display.max_colwidth", 20)
pd.set_option("display.max_columns", 24)

boxoffice.head()

***Removing extraneous categorical columns***

Now that I've cleaned up the columns that were complicated to work with, it's time for me to decide which categorical columns that I'll drop since they're either unlikely to be beneficial to my model or because they'll require complex text analysis prior to using them in a model.

### Dropping extraneous columns. 

I have several columns that I've identified as likely needing to be dropped since they're not relevant to my analysis: 

***Homepage*** - there is a small potential that analyzing the homepage for a movie would lend clues to it's potential revenue.  However all of the critical information is already in my dataframe and stripping extra info would require significant amounts of work relative to it's benefit.  This column will be dropped.

***id*** - This column seems to be entirely unneeded and can be dropped as it's a duplicate of the index.

***Status*** - This is useless, all of these movies have revenue and by definition will have the status 'Released'.  This column will be dropped. 

***Poster_path*** - This presents the same challenge as the homepage url for each movie.  While the posters in theory will have a large impact on marketing and revenue - analyzing the movie posters in and of themselves is a significant amount of modeling work on it's own and would provide marginal benefit to this model.  This is a good area of further exploration with a computer vision modeling project later on.  This column will be dropped. 

***Overview*** - Again this is a large challenge to analyze significant portions of text to get clues for revenue.  Given time and more advanced knowledge of NLP techniques I'd like to score each movies overview and include the score here for modeling purposes.  However thata lies outside the scope of my project.  I'll take a closer look to see if there's something I can wrangle from this information. 

***Tagline*** - This column I think has some potential.  The text is short enough and would be presented in marketing campaigns (vs. Overview), so in theory will have more of an impact on a viewers decision to view a film and thus revenue impact.  Without NLP there's limited opportunity to analyze taglines, however I wonder if there's a correlation between tagline 'length' and revenue, potentially tied in to the genres the movie represents.   I'll also take a closer look at this column.

In [None]:
droplist =['homepage', 'poster_path', 'status', 'id']       #list of columns to drop

boxoffice = boxoffice.drop(droplist, axis=1)

boxoffice.head()

In [None]:
pd.set_option("display.max_colwidth", None)

boxoffice['overview'].head()

There seems to be quite a bit of variability across these strings.  However reading through these I am more compelled by short overviews rather than longer overviews.  

I'm of the mind that the length of overview will be a good proxy for sentiment and 'rich'-ness of the content moving forward en lieu of using NLP and sentiment analysis.  Moving forward I'll add a column that includes the length (in characters) of the overview for each film.  This will also engineer additional numeric values for each film as well. 

In [None]:
boxoffice['Overview_length'] = boxoffice['overview'].str.len()

boxoffice['Overview_length'].head()                  #New column with the character length of each film's plot overview

In [None]:
boxoffice['Tag_length'] = boxoffice['tagline'].str.len()

boxoffice['Tag_length'].head()

At this point I have cleaned up or eliminated my categorical columns of data, which is the majority of the missing values in the dataset.  Now I need to examine the numerical data for outliers. 

In [None]:
pd.set_option("display.max_colwidth", 20)

Collection_bool = boxoffice['collectionbool']

boxoffice.head()

In [None]:
del boxoffice['collectionbool']     #deleting the boolean column to keep histograms simple to work with

### Examining numerical features

I've removed the boolean column in order to keep things simple when plotting histograms for the numerical features of my data.  I need to check and make sure there aren't any outliers to the dataset that might dramatically alter my modeling. 

In [None]:
boxoffice.hist(figsize=(15,10))
plt.subplots_adjust(hspace=0.5);

In [None]:
boxoffice.describe()

It looks like I may have a few outliers with revenue, budget, and potentially popularity as well. 

It is also clear that I have issues with both budget and revenue.  I'll examine budget further in my eploratory data analysis notebook to follow and decide how to impute missing values (since over 25% of films in this dataset have 0.0 listed as their budget).  

However my revenue data has an extemely long right tail which is concerning, and the minimum value is $1.  Given that this is my target variable I need to make sure that I've dropped any films which do not have revenue information.  

In [None]:
boxoffice.loc[boxoffice.popularity > 50].head()    #looking at films whose popularity is significantly above the norm

After looking at TMDB's API documentation it seems that popularity is a metric that's derived from a lot of complicated user interactions (movie added to watchlist, voted for, viewed on their website, searched for, etc) to arrive and a popularity score. Given this, while there are several films clustered abot a score of 200 there's no reason to drop those films as of now.  Particularly given the idiosycrasies of cultural content going 'viral' online this isn't too concerning right now. Moving forward it's best to keep in mind that the popularity metric is derived from TMDB's web analytics and may need to be treated with caution. 

In [None]:
boxoffice[boxoffice.budget > 2e8].head()  #examining films whose budget exceeded 2 million USD

Same as for the popularity, budget had a long tailed distribution.  I could drop films above 200 million in budget, however that would ultimately be arbitrary and after some quick google searches it's apparent that the budget information here is accurate. 

### Examining the target variable revenue for missing values or outliers.

In [None]:
#boxoffice[boxoffice.revenue > 1e9].head()   #examining films whose revenue exceeded 1 billion USD

Again, it's good to check for significant outliers or errors.  The outliers for revenue don't seem to be incorrect after doing some quick spot checking on google.  While I could drop movies that grossed over a billion dollars in revenue that would be fairly arbitrary and eliminate data points that are valid.

It is notable that the outliers for revenue have little to no overlap with the films who's popularity was exceptionally high - this is something I definitely want to explore in a later notebook. 

I also need to check my revenue feature for missing or incorrect values on the lower end:

In [None]:
#boxoffice.revenue.hist(bins=50, figsize=(15,5));

It seems almost all of our film's revenue was below 200,000,000.00 USD.  This stands to reason; a film beigning in over 200 million USD at the box office is a smash-hit - still uncommon but definitely possible.  Given that there's such a large number of movies that fall into that category I'll need to look in more detail at the lower values. 

My goal here is to do some spot checking on low outliers, and check for missing values where revenue is zero. My hunch is that there will be a large number of movies with revenue set to zero either because they are extremly old (EG: Gone with the Wind) or were driect to DVD/streaming releases.

In [None]:
#boxoffice.revenue[boxoffice.revenue < 500000.0].hist(bins=30, figsize=(15,5)); #examine films with less than 500,000 USD revenue

In [None]:
#boxoffice.revenue[boxoffice.revenue < 1000000.0].describe()

I'm still zero-ing in on what the issue is here.  However it seems that while over half of the films in this data set have revenue below 200 million, only about 120 or so have extremely low revenue which is a good sign.

In [None]:
#boxoffice.revenue[boxoffice.revenue<100000].hist(bins=20, figsize=(15,5));   #examine movies with less than 100k USD revenue.

In [None]:
#boxoffice.revenue[boxoffice.revenue<100000].describe()  #examining films with less than $100,000 revenue at the box office

In [None]:
#boxoffice.revenue[boxoffice.revenue <= 5000].describe()    #examining films who's revenue was in the lowest bin of previous plot

In [None]:
#boxoffice.revenue[boxoffice.revenue <= 500].describe()    #examining films who were in the bottom 75% of the last slice

In [None]:
#low_rev = boxoffice[boxoffice.revenue <=500] #new DF with all info for films with low revenue
#low_rev.iloc[::5, :]    #Showing every 5th film that has less than $500 in revenue

Immediately some significant data quality issues become apparent.  

I've never heard of either "The Getaway" (index 1138, revenue = 30 USD) or "Chasing Liberty" (index 450, revenue = 12 USD).  However both immediately stand out since they have A-list actors Alec Baldwin and Mandy Moore in them; and "Chasing Liberty" is shown to have a budget of 23 million USD.  

This is definitely a mistake, 12 and 30 aren't revenue numbers high enough to even account for a family of 4 purchasing movie tickets (maybe in 1994 for "The Getaway", but still). 

After consulting IMDB (the data is originally sourced from TMDB via Kaggle) I notice that the actual revenue numbers are much different.  "The Getaway" had a cumulative worldwide boxoffice gross revenue of 30 *million* USD, and "Chasing Liberty" had worldwide revenue of 12 *million* USD.  While "Chasing Liberty" was definitely a box office bomb, both numbers are significantly different than those provided in the dataset. 

It seems plausible that the values were incorrectly altered through some data cleaning process.  It seems a fairly simple mistake to convert a value that reads '12 million' or '30 million' to '12' & '30' respectively, it's possible that these films all actually grossed over a million dollars and that this is some sort of systematic error. 

I decided to investigate further and look at the film "Electra Luxx" since it seems to be the only film (shown skipping every 5 rows in the low_rev dataframe) that's part of a larger collection.  It's revenue is listed as 10 USD, while the revenue listed on IMDB is $11,000 - clearly still a flop but the data set that I'm working with is clearly flawed.

### Potential Solutions

After consulting with my Springboard mentor AJ Sanchez and my friend (and data scientist for DoorDash) Finn Qiao, I've decided on a course of action for my dataset in regards to resolving errors in my target values.  

I had a few courses of action:

1) I could potentially have just ignored the data errors and move forward with the data I have.  Normally I'd speak to stakeholders about data quality, but as this is a solo project that's not an option.   Given that the decision on this is in my hands I chose to not take this route - 18% of my data falls below $1 million in revenue and I consider that to be highly suspect.  In turn this could essentially render my model useless, which is an unacceptable outcome.   

2) I could drop all films that fall below a certain revenue threshold since this issue is most noticable at the lower revenues for films.  However, I could also be dropping films that have accurate revenue information.  Further, as noted above, there seems to be multiple kinds of errors.  In the case of "The Getaway" it seems that the string 30 million' had everything but the integers stripped, resulting in a massive error.  However for "Electra Luxx", the revenue of 11,000 was dropped completely to 10, and I have no idea how that happened.  This raises larger concerns about the accuracy of other revenue numbers, since I'm not sure how that could even have occured.  

3) I could uncover how to use the IMDB API and import revenue data for all of these films directly from their API and merge it into my dataset to replace the revenue numbers I have.  IMDB is the source that I've used to spot check revenue and other data points for each film, and is a source that is highly trusted in the industry. This does raise the question of "why not just import an entirely new dataset from IMDB".  Good question.  The reason is that there are two API's, one is professional and one is open source using the IMDbPy python package.  I'll have to use the open source API and there is sparse documentation and questionable functionality to retrieve data in bulk. 

4) AJ suggested that since my concerns about data quality seem to related to lower-revenue films I could segment my dataset by revenue and train different models on each segment.  This would create multiple models and increase the complexity of my project, but mitigates rather than resolves my concern regarding data quality.  I'd like to implement this later in the project regardless, and take the opportunity to examine box office hits and bombs in more detail.

After consulting both AJ and Finn, I've decided to figure out how to import data from IMDB, which will require some further work to create a new data set with the correct revenue numbers to rerun the previous work in this notebook on.

In [None]:
boxoffice.to_csv(r'C:\Users\deann\Documents\Data\Box Office Prediction Data\test.csv')

# Importing revenue information from IMDB

I conducted the work to download and import rvenue data from IMDB in a separate workbook that can be found here:  [Learning to use IMDbPy & IMDB API](https://github.com/NickD-Dean/Springboard/blob/f0ac96f192b128bff6b958909ac999daecfc336b/Capstone%20Project%202/Code/Learning%20to%20use%20IMDbPy%20&%20IMDB%20API.ipynb)

In short I needed to download an open source python package (IMDbPy), which gives me access to look up movies by IMDB id.  Unfortunately the package only allows for passing a 1-D object to the lookup function, so I needed to write a function that iterated over my data set and looked up the revenue for each movie individually.  

Ultimately I needed to drop over 2000 rows that had no revenue information in them from the 'train' data set provided by Kaggle.  However with this IMDB data I'm able to use the 'test' data as well since both data sets are samples of the larger paid data on IMDB, the 'test' data was missing it's revenue values. After finding the revenue data I could from IMDB for the 'test' data I concatenated the data frames and I saved that data separately.  I've read the product of that notebook into this one below. 

In [None]:
boxoffice = pd.read_csv(r'C:\Users\deann\Documents\Data\Box Office Prediction Data\BoxOfficeData.csv')

### Creating dummy variables

I need to create dummy variables in order to more easily carry out analysis on categorical features.  Currently each film in a categorical column has a value equal to a list of cetegories (genres, cast, languages, etc) present in that film.  

What I'd like to do is to break these out into new columns, store the new column names as individual lists, then concatenate them back to the larger dataset.

In [None]:
boxoffice['genres']

In [None]:
Genres = pd.get_dummies(boxoffice['genres'].apply(pd.Series).stack()).sum(level=0)
Genres         

In [None]:
print(boxoffice['revenue'].isnull().sum())

### Final check on the data and saving the cleaned dataframe

In [None]:
boxoffice['collectionbool'] = Collection_bool

boxoffice.info()

There are still some missing values in the dataset that I need to be aware of, but I'll leave things as is for now and save this as a cleaned .csv

In [None]:
boxoffice.to_csv(r'C:\Users\deann\Documents\Data\Box Office Prediction Data\boxoffice_cleaned.csv', index=False)

### Summary

This was my notebook containing the data wrangling work to create a predictive model for a film's boxoffice revenue.  

In this notebook I imported the dataset from TMDB (via Kaggle), and examined it for missing values.  Immediately it was apparent that two columns 'belongs_to_collection' and 'homepage' had significant missing values.  It was also readily apparent that several of the columns were lists of dictionaries that would be complex to work with, and that I had a few columns that weren't needed. 

Ultimately I chose to drop the columns for 'id', 'homepage', 'poster_path', and 'status'.  Both 'homepage' and 'poster_path' were links to external websites, while 'id' was a duplicate of the index and 'status' was the same value for all movies. 

I also added three new columns: 'collectionbool' as a boolean mask indicating if a movie is part of a larger collection, 'Overview_length' and 'Tag_length' I engineered to provide a numeric value for both the film's plot overview and tagline since both are used in marketing efforts and could contribute to a films revenue.

I imported the ***ast*** library in order to use the literal_eval method on the values in 'belongs_to_colleection'.  This column's values were lists of dictionaries, formatted as strings so I was unable to extract the dictionary values normally. Using this I extracted only the name of the film collection and replace the original values for the column with that.  

Since that worked well, I then was able to write a function that allowed me to extract the names of the production companies, countries, spoken languages, cast, and crew. 


Finally I examined the distribution of numerical values to identify any concerning outliers in the data.  While I do have some extreme values in revenue and budget, they are all accurate so I chose to leave them in.  The popularity metric has some rather large values as well, and is derived from user behavior on TMDB's website.  I chose to keep the outliers for now, however that column may ultimately be dropped if it has little correlation with the actual revenue. 

If you'd like to read more, my next notebook will examine the relationships between numeric variables and start to look at the impact of different categorical variables as well. 
