In [1]:
import pandas as pd

# Pre-processing

### Bring in the Scraped Data

In [86]:
imdf = pd.read_csv('./imdf_Z.csv', index_col=0)
imdf = imdf.reset_index()
imdf = imdf.drop('index', axis =1)
imdf.head()

Unnamed: 0,Actors,Awards,Box Office ($),Country,Director,Genre,Language,MPAA Rating,Metacritic Score,Movie Title,Plot,Production Studio,Release Date,Rotten Tomatoes Score,Runtime,Writer,Year of Release,imdb ID,imdb Score
0,"Rosanna Arquette, Madonna, Aidan Quinn, Mark Blum",Nominated for 1 Golden Globe. Another 1 win & ...,27400000,USA,Susan Seidelman,"Comedy, Drama",English,PG-13,71.0,Desperately Seeking Susan,"A bored suburban housewife, seeking escape fro...",MGM Home Entertainment,12 Apr 1985,85.0,104.0,Leora Barish,1985,tt0089017,5.9
1,"Jackie Chan, Danny Aiello, Sandy Alexander, Vi...",,829000,"Hong Kong, USA",James Glickenhaus,"Action, Crime, Thriller","English, Cantonese",R,,The Protector,"Two New York cops, Billy Wong and Danny Garoni...",Warner Bros. Pictures,23 Aug 1985,33.0,91.0,"James Glickenhaus, King Sang Tang (Hong Kong v...",1985,tt0089847,5.7
2,"Simon Srebnik, Michael Podchlebnik, Motke Zaïd...",14 wins.,15642,"France, UK",Claude Lanzmann,"Documentary, History, War","German, Hebrew, Polish, Yiddish, French",NOT RATED,99.0,Shoah,Claude Lanzmann's epic documentary recounts th...,IFC Films,01 Nov 1985,100.0,566.0,Claude Lanzmann,1985,tt0090015,8.4
3,"Kevin Kline, Scott Glenn, Kevin Costner, Danny...",Nominated for 2 Oscars. Another 1 win & 2 nomi...,33200000,USA,Lawrence Kasdan,"Action, Crime, Drama",English,PG-13,64.0,Silverado,A misfit bunch of friends come together to rig...,Sony Pictures Home Entertainment,10 Jul 1985,77.0,133.0,"Lawrence Kasdan, Mark Kasdan",1985,tt0090022,7.2
4,"Jeff Bridges, Rosanna Arquette, Alexandra Paul...",1 nomination.,1305114,USA,Hal Ashby,"Action, Crime, Drama","English, Spanish",R,,8 Million Ways to Die,Scudder is a detective with the Sheriff's Depa...,Twentieth Century Fox Home Entertainment,25 Apr 1986,0.0,115.0,"Lawrence Block (book), Oliver Stone (screenpla...",1986,tt0090568,5.7


Let's check out what the types of these series are, in case we'd like things in a different format.

In [87]:
imdf.dtypes

Actors                    object
Awards                    object
Box Office ($)             int64
Country                   object
Director                  object
Genre                     object
Language                  object
MPAA Rating               object
Metacritic Score         float64
Movie Title               object
Plot                      object
Production Studio         object
Release Date              object
Rotten Tomatoes Score    float64
Runtime                  float64
Writer                    object
Year of Release            int64
imdb ID                   object
imdb Score               float64
dtype: object

There are some characters in a few columns that we'll want to remove so that we can interact with them better. We'll also change some of the types from strings to numbers.

In [73]:
imdf['Box Office ($)'] = imdf['Box Office ($)'].map(lambda x: x.lstrip('$'))
imdf['Box Office ($)'] = imdf['Box Office ($)'].str.replace(',', '')
imdf['Box Office ($)'] = imdf['Box Office ($)'].astype(int)
imdf['Rotten Tomatoes Score'] = imdf['Rotten Tomatoes Score'].str.replace('%', '')
imdf['Rotten Tomatoes Score'] = imdf['Rotten Tomatoes Score'].str.replace('/100','')
# Because of NaN values, we have to change to a float
imdf['Rotten Tomatoes Score'] = imdf['Rotten Tomatoes Score'].astype(float)
imdf['Runtime'] = imdf['Runtime'].str.replace(' min', '')
# Because of NaN values, we have to change to a float
imdf['Runtime'] = imdf['Runtime'].astype(float)
imdf.head(3)

Unnamed: 0,Actors,Awards,Box Office ($),Country,Genre,Language,MPAA Rating,Metacritic Score,Movie Title,Plot,...,Horror,Music,Musical,Mystery,Romance,Sci-Fi,Short,Sport,Thriller,War
0,,,45125480,Ireland,[Documentary],German,,,The Great Wall,'The Great Wall has been completed at its most...,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"[Chengpeng Dong, Coulee Nazha]",,115524,China,"[Comedy, Music]",Mandarin,,,City of Rock,A young musician from a small town in China tr...,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"[Lilas Nagoya, Nathalie Remadi]",,24851,France,,French,,,Unrest,,...,,,,,,,,,,


The actors are currently seen as a single string. This isn't very useful since we could only ever interact with an element from that series if the exact line-up of the cast is in a movie. Below we'll split the actors in a movie into a list of strings.

In [63]:
imdf['Actors'] = imdf['Actors'].str.split(', ')
imdf.head(3)

Unnamed: 0,Actors,Awards,Box Office ($),Country,Genre,Language,MPAA Rating,Metacritic Score,Movie Title,Plot,...,Horror,Music,Musical,Mystery,Romance,Sci-Fi,Short,Sport,Thriller,War
0,,,45125480,Ireland,[Documentary],German,,,The Great Wall,'The Great Wall has been completed at its most...,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"[Chengpeng Dong, Coulee Nazha]",,115524,China,"[Comedy, Music]",Mandarin,,,City of Rock,A young musician from a small town in China tr...,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"[Lilas Nagoya, Nathalie Remadi]",,24851,France,,French,,,Unrest,,...,,,,,,,,,,


Looks good! However, this is still a bit limited. Below we'll create dummy variables for every actor. This means that if an actor is in a given movie, they'll have a value of 1, if they aren't in the movie, they'll have a 0.

In [61]:
imdf = pd.concat([imdf, pd.get_dummies(imdf['Actors'].apply(pd.Series).stack()).sum(level=0)], axis=1)
imdf.head(1)

Unnamed: 0,Actors,Awards,Box Office ($),Country,Genre,Language,MPAA Rating,Metacritic Score,Movie Title,Plot,...,Horror,Music,Musical,Mystery,Romance,Sci-Fi,Short,Sport,Thriller,War
0,,,45125480,Ireland,[Documentary],German,,,The Great Wall,'The Great Wall has been completed at its most...,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Now we'll do the same thing for genres so that each genre has a column.

In [56]:
imdf['Genre'] = imdf['Genre'].str.split(', ')

In [64]:
# Taking a peek
imdf['Genre'][0:5]

0                [Documentary]
1              [Comedy, Music]
2                          NaN
3    [Drama, Horror, Thriller]
4                [Documentary]
Name: Genre, dtype: object

In [75]:
imdf = pd.concat([imdf, pd.get_dummies(imdf['Genre'].apply(pd.Series).stack()).sum(level=0)], axis=1)
imdf.head(3)

Unnamed: 0,Actors,Awards,Box Office ($),Country,Genre,Language,MPAA Rating,Metacritic Score,Movie Title,Plot,...,Horror,Music,Musical,Mystery,Romance,Sci-Fi,Short,Sport,Thriller,War
0,,,45125480,Ireland,[Documentary],German,,,The Great Wall,'The Great Wall has been completed at its most...,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"[Chengpeng Dong, Coulee Nazha]",,115524,China,"[Comedy, Music]",Mandarin,,,City of Rock,A young musician from a small town in China tr...,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"[Lilas Nagoya, Nathalie Remadi]",,24851,France,,French,,,Unrest,,...,,,,,,,,,,


In [None]:
imdf = pd.concat([imdf, pd.get_dummies(imdf['Director'].apply(pd.Series).stack()).sum(level=0)], axis=1)

In [None]:
imdf.to_csv('imdf_full.csv')

### Next Steps</br>
Awesome! The data is in a format that we can start to play with, so let's go ahead and do that!</br></br>
Proceed to 04-EDA to continue.