In [1]:
import pandas as pd

# Pre-processing

### Bring in the Scraped Data

In [5]:
imdf = pd.read_csv('./imdb_X.csv', index_col=0)
imdf = imdf.reset_index()
imdf = imdf.drop('index', axis =1)
imdf.head()

Let's check out what the types of these series are, in case we'd like things in a different format.

In [36]:
imdf.dtypes

Actors                    object
Awards                    object
Box Office ($)            object
Country                   object
Genre                     object
Language                  object
MPAA Rating               object
Metacritic Score         float64
Movie Title               object
Plot                      object
Production Studio         object
Release Date              object
Rotten Tomatoes Score     object
Runtime                   object
Writer                    object
Year of Release            int64
imdb ID                   object
imdb Score               float64
dtype: object

There are some characters in a few columns that we'll want to remove so that we can interact with them better. We'll also change some of the types from strings to numbers.

In [73]:
imdf['Box Office ($)'] = imdf['Box Office ($)'].map(lambda x: x.lstrip('$'))
imdf['Box Office ($)'] = imdf['Box Office ($)'].str.replace(',', '')
imdf['Box Office ($)'] = imdf['Box Office ($)'].astype(int)
imdf['Rotten Tomatoes Score'] = imdf['Rotten Tomatoes Score'].str.replace('%', '')
# Because of NaN values, we have to change to a float
imdf['Rotten Tomatoes Score'] = imdf['Rotten Tomatoes Score'].astype(float)
imdf['Runtime'] = imdf['Runtime'].str.replace(' min', '')
# Because of NaN values, we have to change to a float
imdf['Runtime'] = imdf['Runtime'].astype(float)
imdf.head(3)

Unnamed: 0,Actors,Awards,Box Office ($),Country,Genre,Language,MPAA Rating,Metacritic Score,Movie Title,Plot,...,Horror,Music,Musical,Mystery,Romance,Sci-Fi,Short,Sport,Thriller,War
0,,,45125480,Ireland,[Documentary],German,,,The Great Wall,'The Great Wall has been completed at its most...,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"[Chengpeng Dong, Coulee Nazha]",,115524,China,"[Comedy, Music]",Mandarin,,,City of Rock,A young musician from a small town in China tr...,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"[Lilas Nagoya, Nathalie Remadi]",,24851,France,,French,,,Unrest,,...,,,,,,,,,,


The actors are currently seen as a single string. This isn't very useful since we could only ever interact with an element from that series if the exact line-up of the cast is in a movie. Below we'll split the actors in a movie into a list of strings.

In [63]:
imdf['Actors'] = imdf['Actors'].str.split(', ')
imdf.head(3)

Unnamed: 0,Actors,Awards,Box Office ($),Country,Genre,Language,MPAA Rating,Metacritic Score,Movie Title,Plot,...,Horror,Music,Musical,Mystery,Romance,Sci-Fi,Short,Sport,Thriller,War
0,,,45125480,Ireland,[Documentary],German,,,The Great Wall,'The Great Wall has been completed at its most...,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"[Chengpeng Dong, Coulee Nazha]",,115524,China,"[Comedy, Music]",Mandarin,,,City of Rock,A young musician from a small town in China tr...,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"[Lilas Nagoya, Nathalie Remadi]",,24851,France,,French,,,Unrest,,...,,,,,,,,,,


Looks good! However, this is still a bit limited. Below we'll create dummy variables for every actor. This means that if an actor is in a given movie, they'll have a value of 1, if they aren't in the movie, they'll have a 0.

In [61]:
imdf = pd.concat([imdf, pd.get_dummies(imdf['Actors'].apply(pd.Series).stack()).sum(level=0)], axis=1)
imdf.head(1)

Unnamed: 0,Actors,Awards,Box Office ($),Country,Genre,Language,MPAA Rating,Metacritic Score,Movie Title,Plot,...,Horror,Music,Musical,Mystery,Romance,Sci-Fi,Short,Sport,Thriller,War
0,,,45125480,Ireland,[Documentary],German,,,The Great Wall,'The Great Wall has been completed at its most...,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Now we'll do the same thing for genres so that each genre has a column.

In [56]:
imdf['Genre'] = imdf['Genre'].str.split(', ')

In [64]:
# Taking a peek
imdf['Genre'][0:5]

0                [Documentary]
1              [Comedy, Music]
2                          NaN
3    [Drama, Horror, Thriller]
4                [Documentary]
Name: Genre, dtype: object

In [75]:
imdf = pd.concat([imdf, pd.get_dummies(imdf['Genre'].apply(pd.Series).stack()).sum(level=0)], axis=1)
imdf.head(3)

Unnamed: 0,Actors,Awards,Box Office ($),Country,Genre,Language,MPAA Rating,Metacritic Score,Movie Title,Plot,...,Horror,Music,Musical,Mystery,Romance,Sci-Fi,Short,Sport,Thriller,War
0,,,45125480,Ireland,[Documentary],German,,,The Great Wall,'The Great Wall has been completed at its most...,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"[Chengpeng Dong, Coulee Nazha]",,115524,China,"[Comedy, Music]",Mandarin,,,City of Rock,A young musician from a small town in China tr...,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"[Lilas Nagoya, Nathalie Remadi]",,24851,France,,French,,,Unrest,,...,,,,,,,,,,


In [None]:
imdf.to_csv('imdf_full.csv')

### Next Steps</br>
Awesome! The data is in a format that we can start to play with, so let's go ahead and do that!</br></br>
Proceed to 04-Modeling to continue.