## Preparation for Feature Engineer of Movie Data

#### Create Sub_dfs for Categorical Columns with Nested Data:
   - **genres**
   - **keywords**
   - **production_companies**
   - **production_countries**
   - **spoken_languages**
- Organize Nested List of Dictionaries by Movie Title
- Create feature variables [via pd.dummies] based on Column trends of each Movie Title
- Merge column dataframes to movie_df data

In [1]:
%load_ext watermark
%watermark -a "Emily Schoof" -d -t -v -p numpy,pandas,re

Emily Schoof 2019-08-19 10:07:51 

CPython 3.7.3
IPython 7.4.0

numpy 1.16.2
pandas 0.24.2
re 2.2.1


In [2]:
# Load the dataset
%store -r movie_df
print(movie_df.shape)
movie_df.head()

(4523, 12)


Unnamed: 0,title,revenue,genres,keywords,original_language,overview,production_companies,production_countries,runtime,spoken_languages,status,release_date_dt
0,Avatar,2787965000.0,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,"In the 22nd century, a paraplegic Marine is di...","[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,2009-12-10
1,Titanic,1845034000.0,"[{""id"": 18, ""name"": ""Drama""}, {""id"": 10749, ""n...","[{""id"": 2580, ""name"": ""shipwreck""}, {""id"": 298...",en,"84 years later, a 101-year-old woman named Ros...","[{""name"": ""Paramount Pictures"", ""id"": 4}, {""na...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",194.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,1997-11-18
2,The Avengers,1519558000.0,"[{""id"": 878, ""name"": ""Science Fiction""}, {""id""...","[{""id"": 242, ""name"": ""new york""}, {""id"": 5539,...",en,When an unexpected enemy emerges and threatens...,"[{""name"": ""Paramount Pictures"", ""id"": 4}, {""na...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",143.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,2012-04-25
3,Jurassic World,1513529000.0,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 1299, ""name"": ""monster""}, {""id"": 1718,...",en,Twenty-two years after the events of Jurassic ...,"[{""name"": ""Universal Studios"", ""id"": 13}, {""na...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",124.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,2015-06-09
4,Furious 7,1506249000.0,"[{""id"": 28, ""name"": ""Action""}]","[{""id"": 830, ""name"": ""car race""}, {""id"": 3428,...",en,Deckard Shaw seeks revenge against Dominic Tor...,"[{""name"": ""Universal Pictures"", ""id"": 33}, {""n...","[{""iso_3166_1"": ""JP"", ""name"": ""Japan""}, {""iso_...",137.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,2015-04-01


## Extract 'name' key from Nested Columns

In [7]:
# Import necessary modules
import re
import pandas as pd
import numpy as np

In [8]:
# Define patterns to strip from nested data columns
name_pattern = re.compile('[A-Z]+\s?[A-Za-z]+')
keyword_pattern = re.compile('[a-z]+\s?[a-z]+')

In [9]:
def extract_names(series):
    """ Create New Series of Nested Names """
    
    new_series = []
    
    for row in series:
        s = ' '.join(re.findall(name_pattern, row))
        new_series.append(s)
        
    return pd.Series(new_series)

In [10]:
def extract_keywords(series):
    """ Create New Series of Nested Keyword Names """
    
    new_series = []
    
    for row in series:
        s = ' '.join(re.findall(keyword_pattern, row))
        new_s = s.replace('id', '').replace('name', '')
        new_series.append(new_s)
        
    return pd.Series(new_series)

Genres

In [11]:
# Extract names
new_genres = extract_names(movie_df.genres)

# Replace original column
movie_df['genres'] = new_genres

Keywords

In [12]:
# Extract names
new_keywords = extract_keywords(movie_df.keywords)

# Replace original column
movie_df['keywords'] = new_keywords

Production Companies

In [13]:
# Extract names
new_production_companies = extract_names(movie_df.production_companies)

# Replace original column
movie_df['production_companies'] = new_production_companies

Production Countries

In [14]:
# Extract names
new_production_countries = extract_names(movie_df.production_countries)

# Replace original column
movie_df['production_countries'] = new_production_countries

Spoken Languages

In [15]:
# Extract names
new_spoken_languages = extract_names(movie_df.spoken_languages)

# Replace original column
movie_df['spoken_languages'] = new_spoken_languages

In [16]:
# Verify Output
movie_df.head()

Unnamed: 0,title,revenue,genres,keywords,original_language,overview,production_companies,production_countries,runtime,spoken_languages,status,release_date_dt
0,Avatar,2787965000.0,Action Adventure Fantasy Science Fiction,culture clash future space war space c...,en,"In the 22nd century, a paraplegic Marine is di...",Ingenious Film Partners Twentieth Century Fox ...,US United States America GB United Kingdom,162.0,English Espa,Released,2009-12-10
1,Titanic,1845034000.0,Drama Romance Thriller,shipwreck iceberg ship panic titanic...,en,"84 years later, a 101-year-old woman named Ros...",Paramount Pictures Twentieth Century Fox Film ...,US United States America,194.0,English Fran Deutsch Italiano,Released,1997-11-18
2,The Avengers,1519558000.0,Science Fiction Action Adventure,new york shield marvel comic superhero...,en,When an unexpected enemy emerges and threatens...,Paramount Pictures Marvel Studios,US United States America,143.0,English,Released,2012-04-25
3,Jurassic World,1513529000.0,Action Adventure Science Fiction Thriller,monster dna tyrannosaurus rex velocira...,en,Twenty-two years after the events of Jurassic ...,Universal Studios Amblin Entertainment Legenda...,US United States America,124.0,English,Released,2015-06-09
4,Furious 7,1506249000.0,Action,car race speed revenge suspense car ...,en,Deckard Shaw seeks revenge against Dominic Tor...,Universal Pictures Original Film Fuji Televisi...,JP Japan US United States America,137.0,English,Released,2015-04-01


In [17]:
# Store modified movie_df globally
movie_df_unnested = movie_df.copy()
%store movie_df_unnested 

Stored 'movie_df_unnested' (DataFrame)
