## Feature Engineer of Movie Data

#### NLP Assessment:
   - **title**
   - **overview**
   - **genres_series**
   - **keywords_series**
   - **prod_companies_series**
   - **prod_countries_series**
   - **spoken_lang_series**

- NLP processing will be applied to these column in order to assess if certain words/instances within each category have an impact on the resulting movie revenue.

In [1]:
%load_ext watermark
%watermark -a "Emily Schoof" -d -t -v -p numpy,pandas,nltk,sklearn,spacy

Emily Schoof 2019-08-22 16:50:25 

CPython 3.7.3
IPython 7.4.0

numpy 1.16.2
pandas 0.24.2
nltk 3.4
sklearn 0.20.3
spacy 2.1.8


In [2]:
# Load datasets
%store -r movie_df_unnested
movie_df = movie_df_unnested.copy()
print(movie_df.shape)

(4641, 12)


In [3]:
movie_df.head()

Unnamed: 0,title,revenue,genres,keywords,original_language,production_companies,production_countries,runtime,spoken_languages,release_date,release_day_of_week,release_month
0,Avatar,2787965000.0,Action Adventure Fantasy Science Fiction,culture clash future space war space c...,en,Ingenious Film Partners Twentieth Century Fox ...,US United States America GB United Kingdom,162.0,English Espa,20091210,Thursday,12
1,Titanic,1845034000.0,Drama Romance Thriller,shipwreck iceberg ship panic titanic...,en,Paramount Pictures Twentieth Century Fox Film ...,US United States America,194.0,English Fran Deutsch Italiano,19971118,Tuesday,11
2,The Avengers,1519558000.0,Science Fiction Action Adventure,new york shield marvel comic superhero...,en,Paramount Pictures Marvel Studios,US United States America,143.0,English,20120425,Wednesday,4
3,Jurassic World,1513529000.0,Action Adventure Science Fiction Thriller,monster dna tyrannosaurus rex velocira...,en,Universal Studios Amblin Entertainment Legenda...,US United States America,124.0,English,20150609,Tuesday,6
4,Furious 7,1506249000.0,Action,car race speed revenge suspense car ...,en,Universal Pictures Original Film Fuji Televisi...,JP Japan US United States America,137.0,English,20150401,Wednesday,4


## NLP Assessment

In [4]:
# Import necessary modules
import pandas as pd
from nltk.corpus import stopwords
import numpy as np
import spacy
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem import PorterStemmer

#### NLP Pre-Processing Functions

In [5]:
# Load the en_core_web_sm model
nlp = spacy.load('en_core_web_sm')

# Create an object of class PorterStemmer
porter = PorterStemmer()

# Create CountVectorizer object
vectorizer = CountVectorizer(lowercase=True, stop_words='english')

In [6]:
def column_to_text(column): 
    """Convert Column to Text Values"""
    
    text = movie_df[column].values
    text_clean = []
    
    return text, text_clean

In [7]:
def preprocess_text(text, text_clean):
    """Lemmantization and Removal of Stopwords and Punctuation Saved in to Clean List"""
    
    for each in text:
        # Create Doc object
        doc = nlp(each, disable=['ner', 'parser'])

        # Generate lemmas
        lemmas = [token.lemma_ for token in doc]

        # Remove stopwords and non-alphabetic characters
        a_lemmas = [lemma.lower() for lemma in lemmas if lemma.isalpha() and lemma not in stopwords.words('english')]
        
        # Strip word down to stem
       # stems = [porter.stem(a_lemma) for a_lemma in a_lemmas]
        
        # Clean text
        clean_text = ' '.join(a_lemmas)

        # Add to list
        text_clean.append(clean_text)
    
    return text_clean

In [8]:
def bag_of_words_dataframe(text_clean):
    
    # Convert to Series
    text_corpus = pd.Series(text_clean)
    
    # Generate Matrix of Word Vectors
    bow_matrix = vectorizer.fit_transform(text_corpus)
    
    # Convert bow_matrix into a DataFrame
    bow_df = pd.DataFrame(bow_matrix.toarray())

    # Map the column names to vocabulary 
    bow_df.columns = vectorizer.get_feature_names()
    
    return bow_df

In [9]:
def select_top_features(df):
    """ Select Features with 100 or More Instances """
    
    # Add number of ones in dataset
    sum_ones = pd.Series((df == 1).sum(axis=0))
    
    # Select only features terms with 100 or more 1s
    important_features = []
    for feature,sums in sum_ones.iteritems():
        if sums > 100:
            important_features.append(feature)
    
    # Make list of unimportant features
    drop_columns = []
    for column in df.columns:
        if column not in important_features:
            drop_columns.append(column)
            
    # Drop unimportant features
    df = df.drop(columns=drop_columns)
    
    return df

### Application of Functions on:

Columns from movie_df:
- **title**

Cleaned sub_series within preprocessing_series list:
- **genres_series**
- **keywords_series**
- **prod_companies_series**
- **prod_countries_series**
- **spoken_lang_series**

#### Title

In [13]:
title_text, title_clean = column_to_text('title')
title_clean = preprocess_text(title_text, title_clean)

In [14]:
# Test output
title_clean[0], len(title_clean)

('avatar', 4641)

In [15]:
def length_of_string(clean_list):
    
    lengths = []
    for string in clean_list:
        length = len(string)
        lengths.append(length)
    
    return lengths

In [16]:
title_lengths = length_of_string(title_clean)
print(len(title_lengths))

# Replace title column with lengths
movie_df['title'] = pd.Series(title_lengths, index=np.arange(0, 4641))

4641


In [17]:
movie_df.shape

(4641, 12)

### Preprocessing Text Columns

#### Genres

In [18]:
genres_text, genres_clean = column_to_text('genres')
genres_clean = preprocess_text(genres_text, genres_clean)

In [19]:
genres_clean[0]

'action adventure fantasy science fiction'

In [20]:
genres_df = bag_of_words_dataframe(genres_clean)
genres_df.shape

(4641, 22)

In [21]:
# Create second model to assess differences in analysis results
movie_revenue_ml_2 = movie_df.copy()

# Add genres_df to movie_revenue_ml_2
movie_revenue_ml_2 = pd.concat([movie_revenue_ml_2, genres_df], axis=1, sort=False)

# Drop genres column from movie_revenue_ml_2
movie_revenue_ml_2 = movie_revenue_ml_2.drop(columns=['genres'])
movie_revenue_ml_2.head()

Unnamed: 0,title,revenue,keywords,original_language,production_companies,production_countries,runtime,spoken_languages,release_date,release_day_of_week,...,horror,movie,music,mystery,romance,science,thriller,tv,war,western
0,6,2787965000.0,culture clash future space war space c...,en,Ingenious Film Partners Twentieth Century Fox ...,US United States America GB United Kingdom,162.0,English Espa,20091210,Thursday,...,0,0,0,0,0,1,0,0,0,0
1,7,1845034000.0,shipwreck iceberg ship panic titanic...,en,Paramount Pictures Twentieth Century Fox Film ...,US United States America,194.0,English Fran Deutsch Italiano,19971118,Tuesday,...,0,0,0,0,1,0,1,0,0,0
2,7,1519558000.0,new york shield marvel comic superhero...,en,Paramount Pictures Marvel Studios,US United States America,143.0,English,20120425,Wednesday,...,0,0,0,0,0,1,0,0,0,0
3,14,1513529000.0,monster dna tyrannosaurus rex velocira...,en,Universal Studios Amblin Entertainment Legenda...,US United States America,124.0,English,20150609,Tuesday,...,0,0,0,0,0,1,1,0,0,0
4,7,1506249000.0,car race speed revenge suspense car ...,en,Universal Pictures Original Film Fuji Televisi...,JP Japan US United States America,137.0,English,20150401,Wednesday,...,0,0,0,0,0,0,0,0,0,0


In [22]:
genres_df = select_top_features(genres_df)
genres_df.shape

(4641, 18)

In [23]:
# Add genres_df to movie_df
movie_df = pd.concat([movie_df, genres_df], axis=1, sort=False)

# Drop genres column from movie_df
movie_revenue_ml_2 = movie_df.drop(columns=['genres'])
movie_df.head()

Unnamed: 0,title,revenue,genres,keywords,original_language,production_companies,production_countries,runtime,spoken_languages,release_date,...,fantasy,fiction,history,horror,music,mystery,romance,science,thriller,war
0,6,2787965000.0,Action Adventure Fantasy Science Fiction,culture clash future space war space c...,en,Ingenious Film Partners Twentieth Century Fox ...,US United States America GB United Kingdom,162.0,English Espa,20091210,...,1,1,0,0,0,0,0,1,0,0
1,7,1845034000.0,Drama Romance Thriller,shipwreck iceberg ship panic titanic...,en,Paramount Pictures Twentieth Century Fox Film ...,US United States America,194.0,English Fran Deutsch Italiano,19971118,...,0,0,0,0,0,0,1,0,1,0
2,7,1519558000.0,Science Fiction Action Adventure,new york shield marvel comic superhero...,en,Paramount Pictures Marvel Studios,US United States America,143.0,English,20120425,...,0,1,0,0,0,0,0,1,0,0
3,14,1513529000.0,Action Adventure Science Fiction Thriller,monster dna tyrannosaurus rex velocira...,en,Universal Studios Amblin Entertainment Legenda...,US United States America,124.0,English,20150609,...,0,1,0,0,0,0,0,1,1,0
4,7,1506249000.0,Action,car race speed revenge suspense car ...,en,Universal Pictures Original Film Fuji Televisi...,JP Japan US United States America,137.0,English,20150401,...,0,0,0,0,0,0,0,0,0,0


In [24]:
movie_df.shape

(4641, 30)

#### Keywords

In [25]:
keywords_text, keywords_clean = column_to_text('keywords')
keywords_clean = preprocess_text(keywords_text, keywords_clean)

In [26]:
keywords_clean[0]

'culture clash future space war space colony society space travel futuristic romance space alien tribe alien planet cgi marine soldier battle love affair anti war power relation mind soul'

In [27]:
keywords_df = bag_of_words_dataframe(keywords_clean)
keywords_df.shape

(4641, 6540)

In [28]:
keywords_df = select_top_features(keywords_df)

In [29]:
keywords_df.shape

(4641, 41)

In [30]:
# Add keywords_df to movie_df
movie_df = pd.concat([movie_df, keywords_df], axis=1, sort=False)
print(movie_df.shape)

# Drop genres column from movie_df
movie_df = movie_df.drop(columns=['keywords'])
movie_df.head()

(4641, 71)


Unnamed: 0,title,revenue,genres,original_language,production_companies,production_countries,runtime,spoken_languages,release_date,release_day_of_week,...,revenge,school,secret,sex,sport,violence,war,woman,world,york
0,6,2787965000.0,Action Adventure Fantasy Science Fiction,en,Ingenious Film Partners Twentieth Century Fox ...,US United States America GB United Kingdom,162.0,English Espa,20091210,Thursday,...,0,0,0,0,0,0,2,0,0,0
1,7,1845034000.0,Drama Romance Thriller,en,Paramount Pictures Twentieth Century Fox Film ...,US United States America,194.0,English Fran Deutsch Italiano,19971118,Tuesday,...,0,0,0,0,0,0,0,1,0,0
2,7,1519558000.0,Science Fiction Action Adventure,en,Paramount Pictures Marvel Studios,US United States America,143.0,English,20120425,Wednesday,...,0,0,0,0,0,0,0,0,0,1
3,14,1513529000.0,Action Adventure Science Fiction Thriller,en,Universal Studios Amblin Entertainment Legenda...,US United States America,124.0,English,20150609,Tuesday,...,0,0,0,0,0,0,0,0,0,0
4,7,1506249000.0,Action,en,Universal Pictures Original Film Fuji Televisi...,JP Japan US United States America,137.0,English,20150401,Wednesday,...,1,0,0,0,0,0,0,0,0,0


#### Production Companies

In [31]:
prod_companies_text, prod_companies_clean = column_to_text('production_companies')
prod_companies_clean = preprocess_text(prod_companies_text, prod_companies_clean)

In [32]:
prod_companies_clean[0]

'ingenious film partners twentieth century fox film corporation dune entertainment lightstorm entertainment'

In [33]:
prod_companies_df = bag_of_words_dataframe(prod_companies_clean)
prod_companies_df.shape

(4641, 4696)

In [34]:
prod_companies_df = select_top_features(prod_companies_df)
prod_companies_df.shape

(4641, 37)

In [35]:
# Add prod_companies_df to movie_df
movie_df = pd.concat([movie_df, prod_companies_df], axis=1, sort=False)
print(movie_df.shape)

# Drop genres column from movie_df
movie_df = movie_df.drop(columns=['production_companies'])
movie_df.head()

(4641, 107)


Unnamed: 0,title,revenue,genres,original_language,production_countries,runtime,spoken_languages,release_date,release_day_of_week,release_month,...,pictures,production,productions,relativity,studio,touchstone,twentieth,universal,walt,warner
0,6,2787965000.0,Action Adventure Fantasy Science Fiction,en,US United States America GB United Kingdom,162.0,English Espa,20091210,Thursday,12,...,0,0,0,0,0,0,1,0,0,0
1,7,1845034000.0,Drama Romance Thriller,en,US United States America,194.0,English Fran Deutsch Italiano,19971118,Tuesday,11,...,1,0,0,0,0,0,1,0,0,0
2,7,1519558000.0,Science Fiction Action Adventure,en,US United States America,143.0,English,20120425,Wednesday,4,...,1,0,0,0,1,0,0,0,0,0
3,14,1513529000.0,Action Adventure Science Fiction Thriller,en,US United States America,124.0,English,20150609,Tuesday,6,...,1,0,0,0,0,0,0,1,0,0
4,7,1506249000.0,Action,en,JP Japan US United States America,137.0,English,20150401,Wednesday,4,...,1,1,0,0,0,0,0,1,0,0


#### Production Countries

In [36]:
prod_countries_text, prod_countries_clean = column_to_text('production_countries')
prod_countries_clean = preprocess_text(prod_countries_text, prod_countries_clean)

In [37]:
prod_countries_clean[0]

'us united states america gb united kingdom'

In [38]:
prod_countries_df = bag_of_words_dataframe(prod_countries_clean)
prod_countries_df.shape

(4641, 173)

In [39]:
prod_countries_df = select_top_features(prod_countries_df)
prod_countries_df.shape

(4641, 12)

In [40]:
# Add prod_countries_df  to movie_df
movie_df = pd.concat([movie_df, prod_countries_df], axis=1, sort=False)
print(movie_df.shape)

# Drop genres column from movie_df
movie_df = movie_df.drop(columns=['production_countries'])
movie_df.head()

(4641, 118)


Unnamed: 0,title,revenue,genres,original_language,runtime,spoken_languages,release_date,release_day_of_week,release_month,action,...,australia,ca,canada,fr,france,gb,germany,kingdom,states,united
0,6,2787965000.0,Action Adventure Fantasy Science Fiction,en,162.0,English Espa,20091210,Thursday,12,1,...,0,0,0,0,0,1,0,1,1,2
1,7,1845034000.0,Drama Romance Thriller,en,194.0,English Fran Deutsch Italiano,19971118,Tuesday,11,0,...,0,0,0,0,0,0,0,0,1,1
2,7,1519558000.0,Science Fiction Action Adventure,en,143.0,English,20120425,Wednesday,4,1,...,0,0,0,0,0,0,0,0,1,1
3,14,1513529000.0,Action Adventure Science Fiction Thriller,en,124.0,English,20150609,Tuesday,6,1,...,0,0,0,0,0,0,0,0,1,1
4,7,1506249000.0,Action,en,137.0,English,20150401,Wednesday,4,1,...,0,0,0,0,0,0,0,0,1,1


#### Spoken Language

In [41]:
spoken_lang_text, spoken_lang_clean = column_to_text('spoken_languages')
spoken_lang_clean = preprocess_text(spoken_lang_text, spoken_lang_clean)

In [42]:
# Test output
spoken_lang_clean[0]

'english espa'

In [43]:
spoken_lang_df = bag_of_words_dataframe(spoken_lang_clean)
spoken_lang_df.shape

(4641, 34)

In [44]:
spoken_lang_df = select_top_features(spoken_lang_df)
spoken_lang_df.shape

(4641, 5)

In [45]:
# Add spoken_lang_df to movie_df
movie_df = pd.concat([movie_df, spoken_lang_df], axis=1, sort=False)
print(movie_df.shape)

# Drop genres column from movie_df
movie_df = movie_df.drop(columns=['spoken_languages'])
movie_df.head()

(4641, 122)


Unnamed: 0,title,revenue,genres,original_language,runtime,release_date,release_day_of_week,release_month,action,adventure,...,gb,germany,kingdom,states,united,deutsch,english,espa,fran,italiano
0,6,2787965000.0,Action Adventure Fantasy Science Fiction,en,162.0,20091210,Thursday,12,1,1,...,1,0,1,1,2,0,1,1,0,0
1,7,1845034000.0,Drama Romance Thriller,en,194.0,19971118,Tuesday,11,0,0,...,0,0,0,1,1,1,1,0,1,1
2,7,1519558000.0,Science Fiction Action Adventure,en,143.0,20120425,Wednesday,4,1,1,...,0,0,0,1,1,0,1,0,0,0
3,14,1513529000.0,Action Adventure Science Fiction Thriller,en,124.0,20150609,Tuesday,6,1,1,...,0,0,0,1,1,0,1,0,0,0
4,7,1506249000.0,Action,en,137.0,20150401,Wednesday,4,1,0,...,0,0,0,1,1,0,1,0,0,0


### Store engineered dataframes for dataframe merging

In [46]:
movie_revenue_ml = movie_df.copy()
%store movie_revenue_ml
%store movie_revenue_ml_2

Stored 'movie_revenue_ml' (DataFrame)
Stored 'movie_revenue_ml_2' (DataFrame)
