## Feature Engineer of Movie Data

#### NLP Assessment:
   - **title**
   - **overview**
   - **genres_series**
   - **keywords_series**
   - **prod_companies_series**
   - **prod_countries_series**
   - **spoken_lang_series**

- NLP processing will be applied to these column in order to assess if certain words/instances within each category have an impact on the resulting movie revenue.

In [1]:
# Load datasets
%store -r movie_df_unnested
movie_df = movie_df_unnested.copy()
print(movie_df.shape)

(4641, 12)


In [2]:
movie_df.head()

Unnamed: 0,title,revenue,genres,keywords,original_language,production_companies,production_countries,runtime,spoken_languages,release_date,release_day_of_week,release_month
0,Avatar,2787965000.0,Action Adventure Fantasy Science Fiction,"[culture clash, future, space war, space colon...",en,Ingenious Film Partners Twentieth Century Fox ...,US United States America GB United Kingdom,162.0,English Espa,20091210,Thursday,12
1,Titanic,1845034000.0,Drama Romance Thriller,"[shipwreck, iceberg, ship, panic, titanic, oce...",en,Paramount Pictures Twentieth Century Fox Film ...,US United States America,194.0,English Fran Deutsch Italiano,19971118,Tuesday,11
2,The Avengers,1519558000.0,Science Fiction Action Adventure,"[new york, shield, marvel comic, superhero, ba...",en,Paramount Pictures Marvel Studios,US United States America,143.0,English,20120425,Wednesday,4
3,Jurassic World,1513529000.0,Action Adventure Science Fiction Thriller,"[monster, dna, tyrannosaurus rex, velociraptor...",en,Universal Studios Amblin Entertainment Legenda...,US United States America,124.0,English,20150609,Tuesday,6
4,Furious 7,1506249000.0,Action,"[car race, speed, revenge, suspense, car, race...",en,Universal Pictures Original Film Fuji Televisi...,JP Japan US United States America,137.0,English,20150401,Wednesday,4


## NLP Assessment

In [3]:
# Import necessary modules
import pandas as pd
from nltk.corpus import stopwords
import numpy as np
import spacy
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem import PorterStemmer

#### NLP Pre-Processing Functions

In [4]:
# Load the en_core_web_sm model
nlp = spacy.load('en_core_web_sm')

# Create an object of class PorterStemmer
porter = PorterStemmer()

# Create CountVectorizer object
vectorizer = CountVectorizer(lowercase=True, stop_words='english')

In [5]:
def column_to_text(column): 
    """Convert Column to Text Values"""
    
    text = movie_df[column].values
    text_clean = []
    
    return text, text_clean

In [6]:
def list_to_text(cat_list): 
    """Convert List to Text Values"""
    
    text_clean = []
    text = []
    
    for each in cat_list:
        texts = ' '.join(each)
        text.append(texts)   
    
    return text, text_clean

In [7]:
def preprocess_text(text, text_clean):
    """Lemmantization and Removal of Stopwords and Punctuation Saved in to Clean List"""
    
    for each in text:
        # Create Doc object
        doc = nlp(each, disable=['ner', 'parser'])

        # Generate lemmas
        lemmas = [token.lemma_ for token in doc]

        # Remove stopwords and non-alphabetic characters
        a_lemmas = [lemma.lower() for lemma in lemmas if lemma.isalpha() and lemma not in stopwords.words('english')]
        
        # Strip word down to stem
       # stems = [porter.stem(a_lemma) for a_lemma in a_lemmas]
        
        # Clean text
        clean_text = ' '.join(a_lemmas)

        # Add to list
        text_clean.append(clean_text)
    
    return text_clean

In [8]:
def bag_of_words_dataframe(text_clean):
    
    # Convert to Series
    text_corpus = pd.Series(text_clean)
    
    # Generate Matrix of Word Vectors
    bow_matrix = vectorizer.fit_transform(text_corpus)
    
    # Convert bow_matrix into a DataFrame
    bow_df = pd.DataFrame(bow_matrix.toarray())

    # Map the column names to vocabulary 
    bow_df.columns = vectorizer.get_feature_names()
    
    return bow_df

### Application of Functions on:

Columns from movie_df:
- **title**

Cleaned sub_series within preprocessing_series list:
- **genres_series**
- **keywords_series**
- **prod_companies_series**
- **prod_countries_series**
- **spoken_lang_series**

#### Title

In [9]:
title_text, title_clean = column_to_text('title')
title_clean = preprocess_text(title_text, title_clean)

In [10]:
# Test output
title_clean[0], len(title_clean)

('avatar', 4641)

In [11]:
def length_of_string(clean_list):
    
    lengths = []
    for string in clean_list:
        length = len(string)
        lengths.append(length)
    
    return lengths

In [12]:
title_lengths = length_of_string(title_clean)
print(len(title_lengths))

# Replace title column with lengths
movie_df['title'] = pd.Series(title_lengths, index=np.arange(0, 4641))

4641


In [13]:
movie_df.shape

(4641, 12)

### Use of preprocessing_series

#### Genres

In [14]:
genres_text, genres_clean = column_to_text('genres')
genres_clean = preprocess_text(genres_text, genres_clean)

In [15]:
genres_clean[0]

'action adventure fantasy science fiction'

In [16]:
genres_df = bag_of_words_dataframe(genres_clean)
print(genres_df.shape)
genres_df.head()

(4641, 22)


Unnamed: 0,action,adventure,animation,comedy,crime,documentary,drama,family,fantasy,fiction,...,horror,movie,music,mystery,romance,science,thriller,tv,war,western
0,1,1,0,0,0,0,0,0,1,1,...,0,0,0,0,0,1,0,0,0,0
1,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,1,0,1,0,0,0
2,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,1,0,0,0,0
3,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,1,1,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [17]:
# Add genres_df to movie_df
movie_df = pd.concat([movie_df, genres_df], axis=1, sort=False)

# Drop genres column from movie_df
movie_df = movie_df.drop(columns=['genres'])
movie_df.head()

Unnamed: 0,title,revenue,keywords,original_language,production_companies,production_countries,runtime,spoken_languages,release_date,release_day_of_week,...,horror,movie,music,mystery,romance,science,thriller,tv,war,western
0,6,2787965000.0,"[culture clash, future, space war, space colon...",en,Ingenious Film Partners Twentieth Century Fox ...,US United States America GB United Kingdom,162.0,English Espa,20091210,Thursday,...,0,0,0,0,0,1,0,0,0,0
1,7,1845034000.0,"[shipwreck, iceberg, ship, panic, titanic, oce...",en,Paramount Pictures Twentieth Century Fox Film ...,US United States America,194.0,English Fran Deutsch Italiano,19971118,Tuesday,...,0,0,0,0,1,0,1,0,0,0
2,7,1519558000.0,"[new york, shield, marvel comic, superhero, ba...",en,Paramount Pictures Marvel Studios,US United States America,143.0,English,20120425,Wednesday,...,0,0,0,0,0,1,0,0,0,0
3,14,1513529000.0,"[monster, dna, tyrannosaurus rex, velociraptor...",en,Universal Studios Amblin Entertainment Legenda...,US United States America,124.0,English,20150609,Tuesday,...,0,0,0,0,0,1,1,0,0,0
4,7,1506249000.0,"[car race, speed, revenge, suspense, car, race...",en,Universal Pictures Original Film Fuji Televisi...,JP Japan US United States America,137.0,English,20150401,Wednesday,...,0,0,0,0,0,0,0,0,0,0


In [18]:
movie_df.shape

(4641, 33)

#### Keywords

In [None]:
keywords_text, keywords_clean = column_to_text('keywords')
keywords_clean = preprocess_text(keywords_text, keywords_clean)

In [None]:
keywords_clean[0]

In [None]:
keywords_df = bag_of_words_dataframe(keywords_clean)
print(keywords_df.shape)
keywords_df.head()

#### Production Companies

In [None]:
prod_companies_text, prod_companies_clean = list_to_text(preprocessing_series[2])
prod_companies_clean = preprocess_text(prod_companies_text, prod_companies_clean)

In [None]:
prod_companies_clean[0]

In [None]:
prod_companies_df = bag_of_words_dataframe(prod_companies_clean)
print(prod_companies_df.shape)
prod_companies_df.head()

#### Production Countries

In [None]:
prod_countries_text, prod_countries_clean = list_to_text(preprocessing_series[3])
prod_countries_clean = preprocess_text(prod_countries_text, prod_countries_clean)

In [None]:
prod_countries_clean[0]

In [None]:
prod_countries_df = bag_of_words_dataframe(prod_countries_clean)
print(prod_countries_df.shape)
prod_countries_df.head()

#### Spoken Language

In [None]:
spoken_lang_text, spoken_lang_clean = list_to_text(preprocessing_series[4])
spoken_lang_clean = preprocess_text(spoken_lang_text, spoken_lang_clean)

In [None]:
# Test output
spoken_lang_clean[0]

In [None]:
spoken_lang_df = bag_of_words_dataframe(spoken_lang_clean)
print(spoken_lang_df.shape)
spoken_lang_df.head()

### Store engineered dataframes for dataframe merging

In [38]:
engineered_features_dfs = [title_df, overview_df, genres_df, keywords_df, 
                  prod_companies_df, prod_countries_df, spoken_lang_df]

%store engineered_features_dfs

Stored 'engineered_features_dfs' (list)


In [101]:
movie_revenue_ml = movie_df.copy()
%store movie_revenue_ml

Stored 'movie_revenue_ml' (DataFrame)
