## Feature Engineer of Movie Data

#### NLP Assessment:
   - **title**
   - **overview**
   - **genres_series**
   - **keywords_series**
   - **prod_companies_series**
   - **prod_countries_series**
   - **spoken_lang_series**

- NLP processing will be applied to these column in order to assess if certain words/instances within each category have an impact on the resulting movie revenue.

In [1]:
# Load datasets
%store -r movie_df
print(movie_df.shape)
%store -r preprocessing_series

(4523, 12)


## NLP Assessment

In [2]:
# Import necessary modules
import pandas as pd
from nltk.corpus import stopwords
import numpy as np
import spacy
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem import PorterStemmer

#### NLP Pre-Processing Functions

In [3]:
# Load the en_core_web_sm model
nlp = spacy.load('en_core_web_sm')

# Create an object of class PorterStemmer
porter = PorterStemmer()

# Create CountVectorizer object
vectorizer = CountVectorizer(lowercase=True, stop_words='english')

In [4]:
def column_to_text(column): 
    """Convert Column to Text Values"""
    
    text = movie_df[column].values
    text_clean = []
    
    return text, text_clean

In [5]:
def list_to_text(cat_list): 
    """Convert List to Text Values"""
    
    text_clean = []
    text = []
    
    for each in cat_list:
        texts = ' '.join(each)
        text.append(texts)   
    
    return text, text_clean

In [16]:
def preprocess_text(text, text_clean):
    """Lemmantization and Removal of Stopwords and Punctuation Saved in to Clean List"""
    
    for each in text:
        # Create Doc object
        doc = nlp(each, disable=['ner', 'parser'])

        # Generate lemmas
        lemmas = [token.lemma_ for token in doc]

        # Remove stopwords and non-alphabetic characters
        a_lemmas = [lemma.lower() for lemma in lemmas if lemma.isalpha() and lemma not in stopwords.words('english')]
        
        # Strip word down to stem
        stems = [porter.stem(a_lemma) for a_lemma in a_lemmas]
        
        # Clean text
        clean_text = ' '.join(stems)

        # Add to list
        text_clean.append(clean_text)
    
    return text_clean

In [7]:
def bag_of_words_dataframe(text_clean):
    
    # Convert to Series
    text_corpus = pd.Series(text_clean)
    
    # Generate Matrix of Word Vectors
    bow_matrix = vectorizer.fit_transform(text_corpus)
    
    # Convert bow_matrix into a DataFrame
    bow_df = pd.DataFrame(bow_matrix.toarray())

    # Map the column names to vocabulary 
    bow_df.columns = vectorizer.get_feature_names()
    
    return bow_df

### Application of Functions on:

Columns from movie_df:
- **title**
- **overview**

Cleaned sub_series within preprocessing_series list:
- **genres_series**
- **keywords_series**
- **prod_companies_series**
- **prod_countries_series**
- **spoken_lang_series**

#### Title

In [17]:
title_text, title_clean = column_to_text('title')
title_clean = preprocess_text(title_text, title_clean)

In [18]:
# Test output
title_clean[0]

'avatar'

In [19]:
title_df = bag_of_words_dataframe(title_clean)
print(title_df.shape)
title_df.head()

(4523, 3704)


Unnamed: 0,abandon,abbi,abcd,abduct,aberdeen,abid,abov,abraham,absentia,absolut,...,zombi,zombieland,zone,zoo,zookeep,zooland,zoom,zorro,zulu,æon
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### Overview

In [20]:
overview_text, overview_clean = column_to_text('overview')
overview_clean = preprocess_text(overview_text, overview_clean)

In [21]:
# Test output
overview_clean[0]

'centuri parapleg marin dispatch moon pandora uniqu mission becom tear follow order protect alien civil'

In [22]:
overview_df = bag_of_words_dataframe(overview_clean)
print(overview_df.shape)
overview_df.head()

(4523, 13885)


Unnamed: 0,aa,aaa,aam,aang,aaron,aba,abaddon,abagnal,abandon,abba,...,zoom,zorin,zorro,zuckerberg,zula,zuzu,zyklon,æon,émigré,über
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Use of preprocessing_series

#### Genres

In [23]:
genres_text, genres_clean = list_to_text(preprocessing_series[0])
genres_clean = preprocess_text(genres_text, genres_clean)

In [24]:
genres_clean[0]

'action adventur fantasi scienc fiction'

In [25]:
genres_df = bag_of_words_dataframe(genres_clean)
print(genres_df.shape)
genres_df.head()

(4523, 22)


Unnamed: 0,action,adventur,anim,comedi,crime,documentari,drama,famili,fantasi,fiction,...,horror,movi,music,mysteri,romanc,scienc,thriller,tv,war,western
0,1,1,0,0,0,0,0,0,1,1,...,0,0,0,0,0,1,0,0,0,0
1,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,1,0,1,0,0,0
2,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,1,0,0,0,0
3,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,1,1,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### Keywords

In [26]:
keywords_text, keywords_clean = list_to_text(preprocessing_series[1])
keywords_clean = preprocess_text(keywords_text, keywords_clean)

In [27]:
keywords_clean[0]

'cultur clash futur space war space coloni societi space travel futurist romanc space alien tribe alien planet cgi marin soldier battl love affair anti war power relat mind soul'

In [28]:
keywords_df = bag_of_words_dataframe(keywords_clean)
print(keywords_df.shape)
keywords_df.head()

(4523, 6049)


Unnamed: 0,abandon,abduct,abil,abolitionist,aborigin,abort,abraham,abroad,absorb,absurd,...,zombif,zone,zoo,zookeep,zoom,zurich,γη,卧底肥妈,绝地奶霸,超级妈妈
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### Production Companies

In [29]:
prod_companies_text, prod_companies_clean = list_to_text(preprocessing_series[2])
prod_companies_clean = preprocess_text(prod_companies_text, prod_companies_clean)

In [30]:
prod_companies_clean[0]

'ingeni film partner twentieth centuri fox film corpor dune entertain lightstorm entertain'

In [31]:
prod_companies_df = bag_of_words_dataframe(prod_companies_clean)
print(prod_companies_df.shape)
prod_companies_df.head()

(4523, 4530)


Unnamed: 0,aardman,ab,abandon,abbolita,abe,abraham,abram,absinth,absolut,abu,...,zweit,zürich,áfrica,édition,éireann,émile,étoil,étrangèr,île,österreichisch
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


#### Production Countries

In [32]:
prod_countries_text, prod_countries_clean = list_to_text(preprocessing_series[3])
prod_countries_clean = preprocess_text(prod_countries_text, prod_countries_clean)

In [33]:
prod_countries_clean[0]

'us gb'

In [34]:
prod_countries_df = bag_of_words_dataframe(prod_countries_clean)
print(prod_countries_df.shape)
prod_countries_df.head()

(4523, 74)


Unnamed: 0,ae,af,ao,ar,au,aw,ba,bg,bo,br,...,se,sg,si,sk,th,tn,tr,tw,ua,za
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### Spoken Language

In [35]:
spoken_lang_text, spoken_lang_clean = list_to_text(preprocessing_series[4])
spoken_lang_clean = preprocess_text(spoken_lang_text, spoken_lang_clean)

In [36]:
# Test output
spoken_lang_clean[0]

'en es'

In [37]:
spoken_lang_df = bag_of_words_dataframe(spoken_lang_clean)
print(spoken_lang_df.shape)
spoken_lang_df.head()

(4523, 77)


Unnamed: 0,af,ar,bg,bm,bn,bo,br,bs,ca,ce,...,tr,uk,ur,vi,wo,xh,xx,yi,zh,zu
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Store engineered dataframes for dataframe merging

In [38]:
engineered_features_dfs = [title_df, overview_df, genres_df, keywords_df, 
                  prod_companies_df, prod_countries_df, spoken_lang_df]

%store engineered_features_dfs

Stored 'engineered_features_dfs' (list)
