# 02. Data Engineering

In this notebook, we will transform the data to fit the format required for our model and we will create new features to add valuable info to it. 

To start this notebook we will use the DataFrame we generated in Cleaning_Data.

## Loading libraries and data

In [12]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import ast
import collections
import seaborn as sns
from functools import reduce
from scipy import stats
%matplotlib inline
plt.style.use('ggplot')

In [10]:
reves=pd.read_pickle("./data/cleaned_reves_df.pkl")

In [14]:
reves.head(2)

Unnamed: 0,belongs_to_collection,budget,genres,original_language,production_companies,revenue,runtime,title,keywords,release_year,release_month,release_weekday,cast_names,cast_gender,Directors,Producers,Screenplayers
0,Toy Story Collection,30000000.0,"[Animation, Comedy, Family]",en,[Pixar Animation Studios],373554033.0,81.0,Toy Story,"[jealousy, toy, boy, friendship, friends, riva...",1995.0,10.0,0.0,"[Tom Hanks, Tim Allen, Don Rickles, Jim Varney...","[Tom Hanks, Tim Allen, Don Rickles, Jim Varney...",[John Lasseter],"[Bonnie Arnold, Ralph Guggenheim]","[Joss Whedon, Andrew Stanton, Joel Cohen, Alec..."
1,,65000000.0,"[Adventure, Fantasy, Family]",en,"[TriStar Pictures, Teitler Film, Interscope Co...",262797249.0,104.0,Jumanji,"[board game, disappearance, based on children'...",1995.0,12.0,4.0,"[Robin Williams, Jonathan Hyde, Kirsten Dunst,...","[Robin Williams, Jonathan Hyde, Kirsten Dunst,...",[Joe Johnston],"[Scott Kroopf, William Teitler]","[Jonathan Hensleigh, Greg Taylor, Jim Strain]"


## Let's perform some data transformation

#### Belongs to collection

As we explained in the first notebook, what we really need from this column is to know whether a movie belongs to a series or a collection or not. The name of the collection by itself woulnd't add any knowledge to our model. That's why we are going to overwrite this column with 2 possible values:
    - 1 for movies that belong to a collection
    - 0 for movies that don't

In [16]:
reves["belongs_to_collection"] = reves["belongs_to_collection"].apply(lambda x: 1 if type(x)==str else 0)

#### Genres

In [17]:
def get_uniques(sequence):
    #from ast import literal_eval
    #sequence=sequence.fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
    counter=[]
    for line in sequence:
        counter.append(line)
        flat_list=[item for sublist in sequence for item in sublist]
        for sublist in sequence:
            for item in sublist:
                flat_list.append(item)
        counter=list(set(flat_list))
        
        return counter

In [21]:
genres=get_uniques(reves["genres"])
print(genres)


['Western', 'Comedy', 'Animation', 'Drama', 'Romance', 'Fantasy', 'Thriller', 'Adventure', 'Horror', 'Family', 'History', 'Crime', 'Mystery', 'Music', 'TV Movie', 'War', 'Documentary', 'Foreign', 'Science Fiction', 'Action']


In [22]:
def word_to_dummy(word,ref_column):
    word_list=[]
    for line in ref_column:
        value=0
        for element in line:
            if element==str(word):
                value+=1
            else:
                value+=0
        word_list.append(value)
    return word_list

In [23]:
for line in genres:
    reves[line]=word_to_dummy(line,reves["genres"])

In [29]:
reves.columns

Index(['belongs_to_collection', 'budget', 'genres', 'original_language',
       'production_companies', 'revenue', 'runtime', 'title', 'keywords',
       'release_year', 'release_month', 'release_weekday', 'cast_names',
       'cast_gender', 'Directors', 'Producers', 'Screenplayers', 'Western',
       'Comedy', 'Animation', 'Drama', 'Romance', 'Fantasy', 'Thriller',
       'Adventure', 'Horror', 'Family', 'History', 'Crime', 'Mystery', 'Music',
       'TV Movie', 'War', 'Documentary', 'Foreign', 'Science Fiction',
       'Action', 'Genres_count'],
      dtype='object')

Now we are creating a column with the number of genres per movie, as we think that it can be related to movie revenue as it can explain the complexity of the movie's synopsis. We will write a function because we will use it later for other purposes.

In [26]:
def lenght_column(column):
    column_lenght=[]
    for line in reves[column]:
        column_lenght.append(len(line))
    return column_lenght

In [27]:
reves["Genres_count"]=lenght_column("genres")

In [28]:
reves.head()

Unnamed: 0,belongs_to_collection,budget,genres,original_language,production_companies,revenue,runtime,title,keywords,release_year,...,Crime,Mystery,Music,TV Movie,War,Documentary,Foreign,Science Fiction,Action,Genres_count
0,1,30000000.0,"[Animation, Comedy, Family]",en,[Pixar Animation Studios],373554033.0,81.0,Toy Story,"[jealousy, toy, boy, friendship, friends, riva...",1995.0,...,0,0,0,0,0,0,0,0,0,3
1,0,65000000.0,"[Adventure, Fantasy, Family]",en,"[TriStar Pictures, Teitler Film, Interscope Co...",262797249.0,104.0,Jumanji,"[board game, disappearance, based on children'...",1995.0,...,0,0,0,0,0,0,0,0,0,3
2,0,16000000.0,"[Comedy, Drama, Romance]",en,[Twentieth Century Fox Film Corporation],81452156.0,127.0,Waiting to Exhale,"[based on novel, interracial relationship, sin...",1995.0,...,0,0,0,0,0,0,0,0,0,3
3,1,,[Comedy],en,"[Sandollar Productions, Touchstone Pictures]",76578911.0,106.0,Father of the Bride Part II,"[baby, midlife crisis, confidence, aging, daug...",1995.0,...,0,0,0,0,0,0,0,0,0,1
4,0,60000000.0,"[Action, Crime, Drama, Thriller]",en,"[Regency Enterprises, Forward Pass, Warner Bros.]",187436818.0,170.0,Heat,"[robbery, detective, bank, obsession, chase, s...",1995.0,...,1,0,0,0,0,0,0,0,1,4


 
 
#### Original Language

As we stablished in the preliminar exploration, we will transform this feature into a categorical variable with 2 possible values:

    - English = 1
    - Not english = 0

In [30]:
reves["original_language"] = reves["original_language"].apply(lambda x: 1 if x=="en" else 0)

  
#### Production company

In [31]:
reves["production_companies_count"]=lenght_column("production_companies")


#### Cast

In [32]:
reves["number_of_characters"]=lenght_column("cast_names")