# Data Cleaning

A partir del dataset de Full MovieLens que recoge metadatos sobre las 45.000 películas de esta plataforma, se realizará un estudio para comprobar la  veracidad de la hipótesis que guía este trabajo: La industria del cine no es rentable. Para ello, es necesario empezar por generar un dataframe con los elementos que nos sean de utilidad. 

Reordenaremos el dataset original y nos quedaremos con las columnas que son de nuestro interés:

## Data Structuring

Importamos las librerías necesarias para este paso.

In [1]:
import pandas as pd
import os
import sys

Se importa el dataset accediendo al directorio donde está guardado. 

In [2]:
# Obtenemos la dirección de la carpeta raiz del proyecto
root_project = os.path.dirname(os.getcwd())
# Adjuntamos esta dirección a las rutas de python para poder importar modulos desde aquí
sys.path.append(root_project)
# Completamos la dirección del archivo csv con el dataset
movies_csv_dir = root_project + '/resources/movies_metadata.csv'

# Leemos el dataset desde el directorio donde se encuentra
movies_csv_df = pd.read_csv(movies_csv_dir, sep= ',')
movies_df = movies_csv_df.copy()

Identificando las columnas se hace una breve pero rápida exploración de los datos que contiene el dataset.

In [3]:
print(list(movies_df.columns), '\n')
print('number of columns --->', len(movies_df.columns))

['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id', 'imdb_id', 'original_language', 'original_title', 'overview', 'popularity', 'poster_path', 'production_companies', 'production_countries', 'release_date', 'revenue', 'runtime', 'spoken_languages', 'status', 'tagline', 'title', 'video', 'vote_average', 'vote_count'] 

number of columns ---> 24


De las 24 columnas del dataset nos interesan solo 11. Generamos una lista ordenada de estas columnas a partir de la cual generaremos el dataframe para el estudio. 

In [4]:
l = ['title', 'release_date', 'original_language', 'genres', 'budget', 'revenue', 'production_countries', 'production_companies', 'runtime', 'vote_average', 'vote_count']

In [5]:
movies_df = movies_df[l]
movies_df.head()

Unnamed: 0,title,release_date,original_language,genres,budget,revenue,production_countries,production_companies,runtime,vote_average,vote_count
0,Toy Story,1995-10-30,en,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",30000000,373554033.0,"[{'iso_3166_1': 'US', 'name': 'United States o...","[{'name': 'Pixar Animation Studios', 'id': 3}]",81.0,7.7,5415.0
1,Jumanji,1995-12-15,en,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",65000000,262797249.0,"[{'iso_3166_1': 'US', 'name': 'United States o...","[{'name': 'TriStar Pictures', 'id': 559}, {'na...",104.0,6.9,2413.0
2,Grumpier Old Men,1995-12-22,en,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",0,0.0,"[{'iso_3166_1': 'US', 'name': 'United States o...","[{'name': 'Warner Bros.', 'id': 6194}, {'name'...",101.0,6.5,92.0
3,Waiting to Exhale,1995-12-22,en,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",16000000,81452156.0,"[{'iso_3166_1': 'US', 'name': 'United States o...",[{'name': 'Twentieth Century Fox Film Corporat...,127.0,6.1,34.0
4,Father of the Bride Part II,1995-02-10,en,"[{'id': 35, 'name': 'Comedy'}]",0,76578911.0,"[{'iso_3166_1': 'US', 'name': 'United States o...","[{'name': 'Sandollar Productions', 'id': 5842}...",106.0,5.7,173.0


## Data Mining

In [6]:
movies_df.dtypes

title                    object
release_date             object
original_language        object
genres                   object
budget                   object
revenue                 float64
production_countries     object
production_companies     object
runtime                 float64
vote_average            float64
vote_count              float64
dtype: object

Esta observación nos indica el tipo de elementos que contiene cada columna, con lo que podemos identificar que manipulaciones debemos llevar a cabo para que los datos puedan ser utilizados en el estudio. 

A simple vista identificamos los sigientes proplemas:

    - release_date: querremos que esta columna contenga elementos del tipo datetime64

    - budget: esta columna deberá contener valores numéricos para su uso

    - genres, production_countries, production_companies: a la información de estas columnas no es posible             
      acceder de forma directa, será necesario realizar data wrangling.

### Dates to datetime64 type

In [7]:
try:
    pd.to_datetime(movies_df.release_date)
except ValueError:
    print('Given date string not likely a datetime')

Given date string not likely a datetime


Al intentar convertir todos los elementos de release_date a datetime64 aparece el error de arriba. Esto ocurre porque al menos un valor de la columna no aparece en un formato que la librería de pandas pueda convertir al tipo datetime64.  
Debemos indentificar estos elementos:

In [8]:
def not_a_datetime_detector(date_column):
    l = []
    for i, it in enumerate(date_column):
        try:
            pd.to_datetime(it)
        except:
            l.append(i)
    print(l)
    return l

In [9]:
not_a_date_list = not_a_datetime_detector(date_column=movies_df.release_date)

[19730, 29503, 35587]


In [10]:
movies_df.iloc[not_a_date_list]

Unnamed: 0,title,release_date,original_language,genres,budget,revenue,production_countries,production_companies,runtime,vote_average,vote_count
19730,,1,104.0,"[{'name': 'Carousel Productions', 'id': 11176}...",/ff9qCepilowshEtG2GYWwzt2bs4.jpg,,6.0,False,,,
29503,,12,68.0,"[{'name': 'Aniplex', 'id': 2883}, {'name': 'Go...",/zV8bHuSL6WXoD6FWogP9j4x80bL.jpg,,7.0,False,,,
35587,,22,82.0,"[{'name': 'Odyssey Media', 'id': 17161}, {'nam...",/zaSf5OG7V8X8gqFvly88zDdRm46.jpg,,4.3,False,,,


Observando los datos para estos elementos, parece razonable eliminarlos del dataset ya que tienen muchos otros datos vacíos y no aportarán utilidad al estudio

In [11]:
movies_df = movies_df.drop(not_a_date_list)

In [12]:
movies_df = movies_df.reset_index(drop=True)

In [179]:
movies_df.release_date = pd.to_datetime(movies_df.release_date)
movies_df.release_date

0       1995-10-30
1       1995-12-15
2       1995-12-22
3       1995-12-22
4       1995-02-10
           ...    
45458          NaT
45459   2011-11-17
45460   2003-08-01
45461   1917-10-21
45462   2017-06-09
Name: release_date, Length: 45463, dtype: datetime64[ns]

### Budget to numerical values

In [14]:
movies_df.budget = pd.to_numeric(movies_df.budget)

## Data wrangling

In [15]:
import json

### - genres

Las columnas genres, production_companies y production_countries tienen los datos en una forma en la que no es posible acceder a ellos. Es necesario minar estos datos.

A través de la librería json transformamos la información de las columnas que estén en este formato (como si fuera un 'str') de forma que se reconozcan los objetos de python, por ejemplo: listas, diccionarios, etc. 

In [16]:
movies_df.genres = movies_df.genres.str.replace("\'", "\"") #Si no se reemplazan '' por "", python no lo                                                                      reconoce como un archivo json

movies_df.genres = movies_df.genres.apply(json.loads)   #cada elemento de la columna es legible por python ahora

Vamos a agrupar los generos de cada película en listas:

In [17]:
generos = set({})

for i, element in enumerate(movies_df.genres):
    genre_list = []
    for dictio in element:
        genre_list.append(dictio['name'])
        generos.add(dictio['name'])
    if i == 0:
        genres_dictio = {i: [genre_list]}
    else:
        genres_dictio[i] = [genre_list]
  
print(f'Número de géros distintos ---> {len(generos)}\n', 'Todos los generos del dataset:\n', generos)
genre_df = pd.DataFrame.from_dict(genres_dictio, orient='index', columns=['Genres'])
genre_df

Número de géros distintos ---> 20
 Todos los generos del dataset:
 {'Mystery', 'Documentary', 'TV Movie', 'Western', 'Crime', 'History', 'Family', 'Animation', 'Drama', 'Music', 'Adventure', 'Thriller', 'Science Fiction', 'Fantasy', 'Comedy', 'Horror', 'Foreign', 'Romance', 'War', 'Action'}


Unnamed: 0,Genres
0,"[Animation, Comedy, Family]"
1,"[Adventure, Fantasy, Family]"
2,"[Romance, Comedy]"
3,"[Comedy, Drama, Romance]"
4,[Comedy]
...,...
45458,"[Drama, Family]"
45459,[Drama]
45460,"[Action, Drama, Thriller]"
45461,[]


Sustiuimos la columna original de nuestro dataset por la que se ha generado:

In [18]:
movies_df['genres'] = genre_df

Puede ser de utilidad crear un dataframe de la columna genres expandida, de forma que todos los generos correspondan a las columnas:

        -se crea un dataframe de una columna compuesta por diccionarios donde la clave y el valor
         corresponden a los generos de cada película

In [19]:
genres_df_dict = pd.DataFrame.from_dict(genre_df.Genres.apply(lambda x: {gen : gen for gen in x}))
genres_df_dict.head(3)

Unnamed: 0,Genres
0,"{'Animation': 'Animation', 'Comedy': 'Comedy',..."
1,"{'Adventure': 'Adventure', 'Fantasy': 'Fantasy..."
2,"{'Romance': 'Romance', 'Comedy': 'Comedy'}"


        -a partir de este dataframe generamos el dataframe expandido donde cada columna representa uno 
         de los 20 géneros de nuestro dataset original

In [20]:
genre_df_expanded = pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in genres_df_dict.Genres.items() ])).transpose()

In [21]:
genre_df_expanded.head()

Unnamed: 0,Action,Adventure,Animation,Comedy,Crime,Documentary,Drama,Family,Fantasy,Foreign,History,Horror,Music,Mystery,Romance,Science Fiction,TV Movie,Thriller,War,Western
0,,,Animation,Comedy,,,,Family,,,,,,,,,,,,
1,,Adventure,,,,,,Family,Fantasy,,,,,,,,,,,
2,,,,Comedy,,,,,,,,,,,Romance,,,,,
3,,,,Comedy,,,Drama,,,,,,,,Romance,,,,,
4,,,,Comedy,,,,,,,,,,,,,,,,


In [22]:
print(list(genre_df_expanded.count()))

[6596, 3496, 1935, 13182, 4307, 3932, 20265, 2770, 2313, 1622, 1398, 4673, 1598, 2467, 6735, 3049, 767, 7624, 1323, 1042]


### - production_countries

In [23]:
x = movies_df.production_countries.str.replace("\'", "\"")
s = pd.Series()
l = []

for el in range(0,len(movies_df)):
    try:
        a = json.loads(x[el])
        a = pd.Series(a)
        s = s.append(a, ignore_index=True)
    except:
        l.append(el)

print(l)

[4321, 6954, 19729, 23273, 29501, 30433, 35584, 39705]


In [24]:
import re

In [25]:
for er in l:
    if type(x[er]) == float:
        x[er] = "[]"
    else:
        a = re.sub(r'(?<=[^{, :])"(?=[^,:}])'," ",x[er])
        x[er] = a

x

0        [{"iso_3166_1": "US", "name": "United States o...
1        [{"iso_3166_1": "US", "name": "United States o...
2        [{"iso_3166_1": "US", "name": "United States o...
3        [{"iso_3166_1": "US", "name": "United States o...
4        [{"iso_3166_1": "US", "name": "United States o...
                               ...                        
45458               [{"iso_3166_1": "IR", "name": "Iran"}]
45459        [{"iso_3166_1": "PH", "name": "Philippines"}]
45460    [{"iso_3166_1": "US", "name": "United States o...
45461             [{"iso_3166_1": "RU", "name": "Russia"}]
45462     [{"iso_3166_1": "GB", "name": "United Kingdom"}]
Name: production_countries, Length: 45463, dtype: object

In [26]:
x[l]

4321     [{"iso_3166_1": "CI", "name": "Cote D Ivoire"}...
6954     [{"iso_3166_1": "CI", "name": "Cote D Ivoire"}...
19729                                                   []
23273    [{"iso_3166_1": "AU", "name": "Australia"}, {"...
29501                                                   []
30433    [{"iso_3166_1": "LA", "name": "Lao People s De...
35584                                                   []
39705    [{"iso_3166_1": "CA", "name": "Canada"}, {"iso...
Name: production_countries, dtype: object

In [28]:
serie = movies_df.production_countries
serie = serie.apply(json.loads)
use_key = "name"

uniq_vals = set({})

for i, element in enumerate(serie):
    val_list = []
    for dictio in element:
        val_list.append(dictio[use_key])
        uniq_vals.add(dictio[use_key])
    if i == 0:
        serie_dictio = {i: [val_list]}
    else:
        serie_dictio[i] = [val_list]

serie_df = pd.DataFrame.from_dict(serie_dictio, orient='index', columns=[serie.name])
serie_df

Unnamed: 0,production_countries
0,[United States of America]
1,[United States of America]
2,[United States of America]
3,[United States of America]
4,[United States of America]
...,...
45458,[Iran]
45459,[Philippines]
45460,[United States of America]
45461,[Russia]


In [64]:
def json_prep(serie):
    ''' 
    Return a pd.Series object with json elements as a pd.Series with python elements
    '''
    try: 
        #If '' aren't replaced for "", python won't recognise the json file  
        serie = serie.str.replace("\'", "\"")

        #now every element of the series is legible by python   
        serie = serie.apply(json.loads)                  
        return serie
    except:
        #print(error)
        safe_word = input('Try to fix errors?(Yes/No):')
        if safe_word == 'No':
            return
        else:
            serie = serie.str.replace("\'", "\"")
            s = pd.Series()
            l = []

            for el in range(0,len(serie)):
                try:
                    a = json.loads(serie[el])
                    #a = pd.Series(a)
                    #s = s.append(a, ignore_index=True)
                except:
                    l.append(el)

            print('errors column positions:',l)

            for er in l:
                if type(serie[er]) != str:
                    serie[er] = "[]"
                else:
                    a = re.sub(r'(?<=[^{, :])"(?=[^,:}])'," ",serie[er])
                    serie[er] = a

            return json_prep(serie)

In [54]:
def series_dict_to_series_list(serie, use_key, inplace=False):
    '''
    Args:
        serie (pd.Series)
        use_key (str): key name to select from series' dicts
        inplace (bool): bool to indicate if serie is to be replaced by the new series


    From serie (pd.Series of dicts) return the useful information from the dict (use_key) as in list form 
    '''
    serie = json_prep(serie)

    uniq_vals = set({})

    for i, element in enumerate(serie):
        val_list = []
        for dictio in element:
            val_list.append(dictio[use_key])
            uniq_vals.add(dictio[use_key])
        if i == 0:
            serie_dictio = {i: [val_list]}
        else:
            serie_dictio[i] = [val_list]

    print(f'Número de elemenos únicos ---> {len(uniq_vals)}\n', 'Todos los elementos únicos de la columna:\n',         uniq_vals)
    serie_df = pd.DataFrame.from_dict(serie_dictio, orient='index', columns=[serie.name])
    return serie_df
    
    #Quizás funcione con calses
    '''if inplace == True:
        serie = serie_df
    else:
        return serie_df'''

Habiendo hecho e estudio de los datos y definido funciones que podremos reutilizar, preparamos la columna production_countries para representarla de una manera conveniente

In [139]:
transit_df = series_dict_to_series_list(serie=movies_df.production_countries, use_key='name')
movies_df.production_countries = transit_df
movies_df

Número de elemenos únicos ---> 160
 Todos los elementos únicos de la columna:
 {'Germany', 'Philippines', 'Luxembourg', 'Indonesia', 'Latvia', 'Italy', 'Zimbabwe', 'Georgia', 'Thailand', 'Costa Rica', 'Israel', 'Burkina Faso', 'Jordan', 'French Southern Territories', 'Brunei Darussalam', 'Bahamas', 'Nicaragua', 'Serbia', 'Taiwan', 'Honduras', 'Nigeria', 'Bulgaria', 'Turkey', 'Panama', 'Nepal', 'Cameroon', 'Libyan Arab Jamahiriya', 'Tanzania', 'France', 'North Korea', 'Vietnam', 'Estonia', 'Afghanistan', 'Kenya', 'Trinidad and Tobago', 'Kuwait', 'Serbia and Montenegro', 'Bermuda', 'Ecuador', 'Croatia', 'Mauritania', 'Saudi Arabia', 'Sri Lanka', 'Slovenia', 'Dominican Republic', 'Australia', 'El Salvador', 'Puerto Rico', 'Brazil', 'Liberia', 'Angola', 'Congo', 'Qatar', 'Albania', 'Malaysia', 'Finland', 'Rwanda', 'Cayman Islands', 'India', 'Belarus', 'Ireland', 'South Korea', 'Yugoslavia', 'Uganda', 'Macao', 'Poland', 'Guinea', 'Austria', 'Soviet Union', 'Norway', 'Canada', 'Argentina', '

Unnamed: 0,title,release_date,original_language,genres,budget,revenue,production_countries,production_companies,runtime,vote_average,vote_count
0,Toy Story,1995-10-30,en,"[Animation, Comedy, Family]",30000000,373554033.0,[United States of America],"[{'name': 'Pixar Animation Studios', 'id': 3}]",81.0,7.7,5415.0
1,Jumanji,1995-12-15,en,"[Adventure, Fantasy, Family]",65000000,262797249.0,[United States of America],"[{'name': 'TriStar Pictures', 'id': 559}, {'na...",104.0,6.9,2413.0
2,Grumpier Old Men,1995-12-22,en,"[Romance, Comedy]",0,0.0,[United States of America],"[{'name': 'Warner Bros.', 'id': 6194}, {'name'...",101.0,6.5,92.0
3,Waiting to Exhale,1995-12-22,en,"[Comedy, Drama, Romance]",16000000,81452156.0,[United States of America],[{'name': 'Twentieth Century Fox Film Corporat...,127.0,6.1,34.0
4,Father of the Bride Part II,1995-02-10,en,[Comedy],0,76578911.0,[United States of America],"[{'name': 'Sandollar Productions', 'id': 5842}...",106.0,5.7,173.0
...,...,...,...,...,...,...,...,...,...,...,...
45458,Subdue,NaT,fa,"[Drama, Family]",0,0.0,[Iran],[],90.0,4.0,1.0
45459,Century of Birthing,2011-11-17,tl,[Drama],0,0.0,[Philippines],"[{'name': 'Sine Olivia', 'id': 19653}]",360.0,9.0,3.0
45460,Betrayal,2003-08-01,en,"[Action, Drama, Thriller]",0,0.0,[United States of America],"[{'name': 'American World Pictures', 'id': 6165}]",90.0,3.8,6.0
45461,Satan Triumphant,1917-10-21,en,[],0,0.0,[Russia],"[{'name': 'Yermoliev', 'id': 88753}]",87.0,0.0,0.0


Podemos expandir esta columna enun nuevo dataset

In [117]:
def expand_df(serie):
    # Creates column of dictionaries where key = values, from elements of lists in serie
    serie_df = pd.DataFrame.from_dict(serie.apply(lambda x: {gen : gen for gen in x}))
    # Creates dataframe with columns equal to unique values of serie
    serie_expanded = pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in serie_df[serie_df.columns[0]].items() ])).transpose()
    return serie_expanded

In [140]:
prod_countries_expanded = expand_df(movies_df.production_countries)

In [141]:
prod_countries_expanded.count().sort_values(ascending=False).head(20)

United States of America    21153
United Kingdom               4094
France                       3940
Germany                      2254
Italy                        2169
Canada                       1765
Japan                        1648
Spain                         964
Russia                        912
India                         828
Hong Kong                     596
Sweden                        588
Australia                     570
South Korea                   495
Belgium                       447
Denmark                       386
Finland                       383
Netherlands                   375
China                         372
Mexico                        329
dtype: int64

### -production_companies

In [124]:
df_transit = series_dict_to_series_list(serie=movies_df.production_companies, use_key='name', inplace=False)
df_transit

errors column positions: [28, 131, 168, 194, 225, 470, 715, 718, 751, 873, 891, 892, 1067, 1077, 1136, 1361, 1391, 1451, 1581, 1609, 1615, 1619, 1844, 2017, 2087, 2382, 2511, 2540, 2568, 2770, 2827, 2856, 2937, 2980, 3017, 3056, 3280, 3288, 3419, 3671, 3681, 3835, 4102, 4114, 4137, 4143, 4595, 4626, 4770, 4821, 4837, 4894, 5048, 5266, 5347, 5380, 5389, 5403, 5441, 5443, 5445, 5495, 5588, 5629, 5700, 5704, 5774, 5839, 6097, 6142, 6195, 6239, 6358, 6378, 6495, 6571, 6623, 6629, 6632, 6693, 6709, 6737, 6771, 6807, 6987, 7015, 7085, 7088, 7194, 7311, 7337, 7400, 7414, 7603, 7679, 7718, 7778, 7789, 7794, 7847, 7894, 7929, 7933, 8057, 8084, 8105, 8109, 8117, 8183, 8460, 8467, 8791, 8923, 9045, 9070, 9094, 9102, 9280, 9542, 9709, 9756, 9936, 10014, 10088, 10146, 10175, 10182, 10189, 10390, 10562, 10610, 10635, 10654, 10693, 10782, 10817, 10836, 10877, 10922, 10973, 11201, 11252, 11337, 11680, 11759, 11938, 12121, 12146, 12336, 12425, 12437, 12505, 12509, 12597, 12671, 12713, 12727, 12746, 127

TypeError: 'NoneType' object is not iterable

In [125]:
error_list = [1136, 2540, 2937, 3835, 5588, 5629, 5774, 6097, 6632, 6709, 6771, 7085, 7088, 7929, 7933, 8117, 9936, 10189, 10610, 10654, 11759, 12505, 12727, 13532, 13661, 13667, 13704, 13705, 13825, 14339, 14352, 15697, 15721, 16178, 16197, 16253, 16256, 16334, 16335, 17157, 17939, 17970, 18037, 18759, 19944, 20312, 21593, 22213, 22279, 22302, 23337, 23816, 25328, 25852, 25853, 26440, 27653, 27669, 27977, 28092, 28193, 28959, 30272, 30772, 30871, 31478, 31861, 32651, 32659, 33195, 33386, 33459, 34200, 34677, 34726, 35209, 35995, 35997, 36003, 36948, 39123, 40099, 40689, 41285, 42274, 43673, 44260, 44403, 44595, 44603, 44770, 44961]


In [127]:
len(error_list)

92

La columna produce demasiados errores al intentar minar su contenido. Además, resulta muy interesante en la verificación de la hipótesis. Se opta por descartarla

In [157]:
movies_df.drop('production_companies', axis=1, inplace=True)

In [158]:
movies_df

Unnamed: 0,title,release_date,original_language,genres,budget,revenue,production_countries,runtime,vote_average,vote_count
0,Toy Story,1995-10-30,en,"[Animation, Comedy, Family]",30000000,373554033.0,[United States of America],81.0,7.7,5415.0
1,Jumanji,1995-12-15,en,"[Adventure, Fantasy, Family]",65000000,262797249.0,[United States of America],104.0,6.9,2413.0
2,Grumpier Old Men,1995-12-22,en,"[Romance, Comedy]",0,0.0,[United States of America],101.0,6.5,92.0
3,Waiting to Exhale,1995-12-22,en,"[Comedy, Drama, Romance]",16000000,81452156.0,[United States of America],127.0,6.1,34.0
4,Father of the Bride Part II,1995-02-10,en,[Comedy],0,76578911.0,[United States of America],106.0,5.7,173.0
...,...,...,...,...,...,...,...,...,...,...
45458,Subdue,NaT,fa,"[Drama, Family]",0,0.0,[Iran],90.0,4.0,1.0
45459,Century of Birthing,2011-11-17,tl,[Drama],0,0.0,[Philippines],360.0,9.0,3.0
45460,Betrayal,2003-08-01,en,"[Action, Drama, Thriller]",0,0.0,[United States of America],90.0,3.8,6.0
45461,Satan Triumphant,1917-10-21,en,[],0,0.0,[Russia],87.0,0.0,0.0


## Exportación de datasets

Exportaremos los dataset minados a archivos CSV para su posterior uso en el resto de investigaciones:

-Dataset principal/metadatos películas:

In [178]:
movies_df.head()

Unnamed: 0,title,release_date,original_language,genres,budget,revenue,production_countries,runtime,vote_average,vote_count
0,Toy Story,1995-10-30,en,"[Animation, Comedy, Family]",30000000,373554033.0,[United States of America],81.0,7.7,5415.0
1,Jumanji,1995-12-15,en,"[Adventure, Fantasy, Family]",65000000,262797249.0,[United States of America],104.0,6.9,2413.0
2,Grumpier Old Men,1995-12-22,en,"[Romance, Comedy]",0,0.0,[United States of America],101.0,6.5,92.0
3,Waiting to Exhale,1995-12-22,en,"[Comedy, Drama, Romance]",16000000,81452156.0,[United States of America],127.0,6.1,34.0
4,Father of the Bride Part II,1995-02-10,en,[Comedy],0,76578911.0,[United States of America],106.0,5.7,173.0


In [180]:
movies_df.to_csv(root_project + r"/resources/movies_df.csv", index=False)

In [186]:
movies_df.to_pickle(root_project + r"/resources/movies_df.pkl")

-'genres' expandido dataset:

In [174]:
genre_df_expanded.to_csv(root_project + r"/resources/genre_df_expanded.csv", index=False)

In [187]:
genre_df_expanded.to_pickle(root_project + r"/resources/genre_df_expanded.pkl")

-'production_countries' expandido dataset:

In [None]:
prod_countries_expanded.to_csv(root_project + r"/resources/prod_countries_expanded.csv", index=False)

In [188]:
prod_countries_expanded.to_pickle(root_project + r"/resources/prod_countries_expanded.pkl")