## Descripcion

Este codigo consiste en realizar un ETL de los datos de la carpeta 'ignore':  
1-Extraccion: Importa 4 datasets CSV correspondientes a los titulos por plataforma y 8 datasets CSV correspondientes a las puntuaciones de titulos por usuario.  
2-Transformacion: se realizan 2 transformaciones  
    a) All: Busca unificar los titulos de las 4 plataformas para unificar las consultas en un dataset 'All'.  
    b) Score: Busca unificar las puntuaciones de titulos para consultas de score en un unico dataset 'Score'.   
3-Carga: Exporta los dataframes a los siguientes archivos:  
    a) All => all.parquet  
    b) Score => score.csv  
    c) Score => score.parquet  
    d) Cast => cast.csv  

## Extraccion

Importa 4 datasets correspondientes a los titulos por plataforma y 8 datasets correspondientes a las puntuaciones de titulos por usuario.

In [2]:
import pandas as pd
import numpy as np

In [57]:
# Titles
Amazon = pd.read_csv("ignore/amazon_prime_titles.csv")
Disney = pd.read_csv("ignore/disney_plus_titles.csv")
Hulu = pd.read_csv("ignore/hulu_titles.csv")
Netflix = pd.read_csv("ignore/netflix_titles.csv")

In [93]:
# Scores
a = pd.read_csv("ignore/ratings/1.csv")
b = pd.read_csv("ignore/ratings/2.csv")
c = pd.read_csv("ignore/ratings/3.csv")
d = pd.read_csv("ignore/ratings/4.csv")
e = pd.read_csv("ignore/ratings/5.csv")
f = pd.read_csv("ignore/ratings/6.csv")
g = pd.read_csv("ignore/ratings/7.csv")
h = pd.read_csv("ignore/ratings/8.csv")

In [111]:
# Obtiene informacion
print('Amazon')
print('Duplicates:',Amazon.duplicated().any())
print('Files, Columns:',Amazon.shape)
print(Amazon.isnull().sum())
Amazon.tail(3)


Amazon
Duplicates: False
Files, Columns: (9668, 14)
id                 0
platform           0
show_id            0
type               0
title              0
director        2082
cast            1233
country         8996
date_added      9513
release_year       0
rating           337
duration           0
listed_in          0
description        0
dtype: int64


Unnamed: 0,id,platform,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
9665,as9666,amazon,s9666,Movie,Outpost,Steve Barker,"Ray Stevenson, Julian Wadham, Richard Brake, M...",,,2008,R,90 min,Action,"In war-torn Eastern Europe, a world-weary grou..."
9666,as9667,amazon,s9667,TV Show,Maradona: Blessed Dream,,"Esteban Recagno, Ezequiel Stremiz, Luciano Vit...",,,2021,TV-MA,1 Season,"Drama, Sports","The series tells the story of Diego Maradona, ..."
9667,as9668,amazon,s9668,Movie,Harry Brown,Daniel Barber,"Michael Caine, Emily Mortimer, Joseph Gilgun, ...",,,2010,R,103 min,"Action, Drama, Suspense","Harry Brown, starring two-time Academy Award w..."


In [170]:
# Obtiene informacion
print('Disney')
print('Duplicates:',Disney.duplicated().any())
print('Files, Columns:',Disney.shape)
print(Disney.isnull().sum())
Disney.tail(3)

Disney
Duplicates: False
Files, Columns: (1450, 14)
id                0
platform          0
show_id           0
type              0
title             0
director        473
cast            190
country         219
date_added        3
release_year      0
rating            3
duration          0
listed_in         0
description       0
dtype: int64


Unnamed: 0,id,platform,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
1447,ds1448,disney,s1448,Movie,Eddie the Eagle,Dexter Fletcher,"Tom Costello, Jo Hartley, Keith Allen, Dickon ...","United Kingdom, Germany, United States","December 18, 2020",2016,PG-13,107 min,"Biographical, Comedy, Drama","True story of Eddie Edwards, a British ski-jum..."
1448,ds1449,disney,s1449,Movie,Bend It Like Beckham,Gurinder Chadha,"Parminder Nagra, Keira Knightley, Jonathan Rhy...","United Kingdom, Germany, United States","September 18, 2020",2003,PG-13,112 min,"Buddy, Comedy, Coming of Age",Despite the wishes of their traditional famili...
1449,ds1450,disney,s1450,Movie,Captain Sparky vs. The Flying Saucers,Mark Waring,Charlie Tahan,United States,"April 1, 2020",2012,TV-G,2 min,"Action-Adventure, Animals & Nature, Animation",View one of Sparky's favorite home movies.


In [171]:
# Obtiene informacion
print('Hulu')
print('Duplicates:',Hulu.duplicated().any())
print('Files, Columns:',Hulu.shape)
print(Hulu.isnull().sum())
Hulu.tail(3)

Hulu
Duplicates: False
Files, Columns: (3073, 14)
id                 0
platform           0
show_id            0
type               0
title              0
director        3070
cast            3073
country         1453
date_added        28
release_year       0
rating           520
duration         479
listed_in          0
description        4
dtype: int64


Unnamed: 0,id,platform,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
3070,hs3071,hulu,s3071,TV Show,The Fades,,,United Kingdom,,2011,TV-14,1 Season,"Horror, International, Science Fiction",Seventeen-year-old Paul is haunted by apocalyp...
3071,hs3072,hulu,s3072,TV Show,The Twilight Zone,,,United States,,1959,TV-PG,5 Seasons,"Classics, Science Fiction, Thriller",Rod Serling's seminal anthology series focused...
3072,hs3073,hulu,s3073,TV Show,Tokyo Magnitude 8.0,,,Japan,,2009,TV-14,1 Season,"Anime, Drama, International",The devastation is unleashed in the span of se...


In [None]:
# Obtiene informacion
print('Netflix')
print('Duplicates:',Netflix.duplicated().any())
print('Files, Columns:',Netflix.shape)
print(Netflix.isnull().sum())
Netflix.tail(3)

## Transformacion Titles

In [58]:
# Generar campo id: Cada id se compondrá de la primera letra del nombre de la plataforma, 
# seguido del show_id ya presente en los datasets (ejemplo para títulos de Amazon = as123)
id_A = 'a' + Amazon['show_id']
Amazon.insert(0, 'id', id_A)
Amazon.insert(1, 'platform', 'amazon')

id_D = 'd' + Disney['show_id']
Disney.insert(0, 'id', id_D)
Disney.insert(1, 'platform', 'disney')

id_H = 'h' + Hulu['show_id']
Hulu.insert(0, 'id', id_H)
Hulu.insert(1, 'platform', 'hulu')

id_N = 'n' + Netflix['show_id']
Netflix.insert(0, 'id', id_N)
Netflix.insert(1, 'platform', 'netflix')

In [175]:
Titles = pd.concat([Amazon,Disney,Hulu,Netflix], axis=0).reset_index()
print('Titles platforms')
print('Files, Columns:',Titles.shape)
print(Titles.isnull().sum())
Titles.head(3)


Titles platforms
Files, Columns: (22998, 15)
index               0
id                  0
platform            0
show_id             0
type                0
title               0
director         8259
cast             5321
country         11499
date_added       9554
release_year        0
rating            864
duration          482
listed_in           0
description         4
dtype: int64


Unnamed: 0,index,id,platform,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,0,as1,amazon,s1,Movie,The Grand Seduction,Don McKellar,"Brendan Gleeson, Taylor Kitsch, Gordon Pinsent",Canada,"March 30, 2021",2014,,113 min,"Comedy, Drama",A small fishing village must procure a local d...
1,1,as2,amazon,s2,Movie,Take Care Good Night,Girish Joshi,"Mahesh Manjrekar, Abhay Mahajan, Sachin Khedekar",India,"March 30, 2021",2018,13+,110 min,"Drama, International",A Metro Family decides to fight a Cyber Crimin...
2,2,as3,amazon,s3,Movie,Secrets of Deception,Josh Webber,"Tom Sizemore, Lorenzo Lamas, Robert LaSardo, R...",United States,"March 30, 2021",2017,,74 min,"Action, Drama, Suspense",After a man discovers his wife is cheating on ...


In [176]:
# Los valores nulos del campo rating deberán reemplazarse por el string “G” (corresponde al maturity rating: “general for all audiences”)
Titles['rating'].fillna('G',inplace=True)
print(Titles['rating'].isnull().sum())

0


In [177]:
# De haber fechas, deberán tener el formato AAAA-mm-dd
Titles['date_added'] = pd.to_datetime(Titles['date_added'])
Titles.tail(3)

Unnamed: 0,index,id,platform,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
22995,8804,ns8805,netflix,s8805,Movie,Zombieland,Ruben Fleischer,"Jesse Eisenberg, Woody Harrelson, Emma Stone, ...",United States,2019-11-01,2009,R,88 min,"Comedies, Horror Movies",Looking to survive in a world taken over by zo...
22996,8805,ns8806,netflix,s8806,Movie,Zoom,Peter Hewitt,"Tim Allen, Courteney Cox, Chevy Chase, Kate Ma...",United States,2020-01-11,2006,PG,88 min,"Children & Family Movies, Comedies","Dragged from civilian life, a former superhero..."
22997,8806,ns8807,netflix,s8807,Movie,Zubaan,Mozez Singh,"Vicky Kaushal, Sarah-Jane Dias, Raaghav Chanan...",India,2019-03-02,2015,TV-14,111 min,"Dramas, International Movies, Music & Musicals",A scrappy but poor boy worms his way into a ty...


In [178]:
# Los campos de texto deberán estar en minúsculas, sin excepciones
lowercase = lambda s:s.lower() if type(s) == str else s
Titles = Titles.applymap(lowercase)

In [179]:
# El campo duration debe convertirse en dos campos: duration_int y duration_type. 
# El primero será un integer y el segundo un string indicando la unidad de medición de duración: min (minutos) o season (temporadas)
Titles[['duration_int','duration_type']] = Titles['duration'].str.split(' ', expand=True)
Titles.drop(columns=['duration'], inplace=True)
Titles.nunique()          

index             9668
id               22998
platform             4
show_id           9668
type                 2
title            22042
director         10095
cast             16744
country            886
date_added        2003
release_year       101
rating             105
listed_in         1687
description      22669
duration_int       225
duration_type        3
dtype: int64

In [180]:
# Organiza las columnas
Titles = Titles.reindex(columns=['id', 'platform', 'type', 'title', 'director', 'cast', 'country', 'date_added', 'release_year', 'duration_int', 'duration_type', 'listed_in', 'description'])
Titles.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22998 entries, 0 to 22997
Data columns (total 13 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   id             22998 non-null  object        
 1   platform       22998 non-null  object        
 2   type           22998 non-null  object        
 3   title          22998 non-null  object        
 4   director       14739 non-null  object        
 5   cast           17677 non-null  object        
 6   country        11499 non-null  object        
 7   date_added     13444 non-null  datetime64[ns]
 8   release_year   22998 non-null  int64         
 9   duration_int   22516 non-null  object        
 10  duration_type  22516 non-null  object        
 11  listed_in      22998 non-null  object        
 12  description    22994 non-null  object        
dtypes: datetime64[ns](1), int64(1), object(11)
memory usage: 2.3+ MB


In [181]:
# Cantidad de valores unicos
Titles.nunique()

id               22998
platform             4
type                 2
title            22042
director         10095
cast             16744
country            886
date_added        2003
release_year       101
duration_int       225
duration_type        3
listed_in         1687
description      22669
dtype: int64

In [182]:
# Convierte a integer los valores numericos
Titles['duration_int'] = Titles['duration_int'].fillna(0)
Titles['duration_int'] = Titles['duration_int'].astype(int)
Titles['release_year'] = Titles['release_year'].astype(int) 
Titles.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22998 entries, 0 to 22997
Data columns (total 13 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   id             22998 non-null  object        
 1   platform       22998 non-null  object        
 2   type           22998 non-null  object        
 3   title          22998 non-null  object        
 4   director       14739 non-null  object        
 5   cast           17677 non-null  object        
 6   country        11499 non-null  object        
 7   date_added     13444 non-null  datetime64[ns]
 8   release_year   22998 non-null  int64         
 9   duration_int   22998 non-null  int64         
 10  duration_type  22516 non-null  object        
 11  listed_in      22998 non-null  object        
 12  description    22994 non-null  object        
dtypes: datetime64[ns](1), int64(2), object(10)
memory usage: 2.3+ MB


In [183]:
# VIsualiza que categorias de datos hay en duration_type
import collections
collections.Counter(Titles['duration_type'])

Counter({'min': 15999, 'season': 4183, 'seasons': 2334, nan: 482})

In [184]:
# Unifica seasons y season
Titles['duration_type'] = Titles['duration_type'].replace('seasons','season')
collections.Counter(Titles['duration_type'])

Counter({'min': 15999, 'season': 6517, nan: 482})

In [185]:
# Lista los encabezados
headers = list(Titles.columns)
print(headers)  

['id', 'platform', 'type', 'title', 'director', 'cast', 'country', 'date_added', 'release_year', 'duration_int', 'duration_type', 'listed_in', 'description']


In [186]:
# Organiza y reduce las columnas
Titles = Titles.reindex(columns=['id', 'platform', 'type', 'title', 'director', 'cast', 'country', 'date_added', 'release_year', 'duration_int', 'duration_type', 'listed_in', 'description'])
Titles.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22998 entries, 0 to 22997
Data columns (total 13 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   id             22998 non-null  object        
 1   platform       22998 non-null  object        
 2   type           22998 non-null  object        
 3   title          22998 non-null  object        
 4   director       14739 non-null  object        
 5   cast           17677 non-null  object        
 6   country        11499 non-null  object        
 7   date_added     13444 non-null  datetime64[ns]
 8   release_year   22998 non-null  int64         
 9   duration_int   22998 non-null  int64         
 10  duration_type  22516 non-null  object        
 11  listed_in      22998 non-null  object        
 12  description    22994 non-null  object        
dtypes: datetime64[ns](1), int64(2), object(10)
memory usage: 2.3+ MB


In [49]:
# Opcion de guardar avance parcial cambiando switch a True (solo con carpeta ignore)
switch = False
if switch:
    Titles.to_csv('ignore/all_titles.csv', index=False)

## Transformacion Score

In [154]:
# Unifica los dataset de rating en Score
Score = pd.concat([a,b,c,d,e,f,g,h], axis=0)
Score = Score.reset_index(drop=True)
print('Score')
print('Files, Columns:',Score.shape)
print(Score.isnull().sum())
Score.tail(3)

Score
Files, Columns: (11024289, 4)
userId       0
rating       0
timestamp    0
movieId      0
dtype: int64


Unnamed: 0,userId,rating,timestamp,movieId
11024286,124380,3.5,1196785679,hs305
11024287,124380,4.5,1196787089,ns7881
11024288,124380,1.5,1196785847,as883


In [155]:
# Reemplaza timestamp por date, agrega la columna platform, renombra rating a score
Score = Score.rename(columns={'rating':'score'})                        # Renombra rating con score
Score['date'] = pd.to_datetime(Score['timestamp'], unit='s').dt.date    # Agrega columna date
Score.drop(columns='timestamp', inplace=True)                           # Elimina timestamp
Score['platform'] = Score['movieId'].str[0]                             # Agrega la columna platform
# Score = Score[['platform','score','year','movieId']]
Score.tail(3)

Unnamed: 0,userId,score,movieId,date,platform
11024286,124380,3.5,hs305,2007-12-04,h
11024287,124380,4.5,ns7881,2007-12-04,n
11024288,124380,1.5,as883,2007-12-04,a


In [156]:
# Completa el nombre de las plataformas mediante una funcion
def get_initial(movieId:str):
    if movieId[0] == 'a':
        platform = 'amazon'
    elif movieId[0] == 'd':
        platform = 'disney'
    elif movieId[0] == 'h':
        platform = 'hulu'
    elif movieId[0] == 'n':
        platform = 'netflix'
    else: platform = '' 
    
    return platform


# Aplica la función a la columna 'platform' para reemplazar las iniciales con el nombre de la plataforma
Score['platform'] = Score['platform'].apply(get_initial)
Score.tail(2)

Unnamed: 0,userId,score,movieId,date,platform
11024287,124380,4.5,ns7881,2007-12-04,netflix
11024288,124380,1.5,as883,2007-12-04,amazon


In [157]:
# Reordena las columnas y elimina date
Score = Score[['userId','movieId','score','platform']]
Score.tail(2)

Unnamed: 0,userId,movieId,score,platform
11024287,124380,ns7881,4.5,netflix
11024288,124380,as883,1.5,amazon


In [158]:
# Define una función de agregación que calcula los promedios del score y consulta valores estadisticos
S = Score.groupby('movieId').agg(pd.Series({'score': 'mean'}))
S.describe()

  S = Score.groupby('movieId').agg(pd.Series({'score': 'mean'}))


Unnamed: 0,score
count,22998.0
mean,3.533443
std,0.048564
min,3.336478
25%,3.5
50%,3.533673
75%,3.567
max,3.724512


In [187]:
Titles.tail(2)

Unnamed: 0,id,platform,type,title,director,cast,country,date_added,release_year,duration_int,duration_type,listed_in,description
22996,ns8806,netflix,movie,zoom,peter hewitt,"tim allen, courteney cox, chevy chase, kate ma...",united states,2020-01-11,2006,88,min,"children & family movies, comedies","dragged from civilian life, a former superhero..."
22997,ns8807,netflix,movie,zubaan,mozez singh,"vicky kaushal, sarah-jane dias, raaghav chanan...",india,2019-03-02,2015,111,min,"dramas, international movies, music & musicals",a scrappy but poor boy worms his way into a ty...


In [188]:
first_release = Titles['release_year'].min()
max_duration = Titles['duration_int'].max()
users = Score['userId'].max()
print(first_release,max_duration,users)

1920 601 270896


## Transformacion All

In [190]:
# Filtra las columnas userId, movieId y score del dataset Sore y las columnas id y title del dataset Title.
Score = Score[['userId','movieId','score','platform']]
Titles = Titles.rename(columns={'id':'movieId'}) # Cambia el nombre para que haya coincidencia
title = Titles[['movieId','title']]     # Guarda solo los titulos con su id
Titles.tail(2)

Unnamed: 0,movieId,platform,type,title,director,cast,country,date_added,release_year,duration_int,duration_type,listed_in,description
22996,ns8806,netflix,movie,zoom,peter hewitt,"tim allen, courteney cox, chevy chase, kate ma...",united states,2020-01-11,2006,88,min,"children & family movies, comedies","dragged from civilian life, a former superhero..."
22997,ns8807,netflix,movie,zubaan,mozez singh,"vicky kaushal, sarah-jane dias, raaghav chanan...",india,2019-03-02,2015,111,min,"dramas, international movies, music & musicals",a scrappy but poor boy worms his way into a ty...


In [191]:
All = pd.merge(Score,Titles,how='left',on='movieId')
scores_amount = len(All)-1
All.tail(2)

Unnamed: 0,userId,movieId,score,platform_x,platform_y,type,title,director,cast,country,date_added,release_year,duration_int,duration_type,listed_in,description
11024287,124380,ns7881,4.5,netflix,netflix,movie,rocky ii,sylvester stallone,"sylvester stallone, talia shire, burt young, c...",united states,2019-08-01,1979,119,min,"dramas, sports movies","featuring a rousing climax, this engaging sequ..."
11024288,124380,as883,1.5,amazon,amazon,movie,storm boy,shawn seet,"jai courtney, geoffrey rush, finn little, trev...",,NaT,2019,99,min,"drama, faith and spirituality, kids","when michael kingley, a successful retired bus..."


## Carga

In [15]:
# Opcion de cargar avance parcial cambiando switch a True (solo con carpeta ignore)
switch = False
if switch:
    all = pd.read_csv('ignore/all_titles.csv')
    Score = pd.read_parquet('processed_data/score.parquet')

In [16]:
# S = Score.groupby('movieId').agg(pd.Series({'score': 'mean'}))
# S.tail(2)

  S = Score.groupby('movieId').agg(pd.Series({'score': 'mean'}))


Unnamed: 0_level_0,score
movieId,Unnamed: 1_level_1
ns998,3.442149
ns999,3.36952


In [17]:
all = all.rename(columns={'id':'movieId'}) 

In [18]:
all = pd.merge(S,all,how='left',on='movieId')

In [20]:
all

Unnamed: 0,movieId,score,platform,type,title,director,cast,country,date_added,release_year,duration_int,duration_type,listed_in,description
0,as1,3.330677,amazon,movie,the grand seduction,don mckellar,"brendan gleeson, taylor kitsch, gordon pinsent",canada,2021-03-30,2014,113,min,"comedy, drama",a small fishing village must procure a local d...
1,as10,3.307992,amazon,movie,david's mother,robert allan ackerman,"kirstie alley, sam waterston, stockard channing",united states,2021-04-01,1994,92,min,drama,sally goodson is a devoted mother to her autis...
2,as100,3.465116,amazon,movie,wilder napalm,glenn gordon caron,"debra winger, dennis quaid, arliss howard, m. ...",,,1993,109,min,"comedy, science fiction",two brothers with the secret power of starting...
3,as1000,3.404124,amazon,movie,sinbad: make me wanna holla,jay chapman,sinbad,,,2014,90,min,"arts, entertainment, and culture, comedy, docu...",watch the all-out stand-up special featuring a...
4,as1001,3.420043,amazon,movie,simple gifts: the chamber music society at sha...,habib azar,,,,2016,84,min,documentary,a
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22993,ns995,3.370536,netflix,movie,this lady called life,kayode kasum,"bisola aiyeola, efa iwara, molawa onajobi, tin...",nigeria,2021-04-23,2020,120,min,"dramas, international movies, romantic movies","abandoned by her family, young single mother a..."
22994,ns996,3.495951,netflix,movie,vizontele,"yılmaz erdoğan, ömer faruk sorak","yılmaz erdoğan, demet akbağ, altan erkekli, ce...",turkey,2021-04-23,2001,106,min,"comedies, dramas, international movies","in 1974, a rural town in anatolia gets its fir..."
22995,ns997,3.397895,netflix,movie,homunculus,takashi shimizu,"go ayano, ryo narita, yukino kishii, anna ishi...",japan,2021-04-22,2021,116,min,"horror movies, international movies, thrillers",truth and illusion blurs when a homeless amnes...
22996,ns998,3.442149,netflix,tv show,life in color with david attenborough,,david attenborough,"australia, united kingdom",2021-04-22,2021,1,season,"british tv shows, docuseries, international tv...","using innovative technology, this docuseries e..."


In [None]:

# Titles['platform'] = Titles['platform'].astype(str)
# Titles['type'] = Titles['type'].astype(str)
# Titles['title'] = Titles['title'].astype(str)
# Titles['director'] = Titles['director'].astype(str)
# Titles['cast'] = Titles['cast'].astype(str)
# Titles['country'] = Titles['country'].astype(str)
# Titles['type'] = Titles['type'].astype(str)
# Titles['title'] = Titles['title'].astype(str)
# Titles['director'] = Titles['director'].astype(str)
Titles['release_year'] = Titles['release_year'].astype(np.int16)
Titles['duration_int'] = Titles['duration_int'].astype(np.int16)
# Titles['listed_in'] = Titles['listed_in'].astype(str)
# Titles['description'] = Titles['description'].astype(str)

Score['userId'] = Score['userId'].astype(np.int32)
Score['score'] = Score['score'].astype(np.int8)
# Score['platform'] = Score['platform'].astype(str)

In [167]:
All.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11024289 entries, 0 to 11024288
Data columns (total 5 columns):
 #   Column    Dtype  
---  ------    -----  
 0   userId    int64  
 1   movieId   object 
 2   score     float64
 3   platform  object 
 4   title     object 
dtypes: float64(1), int64(1), object(3)
memory usage: 504.7+ MB


In [168]:
Score.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11024289 entries, 0 to 11024288
Data columns (total 4 columns):
 #   Column    Dtype  
---  ------    -----  
 0   userId    int64  
 1   movieId   object 
 2   score     float64
 3   platform  object 
dtypes: float64(1), int64(1), object(2)
memory usage: 336.4+ MB


In [21]:
all.to_csv('data/all.csv')

In [74]:
Titles.to_parquet('processed_data/titles.parquet')

In [192]:
title.to_parquet('processed_data/title.parquet')

In [None]:
# Opcion de exportar csv de gran tamanio cambiando switch a True (solo con carpeta ignore)
switch = False
if switch:
    Score.to_csv('ignore/score.csv', index=False) 
# Da un archivo de 300Mb

In [None]:
# S.to_csv('data/score_mean.csv', index=False)

In [126]:
# Convierte score.csv en score.parquet para reducir el tamanio del archivo
# Score = pd.read_csv('ignore/score.csv')
Score.to_parquet('processed_data/score.parquet')

In [142]:
S.to_parquet('processed_data/S.parquet')

In [169]:
All.to_parquet('processed_data/all.parquet')