#### *Rol: Data Engineer*

# <center>  *Exploratory Data Analysis*(EDA) - *Extract Transform and Load*(ETL) </center>
### Proyecto Individual
---

##### Revisamos los datos que nos llegan, leeremos los archivos CSV que nos brindan en esta consigna y vamos intentando comprender “¿de qué se trata?”, vislumbrando posibles patrones y reconociendo distribuciones estadísticas que puedan ser útiles en el futuro.
 
#### >>><center> **Consigna :** ```Elaborar las transformaciones requeridas para luego disponibilizar los datos mediante la elaboración y ejecución de una API.``` </center>

---
 
### **Requerimientos para nuestra transformación de datos:**

#####  1. Generar campo id: Cada id se compondrá de la primera letra del nombre de la plataforma, seguido del show_id ya presente en los datasets (ejemplo para títulos de Amazon = as123)
#####  2. Los valores nulos del campo rating deberán reemplazarse por el string “G” (corresponde al maturity rating: “general for all audiences”
#####  3. De haber fechas, deberán tener el formato AAAA-mm-dd
#####  4. Los campos de texto deberán estar en minúsculas, sin excepciones
#####  5. El campo duration debe convertirse en dos campos: duration_int y duration_type. El primero será un integer y el segundo un string indicando la unidad de medición de duración: min (minutos) o   season (temporadas)

In [1]:
import pandas as pd

### - Leemos cada dataset con la libreria pandas(formato: CSV).

### Dataset *'Amazon'*

In [2]:
amazon_prime = pd.read_csv('https://raw.githubusercontent.com/HX-FNegrete/PI01-Data-Engineering/main/Datasets/amazon_prime_titles-score.csv')
    
"""
Vamos revisando las columnas con las que contamos.
"""
amazon_prime.head() 

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,score
0,s1,Movie,The Grand Seduction,Don McKellar,"Brendan Gleeson, Taylor Kitsch, Gordon Pinsent",Canada,"March 30, 2021",2014,,113 min,"Comedy, Drama",A small fishing village must procure a local d...,99
1,s2,Movie,Take Care Good Night,Girish Joshi,"Mahesh Manjrekar, Abhay Mahajan, Sachin Khedekar",India,"March 30, 2021",2018,13+,110 min,"Drama, International",A Metro Family decides to fight a Cyber Crimin...,37
2,s3,Movie,Secrets of Deception,Josh Webber,"Tom Sizemore, Lorenzo Lamas, Robert LaSardo, R...",United States,"March 30, 2021",2017,,74 min,"Action, Drama, Suspense",After a man discovers his wife is cheating on ...,20
3,s4,Movie,Pink: Staying True,Sonia Anderson,"Interviews with: Pink, Adele, Beyoncé, Britney...",United States,"March 30, 2021",2014,,69 min,Documentary,"Pink breaks the mold once again, bringing her ...",27
4,s5,Movie,Monster Maker,Giles Foster,"Harry Dean Stanton, Kieran O'Brien, George Cos...",United Kingdom,"March 30, 2021",1989,,45 min,"Drama, Fantasy",Teenage Matt Banting wants to work with a famo...,75


#### - Detallamos la forma de nuestro Dataframe 

In [3]:
print('Cantidad de Filas y columnas:',amazon_prime.shape)

Cantidad de Filas y columnas: (9668, 13)


#### - Vamos viendo a detalle los nombres de nuestras columnas

In [4]:
print('Nombre columnas:',amazon_prime.columns)

Nombre columnas: Index(['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
       'release_year', 'rating', 'duration', 'listed_in', 'description',
       'score'],
      dtype='object')


#### - Vemos cuantos datos nulos se encuentran en cada columna

In [5]:
amazon_prime.isnull().sum()

show_id            0
type               0
title              0
director        2082
cast            1233
country         8996
date_added      9513
release_year       0
rating           337
duration           0
listed_in          0
description        0
score              0
dtype: int64

## .info()
#### - Imprime un resumen conciso de un DataFrame; este método imprime información sobre un DataFrame, incluido el tipo de índice y las columnas, los valores no nulos y el uso de la memoria.

In [6]:
amazon_prime.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9668 entries, 0 to 9667
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       9668 non-null   object
 1   type          9668 non-null   object
 2   title         9668 non-null   object
 3   director      7586 non-null   object
 4   cast          8435 non-null   object
 5   country       672 non-null    object
 6   date_added    155 non-null    object
 7   release_year  9668 non-null   int64 
 8   rating        9331 non-null   object
 9   duration      9668 non-null   object
 10  listed_in     9668 non-null   object
 11  description   9668 non-null   object
 12  score         9668 non-null   int64 
dtypes: int64(2), object(11)
memory usage: 982.0+ KB


#### ``` 1. Generando campo id = "a" + show_id ```


In [7]:
amazon_prime["show_id"] = amazon_prime["show_id"].apply(lambda x: "a" + x)
amazon_prime.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,score
0,as1,Movie,The Grand Seduction,Don McKellar,"Brendan Gleeson, Taylor Kitsch, Gordon Pinsent",Canada,"March 30, 2021",2014,,113 min,"Comedy, Drama",A small fishing village must procure a local d...,99
1,as2,Movie,Take Care Good Night,Girish Joshi,"Mahesh Manjrekar, Abhay Mahajan, Sachin Khedekar",India,"March 30, 2021",2018,13+,110 min,"Drama, International",A Metro Family decides to fight a Cyber Crimin...,37
2,as3,Movie,Secrets of Deception,Josh Webber,"Tom Sizemore, Lorenzo Lamas, Robert LaSardo, R...",United States,"March 30, 2021",2017,,74 min,"Action, Drama, Suspense",After a man discovers his wife is cheating on ...,20
3,as4,Movie,Pink: Staying True,Sonia Anderson,"Interviews with: Pink, Adele, Beyoncé, Britney...",United States,"March 30, 2021",2014,,69 min,Documentary,"Pink breaks the mold once again, bringing her ...",27
4,as5,Movie,Monster Maker,Giles Foster,"Harry Dean Stanton, Kieran O'Brien, George Cos...",United Kingdom,"March 30, 2021",1989,,45 min,"Drama, Fantasy",Teenage Matt Banting wants to work with a famo...,75


#### ```2. Reemplazando los valores nulos del campo 'rating' por el string “g” ```


In [8]:
amazon_prime['rating'].fillna('g', inplace=True)

#### Verificando que se hayan reemplazado todos los valores nulos:

In [9]:
amazon_prime['rating'].isna().count()

9668

#### ``` 3. De haber fechas, deberán tener el formato AAAA-mm-dd ```

In [10]:
amazon_prime['date_added'] = pd.to_datetime(amazon_prime['date_added'], format='%B %d, %Y').dt.strftime('%Y-%m-%d')


#### Verificando cambio de formato en la columna 'date_added'

In [11]:
amazon_prime.head(10)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,score
0,as1,Movie,The Grand Seduction,Don McKellar,"Brendan Gleeson, Taylor Kitsch, Gordon Pinsent",Canada,2021-03-30,2014,g,113 min,"Comedy, Drama",A small fishing village must procure a local d...,99
1,as2,Movie,Take Care Good Night,Girish Joshi,"Mahesh Manjrekar, Abhay Mahajan, Sachin Khedekar",India,2021-03-30,2018,13+,110 min,"Drama, International",A Metro Family decides to fight a Cyber Crimin...,37
2,as3,Movie,Secrets of Deception,Josh Webber,"Tom Sizemore, Lorenzo Lamas, Robert LaSardo, R...",United States,2021-03-30,2017,g,74 min,"Action, Drama, Suspense",After a man discovers his wife is cheating on ...,20
3,as4,Movie,Pink: Staying True,Sonia Anderson,"Interviews with: Pink, Adele, Beyoncé, Britney...",United States,2021-03-30,2014,g,69 min,Documentary,"Pink breaks the mold once again, bringing her ...",27
4,as5,Movie,Monster Maker,Giles Foster,"Harry Dean Stanton, Kieran O'Brien, George Cos...",United Kingdom,2021-03-30,1989,g,45 min,"Drama, Fantasy",Teenage Matt Banting wants to work with a famo...,75
5,as6,Movie,Living With Dinosaurs,Paul Weiland,"Gregory Chisholm, Juliet Stevenson, Brian Hens...",United Kingdom,2021-03-30,1989,g,52 min,"Fantasy, Kids",The story unfolds in a an English seaside town...,74
6,as7,Movie,Hired Gun,Fran Strine,"Alice Cooper, Liberty DeVitto, Ray Parker Jr.,...",United States,2021-03-30,2017,g,98 min,"Documentary, Special Interest","They are the ""First Call, A-list"" musicians, j...",50
7,as8,Movie,Grease Live!,"Thomas Kail, Alex Rudzinski","Julianne Hough, Aaron Tveit, Vanessa Hudgens, ...",United States,2021-03-30,2016,g,131 min,Comedy,"This honest, uncompromising comedy chronicles ...",84
8,as9,Movie,Global Meltdown,Daniel Gilboy,"Michael Paré, Leanne Khol Young, Patrick J. Ma...",Canada,2021-03-30,2017,g,87 min,"Action, Science Fiction, Suspense",A helicopter pilot and an environmental scient...,46
9,as10,Movie,David's Mother,Robert Allan Ackerman,"Kirstie Alley, Sam Waterston, Stockard Channing",United States,2021-04-01,1994,g,92 min,Drama,Sally Goodson is a devoted mother to her autis...,70


#### ``` 4. Los campos de texto deberán estar en minúsculas, sin excepciones ```

In [12]:
amazon = amazon_prime.applymap(lambda x: str(x).lower() if type(x) == str else x)

#### Verificando

In [13]:
amazon.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,score
0,as1,movie,the grand seduction,don mckellar,"brendan gleeson, taylor kitsch, gordon pinsent",canada,2021-03-30,2014,g,113 min,"comedy, drama",a small fishing village must procure a local d...,99
1,as2,movie,take care good night,girish joshi,"mahesh manjrekar, abhay mahajan, sachin khedekar",india,2021-03-30,2018,13+,110 min,"drama, international",a metro family decides to fight a cyber crimin...,37
2,as3,movie,secrets of deception,josh webber,"tom sizemore, lorenzo lamas, robert lasardo, r...",united states,2021-03-30,2017,g,74 min,"action, drama, suspense",after a man discovers his wife is cheating on ...,20
3,as4,movie,pink: staying true,sonia anderson,"interviews with: pink, adele, beyoncé, britney...",united states,2021-03-30,2014,g,69 min,documentary,"pink breaks the mold once again, bringing her ...",27
4,as5,movie,monster maker,giles foster,"harry dean stanton, kieran o'brien, george cos...",united kingdom,2021-03-30,1989,g,45 min,"drama, fantasy",teenage matt banting wants to work with a famo...,75


#### ``` 5. El campo duration debe convertirse en dos campos: duration_int y duration_type. El primero será un integer y el segundo un string indicando la unidad de medición de duración: min (minutos) o season (temporadas) ```

#### Cambiando seasons a season

In [14]:
amazon['duration'] = amazon['duration'].str.replace('seasons', 'season')

#### Transformando 'duration' a 'duration_int','duration_type'

In [15]:
amazon[['duration_int','duration_type']] = amazon['duration'].str.split(n=1,expand=True)
amazon['duration_int'] = amazon['duration_int'].astype(int)


#### Verificando

In [16]:
amazon.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,score,duration_int,duration_type
0,as1,movie,the grand seduction,don mckellar,"brendan gleeson, taylor kitsch, gordon pinsent",canada,2021-03-30,2014,g,113 min,"comedy, drama",a small fishing village must procure a local d...,99,113,min
1,as2,movie,take care good night,girish joshi,"mahesh manjrekar, abhay mahajan, sachin khedekar",india,2021-03-30,2018,13+,110 min,"drama, international",a metro family decides to fight a cyber crimin...,37,110,min
2,as3,movie,secrets of deception,josh webber,"tom sizemore, lorenzo lamas, robert lasardo, r...",united states,2021-03-30,2017,g,74 min,"action, drama, suspense",after a man discovers his wife is cheating on ...,20,74,min
3,as4,movie,pink: staying true,sonia anderson,"interviews with: pink, adele, beyoncé, britney...",united states,2021-03-30,2014,g,69 min,documentary,"pink breaks the mold once again, bringing her ...",27,69,min
4,as5,movie,monster maker,giles foster,"harry dean stanton, kieran o'brien, george cos...",united kingdom,2021-03-30,1989,g,45 min,"drama, fantasy",teenage matt banting wants to work with a famo...,75,45,min


### Dataset *'Disney'*

In [17]:
disney_plus = pd.read_csv('https://raw.githubusercontent.com/HX-FNegrete/PI01-Data-Engineering/main/Datasets/disney_plus_titles-score.csv')
disney_plus.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,score
0,s1,Movie,Duck the Halls: A Mickey Mouse Christmas Special,"Alonso Ramirez Ramos, Dave Wasson","Chris Diamantopoulos, Tony Anselmo, Tress MacN...",,"November 26, 2021",2016,TV-G,23 min,"Animation, Family",Join Mickey and the gang as they duck the halls!,100
1,s2,Movie,Ernest Saves Christmas,John Cherry,"Jim Varney, Noelle Parker, Douglas Seale",,"November 26, 2021",1988,PG,91 min,Comedy,Santa Claus passes his magic bag to a new St. ...,17
2,s3,Movie,Ice Age: A Mammoth Christmas,Karen Disher,"Raymond Albert Romano, John Leguizamo, Denis L...",United States,"November 26, 2021",2011,TV-G,23 min,"Animation, Comedy, Family",Sid the Sloth is on Santa's naughty list.,92
3,s4,Movie,The Queen Family Singalong,Hamish Hamilton,"Darren Criss, Adam Lambert, Derek Hough, Alexa...",,"November 26, 2021",2021,TV-PG,41 min,Musical,"This is real life, not just fantasy!",35
4,s5,TV Show,The Beatles: Get Back,,"John Lennon, Paul McCartney, George Harrison, ...",,"November 25, 2021",2021,,1 Season,"Docuseries, Historical, Music",A three-part documentary from Peter Jackson ca...,11


In [18]:
print('Cantidad de Filas y columnas:',disney_plus.shape)

Cantidad de Filas y columnas: (1450, 13)


In [19]:
print('Nombre columnas:',disney_plus.columns)

Nombre columnas: Index(['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
       'release_year', 'rating', 'duration', 'listed_in', 'description',
       'score'],
      dtype='object')


In [20]:
disney_plus.isnull().sum()

show_id           0
type              0
title             0
director        473
cast            190
country         219
date_added        3
release_year      0
rating            3
duration          0
listed_in         0
description       0
score             0
dtype: int64

In [21]:
disney_plus.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1450 entries, 0 to 1449
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       1450 non-null   object
 1   type          1450 non-null   object
 2   title         1450 non-null   object
 3   director      977 non-null    object
 4   cast          1260 non-null   object
 5   country       1231 non-null   object
 6   date_added    1447 non-null   object
 7   release_year  1450 non-null   int64 
 8   rating        1447 non-null   object
 9   duration      1450 non-null   object
 10  listed_in     1450 non-null   object
 11  description   1450 non-null   object
 12  score         1450 non-null   int64 
dtypes: int64(2), object(11)
memory usage: 147.4+ KB


#### ``` 1. Generando campo id = "d" + show_id ```


In [22]:
# Agregar "d" delante de cada valor en la columna "show_id"
disney_plus["show_id"] = disney_plus["show_id"].apply(lambda x: "d" + x)

In [23]:
disney_plus.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,score
0,ds1,Movie,Duck the Halls: A Mickey Mouse Christmas Special,"Alonso Ramirez Ramos, Dave Wasson","Chris Diamantopoulos, Tony Anselmo, Tress MacN...",,"November 26, 2021",2016,TV-G,23 min,"Animation, Family",Join Mickey and the gang as they duck the halls!,100
1,ds2,Movie,Ernest Saves Christmas,John Cherry,"Jim Varney, Noelle Parker, Douglas Seale",,"November 26, 2021",1988,PG,91 min,Comedy,Santa Claus passes his magic bag to a new St. ...,17
2,ds3,Movie,Ice Age: A Mammoth Christmas,Karen Disher,"Raymond Albert Romano, John Leguizamo, Denis L...",United States,"November 26, 2021",2011,TV-G,23 min,"Animation, Comedy, Family",Sid the Sloth is on Santa's naughty list.,92
3,ds4,Movie,The Queen Family Singalong,Hamish Hamilton,"Darren Criss, Adam Lambert, Derek Hough, Alexa...",,"November 26, 2021",2021,TV-PG,41 min,Musical,"This is real life, not just fantasy!",35
4,ds5,TV Show,The Beatles: Get Back,,"John Lennon, Paul McCartney, George Harrison, ...",,"November 25, 2021",2021,,1 Season,"Docuseries, Historical, Music",A three-part documentary from Peter Jackson ca...,11


#### ``` 2. Reemplazando los valores nulos del campo 'rating' por el string “g” ```


In [24]:
disney_plus['rating'].fillna('g', inplace=True)

#### Verificando que se hayan reemplazado todos los valores nulos:

In [25]:
disney_plus['rating'].isna().count()

1450

#### ``` 3. De haber fechas, deberán tener el formato AAAA-mm-dd ```


In [26]:
disney_plus['date_added'] = pd.to_datetime(disney_plus['date_added'], format='%B %d, %Y').dt.strftime('%Y-%m-%d')

#### Verificando cambio de formato en la columna 'date_added'

In [27]:
disney_plus.head(10)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,score
0,ds1,Movie,Duck the Halls: A Mickey Mouse Christmas Special,"Alonso Ramirez Ramos, Dave Wasson","Chris Diamantopoulos, Tony Anselmo, Tress MacN...",,2021-11-26,2016,TV-G,23 min,"Animation, Family",Join Mickey and the gang as they duck the halls!,100
1,ds2,Movie,Ernest Saves Christmas,John Cherry,"Jim Varney, Noelle Parker, Douglas Seale",,2021-11-26,1988,PG,91 min,Comedy,Santa Claus passes his magic bag to a new St. ...,17
2,ds3,Movie,Ice Age: A Mammoth Christmas,Karen Disher,"Raymond Albert Romano, John Leguizamo, Denis L...",United States,2021-11-26,2011,TV-G,23 min,"Animation, Comedy, Family",Sid the Sloth is on Santa's naughty list.,92
3,ds4,Movie,The Queen Family Singalong,Hamish Hamilton,"Darren Criss, Adam Lambert, Derek Hough, Alexa...",,2021-11-26,2021,TV-PG,41 min,Musical,"This is real life, not just fantasy!",35
4,ds5,TV Show,The Beatles: Get Back,,"John Lennon, Paul McCartney, George Harrison, ...",,2021-11-25,2021,g,1 Season,"Docuseries, Historical, Music",A three-part documentary from Peter Jackson ca...,11
5,ds6,Movie,Becoming Cousteau,Liz Garbus,"Jacques Yves Cousteau, Vincent Cassel",United States,2021-11-24,2021,PG-13,94 min,"Biographical, Documentary",An inside look at the legendary life of advent...,18
6,ds7,TV Show,Hawkeye,,"Jeremy Renner, Hailee Steinfeld, Vera Farmiga,...",,2021-11-24,2021,TV-14,1 Season,"Action-Adventure, Superhero",Clint Barton/Hawkeye must team up with skilled...,98
7,ds8,TV Show,Port Protection Alaska,,"Gary Muehlberger, Mary Miller, Curly Leach, Sa...",United States,2021-11-24,2015,TV-14,2 Seasons,"Docuseries, Reality, Survival",Residents of Port Protection must combat volat...,60
8,ds9,TV Show,Secrets of the Zoo: Tampa,,"Dr. Ray Ball, Dr. Lauren Smith, Chris Massaro,...",United States,2021-11-24,2019,TV-PG,2 Seasons,"Animals & Nature, Docuseries, Family",A day in the life at ZooTampa is anything but ...,53
9,ds10,Movie,A Muppets Christmas: Letters To Santa,Kirk R. Thatcher,"Steve Whitmire, Dave Goelz, Bill Barretta, Eri...",United States,2021-11-19,2008,G,45 min,"Comedy, Family, Musical",Celebrate the holiday season with all your fav...,7


#### ``` 4. Los campos de texto deberán estar en minúsculas, sin excepciones ```


In [28]:
disney = disney_plus.applymap(lambda x: str(x).lower() if type(x) == str else x)

#### Verificando

In [29]:
disney.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,score
0,ds1,movie,duck the halls: a mickey mouse christmas special,"alonso ramirez ramos, dave wasson","chris diamantopoulos, tony anselmo, tress macn...",,2021-11-26,2016,tv-g,23 min,"animation, family",join mickey and the gang as they duck the halls!,100
1,ds2,movie,ernest saves christmas,john cherry,"jim varney, noelle parker, douglas seale",,2021-11-26,1988,pg,91 min,comedy,santa claus passes his magic bag to a new st. ...,17
2,ds3,movie,ice age: a mammoth christmas,karen disher,"raymond albert romano, john leguizamo, denis l...",united states,2021-11-26,2011,tv-g,23 min,"animation, comedy, family",sid the sloth is on santa's naughty list.,92
3,ds4,movie,the queen family singalong,hamish hamilton,"darren criss, adam lambert, derek hough, alexa...",,2021-11-26,2021,tv-pg,41 min,musical,"this is real life, not just fantasy!",35
4,ds5,tv show,the beatles: get back,,"john lennon, paul mccartney, george harrison, ...",,2021-11-25,2021,g,1 season,"docuseries, historical, music",a three-part documentary from peter jackson ca...,11


#### ``` 5. El campo duration debe convertirse en dos campos: duration_int y duration_type. El primero será un integer y el segundo un string indicando la unidad de medición de duración: min (minutos) o season (temporadas) ```


#### Cambiando seasons a season

In [30]:
disney['duration'] = disney['duration'].str.replace('seasons', 'season')

#### Transformando 'duration' a 'duration_int','duration_type'

In [31]:
disney[['duration_int','duration_type']] = disney['duration'].str.split(n=1,expand=True)
disney['duration_int'] = disney['duration_int'].astype(int)

#### Verificando

In [32]:
disney.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,score,duration_int,duration_type
0,ds1,movie,duck the halls: a mickey mouse christmas special,"alonso ramirez ramos, dave wasson","chris diamantopoulos, tony anselmo, tress macn...",,2021-11-26,2016,tv-g,23 min,"animation, family",join mickey and the gang as they duck the halls!,100,23,min
1,ds2,movie,ernest saves christmas,john cherry,"jim varney, noelle parker, douglas seale",,2021-11-26,1988,pg,91 min,comedy,santa claus passes his magic bag to a new st. ...,17,91,min
2,ds3,movie,ice age: a mammoth christmas,karen disher,"raymond albert romano, john leguizamo, denis l...",united states,2021-11-26,2011,tv-g,23 min,"animation, comedy, family",sid the sloth is on santa's naughty list.,92,23,min
3,ds4,movie,the queen family singalong,hamish hamilton,"darren criss, adam lambert, derek hough, alexa...",,2021-11-26,2021,tv-pg,41 min,musical,"this is real life, not just fantasy!",35,41,min
4,ds5,tv show,the beatles: get back,,"john lennon, paul mccartney, george harrison, ...",,2021-11-25,2021,g,1 season,"docuseries, historical, music",a three-part documentary from peter jackson ca...,11,1,season


### Dataset *'Hulu'*

In [33]:
hulu = pd.read_csv('https://raw.githubusercontent.com/HX-FNegrete/PI01-Data-Engineering/main/Datasets/hulu_titles-score%20(2).csv')
hulu.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,score
0,s1,Movie,Ricky Velez: Here's Everything,,,,"October 24, 2021",2021,TV-MA,,"Comedy, Stand Up",​Comedian Ricky Velez bares it all with his ho...,48
1,s2,Movie,Silent Night,,,,"October 23, 2021",2020,,94 min,"Crime, Drama, Thriller","Mark, a low end South London hitman recently r...",8
2,s3,Movie,The Marksman,,,,"October 23, 2021",2021,PG-13,108 min,"Action, Thriller",A hardened Arizona rancher tries to protect an...,8
3,s4,Movie,Gaia,,,,"October 22, 2021",2021,R,97 min,Horror,A forest ranger and two survivalists with a cu...,33
4,s5,Movie,Settlers,,,,"October 22, 2021",2021,,104 min,"Science Fiction, Thriller",Mankind's earliest settlers on the Martian fro...,24


In [34]:
print('Cantidad de Filas y columnas:',hulu.shape)

Cantidad de Filas y columnas: (3073, 13)


In [35]:
print('Nombre columnas:',hulu.columns)

Nombre columnas: Index(['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
       'release_year', 'rating', 'duration', 'listed_in', 'description',
       'score'],
      dtype='object')


In [36]:
hulu.isnull().sum()

show_id            0
type               0
title              0
director        3070
cast            3073
country         1453
date_added        28
release_year       0
rating           520
duration         479
listed_in          0
description        4
score              0
dtype: int64

In [37]:
hulu.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3073 entries, 0 to 3072
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   show_id       3073 non-null   object 
 1   type          3073 non-null   object 
 2   title         3073 non-null   object 
 3   director      3 non-null      object 
 4   cast          0 non-null      float64
 5   country       1620 non-null   object 
 6   date_added    3045 non-null   object 
 7   release_year  3073 non-null   int64  
 8   rating        2553 non-null   object 
 9   duration      2594 non-null   object 
 10  listed_in     3073 non-null   object 
 11  description   3069 non-null   object 
 12  score         3073 non-null   int64  
dtypes: float64(1), int64(2), object(10)
memory usage: 312.2+ KB


#### ``` 1. Generando campo id = "h" + show_id ```


In [38]:
# Agregar "d" delante de cada valor en la columna "show_id"
hulu["show_id"] = hulu["show_id"].apply(lambda x: "h" + x)

In [39]:
hulu.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,score
0,hs1,Movie,Ricky Velez: Here's Everything,,,,"October 24, 2021",2021,TV-MA,,"Comedy, Stand Up",​Comedian Ricky Velez bares it all with his ho...,48
1,hs2,Movie,Silent Night,,,,"October 23, 2021",2020,,94 min,"Crime, Drama, Thriller","Mark, a low end South London hitman recently r...",8
2,hs3,Movie,The Marksman,,,,"October 23, 2021",2021,PG-13,108 min,"Action, Thriller",A hardened Arizona rancher tries to protect an...,8
3,hs4,Movie,Gaia,,,,"October 22, 2021",2021,R,97 min,Horror,A forest ranger and two survivalists with a cu...,33
4,hs5,Movie,Settlers,,,,"October 22, 2021",2021,,104 min,"Science Fiction, Thriller",Mankind's earliest settlers on the Martian fro...,24


#### ``` 2. Reemplazando los valores nulos del campo 'rating' por el string “g” ```


In [40]:
hulu['rating'].fillna('g', inplace=True)

#### Verificando que se hayan reemplazado todos los valores nulos:

In [41]:
hulu['rating'].isna().count()

3073

#### ``` 3. De haber fechas, deberán tener el formato AAAA-mm-dd ```


In [42]:
hulu['date_added'] = pd.to_datetime(hulu['date_added'], format='%B %d, %Y').dt.strftime('%Y-%m-%d')

#### Verificando cambio de formato en la columna 'date_added'

In [43]:
hulu.head(10)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,score
0,hs1,Movie,Ricky Velez: Here's Everything,,,,2021-10-24,2021,TV-MA,,"Comedy, Stand Up",​Comedian Ricky Velez bares it all with his ho...,48
1,hs2,Movie,Silent Night,,,,2021-10-23,2020,g,94 min,"Crime, Drama, Thriller","Mark, a low end South London hitman recently r...",8
2,hs3,Movie,The Marksman,,,,2021-10-23,2021,PG-13,108 min,"Action, Thriller",A hardened Arizona rancher tries to protect an...,8
3,hs4,Movie,Gaia,,,,2021-10-22,2021,R,97 min,Horror,A forest ranger and two survivalists with a cu...,33
4,hs5,Movie,Settlers,,,,2021-10-22,2021,g,104 min,"Science Fiction, Thriller",Mankind's earliest settlers on the Martian fro...,24
5,hs6,TV Show,The Halloween Candy Magic Pet,,,,2021-10-22,2021,g,1 Season,"Family, Kids",Join Mila and Morphle on a mystery-filled Hall...,45
6,hs7,Movie,The Evil Next Door,,,,2021-10-21,2020,g,88 min,"Horror, Thriller","New to her role as a stepmom, a young woman mo...",37
7,hs8,TV Show,The Next Thing You Eat,,,,2021-10-21,2021,g,1 Season,"Cooking & Food, Documentaries, Lifestyle & Cul...",With the unique insights and experience of Ugl...,76
8,hs9,TV Show,Queens,,,,2021-10-20,2021,TV-14,1 Season,"Drama, Music",Four women in their 40s reunite for a chance t...,57
9,hs10,TV Show,The Bachelorette,,,United States,2021-10-20,2003,TV-14,3 Seasons,"Reality, Romance",ABC's romance reality show lets one lucky lady...,79


#### ``` 4. Los campos de texto deberán estar en minúsculas, sin excepciones ```


In [44]:
hulu = hulu.applymap(lambda x: str(x).lower() if type(x) == str else x)

#### Verificando

In [45]:
hulu.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,score
0,hs1,movie,ricky velez: here's everything,,,,2021-10-24,2021,tv-ma,,"comedy, stand up",​comedian ricky velez bares it all with his ho...,48
1,hs2,movie,silent night,,,,2021-10-23,2020,g,94 min,"crime, drama, thriller","mark, a low end south london hitman recently r...",8
2,hs3,movie,the marksman,,,,2021-10-23,2021,pg-13,108 min,"action, thriller",a hardened arizona rancher tries to protect an...,8
3,hs4,movie,gaia,,,,2021-10-22,2021,r,97 min,horror,a forest ranger and two survivalists with a cu...,33
4,hs5,movie,settlers,,,,2021-10-22,2021,g,104 min,"science fiction, thriller",mankind's earliest settlers on the martian fro...,24


#### ``` 5. El campo duration debe convertirse en dos campos: duration_int y duration_type. El primero será un integer y el segundo un string indicando la unidad de medición de duración: min (minutos) o season (temporadas) ```


#### Cambiando seasons a season

In [46]:
hulu['duration'] = hulu['duration'].str.replace('seasons', 'season')

#### Primero reemplazamos los NaN en 'duration' a "0 min"

In [47]:
hulu['duration'].fillna("0 min", inplace=True)

#### Transformando 'duration' a 'duration_int','duration_type'

In [48]:
#Ahora si realizamos la conversión de 'duration' a 'duration_int','duration_type'
hulu[['duration_int','duration_type']] = hulu['duration'].str.split(n=1,expand=True)
hulu['duration_int'] = hulu['duration_int'].astype(int)

#### Verificando

In [49]:
hulu.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,score,duration_int,duration_type
0,hs1,movie,ricky velez: here's everything,,,,2021-10-24,2021,tv-ma,0 min,"comedy, stand up",​comedian ricky velez bares it all with his ho...,48,0,min
1,hs2,movie,silent night,,,,2021-10-23,2020,g,94 min,"crime, drama, thriller","mark, a low end south london hitman recently r...",8,94,min
2,hs3,movie,the marksman,,,,2021-10-23,2021,pg-13,108 min,"action, thriller",a hardened arizona rancher tries to protect an...,8,108,min
3,hs4,movie,gaia,,,,2021-10-22,2021,r,97 min,horror,a forest ranger and two survivalists with a cu...,33,97,min
4,hs5,movie,settlers,,,,2021-10-22,2021,g,104 min,"science fiction, thriller",mankind's earliest settlers on the martian fro...,24,104,min


### Dataset *'Netflix'*

In [50]:
netflix = pd.read_csv('https://raw.githubusercontent.com/HX-FNegrete/PI01-Data-Engineering/main/Datasets/netflix_titles-score.csv')
netflix.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,score
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm...",29
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t...",86
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...,75
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo...",49
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...,92


In [51]:
print('Cantidad de Filas y columnas:',netflix.shape)

Cantidad de Filas y columnas: (8807, 13)


In [52]:
print('Nombre columnas:',netflix.columns)

Nombre columnas: Index(['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
       'release_year', 'rating', 'duration', 'listed_in', 'description',
       'score'],
      dtype='object')


In [53]:
netflix.isnull().sum()

show_id            0
type               0
title              0
director        2634
cast             825
country          831
date_added        10
release_year       0
rating             4
duration           3
listed_in          0
description        0
score              0
dtype: int64

In [54]:
netflix.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
 12  score         8807 non-null   int64 
dtypes: int64(2), object(11)
memory usage: 894.6+ KB


#### ``` 1. Generando campo id = "n" + show_id ```

In [55]:
# Agregar "d" delante de cada valor en la columna "show_id"
netflix["show_id"] = netflix["show_id"].apply(lambda x: "n" + x)

In [56]:
netflix.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,score
0,ns1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm...",29
1,ns2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t...",86
2,ns3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...,75
3,ns4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo...",49
4,ns5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...,92


#### ``` 2. Reemplazando los valores nulos del campo 'rating' por el string “g”  ```


In [57]:
netflix['rating'].fillna('g', inplace=True)

#### Verificando que se hayan reemplazado todos los valores nulos:

In [58]:
netflix['rating'].isna().count()

8807

#### ``` 3. De haber fechas, deberán tener el formato AAAA-mm-dd ```


In [59]:
#Primero igualamos el formato de la data en la columna 'date_added' para eliminar los espacios en blanco al principio y al final de cada valor en la columna
netflix['date_added'] = netflix['date_added'].str.strip()

In [60]:
netflix['date_added'] = pd.to_datetime(netflix['date_added'], format='%B %d, %Y').dt.strftime('%Y-%m-%d')

#### Verificando cambio de formato en la columna 'date_added'

In [61]:
netflix.head(10)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,score
0,ns1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,2021-09-25,2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm...",29
1,ns2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,2021-09-24,2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t...",86
2,ns3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,2021-09-24,2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...,75
3,ns4,TV Show,Jailbirds New Orleans,,,,2021-09-24,2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo...",49
4,ns5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,2021-09-24,2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...,92
5,ns6,TV Show,Midnight Mass,Mike Flanagan,"Kate Siegel, Zach Gilford, Hamish Linklater, H...",,2021-09-24,2021,TV-MA,1 Season,"TV Dramas, TV Horror, TV Mysteries",The arrival of a charismatic young priest brin...,37
6,ns7,Movie,My Little Pony: A New Generation,"Robert Cullen, José Luis Ucha","Vanessa Hudgens, Kimiko Glenn, James Marsden, ...",,2021-09-24,2021,PG,91 min,Children & Family Movies,Equestria's divided. But a bright-eyed hero be...,12
7,ns8,Movie,Sankofa,Haile Gerima,"Kofi Ghanaba, Oyafunmike Ogunlano, Alexandra D...","United States, Ghana, Burkina Faso, United Kin...",2021-09-24,1993,TV-MA,125 min,"Dramas, Independent Movies, International Movies","On a photo shoot in Ghana, an American model s...",10
8,ns9,TV Show,The Great British Baking Show,Andy Devonshire,"Mel Giedroyc, Sue Perkins, Mary Berry, Paul Ho...",United Kingdom,2021-09-24,2021,TV-14,9 Seasons,"British TV Shows, Reality TV",A talented batch of amateur bakers face off in...,44
9,ns10,Movie,The Starling,Theodore Melfi,"Melissa McCarthy, Chris O'Dowd, Kevin Kline, T...",United States,2021-09-24,2021,PG-13,104 min,"Comedies, Dramas",A woman adjusting to life after a loss contend...,3


#### ``` 4. Los campos de texto deberán estar en minúsculas, sin excepciones ```


In [62]:
netflix = netflix.applymap(lambda x: str(x).lower() if type(x) == str else x)

#### Verificando

In [63]:
netflix.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,score
0,ns1,movie,dick johnson is dead,kirsten johnson,,united states,2021-09-25,2020,pg-13,90 min,documentaries,"as her father nears the end of his life, filmm...",29
1,ns2,tv show,blood & water,,"ama qamata, khosi ngema, gail mabalane, thaban...",south africa,2021-09-24,2021,tv-ma,2 seasons,"international tv shows, tv dramas, tv mysteries","after crossing paths at a party, a cape town t...",86
2,ns3,tv show,ganglands,julien leclercq,"sami bouajila, tracy gotoas, samuel jouy, nabi...",,2021-09-24,2021,tv-ma,1 season,"crime tv shows, international tv shows, tv act...",to protect his family from a powerful drug lor...,75
3,ns4,tv show,jailbirds new orleans,,,,2021-09-24,2021,tv-ma,1 season,"docuseries, reality tv","feuds, flirtations and toilet talk go down amo...",49
4,ns5,tv show,kota factory,,"mayur more, jitendra kumar, ranjan raj, alam k...",india,2021-09-24,2021,tv-ma,2 seasons,"international tv shows, romantic tv shows, tv ...",in a city of coaching centers known to train i...,92


#### ``` 5. El campo duration debe convertirse en dos campos: duration_int y duration_type. El primero será un integer y el segundo un string indicando la unidad de medición de duración: min (minutos) o season (temporadas) ```

#### Cambiando seasons a season

In [64]:
netflix['duration'] = netflix['duration'].str.replace('seasons', 'season')

#### Primero reemplazamos los NaN en 'duration' a "0 min"

In [65]:
netflix['duration'].fillna("0 min", inplace=True)

#### Transformando 'duration' a 'duration_int','duration_type'

In [66]:
netflix[['duration_int','duration_type']] = netflix['duration'].str.split(n=1,expand=True)
netflix['duration_int'] = netflix['duration_int'].astype(int)

#### Verificando

In [67]:
netflix.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,score,duration_int,duration_type
0,ns1,movie,dick johnson is dead,kirsten johnson,,united states,2021-09-25,2020,pg-13,90 min,documentaries,"as her father nears the end of his life, filmm...",29,90,min
1,ns2,tv show,blood & water,,"ama qamata, khosi ngema, gail mabalane, thaban...",south africa,2021-09-24,2021,tv-ma,2 season,"international tv shows, tv dramas, tv mysteries","after crossing paths at a party, a cape town t...",86,2,season
2,ns3,tv show,ganglands,julien leclercq,"sami bouajila, tracy gotoas, samuel jouy, nabi...",,2021-09-24,2021,tv-ma,1 season,"crime tv shows, international tv shows, tv act...",to protect his family from a powerful drug lor...,75,1,season
3,ns4,tv show,jailbirds new orleans,,,,2021-09-24,2021,tv-ma,1 season,"docuseries, reality tv","feuds, flirtations and toilet talk go down amo...",49,1,season
4,ns5,tv show,kota factory,,"mayur more, jitendra kumar, ranjan raj, alam k...",india,2021-09-24,2021,tv-ma,2 season,"international tv shows, romantic tv shows, tv ...",in a city of coaching centers known to train i...,92,2,season


## <center>  *¿Qué sacamos del EDA y ETL?*</center>
#### - El EDA será entonces una primer aproximación a los datos, ***atención***, si estamos más o menos bien preparados y suponiendo una muestra de datos “suficiente”, puede que en “unas horas” tengamos ya varias conclusiones como por ejemplo:
#### - A estas alturas podemos saber si nos están pidiendo algo viable ó si necesitamos más datos para comenzar.
#### - El EDA debe tomar horas, ó puede que un día, pero la idea es poder sacar algunas conclusiones rápidas para contestar al cliente si podemos seguir o no con su propuesta.

# *Resultado:*
### - Se realizo las transformaciones asignadas
 

### Concatenamos los 4 datasets

In [68]:
all_app = pd.concat([amazon, disney, hulu, netflix], axis=0)
all_app.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,score,duration_int,duration_type
0,as1,movie,the grand seduction,don mckellar,"brendan gleeson, taylor kitsch, gordon pinsent",canada,2021-03-30,2014,g,113 min,"comedy, drama",a small fishing village must procure a local d...,99,113,min
1,as2,movie,take care good night,girish joshi,"mahesh manjrekar, abhay mahajan, sachin khedekar",india,2021-03-30,2018,13+,110 min,"drama, international",a metro family decides to fight a cyber crimin...,37,110,min
2,as3,movie,secrets of deception,josh webber,"tom sizemore, lorenzo lamas, robert lasardo, r...",united states,2021-03-30,2017,g,74 min,"action, drama, suspense",after a man discovers his wife is cheating on ...,20,74,min
3,as4,movie,pink: staying true,sonia anderson,"interviews with: pink, adele, beyoncé, britney...",united states,2021-03-30,2014,g,69 min,documentary,"pink breaks the mold once again, bringing her ...",27,69,min
4,as5,movie,monster maker,giles foster,"harry dean stanton, kieran o'brien, george cos...",united kingdom,2021-03-30,1989,g,45 min,"drama, fantasy",teenage matt banting wants to work with a famo...,75,45,min


### Conservo la data en formato *'CSV'*

In [70]:
all_app.to_csv('C:/ProgramData/MySQL/MySQL Server 8.0/Uploads/platforms_final.csv', index=False, encoding='utf-8-sig')