<a href="https://colab.research.google.com/github/SELF-msselve/UTN/blob/main/CeL_Data_Eng_Procesamiento_Casteo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Conversión de tipos de datos
Uno de los motivos por el que se realiza la conversión de tipos de datos es por eficiencia y optimización. Al seleccionar el tipo de dato adecuado para una columna, se puede ahorrar memoria y mejorar el rendimiento del programa.

Veamos como aplicarlo en Pandas y cuáles son las mejoras que se pueden obtener.

In [None]:
import pandas as pd

In [None]:
df = pd.read_excel("https://public.tableau.com/app/sample-data/netflix_titles.xlsx")
df.head()

Unnamed: 0,duration_minutes,duration_seasons,type,title,date_added,release_year,rating,description,show_id
0,90.0,,Movie,Norm of the North: King Sized Adventure,2019-09-09 00:00:00,2019.0,TV-PG,Before planning an awesome wedding for his gra...,81145628.0
1,94.0,,Movie,Jandino: Whatever it Takes,2016-09-09 00:00:00,2016.0,TV-MA,Jandino Asporaat riffs on the challenges of ra...,80117401.0
2,,1.0,TV Show,Transformers Prime,2018-09-08 00:00:00,2013.0,TV-Y7-FV,"With the help of three human allies, the Autob...",70234439.0
3,,1.0,TV Show,Transformers: Robots in Disguise,2018-09-08 00:00:00,2016.0,TV-Y7,When a prison ship crash unleashes hundreds of...,80058654.0
4,99.0,,Movie,#realityhigh,2017-09-08 00:00:00,2017.0,TV-14,When nerdy high schooler Dani finally attracts...,80125979.0


Hagamos de cuenta que acabamos de extraer los datos de algún origen de datos
y lo guardamos en la capa bronze del data lake, de forma cruda sin transformar

Consultemos, de antemano, cuanto espacio en memoria ocupa el dataframe antes de transformarlo


In [None]:
# Miremos al final donde dice "memory_usage"
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6236 entries, 0 to 6235
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   duration_minutes  4267 non-null   object 
 1   duration_seasons  1971 non-null   object 
 2   type              6235 non-null   object 
 3   title             6235 non-null   object 
 4   date_added        6223 non-null   object 
 5   release_year      6234 non-null   float64
 6   rating            6223 non-null   object 
 7   description       6233 non-null   object 
 8   show_id           6232 non-null   float64
dtypes: float64(2), object(7)
memory usage: 3.4 MB


### Manejo de nulos
Antes de aplicar la conversión de tipos de datos, es importante verificar si existen valores nulos en el dataset. Si existen, se debe decidir si se eliminan, se rellenan con un valor por defecto o se imputan con un valor calculado. Esto es importante ya que la presencia de valores nulos puede afectar la conversión de tipos de datos.

Particularmente, para este dataset, en total vemos que posee 6236 registros. Sin embargo, según el campo show_id hay 6232 registros no nulos, lo que nos indica que hay 4 registros con valores nulos.

Vamos a chequearlo

In [None]:
df.show_id.isnull().sum()

4

No sirve tenes registros con id nulo, por lo que vamos a eliminarlos

In [None]:
df = df.dropna(subset=['show_id'])

In [None]:
df.show_id.isnull().sum()

0

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6232 entries, 0 to 6235
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   duration_minutes  4263 non-null   object 
 1   duration_seasons  1969 non-null   object 
 2   type              6232 non-null   object 
 3   title             6232 non-null   object 
 4   date_added        6221 non-null   object 
 5   release_year      6232 non-null   float64
 6   rating            6222 non-null   object 
 7   description       6232 non-null   object 
 8   show_id           6232 non-null   float64
dtypes: float64(2), object(7)
memory usage: 486.9+ KB


A fines prácticos, voy a imputar otras columnas con un valor por defecto. Conviene analizar cada caso en particular.

In [None]:
imputation_mapping = {
    "duration_minutes": -1,
    "duration_seasons": -1,
    "date_added": "1900-01-01 00:00:00"
}

df = df.fillna(imputation_mapping)

Una vez finalizada la manipulación, procedemos a aplicar la conversión de tipos de datos

In [None]:
conversion_mapping = {
    "duration_minutes": "int8",
    "duration_seasons": "int8",
    "date_added": "datetime64[ns]",
    "release_year": "int16",
    "type": "category",
    "rating": "category",
    "title": "string"
}

df = df.astype(conversion_mapping)

In [None]:
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6232 entries, 0 to 6235
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   duration_minutes  6232 non-null   int8          
 1   duration_seasons  6232 non-null   int8          
 2   type              6232 non-null   category      
 3   title             6232 non-null   string        
 4   date_added        6221 non-null   datetime64[ns]
 5   release_year      6232 non-null   int16         
 6   rating            6222 non-null   category      
 7   description       6232 non-null   object        
 8   show_id           6232 non-null   float64       
dtypes: category(2), datetime64[ns](1), float64(1), int16(1), int8(2), object(1), string(1)
memory usage: 2.0 MB


Una vez listo todo el trabajo de casteo y manipulacion, se tiene que guardar en la capa bronze del datalake como parquet, por ej.

`df.to_parquet(...)`

### Sumarizacion

In [None]:
df.groupby(
    ['type', 'rating']
    ).agg(
        {
            'duration_minutes': 'mean',
            'duration_seasons': 'mean',
            'show_id': 'count'
        }
    ).rename(
        columns={
            'duration_minutes': 'mean_duration_minutes',
            'duration_seasons': 'mean_duration_seasons',
            'show_id': 'count_show_id'
        }
    )

Unnamed: 0_level_0,Unnamed: 1_level_0,mean_duration_minutes,mean_duration_seasons,count_show_id
type,rating,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Movie,G,,,0
Movie,NC-17,,,0
Movie,NR,,,0
Movie,PG,,,0
Movie,PG-13,,,0
Movie,R,,,0
Movie,TV-14,,,0
Movie,TV-G,,,0
Movie,TV-MA,,,0
Movie,TV-PG,,,0


In [None]:
pd.pivot_table(
    df,
    index='type',
    columns='rating',
    values=['duration_minutes', 'show_id'],
    aggfunc=['mean', 'count']
)

Unnamed: 0_level_0,mean,mean,mean,mean,mean,mean,mean,mean,mean,mean,...,count,count,count,count,count,count,count,count,count,count
Unnamed: 0_level_1,duration_minutes,duration_minutes,duration_minutes,duration_minutes,duration_minutes,duration_minutes,duration_minutes,duration_minutes,duration_minutes,duration_minutes,...,show_id,show_id,show_id,show_id,show_id,show_id,show_id,show_id,show_id,show_id
rating,G,NC-17,NR,PG,PG-13,R,TV-14,TV-G,TV-MA,TV-PG,...,PG-13,R,TV-14,TV-G,TV-MA,TV-PG,TV-Y,TV-Y7,TV-Y7-FV,UR
type,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3,Unnamed: 11_level_3,Unnamed: 12_level_3,Unnamed: 13_level_3,Unnamed: 14_level_3,Unnamed: 15_level_3,Unnamed: 16_level_3,Unnamed: 17_level_3,Unnamed: 18_level_3,Unnamed: 19_level_3,Unnamed: 20_level_3,Unnamed: 21_level_3
Movie,85.361111,131.5,95.123762,97.786885,108.853147,106.158103,108.137765,67.8375,95.301411,98.083527,...,286,506,1038,80,1347,431,41,69,27,7
TV Show,-1.0,,-1.0,-1.0,,-1.0,-1.0,-1.0,-1.0,-1.0,...,0,2,660,69,679,269,102,100,68,0
