# Limpieza de datos
Para llevar a cabo la limpieza de dato seguiremos un proceso sistemático que aborde problemas tanto observados en la exploración como problemas comunes en los datos crudos. La limpieza de datos es crucial, ya que garantiza que el análisis posterior sea preciso y significativo.

### Importación de librerias

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Carga del dataframe

In [2]:
df = pd.read_csv("../data/processed/combined_data.csv")

## Identificación de inconsistencias

In [3]:
# Valores nulos por columna
df.isnull().sum()

Unnamed: 0                             0
id                                     0
Gender                                 0
Customer Type                          0
Age                                    0
Type of Travel                         0
Class                                  0
Flight Distance                        0
Inflight wifi service                  0
Departure/Arrival time convenient      0
Ease of Online booking                 0
Gate location                          0
Food and drink                         0
Online boarding                        0
Seat comfort                           0
Inflight entertainment                 0
On-board service                       0
Leg room service                       0
Baggage handling                       0
Checkin service                        0
Inflight service                       0
Cleanliness                            0
Departure Delay in Minutes             0
Arrival Delay in Minutes             393
satisfaction    

**Observación:** Encontramos un total de 393 datos nulos, todos ellos en la columna "Arrival Delay in Minutes" representado aproximadamente un 0.30% del total de filas de nuestro Dataframe.

In [4]:
# Filas duplicadas
df.duplicated().sum()

0

**Nota:** Se ha validado que no existen filas duplicadas en nuestro DataFrame.

## Menejo de valores nulos

In [5]:
# Asignamos una variable a los datos modificados y eliminamos los valores nulos
df_cleaned = df.dropna()

In [6]:
# Confirmamos la limpieza de valores nulos
df_cleaned.isnull().sum()

Unnamed: 0                           0
id                                   0
Gender                               0
Customer Type                        0
Age                                  0
Type of Travel                       0
Class                                0
Flight Distance                      0
Inflight wifi service                0
Departure/Arrival time convenient    0
Ease of Online booking               0
Gate location                        0
Food and drink                       0
Online boarding                      0
Seat comfort                         0
Inflight entertainment               0
On-board service                     0
Leg room service                     0
Baggage handling                     0
Checkin service                      0
Inflight service                     0
Cleanliness                          0
Departure Delay in Minutes           0
Arrival Delay in Minutes             0
satisfaction                         0
dtype: int64

## Correción de tipos de datos

In [7]:
# Corregir el tipo de dato float de la columna "Arrival Delay in Minutes" a int.
df_cleaned["Arrival Delay in Minutes"] = df_cleaned["Arrival Delay in Minutes"].astype("int64")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned["Arrival Delay in Minutes"] = df_cleaned["Arrival Delay in Minutes"].astype("int64")


In [8]:
# Confirmamos la corrección de tipo de dato
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Index: 129487 entries, 0 to 129879
Data columns (total 25 columns):
 #   Column                             Non-Null Count   Dtype 
---  ------                             --------------   ----- 
 0   Unnamed: 0                         129487 non-null  int64 
 1   id                                 129487 non-null  int64 
 2   Gender                             129487 non-null  object
 3   Customer Type                      129487 non-null  object
 4   Age                                129487 non-null  int64 
 5   Type of Travel                     129487 non-null  object
 6   Class                              129487 non-null  object
 7   Flight Distance                    129487 non-null  int64 
 8   Inflight wifi service              129487 non-null  int64 
 9   Departure/Arrival time convenient  129487 non-null  int64 
 10  Ease of Online booking             129487 non-null  int64 
 11  Gate location                      129487 non-null  int64

## Manejo de Outliers (valores atípicos)