# Trabajo Práctico 3: Limpieza de Datos
**Dataset:** Airbnb Listings – New York City 2019  
**Fuente:** [Kaggle - NYC Airbnb Open Data](https://www.kaggle.com/datasets/dgomonov/new-york-city-airbnb-open-data)

En este trabajo se realiza la **limpieza del dataset de Airbnb en Nueva York (2019)**. El objetivo es identificar y corregir inconsistencias, valores faltantes, duplicados y otros problemas comunes en los datos, con el fin de preparar el conjunto para análisis exploratorio o modelado posterior.

## 1. Carga y exploración inicial del dataset

In [4]:
import pandas as pd

# Cargar el dataset
df = pd.read_csv("C:/Users/JOSE DANIEL/Documents/6to semestre/ciencia de datos 1/AB_NYC_2019.csv.zip")

# Vista general
df.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


In [5]:
# Información general
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48895 entries, 0 to 48894
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              48895 non-null  int64  
 1   name                            48879 non-null  object 
 2   host_id                         48895 non-null  int64  
 3   host_name                       48874 non-null  object 
 4   neighbourhood_group             48895 non-null  object 
 5   neighbourhood                   48895 non-null  object 
 6   latitude                        48895 non-null  float64
 7   longitude                       48895 non-null  float64
 8   room_type                       48895 non-null  object 
 9   price                           48895 non-null  int64  
 10  minimum_nights                  48895 non-null  int64  
 11  number_of_reviews               48895 non-null  int64  
 12  last_review                     

In [6]:
# Estadísticas descriptivas
df.describe(include='all')

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
count,48895.0,48879,48895.0,48874,48895,48895,48895.0,48895.0,48895,48895.0,48895.0,48895.0,38843,38843.0,48895.0,48895.0
unique,,47905,,11452,5,221,,,3,,,,1764,,,
top,,Hillside Hotel,,Michael,Manhattan,Williamsburg,,,Entire home/apt,,,,2019-06-23,,,
freq,,18,,417,21661,3920,,,25409,,,,1413,,,
mean,19017140.0,,67620010.0,,,,40.728949,-73.95217,,152.720687,7.029962,23.274466,,1.373221,7.143982,112.781327
std,10983110.0,,78610970.0,,,,0.05453,0.046157,,240.15417,20.51055,44.550582,,1.680442,32.952519,131.622289
min,2539.0,,2438.0,,,,40.49979,-74.24442,,0.0,1.0,0.0,,0.01,1.0,0.0
25%,9471945.0,,7822033.0,,,,40.6901,-73.98307,,69.0,1.0,1.0,,0.19,1.0,0.0
50%,19677280.0,,30793820.0,,,,40.72307,-73.95568,,106.0,3.0,5.0,,0.72,1.0,45.0
75%,29152180.0,,107434400.0,,,,40.763115,-73.936275,,175.0,5.0,24.0,,2.02,2.0,227.0


## 2. Identificación de problemáticas
Durante la exploración se identifican las siguientes problemáticas:
1. Valores nulos en columnas como `reviews_per_month`, `last_review` y `name`.
2. Duplicados (propiedades listadas más de una vez).
3. Valores fuera de rango en `price` o `minimum_nights`.
4. Tipos de datos incorrectos (por ejemplo, fechas como texto).
5. Inconsistencias en texto o categorías (espacios, mayúsculas/minúsculas).

## 3. Limpieza de datos paso a paso

In [7]:
# 3.1 Manejo de valores faltantes
df.isnull().sum()

# Ejemplo de imputación o eliminación
df['reviews_per_month'].fillna(0, inplace=True)
df.dropna(subset=['name', 'host_name'], inplace=True)

In [8]:
# 3.2 Eliminación de duplicados
df.drop_duplicates(inplace=True)

In [9]:
# 3.3 Corrección de tipos de datos
df['last_review'] = pd.to_datetime(df['last_review'])

In [10]:
# 3.4 Filtrado de valores fuera de rango
df = df[(df['price'] > 0) & (df['price'] < 1000)]

In [11]:
# 3.5 Limpieza de texto y categorías
df['neighbourhood_group'] = df['neighbourhood_group'].str.strip().str.title()

## 4. Verificación de la limpieza

In [12]:
df.info()
df.describe()
df.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
Index: 48549 entries, 0 to 48894
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype         
---  ------                          --------------  -----         
 0   id                              48549 non-null  int64         
 1   name                            48549 non-null  object        
 2   host_id                         48549 non-null  int64         
 3   host_name                       48549 non-null  object        
 4   neighbourhood_group             48549 non-null  object        
 5   neighbourhood                   48549 non-null  object        
 6   latitude                        48549 non-null  float64       
 7   longitude                       48549 non-null  float64       
 8   room_type                       48549 non-null  object        
 9   price                           48549 non-null  int64         
 10  minimum_nights                  48549 non-null  int64         
 11  number_

id                                   0
name                                 0
host_id                              0
host_name                            0
neighbourhood_group                  0
neighbourhood                        0
latitude                             0
longitude                            0
room_type                            0
price                                0
minimum_nights                       0
number_of_reviews                    0
last_review                       9875
reviews_per_month                    0
calculated_host_listings_count       0
availability_365                     0
dtype: int64

## 5. Guardar dataset limpio

In [13]:
df.to_csv('AB_NYC_2019_clean.csv', index=False)

In [None]:
# Visuslizar el dataset limpio
df_clean = pd.read_csv('AB_NYC_2019_clean.csv')

df_clean.head()

df_clean.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48549 entries, 0 to 48548
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              48549 non-null  int64  
 1   name                            48549 non-null  object 
 2   host_id                         48549 non-null  int64  
 3   host_name                       48549 non-null  object 
 4   neighbourhood_group             48549 non-null  object 
 5   neighbourhood                   48549 non-null  object 
 6   latitude                        48549 non-null  float64
 7   longitude                       48549 non-null  float64
 8   room_type                       48549 non-null  object 
 9   price                           48549 non-null  int64  
 10  minimum_nights                  48549 non-null  int64  
 11  number_of_reviews               48549 non-null  int64  
 12  last_review                     

## 6. Conclusión
Luego del proceso de limpieza, el dataset queda listo para análisis exploratorio y modelado. Se corrigieron valores nulos, duplicados, inconsistencias de formato y valores atípicos, mejorando la calidad y confiabilidad del conjunto de datos.