# Proyecto Airbnb NYC 🗽

Este proyecto consiste en construir un pipeline ETL sencillo con Python + AWS + Power BI. Utilizamos datos abiertos de Airbnb en la ciudad de Nueva York publicados en Kaggle.

Objetivo: preparar un conjunto de datos limpio y confiable para análisis de precios, tipos de habitaciones y disponibilidad por barrios.

**Tecnologías:**
- Python (limpieza y preprocesamiento)
- AWS S3 (almacenamiento)
- Power BI (visualización)


In [2]:
import os
os.getcwd()


'c:\\Users\\octav\\Desktop\\Proyecto Airbnb NY\\ETL'

In [1]:
# Carga del dataset
import pandas as pd

# Cargar dataset original
df = pd.read_csv("../Datasets/AB_NYC_2019.csv")
df.shape


(48895, 16)

In [2]:
# Exploración de los datos
df.head()
df.info()
df.describe()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48895 entries, 0 to 48894
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              48895 non-null  int64  
 1   name                            48879 non-null  object 
 2   host_id                         48895 non-null  int64  
 3   host_name                       48874 non-null  object 
 4   neighbourhood_group             48895 non-null  object 
 5   neighbourhood                   48895 non-null  object 
 6   latitude                        48895 non-null  float64
 7   longitude                       48895 non-null  float64
 8   room_type                       48895 non-null  object 
 9   price                           48895 non-null  int64  
 10  minimum_nights                  48895 non-null  int64  
 11  number_of_reviews               48895 non-null  int64  
 12  last_review                     

Unnamed: 0,id,host_id,latitude,longitude,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
count,48895.0,48895.0,48895.0,48895.0,48895.0,48895.0,48895.0,38843.0,48895.0,48895.0
mean,19017140.0,67620010.0,40.728949,-73.95217,152.720687,7.029962,23.274466,1.373221,7.143982,112.781327
std,10983110.0,78610970.0,0.05453,0.046157,240.15417,20.51055,44.550582,1.680442,32.952519,131.622289
min,2539.0,2438.0,40.49979,-74.24442,0.0,1.0,0.0,0.01,1.0,0.0
25%,9471945.0,7822033.0,40.6901,-73.98307,69.0,1.0,1.0,0.19,1.0,0.0
50%,19677280.0,30793820.0,40.72307,-73.95568,106.0,3.0,5.0,0.72,1.0,45.0
75%,29152180.0,107434400.0,40.763115,-73.936275,175.0,5.0,24.0,2.02,2.0,227.0
max,36487240.0,274321300.0,40.91306,-73.71299,10000.0,1250.0,629.0,58.5,327.0,365.0


In [6]:
# Limpieza de datos 

import pandas as pd
import csv

# Filtramos las columnas de interés
cols = [
    'id', 'name', 'host_id', 'neighbourhood_group', 'neighbourhood',
    'latitude', 'longitude', 'room_type', 'price', 'minimum_nights',
    'number_of_reviews', 'availability_365'
]
df = df[cols]

# Eliminamos duplicados
df = df.drop_duplicates()

# Eliminamos precios sospechosos
df = df[(df['price'] > 0) & (df['price'] < 1000)]


# Exportar limpio
df.to_csv("../Output/airbnb_clean.csv", index=False, quoting=csv.QUOTE_ALL)

