![AIRBNB](https://www.stevenridercpa.au/wp-content/uploads/2022/09/airbnb-tax.jpeg)

# Obligatorio de Deep Learning
## Semestre 2 - 2023
-------

## Problema

Se presenta un dataset que contiene información de alojamientos publicados en AirBnB con sus respectivos precios. El tamaño del dataset de train es de 1.5 Gb aproximadamente, y 0.5 Gb el de test. Este cuenta con 84 variables predictoras que se podrán utilizar como consideren adecuado.

El objetivo es asignar el precio correcto a los alojamientos listados.

Además del dataset se les provee esta notebook conteniendo el script de carga de datos y un modelo baseline que corresponde a una arquitectura feed forward.

------

## Consigna

### A) <u>Participación en Competencia Kaggle</u>:
El objetivo de este punto es participar en la competencia de Kaggle y obtener como mínimo un Mean Absolute Error inferior a 70 puntos. [->Link a la competencia<-](https://www.kaggle.com/t/69c648e3aa214d1f812bf2314c8d4ffa).

### B) <u>Utilización de Grid Search (o equivalente)</u>:
Para cumplir con la busqueda de modelos óptimos se debe realizar un grid search lo más abarcativo y metódico posible. Recomendamos enfáticamente [Weights and Biases](https://wandb.ai/site)

### C) <u>Se debe a su vez investigar e implementar las siguientes técnicas</u>:
#### 1. [Batch Normalization](https://machinelearningmastery.com/how-to-accelerate-learning-of-deep-neural-networks-with-batch-normalization/)
#### 2. [Gradient Normalization y/o Gradient Clipping](https://machinelearningmastery.com/how-to-avoid-exploding-gradients-in-neural-networks-with-gradient-clipping/)


Además como en todas las tareas se evaluará la prolijidad de la entrega, el preprocesamiento de datos, visualizaciones y exploración de técnicas alternativas.

-------

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
! cd "/content/drive/MyDrive/Colab Notebooks/Datasets/obligatorio_DL"
! ls

drive  sample_data


In [6]:
import os

os.chdir("/content/drive/MyDrive/Colab Notebooks/Datasets/obligatorio_DL")

In [7]:
print(os.getcwd())

/content/drive/MyDrive/Colab Notebooks/Datasets/obligatorio_DL


In [9]:
print(os.listdir())
# os.chdir("./data/imgs")
# print(os.listdir())
# os.chdir("../../")

['public_train_data.csv', 'private_data_to_predict.csv']


## 1. Setup
### 1.1 Imports

In [10]:
import pandas as pd
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split

### 1.2 Seteo de seeds

In [11]:
np.random.seed(117)
tf.random.set_seed(117)

## 2. Carga de datos

In [15]:
file_path = './public_train_data.csv'
df = pd.read_csv(file_path)

##  3. Análisis exploratorio de datos
### 3.1 Dimensiones

In [16]:
df.shape

(326287, 85)

### 3.2 Obtener información sobre las columnas y tipos de datos

In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 326287 entries, 0 to 326286
Data columns (total 85 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   id                              326287 non-null  int64  
 1   Last Scraped                    326286 non-null  object 
 2   Name                            326018 non-null  object 
 3   Summary                         315651 non-null  object 
 4   Space                           228792 non-null  object 
 5   Description                     326188 non-null  object 
 6   Experiences Offered             326287 non-null  object 
 7   Neighborhood Overview           192513 non-null  object 
 8   Notes                           130729 non-null  object 
 9   Transit                         200649 non-null  object 
 10  Access                          177108 non-null  object 
 11  Interaction                     169193 non-null  object 
 12  House Rules     

### 3.3 Visualizar las primeras filas del dataset

In [18]:
df.head(3)

Unnamed: 0,id,Last Scraped,Name,Summary,Space,Description,Experiences Offered,Neighborhood Overview,Notes,Transit,...,Review Scores Location,Review Scores Value,License,Jurisdiction Names,Cancellation Policy,Calculated host listings count,Reviews per Month,Geolocation,Features,Price
0,0,2017-05-12,Grand Loft in the heart of historic Antwerp,Best location for visiting Antwerp!! Beautiful...,Welcome in Antwerp!! The loft is situated on t...,Best location for visiting Antwerp!! Beautiful...,none,,,,...,10.0,9.0,,,strict,2.0,2.6,"51.21938762207894, 4.4034442505151885","Host Has Profile Pic,Instant Bookable",159.0
1,1,2017-05-03,"CHARMING, CLEAN & COZY BUNGALOW!",Very centrally located and less than 15 min fr...,"Well lit, private entrance with small patio.",Very centrally located and less than 15 min fr...,none,"Quiet. Pretty tree lined streets, safe area.",Has dining table and high back desk chair.,"Uber, bus line and metro link is less than 5 m...",...,,,,"City of Los Angeles, CA",flexible,1.0,,"34.1892692286356, -118.41993491931177","Host Has Profile Pic,Is Location Exact",49.0
2,2,2017-05-09,la casa di maurizio,"nice apartment with view to via veneto , very ...",,"nice apartment with view to via veneto , very ...",none,,,,...,,,,,flexible_new,1.0,,"41.90859623057272, 12.493518028459327","Host Has Profile Pic,Is Location Exact",75.0


### 3.4 Estadísticas descriptivas

In [19]:
df.describe()

Unnamed: 0,id,Host ID,Host Response Rate,Host Listings Count,Host Total Listings Count,Latitude,Longitude,Accommodates,Bathrooms,Bedrooms,...,Review Scores Rating,Review Scores Accuracy,Review Scores Cleanliness,Review Scores Checkin,Review Scores Communication,Review Scores Location,Review Scores Value,Calculated host listings count,Reviews per Month,Price
count,326287.0,326287.0,250845.0,325971.0,325970.0,326287.0,326287.0,326244.0,325300.0,325873.0,...,243160.0,242584.0,242732.0,242378.0,242710.0,242423.0,242347.0,325689.0,246983.0,326287.0
mean,163143.0,32367570.0,93.408264,9.586,9.586026,38.042816,-15.323924,3.270764,1.239482,1.358072,...,92.880063,9.524713,9.326067,9.691416,9.708253,9.468215,9.321031,6.881531,1.486211,138.229041
std,94191.087979,31745720.0,17.536835,57.399711,57.399797,22.910029,70.101677,2.037446,0.574784,0.921763,...,8.569521,0.855361,1.038858,0.731702,0.723143,0.805116,0.906478,42.025986,1.752082,149.790527
min,0.0,19.0,0.0,0.0,0.0,-38.224427,-123.218712,1.0,0.0,0.0,...,20.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,0.01,0.0
25%,81571.5,6869780.0,98.0,1.0,1.0,38.923154,-73.968081,2.0,1.0,1.0,...,90.0,9.0,9.0,10.0,10.0,9.0,9.0,1.0,0.32,55.0
50%,163143.0,21867370.0,100.0,1.0,1.0,42.304549,0.090277,2.0,1.0,1.0,...,95.0,10.0,10.0,10.0,10.0,10.0,10.0,1.0,0.89,90.0
75%,244714.5,47991660.0,100.0,3.0,3.0,50.863658,12.342749,4.0,1.0,2.0,...,100.0,10.0,10.0,10.0,10.0,10.0,10.0,2.0,2.04,150.0
max,326286.0,135088500.0,100.0,1114.0,1114.0,55.994889,153.637837,18.0,8.0,96.0,...,100.0,10.0,10.0,10.0,10.0,10.0,10.0,752.0,223.0,999.0


In [20]:
df.columns

Index(['id', 'Last Scraped', 'Name', 'Summary', 'Space', 'Description',
       'Experiences Offered', 'Neighborhood Overview', 'Notes', 'Transit',
       'Access', 'Interaction', 'House Rules', 'Thumbnail Url', 'Medium Url',
       'Picture Url', 'XL Picture Url', 'Host ID', 'Host URL', 'Host Name',
       'Host Since', 'Host Location', 'Host About', 'Host Response Time',
       'Host Response Rate', 'Host Acceptance Rate', 'Host Thumbnail Url',
       'Host Picture Url', 'Host Neighbourhood', 'Host Listings Count',
       'Host Total Listings Count', 'Host Verifications', 'Street',
       'Neighbourhood', 'Neighbourhood Cleansed',
       'Neighbourhood Group Cleansed', 'City', 'State', 'Zipcode', 'Market',
       'Smart Location', 'Country Code', 'Country', 'Latitude', 'Longitude',
       'Property Type', 'Room Type', 'Accommodates', 'Bathrooms', 'Bedrooms',
       'Beds', 'Bed Type', 'Amenities', 'Square Feet', 'Security Deposit',
       'Cleaning Fee', 'Guests Included', 'Extra Peop

## 4. Modelo Baseline

### 4.1 Seleccionar características relevantes

In [21]:
features = ['Bathrooms', 'Bedrooms']  # Reemplaza con las características relevantes
target = 'Price'
df = df[[*features, target]]
df.dropna(inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.dropna(inplace=True)


In [22]:
X = df[features]
y = df[target]

### 4.2 Dividir los datos en conjuntos de entrenamiento y prueba

In [23]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### 4.3 Definir el modelo

In [24]:
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense


model = Sequential([
    Dense(32, activation='relu', input_shape=(X_train.shape[1],)),
    Dense(1, activation='relu')  # Capa de salida para la predicción del precio
])

model.compile(optimizer='adam', loss='mean_squared_error', metrics=['mae'])

### 4.4 Entrenar

In [25]:
history = model.fit(X_train, y_train, epochs=5, batch_size=128, validation_split=0.2)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


### 4.5 Evaluar en Test

In [None]:
loss, mae = model.evaluate(X_test, y_test)
print(f'Test Loss: {loss}, Test MAE: {mae}')

Test Loss: 41625.078125, Test MAE: 138.3116455078125


## 5 Generación de salida para competencia en Kaggle

In [None]:
file_path2 = './airbnb_data/private_data_to_predict.csv'
data_for_kaggle = pd.read_csv(file_path2)

In [None]:
kaggle_results = model.predict(data_for_kaggle[features])
test_ids = data_for_kaggle['id']
test_ids = np.array(test_ids).reshape(-1,1)
output = np.stack((test_ids, kaggle_results), axis=-1)
output = output.reshape([-1, 2])
df = pd.DataFrame(output)
df.columns = ['id','expected']
df['expected'] = df['expected'].fillna(0)
df.to_csv("output_to_submit.csv", index = False, index_label = False)


