# machine-learning-practica
### **Práctica ML** - Ejercicio de Bootcamp Inteligencia Artificial Full Stack Edición III

Este proyecto es un entregable para la práctica del Master Bootcamp Inteligencia Artificial Full Stack Edición III realizado por el centro de formación [@Keepcoding](https://github.com/KeepCoding)

---

El objetivo de la práctica es la predicción del precio del airbnb de los datos disponibles en el fichero [airbnb-listings-extract.csv](./airbnb-listings-extract.csv)

## Contenido

Los pasos esperados son los siguientes:
1. Preparación de datos: División train/test
2. Análisis exploratorio, por ejemplo:
    - Head, describe, dtypes, etc.
    - Outliers
    - Correlación
3. Preprocesamiento:
    - Eliminación de variables, mediante selección (random forest/Lasso), alta correlación, alto porcentaje de missings, o el método que se considere oportuno.
    - Generación de variables
5. Modelado:
    - Cross validation
    - Evaluación; mejor si lo hacéis de más de un modelo, porque así podéis comparar entre ellos.
6. Conclusión: escrita, no numérica; un par de líneas es más que suficiente.

## 1. Preparación de datos: División train/test

In [5]:
# Camenzamos con las librerías que usaremos
import numpy as np
import pandas as pd

# settings - descomentar a conveniencia
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

In [6]:
# Descargando los datos y primer contacto
datosABNB = pd.read_csv("./airbnb-listings-extract.csv",sep=";")
# aunque el punto 2 requiere en análisis exploratorio, dividir los datos me obliga mínimamente ver cual es la columna resultado
datosABNB.columns # parece que es Price

Index(['ID', 'Listing Url', 'Scrape ID', 'Last Scraped', 'Name', 'Summary',
       'Space', 'Description', 'Experiences Offered', 'Neighborhood Overview',
       'Notes', 'Transit', 'Access', 'Interaction', 'House Rules',
       'Thumbnail Url', 'Medium Url', 'Picture Url', 'XL Picture Url',
       'Host ID', 'Host URL', 'Host Name', 'Host Since', 'Host Location',
       'Host About', 'Host Response Time', 'Host Response Rate',
       'Host Acceptance Rate', 'Host Thumbnail Url', 'Host Picture Url',
       'Host Neighbourhood', 'Host Listings Count',
       'Host Total Listings Count', 'Host Verifications', 'Street',
       'Neighbourhood', 'Neighbourhood Cleansed',
       'Neighbourhood Group Cleansed', 'City', 'State', 'Zipcode', 'Market',
       'Smart Location', 'Country Code', 'Country', 'Latitude', 'Longitude',
       'Property Type', 'Room Type', 'Accommodates', 'Bathrooms', 'Bedrooms',
       'Beds', 'Bed Type', 'Amenities', 'Square Feet', 'Price', 'Weekly Price',
       'Month

In [7]:
# sklearn imports
from sklearn.model_selection import train_test_split

# Separamos train/test
X1 = datosABNB.loc[:,datosABNB.columns != "Price"]
y1 = datosABNB["Price"]
X_train, X_test, y_train, y_test = train_test_split(X1, y1, test_size=0.25, shuffle=True, random_state=0)

## 2. Análisis exploratorio

#### NOTA:
> Antes de comenzar el análisis exploratorio me gustaría decir que lo he comenzado antes de ver el tema en clase y que tengo 0 experiencia en ello. Así que verás muchas cosas inecesarias o que hay mejores maneras de hacerlo, incluso me hice una función con mucha información que en retrospectiva he pedido mi tiempo, pero al mismo tiempo me gustó haberlo intentado y mostrar mi esfuerzo en ello, gradualmente veras que me apegaré a lo que vimos en clase, finalmente repito análisis debido a lo que comento anteriormente. Igualmente me siento orgulloso de haberlo intentado antes y por ello lo dejo.

In [10]:
# Un vistaso rápido a todo
X_train.head(3).T

Unnamed: 0,14764,6577,13212
ID,4181571,13967638,7944109
Listing Url,https://www.airbnb.com/rooms/4181571,https://www.airbnb.com/rooms/13967638,https://www.airbnb.com/rooms/7944109
Scrape ID,20170315084710,20170407214119,20170407214119
Last Scraped,2017-03-15,2017-04-08,2017-04-08
Name,SANT BARTOMEU Apartment w/ PRIVATE PATIO - SO...,Cuarto de la luna llena,"NOMAD V, Friendly Rentals MAD"
Summary,This is a 100% renovated apartment in the hear...,Lugares de interés: Está cerca del centro de l...,This comfortable modern apartment is in a full...
Space,This apartment features great location in Soll...,,This comfortable modern apartment is in a full...
Description,This is a 100% renovated apartment in the hear...,Lugares de interés: Está cerca del centro de l...,This comfortable modern apartment is in a full...
Experiences Offered,none,none,none
Neighborhood Overview,"Set in the beautiful village of Soller, the ap...",,Located in the atmospheric La Latina neighborh...


In [11]:
# comencemos con shape, a ver que nos espera
X_train.shape # 88 columnas por explorar

(11085, 88)

In [12]:
# quiza hay alguna columna que todo sea null se podría quitar
areAllNulls = X_train.isnull().all()
areAllNulls[areAllNulls == True] # no hubo suerte

Series([], dtype: bool)

In [13]:
# Veamos las que contienen null
areSomeNulls = X_train.isnull().any()
areSomeNulls[areSomeNulls == True] # hay varias columnas con algún null, habrá que ver que tan útiles son

Name                              True
Summary                           True
Space                             True
Description                       True
Neighborhood Overview             True
Notes                             True
Transit                           True
Access                            True
Interaction                       True
House Rules                       True
Thumbnail Url                     True
Medium Url                        True
Picture Url                       True
XL Picture Url                    True
Host Name                         True
Host Since                        True
Host Location                     True
Host About                        True
Host Response Time                True
Host Response Rate                True
Host Acceptance Rate              True
Host Thumbnail Url                True
Host Picture Url                  True
Host Neighbourhood                True
Host Listings Count               True
Host Total Listings Count

In [14]:
# Veamos los datatypes
X_train.dtypes # muchos strings tenemos que ver que podemos obtener de ellos

ID                                  int64
Listing Url                        object
Scrape ID                           int64
Last Scraped                       object
Name                               object
Summary                            object
Space                              object
Description                        object
Experiences Offered                object
Neighborhood Overview              object
Notes                              object
Transit                            object
Access                             object
Interaction                        object
House Rules                        object
Thumbnail Url                      object
Medium Url                         object
Picture Url                        object
XL Picture Url                     object
Host ID                             int64
Host URL                           object
Host Name                          object
Host Since                         object
Host Location                     

In [15]:
# tambien buscamos filas duplicados
dup = X_train.duplicated()
dup[dup == True]

Series([], dtype: bool)

In [16]:
#### detectemos Outliers!
# Analizamos un poco los datos
X_train.describe(include='all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
ID,11085.0,,,,10258637.543257,5558271.47456,19864.0,5545662.0,11251956.0,15308709.0,18583609.0
Listing Url,11085.0,11085.0,https://www.airbnb.com/rooms/4181571,1.0,,,,,,,
Scrape ID,11085.0,,,,20170375650379.668,543799242.050982,20160104002432.0,20170407214119.0,20170407214119.0,20170407214119.0,20170615002708.0
Last Scraped,11085.0,36.0,2017-04-08,10222.0,,,,,,,
Name,11084.0,10813.0,Apartamento en el centro de Madrid,11.0,,,,,,,
Summary,10638.0,10022.0,"Unique apartment in vibrant neighborhoods, car...",49.0,,,,,,,
Space,8179.0,7705.0,Los Apartamentos Good Stay Prado se encuentran...,21.0,,,,,,,
Description,11079.0,10742.0,Es un piso con 6 habitaciones de las que 5 ha...,15.0,,,,,,,
Experiences Offered,11085.0,5.0,none,11073.0,,,,,,,
Neighborhood Overview,6863.0,6062.0,Se trata de una de las zonas más emblemáticas ...,26.0,,,,,,,


In [17]:
# NOTA: Algunas de estas cosas ya existen en mejores formas de hacerlo, pero las descubrí en clase después.
#       Parece que redescubrí el hilo negro :S
#       No me arrepiento de haber hecho esto, lo más seguro es que no lo use más en adelante, pero lo dejo 
#       porque al final de cuentas fue mi esfuerzo
#
# Me he puesto a analizar tantas cosas que ya lo volví función en functions.py:
from functions import analisisDF

# Lo probamos con un pequeño dataset:
dataTest = {
    "copy": [420, 380, 390, 411, 400, 395, 410],
    "copied": [420, 380, 390, 411, 400, 395, 410],
    "similar": [415, 380, 390, 411, 400, 395, 410],
    "proporcional": [415*1.5, 380*1.5, 390*1.5, 411*1.5, 400*1.5, 395*1.5, 410*1.5],
    "contains1": ["a", "b", "c", "d", "e", "f", "g"],
    "contains2": ["aaa", "bbbb", "cccc", "dxxxx", "exxxx", "fffff", "gccccc"],
    "allNan": [None, None, None, None, None, None, None],
    "manyNan": [None, 1, None, None, None, None, None],
    "outlier": [1, 1, 390, 1, 1, 1, -1111],
    "formatInconsistence": [1, 1, True, 1, 1, 1, 1]
}
df = pd.DataFrame(dataTest)
print(df) 

# veamos, tiene buena pinta
resultMap = analisisDF(df)
resultMap

   copy  copied  similar  proporcional contains1 contains2 allNan  manyNan  \
0   420     420      415         622.5         a       aaa   None      NaN   
1   380     380      380         570.0         b      bbbb   None      1.0   
2   390     390      390         585.0         c      cccc   None      NaN   
3   411     411      411         616.5         d     dxxxx   None      NaN   
4   400     400      400         600.0         e     exxxx   None      NaN   
5   395     395      395         592.5         f     fffff   None      NaN   
6   410     410      410         615.0         g    gccccc   None      NaN   

   outlier formatInconsistence  
0        1                   1  
1        1                   1  
2      390                True  
3        1                   1  
4        1                   1  
5        1                   1  
6    -1111                   1  


{'duplicateCols': [{'cols': 'copy', 'col2': 'copied'}],
 'similarCols': [{'cols': 'copy', 'col2': 'copied', 'rate': 1.0},
  {'cols': 'copy', 'col2': 'similar', 'rate': 0.8571428571428571},
  {'cols': 'copied', 'col2': 'similar', 'rate': 0.8571428571428571}],
 'containsCols': [{'cols': 'copy', 'col2': 'copied', 'rate': 1.0},
  {'cols': 'copy', 'col2': 'similar', 'rate': 0.8571428571428571},
  {'cols': 'copied', 'col2': 'similar', 'rate': 0.8571428571428571},
  {'cols': 'contains1', 'col2': 'contains2', 'rate': 1.0},
  {'cols': 'outlier',
   'col2': 'formatInconsistence',
   'rate': 0.8571428571428571}],
 'formatInconsitenceCols': [{'col': 'formatInconsistence',
   'types': "<class 'int'>|<class 'bool'>"}],
 'tooManyNanCols': [{'col': 'allNan', 'rate': 1.0},
  {'col': 'manyNan', 'rate': 0.8571428571428571}],
 'proportionalCols': [{'cols': 'copy', 'col2': 'copied', 'proportion': 1.0},
  {'cols': 'similar',
   'col2': 'proporcional',
   'proportion': 0.6666666666666666}],
 'outliersCols': 

In [18]:
# ahora con los verdaderos datos
#import json
#resultMap = analisisDF(X_train,0.87)
#with open('resultMap.json', 'w', encoding='utf-8') as f:
#    json.dump(resultMap,f,ensure_ascii=False, indent=2)
#    f.close()

> La anterior ejecución, aunuque ha terminado bien, ha tomado mucha memoria y mejor la he comentado.
> Por lo tanto he copiado el resultado y pegado en un fichero de texto que se puede ver [aquí](./resultMap.json)
>
> Muy bonito todo pero mejor me apego a partir de ahora a lo visto en clase. Por ejemplo "similarCols" se ve facilmente en la matriz de correlaciones en vez de reinventar el hilo negro

**Bueno es un poco dificil de leer mi función pero si la copio en un notpad++ puedo navegar bien y he visto lo siguiente:**
- No hay duplicados
  ```
  'duplicateCols': [],
  ```
- Hay muchas columnas con similares valere:
  ```
  'similarCols': ['cols:[Host Acceptance Rate|Square Feet],rate:[0.957954765126619]',
  'cols:[Host Acceptance Rate|Has Availability],rate:[0.997583607191185]',
  'cols:[Host Acceptance Rate|License],rate:[0.9733230233906824]',
  'cols:[Host Acceptance Rate|Jurisdiction Names],rate:[0.9844384303112313]',
  'cols:[Host Listings Count|Host Total Listings Count],rate:[1.0]',
  'cols:[City|Market],rate:[0.9438430311231394]',
  'cols:[Square Feet|Has Availability],rate:[0.9596945679489658]',
  'cols:[Square Feet|License],rate:[0.9389135897931568]',
  'cols:[Square Feet|Jurisdiction Names],rate:[0.9475159481925381]',
  'cols:[Has Availability|License],rate:[0.9750628262130292]',
  'cols:[Has Availability|Jurisdiction Names],rate:[0.9858882659965204]',
  'cols:[Review Scores Checkin|Review Scores Communication],rate:[0.8568528900057993]',
  'cols:[License|Jurisdiction Names],rate:[0.9628842064566016]']
  ```
- Hay muchos datos contenidos en otros:
  ```
  ['cols:[ID|Listing Url],rate:[1.0]',
  'cols:[Scrape ID|Guests Included],rate:[0.9501256524260584]',
  'cols:[Last Scraped|Guests Included],rate:[0.9467427024937174]',
  'cols:[Summary|Description],rate:[0.9237386429537986]',
  'cols:[Picture Url|Accommodates],rate:[0.9022810748115213]',
  'cols:[Picture Url|Guests Included],rate:[0.9010245505509376]',
  'cols:[Picture Url|Minimum Nights],rate:[0.8912623236033249]',
  'cols:[Host ID|Host URL],rate:[1.0]',
  'cols:[Host Since|Guests Included],rate:[0.9303112313937754]',
  'cols:[Host Location|Country],rate:[0.8350086990141118]',
  'cols:[Host Acceptance Rate|Square Feet],rate:[0.957954765126619]',
  'cols:[Host Acceptance Rate|Has Availability],rate:[0.997583607191185]',
  'cols:[Host Acceptance Rate|License],rate:[0.9733230233906824]',
  'cols:[Host Acceptance Rate|Jurisdiction Names],rate:[0.9844384303112313]',
  'cols:[Host Thumbnail Url|Accommodates],rate:[0.8705780011598685]',
  'cols:[Host Thumbnail Url|Guests Included],rate:[0.8924221921515562]',
  'cols:[Host Thumbnail Url|Minimum Nights],rate:[0.862265609897545]',
  'cols:[Host Picture Url|Accommodates],rate:[0.8706746568722211]',
  'cols:[Host Picture Url|Guests Included],rate:[0.8924221921515562]',
  'cols:[Host Picture Url|Minimum Nights],rate:[0.862265609897545]',
  'cols:[Host Listings Count|Host Total Listings Count],rate:[1.0]',
  'cols:[Street|City],rate:[0.9996133771505896]',
  'cols:[Street|State],rate:[0.989367871641214]',
  'cols:[Street|Zipcode],rate:[0.9665571235260004]',
  'cols:[Street|Market],rate:[0.949255751014885]',
  'cols:[Street|Country],rate:[0.9999033442876474]',
  'cols:[City|State],rate:[0.9069205490044462]',
  'cols:[City|Market],rate:[0.9460661125072491]',
  'cols:[City|Smart Location],rate:[0.9996133771505896]',
  'cols:[State|Market],rate:[0.8988014691668278]',
  'cols:[Market|Smart Location],rate:[0.9461627682196018]',
  'cols:[Smart Location|Country],rate:[0.9887879373670984]',
  'cols:[Latitude|Accommodates],rate:[0.8080417552677364]',
  'cols:[Accommodates|Geolocation],rate:[0.9048907790450416]',
  'cols:[Square Feet|Has Availability],rate:[0.9596945679489658]',
  'cols:[Square Feet|License],rate:[0.9389135897931568]',
  'cols:[Square Feet|Jurisdiction Names],rate:[0.9475159481925381]',
  'cols:[Guests Included|Calendar last Scraped],rate:[0.945872801082544]',
  'cols:[Guests Included|Geolocation],rate:[0.8985115020297699]',
  'cols:[Minimum Nights|Geolocation],rate:[0.8877827179586314]',
  'cols:[Has Availability|License],rate:[0.9750628262130292]',
  'cols:[Has Availability|Jurisdiction Names],rate:[0.9858882659965204]',
  'cols:[Review Scores Checkin|Review Scores Communication],rate:[0.8568528900057993]',
  'cols:[License|Jurisdiction Names],rate:[0.9628842064566016]'],
  ```
- Hay columas con tipos de datos distintos dentro de la columna, si son útiles habrá que normalizarlas
  ```
  ["col:[Name],types:[<class 'str'>|<class 'float'>]",
  "col:[Summary],types:[<class 'float'>|<class 'str'>]",
  "col:[Space],types:[<class 'str'>|<class 'float'>]",
  "col:[Description],types:[<class 'str'>|<class 'float'>]",
  "col:[Neighborhood Overview],types:[<class 'float'>|<class 'str'>]",
  "col:[Notes],types:[<class 'float'>|<class 'str'>]",
  "col:[Transit],types:[<class 'float'>|<class 'str'>]",
  "col:[Access],types:[<class 'float'>|<class 'str'>]",
  "col:[Interaction],types:[<class 'float'>|<class 'str'>]",
  "col:[House Rules],types:[<class 'str'>|<class 'float'>]",
  "col:[Thumbnail Url],types:[<class 'float'>|<class 'str'>]",
  "col:[Medium Url],types:[<class 'float'>|<class 'str'>]",
  "col:[Picture Url],types:[<class 'str'>|<class 'float'>]",
  "col:[XL Picture Url],types:[<class 'float'>|<class 'str'>]",
  "col:[Host Name],types:[<class 'str'>|<class 'float'>]",
  "col:[Host Since],types:[<class 'str'>|<class 'float'>]",
  "col:[Host Location],types:[<class 'str'>|<class 'float'>]",
  "col:[Host About],types:[<class 'str'>|<class 'float'>]",
  "col:[Host Response Time],types:[<class 'str'>|<class 'float'>]",
  "col:[Host Acceptance Rate],types:[<class 'float'>|<class 'str'>]",
  "col:[Host Thumbnail Url],types:[<class 'str'>|<class 'float'>]",
  "col:[Host Picture Url],types:[<class 'str'>|<class 'float'>]",
  "col:[Host Neighbourhood],types:[<class 'str'>|<class 'float'>]",
  "col:[Host Verifications],types:[<class 'str'>|<class 'float'>]",
  "col:[Neighbourhood],types:[<class 'str'>|<class 'float'>]",
  "col:[Neighbourhood Group Cleansed],types:[<class 'str'>|<class 'float'>]",
  "col:[City],types:[<class 'str'>|<class 'float'>]",
  "col:[State],types:[<class 'str'>|<class 'float'>]",
  "col:[Zipcode],types:[<class 'str'>|<class 'float'>]",
  "col:[Market],types:[<class 'str'>|<class 'float'>]",
  "col:[Country],types:[<class 'str'>|<class 'float'>]",
  "col:[Amenities],types:[<class 'str'>|<class 'float'>]",
  "col:[Has Availability],types:[<class 'float'>|<class 'str'>]",
  "col:[First Review],types:[<class 'str'>|<class 'float'>]",
  "col:[Last Review],types:[<class 'str'>|<class 'float'>]",
  "col:[License],types:[<class 'float'>|<class 'str'>]",
  "col:[Jurisdiction Names],types:[<class 'float'>|<class 'str'>]",
  "col:[Features],types:[<class 'str'>|<class 'float'>]"]
  ```
- Hay columnas con la mayoría de sus datos en null, hay 2 opciones, o sobran revisarlas como undefined a ver si dan algún dato
  ```
  ['col:[Host Acceptance Rate],rate:[0.997583607191185]',
  'col:[Square Feet],rate:[0.960371157935434]',
  'col:[Has Availability],rate:[0.9993234100135318]',
  'col:[License],rate:[0.9757394161994974]',
  'col:[Jurisdiction Names],rate:[0.9861782331335782]'],
  ```
- Hay algunas columnas con valores únicos solo pegaré las que se me hicieron relevantes:
  ```
    'col:[Experiences Offered],unique rate:[0.0004832785617630002],vals:[Experiences Offered
    none        10334
    business        6
    social          3
    family          2
    romantic        1
    Name: count, dtype: int64]',
  'col:[Host Response Time],unique rate:[0.0004832785617630002],vals:[Host Response Time
    within an hour        5544
    within a few hours    2008
    within a day          1261
    a few days or more     203
    Name: count, dtype: int64]',

    Este será bueno pasarlo a números
    'col:[Host Acceptance Rate],unique rate:[0.0009665571235260004],vals:[Host Acceptance Rate
    100%    14
    0%       2
    85%      2
    74%      2
    67%      1
    96%      1
    95%      1
    88%      1
    80%      1
    "col:[Host Neighbourhood],unique rate:[0.032186352213415814],vals:[Host Neighbourhood
    Malasaña                             687
    La Latina                            675
    Embajadores                          637
    Sol                                  519
    Justicia                             486
    Cortes                               432
    Palacio                              349
    Argüelles                            217
    Aluche                               190
    Carabanchel                          172
    Trafalgar                            156
    Rios Rosas                           147
    Ciudad Lineal                        134
    Palos do Moguer                      134
    L'Antiga Esquerra de l'Eixample      134
    Goya                                 127
    .... Otros más

  Esta es normalizable
  'col:[Host Verifications],unique rate:[0.017301372511115406],vals:[Host Verifications
    email,phone,reviews,jumio                                                                                   2598
    email,phone,reviews                                                                                         2462
    email,phone,reviews,jumio,government_id                                                                      624
    email,phone,facebook,reviews,jumio                                                                           595
    email,phone,facebook,reviews                                                                                 517
    email,phone                                                                                                  427
    email,phone,reviews,jumio,work_email                                                                         309
    email,phone,reviews,jumio,offline_government_id,government_id                                                257
    email,phone,facebook,reviews,jumio,government_id                                                             245
    email,phone,reviews,work_email                                                                               189
    email,phone,google,reviews,jumio,government_id                                                               148
    email,phone,reviews,manual_offline,jumio                                                                     130
    phone                                                                                                        119
    email,phone,facebook                                                                                         100
    email,phone,facebook,reviews,jumio,work_email                                                                 91
    phone,reviews                                                                                                 89
    email,phone,facebook,reviews,jumio,offline_government_id,government_id                                        86
    .... Otros más
    "col:[Neighbourhood],unique rate:[0.03150976222694761],vals:[Neighbourhood
    Malasaña                             608
    La Latina                            579
    Embajadores                          551
    Sol                                  512
    Cortes                               406
    Justicia                             396
    Palacio                              287
    Aluche                               159
    Argüelles                            157
    Trafalgar                            151
    Carabanchel                          137
    Palos do Moguer                      126
    Ciudad Lineal                        125
    Goya                                 111
    Puente de Vallecas                    87
    Guindalera                            87
    Arapiles                              83
    Recoletos                             82
    Pacifico                              70
    Almagro                               69
    Hortaleza                             63
    Gaztambide                            61
    Castellana                            58
    Lista                                 57
    Cuatro Caminos                        57
    Acacias                               52
    Fuencarral-el Pardo                   52
    Usera                                 49
    Ibiza                                 48
    San Blas                              46
    Delicias                              44
    Prosperidad                           42
    La Chopera                            42
    Rios Rosas                            41
    Barajas                               38
    Imperial                              38
    .... Otros más

  Esta la hemos comentado en clase, no tiene sentido las filas que no son madrid quiza barcelona a lo mucho
    "col:[City],unique rate:[0.02029769959404601],vals:[City
    Madrid                                 9249
    Barcelona                               211
    London                                   99
    Paris                                    70
    马德里                                      42
    Palma                                    37
    Berlin                                   29
    Alcúdia                                  29
    Roma                                     28
    New York                                 23
    Los Angeles                              20
    Brooklyn                                 18
    Wien                                     18
    Dublin                                   18
    Amsterdam                                16
    Madrid, Comunidad de Madrid, ES          12
    Toronto                                  12
    Inca                                     12
    Rome                                     11
    Pollença                                 10
    Palma de Mallorca                         9
    Washington                                8
    Bondi Beach                               8
    Venezia                                   7
    Búger                                     7
    San Francisco                             7
    madrid                                    6
    Chicago                                   6
    Deià                                      6
    Santa Margalida                           6
    .... Otros más
  'col:[Room Type],unique rate:[0.00028996713705780014],vals:[Room Type
    Entire home/apt    6330
    Private room       3873
    Shared room         143
    Name: count, dtype: int64]',
  'col:[Accommodates],unique rate:[0.0015464913976416005],vals:[Accommodates
    2     3673
    4     2303
    1     1420
    3     1009
    6      816
    5      478
    8      248
    7      160
    10      91
    9       53
    12      37
    16      24
    11      16
    14      11
    15       4
    13       3
    Name: count, dtype: int64]',
  'col:[Bathrooms],unique rate:[0.0017398028223468006],vals:[Bathrooms
    1.0    7716
    2.0    1624
    1.5     384
    3.0     226
    2.5      90
    4.0      54
    5.0      50
    0.5      49
    0.0      44
    6.0      23
    3.5      20
    4.5      16
    5.5       4
    8.0       3
    7.0       3
    6.5       1
    7.5       1
    Name: count, dtype: int64]',
    'col:[Bedrooms],unique rate:[0.0010632128358786005],vals:[Bedrooms
    1.0     6812
    2.0     1833
    0.0      707
    3.0      651
    4.0      213
    5.0       68
    6.0       24
    7.0        8
    10.0       5
    8.0        5
    Name: count, dtype: int64]',
      'col:[Beds],unique rate:[0.0016431471099942006],vals:[Beds
    1.0     5123
    2.0     2697
    3.0     1151
    4.0      650
    5.0      269
    6.0      170
    7.0       77
    8.0       73
    10.0      39
    9.0       29
    16.0      10
    12.0       7
    14.0       5
    13.0       5
    11.0       4
    15.0       3
    Name: count, dtype: int64]',
      'col:[Bed Type],unique rate:[0.0004832785617630002],vals:[Bed Type
    Real Bed         10126
    Pull-out Sofa      169
    Futon               32
    Couch               14
    Airbed               5
    Name: count, dtype: int64]',
    'col:[Cancellation Policy],unique rate:[0.0007732456988208003],vals:[Cancellation Policy
    strict             4062
    flexible           3259
    moderate           2939
    strict_new           26
    super_strict_60      21
    moderate_new         17
    super_strict_30      14
    flexible_new          8
  normalizable
    'col:[Features],unique rate:[0.008022424125265803],vals:[Features
    Host Has Profile Pic,Host Identity Verified,Is Location Exact,Requires License                                   1779
    Host Has Profile Pic,Is Location Exact,Requires License                                                           1472
    Host Has Profile Pic,Host Identity Verified,Is Location Exact,Requires License,Instant Bookable                   1233
    Host Has Profile Pic,Host Identity Verified,Requires License                                                      980
    Host Has Profile Pic,Requires License                                                                             970
    Host Has Profile Pic,Is Location Exact,Requires License,Instant Bookable                                          815
    Host Has Profile Pic,Host Identity Verified,Requires License,Instant Bookable                                     587
    Host Has Profile Pic,Requires License,Instant Bookable                                                            481
    Host Is Superhost,Host Has Profile Pic,Host Identity Verified,Is Location Exact,Requires License                  306
    Host Is Superhost,Host Has Profile Pic,Host Identity Verified,Is Location Exact,Requires License,Instant Bookable 213
    Host Has Profile Pic,Host Identity Verified,Is Location Exact                                                     164
    Host Is Superhost,Host Has Profile Pic,Host Identity Verified,Requires License                                    135
    Host Has Profile Pic,Host Identity Verified,Is Location Exact,Requires License,Require Guest Phone Verification   119
    Host Is Superhost,Host Has Profile Pic,Is Location Exact,Requires License                                         114
  .... Otros más
  ```

In [20]:
# Antes de llamarlo prepocesamiento comencemos con quedarnos con las columnas que nos interesan sin preprocesar

# Solo para filas tomare en cuenta la ciudad, Madrid solamente, he visto que no esta normalizado por lo que usaré un contains y case=False
datosABNB = datosABNB[datosABNB['City'].str.contains('madrid', case=False, na=False)].copy()

In [21]:
# Sigamos con lo facil, ID's y URLs no contienen datos relevantes, también Geolocation pues ya existen columnas de altutud y latitud.
# Features también se va que pormas que quice no entendí su sentido en lo absoluto
columns_to_drop = ['ID', 'Listing Url', 'Scrape ID', 'Thumbnail Url', 
                   'Medium Url', 'Picture Url', 'XL Picture Url', 
                   'Host ID', 'Host URL', 'Host Thumbnail Url', 
                   'Host Picture Url', 'Geolocation', 'Features']

# Drop the columns
datosABNB.drop(columns=columns_to_drop, axis=1, inplace=True)

datosABNB.shape

(13245, 76)

In [22]:
# 'Host Name', 'Host Location', 'Host About' no veo contenidos relevantes por ahora los voy a retirar
columns_to_drop = ['Host Name', 'Host Location', 'Host About']

# Drop the columns
datosABNB.drop(columns=columns_to_drop, axis=1, inplace=True)

datosABNB.shape

(13245, 73)

In [23]:
# 'Square Feet' tiene 96% de nan de acuerdo a resultMap. No se muy bien como rescatarla, 
#  en la misma línea se encuentran las siguientes:
#"tooManyNanCols": [
#  {"col": "Host Acceptance Rate","rate": 0.997583607191185},
#  {"col": "Square Feet",         "rate": 0.960371157935434},
#  {"col": "Has Availability",    "rate": 0.9993234100135318},
#  {"col": "License",             "rate": 0.9757394161994974},
#  {"col": "Jurisdiction Names",  "rate": 0.9861782331335782}
#
# No se me ocurre como rescatarlas, así que por ahora van fuera
columns_to_drop = ['Host Acceptance Rate', 'Square Feet', 'Has Availability','License','Jurisdiction Names']

# Drop the columns
datosABNB.drop(columns=columns_to_drop, axis=1, inplace=True)

datosABNB.shape

(13245, 68)

## 3. Preprocesamiento

Ahora si Prepocesamiento bien separado (de nuevo)

In [25]:
# sklearn imports
from sklearn.model_selection import train_test_split

# Ahora si! separamos train/test
X1 = datosABNB.loc[:,datosABNB.columns != "Price"]
y1 = datosABNB["Price"]
X_train, X_test, y_train, y_test = train_test_split(X1, y1, test_size=0.25, shuffle=True, random_state=0)

In [26]:
# Name, Summary, Space, Description son textos libres que no parecen tener una estructura explotable, sin embargo, 
# navegando en booking observé que entre más descripción sumary y space tienen, suele ser que sus hosts se preocupan por vender mejor su sitio. 
# Por ello intentaré ver si el conteo de palabras de la suma de esos textos muestra alguna proporcionalidad con el precio.
# Mucho parloteo quizá quieran justificar el precio 

text_cols = ['Name', 'Summary', 'Space', 'Description']
X_train[text_cols] = X_train[text_cols].fillna("")

# sumamos todo
X_train['total_words'] = X_train[text_cols].apply(lambda x: sum(len(text.split()) for text in x), axis=1)

# al final quitamos los campos
X_train.drop(text_cols, axis=1, inplace=True)

# veamos que tal
X_train['total_words'].head().head().T

8221     197
6781     250
6329     231
5182     148
10856     75
Name: total_words, dtype: int64

In [27]:
# Similar al punto anterior podría pensarse que 'Notes', 'Transit', 'Access', 'Interaction', 'House Rules' puedan informar algo. Pero solo lo voy a hacer en House Rules, ¿Podría haber alguna relación entre más reglas más pijo?
text_cols = ['Notes', 'Transit', 'Access', 'Interaction', 'House Rules']
X_train['House Rules'] = X_train['House Rules'].fillna("")

# sumamos todo
X_train['house_rules_words'] = X_train['House Rules'].apply( lambda x: sum(len(text.split()) for text in x) )

# al final quitamos los campos
X_train.drop(text_cols, axis=1, inplace=True)

# veamos como quedó
X_train['house_rules_words'].head()

8221     798
6781     108
6329     238
5182       0
10856     67
Name: house_rules_words, dtype: int64

In [28]:
# 'Host Response Time' Lo intenaré pasar a horas
# Mapping Dictionary
response_time_map = {
    "within an hour": 1,
    "within a few hours": 3,
    "within a day": 24,
    "a few days or more": 48
}

# Apply Mapping
X_train['Host Response Time'] = X_train['Host Response Time'].map(response_time_map).fillna(999).astype(int)

# vemos el resultado
X_train['Host Response Time'].head()

8221       3
6781       3
6329       1
5182      48
10856    999
Name: Host Response Time, dtype: int32

In [29]:
# ahora los campos de tipo fecha: "Last scraped", "Host Since", "Calendar last Scraped", "First Review", "Last Review"
date_cols = ['Last Scraped', 'Host Since', 'Calendar last Scraped', 'First Review', 'Last Review']
X_train[date_cols] = X_train[date_cols].apply(pd.to_datetime, errors='coerce')

# Ahora no estoy muy seguro de como ocupar las fechas, se me ocurre 2 cosas por ahacer, primero pasar todas a días que han 
# pasdo desde la fecha más alta +1, 
# - days_host_since: dudo que tenga relación pero hosts experimentados podrían aumentar sus precios
# - days_since_last_scraped: puede que entre más reciente sea, tendrán precios más actualizados
# - days_since_calendar_updated: reciente actividad en el calendario puede denotar ajustes en los precios
# - days_since_first_review: le tengo poca fe, pero una fecha muy grande puede percibirse como reputación, veamos si es verdad.
# - days_since_last_review: actividad reciente pude que suba los precios por prestigio
from datetime import timedelta

# Get Maximum Date + 1 Day
max_date = X_train[date_cols].max().max() + timedelta(days=1)

print(max_date) # para el test

# Pasamos todo a días
X_train['days_host_since'] = (max_date - X_train['Host Since']).dt.days
X_train['days_since_last_scraped'] = (max_date - X_train['Last Scraped']).dt.days
X_train['days_since_calendar_updated'] = (max_date - X_train['Calendar last Scraped']).dt.days
X_train['days_since_first_review'] = (max_date - X_train['First Review']).dt.days
X_train['days_since_last_review'] = (max_date - X_train['Last Review']).dt.days

# Fill Missing with 999
time_features = ['days_host_since', 'days_since_last_scraped', 'days_since_calendar_updated', 
                 'days_since_first_review', 'days_since_last_review']
X_train[time_features] = X_train[time_features].fillna(999).astype(int)

# Para Calendar last Scraped se me ocurre que la actualización que hace el host la realiza para actualizar los precios dependiendo de la temporada
# Así que intentemos reflejar eso:
def get_season(month):
    if month in [1, 2, 11]:
        return 0 #"Low"
    elif month in [3, 4, 10]:
        return 1 #"Mid"
    elif month in [5, 6, 9]:
        return 2 #"High"
    elif month in [7, 8, 12]:
        return 3 #Peak"
    return 0

X_train['season'] = X_train['Calendar last Scraped'].dt.month.apply(get_season)

# al final quitamos los campos
X_train.drop(date_cols, axis=1, inplace=True)

# veamos el resultado
X_train[time_features].head().T

2017-04-09 00:00:00


Unnamed: 0,8221,6781,6329,5182,10856
days_host_since,612,993,2000,233,374
days_since_last_scraped,1,1,1,1,1
days_since_calendar_updated,2,1,2,1,1
days_since_first_review,584,328,182,999,999
days_since_last_review,182,13,159,999,999


In [30]:
# 'Host Verifications' - de acuerdo a internet, host verifications, es la forma en que airbnb verifica la autenticidad
# de los hosts. No creo sacar mucho de los métodos en los que un host se verifica o no, lo que si intentaré es ver con True o False cuales están 
# verificados y cuales no

# Host Verification a 1 o 0
X_train['host_verified'] = X_train['Host Verifications'].notnull().astype(int)

X_train.drop('Host Verifications', axis=1, inplace=True)

X_train['host_verified'].head().T

8221     1
6781     1
6329     1
5182     1
10856    1
Name: host_verified, dtype: int32

In [31]:
X_train['host_verified'].value_counts()

host_verified
1    9925
0       8
Name: count, dtype: int64

In [32]:
!pip install geopy

!pip --version
!python --version

Defaulting to user installation because normal site-packages is not writeable
pip 24.2 from C:\ProgramData\anaconda3\Lib\site-packages\pip (python 3.12)

Python 3.12.7


In [33]:
# Por problemas en mi anaconda no pude usar geopy, consulte como sustiuirlo: y lo dejé comentado abajo, usar el que 
# mejor convenga
import math

# Reference Point: Plaza del Sol (Km 0)
km_0_lat = 40.4168
km_0_lon = -3.7038

# Haversine Formula
def haversine(lat, lon):
    R = 6371  # Radius of Earth in kilometers
    dlat = math.radians(lat - km_0_lat)
    dlon = math.radians(lon - km_0_lon)
    a = math.sin(dlat/2) ** 2 + math.cos(math.radians(km_0_lat)) * math.cos(math.radians(lat)) * math.sin(dlon/2) ** 2
    c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))
    return R * c

# Apply the function
X_train['distance_km'] = X_train.apply(lambda row: haversine(row['Latitude'], row['Longitude']) if pd.notnull(row['Latitude']) else -1, axis=1)


In [34]:
# De los campos 'Street','Neighbourhood', 'Neighbourhood Cleansed','Neighbourhood Group Cleansed', 'City', 'State', 
# 'Zipcode', 'Market','Smart Location', 'Country Code', 'Country', 'Latitude', 'Longitude'
# Creo la mejor manera de usar la localización será con 'Latitude', 'Longitude' y sacar la distancia al centro kilometro 0. 
# También 'Neighbourhood Cleansed' creo que se puede normalizar con la media

# from geopy.distance import geodesic

# # Plaza del Sol (Km 0) - Según San Google
# km_0 = (40.4168, -3.7038)

# # Calculo de distancia
# def calculate_distance(lat, lon):
#     try:
#         return geodesic((lat, lon), km_0).km
#     except:
#         return -1  # For missing coordinates

# df_madrid['distance_km'] = df_madrid.apply(lambda row: calculate_distance(row['Latitude'], row['Longitude']), axis=1)

# Normalizamos 'Neighbourhood Cleansed' por la media del precio
neighPrice = X_train['Neighbourhood Cleansed'].to_frame().join(y_train)
neigh_price_map_train = neighPrice.groupby('Neighbourhood Cleansed')['Price'].median()
X_train['neighbourhood_price'] = X_train['Neighbourhood Cleansed'].map(neigh_price_map_train)

# guardamos para los NA
meian_price_train = y_train.median()
X_train['neighbourhood_price'] = X_train['neighbourhood_price'].fillna(meian_price_train)

# estos se guardan como oro para el test (si es que son tracendentes las columnas).
print(neigh_price_map_train)
print(meian_price_train)

# El resto pa fuera, con esto creo que cubro lo necesario para la dirección
location_cols = ['Street', 'Neighbourhood', 'Neighbourhood Cleansed', 'Neighbourhood Group Cleansed',
                 'City', 'State', 'Zipcode', 'Market', 'Smart Location', 'Country Code', 'Country', 
                 'Latitude', 'Longitude']
X_train.drop(location_cols, axis=1, inplace=True)

X_train[['neighbourhood_price','distance_km']].head()

Neighbourhood Cleansed
Abrantes                         20.0
Acacias                          35.0
Adelfas                          50.0
Aeropuerto                       39.0
Aguilas                          20.0
Alameda de Osuna                 37.0
Almagro                          60.0
Almenara                         45.0
Almendrales                      29.0
Aluche                           25.0
Ambroz                           24.0
Amposta                          15.0
Apostol Santiago                 49.0
Arapiles                         49.5
Aravaca                          40.0
Arcos                            18.0
Argüelles                        52.0
Atocha                           49.0
Bellas Vistas                    36.5
Berruguete                       45.0
Buenavista                       39.0
Butarque                         28.0
Campamento                       27.0
Canillas                         35.0
Canillejas                       27.0
Casa de Campo              

Unnamed: 0,neighbourhood_price,distance_km
8221,50.0,0.988523
6781,42.0,2.494668
6329,25.0,6.736929
5182,39.5,2.099571
10856,65.0,0.835233


In [35]:
# Ufff para las columnas 'Weekly Price' y 'Monthly Price' creo que serán muy importantes si están presentes, el problema es 
# cuando no están presentes, siento que meter la media quita el sentido de no tener datos como no informados.
# Esto es lo que he pensado (desde mi inexperiencia, espero haber acertado), pienso en meter una columna de soprote con 
# true cuando esta informado el precio y false cuando no, luego relleno esos datos de false con 999999 para forzar al ML a tomar solo 
# los valores originalmente informados

# 0 o 1
X_train['has_weekly_price'] = X_train['Weekly Price'].notnull().astype(int)
X_train['has_monthly_price'] = X_train['Monthly Price'].notnull().astype(int)

# los que tengan falso con valeres muy grandes
X_train['Weekly Price'] = X_train['Weekly Price'].fillna(999999)
X_train['Monthly Price'] = X_train['Monthly Price'].fillna(999999)


# Para 'Security Deposit' y 'Cleaning Fee' está más facil, 

# En el primero asumo que un Nan significa que no piden deposito así que 0
X_train['Security Deposit'] = X_train['Security Deposit'].fillna(0)

# y para Cleaning Fee (la meidana por cada Room Type)
X_train['Cleaning Fee'] = X_train.groupby('Room Type')['Cleaning Fee'].transform(lambda x: x.fillna(x.median()))

X_train[['has_weekly_price','Weekly Price','has_monthly_price','Monthly Price','Security Deposit','Cleaning Fee']].head()

Unnamed: 0,has_weekly_price,Weekly Price,has_monthly_price,Monthly Price,Security Deposit,Cleaning Fee
8221,1,300.0,0,999999.0,120.0,25.0
6781,0,999999.0,0,999999.0,0.0,10.0
6329,0,999999.0,0,999999.0,100.0,10.0
5182,0,999999.0,0,999999.0,0.0,20.0
10856,0,999999.0,0,999999.0,0.0,10.0


In [36]:
colPrice = X_train['Property Type'].to_frame().join(y_train)
price_map_train = colPrice.groupby('Property Type')['Price'].median()
price_map_train

Property Type
Apartment              55.0
Bed & Breakfast        32.0
Boat                  100.0
Boutique hotel         55.0
Bungalow               17.0
Casa particular        30.0
Chalet                 50.0
Condominium            49.0
Dorm                   27.5
Earth House            22.0
Guest suite            15.0
Guesthouse             40.0
Hostel                 42.0
House                  38.0
Loft                   60.0
Other                  50.0
Serviced apartment     41.5
Tent                   25.0
Timeshare              20.0
Townhouse              33.5
Villa                 255.0
Name: Price, dtype: float64

In [37]:
# 'Property Type', 'Room Type' y 'Bed Type', creo que pueden dar mucha información, intentaré modelarlas

encode_cols = ['Property Type', 'Room Type', 'Bed Type']

price_map_train = []

col = 'Property Type'
colPrice = X_train[col].to_frame().join(y_train)
price_map_train_Property_Type = colPrice.groupby(col)['Price'].median()
X_train[f'{col}_price_encoded'] = X_train[col].map(price_map_train_Property_Type)
X_train[f'{col}_price_encoded'] = X_train[f'{col}_price_encoded'].fillna(meian_price_train)

col = 'Room Type'
colPrice = X_train[col].to_frame().join(y_train)
price_map_train_Room_Type = colPrice.groupby(col)['Price'].median()
X_train[f'{col}_price_encoded'] = X_train[col].map(price_map_train_Room_Type)
X_train[f'{col}_price_encoded'] = X_train[f'{col}_price_encoded'].fillna(meian_price_train)

col = 'Bed Type'
colPrice = X_train[col].to_frame().join(y_train)
price_map_train_Bed_Type = colPrice.groupby(col)['Price'].median()
X_train[f'{col}_price_encoded'] = X_train[col].map(price_map_train_Bed_Type)
X_train[f'{col}_price_encoded'] = X_train[f'{col}_price_encoded'].fillna(meian_price_train)

print(price_map_train_Property_Type)
print(price_map_train_Room_Type)
print(price_map_train_Bed_Type)

# Drop al final
X_train.drop(encode_cols, axis=1, inplace=True)

X_train[['Property Type_price_encoded', 'Room Type_price_encoded', 'Bed Type_price_encoded']].head().T

Property Type
Apartment              55.0
Bed & Breakfast        32.0
Boat                  100.0
Boutique hotel         55.0
Bungalow               17.0
Casa particular        30.0
Chalet                 50.0
Condominium            49.0
Dorm                   27.5
Earth House            22.0
Guest suite            15.0
Guesthouse             40.0
Hostel                 42.0
House                  38.0
Loft                   60.0
Other                  50.0
Serviced apartment     41.5
Tent                   25.0
Timeshare              20.0
Townhouse              33.5
Villa                 255.0
Name: Price, dtype: float64
Room Type
Entire home/apt    72.0
Private room       29.0
Shared room        20.0
Name: Price, dtype: float64
Bed Type
Airbed           58.5
Couch            52.0
Futon            35.0
Pull-out Sofa    53.0
Real Bed         52.0
Name: Price, dtype: float64


Unnamed: 0,8221,6781,6329,5182,10856
Property Type_price_encoded,55.0,49.0,40.0,55.0,55.0
Room Type_price_encoded,72.0,29.0,29.0,72.0,29.0
Bed Type_price_encoded,52.0,52.0,52.0,52.0,52.0


In [38]:
# 'Amenities' se ve imporante, después de todo un sitio pude ser elejido por algún punto de estos. 
X_train['Amenities'].unique()

array(['TV,Internet,Wireless Internet,Wheelchair accessible,Kitchen,Smoking allowed,Elevator in building,Heating,Family/kid friendly,Essentials,Shampoo',
       'TV,Wireless Internet,Kitchen,Elevator in building,Heating,Washer,Essentials,Shampoo,Iron',
       'Wireless Internet,Free parking on premises,Smoking allowed,Breakfast,Pets live on this property,Dog(s),Cat(s),Buzzer/wireless intercom,Heating,Family/kid friendly,First aid kit,Essentials,Hangers,Laptop friendly workspace,translation missing: en.hosting_amenity_50',
       ...,
       'TV,Internet,Wireless Internet,Kitchen,Smoking allowed,Doorman,Elevator in building,Buzzer/wireless intercom,Heating,Family/kid friendly,Washer,Essentials,Shampoo',
       'TV,Kitchen,Buzzer/wireless intercom,Heating,Family/kid friendly,Washer',
       'Internet,Wireless Internet,Kitchen,Pets allowed,Buzzer/wireless intercom,Heating,Family/kid friendly,Washer,Essentials,Shampoo,24-hour check-in,Hangers,Hair dryer,Iron,Laptop friendly workspace'],
  

In [39]:
# No es facil verlos así, primero saquemos todos los anenities y luego le hacemos un unique count para ver las más 
# importantes
X_train['Amenities'] = X_train['Amenities'].fillna("")

# Split each row into a list of amenities
all_amenities = X_train['Amenities'].apply(lambda x: x.split(','))

# Flatten the list
flat_amenities = [item.strip() for sublist in all_amenities for item in sublist if item.strip() != '']

# Count unique amenities
amenities_count = pd.Series(flat_amenities).value_counts()

# Show top 20 amenities
print(amenities_count.head(40))

Wireless Internet                             9358
Kitchen                                       9158
Heating                                       9030
Essentials                                    8576
Washer                                        8509
TV                                            7817
Hangers                                       6657
Shampoo                                       6273
Elevator in building                          6081
Family/kid friendly                           6062
Iron                                          5710
Hair dryer                                    5591
Internet                                      5588
Air conditioning                              5469
Laptop friendly workspace                     5125
Buzzer/wireless intercom                      4591
translation missing: en.hosting_amenity_50    2982
Smoking allowed                               2615
First aid kit                                 2592
24-hour check-in               

Como se ve de la fuente anterior, podría marcar con True o False las Amenities más populares.
Sin embargo de acuerdo al articulo AirBnb must have, en la sección [Top Amenites Impact](https://www.avantio.com/blog/airbnb-must-haves/) no todos los amenities impactan igual. También me basé en otro articulos

Creo que meteré un mix de ambas listas

In [41]:
import ast

# Fill missing Amenities with empty string
X_train['Amenities'] = X_train['Amenities'].fillna("")

# Split by commas and count
X_train['total_amenities'] = X_train['Amenities'].apply(lambda x: len(x.split(',')))

# Top Amenities List 
top_amenities = ['Wireless Internet', 'Internet', 'Kitchen', 'Heating', 'Essentials', 'Washer', 'TV',  # Primero por moda
                 'Hot tub', 'Pool', 'Gym', 'Cable TV', 'Parking', 'Pets allowed'] # ahora por los articulo que vi

# Create Binary Columns
for amenity in top_amenities:
    X_train[f'amenity_{amenity.lower().replace(" ", "_")}'] = X_train['Amenities'].apply(lambda x: int(amenity in x))

# al final lo dropeamos
X_train.drop('Amenities', axis=1, inplace=True)


In [43]:
X_train.describe(include='all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
Experiences Offered,9933.0,1.0,none,9933.0,,,,,,,
Neighborhood Overview,6189.0,5489.0,Se trata de una de las zonas más emblemáticas ...,24.0,,,,,,,
Host Response Time,9933.0,,,,129.29699,328.166926,1.0,1.0,1.0,24.0,999.0
Host Response Rate,8696.0,,,,94.850966,15.15917,0.0,100.0,100.0,100.0,100.0
Host Neighbourhood,7488.0,100.0,Malasaña,736.0,,,,,,,
Host Listings Count,9930.0,,,,9.727492,27.607957,0.0,1.0,2.0,5.0,265.0
Host Total Listings Count,9930.0,,,,9.727492,27.607957,0.0,1.0,2.0,5.0,265.0
Accommodates,9933.0,,,,3.188563,2.001807,1.0,2.0,2.0,4.0,16.0
Bathrooms,9898.0,,,,1.250051,0.59647,0.0,1.0,1.0,1.0,8.0
Bedrooms,9918.0,,,,1.29149,0.823198,0.0,1.0,1.0,2.0,10.0


In [45]:
areSomeNulls = X_train.isnull().any()
areSomeNulls[areSomeNulls == True] # hay varias columnas con algún null, habrá que ver que tan útiles son

Neighborhood Overview          True
Host Response Rate             True
Host Neighbourhood             True
Host Listings Count            True
Host Total Listings Count      True
Bathrooms                      True
Bedrooms                       True
Beds                           True
Review Scores Rating           True
Review Scores Accuracy         True
Review Scores Cleanliness      True
Review Scores Checkin          True
Review Scores Communication    True
Review Scores Location         True
Review Scores Value            True
Reviews per Month              True
dtype: bool

In [None]:
# 'Host Listings Count' y 'Host Total Listings Count' tienen una correlación directa, me quedo con el que tenga menos nan