# machine-learning-practica
### **Práctica ML** - Ejercicio de Bootcamp Inteligencia Artificial Full Stack Edición III

Este proyecto es un entregable para la práctica del Master Bootcamp Inteligencia Artificial Full Stack Edición III realizado por el centro de formación [@Keepcoding](https://github.com/KeepCoding)

---

El objetivo de la práctica es la predicción del precio del airbnb de los datos disponibles en el fichero [airbnb-listings-extract.csv](./airbnb-listings-extract.csv)

## Contenido

Los pasos esperados son los siguientes:
1. Preparación de datos: División train/test
2. Análisis exploratorio, por ejemplo:
    - Head, describe, dtypes, etc.
    - Outliers
    - Correlación
3. Preprocesamiento:
    - Eliminación de variables, mediante selección (random forest/Lasso), alta correlación, alto porcentaje de missings, o el método que se considere oportuno.
    - Generación de variables
5. Modelado:
    - Cross validation
    - Evaluación; mejor si lo hacéis de más de un modelo, porque así podéis comparar entre ellos.
6. Conclusión: escrita, no numérica; un par de líneas es más que suficiente.

## 1. Preparación de datos: División train/test

In [5]:
# Camenzamos con las librerías que usaremos
import numpy as np
import pandas as pd

# settings - descomentar a conveniencia
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

In [6]:
# Descargando los datos y primer contacto
datosABNB = pd.read_csv("./airbnb-listings-extract.csv",sep=";")
# aunque el punto 2 requiere en análisis exploratorio, dividir los datos me obliga mínimamente ver cual es la columna resultado
datosABNB.columns # parece que es Price

Index(['ID', 'Listing Url', 'Scrape ID', 'Last Scraped', 'Name', 'Summary',
       'Space', 'Description', 'Experiences Offered', 'Neighborhood Overview',
       'Notes', 'Transit', 'Access', 'Interaction', 'House Rules',
       'Thumbnail Url', 'Medium Url', 'Picture Url', 'XL Picture Url',
       'Host ID', 'Host URL', 'Host Name', 'Host Since', 'Host Location',
       'Host About', 'Host Response Time', 'Host Response Rate',
       'Host Acceptance Rate', 'Host Thumbnail Url', 'Host Picture Url',
       'Host Neighbourhood', 'Host Listings Count',
       'Host Total Listings Count', 'Host Verifications', 'Street',
       'Neighbourhood', 'Neighbourhood Cleansed',
       'Neighbourhood Group Cleansed', 'City', 'State', 'Zipcode', 'Market',
       'Smart Location', 'Country Code', 'Country', 'Latitude', 'Longitude',
       'Property Type', 'Room Type', 'Accommodates', 'Bathrooms', 'Bedrooms',
       'Beds', 'Bed Type', 'Amenities', 'Square Feet', 'Price', 'Weekly Price',
       'Month

In [7]:
# sklearn imports
from sklearn.model_selection import train_test_split

# Ahora si separamos train/test
X1 = datosABNB.loc[:,datosABNB.columns != "Price"]
y1 = datosABNB["Price"]
X_train, X_test, y_train, y_test = train_test_split(X1, y1, test_size=0.3, shuffle=True, random_state=0)

## 2. Análisis exploratorio

#### NOTA:
> Antes de comenzar el análisis exploratorio me gustaría decir que lo he comenzado antes de ver el tema en clase y que tengo 0 experiencia en ello. Así que verás muchas cosas inecesarias o que hay mejores maneras de hacerlo, incluse me hice una función con mucha información que retrospectiva he pedido mi tiempo, pero al mismo tiempo me gustó haberlo intentado y mostrar mi esfuerzo en ello, gradualmente veras que me apegaré lo que vimos en clase finalmente he incluso repito análisis debido a lo que comento anteriormente. Igualmente me siento orgulloso de haberlo intentado antes y por ello lo dejo.

In [10]:
# Un vistaso rápido a todo
X_train.head(3).T

Unnamed: 0,4706,6422,4339
ID,2156319,3377153,14800635
Listing Url,https://www.airbnb.com/rooms/2156319,https://www.airbnb.com/rooms/3377153,https://www.airbnb.com/rooms/14800635
Scrape ID,20170407214119,20170407214119,20170407214119
Last Scraped,2017-04-08,2017-04-08,2017-04-08
Name,GRANT VII Plaza Mayor,Nice flat in Plaza Mayor (lift),* ROOM double Barrio Salamanca *
Summary,,"Flat is in the centre town, really close to Pl...","Private room for two persons has a double bed,..."
Space,Beautiful and charming apartment recently deco...,You cannot find a better location to stay in M...,It is a penthouse located in the salamanca dis...
Description,Beautiful and charming apartment recently deco...,"Flat is in the centre town, really close to Pl...","Private room for two persons has a double bed,..."
Experiences Offered,none,none,none
Neighborhood Overview,,"It´s really cool, with a lot of new business (...",The District of Salamanca is one of the 21 dis...


In [11]:
# comencemos con shape, a ver que nos espera
X_train.shape # 88 columnas por explorar

(10346, 88)

In [12]:
# quiza hay alguna columna que todo sea null se podría quitar
areAllNulls = X_train.isnull().all()
areAllNulls[areAllNulls == True] # no hubo suerte

Series([], dtype: bool)

In [13]:
# Veamos las que contienen null
areSomeNulls = X_train.isnull().any()
areSomeNulls[areSomeNulls == True] # hay varias columnas con algún null, habrá que ver que tan útiles son

Name                              True
Summary                           True
Space                             True
Description                       True
Neighborhood Overview             True
Notes                             True
Transit                           True
Access                            True
Interaction                       True
House Rules                       True
Thumbnail Url                     True
Medium Url                        True
Picture Url                       True
XL Picture Url                    True
Host Name                         True
Host Since                        True
Host Location                     True
Host About                        True
Host Response Time                True
Host Response Rate                True
Host Acceptance Rate              True
Host Thumbnail Url                True
Host Picture Url                  True
Host Neighbourhood                True
Host Listings Count               True
Host Total Listings Count

In [14]:
# Veamos los datatypes
X_train.dtypes # muchos strings tenemos que ver que podemos obtener de ellos

ID                                  int64
Listing Url                        object
Scrape ID                           int64
Last Scraped                       object
Name                               object
Summary                            object
Space                              object
Description                        object
Experiences Offered                object
Neighborhood Overview              object
Notes                              object
Transit                            object
Access                             object
Interaction                        object
House Rules                        object
Thumbnail Url                      object
Medium Url                         object
Picture Url                        object
XL Picture Url                     object
Host ID                             int64
Host URL                           object
Host Name                          object
Host Since                         object
Host Location                     

In [15]:
# tambien buscamos filas duplicados
dup = X_train.duplicated()
dup[dup == True]

Series([], dtype: bool)

In [16]:
#### detectemos Outliers!
# Analizamos un poco los datos
X_train.describe(include='all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
ID,10346.0,,,,10283536.899188,5555730.551311,19864.0,5597252.75,11305550.0,15326062.5,18583609.0
Listing Url,10346.0,10346.0,https://www.airbnb.com/rooms/2156319,1.0,,,,,,,
Scrape ID,10346.0,,,,20170376343455.32,537465754.026358,20160104002432.0,20170407214119.0,20170407214119.0,20170407214119.0,20170615002708.0
Last Scraped,10346.0,35.0,2017-04-08,9554.0,,,,,,,
Name,10345.0,10112.0,Apartamento en el centro de Madrid,9.0,,,,,,,
Summary,9922.0,9358.0,"Unique apartment in vibrant neighborhoods, car...",48.0,,,,,,,
Space,7619.0,7203.0,Los Apartamentos Good Stay Prado se encuentran...,18.0,,,,,,,
Description,10340.0,10029.0,Es un piso con 6 habitaciones de las que 5 ha...,15.0,,,,,,,
Experiences Offered,10346.0,5.0,none,10334.0,,,,,,,
Neighborhood Overview,6397.0,5673.0,Se trata de una de las zonas más emblemáticas ...,22.0,,,,,,,


In [17]:
# NOTA: Algunas de estas cosas ya existen en mejores formas de hacerlo, pero las descubrí en clase después.
#       Parece que redescubrí el hilo negro :S
#       No me arrepiento de haber hecho esto, lo más seguro es que no lo use más en adelante, pero lo dejo 
#       porque al final de cuentas fue mi esfuerzo
#
# Me he puesto a analizar tantas cosas que ya lo volví función en functions.py:
from functions import analisisDF

# Lo probamos con un pequeño dataset:
dataTest = {
    "copy": [420, 380, 390, 411, 400, 395, 410],
    "copied": [420, 380, 390, 411, 400, 395, 410],
    "similar": [415, 380, 390, 411, 400, 395, 410],
    "proporcional": [415*1.5, 380*1.5, 390*1.5, 411*1.5, 400*1.5, 395*1.5, 410*1.5],
    "contains1": ["a", "b", "c", "d", "e", "f", "g"],
    "contains2": ["aaa", "bbbb", "cccc", "dxxxx", "exxxx", "fffff", "gccccc"],
    "allNan": [None, None, None, None, None, None, None],
    "manyNan": [None, 1, None, None, None, None, None],
    "outlier": [1, 1, 390, 1, 1, 1, -1111],
    "formatInconsistence": [1, 1, True, 1, 1, 1, 1]
}
df = pd.DataFrame(dataTest)
print(df) 

# veamos, tiene buena pinta
resultMap = analisisDF(df)
resultMap

   copy  copied  similar  proporcional contains1 contains2 allNan  manyNan  \
0   420     420      415         622.5         a       aaa   None      NaN   
1   380     380      380         570.0         b      bbbb   None      1.0   
2   390     390      390         585.0         c      cccc   None      NaN   
3   411     411      411         616.5         d     dxxxx   None      NaN   
4   400     400      400         600.0         e     exxxx   None      NaN   
5   395     395      395         592.5         f     fffff   None      NaN   
6   410     410      410         615.0         g    gccccc   None      NaN   

   outlier formatInconsistence  
0        1                   1  
1        1                   1  
2      390                True  
3        1                   1  
4        1                   1  
5        1                   1  
6    -1111                   1  


{'duplicateCols': ['cols:[copy|copied]'],
 'similarCols': ['cols:[copy|copied],rate:[1.0]',
  'cols:[copy|similar],rate:[0.8571428571428571]',
  'cols:[copied|similar],rate:[0.8571428571428571]'],
 'containsCols': ['cols:[copy|copied],rate:[1.0]',
  'cols:[copy|similar],rate:[0.8571428571428571]',
  'cols:[copied|similar],rate:[0.8571428571428571]',
  'cols:[contains1|contains2],rate:[1.0]',
  'cols:[outlier|formatInconsistence],rate:[0.8571428571428571]'],
 'formatInconsitenceCols': ["col:[formatInconsistence],types:[<class 'int'>|<class 'bool'>]"],
 'tooManyNanCols': ['col:[allNan],rate:[1.0]',
  'col:[manyNan],rate:[0.8571428571428571]'],
 'proportionalCols': ['cols:[copy|copied],proportion:[1.0]',
  'cols:[similar|proporcional],proportion:[0.6666666666666666]'],
 'outliersCols': ['col:[outlier],outliersIndex:[2, 6]'],
 'uniqueCols': ['col:[allNan],unique rate:[0.14285714285714285],vals:[Series([], Name: count, dtype: int64)]',
  'col:[formatInconsistence],unique rate:[0.14285714285

In [18]:
# ahora con los verdaderos datos
#resultMap = analisisDF(X_train)
#resultMap

{'duplicateCols': [],
 'similarCols': ['cols:[Host Acceptance Rate|Square Feet],rate:[0.957954765126619]',
  'cols:[Host Acceptance Rate|Has Availability],rate:[0.997583607191185]',
  'cols:[Host Acceptance Rate|License],rate:[0.9733230233906824]',
  'cols:[Host Acceptance Rate|Jurisdiction Names],rate:[0.9844384303112313]',
  'cols:[Host Listings Count|Host Total Listings Count],rate:[1.0]',
  'cols:[City|Market],rate:[0.9438430311231394]',
  'cols:[Square Feet|Has Availability],rate:[0.9596945679489658]',
  'cols:[Square Feet|License],rate:[0.9389135897931568]',
  'cols:[Square Feet|Jurisdiction Names],rate:[0.9475159481925381]',
  'cols:[Has Availability|License],rate:[0.9750628262130292]',
  'cols:[Has Availability|Jurisdiction Names],rate:[0.9858882659965204]',
  'cols:[Review Scores Checkin|Review Scores Communication],rate:[0.8568528900057993]',
  'cols:[License|Jurisdiction Names],rate:[0.9628842064566016]'],
 'containsCols': ['cols:[ID|Listing Url],rate:[1.0]',
  'cols:[Scrape

> La anterior ejecución, aunuque ha terminado bien, ha tomado mucha memoria y mejor la he comentado.
> Por lo tanto he copiado el resultado y pegado en un fichero de texto que se puede ver [aquí](./resultMap.txt)
>
> Muy bonito todo pero mejor me apego a partir de ahora a lo visto en clase. Por ejemplo "similarCols" se ve facilmente en la matriz de correlaciones en vez de reinventar el hilo negro

**Bueno es un poco dificil de leer mi función pero si la copio en un notpad++ puedo navegar bien y he visto lo siguiente:**
- No hay duplicados
  ```
  'duplicateCols': [],
  ```
- Hay muchas columnas con similares valere:
  ```
  'similarCols': ['cols:[Host Acceptance Rate|Square Feet],rate:[0.957954765126619]',
  'cols:[Host Acceptance Rate|Has Availability],rate:[0.997583607191185]',
  'cols:[Host Acceptance Rate|License],rate:[0.9733230233906824]',
  'cols:[Host Acceptance Rate|Jurisdiction Names],rate:[0.9844384303112313]',
  'cols:[Host Listings Count|Host Total Listings Count],rate:[1.0]',
  'cols:[City|Market],rate:[0.9438430311231394]',
  'cols:[Square Feet|Has Availability],rate:[0.9596945679489658]',
  'cols:[Square Feet|License],rate:[0.9389135897931568]',
  'cols:[Square Feet|Jurisdiction Names],rate:[0.9475159481925381]',
  'cols:[Has Availability|License],rate:[0.9750628262130292]',
  'cols:[Has Availability|Jurisdiction Names],rate:[0.9858882659965204]',
  'cols:[Review Scores Checkin|Review Scores Communication],rate:[0.8568528900057993]',
  'cols:[License|Jurisdiction Names],rate:[0.9628842064566016]']
  ```
- Hay muchos datos contenidos en otros:
  ```
  ['cols:[ID|Listing Url],rate:[1.0]',
  'cols:[Scrape ID|Guests Included],rate:[0.9501256524260584]',
  'cols:[Last Scraped|Guests Included],rate:[0.9467427024937174]',
  'cols:[Summary|Description],rate:[0.9237386429537986]',
  'cols:[Picture Url|Accommodates],rate:[0.9022810748115213]',
  'cols:[Picture Url|Guests Included],rate:[0.9010245505509376]',
  'cols:[Picture Url|Minimum Nights],rate:[0.8912623236033249]',
  'cols:[Host ID|Host URL],rate:[1.0]',
  'cols:[Host Since|Guests Included],rate:[0.9303112313937754]',
  'cols:[Host Location|Country],rate:[0.8350086990141118]',
  'cols:[Host Acceptance Rate|Square Feet],rate:[0.957954765126619]',
  'cols:[Host Acceptance Rate|Has Availability],rate:[0.997583607191185]',
  'cols:[Host Acceptance Rate|License],rate:[0.9733230233906824]',
  'cols:[Host Acceptance Rate|Jurisdiction Names],rate:[0.9844384303112313]',
  'cols:[Host Thumbnail Url|Accommodates],rate:[0.8705780011598685]',
  'cols:[Host Thumbnail Url|Guests Included],rate:[0.8924221921515562]',
  'cols:[Host Thumbnail Url|Minimum Nights],rate:[0.862265609897545]',
  'cols:[Host Picture Url|Accommodates],rate:[0.8706746568722211]',
  'cols:[Host Picture Url|Guests Included],rate:[0.8924221921515562]',
  'cols:[Host Picture Url|Minimum Nights],rate:[0.862265609897545]',
  'cols:[Host Listings Count|Host Total Listings Count],rate:[1.0]',
  'cols:[Street|City],rate:[0.9996133771505896]',
  'cols:[Street|State],rate:[0.989367871641214]',
  'cols:[Street|Zipcode],rate:[0.9665571235260004]',
  'cols:[Street|Market],rate:[0.949255751014885]',
  'cols:[Street|Country],rate:[0.9999033442876474]',
  'cols:[City|State],rate:[0.9069205490044462]',
  'cols:[City|Market],rate:[0.9460661125072491]',
  'cols:[City|Smart Location],rate:[0.9996133771505896]',
  'cols:[State|Market],rate:[0.8988014691668278]',
  'cols:[Market|Smart Location],rate:[0.9461627682196018]',
  'cols:[Smart Location|Country],rate:[0.9887879373670984]',
  'cols:[Latitude|Accommodates],rate:[0.8080417552677364]',
  'cols:[Accommodates|Geolocation],rate:[0.9048907790450416]',
  'cols:[Square Feet|Has Availability],rate:[0.9596945679489658]',
  'cols:[Square Feet|License],rate:[0.9389135897931568]',
  'cols:[Square Feet|Jurisdiction Names],rate:[0.9475159481925381]',
  'cols:[Guests Included|Calendar last Scraped],rate:[0.945872801082544]',
  'cols:[Guests Included|Geolocation],rate:[0.8985115020297699]',
  'cols:[Minimum Nights|Geolocation],rate:[0.8877827179586314]',
  'cols:[Has Availability|License],rate:[0.9750628262130292]',
  'cols:[Has Availability|Jurisdiction Names],rate:[0.9858882659965204]',
  'cols:[Review Scores Checkin|Review Scores Communication],rate:[0.8568528900057993]',
  'cols:[License|Jurisdiction Names],rate:[0.9628842064566016]'],
  ```
- Hay columas con tipos de datos distintos dentro de la columna, si son útiles habrá que normalizarlas
  ```
  ["col:[Name],types:[<class 'str'>|<class 'float'>]",
  "col:[Summary],types:[<class 'float'>|<class 'str'>]",
  "col:[Space],types:[<class 'str'>|<class 'float'>]",
  "col:[Description],types:[<class 'str'>|<class 'float'>]",
  "col:[Neighborhood Overview],types:[<class 'float'>|<class 'str'>]",
  "col:[Notes],types:[<class 'float'>|<class 'str'>]",
  "col:[Transit],types:[<class 'float'>|<class 'str'>]",
  "col:[Access],types:[<class 'float'>|<class 'str'>]",
  "col:[Interaction],types:[<class 'float'>|<class 'str'>]",
  "col:[House Rules],types:[<class 'str'>|<class 'float'>]",
  "col:[Thumbnail Url],types:[<class 'float'>|<class 'str'>]",
  "col:[Medium Url],types:[<class 'float'>|<class 'str'>]",
  "col:[Picture Url],types:[<class 'str'>|<class 'float'>]",
  "col:[XL Picture Url],types:[<class 'float'>|<class 'str'>]",
  "col:[Host Name],types:[<class 'str'>|<class 'float'>]",
  "col:[Host Since],types:[<class 'str'>|<class 'float'>]",
  "col:[Host Location],types:[<class 'str'>|<class 'float'>]",
  "col:[Host About],types:[<class 'str'>|<class 'float'>]",
  "col:[Host Response Time],types:[<class 'str'>|<class 'float'>]",
  "col:[Host Acceptance Rate],types:[<class 'float'>|<class 'str'>]",
  "col:[Host Thumbnail Url],types:[<class 'str'>|<class 'float'>]",
  "col:[Host Picture Url],types:[<class 'str'>|<class 'float'>]",
  "col:[Host Neighbourhood],types:[<class 'str'>|<class 'float'>]",
  "col:[Host Verifications],types:[<class 'str'>|<class 'float'>]",
  "col:[Neighbourhood],types:[<class 'str'>|<class 'float'>]",
  "col:[Neighbourhood Group Cleansed],types:[<class 'str'>|<class 'float'>]",
  "col:[City],types:[<class 'str'>|<class 'float'>]",
  "col:[State],types:[<class 'str'>|<class 'float'>]",
  "col:[Zipcode],types:[<class 'str'>|<class 'float'>]",
  "col:[Market],types:[<class 'str'>|<class 'float'>]",
  "col:[Country],types:[<class 'str'>|<class 'float'>]",
  "col:[Amenities],types:[<class 'str'>|<class 'float'>]",
  "col:[Has Availability],types:[<class 'float'>|<class 'str'>]",
  "col:[First Review],types:[<class 'str'>|<class 'float'>]",
  "col:[Last Review],types:[<class 'str'>|<class 'float'>]",
  "col:[License],types:[<class 'float'>|<class 'str'>]",
  "col:[Jurisdiction Names],types:[<class 'float'>|<class 'str'>]",
  "col:[Features],types:[<class 'str'>|<class 'float'>]"]
  ```
- Hay columnas con la mayoría de sus datos en null, hay 2 opciones, o sobran revisarlas como undefined a ver si dan algún dato
  ```
  ['col:[Host Acceptance Rate],rate:[0.997583607191185]',
  'col:[Square Feet],rate:[0.960371157935434]',
  'col:[Has Availability],rate:[0.9993234100135318]',
  'col:[License],rate:[0.9757394161994974]',
  'col:[Jurisdiction Names],rate:[0.9861782331335782]'],
  ```
- Hay algunas columnas con valores únicos solo pegaré las que se me hicieron relevantes:
  ```
    'col:[Experiences Offered],unique rate:[0.0004832785617630002],vals:[Experiences Offered
    none        10334
    business        6
    social          3
    family          2
    romantic        1
    Name: count, dtype: int64]',
  'col:[Host Response Time],unique rate:[0.0004832785617630002],vals:[Host Response Time
    within an hour        5544
    within a few hours    2008
    within a day          1261
    a few days or more     203
    Name: count, dtype: int64]',

    Este será bueno pasarlo a números
    'col:[Host Acceptance Rate],unique rate:[0.0009665571235260004],vals:[Host Acceptance Rate
    100%    14
    0%       2
    85%      2
    74%      2
    67%      1
    96%      1
    95%      1
    88%      1
    80%      1
    "col:[Host Neighbourhood],unique rate:[0.032186352213415814],vals:[Host Neighbourhood
    Malasaña                             687
    La Latina                            675
    Embajadores                          637
    Sol                                  519
    Justicia                             486
    Cortes                               432
    Palacio                              349
    Argüelles                            217
    Aluche                               190
    Carabanchel                          172
    Trafalgar                            156
    Rios Rosas                           147
    Ciudad Lineal                        134
    Palos do Moguer                      134
    L'Antiga Esquerra de l'Eixample      134
    Goya                                 127
    .... Otros más

  Esta es normalizable
  'col:[Host Verifications],unique rate:[0.017301372511115406],vals:[Host Verifications
    email,phone,reviews,jumio                                                                                   2598
    email,phone,reviews                                                                                         2462
    email,phone,reviews,jumio,government_id                                                                      624
    email,phone,facebook,reviews,jumio                                                                           595
    email,phone,facebook,reviews                                                                                 517
    email,phone                                                                                                  427
    email,phone,reviews,jumio,work_email                                                                         309
    email,phone,reviews,jumio,offline_government_id,government_id                                                257
    email,phone,facebook,reviews,jumio,government_id                                                             245
    email,phone,reviews,work_email                                                                               189
    email,phone,google,reviews,jumio,government_id                                                               148
    email,phone,reviews,manual_offline,jumio                                                                     130
    phone                                                                                                        119
    email,phone,facebook                                                                                         100
    email,phone,facebook,reviews,jumio,work_email                                                                 91
    phone,reviews                                                                                                 89
    email,phone,facebook,reviews,jumio,offline_government_id,government_id                                        86
    .... Otros más
    "col:[Neighbourhood],unique rate:[0.03150976222694761],vals:[Neighbourhood
    Malasaña                             608
    La Latina                            579
    Embajadores                          551
    Sol                                  512
    Cortes                               406
    Justicia                             396
    Palacio                              287
    Aluche                               159
    Argüelles                            157
    Trafalgar                            151
    Carabanchel                          137
    Palos do Moguer                      126
    Ciudad Lineal                        125
    Goya                                 111
    Puente de Vallecas                    87
    Guindalera                            87
    Arapiles                              83
    Recoletos                             82
    Pacifico                              70
    Almagro                               69
    Hortaleza                             63
    Gaztambide                            61
    Castellana                            58
    Lista                                 57
    Cuatro Caminos                        57
    Acacias                               52
    Fuencarral-el Pardo                   52
    Usera                                 49
    Ibiza                                 48
    San Blas                              46
    Delicias                              44
    Prosperidad                           42
    La Chopera                            42
    Rios Rosas                            41
    Barajas                               38
    Imperial                              38
    .... Otros más

  Esta la hemos comentado en clase, no tiene sentido las filas que no son madrid quiza barcelona a lo mucho
    "col:[City],unique rate:[0.02029769959404601],vals:[City
    Madrid                                 9249
    Barcelona                               211
    London                                   99
    Paris                                    70
    马德里                                      42
    Palma                                    37
    Berlin                                   29
    Alcúdia                                  29
    Roma                                     28
    New York                                 23
    Los Angeles                              20
    Brooklyn                                 18
    Wien                                     18
    Dublin                                   18
    Amsterdam                                16
    Madrid, Comunidad de Madrid, ES          12
    Toronto                                  12
    Inca                                     12
    Rome                                     11
    Pollença                                 10
    Palma de Mallorca                         9
    Washington                                8
    Bondi Beach                               8
    Venezia                                   7
    Búger                                     7
    San Francisco                             7
    madrid                                    6
    Chicago                                   6
    Deià                                      6
    Santa Margalida                           6
    .... Otros más
  'col:[Room Type],unique rate:[0.00028996713705780014],vals:[Room Type
    Entire home/apt    6330
    Private room       3873
    Shared room         143
    Name: count, dtype: int64]',
  'col:[Accommodates],unique rate:[0.0015464913976416005],vals:[Accommodates
    2     3673
    4     2303
    1     1420
    3     1009
    6      816
    5      478
    8      248
    7      160
    10      91
    9       53
    12      37
    16      24
    11      16
    14      11
    15       4
    13       3
    Name: count, dtype: int64]',
  'col:[Bathrooms],unique rate:[0.0017398028223468006],vals:[Bathrooms
    1.0    7716
    2.0    1624
    1.5     384
    3.0     226
    2.5      90
    4.0      54
    5.0      50
    0.5      49
    0.0      44
    6.0      23
    3.5      20
    4.5      16
    5.5       4
    8.0       3
    7.0       3
    6.5       1
    7.5       1
    Name: count, dtype: int64]',
    'col:[Bedrooms],unique rate:[0.0010632128358786005],vals:[Bedrooms
    1.0     6812
    2.0     1833
    0.0      707
    3.0      651
    4.0      213
    5.0       68
    6.0       24
    7.0        8
    10.0       5
    8.0        5
    Name: count, dtype: int64]',
      'col:[Beds],unique rate:[0.0016431471099942006],vals:[Beds
    1.0     5123
    2.0     2697
    3.0     1151
    4.0      650
    5.0      269
    6.0      170
    7.0       77
    8.0       73
    10.0      39
    9.0       29
    16.0      10
    12.0       7
    14.0       5
    13.0       5
    11.0       4
    15.0       3
    Name: count, dtype: int64]',
      'col:[Bed Type],unique rate:[0.0004832785617630002],vals:[Bed Type
    Real Bed         10126
    Pull-out Sofa      169
    Futon               32
    Couch               14
    Airbed               5
    Name: count, dtype: int64]',
    'col:[Cancellation Policy],unique rate:[0.0007732456988208003],vals:[Cancellation Policy
    strict             4062
    flexible           3259
    moderate           2939
    strict_new           26
    super_strict_60      21
    moderate_new         17
    super_strict_30      14
    flexible_new          8
  normalizable
    'col:[Features],unique rate:[0.008022424125265803],vals:[Features
    Host Has Profile Pic,Host Identity Verified,Is Location Exact,Requires License                                   1779
    Host Has Profile Pic,Is Location Exact,Requires License                                                           1472
    Host Has Profile Pic,Host Identity Verified,Is Location Exact,Requires License,Instant Bookable                   1233
    Host Has Profile Pic,Host Identity Verified,Requires License                                                      980
    Host Has Profile Pic,Requires License                                                                             970
    Host Has Profile Pic,Is Location Exact,Requires License,Instant Bookable                                          815
    Host Has Profile Pic,Host Identity Verified,Requires License,Instant Bookable                                     587
    Host Has Profile Pic,Requires License,Instant Bookable                                                            481
    Host Is Superhost,Host Has Profile Pic,Host Identity Verified,Is Location Exact,Requires License                  306
    Host Is Superhost,Host Has Profile Pic,Host Identity Verified,Is Location Exact,Requires License,Instant Bookable 213
    Host Has Profile Pic,Host Identity Verified,Is Location Exact                                                     164
    Host Is Superhost,Host Has Profile Pic,Host Identity Verified,Requires License                                    135
    Host Has Profile Pic,Host Identity Verified,Is Location Exact,Requires License,Require Guest Phone Verification   119
    Host Is Superhost,Host Has Profile Pic,Is Location Exact,Requires License                                         114
  .... Otros más
  ```

In [25]:
# Antes de llamarlo prepocesamiento comencemos con quedarnos con las columnas que nos interesan sin preprocesar

# Solo para filas tomare en cuenta la ciudad, Madrid solamente, he visto que no esta normalizado por lo que usaré un contains y case=False
datosABNB = datosABNB[datosABNB['City'].str.contains('madrid', case=False, na=False)].copy()

In [27]:
datosABNB.shape

(13245, 89)

In [31]:
# Sigamos con lo facil, ID's y URLs no contienen datos relevantes
columns_to_drop = ['ID', 'Listing Url', 'Scrape ID', 'Thumbnail Url', 
                   'Medium Url', 'Picture Url', 'XL Picture Url', 
                   'Host ID', 'Host URL', 'Host Thumbnail Url', 
                   'Host Picture Url']

# Drop the columns
datosABNB.drop(columns=columns_to_drop, axis=1, inplace=True)

datosABNB.shape

(13245, 78)

## Preprocesamiento

In [None]:
# Comencemos con las columnas fáciles:
# ID, Listing Url, Scrape ID son Ids o url que facilmente se pueden descartar
# Name, Summary, Space, Description son textos libres que no parecen tener una estructura explotable, sin embargo, navegando en booking considero que entre más descripción sumary y space tienen, suele ser que sus hosts se preocupan por vender mejor su sitio. Por ello intentaré ver si la suma de la longitud de esos textos muestra alguna proporcionalidad con el precio.
# 