# machine-learning-practica
### **Práctica ML** - Ejercicio de Bootcamp Inteligencia Artificial Full Stack Edición III

Este proyecto es un entregable para la práctica del Master Bootcamp Inteligencia Artificial Full Stack Edición III realizado por el centro de formación [@Keepcoding](https://github.com/KeepCoding)

---

El objetivo de la práctica es la predicción del precio del airbnb de los datos disponibles en el fichero [airbnb-listings-extract.csv](./airbnb-listings-extract.csv)

## Contenido

Los pasos esperados son los siguientes:
1. Preparación de datos: División train/test
2. Análisis exploratorio, por ejemplo:
    - Head, describe, dtypes, etc.
    - Outliers
    - Correlación
3. Preprocesamiento:
    - Eliminación de variables, mediante selección (random forest/Lasso), alta correlación, alto porcentaje de missings, o el método que se considere oportuno.
    - Generación de variables
5. Modelado:
    - Cross validation
    - Evaluación; mejor si lo hacéis de más de un modelo, porque así podéis comparar entre ellos.
6. Conclusión: escrita, no numérica; un par de líneas es más que suficiente.

## 1. Preparación de datos: División train/test

In [5]:
# Camenzamos con las librerías que usaremos
import numpy as np
import pandas as pd

# settings - descomentar a conveniencia
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

# sklearn imports
from sklearn.model_selection import train_test_split

#import matplotlib.pyplot as plt
#plt.style.use("seaborn-v0_8")

In [6]:
# Descargando los datos y primer contacto
datosABNB = pd.read_csv("./airbnb-listings-extract.csv",sep=";")
# aunque el punto 2 requiere en análisis exploratorio, dividir los datos me obliga mínimamente ver cual es la columna resultado
datosABNB.columns # parece que es Price

Index(['ID', 'Listing Url', 'Scrape ID', 'Last Scraped', 'Name', 'Summary',
       'Space', 'Description', 'Experiences Offered', 'Neighborhood Overview',
       'Notes', 'Transit', 'Access', 'Interaction', 'House Rules',
       'Thumbnail Url', 'Medium Url', 'Picture Url', 'XL Picture Url',
       'Host ID', 'Host URL', 'Host Name', 'Host Since', 'Host Location',
       'Host About', 'Host Response Time', 'Host Response Rate',
       'Host Acceptance Rate', 'Host Thumbnail Url', 'Host Picture Url',
       'Host Neighbourhood', 'Host Listings Count',
       'Host Total Listings Count', 'Host Verifications', 'Street',
       'Neighbourhood', 'Neighbourhood Cleansed',
       'Neighbourhood Group Cleansed', 'City', 'State', 'Zipcode', 'Market',
       'Smart Location', 'Country Code', 'Country', 'Latitude', 'Longitude',
       'Property Type', 'Room Type', 'Accommodates', 'Bathrooms', 'Bedrooms',
       'Beds', 'Bed Type', 'Amenities', 'Square Feet', 'Price', 'Weekly Price',
       'Month

In [7]:
# Ahora si separamos train/test
X1 = datosABNB.loc[:,datosABNB.columns != "Price"]
y1 = datosABNB["Price"]
X_train, X_test, y_train, y_test = train_test_split(X1, y1, test_size=0.3, shuffle=True, random_state=0)

## 2. Análisis exploratorio

In [9]:
# Un vistaso a todo
X_train.head(3)

Unnamed: 0,ID,Listing Url,Scrape ID,Last Scraped,Name,Summary,Space,Description,Experiences Offered,Neighborhood Overview,Notes,Transit,Access,Interaction,House Rules,Thumbnail Url,Medium Url,Picture Url,XL Picture Url,Host ID,Host URL,Host Name,Host Since,Host Location,Host About,Host Response Time,Host Response Rate,Host Acceptance Rate,Host Thumbnail Url,Host Picture Url,Host Neighbourhood,Host Listings Count,Host Total Listings Count,Host Verifications,Street,Neighbourhood,Neighbourhood Cleansed,Neighbourhood Group Cleansed,City,State,Zipcode,Market,Smart Location,Country Code,Country,Latitude,Longitude,Property Type,Room Type,Accommodates,Bathrooms,Bedrooms,Beds,Bed Type,Amenities,Square Feet,Weekly Price,Monthly Price,Security Deposit,Cleaning Fee,Guests Included,Extra People,Minimum Nights,Maximum Nights,Calendar Updated,Has Availability,Availability 30,Availability 60,Availability 90,Availability 365,Calendar last Scraped,Number of Reviews,First Review,Last Review,Review Scores Rating,Review Scores Accuracy,Review Scores Cleanliness,Review Scores Checkin,Review Scores Communication,Review Scores Location,Review Scores Value,License,Jurisdiction Names,Cancellation Policy,Calculated host listings count,Reviews per Month,Geolocation,Features
4706,2156319,https://www.airbnb.com/rooms/2156319,20170407214119,2017-04-08,GRANT VII Plaza Mayor,,Beautiful and charming apartment recently deco...,Beautiful and charming apartment recently deco...,none,,,,,,Rest hours: Monday to Friday from (phone numbe...,,,https://public.opendatasoft.com/api/v2/catalog...,,1650712,https://www.airbnb.com/users/show/1650712,Ximena,2012-01-25,"Madrid, Madrid, Spain","I love to travel, discover new places and cult...",within an hour,100.0,,https://a0.muscache.com/im/users/1650712/profi...,https://a0.muscache.com/im/users/1650712/profi...,La Latina,40.0,40.0,"email,phone,reviews,jumio","Sol, Madrid, Community of Madrid 28013, Spain",Sol,Sol,Centro,Madrid,Community of Madrid,28013,Madrid,"Madrid, Spain",ES,Spain,40.415418,-3.707123,Apartment,Entire home/apt,4,1.0,1.0,2.0,Real Bed,"TV,Internet,Wireless Internet,Air conditioning...",,535.0,1800.0,150.0,40.0,2,20,2,1125,today,,0,0,0,0,2017-04-08,9,2014-04-02,2014-10-20,93.0,10.0,9.0,9.0,9.0,10.0,9.0,,,strict,40.0,0.24,"40.4154180336,-3.70712273935","Host Has Profile Pic,Host Identity Verified,Is..."
6422,3377153,https://www.airbnb.com/rooms/3377153,20170407214119,2017-04-08,Nice flat in Plaza Mayor (lift),"Flat is in the centre town, really close to Pl...",You cannot find a better location to stay in M...,"Flat is in the centre town, really close to Pl...",none,"It´s really cool, with a lot of new business (...","Apartment has wifi, lift, wash-machine and air...","The best way is walking, since in Madrid all a...","The kitchen is available with oil, sugar, coff...",You can call me if you have any doubt or if yo...,Just enjoy Madrid! You are in your home in Sp...,https://a0.muscache.com/im/pictures/47295744/3...,https://a0.muscache.com/im/pictures/47295744/3...,https://public.opendatasoft.com/api/v2/catalog...,https://a0.muscache.com/im/pictures/47295744/3...,17037651,https://www.airbnb.com/users/show/17037651,Miguel,2014-06-20,"Madrid, Community of Madrid, Spain",I am love going out and meet new people from e...,within an hour,100.0,,https://a0.muscache.com/im/users/17037651/prof...,https://a0.muscache.com/im/users/17037651/prof...,La Latina,1.0,1.0,"email,phone,reviews,manual_offline,jumio","La Latina, Madrid, Comunidad de Madrid 28005, ...",La Latina,Embajadores,Centro,Madrid,Comunidad de Madrid,28005,Madrid,"Madrid, Spain",ES,Spain,40.411131,-3.707258,Apartment,Entire home/apt,4,1.0,1.0,2.0,Real Bed,"Internet,Wireless Internet,Air conditioning,Ki...",,299.0,1350.0,99.0,20.0,2,10,1,1125,today,,4,16,33,298,2017-04-07,182,2014-08-09,2017-03-27,92.0,9.0,9.0,10.0,10.0,10.0,9.0,,,flexible,1.0,5.61,"40.411131472,-3.7072583983","Host Has Profile Pic,Host Identity Verified,Is..."
4339,14800635,https://www.airbnb.com/rooms/14800635,20170407214119,2017-04-08,* ROOM double Barrio Salamanca *,"Private room for two persons has a double bed,...",It is a penthouse located in the salamanca dis...,"Private room for two persons has a double bed,...",none,The District of Salamanca is one of the 21 dis...,It is important to know that the floor is shar...,Metro stations: * Lista 70 metres (line 4) fr...,"The common areas are completely free to use, l...",I will always be on the lookout for what you c...,***Lo más importante pasarlo bien y que se si...,https://a0.muscache.com/im/pictures/069c59b5-c...,https://a0.muscache.com/im/pictures/069c59b5-c...,https://public.opendatasoft.com/api/v2/catalog...,https://a0.muscache.com/im/pictures/069c59b5-c...,88187861,https://www.airbnb.com/users/show/88187861,Adan,2016-08-05,"Madrid, Community of Madrid, Spain",,within an hour,100.0,,https://a0.muscache.com/im/pictures/62ec41c0-c...,https://a0.muscache.com/im/pictures/62ec41c0-c...,Goya,3.0,3.0,"email,phone,reviews,jumio,government_id","Madrid, Comunidad de Madrid 28006, Spain",,Goya,Salamanca,Madrid,Comunidad de Madrid,28006,Madrid,"Madrid, Spain",ES,Spain,40.42805,-3.676042,Apartment,Private room,2,1.0,1.0,1.0,Real Bed,"TV,Wireless Internet,Kitchen,Smoking allowed,E...",,,,,,1,0,1,1125,yesterday,,5,26,45,135,2017-04-08,38,2016-09-12,2017-03-19,97.0,10.0,9.0,10.0,10.0,10.0,10.0,,,flexible,3.0,5.45,"40.4280496724,-3.6760419089","Host Has Profile Pic,Host Identity Verified,Re..."


In [10]:
# comencemos con shape, a ver que nos espera
X_train.shape # 88 columnas por explorar

(10346, 88)

In [11]:
# quiza hay alguna columna que todo sea null se podría quitar
areAllNulls = X_train.isnull().all()
areAllNulls[areAllNulls == True] # no hubo suerte

Series([], dtype: bool)

In [12]:
# Veamos las que contienen null
areSomeNulls = X_train.isnull().any()
areSomeNulls[areSomeNulls == True] # hay varias columnas con algún null, habrá que ver que tan útiles son

Name                              True
Summary                           True
Space                             True
Description                       True
Neighborhood Overview             True
Notes                             True
Transit                           True
Access                            True
Interaction                       True
House Rules                       True
Thumbnail Url                     True
Medium Url                        True
Picture Url                       True
XL Picture Url                    True
Host Name                         True
Host Since                        True
Host Location                     True
Host About                        True
Host Response Time                True
Host Response Rate                True
Host Acceptance Rate              True
Host Thumbnail Url                True
Host Picture Url                  True
Host Neighbourhood                True
Host Listings Count               True
Host Total Listings Count

In [13]:
# Veamos los datatypes
X_train.dtypes # muchos strings tenemos que ver que podemos obtener de ellos

ID                                  int64
Listing Url                        object
Scrape ID                           int64
Last Scraped                       object
Name                               object
Summary                            object
Space                              object
Description                        object
Experiences Offered                object
Neighborhood Overview              object
Notes                              object
Transit                            object
Access                             object
Interaction                        object
House Rules                        object
Thumbnail Url                      object
Medium Url                         object
Picture Url                        object
XL Picture Url                     object
Host ID                             int64
Host URL                           object
Host Name                          object
Host Since                         object
Host Location                     

In [14]:
# tambien buscamos filas duplicados
dup = X_train.duplicated()
dup[dup == True]

Series([], dtype: bool)

In [15]:
#### detectemos Outliers!
# Analizamos un poco los datos
print(X_train.describe(include='all'))

                  ID                           Listing Url     Scrape ID  \
count   1.034600e+04                                 10346  1.034600e+04   
unique           NaN                                 10346           NaN   
top              NaN  https://www.airbnb.com/rooms/2156319           NaN   
freq             NaN                                     1           NaN   
mean    1.028354e+07                                   NaN  2.017038e+13   
std     5.555731e+06                                   NaN  5.374658e+08   
min     1.986400e+04                                   NaN  2.016010e+13   
25%     5.597253e+06                                   NaN  2.017041e+13   
50%     1.130555e+07                                   NaN  2.017041e+13   
75%     1.532606e+07                                   NaN  2.017041e+13   
max     1.858361e+07                                   NaN  2.017062e+13   

       Last Scraped                                Name  \
count         10346         

In [16]:
# NOTA: Algunas de estas cosas ya existen en mejores formas de hacerlo, pero las descubrí en clase después.
#       Parece que redescubrí el hilo negro :S
#       No me arrepiento de haber hecho esto, lo más seguro es que no lo use más en adelante, pero lo dejo 
#       porque al final de cuentas fue mi esfuerzo
#
# Me he puesto a analizar tantas cosas que ya lo volví función en functions.py:
from functions import analisisDF

# Lo probamos con un pequeño dataset:
dataTest = {
    "copy": [420, 380, 390, 411, 400, 395, 410],
    "copied": [420, 380, 390, 411, 400, 395, 410],
    "similar": [415, 380, 390, 411, 400, 395, 410],
    "proporcional": [415*1.5, 380*1.5, 390*1.5, 411*1.5, 400*1.5, 395*1.5, 410*1.5],
    "contains1": ["a", "b", "c", "d", "e", "f", "g"],
    "contains2": ["aaa", "bbbb", "cccc", "dxxxx", "exxxx", "fffff", "gccccc"],
    "allNan": [None, None, None, None, None, None, None],
    "manyNan": [None, 1, None, None, None, None, None],
    "outlier": [1, 1, 390, 1, 1, 1, -1111],
    "formatInconsistence": [1, 1, True, 1, 1, 1, 1]
}
df = pd.DataFrame(dataTest)
print(df) 

# veamos, tiene buena pinta
resultMap = analisisDF(df)
resultMap

   copy  copied  similar  proporcional contains1 contains2 allNan  manyNan  \
0   420     420      415         622.5         a       aaa   None      NaN   
1   380     380      380         570.0         b      bbbb   None      1.0   
2   390     390      390         585.0         c      cccc   None      NaN   
3   411     411      411         616.5         d     dxxxx   None      NaN   
4   400     400      400         600.0         e     exxxx   None      NaN   
5   395     395      395         592.5         f     fffff   None      NaN   
6   410     410      410         615.0         g    gccccc   None      NaN   

   outlier formatInconsistence  
0        1                   1  
1        1                   1  
2      390                True  
3        1                   1  
4        1                   1  
5        1                   1  
6    -1111                   1  


{'duplicateCols': ['cols:[copy|copied]'],
 'similarCols': ['cols:[copy|copied],rate:[1.0]',
  'cols:[copy|similar],rate:[0.8571428571428571]',
  'cols:[copied|similar],rate:[0.8571428571428571]'],
 'containsCols': ['cols:[copy|copied],rate:[1.0]',
  'cols:[copy|similar],rate:[0.8571428571428571]',
  'cols:[copied|similar],rate:[0.8571428571428571]',
  'cols:[contains1|contains2],rate:[1.0]',
  'cols:[outlier|formatInconsistence],rate:[0.8571428571428571]'],
 'formatInconsitenceCols': ["col:[formatInconsistence],types:[<class 'int'>|<class 'bool'>]"],
 'tooManyNanCols': ['col:[allNan],rate:[1.0]',
  'col:[manyNan],rate:[0.8571428571428571]'],
 'proportionalCols': ['cols:[copy|copied],proportion:[1.0]',
  'cols:[similar|proporcional],proportion:[0.6666666666666666]'],
 'outliersCols': ['col:[outlier],outliersIndex:[2, 6]']}

In [17]:
# ahora con los verdaderos datos
resultMap = analisisDF(X_train)
resultMap

{'duplicateCols': [],
 'similarCols': ['cols:[Host Acceptance Rate|Square Feet],rate:[0.957954765126619]',
  'cols:[Host Acceptance Rate|Has Availability],rate:[0.997583607191185]',
  'cols:[Host Acceptance Rate|License],rate:[0.9733230233906824]',
  'cols:[Host Acceptance Rate|Jurisdiction Names],rate:[0.9844384303112313]',
  'cols:[Host Listings Count|Host Total Listings Count],rate:[1.0]',
  'cols:[City|Market],rate:[0.9438430311231394]',
  'cols:[Square Feet|Has Availability],rate:[0.9596945679489658]',
  'cols:[Square Feet|License],rate:[0.9389135897931568]',
  'cols:[Square Feet|Jurisdiction Names],rate:[0.9475159481925381]',
  'cols:[Has Availability|License],rate:[0.9750628262130292]',
  'cols:[Has Availability|Jurisdiction Names],rate:[0.9858882659965204]',
  'cols:[Review Scores Checkin|Review Scores Communication],rate:[0.8568528900057993]',
  'cols:[License|Jurisdiction Names],rate:[0.9628842064566016]'],
 'containsCols': ['cols:[ID|Listing Url],rate:[1.0]',
  'cols:[Scrape