# machine-learning-practica
### **Práctica ML** - Ejercicio de Bootcamp Inteligencia Artificial Full Stack Edición III

Este proyecto es un entregable para la práctica del Master Bootcamp Inteligencia Artificial Full Stack Edición III realizado por el centro de formación [@Keepcoding](https://github.com/KeepCoding)

---

El objetivo de la práctica es la predicción del precio del airbnb de los datos disponibles en el fichero [airbnb-listings-extract.csv](./airbnb-listings-extract.csv)

## Contenido

Los pasos esperados son los siguientes:
1. Preparación de datos: División train/test
2. Análisis exploratorio, por ejemplo:
    - Head, describe, dtypes, etc.
    - Outliers
    - Correlación
3. Preprocesamiento:
    - Eliminación de variables, mediante selección (random forest/Lasso), alta correlación, alto porcentaje de missings, o el método que se considere oportuno.
    - Generación de variables
5. Modelado:
    - Cross validation
    - Evaluación; mejor si lo hacéis de más de un modelo, porque así podéis comparar entre ellos.
6. Conclusión: escrita, no numérica; un par de líneas es más que suficiente.

## 1. Preparación de datos: División train/test

In [40]:
# Camenzamos con las librerías que usaremos
import numpy as np
import pandas as pd

# settings - descomentar a conveniencia
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

# sklearn imports
from sklearn.model_selection import train_test_split

#import matplotlib.pyplot as plt
#plt.style.use("seaborn-v0_8")

In [6]:
# Descargando los datos y primer contacto
datosABNB = pd.read_csv("./airbnb-listings-extract.csv",sep=";")
# aunque el punto 2 requiere en análisis exploratorio, dividir los datos me obliga mínimamente ver cual es la columna resultado
datosABNB.columns # parece que es Price

Index(['ID', 'Listing Url', 'Scrape ID', 'Last Scraped', 'Name', 'Summary',
       'Space', 'Description', 'Experiences Offered', 'Neighborhood Overview',
       'Notes', 'Transit', 'Access', 'Interaction', 'House Rules',
       'Thumbnail Url', 'Medium Url', 'Picture Url', 'XL Picture Url',
       'Host ID', 'Host URL', 'Host Name', 'Host Since', 'Host Location',
       'Host About', 'Host Response Time', 'Host Response Rate',
       'Host Acceptance Rate', 'Host Thumbnail Url', 'Host Picture Url',
       'Host Neighbourhood', 'Host Listings Count',
       'Host Total Listings Count', 'Host Verifications', 'Street',
       'Neighbourhood', 'Neighbourhood Cleansed',
       'Neighbourhood Group Cleansed', 'City', 'State', 'Zipcode', 'Market',
       'Smart Location', 'Country Code', 'Country', 'Latitude', 'Longitude',
       'Property Type', 'Room Type', 'Accommodates', 'Bathrooms', 'Bedrooms',
       'Beds', 'Bed Type', 'Amenities', 'Square Feet', 'Price', 'Weekly Price',
       'Month

In [7]:
# Ahora si separamos train/test
X1 = datosABNB.loc[:,datosABNB.columns != "Price"]
y1 = datosABNB["Price"]
X_train, X_test, y_train, y_test = train_test_split(X1, y1, test_size=0.3, shuffle=True, random_state=0)

## 2. Análisis exploratorio

In [9]:
# Un vistaso a todo
X_train.head(3)

Unnamed: 0,ID,Listing Url,Scrape ID,Last Scraped,Name,Summary,Space,Description,Experiences Offered,Neighborhood Overview,...,Review Scores Communication,Review Scores Location,Review Scores Value,License,Jurisdiction Names,Cancellation Policy,Calculated host listings count,Reviews per Month,Geolocation,Features
4706,2156319,https://www.airbnb.com/rooms/2156319,20170407214119,2017-04-08,GRANT VII Plaza Mayor,,Beautiful and charming apartment recently deco...,Beautiful and charming apartment recently deco...,none,,...,9.0,10.0,9.0,,,strict,40.0,0.24,"40.4154180336,-3.70712273935","Host Has Profile Pic,Host Identity Verified,Is..."
6422,3377153,https://www.airbnb.com/rooms/3377153,20170407214119,2017-04-08,Nice flat in Plaza Mayor (lift),"Flat is in the centre town, really close to Pl...",You cannot find a better location to stay in M...,"Flat is in the centre town, really close to Pl...",none,"It´s really cool, with a lot of new business (...",...,10.0,10.0,9.0,,,flexible,1.0,5.61,"40.411131472,-3.7072583983","Host Has Profile Pic,Host Identity Verified,Is..."
4339,14800635,https://www.airbnb.com/rooms/14800635,20170407214119,2017-04-08,* ROOM double Barrio Salamanca *,"Private room for two persons has a double bed,...",It is a penthouse located in the salamanca dis...,"Private room for two persons has a double bed,...",none,The District of Salamanca is one of the 21 dis...,...,10.0,10.0,10.0,,,flexible,3.0,5.45,"40.4280496724,-3.6760419089","Host Has Profile Pic,Host Identity Verified,Re..."


In [10]:
# comencemos con shape, a ver que nos espera
X_train.shape # 88 columnas por explorar

(10346, 88)

In [11]:
# quiza hay alguna columna que todo sea null se podría quitar
areAllNulls = X_train.isnull().all()
areAllNulls[areAllNulls == True] # no hubo suerte

Series([], dtype: bool)

In [12]:
# Veamos las que contienen null
areSomeNulls = X_train.isnull().any()
areSomeNulls[areSomeNulls == True] # hay varias columnas con algún null, habrá que ver que tan útiles son

Name                              True
Summary                           True
Space                             True
Description                       True
Neighborhood Overview             True
Notes                             True
Transit                           True
Access                            True
Interaction                       True
House Rules                       True
Thumbnail Url                     True
Medium Url                        True
Picture Url                       True
XL Picture Url                    True
Host Name                         True
Host Since                        True
Host Location                     True
Host About                        True
Host Response Time                True
Host Response Rate                True
Host Acceptance Rate              True
Host Thumbnail Url                True
Host Picture Url                  True
Host Neighbourhood                True
Host Listings Count               True
Host Total Listings Count

In [13]:
# Veamos los datatypes
X_train.dtypes # muchos strings tenemos que ver que podemos obtener de ellos

ID                                  int64
Listing Url                        object
Scrape ID                           int64
Last Scraped                       object
Name                               object
                                   ...   
Cancellation Policy                object
Calculated host listings count    float64
Reviews per Month                 float64
Geolocation                        object
Features                           object
Length: 88, dtype: object

In [50]:
# tambien buscamos filas duplicados
dup = X_train.duplicated()
dup[dup == True]

Series([], dtype: bool)

In [14]:
#### detectemos Outliers!
# Analizamos un poco los datos
print(X_train.describe(include='all'))

                  ID                           Listing Url     Scrape ID  \
count   1.034600e+04                                 10346  1.034600e+04   
unique           NaN                                 10346           NaN   
top              NaN  https://www.airbnb.com/rooms/2156319           NaN   
freq             NaN                                     1           NaN   
mean    1.028354e+07                                   NaN  2.017038e+13   
std     5.555731e+06                                   NaN  5.374658e+08   
min     1.986400e+04                                   NaN  2.016010e+13   
25%     5.597253e+06                                   NaN  2.017041e+13   
50%     1.130555e+07                                   NaN  2.017041e+13   
75%     1.532606e+07                                   NaN  2.017041e+13   
max     1.858361e+07                                   NaN  2.017062e+13   

       Last Scraped                                Name  \
count         10346         

In [218]:
dataTest = {
    "copy": [420, 380, 390, 411, 400, 395, 410],
    "copied": [420, 380, 390, 411, 400, 395, 410],
    "similar": [415, 380, 390, 411, 400, 395, 410],
    "proporcional": [415*1.5, 380*1.5, 390*1.5, 411*1.5, 400*1.5, 395*1.5, 410*1.5],
    "contains1": ["a", "b", "c", "d", "e", "f", "g"],
    "contains2": ["aaa", "bbbb", "cccc", "dxxxx", "exxxx", "fffff", "gccccc"],
    "allNan": [None, None, None, None, None, None, None],
    "manyNan": [None, 1, None, None, None, None, None],
    "outlier": [1, 1, 390, 1, 1, 1, 1],
    "formatInconsistence": [1, 1, True, 1, 1, 1, 1]
}
df = pd.DataFrame(dataTest)

print(df) 

   copy  copied  similar  proporcional contains1 contains2 allNan  manyNan  \
0   420     420      415         622.5         a       aaa   None      NaN   
1   380     380      380         570.0         b      bbbb   None      1.0   
2   390     390      390         585.0         c      cccc   None      NaN   
3   411     411      411         616.5         d     dxxxx   None      NaN   
4   400     400      400         600.0         e     exxxx   None      NaN   
5   395     395      395         592.5         f     fffff   None      NaN   
6   410     410      410         615.0         g    gccccc   None      NaN   

   outlier formatInconsistence  
0        1                   1  
1        1                   1  
2      390                True  
3        1                   1  
4        1                   1  
5        1                   1  
6        1                   1  


In [220]:
# Me he puesto a analizar tantas cosas que ya lo volví función
model_cols = df.columns.tolist()
resultMap = {
    "duplicateCols": [],
    "similarCols": [],
    "containsCols": [],
    "formatInconsitenceCols": [],
    "tooManyNanCols": [],
    "proportionalCols": [],
    "outliersCols": []
}
for col in model_cols:
    # tooManyNanCols
    alist = df[col].tolist()
    nanCount = sum(df[col].isnull().tolist())
    if nanCount / len(alist) > 0.8:
        resultMap["tooManyNanCols"].append("col:["+col+"],rate:["+str(nanCount / len(alist))+"]")
    
    # formatInconsitenceCols
    dataType = None
    for element in alist:
        if not dataType:
            dataType = type(element)
        if type(element) != dataType:
            resultMap["formatInconsitenceCols"].append("col:["+col+"],types:["+str(dataType)+"|"+str(type(element))+"]")
            break
    # outliersCols
    
    for col2compare in model_cols[model_cols.index(col)+1:]:
        if(col != col2compare):
            list1 = df[col].tolist()
            list2 = df[col2compare].tolist()
            # duplicateCols
            if(list1 == list2):
                resultMap["duplicateCols"].append("cols:["+col+"|"+col2compare+"]")
            # similarCols
            count = 0
            count += sum(map(lambda x, y: 1 if str(x)==str(y) else 0 , list1, list2))
            if count / len(list1) > 0.8:
                resultMap["similarCols"].append("cols:["+col+"|"+col2compare+"],rate:["+str(count / len(list1))+"]")
            # containsCols
            containsCount = 0
            containsCount += sum(map(lambda x, y: 1 if str(x).find(str(y)) != -1 or str(y).find(str(x)) != -1 else 0 , list1, list2))
            if containsCount / len(list1) > 0.8:
                resultMap["containsCols"].append("cols:["+col+"|"+col2compare+"],rate:["+str(containsCount / len(list1))+"]")
            
            # proportionalCols
            proportion = None
            areAllProportion = False
            for i in range(len(list1)):
                x = list1[i]
                y = list2[i]
                if isinstance(x, (int, float, complex)) and not isinstance(x, bool) and isinstance(y, (int, float, complex)) and not isinstance(y, bool):
                    if not proportion:
                        proportion = x /y
                        areAllProportion = True
                    if proportion != x / y:
                        areAllProportion = False
                        break
            if areAllProportion:
                resultMap["proportionalCols"].append("cols:["+col+"|"+col2compare+"],proportion:["+str(proportion)+"]")
resultMap



{'duplicateCols': ['cols:[copy|copied]'],
 'similarCols': ['cols:[copy|copied],rate:[1.0]',
  'cols:[copy|similar],rate:[0.8571428571428571]',
  'cols:[copied|similar],rate:[0.8571428571428571]',
  'cols:[outlier|formatInconsistence],rate:[0.8571428571428571]'],
 'containsCols': ['cols:[copy|copied],rate:[1.0]',
  'cols:[copy|similar],rate:[0.8571428571428571]',
  'cols:[copied|similar],rate:[0.8571428571428571]',
  'cols:[contains1|contains2],rate:[1.0]',
  'cols:[outlier|formatInconsistence],rate:[0.8571428571428571]'],
 'formatInconsitenceCols': ["col:[formatInconsistence],types:[<class 'int'>|<class 'bool'>]"],
 'tooManyNanCols': ['col:[allNan],rate:[1.0]',
  'col:[manyNan],rate:[0.8571428571428571]'],
 'proportionalCols': ['cols:[copy|copied],proportion:[1.0]',
  'cols:[similar|proporcional],proportion:[0.6666666666666666]',
  'cols:[outlier|formatInconsistence],proportion:[1.0]'],
 'outliersCols': []}

In [78]:
col = 'Review Scores Checkin'
col2 ='Review Scores Communication'
l1 = X_train[col].tolist()
l2 = X_train[col2].tolist()
set(l1).intersection(l2)

{2.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0}