# Proyecto Individual 02
🖥️ **Machine Learning** 🖥️ <br>
🔹Zapata, María Belén

--------------------------------------------------------------------- Modelo NO Supervisado ---------------------------------------------------------------------

In [66]:
#Librerías generales: 
import pandas as pd
import numpy as np

#librerías para la creación del Pipeline:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.impute import KNNImputer
from sklearn.pipeline import Pipeline

#Librerías para el Modelo:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

## 🟣 Pipeline

Para el tratamiento del archivo que se me fue entregado, desarrollaré un pipeline para mejor organización del notebook. 

🔹 **Creación de las clases:** <br>
Las clases son necesarias para introducirlas en el Pipeline. <br> 

In [67]:
#Librerías:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.impute import KNNImputer
from sklearn.pipeline import Pipeline
    #Si bien las librerías son importadas en el primer bloque del notebook, decidí reintroducirlas en los momentos donde son necesarias en el código, para mayor comprensión del paso a paso. 

* **Pipeline**

In [68]:
#Esta clase descarta las columnas innecesarias del dataset.
class DropColumns(TransformerMixin):
    def __init__(self, columns=["id", "url", "region", "region_url", "type", "laundry_options", "parking_options", "image_url", "description", "state", 'lat', 'long']):
        self.columns = columns

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X = X.drop(columns=self.columns)
        return X

'''
La clase DropColumns hereda la clase TransformerMixin de scikit-learn, la cual proporciona una interfaz estandarizada para los transformadores en scikit-learn.

La clase tiene un constructor (método __init__) que toma un argumento opcional 'columns' con una lista de nombres de columnas a descartar.

La clase DropColumns tiene dos métodos más:
    El método fit() no realiza ninguna acción, simplemente devuelve el objeto actual. Este método es necesario para cumplir con la interfaz de los transformadores de scikit-learn.
    El método transform() toma como argumento un dataframe X y elimina las columnas especificadas en el constructor de la clase.
'''

#Esta clase normaliza los datos. 
class Normalize(TransformerMixin):
    def __init__(self):
        pass

    def fit(self, X, y=None):
        self.min = X.min()
        self.max = X.max()
        return self

    def transform(self, X):
        X_norm = (X-self.min)/(self.max-self.min)
        return X_norm

'''
La clase Normalize hereda la clase TransformerMixin de scikit-learn, la cual proporciona una interfaz estandarizada para los transformadores en scikit-learn.
La clase tiene un constructor (método __init__) que no tiene argumentos, simplemente se utiliza "pass" para no definir ninguna acción en este constructor.

La clase Normalize tiene dos métodos más:
    El método fit() toma como argumento un dataframe X y calcula el valor mínimo y máximo de las columnas del dataframe. 
        Guarda estos valores en atributos de la clase para ser utilizados en el método transform(). Este método es necesario para cumplir con la interfaz de los transformadores de scikit-learn.
    El método transform() toma como argumento un dataframe X y normaliza los datos de cada columna de acuerdo a la fórmula (X-min)/(max-min), 
        donde min y max son los valores calculados en el método fit(). Devuelve el dataframe con las columnas normalizadas.
'''

#Creo el Pipeline, guardando las clases en la variable "processes", para mayor orden. 
processes = [('drop_columns', DropColumns()), ('normalize', Normalize())]
pipeline = Pipeline(processes)

# 🟣 Test.parquet

🔹 Procedo a trabajar sobre el archivo de testeo. 

In [69]:
df_test = pd.read_parquet("test.parquet")
    #Cargo el dataset de train.
df_test
    #Reviso que se haya cargado correctamente.

Unnamed: 0,id,url,region,region_url,type,sqfeet,beds,baths,cats_allowed,dogs_allowed,...,wheelchair_access,electric_vehicle_charge,comes_furnished,laundry_options,parking_options,image_url,description,lat,long,state
0,7037609789,https://annarbor.craigslist.org/apa/d/wixom-ho...,ann arbor,https://annarbor.craigslist.org,manufactured,1344,3,2.0,0,0,...,0,0,0,w/d in unit,off-street parking,https://images.craigslist.org/00M0M_iNczP1nzIL...,"OPEN HOUSE TODAY! APPLY THIS WEEK, PUT A HOLDI...",42.5333,-83.5763,mi
1,7032406876,https://vermont.craigslist.org/apa/d/randolph-...,vermont,https://vermont.craigslist.org,apartment,1050,2,1.0,0,0,...,0,0,0,w/d hookups,off-street parking,https://images.craigslist.org/00L0L_ecirmYBIzL...,"Think of it, you'll be first to get your mail....",43.9393,-72.5538,vt
2,7037022682,https://annarbor.craigslist.org/apa/d/ann-arbo...,ann arbor,https://annarbor.craigslist.org,apartment,1150,2,2.0,1,1,...,1,0,0,w/d in unit,carport,https://images.craigslist.org/00e0e_dPln2xjo9g...,One of Ann Arbor's most luxurious apartment co...,42.2492,-83.7712,mi
3,7048681802,https://fortcollins.craigslist.org/apa/d/fort-...,fort collins / north CO,https://fortcollins.craigslist.org,apartment,1280,2,2.5,1,1,...,0,0,0,w/d in unit,attached garage,https://images.craigslist.org/00L0L_jlektT5cSd...,"Specials! Move in before January 16th, 2020 an...",40.5501,-105.0350,co
4,7043597870,https://charlottesville.craigslist.org/apa/d/c...,charlottesville,https://charlottesville.craigslist.org,apartment,783,2,1.0,1,1,...,0,0,0,laundry on site,,https://images.craigslist.org/00D0D_cXa4KbZ6ox...,Barracks West Apartments & Townhomes in Charlo...,38.0936,-78.5611,va
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
38493,7041556338,https://mobile.craigslist.org/apa/d/daphne-lux...,mobile,https://mobile.craigslist.org,apartment,1180,2,2.0,1,1,...,1,0,0,w/d in unit,detached garage,https://images.craigslist.org/01616_lCR9AY6Vlb...,At Belforest Villas youâll have all the conv...,30.6197,-87.8895,al
38494,7051072582,https://elpaso.craigslist.org/apa/d/el-paso-th...,el paso,https://elpaso.craigslist.org,apartment,1138,3,2.0,1,1,...,0,0,0,w/d hookups,off-street parking,https://images.craigslist.org/01010_fEVpb2QLmX...,Ready for the CrossPointe Experience show con...,31.8045,-105.9660,tx
38495,7048966175,https://tampa.craigslist.org/hil/apa/d/brandon...,tampa bay area,https://tampa.craigslist.org,apartment,743,1,1.0,1,1,...,0,0,0,w/d in unit,off-street parking,https://images.craigslist.org/00r0r_b7LZqSM75f...,To schedule a tour We now book our tour appoin...,27.8971,-82.3387,fl
38496,7044693740,https://mohave.craigslist.org/apa/d/fort-mohav...,mohave county,https://mohave.craigslist.org,house,1276,3,2.0,0,0,...,0,0,0,w/d hookups,attached garage,https://images.craigslist.org/00606_21aHFx5Gtq...,"House for Rent (1 year lease - min. ) - 3 Bed,...",35.0052,-114.5690,az


* Análisis de datos: 

In [70]:
df_test.shape
    #Miro el tamaño del dataset. 

(38498, 21)

In [71]:
df_test.describe()
    #describo los datos. 

Unnamed: 0,id,sqfeet,beds,baths,cats_allowed,dogs_allowed,smoking_allowed,wheelchair_access,electric_vehicle_charge,comes_furnished,lat,long
count,38498.0,38498.0,38498.0,38498.0,38498.0,38498.0,38498.0,38498.0,38498.0,38498.0,38302.0,38302.0
mean,7040931000.0,1002.062964,1.924749,1.484129,0.727674,0.708426,0.732064,0.083381,0.013585,0.048002,37.225599,-92.657573
std,8783775.0,686.933541,5.665451,0.700228,0.445162,0.454493,0.44289,0.276461,0.115762,0.213774,5.502983,16.359293
min,7004010000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.25383,-159.42
25%,7035888000.0,750.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,33.4717,-99.79
50%,7043099000.0,947.0,2.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,37.61905,-87.85785
75%,7048393000.0,1150.0,2.0,2.0,1.0,1.0,1.0,0.0,0.0,0.0,41.1468,-81.242075
max,7051284000.0,95242.0,1100.0,75.0,1.0,1.0,1.0,1.0,1.0,1.0,64.881,94.1248


In [72]:
df_test.dtypes
    #Miro los tipos de datos por columna. 

id                           int64
url                         object
region                      object
region_url                  object
type                        object
sqfeet                       int64
beds                         int64
baths                      float64
cats_allowed                 int64
dogs_allowed                 int64
smoking_allowed              int64
wheelchair_access            int64
electric_vehicle_charge      int64
comes_furnished              int64
laundry_options             object
parking_options             object
image_url                   object
description                 object
lat                        float64
long                       float64
state                       object
dtype: object

In [73]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38498 entries, 0 to 38497
Data columns (total 21 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       38498 non-null  int64  
 1   url                      38498 non-null  object 
 2   region                   38498 non-null  object 
 3   region_url               38498 non-null  object 
 4   type                     38498 non-null  object 
 5   sqfeet                   38498 non-null  int64  
 6   beds                     38498 non-null  int64  
 7   baths                    38498 non-null  float64
 8   cats_allowed             38498 non-null  int64  
 9   dogs_allowed             38498 non-null  int64  
 10  smoking_allowed          38498 non-null  int64  
 11  wheelchair_access        38498 non-null  int64  
 12  electric_vehicle_charge  38498 non-null  int64  
 13  comes_furnished          38498 non-null  int64  
 14  laundry_options       

In [74]:
df_test.isnull().sum()
    #Reviso si hay nulos en este dataset, para poder tratarlos antes de utilizar el modelo. 

id                             0
url                            0
region                         0
region_url                     0
type                           0
sqfeet                         0
beds                           0
baths                          0
cats_allowed                   0
dogs_allowed                   0
smoking_allowed                0
wheelchair_access              0
electric_vehicle_charge        0
comes_furnished                0
laundry_options             7855
parking_options            14005
image_url                      0
description                    0
lat                          196
long                         196
state                          0
dtype: int64

* Aplicación del pipeline: 

In [75]:
pipeline.fit(df_test)
    #Ajusto el pipeline a los datos del dataframe. 
df_test = pipeline.transform(df_test)
    #Utilizo ese pipeline ya ajustado para transformar los datos del mismo dataframe, aplicando los transformadores especificados en el pipeline en orden.

In [76]:
df_test.head()
    #revisamos que se hayan aplicado los cambios.

Unnamed: 0,sqfeet,beds,baths,cats_allowed,dogs_allowed,smoking_allowed,wheelchair_access,electric_vehicle_charge,comes_furnished
0,0.014111,0.002727,0.026667,0.0,0.0,1.0,0.0,0.0,0.0
1,0.011025,0.001818,0.013333,0.0,0.0,1.0,0.0,0.0,0.0
2,0.012075,0.001818,0.026667,1.0,1.0,1.0,1.0,0.0,0.0
3,0.013439,0.001818,0.033333,1.0,1.0,0.0,0.0,0.0,0.0
4,0.008221,0.001818,0.013333,1.0,1.0,1.0,0.0,0.0,0.0


In [77]:
df_test.describe()
    #Reviso que la normalización se haya hecho correctamente.

Unnamed: 0,sqfeet,beds,baths,cats_allowed,dogs_allowed,smoking_allowed,wheelchair_access,electric_vehicle_charge,comes_furnished
count,38498.0,38498.0,38498.0,38498.0,38498.0,38498.0,38498.0,38498.0,38498.0
mean,0.010521,0.00175,0.019788,0.727674,0.708426,0.732064,0.083381,0.013585,0.048002
std,0.007213,0.00515,0.009336,0.445162,0.454493,0.44289,0.276461,0.115762,0.213774
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.007875,0.000909,0.013333,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.009943,0.001818,0.013333,1.0,1.0,1.0,0.0,0.0,0.0
75%,0.012075,0.001818,0.026667,1.0,1.0,1.0,0.0,0.0,0.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [78]:
df_test.isnull().sum()
    #Reviso que no queden nulos.

sqfeet                     0
beds                       0
baths                      0
cats_allowed               0
dogs_allowed               0
smoking_allowed            0
wheelchair_access          0
electric_vehicle_charge    0
comes_furnished            0
dtype: int64

# 🟣 K-means

In [79]:
from sklearn.cluster import KMeans
    #Importo la librería necesaria para el modelo

In [80]:
kmeans = KMeans(n_clusters=3)
    #Indico la cantidad de clusters necesarios. 
kmeans.fit(df_test)
    #Ajusto el dataset al modelo.
clusters = kmeans.predict(df_test)
    #Realizo la predicción.
clusters
    #Imprimo para ver que esté todo correcto. 



array([0, 0, 1, ..., 1, 0, 1])

* Inserto la variable cluster en la columna 'pred': 

In [81]:
df_test['pred'] = clusters
    #Creo la columna 'pred' y le asigno los valores del array 'clusters.
cluster_mapping = {0: 0, 1: 1, 2: 2}
    #Mapeo para que asigne correctamente.
df_test['pred'] = df_test['pred'].map(cluster_mapping)
    #Hago efectiva esa asignación. 
df_test
    #Verifico que se haya creado la columna con los datos correctamente. 

Unnamed: 0,sqfeet,beds,baths,cats_allowed,dogs_allowed,smoking_allowed,wheelchair_access,electric_vehicle_charge,comes_furnished,pred
0,0.014111,0.002727,0.026667,0.0,0.0,1.0,0.0,0.0,0.0,0
1,0.011025,0.001818,0.013333,0.0,0.0,1.0,0.0,0.0,0.0,0
2,0.012075,0.001818,0.026667,1.0,1.0,1.0,1.0,0.0,0.0,1
3,0.013439,0.001818,0.033333,1.0,1.0,0.0,0.0,0.0,0.0,2
4,0.008221,0.001818,0.013333,1.0,1.0,1.0,0.0,0.0,0.0,1
...,...,...,...,...,...,...,...,...,...,...
38493,0.012389,0.001818,0.026667,1.0,1.0,0.0,1.0,0.0,0.0,2
38494,0.011949,0.002727,0.026667,1.0,1.0,1.0,0.0,0.0,0.0,1
38495,0.007801,0.000909,0.013333,1.0,1.0,1.0,0.0,0.0,0.0,1
38496,0.013397,0.002727,0.026667,0.0,0.0,0.0,0.0,0.0,0.0,0


* Calculo el Silhouette Score: 

In [88]:
from sklearn.metrics import silhouette_score
    #Importo librerías necesarias.

In [87]:
df_test_original = df_test.drop(columns=['pred'])
    #Quito previamente la columna nueva, ya que causaría conflicto. 
labels = kmeans.predict(df_test_original)
    #utilizo el modelo de clustering kmeans para predecir a qué cluster pertenece cada punto en el dataframe. 
    #El método predict() toma como argumento un conjunto de datos y devuelve una lista de etiquetas de cluster asignadas a cada punto del conjunto de datos.
score = silhouette_score(df_test_original, labels)
    #Calculo el silhouette score. 
print("Silhouette Score: {:.3f}".format(score))
    #Imprimo el resultado. 

Silhouette Score: 0.699


---

# 🔱 Entrega 🔱

Una vez la predicción se ha realizado, y tengo la columna necesaria completa, procedo a hacer el último paso para poder entregar. 

In [84]:
df_test[["pred"]].to_csv("BeeluRiddle.csv", index=False)
    #Exporto el archivo a presentar, con una sola columna, sin index, y en csv. 

In [85]:
df_final = pd.read_csv('BeeluRiddle.csv')
    #Cargo el csv.
df_final
    #Imprimo para corroborar que esté todo funcional y correcto. 

Unnamed: 0,pred
0,0
1,0
2,1
3,2
4,1
...,...
38493,2
38494,1
38495,1
38496,0


In [86]:
df_final['pred'].value_counts()

1    20923
0     9964
2     7611
Name: pred, dtype: int64

---