<a href="https://colab.research.google.com/github/MarianaGrandis00/Diplomatura-Business-Analytics-UDA-Mendoza/blob/main/TP-Modulo-11-GarciaF_GrandisM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![logo](https://github.com/cristiandarioortegayubro/BA/blob/main/dba.png?raw=true)

# **Trabajo Final - Módulo 11 - Aprendizaje Automático - Machine Learning - Clustering**

![logo](https://www.python.org/static/community_logos/python-powered-w-100x40.png)

Garcia Fabian - Grandis Mariana

**Instrucciones**

El conjunto de datos contiene información sobre los clientes de un centro comercial. Se desea utilziar un modelo de Agrupación para crear clusters y tomar una decisión comercial con cada grupo de clientes que ayude a mejorar la relación con el cliente y el monto de dinero que gastan en el centro comercial.

- Desarrollar los puntos necesarios para generar el modelo de Agrupación.
- Puede generar graficos en cualquier instancia del desarrollo.
- Al final del colab, generar una breve conclusión sobre decisiones a tomar con cada cluster.

# **Carga de módulos y datos necesarios**

In [None]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
import sklearn
from sklearn import cluster
from sklearn.cluster import KMeans
from sklearn.preprocessing import scale
import sklearn.metrics as metrics
from sklearn.metrics import silhouette_score

# **Creación del DataFrame**

In [None]:
datos = "https://raw.githubusercontent.com/LucaAPiattelli/Diplomatura_Business_Analytics_UDA/main/Modulo_11_Agrupacion/Mall_Customers.csv"
df = pd.read_csv(datos)
df

Unnamed: 0,CustomerID,Gender,Age,AnnualIncome,SpendingScore
0,1,Male,19,15,39
1,2,Male,21,15,81
2,3,Female,20,16,6
3,4,Female,23,16,77
4,5,Female,31,17,40
...,...,...,...,...,...
195,196,Female,35,120,79
196,197,Female,45,126,28
197,198,Male,32,126,74
198,199,Male,32,137,18


- CustomerID - ID de cliente
- Gender - Genero
- Age - Edad
- AnualIncome - Ingreso Anual
- SpendingScore - Score de gasto en el mall

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   CustomerID     200 non-null    int64 
 1   Gender         200 non-null    object
 2   Age            200 non-null    int64 
 3   AnnualIncome   200 non-null    int64 
 4   SpendingScore  200 non-null    int64 
dtypes: int64(4), object(1)
memory usage: 7.9+ KB


Eliminación de variables no relevantes

In [None]:
df = df.drop(columns=["CustomerID", "Gender"]) 
df

Unnamed: 0,Age,AnnualIncome,SpendingScore
0,19,15,39
1,21,15,81
2,20,16,6
3,23,16,77
4,31,17,40
...,...,...,...
195,35,120,79
196,45,126,28
197,32,126,74
198,32,137,18


Determinación del Numero de Clusters

In [None]:
clusters = pd.DataFrame()
inertia = []

In [None]:
clusters["cluster_range"] = range(1, 10)

In [None]:
for k in clusters["cluster_range"]:
    kmeans = cluster.KMeans(n_clusters=k, random_state = 8).fit(df)
    inertia.append(kmeans.inertia_)

In [None]:
clusters["inertia"] = inertia

In [None]:
clusters.inertia = round(clusters.inertia, 4)

In [None]:
clusters.head(10)

Unnamed: 0,cluster_range,inertia
0,1,308812.78
1,2,212840.1698
2,3,143342.7516
3,4,104366.1515
4,5,75479.7643
5,6,58300.4433
6,7,51082.543
7,8,44342.3174
8,9,40792.9005


Graficando los clusters

In [None]:
fig = px.line(clusters,
              x = "cluster_range",
              y = "inertia",
              markers = True,
              title = "Metodo del codo",
              template = "gridon",
              labels = {"cluster_range":"clusters"})
fig.show()

# **Generación del modelo de agrupación**

Algoritmo K-means

In [None]:
df.head()

Unnamed: 0,Age,AnnualIncome,SpendingScore
0,19,15,39
1,21,15,81
2,20,16,6
3,23,16,77
4,31,17,40


Prueba con K= 6 clusters

In [None]:
km = cluster.KMeans(n_clusters = 6, n_init = 20, random_state = 123)
km

KMeans(n_clusters=6, n_init=20, random_state=123)

In [None]:
km.fit(df)

KMeans(n_clusters=6, n_init=20, random_state=123)

In [None]:
centroids = km.cluster_centers_
labels = km.labels_

In [None]:
centroids

array([[56.15555556, 53.37777778, 49.08888889],
       [32.69230769, 86.53846154, 82.12820513],
       [27.        , 56.65789474, 49.13157895],
       [41.68571429, 88.22857143, 17.28571429],
       [25.27272727, 25.72727273, 79.36363636],
       [44.14285714, 25.14285714, 19.52380952]])

In [None]:
centroids = pd.DataFrame(centroids, columns=['Age', 'AnnualIncome', "SpendingScore"])
centroids

Unnamed: 0,Age,AnnualIncome,SpendingScore
0,56.155556,53.377778,49.088889
1,32.692308,86.538462,82.128205
2,27.0,56.657895,49.131579
3,41.685714,88.228571,17.285714
4,25.272727,25.727273,79.363636
5,44.142857,25.142857,19.52381


In [None]:
labels

array([5, 4, 5, 4, 5, 4, 5, 4, 5, 4, 5, 4, 5, 4, 5, 4, 5, 4, 5, 4, 5, 4,
       5, 4, 5, 4, 5, 4, 5, 4, 5, 4, 5, 4, 5, 4, 5, 4, 5, 4, 0, 4, 0, 2,
       5, 4, 0, 2, 2, 2, 0, 2, 2, 0, 0, 0, 0, 0, 2, 0, 0, 2, 0, 0, 0, 2,
       0, 0, 2, 2, 0, 0, 0, 0, 0, 2, 0, 2, 2, 0, 0, 2, 0, 0, 2, 0, 0, 2,
       2, 0, 0, 2, 0, 2, 2, 2, 0, 2, 0, 2, 2, 0, 0, 2, 0, 2, 0, 0, 0, 0,
       0, 2, 2, 2, 2, 2, 0, 0, 0, 0, 2, 2, 2, 1, 2, 1, 3, 1, 3, 1, 3, 1,
       2, 1, 3, 1, 3, 1, 3, 1, 3, 1, 2, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1,
       3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1,
       3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1,
       3, 1], dtype=int32)

Metricas

In [None]:
metrics.calinski_harabasz_score(df, labels)

166.7204931788687

In [None]:
metrics.silhouette_score(df, labels)

0.4523443947724053

In [None]:
metrics.davies_bouldin_score(df, labels)

0.7469740072755284

In [None]:
df['cluster'] = labels

In [None]:
df

Unnamed: 0,Age,AnnualIncome,SpendingScore,cluster
0,19,15,39,5
1,21,15,81,4
2,20,16,6,5
3,23,16,77,4
4,31,17,40,5
...,...,...,...,...
195,35,120,79,1
196,45,126,28,3
197,32,126,74,1
198,32,137,18,3


Prueba con K = 5 clusters

In [None]:
km = cluster.KMeans(n_clusters = 5, n_init = 20, random_state = 123)
km

KMeans(n_clusters=5, n_init=20, random_state=123)

In [None]:
km.fit(df)

KMeans(n_clusters=5, n_init=20, random_state=123)

In [None]:
centroids = km.cluster_centers_
labels = km.labels_

In [None]:
centroids

array([[45.2173913 , 26.30434783, 20.91304348,  4.56521739],
       [32.69230769, 86.53846154, 82.12820513,  1.        ],
       [25.52173913, 26.30434783, 78.56521739,  3.91304348],
       [43.08860759, 55.29113924, 49.56962025,  0.92405063],
       [40.66666667, 87.75      , 17.58333333,  2.94444444]])

Metricas

In [None]:
metrics.calinski_harabasz_score(df, labels)

151.0344561676331

In [None]:
metrics.silhouette_score(df, labels)

0.4448729900654738

In [None]:
metrics.davies_bouldin_score(df, labels)

0.8212056271551879

# **Conclusiones**

Con una cantidad de 6 clusters, K=6, las metricas de Silhouette y de Calinski y Harabasz se maximizan y el de Davies Buldinen es menor, en comparación con un K = 5. Por lo tanto es mas optimo la generación de 6 clusters.