# üõçÔ∏è Mall Customers ‚Äì Cluster Analysis (K-Means + Hier√°rquico)
---
Notebook completo com an√°lise de cluster usando **K-Means**, **Dendrograma**, **Elbow Method** e **ANOVA**.

# üóÇÔ∏è Fonte da Base de Dados
A base utilizada √© o Mall Customers Dataset, dispon√≠vel publicamente no Kaggle:
üîó https://www.kaggle.com/datasets/shwetabh123/mall-customers

## üì¶ 0. Importa√ß√£o das Bibliotecas

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.cluster.hierarchy as sch
from sklearn.cluster import KMeans
from scipy.stats import zscore
import pingouin as pg


## üìÇ 1. Leitura da Base & EDA Inicial

In [None]:
mall = pd.read_csv('Mall_Customers.csv')
print(mall.info())

# 1.1 M√©dia de Renda Anual por G√™nero
mall.groupby('Genre')['Annual Income (k$)'].mean()

## üéõÔ∏è 2. Padroniza√ß√£o das Vari√°veis Num√©ricas

In [None]:
mallnumeric = mall.drop(columns=['CustomerID', 'Genre'])
mallpad = mallnumeric.apply(zscore, ddof=1)
testeanova = mallpad.copy() # Duplicando para ANOVA futura

## üå≥ 3. Cluster Hier√°rquico ‚Äì Dendrograma

In [None]:
mall_compl = sch.linkage(mallpad, metric='euclidean', method='complete')

# maior salto
dists = mall_compl[:, 2]
saltos = np.diff(dists)
i = np.argmax(saltos)
print(f"O maior salto est√° entre: {dists[i]:.2} e {dists[i+1]:.2}")

#‚ö†Ô∏è Vamos considerar 4 clusters, considerando as evid√™ncias anteriores!
# plot
plt.figure(figsize=(12, 6))
sch.dendrogram(mall_compl, color_threshold=4, labels=list(mall['CustomerID']))
plt.axhline(y=4, color='red', linestyle='--')
plt.title("Dendrograma ‚Äì Complete Linkage")
plt.show()

## üìâ 4. Escolha do N√∫mero Ideal de Clusters ‚Äì Elbow Method

In [None]:
wcss = []
ks = range(1, 11)

for k in ks:
    kmeans = KMeans(n_clusters=k, init='random', random_state=42)
    kmeans.fit(mallpad)
    wcss.append(kmeans.inertia_)

plt.figure(figsize=(12, 6))
plt.plot(ks, wcss, marker='o')
plt.xlabel("N√∫mero de Clusters (k)")
plt.ylabel("WCSS")
plt.title("M√©todo do Cotovelo (Elbow Method)")
plt.show()

## üéØ 5. Modelo Final ‚Äì KMeans com k = 4

In [None]:
cluster_final = KMeans(n_clusters=4, init='random', random_state=42).fit(mallpad)
cluster_KMeans = cluster_final.labels_

mall['Cluster_KMeans'] = cluster_KMeans
testeanova['Cluster_KMeans'] = cluster_KMeans
mall.head()

## üìä 6. ANOVA ‚Äì Compara√ß√£o Estat√≠stica Entre os Clusters

In [None]:
print("\nANOVA ‚Äì Age")
print(pg.anova(dv='Age', between='Cluster_KMeans', data=testeanova, detailed=True).T)

print("\nANOVA ‚Äì Annual Income (k$)")
print(pg.anova(dv='Annual Income (k$)', between='Cluster_KMeans', data=testeanova, detailed=True).T)

print("\nANOVA ‚Äì Spending Score (1-100)")
print(pg.anova(dv='Spending Score (1-100)', between='Cluster_KMeans', data=testeanova, detailed=True).T)