# <font color='#eb3483'>K-Means exercise</font>



In this simple exercise we're going to apply the k-Means algorithm to the  `mall_customer.csv` dataset.

Begin by importing the necessary libraries and uploading the dataset. 

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from mpl_toolkits.mplot3d import Axes3D
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans

import seaborn as sns
sns.set(rc={'figure.figsize':(6,6)}) 
import warnings
warnings.simplefilter("ignore")

%matplotlib inline

In [None]:
data = pd.read_csv("data/mall_customers.csv")

In [None]:
data.head()

In [None]:
data.shape

In [None]:
data.isna().sum()

In [None]:
# rename the columns
df = data.rename(columns={"Annual Income (k$)": "AnnualIncome", "Spending Score (1-100)": "SpendingScore"})

In [None]:
df.head()

In [None]:
df.dtypes

### <font color='#eb3483'>Exercise 1</font>

Visualise the `Age` and `SpendingScore` distinguished by `Gender`.

In [None]:
ax = sns.relplot(x="Age", y="SpendingScore", data=df, hue="Gender")

### <font color='#eb3483'>Exercise 2</font>

Fit a k-Means model using the age and spending score data points. Determine the optimal number of clusters. Then predict clusters of the datapoints. Use a graph to visualize the clusters.

In [None]:
X = df[["Age", "SpendingScore"]].iloc[: , :].values
X[:5,]

In [None]:
# fit the model
inertia = []
for k in range(1 , 20):
    estimator = (KMeans(n_clusters=k,
                        init='k-means++', 
                        n_init = 10,
                        max_iter=300, 
                        tol=0.0001,  
                        random_state=42, 
                        algorithm='elkan') )
    estimator.fit(X)
    inertia.append(estimator.inertia_)

In [None]:
# inertia of cluster, sum of distances in clusters
estimator.inertia_

In [None]:
# optimal number of clusters based on inertia

In [None]:
ax = sns.lineplot(x=range(1,20), y=inertia, label="inertia", color="red").set(xlabel="k", ylabel="inertia")

In [None]:
# now create model with optimized k and make predictions
estimator = (KMeans(n_clusters = 4,
                    init='k-means++', 
                    n_init = 10,
                    max_iter=300, 
                    tol=0.0001,  
                    random_state= 42, 
                    algorithm='elkan') )
estimator.fit(X)
centroids = estimator.cluster_centers_

In [None]:
# predict new clusters
predictions = estimator.predict(X)

In [None]:
# add the clusters
df["cluster_id"] = estimator.labels_

In [None]:
df.head()

In [None]:
df.cluster_id.value_counts() # look at distribution of points into clusters -- seems okay

In [None]:
# plot 
ax = sns.relplot(x="Age", y="SpendingScore", data=df, hue="cluster_id")