<a href="https://colab.research.google.com/github/EricCallaway/COSC_6315_Assignment_05/blob/main/ML_Assignemnt05_Kmeans.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Importing Necessary Libraries

In [39]:
from sklearn.cluster import KMeans
from sklearn.preprocessing import MinMaxScaler

import pandas as pd
import numpy as np

from matplotlib import pyplot as pltb

import plotly.express as px
import plotly.graph_objects as go

In [8]:
from google.colab import drive
drive.mount('/content/drive', force_remount=False)

Mounted at /content/drive


Reading in Dataset

In [12]:
df = pd.read_csv('/content/drive/MyDrive/Data/ML_A05_Dataset/Clustering_dataset.csv')

In [13]:
df.head()

Unnamed: 0,Customer,State,Customer Lifetime Value,Response,Coverage,Education,Effective To Date,EmploymentStatus,Gender,Income,...,Months Since Policy Inception,Number of Open Complaints,Number of Policies,Policy Type,Policy,Renew Offer Type,Sales Channel,Total Claim Amount,Vehicle Class,Vehicle Size
0,BU79786,Washington,2763.519279,No,Basic,Bachelor,2/24/2011,Employed,F,56274,...,5,0,1,Corporate Auto,Corporate L3,Offer1,Agent,384.811147,Two-Door Car,Medsize
1,QZ44356,Arizona,6979.535903,No,Extended,Bachelor,1/31/2011,Unemployed,F,0,...,42,0,8,Personal Auto,Personal L3,Offer3,Agent,1131.464935,Four-Door Car,Medsize
2,AI49188,Nevada,12887.43165,No,Premium,Bachelor,2/19/2011,Employed,F,48767,...,38,0,2,Personal Auto,Personal L3,Offer1,Agent,566.472247,Two-Door Car,Medsize
3,WW63253,California,7645.861827,No,Basic,Bachelor,1/20/2011,Unemployed,M,0,...,65,0,7,Corporate Auto,Corporate L2,Offer1,Call Center,529.881344,SUV,Medsize
4,HB64268,Washington,2813.692575,No,Basic,Bachelor,3/2/2011,Employed,M,43836,...,44,0,1,Personal Auto,Personal L1,Offer1,Agent,138.130879,Four-Door Car,Medsize


Here notice how many columns we have. We only need to work with four of these features.

In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9134 entries, 0 to 9133
Data columns (total 24 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Customer                       9134 non-null   object 
 1   State                          9134 non-null   object 
 2   Customer Lifetime Value        9134 non-null   float64
 3   Response                       9134 non-null   object 
 4   Coverage                       9134 non-null   object 
 5   Education                      9134 non-null   object 
 6   Effective To Date              9134 non-null   object 
 7   EmploymentStatus               9134 non-null   object 
 8   Gender                         9134 non-null   object 
 9   Income                         9134 non-null   int64  
 10  Location Code                  9134 non-null   object 
 11  Marital Status                 9134 non-null   object 
 12  Monthly Premium Auto           9134 non-null   i

Here we create a new dataframe consisting only of the features that we need for our clustering project. 

In [15]:
df2 = df[['Customer Lifetime Value', 'Income', 'Total Claim Amount', 'Monthly Premium Auto']]

Notice our new dataframe has been reduced to only four classes.

In [16]:
df2.head()

Unnamed: 0,Customer Lifetime Value,Income,Total Claim Amount,Monthly Premium Auto
0,2763.519279,56274,384.811147,69
1,6979.535903,0,1131.464935,94
2,12887.43165,48767,566.472247,108
3,7645.861827,0,529.881344,106
4,2813.692575,43836,138.130879,73


In [17]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9134 entries, 0 to 9133
Data columns (total 4 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Customer Lifetime Value  9134 non-null   float64
 1   Income                   9134 non-null   int64  
 2   Total Claim Amount       9134 non-null   float64
 3   Monthly Premium Auto     9134 non-null   int64  
dtypes: float64(2), int64(2)
memory usage: 285.6 KB


Here I'm plotting the four features in a 2D representation. This is showing the relation ship between each feature compared to every other feature in the dataset. 

In [23]:
fig = px.scatter_matrix(df2, width=1000, height=1000)
fig.show()

It is quite difficult for us to understand multi-dimensional spaces. In this case we are working with a 4D dataset. So below I've represented the dataset as so.

*   The x-axis represents the "Monthly Premium Auto" data.
*   The y-axis represents the "Customer Lifetime Value" data.
*   The z-azis represent the "Total Claim Amount" data.
*   And to represent the fourth dimension in a way we can understand it. I've chosed to represent the "Income" as the color of the data points, so you can see, the greater the income the brighter the color. The lower the income the duller the color.

In [37]:
fig2 = px.scatter_3d(df2, x="Monthly Premium Auto", y="Customer Lifetime Value",z="Total Claim Amount",
                     color="Income")
fig2.update_layout(title="4 Features Representation", scene_aspectratio=dict(x=10, y=10, z=10))
fig2.show()

Here I'm defining the Inertia calculations (Elbow Method) of finding the most optimal K value for the algorithm. This mehtod actually suggests that 3 is the most optimal K value for this dataset. 

In [43]:
X=df2
scaler = MinMaxScaler()
scaler.fit(X)
X=scaler.transform(X)
inertia = []
for i in range(1,11):
    kmeans = KMeans(
        n_clusters=i, init="k-means++",
        n_init=10,
        tol=1e-04, random_state=42
    )
    kmeans.fit(X)
    inertia.append(kmeans.inertia_)
fig = go.Figure(data=go.Scatter(x=np.arange(1,11),y=inertia))
fig.update_layout(title="Inertia vs Cluster Number",xaxis=dict(range=[0,11],title="Cluster Number"),
                  yaxis={'title':'Inertia'},
                 annotations=[
        dict(
            x=3,
            y=inertia[2],
            xref="x",
            yref="y",
            text="Elbow!",
            showarrow=True,
            arrowhead=7,
            ax=20,
            ay=-40
        )
    ])

Here in an effor to show the clusters of the data. I've elected to use a line_polar method. This method allows us to represent multidimension clusters easier to see the overlap and what clusters are being formed based on the given dataset. 

In [45]:
kmeans = KMeans(
        n_clusters=4, init="k-means++",
        n_init=10,
        tol=1e-04, random_state=1337
    )
kmeans.fit(X)
clusters=pd.DataFrame(X,columns=df2.columns)
clusters['label']=kmeans.labels_
polar=clusters.groupby("label").mean().reset_index()
polar=pd.melt(polar,id_vars=["label"])
fig4 = px.line_polar(polar, r="value", theta="variable", color="label", line_close=True,height=800,width=1400)
fig4.show()

In [51]:
kmeans.cluster_centers_

array([[0.06808826, 0.3431567 , 0.12630293, 0.10329088],
       [0.07265177, 0.75175238, 0.1038925 , 0.11718048],
       [0.19359178, 0.32046102, 0.38725302, 0.59705605],
       [0.06627979, 0.02169372, 0.19003315, 0.1198488 ]])

Here I'm creating a dataframe of the centroids of each cluster.

In [58]:
kmeans.cluster_centers_[0]
centroid_array = kmeans.cluster_centers_
cluster_centroids_df = pd.DataFrame(centroid_array, columns = ['Centroid 1','Centroid 2','Centroid 3', 'Centroid 4'])

In [59]:
cluster_centroids_df.head()

Unnamed: 0,Centroid 1,Centroid 2,Centroid 3,Centroid 4
0,0.068088,0.343157,0.126303,0.103291
1,0.072652,0.751752,0.103892,0.11718
2,0.193592,0.320461,0.387253,0.597056
3,0.06628,0.021694,0.190033,0.119849


I then take the dataframe and represent it in this 4D format. The four centroids are visualized in this grid. Remember the Grid similar to this earlier, well these are the centroids of all those dots that we plotted in the earlier grid. 

In [60]:
fig5 = px.scatter_3d(cluster_centroids_df, x="Centroid 1", y="Centroid 2",z="Centroid 3",
                     color="Centroid 4")
fig5.update_layout(title="Centroid Representation", scene_aspectratio=dict(x=10, y=10, z=10))
fig5.show()