<a href="https://colab.research.google.com/github/Hassaan-T075/AI_Semester_Project/blob/main/kmeans_hierarchical_clustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**20L-1154 - Ali Hussnain**

**20L-1040 - M.Burhan Tahir**

**20L-0905 - M.Hassaan Tahir**

In [None]:
import joblib
import warnings
import matplotlib
import numpy as np
import pandas as pd
import seaborn as sns
import plotly.express as px
import plotly.graph_objs as go
import matplotlib.pyplot as plt
from datetime import date
from sklearn import metrics
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
from mpl_toolkits.mplot3d import Axes3D
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import davies_bouldin_score
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import calinski_harabasz_score
from sklearn.cluster import AgglomerativeClustering
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import StandardScaler, normalize


# Load data from CSV file
customerData = pd.read_csv('Customers.csv')

customerData = customerData.drop(['CustomerID'], axis = 1)

# Inpute the missing value by the Model Value
customerData.Profession.fillna('mode', inplace=True)

# Use factorize() function to assign a number to each unique string
customerData['Profession'] = pd.factorize(customerData['Profession'])[0]
customerData['Gender'] = pd.factorize(customerData['Gender'])[0]
customerData[customerData['Annual Income ($)']<100000]
customerData[customerData['Age']>15]
customerData.head()



Unnamed: 0,Gender,Age,Annual Income ($),Spending Score (1-100),Profession,Work Experience,Family Size
0,0,19,15000,39,0,1,4
1,0,21,35000,81,1,3,3
2,1,20,86000,6,1,1,1
3,1,23,59000,77,2,0,2
4,1,31,38000,40,3,2,6


In [None]:
# Create a box plot of other features against Family Size
age_family_boxplot = px.box(customerData, x='Family Size', y='Age', color='Family Size', title='Distribution of Age by Family Size')
income_family_boxplot = px.box(customerData, x='Family Size', y='Annual Income ($)', color='Family Size', title='Distribution of Annual Income ($) by Family Size')
spending_family_boxplot = px.box(customerData, x='Family Size', y='Spending Score (1-100)', color='Family Size', title='Distribution of Spending Score (1-100) by Family Size')

# Display the plot
age_family_boxplot.show()
income_family_boxplot.show()
spending_family_boxplot.show()

**The box plot provides clear insights into the distribution of other features against family size. From the plot, it is evident that family size does not appear to significantly impact the distribution of plots. This suggests that family size is not a significant factor when predicting the values of other variables and may not lead to significant improvements in accuracy of our predictive model**

In [None]:
# Create a box plot of other features against Work Experience
age_experience_boxplot = px.box(customerData, x='Work Experience', y='Age', color='Work Experience', title='Distribution of Age by Work Experience')
income_experience_boxplot = px.box(customerData, x='Work Experience', y='Annual Income ($)', color='Work Experience', title='Distribution of Annual Income ($) by Work Experience')
spending_experience_boxplot = px.box(customerData, x='Work Experience', y='Spending Score (1-100)', color='Work Experience', title='Distribution of Spending Score (1-100) by Work Experience')

# Display the plot
age_experience_boxplot.show()
income_experience_boxplot.show()
spending_experience_boxplot.show()

**As mentioned above, box plots help us identify the signifance of a feature in predicting other features. From the plots, it can be that work experience does not appear to significantly impact the distribution of plots. So, similarly, it can be ignored as it is not a signifact feature in predicting other variables and will not help in improving the accuracy much.**

In [None]:
# Rescaled data
cutout=customerData[['Age','Annual Income ($)','Spending Score (1-100)']]
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(cutout)

# Fit and transform the data to obtain the 3D projection
data_3D = scaled_data

In [None]:
# KMeans Clustering 
kmeans = KMeans(n_clusters=8, random_state=42)

# Fit the KMeans model on train_ds
kmeans.fit(scaled_data)

# Obtain cluster labels and centroids
kmeans_labels = kmeans.labels_
kmeans_centroids = kmeans.cluster_centers_





In [None]:
# Fit Agglomerative Clustering
agg_clustering = AgglomerativeClustering(n_clusters = 8)

# Fit the Agglomerative model on train_ds
agg_clustering.fit(scaled_data)

# Obtain cluster labels
agg_labels = agg_clustering.labels_

**The silhouette method is a technique for evaluating the performance of clustering algorithms based on how well the data points are clustered. It involves calculating a silhouette score for each point in the dataset, which measures how well the point belongs to its assigned cluster compared to other clusters. The silhouette score ranges from -1 to 1, with a higher score indicating a better clustering result.**

**The Silhouette score for a clustering solution is calculated by averaging the Silhouette scores for each data point in the dataset. The higher the average Silhouette score, the better the clustering solution.**


**The Davies-Bouldin Index is a measure of clustering quality or cluster separation in a dataset. It is used to evaluate the effectiveness of a clustering algorithm in partitioning a dataset into distinct and meaningful clusters.**

**The index is calculated by measuring the ratio of the within-cluster scatter to the between-cluster separation. In other words, it takes into account the average distance between the points within a cluster as well as the distance between the centroids of each cluster.**

**A lower Davies-Bouldin Index value indicates better clustering quality or greater cluster separation, which means that the clusters are well-separated and distinct. Conversely, a higher Davies-Bouldin Index value indicates poorer clustering quality or less cluster separation, which means that the clusters are not well-separated and may contain overlapping or ambiguous data points.**

In [None]:
# Compute silhouette scores and davies_bouldin_score
kmeans_score = silhouette_score(scaled_data, kmeans_labels)
agg_score = silhouette_score(scaled_data, agg_labels)

kmeans_db_index = davies_bouldin_score(scaled_data, kmeans_labels)
agg_db_index = davies_bouldin_score(scaled_data, agg_labels)


In [None]:
# Create the 3D scatter plot for KMeans
fig_kmeans_3d = px.scatter_3d(
    x=data_3D[:, 0], y=data_3D[:, 1], z=data_3D[:, 2], 
    color=kmeans_labels,
    size_max=5, 
    opacity=0.8,
    labels={'x':'X', 'y':'Y', 'z':'Z'},
    )


# Add a trace for the cluster centers for KMeans
fig_kmeans_3d.add_trace(
    go.Scatter3d(
        x=kmeans_centroids[:,0],
        y=kmeans_centroids[:,1],
        z=kmeans_centroids[:,2],
        mode='markers+text',
        text=['Centroid 1', 'Centroid 2','Centroid 3','Centroid 4','Centroid 5','Centroid 6','Centroid 7','Centroid 8',],
        marker=dict(
            size=10,
            color='black',
            opacity=0.8,
            symbol='diamond'
        )
    )
)


# Update the layout for KMeans
fig_kmeans_3d.update_layout(
    coloraxis_showscale=False,
    title='KMeans Clustering Visualization (3D)'
)

# Show the plot for KMeans
fig_kmeans_3d.show()


# Create the 3D scatter plot for Agglomerative Clustering
fig_agg_3d = px.scatter_3d(
    x=data_3D[:, 0], y=data_3D[:, 1], z=data_3D[:, 2], 
    color=agg_labels,
    size_max=5, 
    opacity=0.8,
    labels={'x':'X', 'y':'Y', 'z':'Z'},
    )

# Update the layout for Agglomerative Clustering
fig_agg_3d.update_layout(
    coloraxis_showscale=False,
    title='Agglomerative Clustering Visualization (3D)'
)

# Show the plot for Agglomerative Clustering
fig_agg_3d.show()




In [None]:
# Print silhouette scores
print(f"KMeans Silhouette score: {kmeans_score}")
print(f"Agglomerative Silhouette score: {agg_score}")
print("Davies-Bouldin Index for K-Means:", kmeans_db_index)
print("Davies-Bouldin Index for Agglometric :", agg_db_index)

joblib.dump(kmeans, 'kmeans_model.pkl')

KMeans Silhouette score: 0.2723800782079217
Agglomerative Silhouette score: 0.21269830202072296
Davies-Bouldin Index for K-Means: 1.0971169414241067
Davies-Bouldin Index for Agglometric : 1.2167088719834824


['kmeans_model.pkl']