# Hierarchial Clustering Project based on Consumer Data

This project aims to implement a simple hierarchial model based on consumer data from a sample dataset.

## Table of Contents 
1. Loading the required libraries.
2. Loading the dataset.
3. Exploring the dataset.
4. Cleaning the dataset.
5. Designing the hierarchial clustering model.
6. Interpretation

### Loading the required libraries.

In [28]:
import pandas as pd
import numpy as np
import sklearn
import scipy 
import matplotlib.pyplot as plt
import seaborn

%matplotlib inline

### Loading the dataset

In [2]:
## loading the dataset

df = pd.read_csv("D:\others\Personal Projects 2\Practice Files July 2024\consumer data.csv")

  df = pd.read_csv("D:\others\Personal Projects 2\Practice Files July 2024\consumer data.csv")


### Exploring the dataset.

In [3]:
## exploring the dataset

df.head()

Unnamed: 0,customer_id,gender,age,payment_method
0,C241288,Female,28.0,Credit Card
1,C111565,Male,21.0,Debit Card
2,C266599,Male,20.0,Cash
3,C988172,Female,66.0,Credit Card
4,C189076,Female,53.0,Cash


In [4]:
## dimensions of the dataset

df.shape

(99457, 4)

In [5]:
## last 5 rows

df.tail()

Unnamed: 0,customer_id,gender,age,payment_method
99452,C441542,Female,45.0,Credit Card
99453,C569580,Male,27.0,Cash
99454,C103292,Male,63.0,Debit Card
99455,C800631,Male,56.0,Cash
99456,C273973,Female,36.0,Credit Card


### Cleaning the dataset

In [6]:
## dropping unwanted columns 

df1 = df.drop(["customer_id"], axis =1)

In [7]:
## identifying any duplicates

df1.duplicated().sum()

99139

In [8]:
## dealing with duplicates

df2 = df1.drop_duplicates()

df2.shape

(318, 3)

In [9]:
## checking for any null values

df2.isna().sum()

gender            0
age               6
payment_method    0
dtype: int64

In [10]:
## dealing with the null values 

df2["age"] = df2["age"].fillna(df2["age"].mean())

df2.isna().sum()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2["age"] = df2["age"].fillna(df2["age"].mean())


gender            0
age               0
payment_method    0
dtype: int64

In [15]:
df2["gender"] = df2["gender"].replace({
    "Male" : 1,
    "Female" : 2
})

df2["payment_method"] = df2["payment_method"].replace({
    "Cash" : 1,
    "Credit Card" : 2,
    "Debit Card" : 3
})

  df2["gender"] = df2["gender"].replace({
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2["gender"] = df2["gender"].replace({
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2["payment_method"] = df2["payment_method"].replace({


### Designing the hieriarcial clustering model.

In [21]:
## loading the clustering data 

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from scipy.cluster.hierarchy import linkage 
from scipy.cluster.hierarchy import dendrogram
from scipy.cluster.hierarchy import fcluster

In [16]:
## peforming the hierarchial clustering

data = df2 

linkage_matrix = linkage(data, method = "ward") 

In [30]:
# ##  plotting the dendogram

# plt.figure(figsize = (12,6), facecolor = "brown")
# dendrogram(linkage_matrix, labels= data.columns, leaf_rotation=90, leaf_font_size=10, color_threshold=5)
# plt.title("Hierarchical Clustering Dendrogram (Consumer Data)")
# plt.xlabel("Cluster")
# plt.ylabel("Distance")
# plt.grid(True)
# plt.axhline(y=5, color = "b", linestyle = "--", label='Cluster Cut Threshold')
# plt.legend()
# plt.show()

In [23]:
## form flat clusters ( specifying the number of clusters)

clusters = fcluster(linkage_matrix, t=3, criterion='maxclust')

In [24]:
# Add cluster labels to the dataset

data['cluster'] = clusters

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['cluster'] = clusters


In [27]:
# Evaluate cluster distribution

cluster_distribution = data.groupby('cluster')['gender'].value_counts()
print("Cluster Distribution:\n", cluster_distribution)

Cluster Distribution:
 cluster  gender
1        1         75
         2         75
2        1         27
         2         27
3        1         57
         2         57
Name: count, dtype: int64


In [31]:
from sklearn.metrics import silhouette_score

# Silhouette score (input the feature matrix and the predicted cluster labels)
silhouette_avg = silhouette_score(data, clusters)
print(f"Silhouette Score: {silhouette_avg:.2f}")

Silhouette Score: 0.52


### Interpretation: 

Silhouette Score of 0.52 implies there are well-clustered point within the model.