# **Project Title: Customer Segmentation Analysis using K-Means Clustering Algorithm**

**Done BY:-**

- RISHITHA KANCHARLA -AP21110010059

### **Libraries imports**

In [5]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### **Explore the dataset**

In [6]:
df = pd.read_csv('Mall_Customers.csv')
df.head()

FileNotFoundError: [Errno 2] No such file or directory: 'Mall_Customers.csv'

In [None]:
df.shape

In [None]:
df.columns

In [None]:
df.info()

In [None]:
df.isnull().sum()

### Observation:
- As there are no null values, no need to clean the dataset.

In [None]:
df.describe()

### **Data exploration and visualization**

In [None]:
#Plot pairwise relationships between features in a dataset.
#plt.figure(1, figsize=(16,10))
sns.pairplot(data=df, hue='Gender')
plt.show()

In [None]:
#Number of male vs female
plt.figure(1, figsize=(4,4))
sns.countplot(x='Gender', data=df)
plt.show()

In [None]:
#Distribution of Male & Female customers
gender_counts = df['Gender'].value_counts()

plt.figure(figsize=(4, 4))
plt.pie(gender_counts, labels=gender_counts.index, autopct='%1.1f%%', startangle=90, colors=['skyblue', 'lightgreen'])
plt.title('Distribution of Male and Female Customers')

centre_circle = plt.Circle((0, 0), 0.70, fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)

plt.axis('equal')
plt.show()

In [None]:
#Distribution of numerical features (Age, Annual income & Spending score)

plt.figure(1, figsize=(16,4))
n = 0
for x in ['Age', 'Annual Income (k$)', 'Spending Score (1-100)']:
    n += 1
    plt.subplot(1, 3, n)
    plt.subplots_adjust(hspace=0.5 , wspace=0.5)
    sns.distplot(df[x] , bins=10)
    plt.title('Distplot of {}'.format(x))
plt.show()

In [None]:
plt.figure(figsize=(6,4))
sns.boxplot(data=df[['Annual Income (k$)', 'Spending Score (1-100)', 'Age']], orient='h', palette='Set3')

plt.title('Box Plot of Annual Income, Spending Score, and Age')
plt.xlabel('Values')
plt.ylabel('Variables')

plt.show()

In [None]:
# "Age" VS "Spending Score"
palette = sns.color_palette("dark", as_cmap=True)
sns.jointplot(x="Age", y="Spending Score (1-100)",data=df, kind='reg',height=5, color=palette[4],space=0)

In [None]:
# "Age" VS "Annual Income"
sns.jointplot(x=df["Age"], y=df["Annual Income (k$)"], kind='hex', color=palette[6],height=5,ratio=5,space=0)

In [None]:
# "Spending Score" VS "Annual Income"
g = sns.JointGrid(data = df, height = 5, x = "Annual Income (k$)", y = "Spending Score (1-100)", space = 0.1)
g.plot_joint(sns.kdeplot, fill = True, thresh = 0, color = palette[9])
g.plot_marginals(sns.histplot, color = palette[9], alpha = 1, bins = 20);

In [None]:
# Selecting only numerical columns
numerical_df = df.select_dtypes(include=['int64', 'float64'])

# Heatmap: visualizing the correlation between features
plt.figure(figsize=(6, 4))
heatmap = sns.heatmap(numerical_df.corr(), vmin=-1, vmax=1, annot=True, cmap='viridis')
heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':18}, pad=12);


## **K-means:-**

- Simplest and popular unsupervised machine learning algorithms.
- A cluster refers to a collection of data points aggregated together because of certain similarities.
- You’ll define a target number k, which refers to the number of centroids you need in the dataset. A centroid is the imaginary or real location representing the center of the cluster.
- Every data point is allocated to each of the clusters through reducing the in-cluster sum of squares.
- In other words, the K-means algorithm identifies k number of centroids, and then allocates every data point to the nearest cluster, while keeping the centroids as small as possible.

## **Working of K-Means:-**
- The K-means algorithm in data mining starts with a first group of randomly selected centroids.
- These are used as the beginning points for every cluster.
- Then performs iterative calculations to optimize the positions of the centroids.
- It halts creating and optimizing clusters when either:
    - The centroids have stabilized: there is no change in their values because the clustering has been successful.
    - The defined number of iterations has been achieved.

## **Finding Optimal K-value using Elbow method:-**

- Here we plot the mean distance of every point toward its cluster center, as a function of the number of clusters.
- Sometimes the plot has an arm shape, and the elbow would be the optimal K.
    - wcss stands for: within cluster sum of square
    - wcss: distance between each point and centroid in a cluster
    - when we plot wcss with K-value plot looks like a elbow
    - the no.of cluster increases, wcss value decreases
    - wcss value is largest when k=1
- here we are creating a loop to find optimal value of 'K' using elbow method

### **Creating a separate data frame, here we are extracting the last two for customer segmentation.**

In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

In [None]:
#Perform clustering (optimizing K with the elbow method).
#In order to simplify the problem, we start by keeping only the two last columns as features.
X = df.iloc[:, -2:]

In [None]:
#this iterates k-means for our data
#here we are using initializer k-means++, which ensures smarter initialization of centroids and improves cluster quality
#kmeans.inertia_: seggregates data points into cluster

km_inertias, wcss = [], []

for k in range(2, 11):
    km = KMeans(n_clusters=k).fit(X)
    km_inertias.append(km.inertia_)
    wcss.append(silhouette_score(X, km.labels_))

In [None]:
sns.lineplot(x=range(2, 11), y=km_inertias)
plt.title('The Elbow Method Graph')
plt.xlabel('Number of Clusters (k$)')
plt.ylabel('Wcss Values')
plt.show()

### Observation:
- Plot reduces drastically from culter number 1 to 3.
- Slows down till 5.
- Flatens from 6 to 10.
- We are getting elbow where k=5.
- So, the optimal number of cluster will be 5.

In [None]:
sns.lineplot(x=range(2, 11), y=wcss)
plt.title('Scores Depending on K')
plt.xlabel('Number of Clusters K')
plt.ylabel('Silhouette Score')
plt.show()

In [None]:
for k, wcss in zip(range(2, 11), wcss):
    print(f"Silhouette Score for K={k}: {wcss}")

### **Observation:**
- Max Silhouette Score for K=5 is 0.553931997444648

## **Training a model using K-Means Algorithm**

In [None]:
km = KMeans(n_clusters=5).fit(X)

In [None]:
# K-Means visualization on pair of 2 features

plt.figure(figsize=(10, 6))

sns.scatterplot(data=X.loc[km.labels_ == 0], x=X.columns[0], y=X.columns[1], s=80, color="darkblue", label='Cluster-1')
sns.scatterplot(data=X.loc[km.labels_ == 1], x=X.columns[0], y=X.columns[1], s=80, color="skyblue", label='Cluster-2')
sns.scatterplot(data=X.loc[km.labels_ == 2], x=X.columns[0], y=X.columns[1], s=80, color="purple", label='Cluster-3')
sns.scatterplot(data=X.loc[km.labels_ == 3], x=X.columns[0], y=X.columns[1], s=80, color="plum", label='Cluster-4')
sns.scatterplot(data=X.loc[km.labels_ == 4], x=X.columns[0], y=X.columns[1], s=80, color="gray", label='Cluster-5')

plt.scatter(km.cluster_centers_[:,0], km.cluster_centers_[:,1], s=100, c='black', label='Centroids')
plt.title('Clusters of Customers')
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.legend()
plt.show()

### **Definition of customers profiles corresponding to each clusters**

In [None]:
# Profiles of customers
X['label'] = km.labels_

In [None]:
# Count of customers in each cluster
X.label.value_counts()

In [None]:
# Cluster profiles
for k in range(5):
    print(f'Cluster Number : {k}')
    print(X[X.label == k].describe().iloc[[0, 1, 3, 7], :-1])
    print('\n\n')

In [None]:
# Specific cluster profile (for cluster 1)
X[X.label == 1].describe().iloc[[0, 1, 3, 7], :-1]

### **The generated "Clusters of Customers" plot shows the distribution of the 5 clusters. A sensible interpretation for the mall customer segments can be:**

- **Cluster 0:**
    - Customers in this cluster demonstrate a moderate annual income and a relatively high spending score.
- **Cluster 1:**
    - This group comprises customers with a relatively high annual income but a low spending score.
- **Cluster 2:**
    - Customers in this category exhibit a low annual income and a moderate spending score.
- **Cluster 3:**
    - This cluster consists of customers with a low annual income but a high spending score.    
- **Cluster 4:**
    - Customers in this segment display a moderate annual income and a moderate spending score.

- Understanding these customer segments enables strategic decision-making. For example, identifying customers with high annual incomes but low spending scores suggests an opportunity for targeted marketing strategies to increase their spending habits and overall engagement. Additionally, maintaining customer satisfaction, especially among loyal customers, is essential for long-term success.



- This analysis showcases the capability of clustering algorithms to generate insightful recommendations. While this dataset only considers two variables (income and spending), incorporating additional variables can provide more precise and business-specific insights, further enhancing decision-making processes.
