# Introduction

In this project, we're focusing on segmenting customers based on their purchase history using the K-means clustering algorithm. Customer segmentation is a key part of marketing and business strategy because it helps companies customize their products, services, and marketing efforts for different customer groups. By understanding the unique segments within a customer base, businesses can improve customer satisfaction, boost retention rates, and increase overall profitability. This notebook will walk you through the steps of loading and preprocessing the data, determining the optimal number of clusters, applying the K-means algorithm, and interpreting the results.

# Importing Libraries

In [None]:
# Surpress warnings:
def warn(*args, **kwargs):
    pass
import warnings

warnings.warn = warn
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import seaborn as sns



# Loading and Exploring the Data

In [None]:
df= pd.read_csv('/kaggle/input/mall-customers-csv/Mall_Customers.csv')
print(df.head())


As you can see, Gender in this dataset is a categorical variable. The k-means algorithm isn't directly applicable to categorical variables because the Euclidean distance function isn't really meaningful for discrete variables. So, let's drop this feature and run clustering.

# Data Preprocessing

In [None]:
print(df.duplicated())
df.drop_duplicates
df.dropna()
df= df.drop('Gender', axis=1)
df.head()

# first:
We use __StandardScaler()__ to normalize our dataset.
meaning:to have a mean of 0 and a standard deviation of 1

In [None]:
from sklearn.preprocessing import StandardScaler
X=df.values[:,:]
X=np.nan_to_num(X)
clust_ds=StandardScaler().fit_transform(X)


# Determining the Optimal Number of Clusters

Use the Elbow Method to find the optimal number of clusters (K):

In this context,  **WCSS** stands for Within-Cluster Sum of Squares. It is a measure of the total variance within each cluster. The goal of the K-means algorithm is to minimize the WCSS, which means that the data points within each cluster are as close as possible to the cluster centroid.

In [None]:
from sklearn.cluster import KMeans
wcss=[]
for i in range (1,11):
    kmeans=KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=0)
    kmeans.fit(clust_ds)
    wcss.append(kmeans.inertia_)
plt.plot(range(1,11),wcss)
plt.title('Elbow Method')
plt.xlabel('number of clusters')
plt.ylabel('Within-Cluster Sum of Squares')
plt.show()


As we notice the best number of clusters can be identified using the "elbow point" in the graph. This is where the Within-Cluster Sum of Squares (WCSS) starts to decrease at a slower rate. In the graph above, the elbow point appears to be at 4 clusters. This indicates that 4 is the optimal number of clusters for our dataset

# Applying K-means Clustering

In [None]:
kmean=KMeans(n_clusters=4,init='k-means++',max_iter=300,random_state=0)
clusters=kmean.fit_predict(clust_ds)


# Adding Cluster Labels to the Data

In [None]:
df['clusters']=clusters
print(df.head())

# Visualizing the Clusters

This scatter plot visualizes customer segments based on their annual income and spending score. The different colored dots represent customers grouped into 4 distinct clusters. The scatter plot allows you to see the distribution of customers across income and spending levels, and how the clusters are differentiated.

In [None]:
sns.scatterplot(x='Annual Income (k$)', y='Spending Score (1-100)', hue='clusters', data=df, palette='viridis')
plt.title('Customer Segments')
plt.xlabel('Annual Income')
plt.ylabel('Spending Score')
plt.show()

# Interpreting the Results

lets check the centroid values by averaging the features in each cluster.

In [None]:
df.groupby('clusters').mean()

we read from the data table above:

Cluster 0: Younger customers with moderate annual income and spending scores.

Cluster 1: Older customers with moderate annual income and lower spending scores.

Cluster 2: Middle-aged customers with high annual income and low spending scores.

Cluster 3: Middle-aged customers with high annual income and high spending scores.

let's look at the distribution of customers based on their age and income:
# 2D Scatter Plot

This chart shows the relationship between age and annual income. 
The overlapping colored bubbles represent different "clusters" of customers.
The size and positioning of the bubbles indicate the density of customers within each age and income range.

In [None]:
import numpy.testing as testing
area = np.pi * (X[:, 1])**2
plt.scatter(X[:, 1], X[:, 2], s=area, c=clusters.astype(float), alpha=0.5)
plt.xlabel('Age', fontsize=18) 
plt.ylabel('Annual Income (k$)', fontsize=16)
plt.show()


we notice that:
The largest concentrations of customers appear to be in the 30-50 age range, with a mix of income levels.
The smaller bubbles at older ages suggest fewer customers in those age groups, potentially due to retirement or other factors.
The varying bubble sizes demonstrate the relative density of customers at different age and income intersections. Larger bubbles indicate higher concentrations.


# 3D Scatter Plot

This 3D scatter plot displays customer data across three dimensions - age, spending score, and annual income.
The different colored dots represent individual customers. The 3D visualization allows you to observe patterns and relationships between all three variables simultaneously.

In [None]:
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
import numpy as np

fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, projection='3d', elev=48, azim=134)

# Set axis labels
ax.set_xlabel('Spending Score (1-100)')
ax.set_ylabel('Age')
ax.set_zlabel('Annual Income (k$)')

# Create 3D scatter plot
ax.scatter(X[:, 3], X[:, 1], X[:, 2], c=clusters.astype(float), alpha=0.5)

# Show plot
plt.show()



The 3D perspective helps us spot patterns and clusters that might not be obvious in 2D views. We can see a positive correlation between age, income, and spending score, with most customers clustered in the upper-right quadrant (older, higher-income, higher-spending). However, there are also customers in the lower-left quadrant (younger, lower-income, lower-spending), as well as outliers scattered in other areas of the 3D space. The different densities of customers in various regions of the 3D space give us insights into the sizes and characteristics of distinct customer segments. By rotating and examining the plot from different angles, we can uncover additional insights about the relationships and distributions within the data.

This 3D visualization provides a more comprehensive view of the multi-dimensional customer data, potentially revealing patterns and segments that might be hidden in 2D analyses. It can support more sophisticated customer segmentation, targeting, and product strategies.

# Conclusion

by using K-means clustering, we were able to group customers into distinct segments based on their purchasing habits. This method gives us a comprehensive view of customer purchases and freshness, revealing valuable insights into their behaviors and preferences, which can significantly boost customer loyalty. These insights are crucial for developing new systems that cater to customer needs. Ultimately, this approach offers businesses a detailed visualization to better understand their customers, identify opportunities within their diverse consumer base, and adapt to their evolving needs. Additionally, it helps in retaining customers over time through effective systems, even before these visualizations enable companies to foster engagement and loyalty. This analysis can be further refined and validated with more data and advanced clustering techniques to yield more precise and actionable insights. In summary, customer segmentation using K-means clustering is a powerful tool for improving business decisions and strengthening customer relationships.