# Use case Problem statement

As a data scientist at **Amazon**, you are given a dataset that has details about different customers with features like
- 'ID',
- 'n_clicks',
- 'n_visits', etc,

You are asked to segment these customers so that the **Amazon** can provide relevant and similar items to their customers, which will increase their overall sale.

#### What is Customer Segmenation?


Imagine you own a clothing store.

<center><img src='https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/069/167/original/kmeans5.jpg?1711541893' width = "800"></center>


**You wouldn't talk to teenagers the same way you talk to adults, right?**

You'd probably advertise **trendy jeans and graphic tees** to the **teens**,
- and focus on **work attire or comfy clothes for adults**.


Customer segmentation is like that, but for all sorts of businesses.


**It's basically dividing your customers into groups based on things they have in common.**

- Age (teens, adults, seniors)
- Interests (sports fans, gamers, bookworms)
- Buying habits (frequent shoppers, big spenders, bargain hunters)


By diving customer into groups and analysing them,
- businesses can target them with the right message.  

- They can advertise things those groups are likely to want, and do it in a way that appeals to them.


For example:

- Based on spending habits, the customer can be segregated into 3 different groups
    - Frequent Shoppers
    - Big spenders
    - Bargain hunters

#### Understanding the different segments

**Frequent Shoppers**

- These customers buy often, but their individual purchase amounts might be smaller.
- They might be attracted to a brand or specific product lines and enjoy the convenience of frequent purchases.

For Example:
- Customers who buy groceries weekly,
- people who refill on beauty products every month, or
- those who subscribe to recurring deliveries.

**Big Spenders**

- These customers make less frequent purchases, but their individual purchase value is high.
- They might be looking for premium products, large quantities, or bundled deals.

For Example:
- Customers who buy high-end electronics, furniture, or jewelry, or those who stock up on bulk items during sales.


**Bargain Hunters**

- These customers prioritize value and discounts.
- They might purchase less frequently but wait for sales or promotions before buying.
- They might be attracted to coupons, loyalty programs with point redemption, or clearance events.

Now that we have understood what do we have to, let's look at the dataset we have

In [None]:
 !wget "https://drive.google.com/uc?export=download&id=1lEccW5Y5_2z00VRtLGOAJOAU6YA9fl6W" -O E-commerce.csv

--2025-08-20 12:42:51--  https://drive.google.com/uc?export=download&id=1lEccW5Y5_2z00VRtLGOAJOAU6YA9fl6W
Resolving drive.google.com (drive.google.com)... 173.194.195.101, 173.194.195.138, 173.194.195.102, ...
Connecting to drive.google.com (drive.google.com)|173.194.195.101|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://drive.usercontent.google.com/download?id=1lEccW5Y5_2z00VRtLGOAJOAU6YA9fl6W&export=download [following]
--2025-08-20 12:42:51--  https://drive.usercontent.google.com/download?id=1lEccW5Y5_2z00VRtLGOAJOAU6YA9fl6W&export=download
Resolving drive.usercontent.google.com (drive.usercontent.google.com)... 142.251.184.132, 2607:f8b0:4001:c66::84
Connecting to drive.usercontent.google.com (drive.usercontent.google.com)|142.251.184.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 139827 (137K) [application/octet-stream]
Saving to: ‘E-commerce.csv’


2025-08-20 12:42:53 (6.03 MB/s) - ‘E-commerce.csv’ saved [

In [None]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns

In [None]:
df = pd.read_csv('./E-commerce.csv')
df.head()

Unnamed: 0,ID,n_clicks,n_visits,amount_spent,amount_discount,days_since_registration,profile_information
0,1476,130,65,213.905831,31.600751,233,235
1,1535,543,46,639.223004,5.689175,228,170
2,1807,520,102,1157.402763,844.321606,247,409
3,1727,702,83,1195.903634,850.041757,148,200
4,1324,221,84,180.754616,64.2833,243,259


In [None]:
X=df.drop("ID",axis=1)

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2500 entries, 0 to 2499
Data columns (total 7 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   ID                       2500 non-null   int64  
 1   n_clicks                 2500 non-null   int64  
 2   n_visits                 2500 non-null   int64  
 3   amount_spent             2500 non-null   float64
 4   amount_discount          2500 non-null   float64
 5   days_since_registration  2500 non-null   int64  
 6   profile_information      2500 non-null   int64  
dtypes: float64(2), int64(5)
memory usage: 136.8 KB


### **Cleaning + Preprocessing**

**Some things to note about this data**

- CustomerID is just an identifier of the customer - its not a required parameter.
- No categorical variable, but we would have discarded that as well.
- No missing values.

In [None]:
X.describe()

Unnamed: 0,n_clicks,n_visits,amount_spent,amount_discount,days_since_registration,profile_information
count,2500.0,2500.0,2500.0,2500.0,2500.0,2500.0
mean,408.68,94.4756,1445.090745,388.508637,200.9736,201.0404
std,186.41409,38.866356,1167.663473,487.143968,99.136618,100.139213
min,50.0,10.0,0.0,0.0,0.0,0.0
25%,274.75,67.0,609.618538,56.298615,130.0,132.0
50%,378.0,92.0,1036.189112,137.454623,200.0,201.0
75%,522.0,119.0,1949.270949,679.540536,268.0,270.0
max,1246.0,259.0,6567.402267,2428.406527,514.0,585.0


### Feature Scaling

**Can you observe something about the ranges of the features?**

- Features are on different scales.

**Should we or should we not we scale the variables for K-means?**

- K-Means is a distance-based algorithm. Because of that, it’s really important to perform feature scaling (normalize, standardize, or choose any other option in which the distance has some comparable meaning for all the columns).

- For our use case, we can use MinMaxScaler instead of StandardScaler,  transforming the feature values to fall within the bounded intervals (min and max), rather than making them to fall around mean as 0 with standard deviation as 1 (StandardScaler).

- MinMaxScaler is an excellent tool for this purpose. MinMaxScaler scales all the data features in the range [0, 1] or else in the range [-1, 1] if there are negative values in the dataset. This scaling compresses all the inliers in the narrow range [0, 0.005].

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler.fit(X)
X=scaler.transform(X)


### Modeling



In [None]:
from sklearn.cluster import KMeans

k = 4 ## arbitrary value
kmeans = KMeans(n_clusters=k,random_state = 42)
y_pred = kmeans.fit_predict(X)

In [None]:
y_pred

array([3, 3, 0, ..., 3, 1, 0], dtype=int32)

In [None]:
kmeans.cluster_centers_

array([[0.45420614, 0.47552747, 0.20419432, 0.43879674, 0.55639781,
        0.38103402],
       [0.16972077, 0.2333873 , 0.49519562, 0.03295739, 0.38546526,
        0.36577318],
       [0.43951898, 0.45250413, 0.21523292, 0.46553245, 0.23500801,
        0.2938361 ],
       [0.27859127, 0.31899857, 0.09864364, 0.05688261, 0.3900433 ,
        0.33664185]])

In [None]:
clusters_df = pd.DataFrame(X, columns=df.drop("ID",axis=1).columns)
clusters_df['label'] = kmeans.labels_
clusters_df.head(3)

Unnamed: 0,n_clicks,n_visits,amount_spent,amount_discount,days_since_registration,profile_information,label
0,0.06689,0.220884,0.032571,0.013013,0.453307,0.401709,3
1,0.412207,0.144578,0.097333,0.002343,0.44358,0.290598,3
2,0.392977,0.369478,0.176234,0.347685,0.480545,0.699145,0


In [None]:
polar = clusters_df.groupby("label").mean().reset_index()
polar = pd.melt(polar, id_vars=["label"])
polar.head(4)

Unnamed: 0,label,variable,value
0,0,n_clicks,0.454206
1,1,n_clicks,0.169721
2,2,n_clicks,0.439519
3,3,n_clicks,0.278591


In [None]:
import plotly.express as px

"""
  'polar' : customer dataset we are using
  'r' :  mean values for each feature which will be connected using lines
  'theta' : variables where each of the feature will have an angle and
            color will be based on the label of the clusters.
"""
fig = px.line_polar(polar, r="value", theta="variable", color="label", line_close=True,height=700,width=800)
fig.show()

In [None]:
from sklearn.decomposition import PCA

pca = PCA(2)

components_pca = pca.fit_transform(X)

In [None]:
def viz_clusters(clusters):
    plt.scatter(clusters['X1'], clusters['X2'], c=clusters['label'], s = 40)
    plt.xlabel('X1')
    plt.ylabel('X2')
    plt.title('Visualizing Clusters')


#Elbow Method

In [None]:
from sklearn.cluster import KMeans

In [None]:
kmeans_iter1 = KMeans(n_clusters=3, init="random", n_init=1,
                     algorithm="full", random_state=0)
kmeans_iter2 = KMeans(n_clusters=5, init="random", n_init=1,
                     algorithm="full", random_state=0)
kmeans_iter3 = KMeans(n_clusters=8, init="random", n_init=1,
                     algorithm="full", random_state=0)
kmeans_iter1.fit(X)
kmeans_iter2.fit(X)
kmeans_iter3.fit(X)

In [None]:
clusters_1 = pd.DataFrame(components_pca, columns=['X1', 'X2'])
clusters_1['label'] = kmeans_iter1.labels_


clusters_2 = pd.DataFrame(components_pca, columns=['X1', 'X2'])
clusters_2['label'] = kmeans_iter2.labels_


clusters_3 = pd.DataFrame(components_pca, columns=['X1', 'X2'])
clusters_3['label'] = kmeans_iter3.labels_

In [None]:
plt.figure(figsize=(12,10))

plt.subplot(321)
viz_clusters(clusters_1)

plt.subplot(322)
viz_clusters(clusters_2)

plt.subplot(323)
viz_clusters(clusters_3)

- So, Yes, the end results of K-Means depends on number of clusters.

**So, how many clusters are ideal? How to pick that?**
  - Using Inertia or WCSS

#### How to select the best model?


In [None]:
kmeans_iter1.inertia_

In [None]:
kmeans_iter3.inertia_

**As we increase the number of clusters, inertia decreases.**

**So, does that mean we should keep increasing the number of clusters for better performance?**

- No, we cannot simply take the value of **K** that minimizes the inertia, since it keeps getting lower as we increase **K**.
- Because, the more clusters there are, the closer each instance will be to its closest centroid, and therefore the lower the inertia will be.

### Let's experiment with different number of clusters and plot their Inertia



- Calculate the Within Cluster Sum of Squared Errors (WCSS) for different values of K
- Choose the K for which WCSS first starts to diminish.

**The steps can be summarized in the below steps:**
1. Perform K-Means clustering for different values of K by varying K from 1 to 10 clusters.
2. For each K, calculate the total within-cluster sum of square (WCSS).
3. Plot the curve of WCSS vs the number of clusters K.
4. The inflectionion point(elbow) in the plot is generally considered to be an indicator of the appropriate number of clusters.

In [None]:
# Inertia = Within Cluster Sum of Squares
kmeans_per_k = [KMeans(n_clusters=k, random_state=42).fit(X)
                for k in range(1, 10)]

inertias = [model.inertia_ for model in kmeans_per_k]

In [None]:
plt.figure(figsize=(12, 8))
plt.plot(range(1, 10), inertias, "bo-")
plt.xlabel("$k$", fontsize=14)
plt.ylabel("Inertia", fontsize=14)
plt.annotate('Elbow',
             xy=(3, inertias[2]),
             xytext=(0.55, 0.55),
             textcoords='figure fraction',
             fontsize=16,
             arrowprops=dict(facecolor='black', shrink=0.1)
            )
plt.show()

- Inflection point is near 3 or 4 where the drops sharply and then slows down
- 3 would be a good choice, any lower value would be dramatic, while any higher value would not help much.

  **The elbow is found when the dataset becomes flat or linear after applying the cluster analysis algorithm.**

But it's not a very precise method

- Elbow curve still relies on human interpretaion of where we see the slope changing.
- It gives a rough estimate only.