<a href="https://colab.research.google.com/github/Advanced-Data-Science-TU-Berlin/Data-Science-Training-Python-Part-2/blob/main/interactive_notebooks/2_1_2_Mall_Customer_Segmentation_KMeans.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Mall Customer Segmentation
![picture](https://www.aimtechnologies.co/wp-content/uploads/2023/09/customer-segmentation-social.png)
The dataset used in this example is the "Mall Customer Segmentation" dataset. This dataset is commonly used for customer segmentation analysis, a task where the goal is to divide customers into groups based on certain characteristics.

Here's more information about the columns in the dataset:

*   **CustomerID**: A unique identifier for each customer.
*   **Gender**: The gender of the customer (e.g., Male or Female).
*   **Age**: The age of the customer.
*   **Annual Income (k$)**: The annual income of the customer in thousands of dollars.
*   **Spending Score (1-100)**: A score assigned to the customer by the mall based on their spending behavior and various parameters.

In [None]:
# Yellowbrick extends the Scikit-Learn API to make model selection and hyperparameter tuning easier. Under the hood, it’s using Matplotlib.
!pip install yellowbrick

In [2]:
# Import
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
url = "https://raw.githubusercontent.com/rileypredum/mall_customer_segmentation/master/Mall_Customers.csv"

# Read the data from the url
# Hint: use pd.read_csv and pass the url
<your-code-here>

# Look at few rows of data
display(df.head())

## Exploratory Data Analysis (EDA)
As the first step, let's explore the dataset.

In [None]:
# Add code for basic statistics, missing values, data types, etc.

# Display basic statistics
# Hint: use df.describe()
display(<your-code-here>)

# Display the data types
# Hint: use df.dtypes
display(<your-code-here>)

# Check for missing values
# Hint: use df.isnull().sum()
<your-code-here>

### Pairwise Correlation Plot
A pairwise correlation plot is a graphical representation that allows us to explore the relationships between pairs of numerical variables in a dataset. This type of plot is particularly useful in understanding how different features interact with each other and whether there are any discernible patterns or trends.

**Interpretation:**
- *Positive Correlation:* Points on the scatterplot tend to follow an upward trend. When one variable increases, the other variable also tends to increase.

- *Negative Correlation:* Points on the scatterplot tend to follow a downward trend. When one variable increases, the other variable tends to decrease.

- *No Correlation:* Points on the scatterplot appear randomly distributed, indicating a lack of a clear relationship between the variables.


In [None]:
import seaborn as sns

# Select numerical columns for correlation plot.
numerical_cols = list(df.select_dtypes(include=[np.number]).columns)

# Should we include CustomerID?
# CustomerID is considered as an identifier and doesn't provide meaningful information for clustering, as it is likely to be unique for each customer.
# Hint use numerical_cols.remove and pass 'CustomerID' as input
<your-code-here>

# Create a pairwise correlation plot
# Hint use sns.pairplot and pass df[numerical_cols] as input
<your-code-here>
plt.suptitle("Pairwise Correlation Plot", y=1.02)
plt.show()

### Feature Selection

- Can you think of relevant features for clustering in this data?

For clustering, we are interested in understanding customer segments based on their income and spending behavior. Therefore, we select 'Annual Income (k$)' and 'Spending Score (1-100)' as our relevant features.


In [7]:
# Select relevant features for clustering (e.g., 'Annual Income (k$)' and 'Spending Score (1-100)')
X = df[['Annual Income (k$)', 'Spending Score (1-100)']]

## Data Preprocessing

In [8]:
from sklearn.preprocessing import StandardScaler

# Standardize the features
scaler = StandardScaler()

# Fit and transform the data using standard scaler
# Hint: use scaler.fit_transform function and pass X
X_scaled = <your-code-here>

## Determining the Optimal Number of Clusters (K)
Use the Elbow Method to find the optimal number of clusters (K)

### Elbow Method

The Elbow Method helps us determine the optimal number of clusters by looking for the "elbow" point in the WCSS graph.
- WCSS (Within-Cluster Sum of Squares) measures the compactness of the clusters.

#### Sample Elbow Curve
<div>
<img src="https://almablog-media.s3.ap-south-1.amazonaws.com/elbow_2_18740a3e28.png" width="500"/>
</div>


‘k-means++’ : selects initial cluster centroids using sampling based on an empirical probability distribution of the points’ contribution to the overall inertia. This technique speeds up convergence.

In [None]:
from sklearn.cluster import KMeans
import ipywidgets as widgets
from ipywidgets import interact

# Define the function to plot the Elbow Method graph
def plot_elbow_method(k):
    wcss = []
    for i in range(1, k + 1):
        kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=0)
        # Fit kmeans model
        # Hint: use kmeans.fit and pass X_scaled
        <your-code-here>
        wcss.append(kmeans.inertia_)

    # Plot the Elbow Method graph
    plt.plot(range(1, k + 1), wcss)
    plt.title('Elbow Method for Optimal K')
    plt.xlabel('Number of clusters')
    plt.ylabel('WCSS')  # WCSS stands for Within-Cluster Sum of Squares
    plt.show()

# Create an interactive slider for the number of clusters (k)
interact(plot_elbow_method, k=widgets.IntSlider(min=1, max=10, step=1, value=3))

What is the best number of clusters according to the Elbow Method?

## K-Means Clustering
Based on the Elbow Method, let's choose the optimal K.

### Applying K-means Clustering
We choose the number of clusters based on the "elbow" observed in the WCSS graph, which indicates a balance between cluster compactness and simplicity.

In [None]:
k_optimal = 5

# Apply K-means clustering
kmeans = KMeans(n_clusters=k_optimal, init='k-means++', max_iter=300, n_init=10, random_state=0)

# Fit the KMeans model and assign cluster labels to the DataFrame
# Hint: use kmeans.fit_predict and pass X_scaled
df['Cluster'] = <your-code-here>

# Display the clustered data
df.head()

## Visualizing Clusters
Let's use a scatter plot to visualize the clusters formed by K-means.

Each point represents a customer, and the color indicates the cluster to which they belong.


In [None]:
# Visualize the clusters
# Hint: pass df['Cluster'] for the `c` parameter
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=<cluster-column>, cmap='viridis', s=50)

# Add centroids to the plot (students can customize the marker style, color, etc.)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=200, c='red', marker='X', label='Centroids')

# Set plot title and axis labels
plt.title('K-means Clustering')
plt.xlabel('Annual Income (Standardized)')
plt.ylabel('Spending Score (Standardized)')

# Add legend to the plot
plt.legend()

# Display the plot
plt.show()