# **Customer Segmentation & Churn Prediction**

# 1. K-Means (Clustering)
**Purpose**:  
K-Means is a clustering algorithm used to group similar customers based on features like purchase history, frequency, and monetary value. It’s an unsupervised algorithm, meaning it does not require labeled data. K-Means groups customers into clusters based on similarities in the data.

**Use Case for Customer Segmentation**:  
K-Means is often used in customer segmentation based on features like **Recency, Frequency, and Monetary** value (RFM analysis). This helps identify distinct customer groups, such as:

- **High-value customers**
- **At-risk customers**
- **Frequent purchasers**

**When to Use**:  
Use K-Means when you want to identify patterns or groups of customers from **unlabeled data**, such as clustering customers based on purchasing behavior without predefined categories.

**Example Application of K-Means for Customer Segmentation**:
- K-Means can automatically group customers into clusters based on their similarity in spending habits, frequency, and other features.
- The number of clusters (k) is determined through methods like the **Elbow Method** or **Silhouette Scores**.

---

# 2. K-Nearest Neighbors (KNN, Classification)
**Purpose**:  
KNN is a **classification algorithm** that assigns a label to a customer based on the majority label of its nearest neighbors. It is supervised, meaning it requires labeled data for training. KNN uses distance metrics to classify new customers into pre-defined categories.

**Use Case for Customer Segmentation**:  
KNN is useful when you have labeled customer data and want to classify new customers based on their similarity to existing customers. For example, KNN can be used for:

- Classifying new customers as **"loyal,"** **"potential churn,"** or **"occasional"** based on their similarity to previously labeled customers.

**When to Use**:  
KNN is best used when you already have **labeled customer data** and want to classify new customers or predict which segment they belong to based on their features.

### Import Libraries

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report,accuracy_score
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import norm
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.decomposition import PCA

### Load the Dataset

In [None]:
file_path ='/Online_Retail/data/online_retail_II.xlsx'

In [None]:
train_data = pd.read_excel(file_path, sheet_name='Year 2009-2010')
test_data = pd.read_excel(file_path, sheet_name='Year 2010-2011')

In [None]:
print(train_data.head())

In [None]:
print(test_data.head())

In [None]:
train_data.info()

In [None]:
test_data.info()

In [None]:
train_data = train_data.dropna(inplace=False)

In [None]:
test_data = test_data.dropna(inplace=False)

In [None]:
# Check for missing values in the 'CustomerID' column
print(train_data['Customer ID'].isnull().sum())

# If there are missing values, you can either drop them or fill them based on your use case:
#train_data = train_data.dropna(subset=['CustomerID']) 

In [None]:
print(test_data['Customer ID'].isnull().sum())

### Detecting Outliers

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

# Function to plot the normal distribution with standard deviation lines
def plot_normal_distribution_with_std(data, column_name, dataset_type):
    plt.figure(figsize=(10, 6))
    
    # Calculate mean and standard deviation
    mean = np.mean(data)
    std_dev = np.std(data)
    
    # Generate the normal distribution curve
    xmin, xmax = mean - 4*std_dev, mean + 4*std_dev  # Focus on ±4 standard deviations
    x = np.linspace(xmin, xmax, 100)
    p = norm.pdf(x, mean, std_dev)
    
    # Plot the normal distribution curve
    plt.plot(x, p, 'k', linewidth=2, label='Normal Distribution')
    
    # Add vertical lines for ±1, ±2, and ±3 standard deviations
    plt.axvline(mean, color='blue', linestyle='--', label='Mean')
    plt.axvline(mean + std_dev, color='green', linestyle='--', label='±1 Standard Deviation')
    plt.axvline(mean - std_dev, color='green', linestyle='--')
    plt.axvline(mean + 2*std_dev, color='orange', linestyle='--', label='±2 Standard Deviations')
    plt.axvline(mean - 2*std_dev, color='orange', linestyle='--')
    plt.axvline(mean + 3*std_dev, color='red', linestyle='--', label='±3 Standard Deviations')
    plt.axvline(mean - 3*std_dev, color='red', linestyle='--')

    # Fill areas under the curve for ±1, ±2, and ±3 standard deviations
    plt.fill_between(x, p, where=((x >= mean - std_dev) & (x <= mean + std_dev)), color='green', alpha=0.2)
    plt.fill_between(x, p, where=((x >= mean - 2*std_dev) & (x <= mean + 2*std_dev)), color='orange', alpha=0.2)
    plt.fill_between(x, p, where=((x >= mean - 3*std_dev) & (x <= mean + 3*std_dev)), color='red', alpha=0.2)
    
    # Add labels and title
    plt.title(f'Normal Distribution for {column_name} ({dataset_type})')
    plt.xlabel(column_name)
    plt.ylabel('Density')
    plt.legend(loc="best")
    
    # Show the plot
    plt.show()

# Plot the distribution for 'Quantity' in train data
plot_normal_distribution_with_std(train_data['Quantity'], 'Quantity', 'Train')

# Plot the distribution for 'Quantity' in test data
plot_normal_distribution_with_std(test_data['Quantity'], 'Quantity', 'Test')

# Plot the distribution for 'Price' in train data
plot_normal_distribution_with_std(train_data['Price'], 'Price', 'Train')

# Plot the distribution for 'Price' in test data
plot_normal_distribution_with_std(test_data['Price'], 'Price', 'Test')

In [None]:
def detect_outlier(data):
    outliers = []
    threshold = 3  # Z-score threshold
    mean = np.mean(data)  
    std = np.std(data)  
    
    # Identify outliers based on Z-score
    for y in data:
        z_score = (y - mean) / std
        if np.abs(z_score) > threshold:
            outliers.append(y)
    
    return outliers

# List of irrelevant columns to exclude from outlier detection
exclude_columns = ['CustomerID']

for dataset, name in zip([train_data, test_data], ['Train', 'Test']):
    for item in dataset.select_dtypes(include=[np.number]).columns:
        if item not in exclude_columns:
            mean = np.mean(dataset[item])
            print(f'Outliers in {item} ({name} data) will be replaced by the mean: {mean}')
            
            # Detect outliers
            outliers = detect_outlier(dataset[item])
            
            # Replace outliers with the mean (you can choose inplace=True or False)
            dataset[item].replace(outliers, mean, inplace=False)
            
            print(f'{len(outliers)} outliers found in {item} ({name} data)')

In [None]:
# Detect outliers in the "Quantity" and "Price" columns
quantity_outliers_train_data = detect_outlier(train_data['Quantity'])
price_outliers_train_data = detect_outlier(train_data['Price'])

In [None]:
# Detect outliers in the "Quantity" and "Price" columns
quantity_outliers_test_data = detect_outlier(test_data['Quantity'])
price_outliers_test_data = detect_outlier(test_data['Price'])

# RFM (Recency, Frequency, Monetary)
**Purpose**:  
RFM analysis is a customer segmentation technique used to evaluate and rank customers based on their purchasing behavior. It helps identify valuable customers by analyzing three key metrics:

- **Recency**: How recently a customer made a purchase.
- **Frequency**: How often a customer makes a purchase.
- **Monetary**: How much money a customer spends.

**Use Case for Customer Segmentation**:  
RFM helps segment customers into different groups such as high-value customers, at-risk customers, and frequent purchasers. By evaluating these metrics, businesses can tailor marketing strategies to different customer segments.

---

# Churn Label
**Purpose**:  
The churn label indicates whether a customer has **stopped** doing business with a company. It’s a **binary label** typically represented as:

- **Churn = 1**: The customer has churned (stopped purchasing).
- **Churn = 0**: The customer has not churned (still an active customer).

**Use Case for Customer Churn Prediction**:  
By applying churn labels to customer data, businesses can use machine learning models like **KNN** to predict which customers are likely to churn. This allows companies to take proactive measures to retain these customers.

## RMF Calculation

In [None]:
train_data['InvoiceDate'] = pd.to_datetime(train_data['InvoiceDate'])
train_data['Diff'] = max(train_data['InvoiceDate']) - train_data['InvoiceDate']
recency_train = train_data.groupby('Customer ID')['Diff'].min()
recency_train = recency_train.dt.days
recency_train = recency_train.reset_index()

In [None]:
test_data['InvoiceDate'] = pd.to_datetime(test_data['InvoiceDate'])
test_data['Diff'] = max(test_data['InvoiceDate']) - test_data['InvoiceDate']
recency_test = test_data.groupby('Customer ID')['Diff'].min()
recency_test = recency_test.dt.days
recency_test = recency_test.reset_index()

In [None]:
train_data['Amount'] = train_data['Quantity'] * train_data['Price']
monetary_train = train_data.groupby('Customer ID')['Amount'].sum()
monetary_train = monetary_train.reset_index()
monetary_train

In [None]:
test_data['Amount'] = test_data['Quantity'] * test_data['Price']
monetary_test = test_data.groupby('Customer ID')['Amount'].sum()
monetary_test = monetary_test.reset_index()
monetary_test

In [None]:
frequency_train = train_data.groupby('Customer ID')['Invoice'].count()
frequency_train = frequency_train.reset_index()
frequency_train

In [None]:
frequency_test = test_data.groupby('Customer ID')['Invoice'].count()
frequency_test = frequency_test.reset_index()
frequency_test

In [None]:
rfm_train = pd.merge(recency_train, frequency_train, on='Customer ID')
rfm_train = pd.merge(rfm_train, monetary_train, on='Customer ID')
rfm_train.columns = ['Customer ID', 'Recency', 'Frequency', 'Monetary']
rfm_train

In [None]:
print(rfm_test.columns)

In [None]:
rfm_test = pd.merge(recency_test, frequency_test, on='Customer ID')
rfm_test = pd.merge(rfm_test, monetary_test, on='Customer ID')
rfm_test.columns = ['Customer ID', 'Recency', 'Frequency', 'Monetary']
rfm_test

In [None]:
rfm_scaled_train = rfm_train[['Recency', 'Frequency', 'Monetary']]
rfm_scaled_train = StandardScaler().fit_transform(rfm_scaled_train)
rfm_scaled_train = pd.DataFrame(rfm_scaled_train)
rfm_scaled_train.columns = ['Recency', 'Frequency', 'Monetary']
rfm_scaled_train

In [None]:
rfm_scaled_test = rfm_test[['Recency', 'Frequency', 'Monetary']]
rfm_scaled_test = StandardScaler().fit_transform(rfm_scaled_test)
rfm_scaled_test = pd.DataFrame(rfm_scaled_test)
rfm_scaled_test.columns = ['Recency', 'Frequency', 'Monetary']
rfm_scaled_test

### **K means**

In [None]:
distortions = []
range_n_clusters = range(1, 10)
for num_cluster in range_n_clusters :
    kmeans = KMeans(n_clusters=num_cluster)
    kmeans.fit(rfm_scaled_train)
    distortions.append(kmeans.inertia_)
plt.figure(figsize=(16,8))
plt.plot(range_n_clusters, distortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Distortion')
plt.title('The Elbow Method showing the optimal k for ')
plt.show()

In [None]:
distortions = []
range_n_clusters = range(1, 10)
for num_cluster in range_n_clusters :
    kmeans = KMeans(n_clusters=num_cluster)
    kmeans.fit(rfm_scaled_train)
    distortions.append(kmeans.inertia_)
plt.figure(figsize=(16,8))
plt.plot(range_n_clusters, distortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Distortion')
plt.title('The Elbow Method showing the optimal k')
plt.show()

In [None]:
optimal_k = 3

In [None]:
kmeans = KMeans(n_clusters=optimal_k)
kmeans.fit(rfm_scaled_train)

In [None]:
train_clusters = kmeans.predict(rfm_scaled_train)

In [None]:
rfm_train['Cluster'] = train_clusters

In [None]:
print(rfm_train.head(15))

In [None]:
test_clusters = kmeans.predict(rfm_scaled_test)

In [None]:
rfm_test['Cluster'] = test_clusters

In [None]:
print(rfm_test.head(15))

In [None]:
print(rfm_train['Cluster'].value_counts())

In [None]:
print(rfm_test['Cluster'].value_counts())

In [None]:
# Reduce the dimensions of the scaled training data for visualization
pca = PCA(n_components=2)
rfm_train_pca = pca.fit_transform(rfm_scaled_train)

In [None]:
plt.figure(figsize=(10,6))
plt.scatter(rfm_train_pca[:, 0], rfm_train_pca[:, 1], c=train_clusters, cmap='viridis')
plt.title('Cluster Visualization (Training Data)')
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.colorbar(label='Cluster')
plt.show()

In [None]:
rfm_test_pca = pca.transform(rfm_scaled_test)

In [None]:
plt.figure(figsize=(10,6))
plt.scatter(rfm_test_pca[:, 0], rfm_test_pca[:, 1], c=test_clusters, cmap='viridis')
plt.title('Cluster Visualization (Test Data)')
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.colorbar(label='Cluster')
plt.show()

In [None]:
# Analyze the characteristics of each cluster in training data
cluster_analysis_train = rfm_train.groupby('Cluster').mean().drop(columns=['Customer ID'])
print(cluster_analysis_train)

In [None]:
# Analyze the characteristics of each cluster in the test data without 'Customer ID'
cluster_analysis_test = rfm_test.groupby('Cluster').mean().drop(columns=['Customer ID'])
print(cluster_analysis_test)


In [None]:
# Calculate silhouette score for the training data
silhouette_train = silhouette_score(rfm_scaled_train, train_clusters)
print(f'Silhouette Score for Training Data: {silhouette_train}')

In [None]:
# Calculate silhouette score for the test data
silhouette_test = silhouette_score(rfm_scaled_test, test_clusters)
print(f'Silhouette Score for Test Data: {silhouette_test}')

In [None]:
for k in range(2, 11):
    kmeans = KMeans(n_clusters=k, random_state=42,n_init=10)
    kmeans.fit(rfm_scaled_train)
    score = silhouette_score(rfm_scaled_train, kmeans.labels_)
    print(f'For n_clusters = {k}, the silhouette score is {score}')

In [None]:
for k in range(2, 11):
    kmeans = KMeans(n_clusters=k, random_state=42,n_init=10)
    kmeans.fit(rfm_scaled_test)
    score = silhouette_score(rfm_scaled_test, kmeans.labels_)
    print(f'For n_clusters = {k}, the silhouette score is {score}')

### **KNN**

In [None]:
# Define churn label based on Recency (example: if Recency > 180 days, mark as churn)
rfm_train['Churn'] = rfm_train['Recency'].apply(lambda x: 1 if x > 180 else 0)

In [None]:
# Check the churn distribution
print(rfm_train['Churn'].value_counts())

In [None]:
# Define churn label based on Recency (example: if Recency > 180 days, mark as churn)
rfm_test['Churn'] = rfm_test['Recency'].apply(lambda x: 1 if x > 180 else 0)

In [None]:
# Check the churn distribution
print(rfm_test['Churn'].value_counts())

In [None]:
print(rfm_train.columns)

In [None]:
print(rfm_test.columns)

In [None]:
X_train = rfm_train[['Recency', 'Frequency', 'Monetary']]
y_train = rfm_train['Churn']

In [None]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [None]:
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)

In [None]:
X_test = rfm_test[['Recency', 'Frequency', 'Monetary']]
y_test = rfm_test['Churn']

In [None]:
rfm_test['Churn_Predicted'] = knn.predict(X_test_scaled)

In [None]:
accuracy = accuracy_score(y_test, rfm_test['Churn_Predicted'])
print(f"Accuracy: {accuracy}")

In [None]:
print("Classification Report:")
print(classification_report(y_test, rfm_test['Churn_Predicted']))

## **Conclusion**

In [None]:
plt.figure(figsize=(10, 6))

# Scatter plot of Frequency vs Monetary with cluster color coding
sns.scatterplot(x=rfm_train['Frequency'], y=rfm_train['Monetary'], hue=rfm_train['Cluster'], palette="deep", s=100)

plt.title('Customer Segmentation based on K-Means Clusters')
plt.xlabel('Frequency')
plt.ylabel('Monetary Value')
plt.legend(title='Cluster')
plt.show()

In [None]:
cm = confusion_matrix(y_test, rfm_test['Churn_Predicted'])

plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['No Churn', 'Churn'], yticklabels=['No Churn', 'Churn'])

plt.title('Confusion Matrix for KNN Churn Prediction')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()