# K-means Clustering Analysis with Python
## Overview
This notebook guides you through the process of loading and preprocessing data, applying K-means clustering, and visualizing the results using Python. The analysis includes clustering customers based on their tenure and monthly charges using Min-Max scaled and Standard scaled data.

## Step 1: Importing Required Libraries
First, import all the necessary libraries and modules that will be used for data processing, clustering, and visualization.

In [1]:
import os
import sys
import json
import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import silhouette_score


## Step 2: Set Up Paths and Ensure Utility Modules are Accessible
Here, we ensure that our custom utility modules can be accessed by setting up the correct paths. This step is important to make sure the script can locate and import necessary functions from other parts of your project.

In [3]:
# Ensure the utils module can be found
notebook_dir = os.path.dirname(os.path.abspath(''))
project_root = os.path.abspath(os.path.join(notebook_dir, '..'))
utils_path = os.path.join(project_root, 'utils')

print(f"Notebook directory: {notebook_dir}")
print(f"Project root: {project_root}")
print(f"Utils path: {utils_path}")

if utils_path not in sys.path:
    sys.path.append(utils_path)

try:
    from data_loader import load_data
    from data_cleaner import clean_data
    from handle_missing_and_encode import handle_missing_and_encode
except ImportError as e:
    print(f"Error importing module: {e}")
    sys.exit(1)


Notebook directory: C:\Users\kusha\OneDrive\Documents\Customer-Churn-Analysis-main\notebooks
Project root: C:\Users\kusha\OneDrive\Documents\Customer-Churn-Analysis-main
Utils path: C:\Users\kusha\OneDrive\Documents\Customer-Churn-Analysis-main\utils
Executing data_loader.py
Executing handle_missing_and_encode.py


## Step 3: Load Configuration and Set Up Paths
In this step, we load configuration settings from `config.json` and convert relative paths to absolute paths to ensure that the correct data files are accessed during the analysis.

In [6]:
# Load configuration
config_path = os.path.join(os.path.dirname(os.path.abspath('')), '..', 'config.json')
print(f"Config path: {config_path}")
with open(config_path, 'r') as f:
    config = json.load(f)

# Convert relative paths to absolute paths
project_root = os.path.dirname(os.path.dirname(os.path.abspath('')))
raw_data_path = os.path.join(project_root, config['raw_data_path'])
interim_cleaned_data_path = os.path.join(project_root, config['interim_cleaned_data_path'])
preprocessed_data_path = os.path.join(project_root, config['preprocessed_data_path'])
standard_scaled_data_path = os.path.join(project_root, 'data_preparation/scaling_techniques/standard_scaled_dataset.csv')
min_max_scaled_data_path = os.path.join(project_root, 'data_preparation/scaling_techniques/min_max_scaled_dataset.csv')

print(f"Raw data path (absolute): {raw_data_path}")
print(f"Interim cleaned data path (absolute): {interim_cleaned_data_path}")
print(f"Preprocessed data path (absolute): {preprocessed_data_path}")
print(f"Standard scaled data path (absolute): {standard_scaled_data_path}")
print(f"Min-Max scaled data path (absolute): {min_max_scaled_data_path}")


Config path: C:\Users\kusha\OneDrive\Documents\Customer-Churn-Analysis-main\notebooks\..\config.json
Raw data path (absolute): C:\Users\kusha\OneDrive\Documents\Customer-Churn-Analysis-main\data/raw/Dataset (ATS)-1.csv
Interim cleaned data path (absolute): C:\Users\kusha\OneDrive\Documents\Customer-Churn-Analysis-main\data/interim/cleaned_dataset.csv
Preprocessed data path (absolute): C:\Users\kusha\OneDrive\Documents\Customer-Churn-Analysis-main\Data_Preparation/preprocessed_dataset/cleaned_dataset.csv
Standard scaled data path (absolute): C:\Users\kusha\OneDrive\Documents\Customer-Churn-Analysis-main\data_preparation/scaling_techniques/standard_scaled_dataset.csv
Min-Max scaled data path (absolute): C:\Users\kusha\OneDrive\Documents\Customer-Churn-Analysis-main\data_preparation/scaling_techniques/min_max_scaled_dataset.csv


## Step 4: Load the Preprocessed Data
Load the preprocessed datasets (both Min-Max scaled and Standard scaled) to prepare them for clustering.

In [7]:
# Load the min-max scaled data
df_min_max_scaled = pd.read_csv(min_max_scaled_data_path)
print(f"Min-Max scaled data loaded successfully from {min_max_scaled_data_path}")
df_standard_scaled = pd.read_csv(standard_scaled_data_path)
print(f"Standard scaled data loaded successfully from {standard_scaled_data_path}")

# Example of printing the first few rows to verify
print("Standard Scaled Data:")
print(df_standard_scaled.head())

print("Min-Max Scaled Data:")
print(df_min_max_scaled.head())


Min-Max scaled data loaded successfully from C:\Users\kusha\OneDrive\Documents\Customer-Churn-Analysis-main\data_preparation/scaling_techniques/min_max_scaled_dataset.csv
Standard scaled data loaded successfully from C:\Users\kusha\OneDrive\Documents\Customer-Churn-Analysis-main\data_preparation/scaling_techniques/standard_scaled_dataset.csv
Standard Scaled Data:
   SeniorCitizen    tenure  MonthlyCharges  gender_Female  gender_Male  \
0      -0.439916 -1.277445       -1.160323       1.009559    -1.009559   
1      -0.439916  0.066327       -0.259629      -0.990532     0.990532   
2      -0.439916 -1.236724       -0.362660      -0.990532     0.990532   
3      -0.439916  0.514251       -0.746535      -0.990532     0.990532   
4      -0.439916 -1.236724        0.197365       1.009559    -1.009559   

   Dependents_No  Dependents_Yes  PhoneService_No  PhoneService_Yes  \
0       0.654012       -0.654012         3.054010         -3.054010   
1       0.654012       -0.654012        -0.3274

## Step 5: Apply K-means Clustering and Visualize Clusters
Define a function to apply K-means clustering to the dataset and generate visualizations of the clusters. This function will also save the visualizations to a specified directory.

In [8]:
def apply_kmeans_and_visualize(df, scaling_label, n_clusters):
    # Define path for saving visualizations inside the function
    visualizations_path = os.path.join(project_root, 'Clustering_Analysis', 'visualizations')
    os.makedirs(visualizations_path, exist_ok=True)
    
    # Use only the 'tenure' and 'MonthlyCharges' columns for clustering
    features = df[['tenure', 'MonthlyCharges']]
    
    # Apply K-means clustering
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    kmeans.fit(features)
    
    # Add cluster labels to the DataFrame
    df['Cluster'] = kmeans.labels_
    
    # Visualize the clusters
    plt.figure(figsize=(10, 6))
    sns.scatterplot(data=df, x='tenure', y='MonthlyCharges', hue='Cluster', palette='viridis')
    plt.title(f'Customer Segments based on Tenure and Monthly Charges ({scaling_label} - Assumed 3 Clusters)')
    plt.xlabel('Tenure')
    plt.ylabel('Monthly Charges')
    plt.legend(title='Cluster')
    
    # Save the visualization
    visualization_filename = f'{scaling_label.lower().replace(" ", "_")}_3_clusters_assumed.png'
    visualization_filepath = os.path.join(visualizations_path, visualization_filename)
    plt.savefig(visualization_filepath)
    plt.close()
    print(f'Saved cluster visualization: {visualization_filepath}')


## Step 6: Running K-means Clustering
Apply K-means clustering with an assumed number of clusters (e.g., 3 clusters) to both the Min-Max scaled and Standard scaled datasets. Visualize and save the results.

In [9]:
n_clusters = 3
apply_kmeans_and_visualize(df_min_max_scaled, 'Min-Max Scaled', n_clusters)
apply_kmeans_and_visualize(df_standard_scaled, 'Standard Scaled', n_clusters)


Saved cluster visualization: C:\Users\kusha\OneDrive\Documents\Customer-Churn-Analysis-main\Clustering_Analysis\visualizations\min-max_scaled_3_clusters_assumed.png
Saved cluster visualization: C:\Users\kusha\OneDrive\Documents\Customer-Churn-Analysis-main\Clustering_Analysis\visualizations\standard_scaled_3_clusters_assumed.png


# Conclusion
This K-means clustering analysis segments customers based on their tenure and monthly charges. The clusters are visualized and saved, providing insights into different customer segments. The process includes importing necessary libraries, loading configuration files, preprocessing data, and visualizing the clusters.

# Next Steps:
1. Optimal Clustering: Consider using methods like the Elbow Method or Silhouette Score to determine the optimal number of clustersCluster 
2. Interpretation: Examine the characteristics of each cluster for business insights.
3. Advanced Analysis: Explore additional features and clustering techniques to refine the segmentation.