# Project Details

**Submitted by:** Shalini Chauhan-045051  
**Submitted to:** Prof. Amarnath Mitra  
**Project Name:** Unsupervised Learning  
**Dataset:** Telecom Churn (99999 X 226)

## About Dataset

The dataset is a churn dataset of a telecommunication company with various features including:

- `mou`: Minutes of Usage
- `arpu`: Average Revenue Per User
- `t2t`: Talktime to Talktime
- `t2m`: Talktime to Mobile
- `t2f`: Talktime to Fixed Line
- `t2c`: Talktime to Customer Care
- `ic`: Incoming Calls
- `og`: Outgoing Calls
- `std`: Standard
- `isd`: International Subscriber Dialing
- `spl`: Special
- `roam`: Roaming
- `loc`: Local
- `max`: Maximum
- `av`: Average
- `vol`: Volume
- `fb`: Facebook
- `aon`: Age on Network
- `vbc`: Volume Based Charging


# Objective:
The objective of this analysis is to gain insights into customer churn behavior in a telecom company dataset using clustering techniques. Specifically, we aim to identify distinct churn behavior patterns, understand their revenue impact, and explore service utilization trends within different customer segment along with comparison of two types of clustering..
## Missing Data Report

- Total Percentage of missing data = 15.91%
- Missing values range from 0.60% to 74.85%
- Imputation method used due to lack of discernible pattern in missing data

## K-Means Clustering Report

- Optimal number of clusters determined to be 7 for the entire dataset
- Optimal number of clusters determined to be 5 for a subset of 25000 rows (Silhouette Score: 0.4474979398971488)
  

## Hierarchical Clustering Report

- Silhouette score highest for model with 3 clusters (0.2883)
- Silhouette score for 4 clusters (0.2801) and 5 clusters (0.2778)

## Comparison of Memory and Time

**K-Means:**
- Memory Usage: Peak memory of 2882.79 MiB
- Time Taken: CPU time of 24.3 seconds, Wall time of 11.5 seconds

**Hierarchical Agglomerative:**
- Memory Usage: Peak memory of 5350.38 MiB
- Time Taken: CPU time of 1 minute 33 seconds, Wall time of 1 minute 37 seconds

## Conclusion

- K-Means clustering is more memory and time efficient compared to hierarchical agglomerative clustering
- Hierarchical agglomerative clustering had scalability issues and couldn't be applied to the entire dataset
- K-Means clustering is preferred for this dataset and subset size



## Analysis:

### K-Means Clustering:
- **Optimal Number of Clusters:**
  - K-means clustering identified 7 distinct clusters for the entire customer churn dataset.
  - A subset of 25000 rows revealed 5 optimal clusters based on Silhouette Score.
- **Cluster Characteristics:**
  - **Churn Behavior Patterns:**
    - Clusters likely represent different churn behavior patterns among telecom customers.
    - Understanding these patterns can aid in predicting and mitigating churn, thereby improving customer retention rates.
  - **Revenue Impact:**
    - Clusters may exhibit varying revenue contributions, with some segments generating higher revenue despite churn propensity.
    - Analyzing revenue patterns within clusters can inform targeted retention strategies and pricing adjustments.
  - **Service Utilization Insights:**
    - Different clusters may demonstrate preferences for specific services or usage patterns that correlate with churn likelihood.
    - Identifying service utilization trends within clusters can guide service enhancements and customer engagement efforts.

### Hierarchical Clustering:
- **Silhouette Scores:**
  - The silhouette score was highest for the model with 3 clusters, suggesting well-defined clusters with clear separations between churn behavior segments.
- **Interpretation:**
  - **Distinct Churn Profiles:**
    - Hierarchical clustering reveals compact and well-separated churn behavior profiles, facilitating targeted interventions and personalized customer interactions.
  - **Optimal Cluster Number:**
    - While the silhouette score peaks at 3 clusters, additional clusters (4 and 5) offer nuanced insights into churn dynamics and customer segmentation.

# Findings:

## Strategic Implications:
- The identified clusters provide actionable insights into customer churn behavior and revenue implications.
- Telecom companies can leverage these insights to develop targeted retention strategies, optimize service offerings, and enhance overall customer satisfaction.

## Algorithm Selection:
- K-means clustering emerges as the preferred algorithm for customer churn analysis due to its efficiency and scalability.
- Its ability to handle large datasets efficiently makes it a valuable tool for churn prediction and customer segmentation tasks.

This analysis underscores the importance of cluster recognition in understanding customer churn dynamics and guiding strategic decision-making in the telecom industry. By leveraging clustering techniques, telecom companies can gain valuable insights into churn behavior patterns, revenue drivers, and service preferences, ultimately driving customer retention and business growth.

# Implications:

- **Tailored Retention Strategies:** Understanding distinct churn behavior patterns allows telecom companies to tailor retention strategies based on the needs and preferences of different customer segments. By addressing the specific reasons for churn within each cluster, companies can implement targeted interventions to improve customer retention rates.

- **Revenue Optimization:** Analyzing revenue patterns within clusters enables companies to identify high-value customer segments that contribute significantly to revenue despite churn propensity. By focusing on retaining these valuable customers through personalized offers and incentives, companies can optimize revenue generation and profitability.

- **Service Enhancement Opportunities:** Service utilization insights derived from cluster analysis highlight opportunities for enhancing service offerings and customer experience. By identifying service preferences and usage patterns within each cluster, companies can prioritize investments in areas that align with customer needs, g is preferred for this dataset and subset size

## References

- Prof. Amarnath Mitra

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder # For Encoding Categorical Data [Nominal | Ordinal]
from sklearn.preprocessing import OneHotEncoder # For Creating Dummy Variables of Categorical Data [Nominal]
from sklearn.impute import SimpleImputer, KNNImputer # For Imputation of Missing Data
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler # For Rescaling Data
from sklearn.model_selection import train_test_split # For Splitting Data into Training & Testing Sets 


In [2]:
import requests
import zipfile
import pandas as pd
from io import BytesIO

# URL of the ZIP file on GitHub
zip_url = 'https://github.com/045051Shalini/MLM1_PROJECT/raw/main/telecom_churn_data.zip'

# Download the ZIP file
response = requests.get(zip_url)

# Check if the request was successful
if response.status_code == 200:
    # Extract the ZIP file
    with zipfile.ZipFile(BytesIO(response.content), 'r') as zip_ref:
        # Assuming there's only one CSV file in the ZIP archive
        csv_file_name = zip_ref.namelist()[0]
        zip_ref.extractall('temp_folder')  # Extract files to a temporary folder

    # Load the CSV file using pandas
    csv_file_path = f'temp_folder/{csv_file_name}'
    df = pd.read_csv(csv_file_path)



In [3]:
df.shape

(99999, 226)

In [None]:
df.info()

In [None]:
df.columns

In [None]:
missing_data = df.isnull().sum()
total_cells = df.shape[0] * df.shape[1]  
total_missing = missing_data.sum() 
percent_missing = (total_missing / total_cells) * 100
percent_missing

In [None]:
df.iloc[:, -40:].head()

In [None]:
plt.figure(figsize=(50,10),dpi=80)
null_values = df.isnull().sum()
null_values = null_values.reset_index()  # Resetting index to make column names accessible
null_values.columns = ['Column', 'Missing_Count']  # Renaming columns for clarity

 # Adjust figure size if needed
sns.barplot(x='Column', y='Missing_Count', data=null_values, color='skyblue')
plt.xticks(rotation=90)  # Rotate x-axis labels for better readability
plt.xlabel('Columns')
plt.ylabel('Missing Values Count')
plt.title('Missing Values Count in Each Column')
plt.tight_layout()  # Adjust layout to prevent label cutoff
plt.show()

In [None]:
# Assuming null_values DataFrame contains the column names and their corresponding missing values counts
filtered_null_values = null_values[null_values['Missing_Count'] < 2000]
filtered_null_values
max_missing_count = filtered_null_values['Missing_Count'].max()
max_missing_count

In [None]:
filtered_null_values = null_values[(null_values['Missing_Count'] > 2000) & (null_values['Missing_Count'] < 10000)]
filtered_null_values
max_missing_count = filtered_null_values['Missing_Count'].max()
max_missing_count

In [None]:
plt.figure(figsize=(50,10),dpi=80)
df.isnull().sum()
100* df.isnull().sum() / len(df)
def percent_missing(df):
    percent_nan = 100* df.isnull().sum() / len(df)
    percent_nan = percent_nan[percent_nan>0].sort_values()
    return percent_nan
percent_nan = percent_missing(df)
sns.barplot(x=percent_nan.index,y=percent_nan)
plt.xticks(rotation=90);


In [None]:
percent_nan

In [None]:
# Convert percent_nan into a DataFrame
percent_nan_df = pd.DataFrame(percent_nan.items(), columns=['Column Name', 'Percentage Missing'])

# Sort DataFrame by percentage of missing values
percent_nan_df.sort_values(by='Percentage Missing', inplace=True)

# Dictionary to store the names of columns in each category
column_names_by_category = {}

# Extract unique percentages
unique_percentages = percent_nan_df['Percentage Missing'].unique()
unique_percentages.sort()

# Iterate over consecutive pairs of unique percentages
for i in range(len(unique_percentages) - 1):
    start_percentage = unique_percentages[i]
    end_percentage = unique_percentages[i + 1]
    
    # Get the names of columns with missing values within the current range
    column_names = percent_nan_df[(percent_nan_df['Percentage Missing'] > start_percentage) & (percent_nan_df['Percentage Missing'] <= end_percentage)]['Column Name'].tolist()
    
    # Store the names in the dictionary
    column_names_by_category[(start_percentage, end_percentage)] = column_names

# Print the names of columns in each category
for (start_percentage, end_percentage), column_names in column_names_by_category.items():
    print("Percentage range:", start_percentage, "-", end_percentage)
    print("Number of columns:", len(column_names))
    print("Column names:", column_names)
    print()

In [None]:
percent_nan_df = pd.DataFrame(percent_nan) 
percent_nan_df.reset_index(inplace=True)
percent_nan_df.columns = ['Column Name', 'Percentage Missing']
percent_nan_df['Percentage Missing'].unique()
#unique_values_column_0 = percent_nan[0].unique()
#print(unique_values_column_0)

In [None]:
percent_nan[(percent_nan >10) &(percent_nan < 80)].count()

In [None]:
100/len(df) 
print ([100/len(df)],'1 observation is 0.001%')

In [None]:
plt.figure(figsize=(6,5))
sns.histplot(data=percent_nan_df,x='Percentage Missing')
plt.ylim(0,60)
plt.xlim(0,10)

In [None]:

# Count the number of missing values in each row
missing_values_counts = df.isnull().sum(axis=1)

# Get unique missing values counts
unique_counts = missing_values_counts.unique()

# Initialize variables to store the total number of rows and the dictionary to store column counts
total_rows = len(df)
total_matched_rows = 0
column_counts = {}

# Iterate over each unique missing values count
for count in unique_counts:
    # Skip rows with no missing values
    if count == 0:
        continue
    
    # Find rows with the same missing values count
    matching_rows = df[missing_values_counts == count]
    
    # Count the number of matching rows
    matching_rows_count = len(matching_rows)
    
    # If there are matching rows, increment the total_matched_rows count and update the column_counts dictionary
    if matching_rows_count > 1:
        total_matched_rows += matching_rows_count
        if count in column_counts:
            column_counts[count] += matching_rows_count
        else:
            column_counts[count] = matching_rows_count

print("Total number of rows:", total_rows)
print("Total number of rows with matching missing columns in other rows:", total_matched_rows)
print("Column counts for matched rows:")
for num_columns, count in column_counts.items():
    print(f"Number of columns matched: {num_columns}, Count: {count}")


In [None]:
df.head()

In [None]:
df.copy=df.drop(['mobile_number', 'circle_id'], axis=1, inplace=True)

In [None]:
print(df.columns.to_list())

In [None]:
df[(df['loc_ic_t2f_mou_9'].isnull()) & (df['loc_og_t2f_mou_9'].isnull())]

In [None]:
# Categorical variables
df_cat= df[[
    'last_date_of_month_6', 'last_date_of_month_7', 'last_date_of_month_8', 'last_date_of_month_9',
    'night_pck_user_6', 'night_pck_user_7', 'night_pck_user_8', 'night_pck_user_9',
    'fb_user_6', 'fb_user_7', 'fb_user_8', 'fb_user_9'
]]  


# Non-categorical variables
df_noncat = df[[
    'loc_og_t2o_mou', 'std_og_t2o_mou', 'loc_ic_t2o_mou', 'arpu_6', 'arpu_7', 'arpu_8', 'arpu_9',
    'onnet_mou_6', 'onnet_mou_7', 'onnet_mou_8', 'onnet_mou_9', 'offnet_mou_6', 'offnet_mou_7',
    'offnet_mou_8', 'offnet_mou_9', 'roam_ic_mou_6', 'roam_ic_mou_7', 'roam_ic_mou_8', 'roam_ic_mou_9',
    'roam_og_mou_6', 'roam_og_mou_7', 'roam_og_mou_8', 'roam_og_mou_9', 'loc_og_t2t_mou_6',
    'loc_og_t2t_mou_7', 'loc_og_t2t_mou_8', 'loc_og_t2t_mou_9', 'loc_og_t2m_mou_6', 'loc_og_t2m_mou_7',
    'loc_og_t2m_mou_8', 'loc_og_t2m_mou_9', 'loc_og_t2f_mou_6', 'loc_og_t2f_mou_7', 'loc_og_t2f_mou_8',
    'loc_og_t2f_mou_9', 'loc_og_t2c_mou_6', 'loc_og_t2c_mou_7', 'loc_og_t2c_mou_8', 'loc_og_t2c_mou_9',
    'loc_og_mou_6', 'loc_og_mou_7', 'loc_og_mou_8', 'loc_og_mou_9', 'std_og_t2t_mou_6', 'std_og_t2t_mou_7',
    'std_og_t2t_mou_8', 'std_og_t2t_mou_9', 'std_og_t2m_mou_6', 'std_og_t2m_mou_7', 'std_og_t2m_mou_8',
    'std_og_t2m_mou_9', 'std_og_t2f_mou_6', 'std_og_t2f_mou_7', 'std_og_t2f_mou_8', 'std_og_t2f_mou_9',
    'std_og_t2c_mou_6', 'std_og_t2c_mou_7', 'std_og_t2c_mou_8', 'std_og_t2c_mou_9', 'std_og_mou_6',
    'std_og_mou_7', 'std_og_mou_8', 'std_og_mou_9', 'isd_og_mou_6', 'isd_og_mou_7', 'isd_og_mou_8',
    'isd_og_mou_9', 'spl_og_mou_6', 'spl_og_mou_7', 'spl_og_mou_8', 'spl_og_mou_9', 'og_others_6',
    'og_others_7', 'og_others_8', 'og_others_9', 'total_og_mou_6', 'total_og_mou_7', 'total_og_mou_8',
    'total_og_mou_9', 'loc_ic_t2t_mou_6', 'loc_ic_t2t_mou_7', 'loc_ic_t2t_mou_8', 'loc_ic_t2t_mou_9',
    'loc_ic_t2m_mou_6', 'loc_ic_t2m_mou_7', 'loc_ic_t2m_mou_8', 'loc_ic_t2m_mou_9', 'loc_ic_t2f_mou_6',
    'loc_ic_t2f_mou_7', 'loc_ic_t2f_mou_8', 'loc_ic_t2f_mou_9', 'loc_ic_mou_6', 'loc_ic_mou_7', 'loc_ic_mou_8',
    'loc_ic_mou_9', 'std_ic_t2t_mou_6', 'std_ic_t2t_mou_7', 'std_ic_t2t_mou_8', 'std_ic_t2t_mou_9',
    'std_ic_t2m_mou_6', 'std_ic_t2m_mou_7', 'std_ic_t2m_mou_8', 'std_ic_t2m_mou_9', 'std_ic_t2f_mou_6',
    'std_ic_t2f_mou_7', 'std_ic_t2f_mou_8', 'std_ic_t2f_mou_9', 'std_ic_t2o_mou_6', 'std_ic_t2o_mou_7',
    'std_ic_t2o_mou_8', 'std_ic_t2o_mou_9', 'std_ic_mou_6', 'std_ic_mou_7', 'std_ic_mou_8', 'std_ic_mou_9',
    'total_ic_mou_6', 'total_ic_mou_7', 'total_ic_mou_8', 'total_ic_mou_9', 'spl_ic_mou_6', 'spl_ic_mou_7',
    'spl_ic_mou_8', 'spl_ic_mou_9', 'isd_ic_mou_6', 'isd_ic_mou_7'
]]


In [None]:
si_cat = SimpleImputer(missing_values=np.nan, strategy='most_frequent') # Strategy = median [When Odd Number of Categories Exists]

# Now you can proceed with the imputation
si_cat_fit = si_cat.fit_transform(df_cat.values)
df_cat_si = pd.DataFrame(si_cat_fit, columns=df_cat.columns)
df_cat_si.info()

In [None]:

imputer = SimpleImputer(strategy='most_frequent')

# Fit and transform the DataFrame
df_noncat_imputed = pd.DataFrame(imputer.fit_transform(df_noncat), columns=df_noncat.columns)
df_noncat_imputed.info()

In [None]:
# Convert the specified columns to categorical type
df_cat_si[['last_date_of_month_6', 'last_date_of_month_7', 'last_date_of_month_8', 'last_date_of_month_9',
           'night_pck_user_6', 'night_pck_user_7', 'night_pck_user_8', 'night_pck_user_9',
           'fb_user_6', 'fb_user_7', 'fb_user_8', 'fb_user_9']] = df_cat_si[[
    'last_date_of_month_6', 'last_date_of_month_7', 'last_date_of_month_8', 'last_date_of_month_9',
    'night_pck_user_6', 'night_pck_user_7', 'night_pck_user_8', 'night_pck_user_9',
    'fb_user_6', 'fb_user_7', 'fb_user_8', 'fb_user_9']].astype('category')

# Encode the categorical columns using cat.codes
df_cat_si['last_date_of_month_6_code'] = df_cat_si['last_date_of_month_6'].cat.codes
df_cat_si['last_date_of_month_7_code'] = df_cat_si['last_date_of_month_7'].cat.codes
df_cat_si['last_date_of_month_8_code'] = df_cat_si['last_date_of_month_8'].cat.codes
df_cat_si['last_date_of_month_9_code'] = df_cat_si['last_date_of_month_9'].cat.codes
df_cat_si['night_pck_user_6_code'] = df_cat_si['night_pck_user_6'].cat.codes
df_cat_si['night_pck_user_7_code'] = df_cat_si['night_pck_user_7'].cat.codes
df_cat_si['night_pck_user_8_code'] = df_cat_si['night_pck_user_8'].cat.codes
df_cat_si['night_pck_user_9_code'] = df_cat_si['night_pck_user_9'].cat.codes
df_cat_si['fb_user_6_code'] = df_cat_si['fb_user_6'].cat.codes
df_cat_si['fb_user_7_code'] = df_cat_si['fb_user_7'].cat.codes
df_cat_si['fb_user_8_code'] = df_cat_si['fb_user_8'].cat.codes
df_cat_si['fb_user_9_code'] = df_cat_si['fb_user_9'].cat.codes

# Display the resulting DataFrame
df_cat_si

In [None]:
df_encoded_columns= df_cat_si.filter(regex='_code$')

# Display the resulting DataFrame
df_encoded_columns.head()

In [None]:

df_dummy_cat = pd.get_dummies(df_encoded_columns,
                               columns=['last_date_of_month_6_code', 'last_date_of_month_7_code', 'last_date_of_month_8_code', 'last_date_of_month_9_code',
                                        'night_pck_user_6_code', 'night_pck_user_7_code', 'night_pck_user_8_code', 'night_pck_user_9_code',
                                        'fb_user_6_code', 'fb_user_7_code', 'fb_user_8_code', 'fb_user_9_code'])

df_dummy_cat

In [None]:
# 3.1. Standardization
ss = StandardScaler()
ss_fit = ss.fit_transform(df_noncat_imputed[['loc_og_t2o_mou', 'std_og_t2o_mou', 'loc_ic_t2o_mou', 'arpu_6', 'arpu_7', 'arpu_8', 'arpu_9',
    'onnet_mou_6', 'onnet_mou_7', 'onnet_mou_8', 'onnet_mou_9', 'offnet_mou_6', 'offnet_mou_7',
    'offnet_mou_8', 'offnet_mou_9', 'roam_ic_mou_6', 'roam_ic_mou_7', 'roam_ic_mou_8', 'roam_ic_mou_9',
    'roam_og_mou_6', 'roam_og_mou_7', 'roam_og_mou_8', 'roam_og_mou_9', 'loc_og_t2t_mou_6',
    'loc_og_t2t_mou_7', 'loc_og_t2t_mou_8', 'loc_og_t2t_mou_9', 'loc_og_t2m_mou_6', 'loc_og_t2m_mou_7',
    'loc_og_t2m_mou_8', 'loc_og_t2m_mou_9', 'loc_og_t2f_mou_6', 'loc_og_t2f_mou_7', 'loc_og_t2f_mou_8',
    'loc_og_t2f_mou_9', 'loc_og_t2c_mou_6', 'loc_og_t2c_mou_7', 'loc_og_t2c_mou_8', 'loc_og_t2c_mou_9',
    'loc_og_mou_6', 'loc_og_mou_7', 'loc_og_mou_8', 'loc_og_mou_9', 'std_og_t2t_mou_6', 'std_og_t2t_mou_7',
    'std_og_t2t_mou_8', 'std_og_t2t_mou_9', 'std_og_t2m_mou_6', 'std_og_t2m_mou_7', 'std_og_t2m_mou_8',
    'std_og_t2m_mou_9', 'std_og_t2f_mou_6', 'std_og_t2f_mou_7', 'std_og_t2f_mou_8', 'std_og_t2f_mou_9',
    'std_og_t2c_mou_6', 'std_og_t2c_mou_7', 'std_og_t2c_mou_8', 'std_og_t2c_mou_9', 'std_og_mou_6',
    'std_og_mou_7', 'std_og_mou_8', 'std_og_mou_9', 'isd_og_mou_6', 'isd_og_mou_7', 'isd_og_mou_8',
    'isd_og_mou_9', 'spl_og_mou_6', 'spl_og_mou_7', 'spl_og_mou_8', 'spl_og_mou_9', 'og_others_6',
    'og_others_7', 'og_others_8', 'og_others_9', 'total_og_mou_6', 'total_og_mou_7', 'total_og_mou_8',
    'total_og_mou_9', 'loc_ic_t2t_mou_6', 'loc_ic_t2t_mou_7', 'loc_ic_t2t_mou_8', 'loc_ic_t2t_mou_9',
    'loc_ic_t2m_mou_6', 'loc_ic_t2m_mou_7', 'loc_ic_t2m_mou_8', 'loc_ic_t2m_mou_9', 'loc_ic_t2f_mou_6',
    'loc_ic_t2f_mou_7', 'loc_ic_t2f_mou_8', 'loc_ic_t2f_mou_9', 'loc_ic_mou_6', 'loc_ic_mou_7', 'loc_ic_mou_8',
    'loc_ic_mou_9', 'std_ic_t2t_mou_6', 'std_ic_t2t_mou_7', 'std_ic_t2t_mou_8', 'std_ic_t2t_mou_9',
    'std_ic_t2m_mou_6', 'std_ic_t2m_mou_7', 'std_ic_t2m_mou_8', 'std_ic_t2m_mou_9', 'std_ic_t2f_mou_6',
    'std_ic_t2f_mou_7', 'std_ic_t2f_mou_8', 'std_ic_t2f_mou_9', 'std_ic_t2o_mou_6', 'std_ic_t2o_mou_7',
    'std_ic_t2o_mou_8', 'std_ic_t2o_mou_9', 'std_ic_mou_6', 'std_ic_mou_7', 'std_ic_mou_8', 'std_ic_mou_9',
    'total_ic_mou_6', 'total_ic_mou_7', 'total_ic_mou_8', 'total_ic_mou_9', 'spl_ic_mou_6', 'spl_ic_mou_7',
    'spl_ic_mou_8', 'spl_ic_mou_9', 'isd_ic_mou_6', 'isd_ic_mou_7']])
df_noncat_std = pd.DataFrame(ss_fit, columns=df_noncat_imputed.columns+'_std'); df_noncat_std



In [None]:
df_noncat_std=df_noncat_std.copy()
df_dummy_cat=df_dummy_cat.copy()

In [None]:
df_ppd = pd.concat([df_noncat_std, df_dummy_cat], axis=1)
df_ppd

In [None]:
df_ppd.isnull().sum()

In [None]:
df_kmeans=df_ppd.copy()

In [None]:
from sklearn.cluster import KMeans

In [None]:
model = KMeans(n_clusters=2)


In [None]:
cluster_labels=model.fit_predict(df_kmeans)

In [None]:
cluster_labels

In [None]:
df_kmeans['cluster']=cluster_labels

In [None]:
sns.heatmap(df_kmeans.corr())

In [None]:
df_kmeans.corr()['cluster'].iloc[:-1].sort_values()

In [None]:

plt.figure(figsize=(12, 6), dpi=100)
df_kmeans.corr()['cluster'].iloc[:-1].sort_values().plot(kind='bar')
plt.xticks(rotation=90, ha='right', fontsize=5)  # Change the font size of x-axis labels to 12
plt.tight_layout()  # Adjust layout to prevent overlap
plt.show()


In [None]:
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans

ssd = []
silhouette_scores = []

for k in range(4, 10):
    model = KMeans(n_clusters=k)
    model.fit(df_kmeans)
    
    # Sum of squared distances of samples to their closest cluster center
    ssd.append(model.inertia_)
    
  

In [None]:
plt.plot(range(4,10),ssd,'o--')
plt.xlabel("K Value")
plt.ylabel("Sum of Squared Distances")

In [None]:
ssd

In [None]:
sample_kmeans=df_ppd.sample(n=25000,random_state=822)


In [None]:
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans

ssd_sample = []
silhouette_scores_sample = []

for k in range(4, 10):
    model_sample = KMeans(n_clusters=k, n_init=10)
    model_sample.fit(sample_kmeans)
    
    # Sum of squared distances of samples to their closest cluster center
    ssd_sample.append(model_sample.inertia_)
    
    # Silhouette score
    silhouette_avg = silhouette_score(sample_kmeans, model_sample.labels_)
    silhouette_scores_sample.append(silhouette_avg)

# Print SSD and silhouette scores for each cluster
for k, ssd_score, silhouette_score in zip(range(4, 10), ssd_sample, silhouette_scores_sample):
    print(f"Number of clusters: {k}, SSD: {ssd_score}, Silhouette Score: {silhouette_score}")


In [None]:
#Heirarchical agglomerative Clustering
import scipy.cluster.hierarchy as sch # For Hierarchical Clustering
from sklearn.cluster import AgglomerativeClustering as agclus, KMeans as kmclus # For Agglomerative & K-Means Clustering
from sklearn.metrics import silhouette_score as sscore, davies_bouldin_score as dbscore # For Clustering Model Evaluation


In [None]:
df_agg_cluster=df_ppd.copy()

In [None]:
sample_agg_cluster=df_agg_cluster.sample(n=25000,random_state=822)

In [None]:
ah_5cluster = agclus(n_clusters=5, affinity='euclidean', linkage='ward')
ah_5cluster_model = ah_5cluster.fit_predict(sample_agg_cluster); ah_5cluster_model

In [None]:
ah_6cluster = agclus(n_clusters=6, affinity='euclidean', linkage='ward')
ah_6cluster_model = ah_6cluster.fit_predict(sample_agg_cluster); ah_6cluster_model

In [None]:
ah_4cluster = agclus(n_clusters=4, affinity='euclidean', linkage='ward')
ah_4cluster_model = ah_4cluster.fit_predict(sample_agg_cluster); ah_4cluster_model

In [None]:
ah_3cluster = agclus(n_clusters=3, affinity='euclidean', linkage='ward')
ah_3cluster_model = ah_3cluster.fit_predict(sample_agg_cluster); ah_3cluster_model

In [None]:
sscore_ah_3cluster = sscore(sample_agg_cluster, ah_3cluster_model); sscore_ah_3cluster -0.2883128555155335
sscore_ah_4cluster = sscore(sample_agg_cluster, ah_4cluster_model); sscore_ah_4cluster- 0.28005227166850494
sscore_ah_5cluster = sscore(sample_agg_cluster, ah_5cluster_model); sscore_ah_5cluster- 0.2777902860167923

In [None]:
dbcore_ah_3cluster = dbscore(sample_agg_cluster, ah_3cluster_model); dbcore_ah_3cluster

In [None]:
sscore_ah_4cluster = sscore(sample_agg_cluster, ah_4cluster_model); sscore_ah_4cluster

In [None]:
dbcore_ah_4cluster = dbscore(sample_agg_cluster, ah_4cluster_model); dbcore_ah_4cluster

In [None]:
sscore_ah_5cluster = sscore(sample_agg_cluster, ah_5cluster_model); sscore_ah_5cluster

In [None]:
dbscore_ah_5cluster = dbscore(sample_agg_cluster, ah_5cluster_model); dbscore_ah_5cluster


In [None]:
sscore_ah_6cluster = sscore(sample_agg_cluster, ah_6cluster_model); sscore_ah_6cluster


In [None]:
dbscore_ah_6cluster = dbscore(sample_agg_cluster, ah_6cluster_model); dbscore_ah_6cluster

In [None]:
%%time
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans

ssd_sample = []
silhouette_scores_sample = []

for k in range(4, 10):
    model_sample = KMeans(n_clusters=k, n_init=10)
    model_sample.fit(sample_kmeans)
    
    # Sum of squared distances of samples to their closest cluster center
    ssd_sample.append(model_sample.inertia_)
    
    # Silhouette score
    silhouette_avg = silhouette_score(sample_kmeans, model_sample.labels_)
    silhouette_scores_sample.append(silhouette_avg)

# Print SSD and silhouette scores for each cluster
for k, ssd_score, silhouette_score in zip(range(4, 10), ssd_sample, silhouette_scores_sample):
    print(f"Number of clusters: {k}, SSD: {ssd_score}, Silhouette Score: {silhouette_score}")


In [None]:
!pip install memory_profiler

In [None]:
%load_ext memory_profiler
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans

def measure_memory_usage():


    ssd_sample = []
    silhouette_scores_sample = []

    for k in range(4, 10):
        model_sample = KMeans(n_clusters=k, n_init=10)
        model_sample.fit(sample_kmeans)
        
        # Sum of squared distances of samples to their closest cluster center
        ssd_sample.append(model_sample.inertia_)
        
        # Silhouette score
        silhouette_avg = silhouette_score(sample_kmeans, model_sample.labels_)
        silhouette_scores_sample.append(silhouette_avg)

        return
%memit measure_memory_usage()

In [None]:
%%time
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans

def measure_memory_usage():


    ssd_sample = []
    silhouette_scores_sample = []

    for k in range(4, 10):
        model_sample = KMeans(n_clusters=k, n_init=10)
        model_sample.fit(sample_kmeans)
        
        # Sum of squared distances of samples to their closest cluster center
        ssd_sample.append(model_sample.inertia_)
        
        # Silhouette score
        silhouette_avg = silhouette_score(sample_kmeans, model_sample.labels_)
        silhouette_scores_sample.append(silhouette_avg)

        return
%memit measure_memory_usage()

In [None]:
%%time
ah_5cluster = agclus(n_clusters=5, affinity='euclidean', linkage='ward')
ah_5cluster_model = ah_5cluster.fit_predict(sample_agg_cluster); ah_5cluster_model

In [None]:
%load_ext memory_profiler
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans

def measure_memory_usage():

    ah_5cluster = agclus(n_clusters=5, affinity='euclidean', linkage='ward')
    ah_5cluster_model = ah_5cluster.fit_predict(sample_agg_cluster); ah_5cluster_model
                        
    return
%memit measure_memory_usage()