

#### Step 1: Data Loading and Initial Exploration

**Data set**
- Load the Mall Customers dataset from the following URL: 'https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/Cust_Segmentation.csv'.



In [None]:


import pandas as pd
import io
import requests

# Load the data
url = 'https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/Cust_Segmentation.csv'

s = requests.get(url).content
#storing data in mall_customers
mall_customers = pd.read_csv(io.StringIO(s.decode('utf-8')))

mall_customers.head(5)

mall_customers.describe()





#### Step 2: Data Preprocessing


In [None]:
from sklearn.preprocessing import StandardScaler

# Dropping 'Address' (categorical data), 'Customer Id', and 'Defaulted' as they are not suitable for applying  K-means later.
# I exclude this feature and proceed with the clustering analysis.

mall_customers_to_process = mall_customers.drop(['Address','Customer Id','Defaulted'], axis=1)


# storing the columns of the dataframe
columns= mall_customers_to_process.columns 
#removeing missing (NaN) values from the DataFrame and replacing them by their means
mall_customers_to_process[columns].fillna(mall_customers_to_process[columns].mean(), inplace=True)
#Normalizing the dataset with StandardScaler to ensure that all features in the dataset are on the same scale.
scaler = StandardScaler()
mall_customers_normalized = pd.DataFrame(scaler.fit_transform(mall_customers_to_process), columns=mall_customers_to_process.columns )

mall_customers_normalized.head(5)


#### Step 3: Applying K-Means Clustering


For this part I Apply K-Means clustering on the processed data. The Suitble number of clusters for appling K-means is 3 . But you can uncomment the code in the bottom to run the elbow method and check the optimal number of clusters




In [None]:
from sklearn.cluster import KMeans

# starting with 3 clusters
k = 3
#applying KMeans
kmeans = KMeans(n_clusters=k, random_state=42)
#Fitting the model to are data 
kmeans.fit(mall_customers_normalized) 
#getting access to the cluster labels
labels = kmeans.labels_
#makeing a copy of the original DataFrame(to preserve the original DF)
mall_customers_with_labels = mall_customers_to_process.copy() 
# adding the cluster labels as a new column in the original DataFrame
mall_customers_with_labels['Cluster'] = labels  
#retrieving the coordinates of the cluster centroids
centroids = kmeans.cluster_centers_ 


# Converting centroids back to original feature space from the normalized space
centroids_original = scaler.inverse_transform(centroids)  
#Converting the centroids into a new DataFrame  with the same column names as the original DataFrame
centroids_df = pd.DataFrame(centroids_original, columns=mall_customers_to_process.columns)
# Assigning cluster numbers to a new column for clarity, in range of k(three clusters)
centroids_df['Cluster'] = range(k)  
print(centroids_df)
print(mall_customers_with_labels.head())
################################### finding optimal number of clusters using elbow method ##########


# from sklearn.cluster import KMeans
# import matplotlib.pyplot as plt

# # Elbow method to determine the optimal number of clusters
# def find_optimal_clusters(data, max_k):
#     inertia = []
#     for k in range(1, max_k + 1):
#         kmeans = KMeans(n_clusters=k, random_state=42)
#         kmeans.fit(data)
#         inertia.append(kmeans.inertia_)
    
#     # Plot the elbow curve
#     plt.figure(figsize=(8, 5))
#     plt.plot(range(1, max_k + 1), inertia, marker='o', linestyle='--')
#     plt.title('Elbow Method for Optimal Clusters')
#     plt.xlabel('Number of Clusters')
#     plt.ylabel('Inertia')
#     plt.xticks(range(1, max_k + 1))
#     plt.grid(True)
#     plt.show()

# # Run the elbow method to determine the optimal number of clusters
# find_optimal_clusters(mall_customers_normalized, max_k=10)

Ploting the results using a scatter plot. I Used`Age` as the x-axis and `Income` as the y-axis. 


The behavior each cluster represents is based on both the age and income distribution.
1- Cluster 0 (blue) it represents middle-aged customers between 35 and 55 years old who have and average income level, and relatively good spending power.
2- Cluster 1 (orange) describes a younger group of customers with lower income and probably a lower spending power. 
3- cluster 2 (green) corresponds to the customers with higher income and older age who can afford more expensive goods.


In [None]:
import matplotlib.pyplot as plt 
from sklearn.datasets import make_blobs
import numpy as np


fig_Nelliptic = plt.figure(figsize=(10, 7))
# colors for k=3
colors = ["#4EACC5", "#FF9C34", "#4E9A06"]

#iterating through clusters and plotting a scatter plot of Age vs. Income for each cluster using distinct colors.
for k, col in enumerate(colors):
    cluster_data = mall_customers_with_labels[mall_customers_with_labels['Cluster'] == k]
    plt.scatter(cluster_data['Age'], cluster_data['Income'], c=col, label=f'Cluster {k}', s=40, alpha=0.5)

plt.scatter(centroids_df['Age'], centroids_df['Income'], c='red', s=200, marker='X', label='Centroids')
plt.title("Scatter plot of Age vs. Income")
plt.xlabel("Age")
plt.ylabel("Income")

# Adding the legend
plt.legend()
plt.show()



#### Step 4: Applying DBSCAN Clustering



In [None]:
from sklearn.cluster import DBSCAN

#applying DBSCAN to the normalized dataset + fitting the model
#Started with `eps=0.5` and `min_samples=5`.
db_Nelliptic = DBSCAN(eps=0.5, min_samples=5).fit(mall_customers_normalized)
#predicting labels 
DBSCAN_labels_Nelliptic = db_Nelliptic.labels_
# Number of clusters in labels, ignoring noise by excluding it from the cluster count, if present.
n_clusters_Nelliptic = len(set(DBSCAN_labels_Nelliptic)) - (1 if -1 in DBSCAN_labels_Nelliptic else 0)
#counting the number of noise points 
n_noise_Nelliptic = list(DBSCAN_labels_Nelliptic).count(-1)
# Adding DBSCAN labels to mall_customers_normalized DataFrame
mall_customers_with_dbscan_labels = mall_customers_normalized.copy()
#storing the DBSCAN cluster labels 
mall_customers_with_dbscan_labels['DBSCAN_Cluster'] = DBSCAN_labels_Nelliptic


display(mall_customers_with_dbscan_labels)
print('Number of cluster labels is ',n_clusters_Nelliptic)

#########################
#Apply DBSCAN to the same dataset again to attain 3 clusters. Start with `eps=0.8` and `min_samples=8`.
#I manually changed the values of eps and min_samples to find the right values. 
db_Nelliptic_2 = DBSCAN(eps=0.8, min_samples=8).fit(mall_customers_normalized)
DBSCAN_labels_Nelliptic_2 = db_Nelliptic_2.labels_
n_clusters_Nelliptic_2 = len(set(DBSCAN_labels_Nelliptic_2)) - (1 if -1 in DBSCAN_labels_Nelliptic_2 else 0)
n_noise_Nelliptic_2 = list(DBSCAN_labels_Nelliptic_2).count(-1)
mall_customers_with_dbscan_labels_2 = mall_customers_normalized.copy()
mall_customers_with_dbscan_labels_2['DBSCAN_Cluster'] = DBSCAN_labels_Nelliptic_2

display(mall_customers_with_dbscan_labels_2)
print('Number of cluster labels is ',n_clusters_Nelliptic_2)



Ploting the results in a scatter plot to Compare the results with K-Means. 

According to the results and plots,these methods have some obvious differences. K-Means has formed spherical clusters based on distances between centroids and data points, and we see the centroid in the plots. But in DBSCAN we see groups of points based on density, it also identifies noises explicitly(white points). 

In [None]:
fig_Nelliptic = plt.figure(figsize=(10,10))
ax_Nelliptic = fig_Nelliptic.add_subplot(1, 1, 1)
#retrieves the unique cluster labels
unique_labels_Nelliptic = set(DBSCAN_labels_Nelliptic)
#creating a mask 
core_samples_mask_Nelliptic = np.zeros_like(DBSCAN_labels_Nelliptic, dtype=bool)
#generating a list of distinct colors
colors_Nelliptic = [plt.cm.Spectral(each) for each in np.linspace(0, 1, len(unique_labels_Nelliptic))]

#iterates through each cluster k and assigns the corresponding color 
for k, col_Nelliptic in zip(unique_labels_Nelliptic, colors_Nelliptic):
    if k == -1:
        # white used for noise as it was more clear to see other clusters 
        col_Nelliptic = [1, 0, 0, 0]
    
    #a boolean mask to identify the data points that belong to the current cluster k
    class_member_mask_Nelliptic = DBSCAN_labels_Nelliptic == k
    #selecting core points and plotting them
    xy_Nelliptic = mall_customers_normalized[class_member_mask_Nelliptic & core_samples_mask_Nelliptic]
    ax_Nelliptic.plot(
        xy_Nelliptic['Age'],
        xy_Nelliptic['Income'],
        "o",
        markerfacecolor=tuple(col_Nelliptic),
        markeredgecolor="k",
        markersize=20,
       label=f'Core point (Cluster {k})' if k != -1 else 'Noise'
        
    )
    #selecting and plotting of border points with a smaller marker size
    xy_Nelliptic = mall_customers_normalized[class_member_mask_Nelliptic & ~core_samples_mask_Nelliptic]
    ax_Nelliptic.plot(
        xy_Nelliptic['Age'],
        xy_Nelliptic['Income'],
        "o",
        markerfacecolor=tuple(col_Nelliptic),
        markeredgecolor="k",
        markersize=6,
        label=f'Border point (Cluster {k})' if k != -1 else 'Noise'

    )
# Adding title and labels
ax_Nelliptic.set_title('DBSCAN Clustering on Mall Customers Data')
ax_Nelliptic.set_xlabel('Age')
ax_Nelliptic.set_ylabel('Income')
# Adding legends
ax_Nelliptic.legend()
ax_Nelliptic.set_title('DBSCAN clustering on a Non-elliptic dataset')
plt.show()



########################################
#plotting the second DBSCAN
fig_Nelliptic_2 = plt.figure(figsize=(10,10))
ax_Nelliptic_2 = fig_Nelliptic_2.add_subplot(1, 1, 1)
unique_labels_Nelliptic_2 = set(DBSCAN_labels_Nelliptic_2)
core_samples_mask_Nelliptic_2 = np.zeros_like(DBSCAN_labels_Nelliptic_2, dtype=bool)

colors_Nelliptic = [plt.cm.Spectral(each) for each in np.linspace(0, 1, len(unique_labels_Nelliptic_2))]


for k, col_Nelliptic in zip(unique_labels_Nelliptic_2, colors_Nelliptic):
    if k == -1:
        # white used for noise as it was more clear to see other clusters 
        col_Nelliptic = [1, 0, 0, 0]
    

    class_member_mask_Nelliptic_2 = DBSCAN_labels_Nelliptic_2 == k

    xy_Nelliptic_2 = mall_customers_normalized[class_member_mask_Nelliptic_2 & core_samples_mask_Nelliptic_2]
    ax_Nelliptic_2.plot(
        xy_Nelliptic_2['Age'],
        xy_Nelliptic_2['Income'],
        "o",
        markerfacecolor=tuple(col_Nelliptic),
        markeredgecolor="k",
        markersize=20,
       label=f'Core point (Cluster {k})' if k != -1 else 'Noise'
        
    )
    xy_Nelliptic_2= mall_customers_normalized[class_member_mask_Nelliptic_2 & ~core_samples_mask_Nelliptic_2]
    ax_Nelliptic_2.plot(
        xy_Nelliptic_2['Age'],
        xy_Nelliptic_2['Income'],
        "o",
        markerfacecolor=tuple(col_Nelliptic),
        markeredgecolor="k",
        markersize=6,
        label=f'Border point (Cluster {k})' if k != -1 else 'Noise'

    )
# Adding title and labels
ax_Nelliptic_2.set_title('DBSCAN Clustering on Mall Customers Data')
ax_Nelliptic_2.set_xlabel('Age')
ax_Nelliptic_2.set_ylabel('Income')
# Adding legends
ax_Nelliptic_2.legend()
ax_Nelliptic_2.set_title('DBSCAN clustering on a Non-elliptic dataset')
plt.show()




#### Step 5: Evaluation of the methods with silhouette scores for both K-Means and DBSCAN



In [None]:
from sklearn.metrics import silhouette_score

# Silhouette score for K-Means
copy_mall = mall_customers_normalized
kmeans_silhouette_score = silhouette_score(copy_mall, labels)
print('kmeans_silhouette_score:',kmeans_silhouette_score)

################################################
# without excluding the noise got the score
copy_mall2 = mall_customers_normalized
dbscan_silhouette_score = silhouette_score(mall_customers_normalized, DBSCAN_labels_Nelliptic)
print('dbscan_silhouette_score including noise points:',dbscan_silhouette_score)
##################################################
copy_mall3 = mall_customers_normalized

# Exclude noise points (labeled as -1)
filtered_data = copy_mall3[DBSCAN_labels_Nelliptic != -1]
filtered_labels = DBSCAN_labels_Nelliptic[DBSCAN_labels_Nelliptic != -1]

# Calculate silhouette score only for non-noise points
dbscan_silhouette_score = silhouette_score(filtered_data, filtered_labels)
print('dbscan_silhouette_score for non-noise points:',dbscan_silhouette_score)

According to Silhouette score of each clustering method I draw conclusions as following :

- The score of 0.2 for K-Means indicates that the clusters are well seperated but have overlapp each other(we see it in the plotting as well)
- The negative Silhouette score for DBSCAN method shows a poor clustering performance on the dataset. Eventhough I excluded the noises the second time , the score is very close to 0 and doesn't make a remarkable difference.

K-means strenghs are in forming well-seperated clusters, clear datapoints around the centroids and the shape of clusters (elliptical or spherical). However, it shows weakness as it's sensitive to outliers and it includes many outliers to the clusters (green cluster in exercise 4)

DBSAN's strength lies in its ability to handle clusters of arbitary shapes and sizes , identifying noise (white datapoints) and not needing the predefinition of cluster numbers, it discovers that itself. but the weaknes of this method is the sensitivity it has for on the input parameters , eps and min_samples to determine the number of clusters. It took time to find the right combination.Another weakness was that the method was not able to vary densities correctly that ked to unmerged clusters after plotting.


The reason behind the low scores can be the presens of high noise level in our data. after plotting the data, and applying DBSCAN method, it was obvious that noises have a lot of impact in the clustering process and the visuallization is not clear.

In summary, the clustering results indicate that K-Means provides a better clustering solution than DBSCAN for the dataset based on the silhouette scores.