# ML - Fall 2023 - Practical Homework

## Practical Homework 6 - KMeans and PCA

Student Name: Parham Rezaei

Student Number: 400108547

# Phase 0: Introduction

**In this assignment, you will develop K-means and PCA algorithm to perform data segmentation. The dataset contains behavioral variables of customers such as Balance, Purchases, etc. Your task is to make a model to segment the same customers into the clusters.**

In [None]:
# essential packages
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_samples, silhouette_score
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer
from sklearn.model_selection import train_test_split
from sklearn import metrics

import warnings
warnings.filterwarnings("ignore")

# add any other packages that you may need here

In [None]:
!wget -O https://www.dropbox.com/scl/fi/vcejtazdshv8dnhbnfxc7/dataset.csv?rlkey=zauavuzjf5jzmdoqtorkmrkzk&dl=1

# Phase 1: Explore

## Sec 1: Load and Explore the given dataset (P1-Sec1: 25 Points)

Load the dataset and display the first 10 rows of dataset. **(P1-1-1: 2 points)**

In [None]:
df = pd.read_csv('dataset.csv')
df.head(10)

Print the column names and number of data samples. **(P1-1-2: 1 points)**

In [None]:
print(f"column names: {df.columns}")
print(f"number of data samples: {len(df)}")

Identify the columns that contain nan values. **(P1-1-3: 2 points)**

In [None]:
print(f"columns with nan values: {df.columns[df.isnull().any()].tolist()}")

Fill the nan values with the median of each column. **(P1-1-4: 2 points)**

In [None]:
df = df.fillna(df.median())

Find the max, min, and average of each column with numerical data. **(P1-1-5: 2 points)**

In [None]:
df_stats = pd.DataFrame(columns=['column_name', 'min', 'max', 'mean'])
for column in df.columns:
    if df[column].dtype in ['int64', 'float64']:
        df_stats = df_stats.append({'column_name': column, 'min': df[column].min(), 'max': df[column].max(), 'mean': df[column].mean()}, ignore_index=True)
df_stats


Plot the Histogram of each column with numerical data. Also, show the median and average value of each column in plot. **(P1-1-6: 6 points)**

In [None]:
for column in df.columns:
    if df[column].dtype in ['int64', 'float64']:
        plt.figure()
        plt.hist(df[column], bins=100)
        plt.axvline(df[column].median(), color='red', linestyle='dashed', linewidth=2, label='median')
        plt.axvline(df[column].mean(), color='green', linestyle='dotted', linewidth=2, label='average')
        plt.legend()
        plt.title(column)
        plt.show()

Display the box plot for each numerical column. **(P1-1-7: 5 points)**

In [None]:
for column in df.columns:
    if df[column].dtype in ['int64', 'float64']:
        plt.figure()
        plt.boxplot(df[column])
        plt.title(column)
        plt.show()

Show the correlation between columns by plotting the heatmap of correlation coefficients. **(P1-1-8: 5 points)**

In [None]:
plt.figure(figsize=(14,14))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.show()

# Phase 2: Preprocessing

## Sec 1: Preprocess the data **(P2-Sec1: 15 Points)**

drop the 'CUST_ID' column. **(P2-1-1: 2 points)**

In [None]:
df.drop(columns=['CUST_ID'], inplace=True)

Check for duplicated rows. If there is any duplicated row, remove them. **(P2-1-2: 6 points)**

In [None]:
print(f"number of duplicate rows: {len(df[df.duplicated()])}")
df = df.drop_duplicates()

Normalize the values of each column. **(P2-1-3: 7 points)**

In [None]:
# my own normalizer
for column in df.columns:
    if df[column].dtype in ['int64', 'float64']:
        df[column] = (df[column] - df[column].min()) / (df[column].max() - df[column].min())

In [None]:
ndf = df.copy()

In [None]:
# sklearn normalizer
from sklearn.preprocessing import Normalizer
normalizer = Normalizer()
cols = df.columns.tolist()
ndf[cols] = normalizer.fit_transform(df[cols])

In [None]:
ndf.head()

# Phase 3: Modeling

## Sec 1: PCA and K-means with sklearn **(P3-Sec1: 40 Points)**

Use the `PCA` class from the `sklearn` library to reduce the dimensionality of the DataFrame. **(P3-1-1: 2 points)**

Follow [this link](https://www.youtube.com/watch?v=nEvKduLXFvk) to understand more about PCA (2 minutes).

In [None]:
pca = PCA()
pca.fit(ndf)
pca_data = pca.transform(ndf)
pca_data.shape

In [None]:
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('components')
plt.ylabel('variance learned')

In [None]:
# So, I will choose components = 5 as more than 90% percent is learned there, there is also some kind of elbow there
pca = PCA(5)
pca.fit(ndf)
pca_data = pca.transform(ndf)
pca_data.shape

**Elbow Method Visualization** helps to determine the optimal number of clusters by visualizing the within-cluster sum of squares (WCSS) against the number of clusters.

Use the `plot_elbow_method function` to plot the number of clusters versus WCSS for both the main DataFrame and the one reduced using `PCA`. Then Discuss the choice of the number of components for PCA and clusters for K-means. **(P3-1-2: 10 points)**

In [None]:
def plot_elbow_method(X, max_clusters=10):
    wcss = []
    for i in range(1, max_clusters + 1):
        kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42)
        kmeans.fit(X)
        wcss.append(kmeans.inertia_)

    plt.plot(range(1, max_clusters + 1), wcss)
    plt.title('Elbow Method')
    plt.xlabel('Number of clusters')
    plt.ylabel('WCSS')
    plt.show()

In [None]:
plot_elbow_method(ndf)
plot_elbow_method(pca_data)

Use the `KMeans` class from the `sklearn.cluster` module to create clusters from the DataFrame that has been dimensionally reduced using `PCA`. **(P3-1-3: 3 points)**

Follow [this link](https://www.youtube.com/watch?v=R2e3Ls9H_fc) to understand more about KMeans (4 minutes).

In [None]:
kmeans = KMeans(n_clusters=3, init='k-means++', random_state=42)
kmeans.fit(pca_data)
y_kmeans = kmeans.predict(pca_data)

Complete the definition of the following class to implement PCA, which is capable of reducing the dimensionality. **(P3-1-4: 10 points)**

In [None]:
class CustomPCA:
    """
    Custom implementation of PCA.
    Attributes:
    -----------
    n_components : int
        Number of principal components.
    components : ndarray
        Principal components.
    """
    def init(self, n_components):
        self.n_components = n_components
        self.components = None

    def fit(self, X):
        """
        Fit the model with X.
        Parameters:
        -----------
        X : ndarray, shape (n_samples, n_features)
            Training data.
        """
        # Calculate covariance matrix
        cov = np.cov(X.T)
        # Find eigenvalues and eigenvectors
        eigenvalues, eigenvectors = np.linalg.eig(cov)
        # Sort eigenvectors based on eigenvalues
        eigenvectors = eigenvectors.T
        idxs = np.argsort(eigenvalues)[::-1]
        eigenvalues = eigenvalues[idxs]
        # Store first n_components eigenvectors in self.components
        self.components = eigenvectors[0:self.n_components]


    def transform(self, X):
        """
        Apply dimensionality reduction to X.
        Parameters:
        -----------
        X : ndarray, shape (n_samples, n_features)
            Data to transform.
        Returns:
        --------
        X_transformed : ndarray, shape (n_samples, n_components)
            Transformed data.
        """
        # Project data
        X_transformed = np.dot(X, self.components.T)
        return X_transformed

Complete defining the class below to implement `KMeans`, an algorithm designed for clustering. **(P3-1-5: 15 points)**

In [None]:
class CustomKMeans:
    """
    Custom implementation of K-means clustering.
    Attributes:
    -----------
    n_clusters : int
        Number of clusters.
    centroids : ndarray
        Coordinates of cluster centers.
    """
    def init(self, n_clusters):
        self.n_clusters = n_clusters
        self.centroids = None


    def fit_predict(self, X):
        """
        Compute k-means clustering.
        Parameters:
        -----------
        X : ndarray, shape (n_samples, n_features)
            Training instances to cluster.
        Returns:
        --------
        labels : ndarray, shape (n_samples,)
            Index of the cluster each sample belongs to.
        """
        # Initialize centroids
        self.centroids = X[np.random.choice(X.shape[0], self.n_clusters, replace=False)]
        # Repeat until convergence:
        while True:
            # Assign points to the nearest centroid
            labels = np.argmin(np.sqrt(((X - self.centroids[:, np.newaxis])**2).sum(axis=2)),axis=0)
            # Recalculate the centroids
            centroids = []
            for i in range(self.n_clusters):
                centroids.append(np.mean(X[labels == i], axis=0))
            new_centroids = np.array(centroids)
            if np.allclose(self.centroids, new_centroids):
                break
            self.centroids = new_centroids
        return labels

## Sec 2: Fitting implemented Kmeans **(P3-Sec2: 5 Points)**

Use your implemented `CustomPCA` to reduce the dimensionality of the DataFrame. **(P3-2-1: 3 points)**

In [None]:
custom_pca = CustomPCA()
custom_pca.init(5)
custom_pca.fit(ndf)
custom_pca_data = custom_pca.transform(ndf)

Apply the `CustomKMeans` implementation you created to perform clustering on the DataFrame. **(P3-2-2: 2 points)**

In [None]:
custom_kmeans = CustomKMeans()
custom_kmeans.init(3)
custom_kmeans_labels = custom_kmeans.fit_predict(custom_pca_data)

# Phase 4: Analyzing

## Sec 1: Visualizing and Comparing **(P4-Sec1: 5 Points)**

Visualize and compare the clustering results from the sklearn library with those from your custom clustering implementation. **(P4-1-1: 5 points)**

In [None]:
plt.figure(figsize=(10,5))
plt.subplot(1,2,1)
plt.scatter(pca_data[:,0], pca_data[:,1], c=y_kmeans, cmap='rainbow')
plt.title('sklearn kmeans')
plt.subplot(1,2,2)
plt.scatter(custom_pca_data[:,0], custom_pca_data[:,1], c=custom_kmeans_labels, cmap='rainbow')
plt.title('custom kmeans')
plt.show()

## Sec 2: Silhouette Analysis **(P4-Sec2: 10 Points)**

**Silhouette Analysis** involves calculating and plotting the silhouette coefficients, which measure how similar each point is to its own cluster compared to other clusters. The closer these coefficients are to +1, the better the clustering.

Calculate the silhouette scores for each sample in the dataset using `silhouette_samples` and the average silhouette score using `silhouette_score`. Then, visualize these scores in a plot for each cluster for both results from the sklearn library and those from your custom clustering implementation.  **(P4-2-1: 10 points)**

In [None]:
print("Sklearn")
print(f"silhouette_score: {silhouette_score(pca_data, y_kmeans)}")
print(f"silhouette_samples: {silhouette_samples(pca_data, y_kmeans)}")
print("Mine")
print(f"silhouette_score: {silhouette_score(custom_pca_data, custom_kmeans_labels)}")
print(f"silhouette_samples: {silhouette_samples(custom_pca_data, custom_kmeans_labels)}")
plt.figure(figsize=(10,5))
plt.scatter(pca_data[:,0], pca_data[:,1], c=silhouette_samples(pca_data, y_kmeans))
plt.title('sklearn kmeans')
plt.show()
plt.figure(figsize=(10,5))
plt.scatter(custom_pca_data[:,0], custom_pca_data[:,1], c=silhouette_samples(custom_pca_data, custom_kmeans_labels))
plt.title('sklearn kmeans')
plt.show()

In [None]:
visualizer = SilhouetteVisualizer(kmeans)
visualizer.fit(pca_data)
visualizer.show()

In [None]:
# for kmeans the visualizer does not work as it is a custom class so :
l=0
custom_scores = silhouette_score(custom_pca_data, custom_kmeans_labels)
custom_samples = silhouette_samples(custom_pca_data, custom_kmeans_labels)
for i in range(3):
    cluster_samples = custom_samples[custom_kmeans_labels == i]
    cluster_samples.sort()
    cluster_size = cluster_samples.shape[0]
    u = l + cluster_size
    plt.fill_betweenx(np.arange(l,u), 0, cluster_samples, alpha=0.7)
    l = u + 10

plt.axvline(x=custom_scores, color="red", linestyle="--")
plt.title("Custom Kmeans")

the first plot, a little bit more beautiful :)

In [None]:
plt.figure(figsize=(10,5))
plt.scatter(pca_data[:,0], pca_data[:,1], c=silhouette_samples(pca_data, y_kmeans),cmap='rainbow')
plt.title('sklearn kmeans')
plt.show()
plt.figure(figsize=(10,5))
plt.scatter(custom_pca_data[:,0], custom_pca_data[:,1], c=silhouette_samples(custom_pca_data, custom_kmeans_labels),cmap='rainbow')
plt.title('sklearn kmeans')
plt.show()

# Extra
Something I noticed, Using standard Scaler instead of Normalizer leads us towards using one more cluster

In [None]:
sc = StandardScaler()
ndf = pd.DataFrame(sc.fit_transform(df),columns=df.columns)

In [None]:
pca = PCA()
pca.fit(ndf)
pca_data = pca.transform(ndf)
pca_data.shape

In [None]:
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('components')
plt.ylabel('variance learned')

In [None]:
custom_pca = CustomPCA()
custom_pca.init(10)
custom_pca.fit(ndf)
custom_pca_data = custom_pca.transform(ndf)

In [None]:
plot_elbow_method(ndf)
plot_elbow_method(pca_data)

In [None]:
kmeans = KMeans(n_clusters=4, init='k-means++', random_state=42)
kmeans.fit(pca_data)
y_kmeans = kmeans.predict(pca_data)

In [None]:
custom_kmeans = CustomKMeans()
custom_kmeans.init(4)
custom_kmeans_labels = custom_kmeans.fit_predict(custom_pca_data)

In [None]:
plt.figure(figsize=(10,5))
plt.subplot(1,2,1)
plt.scatter(pca_data[:,0], pca_data[:,1], c=y_kmeans, cmap='rainbow')
plt.title('sklearn kmeans')
plt.subplot(1,2,2)
plt.scatter(custom_pca_data[:,0], custom_pca_data[:,1], c=custom_kmeans_labels, cmap='rainbow')
plt.title('custom kmeans')
plt.show()