# Unsupervised Machine Learning 

**Unsupervised learning** is a branch of machine learning where we do not have any output/dependent/target variables, we only work with input/independent/feature variables. In the context of unsupervised learning, and particularly in clustering tasks, the goal is to group data points into clusters, smaller subsets of similar data, from which we can draw useful insights or patterns.

## KMeans Clustering
KMeans clustering is an example of **exclusive clustering**, meaning each observation can belong to exactly one cluster only. There is no overlap. An observation cannot be part of two clusters simultaneously.

This algorithm works only with numerical variables, which means all non-numeric features should be removed before applying KMeans. Additionally, feature values should be either standardized or normalized, depending on the context. This step ensures that variables with larger scales do not dominate the clustering process, as KMeans relies on Euclidean distance, which is highly sensitive to variable scale.

Even if the dataset does not contain outliers, scaling remains essential. Without it, variables with larger numerical ranges can disproportionately influence the clustering results, leading to biased groupings.

However, if outliers are present, they should be treated before scaling. One commonly used technique is Winsorization, which reduces the impact of extreme values by capping them at specified percentiles. Once outliers are handled, it is safe to proceed with normalization or standardization.

To determine the optimal number of clusters K, the elbow method is most frequently used. Alternatively, a direct metric-based approach can be applied, such as maximizing the ratio:
betweenss / totss,
which reflects how well-separated the clusters are. Higher values of this ratio indicate better-defined clustering structures.

In **R**, the `kmeans()` function is commonly used for this purpose. It has several important parameters such as:
- the number of clusters (`centers`) we want to use,
- the number of times the algorithm will run with different initial cluster positions (`nstart`).
- the maximum number of iterations the algorithm will run (`iter.max`)
The number of clusters is usually determined by the **elbow method** or by using a simple metric. The most commonly used metric is the ratio: `betweenss / totss`.

Unlike supervised learning, clustering does not have well-defined evaluation metrics like accuracy, precision, or recall. Instead, we focus on:
- Within-cluster sum of squares (withinss): the sum of squared distances between each observation and the centroid of its cluster.
- Between-cluster sum of squares (betweenss): the sum of squared distances between each cluster’s centroid and the overall centroid (global center).
- Total sum of squares (totss): the total squared distance of all observations from the global centroid.

The goal is to maximize the ratio `betweenss / totss`, ideally approaching 100%, but of course, that is an ideal scenario (utopia) that is rarely achievable in practice.

The KMeans algorithm works by iteratively updating cluster centroids. We specify a maximum number of iterations, and the algorithm stops early if the centroids do not change significantly between iterations, meaning it has converged.

Another important parameter is `nstart`, which determines how many random initial centroid positions will be tested. Since the initial positions are selected randomly, setting `nstart = 1` is discouraged, it is always better to try multiple initializations to avoid poor local minima.
Parameter `iter.max` is used to set an upper bound on the number of iterations the algorithm will run. This is important because if the initial cluster centers are poorly chosen (due to random selection), the algorithm might struggle to converge, meaning it can’t find optimal cluster centers so this parameter acts as a safeguard to stop the algorithm after a certain number of iterations, even if convergence hasn’t been reached.

### Implementation in Python

Having discussed the KMeans algorithm in the R environment, we will now demonstrate its implementation in **Python** using the `scikit-learn` library.


In [None]:
# KMeans clustering on world happiness data
# Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Loading dataset
data = pd.read_csv('Data/world-happiness-report-2021.csv')

# Showing first 5 rows
data.head()

In [None]:
# Using info to obtain more information about the dataset
# We can see 2 categorical variables which are not suitable for KMeans clustering
data.info()


In [None]:
# Removing variables which are not useful 
data = data.drop(columns=[
    "Country name",  # categorical variable
    "Regional indicator", # categorical variable
    "Ladder score", # aggregate metric calculated based on other columns
    "Standard error of ladder score",  # margin of error for ladder score estimation
    "upperwhisker",  # upper limit for ladder score
    "lowerwhisker",  # lower limit for ladder score
    "Ladder score in Dystopia", # no direct meaning for our model
    # these 6 would represent duplicated information
    "Explained by: Log GDP per capita",
    "Explained by: Social support",
    "Explained by: Healthy life expectancy",
    "Explained by: Freedom to make life choices",
    "Explained by: Generosity",
    "Explained by: Perceptions of corruption",
    "Dystopia + residual" # hard to interpret
])

# Checking for missing values 
missing_values = data.isna().sum()
print("Missing values per column:\n", missing_values)

In [None]:
# Plotting histograms for all features
data.hist(bins=30, figsize=(12, 10))
plt.tight_layout()
plt.show()

# Based on histogram plots most features are not normally distributed
# Therefore, IQR is a more appropriate method for outlier detection than z-score.

def calculate_iqr_bounds(series):
    # Using numpy.quantile 
    q1 = np.quantile(series, 0.25)
    q3 = np.quantile(series, 0.75)
    iqr = q3 - q1
    lower = q1 - 1.5 * iqr
    upper = q3 + 1.5 * iqr
    return lower, upper



In [None]:
# Detecting outliers in all columns
outliers_dict = {}

for col in data.columns:
    lower, upper = calculate_iqr_bounds(data[col])
    outliers = data[(data[col] < lower) | (data[col] > upper)][col]
    outliers_dict[col] = outliers

# Number of outliers per column
for col, outliers in outliers_dict.items():
    print(f"{col}: {len(outliers)} outliers")

In [None]:
# Displaying outliers for a single column
# The column Perceptions of corruption only has lower outliers, it means that there are a few countries with unusually low
# perceived corruption values.

plt.figure(figsize=(6, 4))
sns.boxplot(x=data[col])
plt.title(f"Boxplot - {col}")
plt.xlabel(col)
plt.tight_layout()
plt.show()


In [None]:
#  Function to cap outliers using IQR method by capping extreme values
def clip_outliers_iqr(dataframe):
    for col in dataframe.columns:
        # Check if the column is numeric to avoid errors
        if pd.api.types.is_numeric_dtype(dataframe[col]):
            lower, upper = calculate_iqr_bounds(dataframe[col].to_numpy(dtype='float64', copy=True))
            # Cap outliers (Winsorization)
            dataframe[col] = np.clip(dataframe[col], lower, upper)
    return dataframe

In [None]:
# Appling Winsorization on dataset
data = clip_outliers_iqr(data)


In [None]:
# Plotting to visually confirm that outliers are capped
plt.figure(figsize=(6, 4))
sns.boxplot(x=data[col])
plt.title(f"Boxplot - {col}")
plt.xlabel(col)
plt.tight_layout()
plt.show()


In [None]:
# Moving on to feature scaling using standardization

# Both standardization (StandardScaler) and normalization (MinMaxScaler) were tested for feature scaling before clustering.
# While normalization showed a slightly better silhouette score for K = 3,
# standardization produced more consistent and stable results across the full range of cluster values.
# Differences were minor so I decided to go with standardization
scaler = StandardScaler()
data[data.columns] = scaler.fit_transform(data)
print(data.describe())

In [None]:
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Define the range of clusters to test
# Inertia is equivalent to tot.withinss in R
inertia = [] 
k_range = range(2, 9)

for k in k_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=1000)
    kmeans.fit(data)
    inertia.append(kmeans.inertia_)

# Plotting the Elbow graph to determine the optimal number of clusters
plt.figure(figsize=(8, 5))
plt.plot(k_range, inertia, marker='o')
plt.title('Elbow Method for Optimal k (k = 2 to 8)')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Inertia (WSS)')
plt.grid(True)
plt.tight_layout()
plt.show()



In [None]:
from sklearn.metrics import silhouette_score
# In this section, I evaluate different values of K using the silhouette score.
# It measures how well each data point fits within its assigned cluster
# compared to other clusters. The score ranges from -1 to 1.
# Score close to 1 indicates that the point is well clustered.
# Score close to 0 means it's on the boundary between two clusters.
# Negative score means it may have been assigned to the wrong cluster.


silhouette_scores = []

# Test values of k from 2 to 8
for k in range(2, 9):
    kmeans = KMeans(n_clusters=k, max_iter=20, n_init=1000, random_state=4)
    cluster_labels = kmeans.fit_predict(data)
    score = silhouette_score(data, cluster_labels)
    silhouette_scores.append((k, score))


silhouette_df = pd.DataFrame(silhouette_scores, columns=['k', 'silhouette_score'])


print(silhouette_df)

# Plotting silhouette scores
plt.figure(figsize=(8, 5))
plt.plot(silhouette_df['k'], silhouette_df['silhouette_score'], marker='o')
plt.title('Silhouette Score vs. Number of Clusters')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Silhouette Score')
plt.grid(True)
plt.show()


In [None]:
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Plotting clustering when K = 2
kmeans_2 = KMeans(n_clusters=2, n_init=100, max_iter=20, random_state=42)
clusters_2 = kmeans_2.fit_predict(data)


#  Plotting clustering when K = 3
kmeans_3 = KMeans(n_clusters=3, n_init=100, max_iter=20, random_state=42)
clusters_3 = kmeans_3.fit_predict(data)



pca = PCA(n_components=2)
pca_result = pca.fit_transform(data)


# Plotting side by side
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Plot for K = 2
axes[0].scatter(pca_result[:, 0], pca_result[:, 1], c=clusters_2, cmap='Set1', s=60)
axes[0].set_title('Clusters K=2')
axes[0].set_xlabel('PC1')
axes[0].set_ylabel('PC2')
axes[0].grid(True)

# Plot for K = 3
axes[1].scatter(pca_result[:, 0], pca_result[:, 1], c=clusters_3, cmap='Set1', s=60)
axes[1].set_title('Clusters K=3')
axes[1].set_xlabel('PC1')
axes[1].set_ylabel('PC2')
axes[1].grid(True)

plt.tight_layout()
plt.show()

# Based on the PCA plots, K = 2 appears to provide slightly more distinct cluster separation compared to K = 3.



In [None]:
# Clustering using optimal number of clusters
kmeans_final = KMeans(n_clusters=2, n_init=100, max_iter=20, random_state=42)
data['Cluster'] = kmeans_final.fit_predict(data)
# fits the KMeans model and returns the cluster label 0 or 1 for each row

cluster_summary = data.groupby('Cluster').mean()
print(cluster_summary)


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(14, 6))
sns.heatmap(cluster_summary, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Cluster Profile Heatmap')
plt.ylabel('Cluster')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()

plt.show()



### Cluster interpretation

To better understand the nature of the clusters formed via KMeans, I calculated the average z-scored values for the key happiness related features and visualized them using a heatmap.
Z-score measures the deviation from the mean and it's expressed in the units of standard deviation.
Z-score allows us to compare variables on the same scale even if their original units were different.

Cluster 0 shows positive z-scores for: 
    Logged GDP per capita (+0.57)
    Social support (+0.57)
    Healthy life expectancy (+0.59)
    Freedom to make life choices (+0.35)

These values indicate that countries in Cluster 0 tend to have strong economic indicators, better health, more freedom.

Cluster 1 shows negative z-scores for:
    GDP (-1.06)
    Social support (-1.06)
    Life expectancy (-1.10)
    Freedom (-0.65)

This suggests that Cluster 1 includes countries with weaker economic and health related metrics while they display greater generosity and higher trust in institutions.
