## **Exploring Global Country Data and Patterns through K-Means Clustering**

## **LIBRAIRIES IMPORT**

**Libraries Overview:**

- **Pandas**: Data manipulation and analysis.
- **NumPy**: Mathematical functions and array operations.
- **Seaborn**: Statistical data visualization.
- **Matplotlib**: Creating a wide range of plots.
- **Plotly Express**: Interactive visualizations.
- **StandardScaler**: Data preprocessing for scaling.
- **KMeans**: Clustering algorithm for grouping data.
- **silhouette_score**: Clustering quality evaluation.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns 
import matplotlib.pyplot as plt
import plotly.express as px
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

## **DATA IMPORT**
We load two datasets using Pandas: `2016.csv` for happiness scores and `countries of the world.csv` for country attributes. The DataFrames are named `df_happiness_score` and `df_countries`, respectively.





In [None]:
df_happiness_score = pd.read_csv("./data/2016.csv")
df_happiness_score.head()

In [None]:
df_countries = pd.read_csv("./data/countries of the world.csv")
df_countries.head()

## **DATA CLEANING**

We perform several data manipulations on the `df_happiness_score` DataFrame:
- **Drop irrelevant columns:** 'Region', 'Happiness Rank', 'Lower Confidence Interval', 'Upper Confidence Interval'.
- Sort the DataFrame by 'Country' in ascending order.
- **Rename columns for clarity**: 'Economy (GDP per Capita)' to 'Economy', 'Health (Life Expectancy)' to 'Life expectancy', 'Trust (Government Corruption)' to 'Trust', 'Happiness Score' to 'Happiness score', 'Dystopia Residual' to 'Dystopia residual'.
- Reset the DataFrame index.

In [None]:
df_happiness_score.drop(['Region', 'Happiness Rank', 'Lower Confidence Interval', 'Upper Confidence Interval' ], axis=1, inplace=True)
df_happiness_score.sort_values(by=["Country"], inplace=True)
df_happiness_score.rename(columns={'Economy (GDP per Capita)': 'Economy', 'Health (Life Expectancy)': 'Life expectancy', 'Trust (Government Corruption)': 'Trust', 'Happiness Score': 'Happiness score', 'Dystopia Residual': 'Dystopia residual'}, inplace=True)
df_happiness_score.reset_index(drop=True, inplace=True)
df_happiness_score.head()


**Data Column Renaming and Cleaning:**

We streamline column names in the `df_countries` DataFrame using a mapping dictionary. Additionally, we remove leading and trailing blanks from 'Country' and 'Region' columns.

Notice : %0 = per 1000


**Creating the Feature GDP:**

We engineer a new feature 'GDP' by multiplying 'GDP per capita' with 'Population', offering insights into the economic strength of each country.


**NaN Handling and Numeric Conversion:**

For numeric columns, we convert comma-separated values to decimal points and fill missing values with the mean of their respective regions.

In [None]:
columns_to_rename = {
    'Area (sq. mi.)': 'Area',
    'Pop. Density (per sq. mi.)': 'Population density',
    'Coastline (coast/area ratio)': 'Coastline %',
    'Infant mortality (per 1000 births)': 'Infant mortality %0',
    'GDP ($ per capita)': 'GDP per capita',
    'Literacy (%)': 'Literacy %',
    'Phones (per 1000)': 'Phones %0',
    'Arable (%)': 'Arable %',
    'Crops (%)': 'Crops %',
    'Other (%)': 'Other %',
    'Agriculture': 'Agriculture %',
    'Service': 'Service %',
    'Industry': 'Industry %'
}

df_countries.rename(columns = columns_to_rename, inplace=True)

df_countries['Country'] = df_countries['Country'].str.strip()
df_countries['Region'] = df_countries['Region'].str.strip()
df_countries['GDP'] = df_countries['GDP per capita'] * df_countries['Population']

columns = df_countries.columns

for column in columns[2:]:

    df_countries[column] = pd.to_numeric(df_countries[column].replace(',', '.', regex=True))

    region_means = df_countries.groupby('Region')[column].transform('mean')

    df_countries[column] = df_countries[column].fillna(region_means)
    
df_countries.head()



**Merging DataFrames for Analysis:**

We begin by identifying non-matching countries between the `df_countries` and `df_happiness_score` DataFrames, ensuring data consistency. Subsequently, non-matching countries are removed from `df_countries` to align the datasets.

A left merge operation combines `df_countries` and `df_happiness_score` based on the 'Country' column, enabling comprehensive analysis of country attributes and happiness scores.

In the interest of accurate regional categorization, certain regions like 'NORTHERN AMERICA' and 'OCEANIA' are combined into 'NORTHERN AMERICA & OCEANIA'. This consolidation is particularly relevant since the 'OCEANIA' region in the happiness score dataset comprises only two countries, with others not present.

With these modifications, the resulting DataFrame provides a unified and enriched dataset for in-depth exploration and insights.


In [None]:
# Find the non-matching countries in 'df_countries'
non_matching_countries = set(df_countries['Country']) - set(df_happiness_score['Country'])

# Remove rows with non-matching countries from 'df_countries'
df_countries = df_countries[~df_countries['Country'].isin(non_matching_countries)]

# Merge the DataFrames 
df_countries = pd.merge(df_countries, df_happiness_score, on='Country', how='left')
df_countries.rename(columns={'Score': 'Happiness score'}, inplace=True)
df_countries['Region'] = df_countries['Region'].replace({'NORTHERN AMERICA': 'NORTHERN AMERICA & OCEANIA',
                                                       'OCEANIA': 'NORTHERN AMERICA & OCEANIA', 'ASIA (EX. NEAR EAST)': 'ASIA'})
df_countries.head()


## **DATA VISUALIZATION**

**Population Distribution by Region Visualization:**

The `population_distribution_by_region()` function generates a two-part pie chart visualization showcasing the distribution of global population by region. Using Seaborn for styling, the function groups the `df_countries` DataFrame by 'Region' and calculates the sum of populations.

In the resulting pie chart:
- The outer ring illustrates the distribution of population across regions, with labels indicating each region's percentage share.
- The inner circle serves as a visual separator, providing a clean and informative presentation.

This visualization provides a clear snapshot of population distribution, aiding in the understanding of demographic patterns across different regions.


In [None]:
def population_distribution_by_region():

    sns.set(style="whitegrid")
    sns.set_palette("bright")
    region_population = df_countries.groupby('Region')['Population'].sum()

    fig, (outer_ax, inner_ax) = plt.subplots(1, 2, figsize=(12, 6))

    outer_wedges, outer_labels, _ = outer_ax.pie(region_population, labels=region_population.index, startangle=140,
                                                autopct='%d%%', pctdistance=0.85, wedgeprops=dict(width=0.3),
                                                textprops={'color': 'white', 'fontweight': 'bold'})


    for label, wedge in zip(outer_labels, outer_wedges):
        label.set_color(wedge.get_facecolor())

    outer_ax.axis('equal')
    outer_ax.set_title('Population Distribution by Region', fontweight='bold')

    inner_wedges, inner_labels = inner_ax.pie([1], labels=[''], radius=0.7, colors=['white']) 

    inner_ax.axis('equal')  

    plt.tight_layout()
    plt.show()
    plt.clf()

population_distribution_by_region()

**Observations:**

We can see that 57% of humanity is Asian, due in particular to India and China. Apart from Asia, the population is fairly evenly distributed between the continents.

**Barplot for Regional Statistics:**

The `barplot_stat_per_region()` function generates bar plots to visualize selected statistics across different regions. It accepts a DataFrame, a statistical column (`stat`), and an aggregation function (`estimator_func`), such as mean or median.

For each statistic, the function creates a bar plot with the x-axis representing regions and the y-axis indicating the chosen statistic. Error bars are omitted for clarity. 

Additional visual elements include:
- Rotated x-axis labels for better readability.
- Numerical annotations above each bar displaying the corresponding statistic.

The resulting plots offer valuable insights into how specific statistics vary across different regions, aiding in the exploration of regional patterns.


In [None]:
def barplot_stat_per_region(data, stat, estimator_func):
    
    plt.figure()
    sns.set(style="whitegrid")
    sns.set_palette("bright")

    fig, ax = plt.subplots(figsize=(12, 10))
    sns.barplot(data=data, y=stat, x='Region', ax=ax, errorbar=None)
    plt.xticks(rotation=90, fontweight = 'bold')
    plt.yticks(fontweight = 'bold')

    plt.xlabel('Region', fontweight='bold')
    plt.ylabel(stat, fontweight='bold')
    plt.title(f'{estimator_func.__name__.capitalize()} {stat} per Region', fontweight='bold')

    for p in ax.patches:
        ax.annotate(format(p.get_height(), '.2f'), 
                    (p.get_x() + p.get_width() / 2., p.get_height()), 
                    ha='center', va='center', 
                    xytext=(0, 9), 
                    textcoords='offset points', fontweight = 'bold')

    plt.tight_layout()  
    plt.show()
    plt.clf()

barplot_stat_per_region(df_countries, 'GDP per capita', np.mean) 
barplot_stat_per_region(df_countries, 'Happiness score', np.mean) 
barplot_stat_per_region(df_countries, 'Literacy %', np.median)

**Observations**: 

We can see that the richest continents are Europe and North America (Oceania looks richer than it is because of the absence of other countries such as Australia and NZ). There is an abysmal difference between Western and Eastern Europe. The poorest region is Sub-Saharan Africa.

We can see that happiness scores follow those for GDP per capita. The two are obviously positively correlated. The richest regions are the happiest, and the poorest the unhappiest. There is a slight upward trend for Latin American and Caribbean countries, perhaps due to their paradisiacal landscapes and weather conditions.

Literacy scores are extremely revealing. There is great inequality between regions. Africa has a very low literacy rate, while other continents are much more literate. Even some regions that weren't among the top performers in terms of GDP per capita have excellent literacy scores. In Europe, America and Oceania, virtually everyone can read and write. 

**Boxplots for Regional Statistics:**

The `boxplot_stat_per_region()` function generates boxplots to visualize the distribution of selected statistics across different regions. It takes a DataFrame and a statistical column (`stat`) as inputs.

For each statistic, the function creates a boxplot with the x-axis representing regions and the y-axis indicating the chosen statistic. The boxes depict the interquartile range (IQR), whiskers extend to the minimum and maximum values within a defined range, and any outliers are highlighted.

Additional visual elements include:
- Rotated x-axis labels for improved legibility.

The resulting boxplots provide a clear depiction of the variability and distribution of chosen statistics across regions, offering valuable insights into regional differences.


In [None]:
def boxplot_stat_per_region(data, stat):
    
    plt.figure()
    sns.set(style="whitegrid")
    sns.set_palette("bright")

    fig, ax = plt.subplots(figsize=(12, 10))
    sns.boxplot(data=data, y=stat, x='Region')
    plt.xticks(rotation=90, fontweight = 'bold')
    plt.yticks(fontweight = 'bold')

    plt.xlabel('Region', fontweight='bold')
    plt.ylabel(stat, fontweight='bold')
    plt.title(f'Boxplots {stat} per Region', fontweight='bold')

    plt.tight_layout()  
    plt.show()
    plt.clf()

boxplot_stat_per_region(df_countries, 'GDP per capita') 
boxplot_stat_per_region(df_countries, 'Happiness score') 
boxplot_stat_per_region(df_countries, 'Literacy %')

**Observations**

We can see that the distribution differs from region to region: in the Near East, GDP per capita is very widely distributed, while it is very narrow in sub-Saharan Africa and the Baltic region. This graph shows an even poorer view of Africa, because the vast majority of countries have a very good GDP per Capita, with no extreme low values.

Happiness scores are broadly distributed, except in the Baltic and North American regions, certainly due to the small number of countries in these regions. We see many high and low extreme values.

The distribution of literacy scores is very uneven. For Western countries, all have excellent literacy levels, but for the rest the distribution is extremely wide, so there are countries with very low literacy levels among these regions, even though the overall score for the region may be much higher. There are strong disparities in these regions. For example, the average literacy level in sub-Saharan Africa is 60%, but one country has a score below 20%.

**Scatter Plots for Statistical Relationships:**

The `plot_stats()` function generates scatter plots to explore relationships between selected statistics. It accepts a DataFrame, an x-axis variable (`x`), and up to three y-axis variables (`y1`, `y2`, `y3`).

For each scatter plot:
- The x-axis represents the chosen variable `x`.
- The y-axis displays the selected statistic(s) (`y1`, `y2`, `y3`).

These scatter plots enable the investigation of potential relationships and correlations among various statistics, contributing to a deeper understanding of the data.


In [None]:
def plot_stats(data, x, y1, y2 = None, y3 = None):
    
    plt.figure(figsize=(10,5))
    sns.set_palette("bright")

    sns.scatterplot(data=data, y=y1, x= x)

    if y2:
        sns.scatterplot(data=data, y=y2, x= x)

        if y3:
            sns.scatterplot(data=data, y=y3, x=x)
            plt.ylabel(f'{y1} & {y2} & {y3}', fontweight = 'bold')
            plt.title(f'{y1} & {y2} {y3} in function of {x}', fontweight='bold')
        else:
            plt.ylabel(f'{y1} & {y2}', fontweight = 'bold')
            plt.title(f'{y1} & {y2} in function of {x}', fontweight='bold')   
    else:
            plt.title(f'{y1} in function of {x}', fontweight='bold')
            plt.ylabel(f'{y1}', fontweight = 'bold')
        

    plt.xticks(fontweight = 'bold')
    plt.yticks(fontweight = 'bold')
    plt.xlabel(x, fontweight='bold')
   

    plt.show()
    plt.clf()


plot_stats(df_countries, 'GDP per capita', 'Happiness score')
plot_stats(df_countries, 'GDP per capita','Phones %0')
plot_stats(df_countries, 'Infant mortality %0', 'GDP per capita')
plot_stats(df_countries, 'GDP per capita', 'Service %', 'Industry %', 'Agriculture %')

**Observations**

We can see that the happiness score is obviously positively correlated with GDP per capita. The relationship between the two seems more exponential than linear. Happiness score increase extremely fast as GDP per capita increases.

The relation between the number of phones per 1000 inhabitants and GDP per capita seems linear which seems logical because because %0 of phones is a direct statistic of wealt.

It can be seen that infant mortality decreases sharply the richer the country, as does the happiness score, with the link appearing to be exponential. Infant mortality decreases exponentially with a country's wealth, thanks in particular to the higher quality of its hospitals, doctors, etc.

In terms of economic sectors, the poorest countries are mainly agricultural. The average countries are more industrial, while the richer countries invest more in services such as health, transport, etc.

**Leading Sector Analysis and Scatter Plot:**

We derive a new column, 'Leading sector', in the `df_countries` DataFrame by identifying the dominant sector (Service, Industry, or Agriculture) based on highest percentage among 'Service %', 'Industry %', and 'Agriculture %'.

Subsequently, we generate a scatter plot comparing 'Happiness score' against 'GDP per capita', with color-coded points indicating the leading sector. The scatter plot provides a visual exploration of how happiness and GDP relate to the dominant sector of each country.

This scatter plot offers insights into potential correlations between happiness, GDP, and the leading sector of each country.




In [None]:
df_countries['Leading sector'] = df_countries[['Service %', 'Industry %', 'Agriculture %']].idxmax(axis=1)

plt.figure(figsize=(10, 6))
sns.scatterplot(data=df_countries, x='Happiness score', y='GDP per capita',
                hue='Leading sector', palette={'Service %': 'blue', 'Industry %': 'orange', 'Agriculture %': 'green'})
plt.title('Happiness Score vs GDP per Capita with Leading Sector Color', fontweight = 'bold')
plt.xlabel('Happiness Score', fontweight = 'bold')
plt.ylabel('GDP per Capita', fontweight = 'bold')
plt.legend(title='Leading Sector')
plt.show()
plt.clf()


**Sector Distribution for Top and Bottom GDP per Capita Countries:**

We select the top 10 and bottom 10 countries based on 'GDP per capita' from the `df_countries` DataFrame. Combining these dataframes, we create a `top_bottom_10` dataframe and set 'Country' as the index.

The code generates a stacked bar plot displaying the distribution of sectors (Agriculture, Service, Industry) for the selected top and bottom GDP per capita countries. Different sectors are color-coded, and the stacked bars allow for easy comparison of sector percentages.


This stacked bar plot offers a visual comparison of sectoral distribution between the top and bottom GDP per capita countries.


In [None]:
top_10 = df_countries.nlargest(10, 'GDP per capita')
bottom_10 = df_countries.nsmallest(10, 'GDP per capita')

top_bottom_10 = pd.concat([top_10, bottom_10])
top_bottom_10.set_index('Country', inplace=True)
colors = ['green', 'blue', 'orange']

plt.figure()
top_bottom_10[['Agriculture %', 'Service %', 'Industry %']].plot(kind='bar', stacked=True, color=colors)
plt.title('Percentage of Sectors for Top and Bottom 20 GDP per capita Countries', fontweight = 'bold')
plt.xlabel('Country', fontweight = 'bold')
plt.ylabel('Percentage', fontweight = 'bold')
plt.legend(title='Sector')
plt.xticks(rotation=90, ha='right', fontweight = 'bold')
plt.tight_layout()
plt.show()
plt.clf()

**Observations**

The different charts confirms what we saw earlier: service-dominant countries are in the majority, and rich countries are exclusively service-dominant, while all agriculture-dominant countries are among the poor. It seems that poor countries are forced to turn to agriculture while countries with great resources naturally invest in services.

**Factors Contributing to Top and Bottom Happiness Score Countries:**

We select the top 10 and bottom 10 countries based on 'Happiness score' from the `df_countries` DataFrame. Combining these dataframes, we create a `top_bottom_10` dataframe and set 'Country' as the index.

The code generates a stacked bar plot illustrating the factors contributing to the happiness scores of the selected top and bottom countries. Factors include 'Economy', 'Family', 'Life expectancy', 'Freedom', 'Trust', 'Generosity', and 'Dystopia residual'. 

This stacked bar plot offers an insightful comparison of factors that play a role in the happiness scores of the top and bottom countries.


In [None]:
top_10 = df_countries.nlargest(10, 'Happiness score')
bottom_10 = df_countries.nsmallest(10, 'Happiness score')

top_bottom_10 = pd.concat([top_10, bottom_10])
top_bottom_10.set_index('Country', inplace=True)

colors = ['blue', 'pink', 'green', 'yellow', 'purple', 'red', 'grey']

plt.figure()
top_bottom_10[['Economy', 'Family', 'Life expectancy',
      'Freedom', 'Trust',
      'Generosity', 'Dystopia residual']].plot(kind='bar', stacked=True, color = colors)


plt.title('Top and Bottom 20 Happiness Score Countries', fontweight = 'bold')
plt.xlabel('Country', fontweight = 'bold')
plt.ylabel('Score', fontweight = 'bold')
plt.legend()
plt.xticks(rotation=90, ha='right', fontweight = 'bold')
plt.tight_layout()
plt.show()
plt.clf()

**Observations**

We can see that the most differentiating factors are trust in government, as poor countries are very much affected by corruption; and life expectancy, as the inhabitants of poor countries are greatly affected by the lack of health due to poverty and sometimes conflict.

**Correlation Heatmap of Selected Metrics:**

We create a correlation heatmap to visualize the pairwise relationships between selected metrics from the `df_countries` DataFrame. The heatmap employs a color scale to represent the strength and direction of correlations.

Key features include:
- A square heatmap matrix, with metrics on both the x and y axes.
- Annotated values within each cell, reflecting the correlation coefficient rounded to two decimal places.

The resulting heatmap offers insights into potential correlations among various socio-economic metrics, aiding in the identification of patterns and relationships.


In [None]:
metrics = df_countries.columns[2:28]
plt.figure(figsize=(20,15))
sns.heatmap(data = df_countries[metrics].corr(), annot=True, fmt='.2f')
plt.show()
plt.clf()


**Observations**

We can see all the positive correlations found before such as GDP per capite and Happiness score, GDP and Service, or GDP and Literacy. Also, we can see all the negative correlations found before such as Infant Mortality and GDP per capita, GDP and Agriculture.  

## **DATA PREPROCESSING**

We prepare the data for analysis by removing the 'Country' column and performing one-hot encoding using `pd.get_dummies()` to convert categorical variables into binary columns.

Next, we apply the StandardScaler from scikit-learn to standardize the numerical features. This ensures that all features have a mean of 0 and a standard deviation of 1, reducing the impact of varying scales on the analysis.

The resulting preprocessed and scaled data is ready for further exploration and modeling.

In [None]:
X = df_countries.drop(["Country"], axis=1)
X = pd.get_dummies(X)
scaler = StandardScaler()
X[X.columns] = scaler.fit_transform(X[X.columns])


## **K-MEANS CLUSTERING**

**Elbow Method for Optimal Number of Clusters:**

We employ the elbow method to determine the optimal number of clusters for K-Means clustering. For a range of values from 2 to 19 (inclusive), we fit a K-Means model to the preprocessed data and record the sum of squared distances from each point to its assigned cluster center.

The code generates a scatter plot where the x-axis represents the number of clusters (k) and the y-axis displays the sum of squared distances. The plot often exhibits an "elbow" point, which indicates the optimal number of clusters.

This technique aids in selecting an appropriate number of clusters to best capture the underlying patterns in the data.


In [None]:
sum_squared_distances = []

for k in range(2,20):
    
    model = KMeans(n_clusters=k)
      
    model.fit(X)
    
    sum_squared_distances.append(model.inertia_)

plt.figure()
plt.scatter( x = [ k for k in range(2,20)], y = sum_squared_distances)
plt.show()
plt.clf()    

**Silhouette Scores for Optimal K:**

We calculate Silhouette Scores for a range of cluster numbers (k) from 2 to 19. For each value of k, we fit a K-Means model to the preprocessed data and compute the Silhouette Score, which quantifies how well each data point fits into its assigned cluster while considering the distance to other clusters.

The code generates a line plot, where the x-axis represents the number of clusters (k) and the y-axis displays the Silhouette Score. A higher score indicates better-defined clusters, aiding in the identification of an optimal cluster count.

This approach helps determine the most suitable number of clusters that best captures the inherent structure of the data.


In [None]:
silhouette_scores = []
for k in range(2, 20):
    model = KMeans(n_clusters=k)
    kmeans_labels = model.fit_predict(X)
    score = silhouette_score(X, kmeans_labels)
    silhouette_scores.append(score)

# Plot Silhouette Scores
plt.figure(figsize=(10, 6))
plt.plot([k for k in range(2,20)], silhouette_scores, marker='o')
plt.title('Silhouette Scores for Optimal K')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Silhouette Score')
plt.show()

**We see that the optimal K is 15. It has the best silhouette score and a low sum squared error.** 

In [None]:
model = KMeans( n_clusters = 15)
model.fit(X)
model.labels_

**Assigning Clusters and Displaying Countries:**

We assign clusters to countries using the `model.labels_` attribute from the K-Means model. Each country is assigned to a specific cluster.

For each cluster (from 0 to 15), the code iterates through and displays the countries belonging to that cluster. This provides a clear understanding of the countries grouped together within each cluster.

The output showcases the distribution of countries among clusters, aiding in the interpretation and analysis of the clustering results.


In [None]:
df_countries['Cluster'] = model.labels_
for k in range(16):
    print(f"Cluster number {k}")
    print(df_countries[df_countries['Cluster'] == k]['Country'])
    print("-"*20)

**Choropleth Map of Clustered Countries:**

We use the `iso_codes` DataFrame to create a mapping between country names and their ISO codes. This mapping facilitates the visualization of clustered countries on a choropleth map.

The code employs Plotly Express (`px.choropleth`) to generate the choropleth map. The map displays countries based on their ISO codes, with color-coding representing the assigned cluster. Hovering over a country reveals the country name, and the color scale enhances visual differentiation of clusters.



In [None]:
iso_codes = pd.read_csv("./data/countries_continents_codes_flags_url.csv")
iso_country_mapping = iso_codes.set_index('country')['alpha-3'].to_dict()

df_countries['ISO code'] = df_countries['Country'].map(iso_country_mapping)
fig = px.choropleth(
    df_countries,
    locations="ISO code",
    color="Cluster",
    hover_name="Country",
    color_continuous_scale=  px.colors.sequential.Turbo 
)

fig.show()



## **Analysis of Clustered Countries:**

The K-Means clustering algorithm has grouped the countries into distinct clusters based on their socio-economic attributes. Below is an analysis of the countries within each cluster:

Cluster 0: This cluster comprises countries with relatively high GDP per capita and happiness scores, including Cyprus, Greece, Italy, Japan, Spain, and Portugal. These countries exhibit a balanced mix of economic and social indicators.

Cluster 1: Predominantly consisting of South American countries like Argentina, Brazil, and Mexico, this cluster includes countries with varying levels of GDP per capita and happiness scores.

Cluster 2: Comprising countries from the Eastern European region, this cluster includes Armenia, Azerbaijan, Russia, and Ukraine, among others. These countries show diverse economic and social profiles.

Cluster 3: Encompassing Central European countries such as Poland, Hungary, and Romania, this cluster consists of nations with relatively higher GDP per capita and happiness scores.

Cluster 4: This small cluster includes African countries like Angola and Nigeria, indicating lower GDP per capita and happiness scores.

Cluster 5: Encompassing countries such as Bangladesh, Haiti, and Pakistan, this cluster represents nations with lower GDP per capita and happiness scores.

Cluster 6: Dominated by African nations, this cluster includes countries with varying levels of GDP per capita and happiness scores.

Cluster 7: Consisting of Hong Kong and Singapore, this cluster represents highly developed Asian countries with high GDP per capita and happiness scores.

Cluster 8: Encompassing Western European nations like Germany, France, and the United Kingdom, this cluster represents countries with high GDP per capita and happiness scores.

Cluster 9: This cluster includes African nations like Burundi, Liberia, and Rwanda, with lower GDP per capita and happiness scores.

Cluster 10: Comprising China and India, this cluster represents two populous Asian countries with significant economic influence.

Cluster 11: Representing highly developed English-speaking nations, this cluster includes Australia, Canada, New Zealand, and the United States.

Cluster 12: Encompassing countries such as Indonesia, Iran, and the Philippines, this cluster represents nations with varying levels of GDP per capita and happiness scores.

Cluster 13: This cluster includes Middle Eastern countries like Saudi Arabia, Qatar, and the United Arab Emirates, with higher GDP per capita and happiness scores.

Cluster 14: Encompassing North African nations like Egypt and Algeria, this cluster represents countries with varying levels of GDP per capita and happiness scores.

Cluster 15: This small cluster does not contain any countries, indicating that these nations did not fit well into the defined clusters based on the selected socio-economic attributes.

Notice that the numbers of clusters change everytime we run the model.