<div style="background-color: #333; padding: 40px; border: 2px solid #ffd700; border-radius: 10px; color: #ffd700; text-align: center; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.2);">

<h1 style="font-size: 48px; font-weight: bold; color: #ffd700;">Country data</h1>

<img src="https://t3.ftcdn.net/jpg/06/16/14/20/360_F_616142053_gKBkSdbs1JvdeQTS4X2mK6gqbxavxMuu.jpg" alt="Movie Reel" style="width: 500px; margin: 20px auto; border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.2);">
    
</div>

<div style="border-radius: 10px; border: 2px solid #ffd700; padding: 15px; background-color: #333; font-size: 180%; text-align: center; color: #ffd700; font-weight: bold;"> Table of Contents 
</div>

<ul class="list-group" id="list-tab" role="tablist">
    <li><a href="#1.-Import-Libraries">1. Import Libraries</a></li><br>
    <li><a href="#2.-Load-data">2. Load data</a></li><br>
    <li><a href="#3.-Exploratory-Data-Analysis">3. Exploratory Data Analysis</a></li><br>
    <li><a href="#4.-Clustering-model">4. Clustering model</a></li><br>
</ul>

## <div style="border-radius: 10px; border: 2px solid #ffd700; padding: 15px; background-color: #333; font-size: 120%; text-align: center; color: #ffd700; font-weight: bold;">1. Import Libraries</div>

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
import warnings

from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA, IncrementalPCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

## <div style="border-radius: 10px; border: 2px solid #ffd700; padding: 15px; background-color: #333; font-size: 120%; text-align: center; color: #ffd700; font-weight: bold;">2. Load data</div>

In [None]:
df_country = pd.read_csv("/kaggle/input/unsupervised-learning-on-country-data/Country-data.csv")
df_country.head()

In [None]:
df_country.info()

In [None]:
df_country.describe()

## <div style="border-radius: 10px; border: 2px solid #ffd700; padding: 15px; background-color: #333; font-size: 120%; text-align: center; color: #ffd700; font-weight: bold;">3. Exploratory Data Analysis</div>

## <div style="border-radius: 10px; border: 2px solid #333; padding: 15px; background-color: #ffd700; font-size: 120%; text-align: left; color: #333; font-weight: bold;">3.1 Data quality</div>

### I | Check duplicates

In [None]:
duplicates = df_country.duplicated().sum()
print(duplicates)

### II | Check null and missing values

In [None]:
missing_values = df_country.isnull().sum()
total_missing_values = (missing_values).sum()
total_cells = np.product(df_country.shape)
percent_missing_values = (total_missing_values / total_cells)*100
print("Percent of data that is missing", percent_missing_values)
print(missing_values)

### III | Check unique values in each columns

In [None]:
for column in df_country.columns:
    num_distinct_values = len(df_country[column].unique())
    print(f"{column}: {num_distinct_values} distinct values")

### IV | Correlation Analysis

In [None]:
numeric_columns = df_country.select_dtypes(include=[np.number])
correlation_matrix = numeric_columns.corr()
correlation_matrix

In [None]:
fig, ax = plt.subplots() 
fig.set_size_inches(15,10)
sns.heatmap(correlation_matrix, vmax =.8, square = True, annot = True,cmap='YlGn' )
plt.title('Correlation Matrix',fontsize=15);

Most related features : 
 - income / Gdp : 0.9 -> highly positive correlated
 - child_mort / life_expect : -0.89 -> highly negative correlated
 - total_fer / child_mort : 0.85 -> highly positive correlated

### V | Normalization

In [None]:
col = list(df_country.columns)
col.remove('country')
categorical_features = ['country']
numerical_features = [*col]

In [None]:
scaler = MinMaxScaler().fit_transform(df_country[numerical_features])
df_scale = pd.DataFrame(scaler, columns=df_country[numerical_features].columns)

### VI | PCA

In [None]:
pca = PCA(n_components=9).fit(df_scale)
exp = pca.explained_variance_ratio_
print(exp)

In [None]:
plt.plot(np.cumsum(exp), linewidth=2, marker = 'o', linestyle = '--')
plt.title("PCA")
plt.xlabel('n_component')
plt.ylabel('Cumulative explained Variance Ratio')
plt.yticks(np.arange(0.5, 1.05, 0.05))
plt.show()

In [None]:
finla_pca = IncrementalPCA(n_components=5).fit_transform(df_scale)
pc = np.transpose(finla_pca)

In [None]:
df = pd.DataFrame({
    'PC1':pc[0],
    'PC2':pc[1],
    'PC3':pc[2],
    'PC4':pc[3],
    'PC5':pc[4],
})
df

## <div style="border-radius: 10px; border: 2px solid #333; padding: 15px; background-color: #ffd700; font-size: 120%; text-align: left; color: #333; font-weight: bold;">3.2 Univariative Analysis</div>

In [None]:
warnings.filterwarnings('ignore')

num_subplots = len(numerical_features)
colors = sns.color_palette("husl", num_subplots)

fig, ax = plt.subplots(nrows = 3,ncols = 3,figsize = (15,15))
for i in range(len(numerical_features)):
    plt.subplot(3,3,i+1)
    sns.distplot(df_country[numerical_features[i]], color = colors[i])
    title = 'Distribution : ' + numerical_features[i]
    plt.title(title)
plt.show()

In [None]:
plt.figure(figsize=(10, 6))
sns.lineplot(data=df_country[['exports', 'imports']], dashes=False, markers='o')

## <div style="border-radius: 10px; border: 2px solid #333; padding: 15px; background-color: #ffd700; font-size: 120%; text-align: left; color: #333; font-weight: bold;">3.3 Bivariative Analysis</div>

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(16, 6))

axes[0].set_title("Countries with highest exports")
sns.barplot(x='country', y='exports', ax=axes[0], data=df_country.sort_values(ascending = False,by = 'exports').iloc[:5], color = 'blue', edgecolor = 'black')

axes[1].set_title("Countries with highest imports")
sns.barplot(x='country', y='imports', ax=axes[1], data=df_country.sort_values(ascending = False,by = 'imports').iloc[:5], color = 'red', edgecolor = 'black')

plt.show()

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(16, 5))

axes[0].set_title("Countries with high income")
sns.barplot(x='country', y='income', ax=axes[0], data=df_country.sort_values(ascending = False,by = 'income').iloc[:5], color = 'blue', edgecolor = 'black')

axes[1].set_title("Countries with high inflation")
sns.barplot(x='country', y='inflation', ax=axes[1], data=df_country.sort_values(ascending = False,by = 'inflation').iloc[:5], color = 'red', edgecolor = 'black')

axes[2].set_title("Countries with high gdp")
sns.barplot(x='country', y='gdpp', ax=axes[2], data=df_country.sort_values(ascending = False,by = 'gdpp').iloc[:5], color = 'yellow', edgecolor = 'black')

plt.tight_layout()

plt.show()

In [None]:
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(16, 8))

axes[0, 0].set_title("Countries with high child mortality")
sns.barplot(x='country', y='child_mort', ax=axes[0, 0], data=df_country.sort_values(ascending = False,by = 'child_mort').iloc[:5], color = 'blue', edgecolor = 'black')

axes[0, 1].set_title("Countries with high life expectancy")
sns.barplot(x='country', y='life_expec', ax=axes[0, 1], data=df_country.sort_values(ascending = False,by = 'life_expec').iloc[:5], color = 'red', edgecolor = 'black')

axes[1, 0].set_title("Countries with high health spending")
sns.barplot(x='country', y='health', ax=axes[1, 0], data=df_country.sort_values(ascending = False,by = 'gdpp').iloc[:5], color = 'yellow', edgecolor = 'black')

axes[1, 1].set_title("Countries with high female fertility rate")
sns.barplot(x='country', y='total_fer', ax=axes[1, 1], data=df_country.sort_values(ascending = False,by = 'total_fer').iloc[:5], color = 'purple', edgecolor = 'black')

plt.tight_layout()

plt.show()

## <div style="border-radius: 10px; border: 2px solid #ffd700; padding: 15px; background-color: #333; font-size: 120%; text-align: center; color: #ffd700; font-weight: bold;">4. Clustering model</div>

k hyperparameter : It defines the number of clusters or groups the data is to be divided into. For the selection of values of k, we use 2 statistical tests :
1. Elbow Method : It is a method that plots the sum of squared error for a range of values of k. If this plot looks like an arm, then k is the value that resembles an elbow is selected. From this elbow value, sum of squared values (inertia) starts decreasing in a linear fashion and thus is considered as an optimal value.
1. Silhouette Score Method : It is a method that evaluates the quality of clusters in terms of how well data points are clustered with other data points that are similar to each other. This score is calculated using the distance formula and the k value with highest score is selected for modeling.

## <div style="border-radius: 10px; border: 2px solid #333; padding: 15px; background-color: #ffd700; font-size: 120%; text-align: left; color: #333; font-weight: bold;">4.1 Elbow method</div>

In [None]:
inertias = []

for k in range(1, 11):
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(df)
    inertias.append(kmeans.inertia_)

# Plot the elbow curve
plt.plot(range(1, 11), inertias, marker='o')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal k')
plt.show()

## <div style="border-radius: 10px; border: 2px solid #333; padding: 15px; background-color: #ffd700; font-size: 120%; text-align: left; color: #333; font-weight: bold;">4.2 Silhouette score method</div>

In [None]:
silhouette_scores = []

for k in range(2, 11):
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(df)
    silhouette_scores.append(silhouette_score(df, kmeans.labels_))

# Plot silhouette scores
plt.plot(range(2, 11), silhouette_scores, marker='o')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Score for Optimal k')
plt.show()

From the results of the above 2 methods, we select :
k : **Clusters = 3**

## <div style="border-radius: 10px; border: 2px solid #333; padding: 15px; background-color: #ffd700; font-size: 120%; text-align: left; color: #333; font-weight: bold;">4.3 K-means model</div>

In [None]:
kmeans = KMeans(n_clusters=3).fit(df)
df.insert(0, 'Country', df_country['country'])
df['class'] = kmeans.labels_
df['Label'] = df['class']

In [None]:
#count number of records in every cluster
pd.Series(kmeans.labels_).value_counts()

In [None]:
poor = int(df[df.Country=='Afghanistan']['class'])
midle = int(df[df.Country=='Iran']['class'])
rich = int(df[df.Country=='Canada']['class'])

In [None]:
poor_label = 'Poor countries'
midle_label = 'Midle countries'
rich_label = 'Rich countries'

In [None]:
df.replace({'Label':{poor:'Poor countries', midle:'Midle countries', rich:'Rich countries'}},inplace=True)
df.head()

In [None]:
fig = px.choropleth(df[['Country','class']],
                    locationmode = 'country names',
                    locations = 'Country',
                    color = df['Label'],  
                    color_discrete_map = {'Rich countries': 'Green',
                                          'Midle countries':'LightBlue',
                                          'Poor countries':'Red'}
                   )

fig.update_layout(
        margin = dict(
                l=0,
                r=0,
                b=0,
                t=0,
                pad=2,
            ),
    )
fig.show()

<div style="border-radius: 10px; border: 2px solid #ffd700; padding: 15px; background-color: #001f3f; font-size: 120%; text-align: center; color: #ffd700; font-weight: bold;">Please upvote if you like the work!</div>