## Categorising countries

### Data Source

The data used in this task was orginally sourced from Help.NGO. This international non-governmental organisation specialises in emergency response, preparedness, and risk mitigation.  

### Dataset Attributes
- country: name of the country
- child_mort: death of children under 5 years of age per 1000 live births
- exports: exports of goods and services per capita. Given as a percentage of the GDP per capita
- health: total health spending per capita. Given as a percentage of GDP per capita
- imports: imports of goods and services per capita. Given as a percentage of the GDP per capita
- income: net income per person
- inflation: the measurement of the annual growth rate of the Total GDP
- life_expec: the average number of years a new born child would live if the current mortality patterns remain the same
- total_fer: the number of children that would be born to each woman if the current age-fertility rates remains the same
- gdpp: the GDP per capita. Calculated as the Total GDP divided by the total population.

## Objective  
To group countries using socio-economic and health factors to determine the development status of the country.

In [None]:
# Import libraries
import numpy as np
import pandas as pd

import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning)
import os

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn import metrics
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import MinMaxScaler

In [None]:
# Random state seed
rseed = 42

## Load and explore data

In [None]:
# Import the dataset

In [None]:
# Check the shape

In [None]:
# Check datatypes & counts

In [None]:
# Get descriptive statistics

In [None]:
# Identify any missing data

## Preprocessing and Feature Selection

In [None]:
# Drop any non-numeric features (columns)

In [None]:
# Create a correlation map of features to explore relationships between features
# Hint: Explore seaborn heatmap

In [None]:
# Explore the continuous independent features against child_mort using scatter plots.

In [None]:
# Explore the continuous independent features against gdpp using scatter plots.

In [None]:
# Create a pair plot
# Hint: Explore seaborn pairplot

Note the peaks in the diagonal graphs that are distinct from each other or only overlap slightly. Looking at the scatter plot distributions may also give you some indication of features that would be good candidates for clustering the data.

### Scaling the Data

In [None]:
# Normalise the data using MinMaxScaler
# Name the normalised dataframe "df_scaled"


# df_scaled.head()

## K-Means Clustering

### Selecting K

In [None]:
# Plot elbow curve
def eval_Kmeans(x, k, r):
    kmeans = KMeans(n_clusters=k, random_state=r, max_iter=500)
    kmeans.fit(x)
    return kmeans.inertia_

def elbow_Kmeans(x, max_k=10, r=42):
    within_cluster_vars = [eval_Kmeans(x, k, r) for k in range(1, max_k+1)]
    plt.plot(range(1, 11), within_cluster_vars,marker='o')
    plt.xlabel('K')
    plt.ylabel('Inertia')
    plt.show()

    # Plot elbow curve using scaled dataset

In [None]:
# Silhouette score method
kmax = 10
sil = []
plt.plot()
for k in range(2, kmax+1):
    kmeans = KMeans(n_clusters=k, random_state=rseed, max_iter=500)
    kmeans.fit(df_scaled)
    labels = kmeans.labels_
    sil.append(silhouette_score(df_scaled, labels, metric='euclidean'))

sns.lineplot(x=range(2, kmax+1), y=sil)
plt.title('Silhouette Score Method')
plt.xlabel('k : Number of clusters')
plt.ylabel("Silhouette Score")
plt.grid(visible=True)
plt.show()

Based on the elbow and silhouette score method choose a value for K.

## Fitting a K-Means Model with the selected K value

In [None]:
# Remember to set the random_state to rseed

In [None]:
# Count the number of records in each cluster

In [None]:
# Check model performance with the silhouette coefficient

## Predictions

In [None]:
# Add the predicted cluster label column to the original dataframe

## Visualisation of clusters

In [None]:
# Visualisation of clusters: child mortality vs gdpp

In [None]:
# Visualisation of clusters: inflation vs gdpp

## Conclusions

Label the groups of countries in the plots you created based on child mortality, GDPP and inflation. You may use [terms](https://en.wikipedia.org/wiki/Developing_country#Terms_used_to_classify_countries) such as: least developed, developing and developed, or low, low-middle, upper-middle and high income. Alternatively, simply rank them from highest to lowest. Justify the labels you assign to each group.


**Answer here:**

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np

# Load the data
df = pd.read_csv('Country-data.csv')
df.head()


In [None]:
# Check for missing values
df.isnull().sum()


In [None]:
# Data Preprocessing: Drop non-numeric column 'country'
df.drop('country', axis=1, inplace=True)


In [None]:
# Feature Scaling (if needed)
# You can apply feature scaling if necessary, but it's not required for K-Means clustering.


In [None]:
# K-Means Clustering - Choosing the number of clusters (K) using the Elbow method
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

def eval_Kmeans(x, k, r):
    kmeans = KMeans(n_clusters=k, random_state=r)
    kmeans.fit(x)
    return kmeans.inertia_

def elbow_Kmeans(x, max_k=10, r=123):
    within_cluster_vars = [eval_Kmeans(x, k, r) for k in range(1, max_k + 1)]
    plt.plot(range(1, max_k + 1), within_cluster_vars, marker='o')
    plt.xlabel('K')
    plt.ylabel('Inertia')
    plt.show()

elbow_Kmeans(df.values)


In [None]:
# K-Means Clustering - Choosing the number of clusters (K) using the Silhouette Score method
import seaborn as sns
from sklearn.metrics import silhouette_score

kmax = 10
sil = []
plt.plot()
for k in range(2, kmax + 1):
    kmeans = KMeans(n_clusters=k, random_state=0)
    kmeans.fit(df.values)
    labels = kmeans.labels_
    sil.append(silhouette_score(df.values, labels, metric='euclidean'))

sns.lineplot(x=range(2, kmax + 1), y=sil)
plt.title('Silhouette Score Method')
plt.xlabel('k : Number of clusters')
plt.ylabel("Silhouette Score")
plt.grid(visible=True)
plt.show()


In [None]:
# K-Means Clustering - Fit a K-Means model and visualize the outcome
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import silhouette_score

# Fit a K-Means Model and Visualize the Outcome
def scatter_Kmeans(x, k, r=123):
    kmeans = KMeans(n_clusters=k, init='k-means++', random_state=r)
    y_pred = kmeans.fit_predict(x)
    colours = 'rbgcmy'
    for c in range(k):
        plt.scatter(x[y_pred == c, 0], x[y_pred == c, 1], c=colours[c], label=f'Cluster {c}')
        plt.scatter(kmeans.cluster_centers_[c, 0], kmeans.cluster_centers_[c, 1], marker='x', c='black')

    score = round(silhouette_score(x, kmeans.labels_, metric='euclidean'), 2)
    plt.title(f'Silhouette Score = {score}')
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')
    plt.legend()
    plt.show()

# Scatter plot for K=3
scatter_Kmeans(df.values, 3, r=0)
