# <a name="0">Hierarchical Clustering</a>

1. <a href="#1">Read the dataset</a>
2. <a href="#2">Data investigation</a>
3. <a href="#3">Data preprocessing </a>
4. <a href="#4">Features transformation </a>
4. <a href="#5">Training datasets</a>
5. <a href="#6">Improvement ideas</a>



In [118]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
from sklearn.impute import SimpleImputer


## 1. <a name="1">Read the dataset</a>
(<a href="#0">Go to top</a>)

First dowmload the data set from this link https://www.kaggle.com/fernandol/countries-of-the-world
then import it in python.

In [151]:
#read the data
pd.options.display.float_format = '{:,}'.format

data_path = 'countries of the world.csv'  #the path where you downloaded the data
df = pd.read_csv(data_path)

print('The shape of the dataset is:', df.shape)

# Renaming columns
df.rename(columns={"Area (sq. mi.)" : "Area","Pop. Density (per sq. mi.)" : "PopDensity","Coastline (coast/area ratio)": "Coastline","Net migration":"NetMigration","Infant mortality (per 1000 births)" :"InfantMortality","GDP ($ per capita)": "GDP","Literacy (%)" : "Literacy","Phones (per 1000)" : "Phones","Arable (%)" : "Arable","Crops (%)" : "Crops","Other (%)" : "Other","Climate" : "Climate","Birthrate" : "Birthrate","Deathrate" : "Deathrate","Agriculture": "Agriculture","Industry" : "Industry","Service"  : "Service"},inplace=True)

#Dropping the only one with GDP NaN 
df =  df[df["GDP"].notna()]

# Getting num cols
num_cols = ['PopDensity','Coastline','NetMigration','InfantMortality','Literacy','Phones','Arable','Crops','Other','Climate','Birthrate','Deathrate','Agriculture','Industry','Service']

# Converting commas to floating points
for i in num_cols:
    #print(i)
    df[i] = df[i].str.replace(',', '.').astype(float)
    
    
# this is telling us that there is  

unique, counts = np.unique(np.sort(df.isna().T.sum()), return_counts=True)
np.asarray((unique, counts)).T
# array([[  0, 179],
#        [  1,  28],
#        [  2,   1],
#        [  3,   8],
#        [  4,   7],
#        [  5,   1],
#        [  7,   2]], dtype=int64)

#using this info we can drop all row with four or more missing values
df = df.dropna(thresh = df.shape[1] - 3)

#filling the rest missing

#filling last three with nans
fill_NaN = SimpleImputer(missing_values=np.nan, strategy='mean')
df_nums = df[num_cols]
imputed_DF = pd.DataFrame(fill_NaN.fit_transform(df_nums))
imputed_DF.columns = df_nums.columns
imputed_DF.index = df_nums.index

df[num_cols] = imputed_DF


The shape of the dataset is: (227, 20)


## 2. <a name="2">Data investigation</a>
(<a href="#0">Go to top</a>)

in this part you need to check the data quality and assess any issues in the data as:
- null values in each column 
- each column has the proper data type
- outliers
- duplicate rows
- distribution for each column (skewness)
<br>

**comment each issue you find** 

In [16]:
# Let's see the data types and non-null values for each column
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 227 entries, 0 to 226
Data columns (total 20 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   Country                             227 non-null    object 
 1   Region                              227 non-null    object 
 2   Population                          227 non-null    int64  
 3   Area (sq. mi.)                      227 non-null    int64  
 4   Pop. Density (per sq. mi.)          227 non-null    object 
 5   Coastline (coast/area ratio)        227 non-null    object 
 6   Net migration                       224 non-null    object 
 7   Infant mortality (per 1000 births)  224 non-null    object 
 8   GDP ($ per capita)                  226 non-null    float64
 9   Literacy (%)                        209 non-null    object 
 10  Phones (per 1000)                   223 non-null    object 
 11  Arable (%)                          225 non-n

## The only two columns that are  string are (country and region)
## we can t see that in the data types of the info() output 

In [7]:
df.shape[0]

227

In [8]:
df.isna().sum()

Country                                0
Region                                 0
Population                             0
Area (sq. mi.)                         0
Pop. Density (per sq. mi.)             0
Coastline (coast/area ratio)           0
Net migration                          3
Infant mortality (per 1000 births)     3
GDP ($ per capita)                     1
Literacy (%)                          18
Phones (per 1000)                      4
Arable (%)                             2
Crops (%)                              2
Other (%)                              2
Climate                               22
Birthrate                              3
Deathrate                              4
Agriculture                           15
Industry                              16
Service                               15
dtype: int64

In [4]:
round(df.isnull().sum(axis=0)*100/df.shape[0],2)

Country                               0.00
Region                                0.00
Population                            0.00
Area (sq. mi.)                        0.00
Pop. Density (per sq. mi.)            0.00
Coastline (coast/area ratio)          0.00
Net migration                         1.32
Infant mortality (per 1000 births)    1.32
GDP ($ per capita)                    0.44
Literacy (%)                          7.93
Phones (per 1000)                     1.76
Arable (%)                            0.88
Crops (%)                             0.88
Other (%)                             0.88
Climate                               9.69
Birthrate                             1.32
Deathrate                             1.76
Agriculture                           6.61
Industry                              7.05
Service                               6.61
dtype: float64

In [5]:
# This will print basic statistics for numerical columns
df.describe()

Unnamed: 0,Population,Area (sq. mi.),GDP ($ per capita)
count,227.0,227.0,226.0
mean,28740280.0,598227.0,9689.823009
std,117891300.0,1790282.0,10049.138513
min,7026.0,2.0,500.0
25%,437624.0,4647.5,1900.0
50%,4786994.0,86600.0,5550.0
75%,17497770.0,441811.0,15700.0
max,1313974000.0,17075200.0,55100.0


## 3. <a name="3">Data preprocessing</a>
(<a href="#0">Go to top</a>)


### Define below all the issues that you had found in the previous part
1-           <br>
2-           <br>
3-           <br>

In [None]:
#make a copy for the original dataset
df_copy=df.copy()

### for each issue adapt this methodology 
- start by defining the solution
- apply this solution onn the data
- test the solution to make sure that you have solved the issue

**First issue**

In [None]:
#solution 


In [None]:
#test 


**Second issue**

In [None]:
#solution 


In [None]:
#test 


## 4. <a name="4">Features transformation</a>
(<a href="#0">Go to top</a>)

*What is the feature scaling technique that would use and why?* <br>
*return to this section again and try another technique and see how that will impact your result*<br>
for more details on different methods for scaling check these links
- https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing
- https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing
- https://www.analyticsvidhya.com/blog/2020/07/types-of-feature-transformation-and-scaling/

In [None]:
from sklearn import preprocessing

## 5. <a name="5">Training and hyperparamter tuning</a>
(<a href="#0">Go to top</a>)


Before we start the training process we need to specify 3 paramters:<br>
1- Linkage criteria : The linkage criterion determines the distance between two clusters
    - Complete-Linkage Clustering
    - Single-Linkage Clustering
    - Average-Linkage Clustering
    - Centroid Linkage Clustering
2- Distance function:
    - Euclidean Distance 
    - Manhattan Distance 
    - Mahalanobis distance 
3- Number of clusters


### *Number of clusters*
Use Dendograms to specify the optimum number of clusters
- Compare how changing linkage criteria or distance function would affect the optimum number of clusters
- you can use silhouette_score or any other evalution method to help you determine the optimum number of clusters
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html

In [None]:
import scipy.cluster.hierarchy as shc

plt.figure(figsize=(10, 7))
plt.title("Counters Dendograms")
dend = shc.dendrogram(shc.linkage(y=... , method=...,metric=...),orientation='right') #fill y with your dataframe
                                                                                      #and method with linkage criteria
                                                                                      #and metric with distance function

In [None]:
#training
from sklearn.cluster import AgglomerativeClustering





## 6. <a name="6">improvement ideas</a>
(<a href="#0">Go to top</a>)

- Try to use PCA to reduce the number of features and compare how this will affect the clustring process
- Try to run your code again but with different tranformation technique
- Implement gap statistics method and use it as evaluation metric and compare the result with what you did before https://www.datanovia.com/en/lessons/determining-the-optimal-number-of-clusters-3-must-know-methods/#gap-statistic-method 