# Unsupervised Clustering Using K-means

## 1. Importing Libraries

In [2]:
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

## 2. Loading the data

The dataset for this example is a subset of a dataset from the [FORCE 2020 competition](https://www.kaggle.com/c/force-2020-ml-contest) hosted by FORCE 2020 and XEEK for the prediction of lithology from well logging measurements. (Bormann P., Aursand P., Dilib F., Dischington P., Manral S. 2020. FORCE Machine Learning Competition)


### 2.1. The dataset

In [10]:
df = pd.read_csv("force2020_data_unsupervised_learning.csv", index_col = 'DEPTH_MD')
df

Unnamed: 0_level_0,RHOB,GR,NPHI,PEF,DTC
DEPTH_MD,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
494.528,1.884186,80.200851,,20.915468,161.131180
494.680,1.889794,79.262886,,19.383013,160.603470
494.832,1.896523,74.821999,,22.591518,160.173615
494.984,1.891913,72.878922,,32.191910,160.149429
495.136,1.880034,71.729141,,38.495632,160.128342
...,...,...,...,...,...
3271.416,2.630211,19.418915,0.187811,,
3271.568,2.643114,21.444370,0.185574,,
3271.720,2.681300,22.646879,0.176074,,
3271.872,2.738337,22.253584,0.174617,,


As we can see, the dataset above has some NaN values, we will remove them.

### 2.3. Drop the missing values

In [8]:
df.dropna(inplace = True)

In [9]:
df

Unnamed: 0_level_0,RHOB,GR,NPHI,PEF,DTC
DEPTH_MD,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1138.704,1.774626,55.892757,0.765867,1.631495,147.837677
1138.856,1.800986,60.929138,0.800262,1.645080,142.382431
1139.008,1.817696,62.117264,0.765957,1.645873,138.258331
1139.160,1.829333,61.010860,0.702521,1.620216,139.198914
1139.312,1.813854,58.501236,0.639708,1.504854,144.290085
...,...,...,...,...,...
2993.256,2.468236,90.537521,0.341534,4.699200,86.474564
2993.408,2.457519,88.819122,0.351085,4.699200,86.187599
2993.560,2.429228,92.128922,0.364982,4.699200,87.797836
2993.712,2.425479,95.870255,0.367323,5.224292,88.108452


## 3. Transform the data

Standardlise the data using StandardScaler function from sklearn.

To account for variations in measurements units and scale, it's common practice in machine learning to standardise the data.

This is done by taking the feature, and subtracting the mean and dividing by the standard deviation.

$ \Large z = \frac{x - \mu}{\sigma} $ 

Where $\mu$ and $\sigma$ are the mean and standard deviation of $x$ (the data) respectively.
This process can be influenced by outliers within the data, which can skew the mean and standard deviation. So it's important that the outliers are identified and dealed with before this step.

In [11]:
df.describe()

Unnamed: 0,RHOB,GR,NPHI,PEF,DTC
count,18270.0,18270.0,14032.0,16440.0,18189.0
mean,2.110451,63.847477,0.404547,3.463851,125.106178
std,0.297725,28.636331,0.133532,2.561239,30.618337
min,1.404576,6.191506,0.02433,1.010027,55.726753
25%,1.963399,43.86669,0.315346,2.320836,90.883087
50%,2.055079,66.777851,0.448527,2.790249,141.300461
75%,2.381963,81.542681,0.506343,4.267342,148.048355
max,2.927888,499.022583,0.800262,66.030319,175.95314


### 3.1. Create an instance of the scaler

In [None]:
We the