# Introduction

In this notebook, we will learn about unsupervised methods for clustering and dimensionality reduction.

## Table of Content

- [0. Packages](#0)
- [1. Unsupervised Leaerning](#1)
- [2. Clustering](#3)
    - [2.1. Dataset](#2)
- [3. Dimensionality Reduction](#4)

# 0. Packages <a name="0"></a>

In this session, we will make use of the following packages:
- [PathLib](https://docs.python.org/3/library/pathlib.html) is a convenient library to work with path names.
- [NumPy](https://docs.scipy.org/doc/numpy/) is a popular library for scientific computing.
- [matplotlib](https://matplotlib.org/3.1.1/contents.html) is a plotting library compatible with numpy.
- [pandas](https://pandas.pydata.org/docs/) is what we'll use to manipulate our data.
- [sklearn](https://scikit-learn.org/stable/index.html) will be used to measure the performance of our model.

Run the next cell to import the necessary packages mentioned before. Besides, we will add more packages as needed while progressing in this session.

In [4]:
# Good practice to use short but clear aliases for the imported libraries
from pathlib import Path
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import sklearn
# Magic Function
%matplotlib inline
# Hide all warnings
import warnings
warnings.filterwarnings('ignore') # warnings.filterwarnings(action='once')

# 1. Unsupervised Learning

> Unsupervised learning is a type of machine learning that looks for previously undetected patterns in a data set with no pre-existing labels and with a minimum of human supervision. In contrast to supervised learning that usually makes use of human-labeled data, unsupervised learning, also known as self-organization allows for modeling of probability densities over inputs. It forms one of the three main categories of machine learning, along with supervised and reinforcement learning. Semi-supervised learning, a related variant, makes use of supervised and unsupervised techniques.<br/><br/>
Two of the main methods used in unsupervised learning are **principal component** and **cluster analysis**. Cluster analysis is used in unsupervised learning to group, or segment, datasets with shared attributes in order to extrapolate algorithmic relationships. Cluster analysis is a branch of machine learning that groups the data that has not been labelled, classified or categorized. Instead of responding to feedback, cluster analysis identifies commonalities in the data and reacts based on the presence or absence of such commonalities in each new piece of data. This approach helps detect anomalous data points that do not fit into either group.<br/><br/>
A central application of unsupervised learning is in the field of density estimation in statistics,though unsupervised learning encompasses many other domains involving **summarizing** and **explaining data features**. It could be contrasted with supervised learning by saying that whereas supervised learning intends to infer a conditional probability distribution $pX(x|y)$ conditioned on the label $y$ of input data; unsupervised learning intends to infer an a priori probability distribution $pX(x)$.


Source: [Wikipedia](https://en.wikipedia.org/wiki/Unsupervised_learning)

# 2. Clustering

> Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including pattern recognition, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning. Cluster analysis itself is not one specific algorithm, but the general task to be solved...

Source: [Wikipedia](https://en.wikipedia.org/wiki/Cluster_analysis)

 <a name="2-1"></a>
## 2.1 Dataset

In this session, we will be using the GeoLink dataset that used in previous sessions.

**Note:** download data from https://drive.google.com/drive/folders/1EgDN57LDuvlZAwr5-eHWB5CTJ7K9HpDP

Credit to this repo: https://github.com/LukasMosser/geolink_dataset

## Data Disclaimer

All the data serving as an input to these notebooks was generously donated by GEOLINK  
and is CC-by-SA 4.0 

If you use this data please reference the dataset properly to give them credit for their contribution.


## Data Preparation
The geolink dataset we will use in this notebook has been preprocessed. You can find the process of preparation of this dataset in <code>notebook/00 Data Prep/00-mc-prep_geolink_norge_dataset.ipynb</code>

## Load Dataset

Let's load the dataset

In [5]:
interim_locations = Path("../../data/processed/geolink_norge_dataset/")
# Load processed dataset
geolink = pd.read_parquet(interim_locations/'geolink_norge_well_logs_train.parquet').set_index(['Well', 'DEPT'])
geolink

Unnamed: 0_level_0,Unnamed: 1_level_0,LITHOLOGY_GEOLINK,CALI,RHOB,GR,DTC,RDEP,RMED
Well,DEPT,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
15_9-12,2215.917725,Shaly Silt,14.438001,2.363000,60.285748,134.253601,0.737006,0.785088
15_9-12,2216.070068,Shaly Silt,14.633000,2.340000,63.250000,129.101868,0.741000,0.840000
15_9-12,2216.222412,Shaly Silt,14.813001,2.314000,61.405998,122.476944,0.752000,0.858000
15_9-12,2216.375000,Shaly Silt,14.383001,2.293000,62.561596,116.908607,0.739962,0.857046
15_9-12,2216.527344,Shaly Silt,14.202999,2.275000,61.691055,115.390953,0.715966,0.886082
...,...,...,...,...,...,...,...,...
7_3-1,4569.153320,Cross Bedded Sst,8.538000,2.643616,50.886002,63.442799,6.727000,6.835000
7_3-1,4569.305664,Cross Bedded Sst,8.540000,2.631049,51.219002,63.450794,6.639000,6.690000
7_3-1,4569.458008,Cross Bedded Sst,8.548000,2.626054,51.671001,63.590557,6.551000,6.520000
7_3-1,4569.610352,Cross Bedded Sst,8.552000,2.624065,51.820999,64.036644,6.464000,6.462185


## 2.2 KMeans


> k-means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid), serving as a prototype of the cluster.


In the example below, we have some data in a 2d space with 3 clusters. The algorithm will find the best centroids for the 3 clusters in a iterative process:

![kmeans-convergence](https://upload.wikimedia.org/wikipedia/commons/e/ea/K-means_convergence.gif)

Source: [Wikipedia](https://en.wikipedia.org/wiki/K-means_clustering), [Image](https://en.wikipedia.org/wiki/K-means_clustering#/media/File:K-means_convergence.gif) License Image: [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/)

## 2.3 DBScan

> Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu in 1996. It is a density-based clustering non-parametric algorithm: given a set of points in some space, it groups together points that are closely packed together (points with many nearby neighbors), marking as outliers points that lie alone in low-density regions (whose nearest neighbors are too far away). DBSCAN is one of the most common clustering algorithms and also most cited in scientific literature.

### Why should I use it?

> In 2014, the algorithm was awarded the test of time award (an award given to algorithms which have received substantial attention in theory and practice) at the leading data mining conference, ACM SIGKDD. As of July 2020, the follow-up paper "DBSCAN Revisited, Revisited: Why and How You Should (Still) Use DBSCAN" appears in the list of the 8 most downloaded articles of the prestigious ACM Transactions on Database Systems (TODS) journal

***DBSCAN can find non-linearly separable clusters. This dataset cannot be adequately clustered with k-means or Gaussian Mixture EM clustering.***
![](https://upload.wikimedia.org/wikipedia/commons/thumb/0/05/DBSCAN-density-data.svg/1920px-DBSCAN-density-data.svg.png)

Source: [Wikipedia](https://en.wikipedia.org/wiki/DBSCAN), [Source Image](https://en.wikipedia.org/wiki/DBSCAN#/media/File:DBSCAN-density-data.svg) License Image: [CC BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/)

# 3. Dimensionality Reduction


Source: [Kaggle](https://www.kaggle.com/arthurtok/principal-component-analysis-with-kmeans-visuals) License: [Apache 2.0]('http://www.apache.org/licenses/LICENSE-2.0)