In [1]:
import matplotlib.pyplot as plt
import numpy as np

## Clustering Concept:
**Concept**:

**Definition**: Clustering is a type of unsupervised learning where the goal is to group similar data points together based on certain features or characteristics.
Objective: The objective is to discover inherent structures or patterns in the data without prior knowledge of the groups.
Analogy:

**Imagine a Library:** Think of a library with a large collection of books. If you were to organize these books without any labels or categories, you might naturally group them based on topics, genres, or authors. Each group would represent a cluster of books with similar content.

### **Key Aspects:**

**Similarity Metric**:

Clustering relies on a similarity metric to measure how alike or different data points are. Common metrics include Euclidean distance or cosine similarity.
Unsupervised Nature:

Clustering is unsupervised because it doesn't have predefined categories; the algorithm discovers patterns on its own.

## K-Means Algorithm:
**Concept**:

- **Definition**: K-Means is a popular clustering algorithm that partitions data into K clusters, where each data point belongs to the cluster with the nearest mean (centroid).
Objective: Minimize the within-cluster sum of squares, making data points within a cluster as similar as possible.

**Analogy**:

- **Organizing Party Guests**:
Imagine you're organizing a party, and you want to group guests based on their interests. K-Means would be like placing guests into K groups so that people within each group share similar interests.
Steps of K-Means:

**Initialization**:

Choose K initial centroids randomly (representing the initial cluster centers).

**Assignment**:

Assign each data point to the cluster whose centroid is the closest.
Update Centroids:

Recalculate the centroids based on the mean of data points in each cluster.


**Repeat**:

Repeat the assignment and update steps until convergence (when the centroids stabilize or don't change significantly).

### **Key Aspects:**

- **K Selection:**

Choosing the right value for K is crucial. It depends on the nature of the data and the desired level of granularity.
Sensitivity to Initialization:

The performance of K-Means can be sensitive to the initial choice of centroids. Multiple initializations and choosing the best result can mitigate this.

- **Limitations**:

K-Means assumes spherical clusters and struggles with non-linear shapes or varying cluster sizes.
Example:

In [2]:
from sklearn.datasets import load_breast_cancer
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

data = load_breast_cancer()

In [3]:
data.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

In [4]:
# Let's see if Kmeans is able to distinguish 
# between malignant an benign 

X = data.data
# Algorithm for 2 clusters knowing there's 2 classes
kmeans = KMeans(n_clusters=2, random_state=42)
kmeans.fit(X)

  super()._check_params_vs_input(X, default_n_init=10)


In [5]:
# How well K-Means did the partition
(kmeans.labels_ == data.target).sum() / len(kmeans.labels_)

0.8541300527240774

What happen if i first apply PCA?

In [6]:

pca = PCA(n_components=2).fit(X)
X_red = pca.transform(X)

kmeans2 = KMeans(n_clusters=2, random_state=42).fit(X_red)

  super()._check_params_vs_input(X, default_n_init=10)


In [7]:
(kmeans.labels_ == data.target).sum() / len(kmeans.labels_)

0.8541300527240774

Nothing :/

***

### Agglomerative Clustering:

**Concept:**
- **Definition:** Agglomerative Clustering is a hierarchical clustering algorithm that starts with individual data points as separate clusters and iteratively merges the closest clusters until only one cluster remains.
- **Objective:** The algorithm creates a hierarchy of clusters, revealing relationships at different levels of granularity.

**Analogy:**
- **Family Tree:**
  - Imagine constructing a family tree. At the beginning, each person is a separate cluster (like an individual data point). As you discover relationships, you merge individuals into families, and families into larger branches, forming a hierarchical structure.

**Steps of Agglomerative Clustering:**
1. **Individual Data Points:**
   - Start with each data point as a separate cluster.

2. **Merge Closest Clusters:**
   - Iteratively merge the two closest clusters based on a specified distance metric.

3. **Hierarchy Formation:**
   - Continue merging until all data points belong to a single cluster, forming a hierarchy.

**Key Aspects:**
1. **Dendrogram:**
   - The result is often visualized as a dendrogram, a tree-like diagram showing the hierarchy of clusters.

2. **Linkage Criteria:**
   - Different linkage criteria (e.g., ward, complete, average) determine how the distance between clusters is calculated.

---

### DBSCAN (Density-Based Spatial Clustering of Applications with Noise):

**Concept:**
- **Definition:** DBSCAN is a density-based clustering algorithm that groups together data points that are close to each other and have a sufficient number of neighbors, while marking data points in less dense regions as noise.
- **Objective:** Identify clusters of varying shapes and densities.

**Analogy:**
- **People Gathering:**
  - Imagine people gathering in a park. DBSCAN would identify groups of people standing close to each other as clusters, regardless of the overall park's size. People sitting alone might be considered noise.

**Steps of DBSCAN:**
1. **Core Points:**
   - Identify core points that have a minimum number of neighbors within a specified radius.

2. **Expand Clusters:**
   - Expand clusters by adding reachable points to the core points, forming dense regions.

3. **Noise Detection:**
   - Points that are not part of any cluster or have insufficient neighbors are considered noise.

**Key Aspects:**
1. **Flexibility with Cluster Shapes:**
   - DBSCAN is effective at identifying clusters of arbitrary shapes and is not sensitive to the spherical assumption.

2. **Parameter Selection:**
   - Parameters include the radius (epsilon) and the minimum number of points required to form a dense region (minPts).

3. **Noise Handling:**
   - DBSCAN can handle noise and outliers effectively, marking them as separate entities.

**Example:**
- **Monitoring Network Anomalies:**
  - Consider monitoring network traffic. DBSCAN could identify clusters of usual activity and mark unusual patterns as noise, potentially indicating anomalies or security threats.

Applying Agglomerative clusteering and DBSCAN on data set for ilustrative purposes (i'm just experimenting). 

I'm curious about what happend if we already know how data is divided and apply unsupervised clustering. 

How accurately these models can separate classes? ...

In [15]:
from sklearn.cluster import AgglomerativeClustering

# Different linkage criterion should have different accuracy results
linkages = ['ward', 'complete', 'average', 'single']
for linkage in linkages:
# Labels must be instantiated
    clt = AgglomerativeClustering(n_clusters=2, 
                                  linkage= linkage,
                                  )
                                # Already knowing there's two classes
    labels = clt.fit_predict(X)
    
    acc = (labels == data.target).sum() / len(labels)
    print(f'Accuracy for {linkage} linkage: {acc}')

Accuracy for ward linkage: 0.22144112478031636
Accuracy for complete linkage: 0.6625659050966608
Accuracy for average linkage: 0.6625659050966608
Accuracy for single linkage: 0.37082601054481545


- What if i don't specify the number of groups? 

- Will DBSCAN identify malignant and benignant?

In [18]:
from sklearn.cluster import DBSCAN

# Default parametters 
# eps = 0.5, min_samples = 5
dbscan = DBSCAN()
clusters = dbscan.fit_predict(X)

print(np.unique(clusters))

[-1]


Only detected noise :/ ...

Apparently i didn't scaled data.

In [30]:
from sklearn.preprocessing import StandardScaler

stdr = StandardScaler()
X_scaled = stdr.fit_transform(X)


X_scaled

array([[ 1.09706398, -2.07333501,  1.26993369, ...,  2.29607613,
         2.75062224,  1.93701461],
       [ 1.82982061, -0.35363241,  1.68595471, ...,  1.0870843 ,
        -0.24388967,  0.28118999],
       [ 1.57988811,  0.45618695,  1.56650313, ...,  1.95500035,
         1.152255  ,  0.20139121],
       ...,
       [ 0.70228425,  2.0455738 ,  0.67267578, ...,  0.41406869,
        -1.10454895, -0.31840916],
       [ 1.83834103,  2.33645719,  1.98252415, ...,  2.28998549,
         1.91908301,  2.21963528],
       [-1.80840125,  1.22179204, -1.81438851, ..., -1.74506282,
        -0.04813821, -0.75120669]])

Multiple combination of parametters.

In [39]:
samples = [2, 3, 5, 7]
radius = [0.5, 1, 1.5, 2, 3]

for min_sample in samples:
    for eps in radius:
        dbscan = DBSCAN(min_samples=min_sample, eps=eps)
        clusters = dbscan.fit_predict(X_scaled)
        print(f'min_samples: {min_sample} - eps: {eps} -> Clusters: {np.unique(clusters)}')
    print('\n')

min_samples: 2 - eps: 0.5 -> Clusters: [-1]
min_samples: 2 - eps: 1 -> Clusters: [-1]
min_samples: 2 - eps: 1.5 -> Clusters: [-1  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]
min_samples: 2 - eps: 2 -> Clusters: [-1  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]
min_samples: 2 - eps: 3 -> Clusters: [-1  0  1  2  3  4  5  6]


min_samples: 3 - eps: 0.5 -> Clusters: [-1]
min_samples: 3 - eps: 1 -> Clusters: [-1]
min_samples: 3 - eps: 1.5 -> Clusters: [-1  0  1  2  3  4  5]
min_samples: 3 - eps: 2 -> Clusters: [-1  0  1  2  3  4  5  6]
min_samples: 3 - eps: 3 -> Clusters: [-1  0  1]


min_samples: 5 - eps: 0.5 -> Clusters: [-1]
min_samples: 5 - eps: 1 -> Clusters: [-1]
min_samples: 5 - eps: 1.5 -> Clusters: [-1  0]
min_samples: 5 - eps: 2 -> Clusters: [-1  0  1  2  3]
min_samples: 5 - eps: 3 -> Clusters: [-1  0]


min_samples: 7 - eps: 0.5 -> Clusters: [-1]
min_samples: 7 - eps: 1 -> Clusters: [-1]
min_samples: 7 - eps: 1.5 -> Clusters: [-1  0]
min_samples: 7 