<a href="https://colab.research.google.com/github/KrituneX/Hands-on-Machine-Learning-with-Scikit-Learn-Keras-TensorFlow/blob/main/Chapter_9.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chapter 9: Unsupervised Learning Techniques
## Hands-On Machine Learning with Scikit-Learn, Keras and TensorFlow

## 1. Konsep Dasar Unsupervised Learning
Unsupervised learning adalah pendekatan machine learning dimana model belajar pola dari data tanpa label/target. Terdapat dua teknik utama:

**a. Clustering**
- Mengelompokkan data points yang serupa ke dalam cluster
- Contoh algoritma: K-Means, DBSCAN, Hierarchical Clustering

**b. Dimensionality Reduction**
- Mereduksi jumlah variabel/fitur dengan mempertahankan informasi penting
- Contoh: PCA (Chapter 8), t-SNE, Autoencoders

**Perbedaan Fundamental**
- Supervised learning: Memprediksi label/target (classification/regression)
- Unsupervised learning: Menemukan pola tersembunyi (clustering/dimensionality reduction)

## 2. K-Means Clustering (Teori Mendalam)
Algoritma partisi data ke dalam K cluster dengan meminimalkan variance intra-cluster

**Mathematical Formulation**
Objective function (Inertia):
$$ J = \sum_{i=1}^n \sum_{j=1}^k w_{ij} ||x_i - \mu_j||^2 $$
Dimana:
- $w_{ij} = 1$ jika $x_i$ termasuk cluster $j$, 0 untuk lainnya
- $\mu_j$ adalah centroid cluster $j$

**Proses Iteratif**:
1. Tentukan jumlah cluster (K)
2. Inisialisasi centroid secara acak
3. Hitung jarak tiap data point ke centroid (Euclidean distance)
4. Assign data points ke cluster terdekat
5. Update centroid sebagai rata-rata semua points dalam cluster
6. Ulangi langkah 3-5 sampai konvergen (centroid stabil)

**Keuntungan**:
- Efisien secara komputasi (O(n))
- Mudah diimplementasikan

**Keterbatasan**:
- Harus menentukan K sebelumnya
- Sensitif terhadap inisialisasi centroid
- Kerja buruk untuk cluster non-spherical/bervariasi ukuran

## 3. Hands-On: K-Means Implementation
### 3.1 Basic Implementation dengan Scikit-Learn

In [None]:
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
import numpy as np

# Generate sample data
X, y = make_blobs(n_samples=500, centers=4, cluster_std=0.8, random_state=42)

# Visualize raw data
plt.figure(figsize=(8,6))
plt.scatter(X[:,0], X[:,1], s=50)
plt.title('Original Data')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()

# Apply K-Means
kmeans = KMeans(n_clusters=4, random_state=42)
kmeans.fit(X)
labels = kmeans.labels_
centroids = kmeans.cluster_centers_

# Visualize clustered data
plt.figure(figsize=(8,6))
plt.scatter(X[:,0], X[:,1], c=labels, s=50, cmap='viridis')
plt.scatter(centroids[:,0], centroids[:,1], c='red', s=200, alpha=0.8, marker='X')
plt.title('K-Means Clustering Results')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()

### 3.2 Menentukan Jumlah Cluster Optimal (Elbow Method)

In [None]:
inertia = []
k_range = range(1,10)

for k in k_range:
    model = KMeans(n_clusters=k, random_state=42)
    model.fit(X)
    inertia.append(model.inertia_)

# Plot elbow curve
plt.figure(figsize=(8,5))
plt.plot(k_range, inertia, marker='o')
plt.title('Elbow Method for Optimal K')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia')
plt.xticks(k_range)
plt.grid()
plt.show()

## 4. DBSCAN Clustering (Teori Lengkap)
Density-Based Spatial Clustering of Applications with Noise (DBSCAN) bekerja berdasarkan density connectivity

**Konsep Kunci**:
- **Core point**: Memiliki min_samples dalam radius eps
- **Border point**: Dalam radius core point tapi bukan core
- **Noise point**: Bukan core atau border

**Parameter**:
- eps: Jarak maksimum antara dua samples
- min_samples: Jumlah minimum samples dalam radius eps

**Algoritma**:
1. Pilih titik acak yang belum dikunjungi
2. Temukan semua titik yang terkoneksi secara density (core points)
3. Jika titik adalah core, bentuk cluster
4. Jika noise, tandai sebagai outlier
5. Ulangi sampai semua titik diproses

## 5. Hands-On: DBSCAN Implementation

In [None]:
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_moons

# Generate non-linear cluster data
X_moons, _ = make_moons(n_samples=300, noise=0.07, random_state=42)

# Apply DBSCAN
dbscan = DBSCAN(eps=0.2, min_samples=5)
clusters = dbscan.fit_predict(X_moons)

# Visualize results
plt.figure(figsize=(8,6))
plt.scatter(X_moons[:,0], X_moons[:,1], c=clusters, s=50, cmap='viridis')
plt.title('DBSCAN Clustering Results')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()

## 6. Gaussian Mixture Models (GMM)
Pendekatan probabilistic dimana data dimodelkan sebagai campuran distribusi Gaussian

**Mathematical Foundations**:
$$ p(x) = \sum_{k=1}^K \pi_k \mathcal{N}(x|\mu_k, \Sigma_k) $$
Dimana:
- $\pi_k$: Mixing coefficient (bobot komponen ke-k)
- $\mathcal{N}$: Distribusi normal multivariate
- $\mu_k, \Sigma_k$: Mean dan covariance matrix

**Estimasi Parameter menggunakan EM Algorithm**:
1. **E-step**: Hitung responsibility tiap komponen
2. **M-step**: Update parameter model

## 7. Aplikasi Real-World: Customer Segmentation

In [None]:
import pandas as pd
from sklearn.preprocessing import StandardScaler

# Load dataset (contoh: data pelanggan e-commerce)
data = pd.read_csv('customer_data.csv')
features = data[['Annual_Income', 'Spending_Score', 'Age']]

# Preprocessing
scaler = StandardScaler()
scaled_features = scaler.fit_transform(features)

# Clustering
kmeans = KMeans(n_clusters=5, random_state=42)
clusters = kmeans.fit_predict(scaled_features)

# Analysis results
data['Cluster'] = clusters
cluster_stats = data.groupby('Cluster').mean()

# Visualize 3D clusters
from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure(figsize=(12,8))
ax = fig.add_subplot(111, projection='3d')
scatter = ax.scatter(data['Annual_Income'],
                    data['Spending_Score'],
                    data['Age'],
                    c=data['Cluster'],
                    cmap='viridis',
                    s=60)
ax.set_xlabel('Annual Income')
ax.set_ylabel('Spending Score')
ax.set_zlabel('Age')
plt.title('3D Customer Segmentation')
plt.colorbar(scatter)
plt.show()