<h1 align="center">An Introduction to Machine Learning - 25737</h1>
<h4 align="center">Dr. Yasaei</h4>
<h4 align="center">Sharif University of Technology, Autumn 2024</h4>

**Student Name**:

**Student ID**:

# Gaussian Mixture Models with EM

## Introduction and Purpose

In this exercise, you will:

1. Implement a **Gaussian Mixture Model (GMM)** using the Expectation-Maximization (EM) algorithm **from scratch** (using NumPy and basic Python operations).
2. Implement the **same GMM model using PyTorch**.
3. Compare and contrast the two implementations (performance, complexity, ease of coding, etc.).

**Gaussian Mixture Models** assume that data is generated from a mixture of several Gaussian distributions. The EM algorithm iteratively updates the parameters (means, covariances, and mixture weights) of these Gaussians to maximize the likelihood of observed data.



## Part 1: Data Loading and Exploration

**Tasks:**  
- Load the Iris dataset and store the features in `X` and labels in `y`.
- Print the shape of `X` and examine a few rows.
- **Hint:** Use `sklearn.datasets.load_iris()` to load the data.

In [None]:
# TODO: Load the Iris dataset and print shape
# from sklearn.datasets import load_iris

# iris = ...
# X = ...
# y = ...
# print("Shape of X:", ...)
# print("First 5 samples:\n", ...)


## Part 2: Data Preprocessing (Scaling)

**Tasks:**  
- Scale the data using `StandardScaler` so that each feature has zero mean and unit variance.
- **Hint:** `from sklearn.preprocessing import StandardScaler`.


In [None]:
# TODO: Scale the data
# from sklearn.preprocessing import StandardScaler

# scaler = ...
# X_scaled = ...
# print("Mean after scaling:", X_scaled.mean(axis=0))
# print("Std after scaling:", X_scaled.std(axis=0))


## Part 3: Implementing GMM with EM **from scratch** (NumPy-based)

We will first implement GMM using NumPy arrays and basic operations, without PyTorch.

**Tasks:**  
- Choose the number of components `K` (e.g., K=3).
- Initialize the parameters: means, covariances (diagonal), and mixture weights.
- Write functions for the E-step and M-step of the EM algorithm.
- Run the EM algorithm for a fixed number of iterations.

**Hints for Implementation:**

- Means: K x D array.
- Covariances: K x D x D (diagonal only, so you mainly store variances per feature).
- Weights: K-dimensional array, summing to 1.
- To compute Gaussian densities, recall the formula for the probability density of a multivariate Gaussian.
- For the E-step, compute responsibilities using the mixture components and their densities.
- For the M-step, update means, covariances, and weights using the responsibilities.

After implementing and running EM, extract cluster assignments by taking `argmax` of responsibilities.


In [None]:
import numpy as np

# Set number of components
K = 3
N, D = X_scaled.shape

# TODO: Initialize means, covariances, and weights
# means = ...
# covariances = ...  # Diagonal covariances, so you can store just var per component and construct diag
# weights = ...

# TODO: Define a function to compute Gaussian PDF values for each component
# def gaussian_pdf(X, mean, cov):
#     # X: N x D
#     # mean: D
#     # cov: D x D (diagonal)
#     # return pdf values: N-dim array
#     pass

# TODO: E-step
# def e_step(X, means, covariances, weights):
#     # Compute responsibilities
#     pass

# TODO: M-step
# def m_step(X, responsibilities):
#     # Update means, covariances, weights
#     pass

# Run EM

# cluster_labels_numpy = ... # argmax of responsibilities


## Part 4: Implementing GMM with EM **using PyTorch**

Now, we will implement the same algorithm using PyTorch tensors. The steps are similar, but you will use `torch` operations. This might simplify certain operations and open the door to GPU acceleration.

**Tasks:**  
- Convert `X_scaled` to a PyTorch tensor.
- Initialize parameters as `torch.tensor`s.
- Implement E-step and M-step in PyTorch.
- Run EM for a fixed number of iterations.
- Extract cluster labels.

**Hints:**
- Use `torch.tensor(X_scaled, dtype=torch.float32)` to create a PyTorch tensor.
- Operations are similar but use `torch.sum`, `torch.exp`, etc.
- Watch out for broadcasting rules and ensure shapes align.


In [None]:
import torch

# TODO: Convert data to torch tensor
# X_torch = ...

# TODO: Initialize means, covariances, weights as torch tensors
# means_torch = ...
# covariances_torch = ...
# weights_torch = ...

# TODO: Implement gaussian_pdf using torch operations
# def gaussian_pdf_torch(X, mean, cov):
#     pass

# TODO: E-step in torch
# def e_step_torch(X, means, covariances, weights):
#     pass

# TODO: M-step in torch
# def m_step_torch(X, responsibilities):
#     pass

# Run EM in torch

# cluster_labels_torch = ... # argmax of responsibilities


## Part 5: Evaluating and Comparing Both Implementations

**Tasks:**  
- Use `adjusted_rand_score` to compare the cluster labels from both methods against the true labels `y`.
- Print the ARI for both NumPy and PyTorch implementations.
- Visually inspect if both implementations yield similar results.

**Questions:**
- Are the ARI scores similar or different between the two implementations?
- Which code was easier to write and maintain?
- Which implementation might be easier to extend to more complex models?


In [None]:
from sklearn.metrics import adjusted_rand_score

# TODO: Compute ARI for numpy-based clustering
# ari_numpy = adjusted_rand_score(y, cluster_labels_numpy)
# print("ARI (NumPy):", ari_numpy)

# TODO: Compute ARI for torch-based clustering
# ari_torch = adjusted_rand_score(y, cluster_labels_torch)
# print("ARI (PyTorch):", ari_torch)


**Questions:**  
1. **Implementation Detail:** What are the main differences in code complexity between a plain NumPy-based implementation and a PyTorch-based one?  
answer:

2. **Performance:** Which implementation is likely to be more efficient or easier to parallelize and why?  
answer:
3. **Numerical Stability:** How might PyTorch’s built-in functions improve numerical stability compared to a manual implementation?  
answer:

4. **Extendability:** If you wanted to add more complex features (e.g., full covariance matrices, regularization), which approach would be simpler and why?
answer: