# **1. Principal Component Analysis (PCA)**

---

## **1. Introduction**

* **PCA** is a **dimensionality reduction technique** used to transform high-dimensional data into fewer dimensions while preserving as much **variance** (information) as possible.
* It creates **new features (principal components)** that are linear combinations of the original features.
* Useful when:

  * Data has many correlated features.
  * You want to visualize high-dimensional data in 2D/3D.
  * Reduce noise and improve ML model performance.

📌 Example:
Reducing 100 stock market indicators to 5 key “principal components” that explain most of the variation.

---

## **2. Key Idea**

* Find new axes (**principal components**) such that:

  * The **first component** explains the maximum variance.
  * The **second component** explains the maximum remaining variance (orthogonal to first).
  * Continue until all components are extracted.

---

## **3. PCA Workflow (Step by Step)**

1. **Standardize the data**
   (important because PCA is affected by feature scale).

2. **Compute covariance matrix**

   $$
   \Sigma = \frac{1}{n} (X - \bar{X})^T (X - \bar{X})
   $$

3. **Compute eigenvalues & eigenvectors** of covariance matrix.

   * Eigenvectors → Principal Components (directions of maximum variance).
   * Eigenvalues → Variance explained by each component.

4. **Sort eigenvectors by eigenvalues** (descending order).

5. **Choose top k components** that explain most of the variance.

6. **Transform data** into new subspace.

---

## **4. Mathematical Intuition**

We want to maximize variance along new axis:

$$
\text{Var}(z) = w^T \Sigma w
$$

Subject to constraint:

$$
||w|| = 1
$$

This is solved by eigen-decomposition → eigenvector with largest eigenvalue gives **first principal component**.

---

## **5. Explained Variance**

* Eigenvalues indicate how much variance each principal component captures.
* **Explained Variance Ratio:**

$$
\text{EVR}_i = \frac{\lambda_i}{\sum_j \lambda_j}
$$

📌 Rule: Keep enough components to capture **~90-95%** of variance.

---

## **6. Pros & Cons**

### ✅ Pros

* Reduces dimensionality → faster training.
* Removes multicollinearity.
* Visualization in 2D/3D possible.
* Can denoise data.

### ❌ Cons

* Components are **linear combinations** (less interpretable).
* Sensitive to scaling of features.
* Assumes linear relationships.

---

## **7. Real-Life Applications**

* **Finance:** Reduce correlated indicators into few principal indexes.
* **Image Compression:** Reduce pixel space while retaining key patterns.
* **Genomics:** Reduce thousands of gene expressions into few key dimensions.
* **Marketing:** Customer segmentation with reduced features.
* **Speech Recognition:** Reduce acoustic features before classification.

---

## **8. Visualization (Conceptual)**

Original space (X, Y features):

```
 ● ● ● ● ● ●
      ↘  New Axis (PC1: Max variance)
       ↘
        ↘  (PC2: Orthogonal to PC1)
```

After PCA → rotate axes → project data into fewer dimensions.

---

## **9. Example**

Dataset: Students with features [Math, Science, English Scores].

* PCA may reduce this into 2 components:

  * PC1 = overall academic strength.
  * PC2 = preference for math/science vs language.

---

## **10. Key Takeaways**

* PCA = transforms correlated features → uncorrelated principal components.
* Helps in **dimensionality reduction, noise removal, visualization**.
* Always standardize data before applying PCA.
* Decide number of PCs using **explained variance (scree plot / elbow method)**.

---
---
---

# **2. t-SNE (t-distributed Stochastic Neighbor Embedding)**

---

## **1. Introduction**

* **t-SNE** is a **non-linear dimensionality reduction technique** used primarily for **visualization** of high-dimensional data in **2D or 3D**.
* Unlike PCA (linear), t-SNE preserves **local structure**, meaning similar points in high dimensions stay close in low dimensions.
* Commonly used for:

  * Image embeddings
  * Word embeddings (NLP)
  * Clustering visualization

📌 Example:
Visualizing handwritten digits (MNIST dataset) in 2D so that digits 0–9 form separate clusters.

---

## **2. Key Idea**

1. Compute **pairwise similarities** between points in high-dimensional space (using Gaussian distribution).
2. Map points to low-dimensional space such that **similar points stay close**, and **dissimilar points stay apart**.
3. Minimize **Kullback-Leibler (KL) divergence** between high-dimensional and low-dimensional distributions.

---

## **3. How t-SNE Works (Conceptually)**

1. **Compute pairwise probabilities** $p_{ij}$ in high-dimensional space:

   $$
   p_{j|i} = \frac{\exp(-||x_i - x_j||^2 / 2\sigma_i^2)}{\sum_{k \neq i} \exp(-||x_i - x_k||^2 / 2\sigma_i^2)}
   $$

   * Symmetrize: $p_{ij} = \frac{p_{j|i} + p_{i|j}}{2N}$

2. **Map points to low-dimensional space** (y_i).

3. Compute **similarity $q_{ij}$** in low-dimensional space using **Student-t distribution**:

   $$
   q_{ij} = \frac{(1 + ||y_i - y_j||^2)^{-1}}{\sum_{k \neq l} (1 + ||y_k - y_l||^2)^{-1}}
   $$

4. **Minimize KL divergence** between high-dimensional $p_{ij}$ and low-dimensional $q_{ij}$:

   $$
   KL(P||Q) = \sum_i \sum_j p_{ij} \log \frac{p_{ij}}{q_{ij}}
   $$

---

## **4. Key Features**

* Preserves **local neighborhood structure**.
* Can reveal **clusters** in data visually.
* **Non-linear mapping**, unlike PCA.

---

## **5. Pros & Cons**

### ✅ Pros

* Excellent for **visualizing high-dimensional data**.
* Reveals hidden **clusters or patterns**.
* Preserves local similarities better than linear methods.

### ❌ Cons

* Computationally expensive for very large datasets.
* Does not scale well beyond tens of thousands of points (though **Barnes-Hut t-SNE** improves speed).
* Non-deterministic results (may vary across runs).
* Low-dimensional distances are **not always meaningful globally**.

---

## **6. Hyperparameters**

* **Perplexity:** Roughly controls the number of neighbors (5–50 common).
* **Learning Rate:** Affects convergence (typical 200–1000).
* **Number of Iterations:** More iterations → better embedding.

---

## **7. Real-Life Applications**

* **MNIST / Image Datasets:** Visualize digits or image embeddings.
* **NLP:** Visualize word embeddings (Word2Vec, GloVe).
* **Bioinformatics:** Gene expression data visualization.
* **Clustering:** Visual validation of clusters.
* **Anomaly Detection:** Visualize outliers in high-dimensional space.

---

## **8. Visualization (Conceptual)**

```
High-dimensional points → t-SNE → 2D map

High-D Space:    ● ● ● ● ●   (similar points close)
Low-D Map:       ○ ○ ○ ○ ○   (similar points stay close, clusters appear)
```

Example: MNIST digits

```
Cluster 0  Cluster 1   Cluster 2
●●●       ●●●        ●●●
```

---

## **9. Key Takeaways**

* t-SNE = **non-linear, local-preserving dimensionality reduction**.
* Best for **visualization**, not for direct ML modeling.
* Reveals hidden clusters in high-dimensional datasets.
* Sensitive to **perplexity and initialization**, so tuning is often needed.

---
---
---

# **3. Association Rule Mining**

---

## **1. Introduction**

* Association Rule Mining finds **if-then patterns** in data.
* It’s most famous in **Market Basket Analysis** (e.g., “People who buy bread also buy butter”).
* Works on **transactional datasets** (like shopping carts, clickstreams, medical records).

📌 Example:

* Rule: {Milk, Bread} → {Butter}
* Meaning: Customers who buy milk and bread are also likely to buy butter.

---

## **2. Key Terms**

Let’s assume a dataset of transactions (shopping baskets).

* **Itemset:** A collection of items (e.g., {milk, bread}).

* **Support:** Frequency of itemset in dataset.
  [
  Support(A) = \frac{\text{Number of transactions containing A}}{\text{Total transactions}}
  ]

* **Confidence:** Strength of rule (probability of B given A).
  [
  Confidence(A \rightarrow B) = \frac{Support(A \cup B)}{Support(A)}
  ]

* **Lift:** How much more likely A and B occur together compared to random chance.
  [
  Lift(A \rightarrow B) = \frac{Confidence(A \rightarrow B)}{Support(B)}
  ]

  * Lift > 1 → Positive association.
  * Lift < 1 → Negative association.

---

## **3. Example**

Dataset:

* T1: {Milk, Bread, Butter}
* T2: {Milk, Bread}
* T3: {Bread, Butter}
* T4: {Milk, Butter}
* T5: {Milk, Bread, Butter}

Rule: **{Milk, Bread} → {Butter}**

* Support = 3/5 = 0.6
* Confidence = 3/4 = 0.75
* Lift = 0.75 / 0.6 = 1.25 (positive correlation).

---

## **4. Algorithms for Association Rule Mining**

### 🔹 **Apriori Algorithm**

* Iteratively generates frequent itemsets using support threshold.
* Uses **downward closure property**: If an itemset is frequent, all its subsets must also be frequent.

Steps:

1. Generate candidate itemsets.
2. Prune based on minimum support.
3. Generate rules with minimum confidence.

---

### 🔹 **FP-Growth (Frequent Pattern Growth)**

* Faster than Apriori (avoids candidate generation).
* Builds a **prefix tree (FP-tree)** of itemsets.
* Extracts frequent itemsets directly.

---

## **5. Pros & Cons**

### ✅ Pros

* Simple and intuitive (especially Apriori).
* Helps in **recommendation systems**.
* Useful for **business insights**.

### ❌ Cons

* Can generate **too many rules** (difficult to interpret).
* Computationally expensive for very large datasets.
* Focuses on correlation, not causation.

---

## **6. Real-Life Applications**

* **Retail (Market Basket Analysis):** Find product bundles (e.g., chips + soda).
* **E-commerce:** Cross-selling & upselling recommendations.
* **Healthcare:** Discover symptom–disease associations.
* **Web Usage Mining:** Identify frequently co-visited pages.
* **Fraud Detection:** Detect unusual item combinations.

---

## **7. Visualization**

```
{Milk, Bread} → {Butter}
Support = 60%
Confidence = 75%
Lift = 1.25
```

Think of it as:

* **Support:** How popular the rule is.
* **Confidence:** How reliable the rule is.
* **Lift:** How strong the association is compared to chance.

---

## **8. Key Takeaways**

* Association Rule Mining finds **hidden relationships** in data.
* Metrics: **Support, Confidence, Lift**.
* Algorithms: **Apriori** (classic, slow) & **FP-Growth** (fast, scalable).
* Widely used in **retail, healthcare, fraud detection, recommendation systems**.

---
---
---

# **4. Anomaly Detection**

---

## **1. Introduction**

* **Anomalies (Outliers)** are data points that **don’t follow the normal pattern** of the dataset.
* Goal: Identify unusual observations that may indicate **errors, fraud, attacks, or rare events**.

📌 Examples:

* Unusually high credit card transaction → Fraud.
* Sudden spike in server traffic → Cyberattack.
* Abnormal medical test values → Disease indication.

---

## **2. Types of Anomalies**

1. **Point Anomaly:** A single data point is unusual.

   * Example: A \$10,000 withdrawal in a dataset where typical withdrawals are \$100–$500.

2. **Contextual Anomaly:** An observation is unusual in a specific context.

   * Example: A high temperature of 35°C is normal in summer but abnormal in winter.

3. **Collective Anomaly:** A group of data points together is abnormal.

   * Example: Multiple failed login attempts in sequence → Intrusion.

---

## **3. Approaches to Anomaly Detection**

### 🔹 **Statistical Methods**

* Assume data follows a distribution (e.g., Gaussian).
* Flag points far from mean (using **z-scores** or thresholds).

### 🔹 **Distance-Based Methods**

* Points far from others are anomalies.
* Example: **k-Nearest Neighbors (kNN)** anomaly detection.

### 🔹 **Density-Based Methods**

* Points in low-density regions are anomalies.
* Example: **DBSCAN**, **Local Outlier Factor (LOF)**.

### 🔹 **Clustering-Based**

* Normal data forms clusters; anomalies don’t belong to any cluster.
* Example: K-Means anomaly detection.

### 🔹 **Machine Learning Methods**

* **One-Class SVM:** Learns a boundary around normal data, flags points outside.
* **Isolation Forest:** Randomly splits data; anomalies are easier to isolate.
* **Autoencoders (Deep Learning):** Reconstruct input data; anomalies have high reconstruction error.

---

## **4. Mathematical Intuition**

### Z-Score Method (Statistical)

[
z = \frac{x - \mu}{\sigma}
]

* If |z| > threshold (e.g., 3), point is anomaly.

### Local Outlier Factor (LOF)

* Ratio of density of a point to densities of neighbors.
* LOF > 1 → anomaly.

### Isolation Forest

* Randomly split features into trees.
* Outliers → isolated quickly (shorter path length).

---

## **5. Pros & Cons**

### ✅ Pros

* Works across finance, security, healthcare, IoT.
* Can detect both known & unknown anomalies.
* Many algorithm choices (statistical → deep learning).

### ❌ Cons

* Often **domain-specific thresholds** needed.
* High false positives (normal unusual data flagged as anomaly).
* Some methods don’t scale well with very high dimensions.

---

## **6. Real-Life Applications**

* **Fraud Detection:** Credit card, insurance, tax fraud.
* **Cybersecurity:** Intrusion detection, malware detection.
* **Healthcare:** Detect rare diseases via abnormal test results.
* **Manufacturing (IoT):** Machine failure prediction.
* **Finance:** Stock market irregularities.
* **Climate Science:** Detect abnormal weather events.

---

## **7. Visualization (Conceptual)**

Normal data (clustered) vs anomalies:

```
● ● ● ● ● ● ● ●   ○
● ● ● ● ● ● ● ●
● ● ● ● ●     ○
```

* ● = Normal points
* ○ = Anomalies (outliers)

---

## **8. Key Takeaways**

* Anomalies = **data points deviating from normal patterns**.
* Types: **Point, Contextual, Collective**.
* Methods: Statistical, distance, density, clustering, ML.
* Powerful applications in **fraud detection, cybersecurity, healthcare, IoT**.

---
---
---