# Unsupervised Learning 

![Image](./image/UnsupervisedLearning.png)

### Key Points:

- **Unsupervised Learning**: Unlike supervised learning, where input features $x$ are paired with target outputs $y$ to learn from, unsupervised learning deals with data without explicit labels.

- **Discovering Data Structure**: The goal is to uncover interesting structures within the data, such as grouping data points that are similar to each other into clusters.


# K-Means 


### Process: 

1. **Initialization**: Randomly pick two points as the initial guesses for the cluster centers (centroids).

![Image](./image/K-mean(1).png)

2. **Assignment**: Assign each data point to the nearest cluster centroid. This step groups the data points based on which centroid they are closest to.

![Image](./image/K-mean(2).png)

3. **Update**: Update each cluster centroid to the average location of the data points assigned to it. This step recalculates the position of each centroid based on the current members of its cluster.

![Image](./image/K-mean(3).png)

4. **Repeat**: Alternate between the assignment and update steps until the centroids no longer change position, indicating the algorithm has converged.

![Image](./image/K-mean(4).png)

### Pseudocode: 
Initialize K cluster centroids randomly (mu_1, mu_2, ..., mu_k)

```
Repeat {
  // Assignment step
  For every point i in the dataset {
    Assign the point to the closest centroid:
    c_i = argmin_k ||x^(i) - mu_k||^2
  }

  // Update step
  For each cluster centroid k {
    Set mu_k to be the mean of points assigned to cluster k:
    mu_k = mean(all points x^(i) where c_i = k)
  }
  
  // Check for convergence
  If no points change their cluster assignment and centroids don't move,
  then stop the loop.
}
```

### Cost Function
- The cost function J depends on:
    - $ C_1, C_2, ..., C_m $: The cluster assignments for each point.
    - $ \mu_1, \mu_2, ..., \mu_k $: The locations of the cluster centroids.
  - It is computed as: $ J = \frac{1}{m} \sum_{i=1}^{m} ||x^{(i)} - \mu_{C_i}||^2 $.
  - The goal is to minimize J by adjusting $ C_i $ and $ \mu_k $.

![Image](./image/CostFunction.png)

## Optimization

### K-Means Initial Optimization

![Image](./image/K-MeanInitialize.png)

### K-Means Amount Clusters

- **Ambiguity in Choosing K**:
  - The 'correct' number of clusters (K) is not always clear-cut; different observers might see different cluster counts in the same dataset.

- **Elbow Method**:
  - Plot the cost function (distortion) for various values of K.
  - Look for an 'elbow' where the cost function begins to decrease more slowly and choose K at that point.

![Image](./image/ElbowMethod.png)

- **Practical Approach**:
  - Determine K based on the clusters' performance for a subsequent or downstream application.
  - Evaluate how well K-means serves the intended purpose of clustering rather than just minimizing the cost function.

![Image](./image/K-Value.png)

# Gaussian (Normal) Distribution

- **Mean (μ)**: The center of the curve, where the peak aligns.

- **Standard Deviation $σ$**: Determines the width of the curve. Variance: $σ^{2} = \frac{1}{m} * \sum_{i = 1}^{m}(x^{(i)} - μ)^{2}$

- **Probability Density Function $p(x)$**: $p(x) = \frac{1}{\sqrt{2πσ}} e^{\frac{-(x - μ)^{2}}{(2σ²)}}$, showing how probable different values of $x$ are under the distribution.

![Image](./image/NormalDistribution.png)
  
## Adjusting μ and σ:
- When μ = 0 and σ = 1, it's the standard normal distribution.

- Reducing σ narrows and heightens the curve, increasing σ widens and lowers it.

- Changing μ shifts the center of the distribution left or right without altering its shape.

## Anomaly Detection Application:
- Given a dataset of normal examples, estimate good values for μ and σ².
- μ is calculated as the average of the data points.
- σ² is the average of the squared differences from the mean.
- New examples with low p(x) (falling far from the center of the distribution) are flagged as anomalies.



# Anomaly Detection

- **Purpose**: Anomaly detection algorithms identify unusual events or anomalies in unlabeled datasets of normal occurrences.

- **Density Estimation**: The algorithm models the probability of feature values in the dataset, identifying high-probability (normal) and low-probability (anomalous) regions.

- **Flagging Anomalies**: If the probability of a new data point's features is below a threshold ε (epsilon), it's flagged as an anomaly.

![Image](./image/AnomalyDetection.png)


## Algorithm

![Image](./image/AnomalyDetectionAlgo.png)

### Optimize
- Split dataset
    - **Training Set**: Large set of normal examples, anomalies are rare or assumed absent.
    - **Cross Validation and Test Sets**: Mix of normal examples and a small number of known anomalies to evaluate and fine-tune the model.

### Evaluate Metrics: 
- Use metrics like true positives, false positives, false negatives, and true negatives.
Consider precision, recall, and F1 score instead of classification accuracy.

### Choosing Feature
- To improve anomaly detection, it's beneficial to ensure the features approximate a Gaussian distribution. 

- If a feature does not naturally fit this distribution, various transformations can be applied to make it more Gaussian-like. Common transformations include taking the $\log_2{x}$, $\sqrt{x}$, ... or any other math operation to get the bell-shape curve

# Anomaly Detection vs. Supervised Learning

| Criteria                        | Anomaly Detection                                                                 | Supervised Learning                                                               |
|---------------------------------|-----------------------------------------------------------------------------------|----------------------------------------------------------------------------------|
| **Number of Positive Examples** | Very few (0-20), with a larger set of negative examples.                           | Larger set of both positive and negative examples.                               |
| **Learning from Examples**      | Learns parameters from negative examples. Positive examples used for evaluation.  | Learns from both positive and negative examples. Assumes future positives are similar to past ones. |
| **Anomaly Diversity**           | Suited for detecting new, unseen types of anomalies.                              | Assumes future positive examples will be similar to those in the training set.   |
| **Best Use Case**               | When future anomalies might be completely different from those in the training set. | When positive examples are consistent over time and future instances resemble past ones. |
| **Applications**                | - New defect detection in manufacturing<br>- Fraud detection for new methods<br>- Security (evolving threats) | - Known defect detection in manufacturing<br>- Spam email classification<br>- Weather prediction<br>- Disease diagnosis |
