## A. Clustering
- Unsupervised learning algorithm (Unlabelled Data)
- Looks at data points and groups similar instances together


**Applications**:
1. Grouping similar news
2. Market segmentation
3. DNA Analysis
4. Astronomical data analysis (Group bodies together)
5. Image compression

### A1. K-means clustering

**K-means algorithm**:
1. Randomly initialse $K$ cluster centroids $\mu_1,\mu_2,...\mu_k$
2. Repeat:
    * Assign each data point to its closest centroids
    * Recompute and move cluster centroids (Average of all respective data points)
3. Stop when certain criteria is met: no further changes
- A supervised machine learning classification tool
- Regression (Regression trees)

<img  src="./images/Wk8_1.png"  style=" width:80%; padding: 10px 20px ; ">

### A2. Optimisation Objective
- K-means algorithm has an underlying cost function (Distortion function)
<img  src="./images/Wk8_2.png"  style=" width:70%; padding: 10px 20px ; ">

### A3. Initialising k-means

#### Random initialisation
1. Choose $K < m$
2. Randomly pick $K$ training examples
3. Set $\mu_1,\mu_2,...\mu_k$ equal to these $K$ examples
<img  src="./images/Wk8_3.png"  style=" width:70%; padding: 10px 20px ; ">

### A4. Choosing K

- Clustering is unsupervised, no clear indicator and is ambiguous

#### Elbow method:
- Look at cost function as function of K, and choosing number of clusters at the "elbow" \\_

#### Evaluate K-means based on a metric for how well it performs for that later purpose

## B. Anomaly detection
- Another unsupervised learning algorithm
- For finding/ detecting unusual events

Applications:
1. Aircraft engine anomaly/ Defects
2. Fraud detection by identifying unusual users
3. Monitoring Computers in Data Centers


**Algorithm**:
1. Plot all data points which are "normal"
2. Anomaly is test data point which is away from other data points
3. This is done through **density estimation** to decide if $x_{test}$ is anomalous
    * Model $p(x)$ from data
    * $p(x_{test}) < \epsilon$ -> Flag anomaly  
    * $p(x_{test}) >= \epsilon$ -> "ok"

### B1. Gaussian (Normal) Distribution

**Aim**: To model p(x)

$x(\mu,\sigma^2)$

$$ p(x ; \mu,\sigma ^2) = \frac{1}{\sqrt{2 \pi \sigma ^2}}\exp^{ - \frac{(x - \mu)^2}{2 \sigma ^2} }$$

   where $\mu$ is the mean and $\sigma^2$ controls the variance.
   
To estimate the mean, you will
use:

$$\mu_i = \frac{1}{m} \sum_{j=1}^m x_i^{(j)}$$

and for the variance you will use:
$$\sigma_i^2 = \frac{1}{m} \sum_{j=1}^m (x_i^{(j)} - \mu_i)^2$$

### B2. Anomaly Detection Algorithm using Density Estimation

<img  src="./images/Wk8_4.png"  style=" width:70%; padding: 10px 20px ; ">
<img  src="./images/Wk8_5.png"  style=" width:70%; padding: 10px 20px ; ">

### B3. Developing and Evaluating an Anomaly Detection System
- How to choose $\epsilon$

- **Key**: Unlabelled training sets, with labelled cross-validation and test sets with anomalous data points
    1. Train data and fit Gaussian distribution on train set
    2. See how many anomalous data points it correctly flags on CV set
    3. **Tune parameters($\epsilon$) accordingly using CV set**
    4.  Test model using test set
<img  src="./images/Wk8_6.png"  style=" width:70%; padding: 10px 20px ; ">

### B4. Algorithm Evaluation
1. Fit model p(x) on training set $x^{(1)}, x^{(2)}, ... x^{(m)},$
2. On a CV set, predict (y = 1 if $p(x) < \epsilon$ (anomaly); y = 0 otherwise)
3. Evaluation metrics:
    * TP/FP/TN/FN
    * Precision/Recall
    * F1-Score

### B5. Anomaly detection vs Supervised Learning
- Since we're using labelled data, why not just used supervised learning?



| data             | % of total | Description |
|------------------|:----------:|:---------|
| training         | 60         | Data used to tune model parameters $w$ and $b$ in training or fitting |
| cross-validation | 20         | Data used to tune other model parameters like degree of polynomial, regularization or the architecture of a neural network.|
| test             | 20         | Data used to test the model after tuning to gauge performance on new data |

| Anomaly Detection | Supervised Learning|
| :------------------:| :------------------:|
| very small number of positive  examples (y=1)<br> large number of negative examples (y = 0)  | Large number of positive and negative examples 
|Many different "types" of anomalies<br> Hard for any algorithm to learn from positive examples what the anomalies look like<br>Future anomalies look different/nothing like it| Future positive likely to be similar to ones in training set
|Compares to "good" and flag anything that deviates as bad/anomaly | Enough positve examples for algorithm to sense what positive examples are like
|Fraud Detection|Email Spam Classification
|Manufacturing - Finding new **previously unseen** defects| Manufacturing - Finding known, **previously seen** defects (scratches etc)
|Monitoring machines in data center| Diseases classification


### B6. Choosing what features to use
- Good choice of features is crucial  
- As opposed to supervised learning, which is more lenient towards extra/ lesser features
- More important as it is trained on unlabelled dat###a

#### KEY: USE GAUSSIAN FEATURES

1. Non-gaussian features
    * Plot histogram using `plt.hist`
    * If skewed, transform using $log(x)$
<img  src="./images/Wk8_7.png"  style=" width:70%; padding: 10px 20px ; ">    

### B7. Error analysis for anomaly detection

Want  $p(x) < \epsilon$ if y = 1 (anomaly); $p(x) >= \epsilon$ otherwise  

Most common problem:


    * p(x) is comparable for normal and anomalous examples
    * Solution: look at example and see what made it not flag correctly, and find another feature that distinguishes
  

**Monitoring computers in a data center**

Features:
- $x_1 = $Memory use of computer 
- $x_2 = $Number of disk accesses/sec
- $x_3 = $CPU load
- $x_4 = $network traffic
- $x_5 = \frac{\text{CPU Load}}{\text{Network Traffic}}$
- $x_6 = \frac{(\text{CPU Load})^2}{\text{Network Traffic}}$

Deciding feature choice based on p(x)
- large for normal examples, becomes small for anomalies