## ML Algorithms
1. Supervised learning
2. Unsupervised Learning
3. Recommender Systems
4. Reinforcement learning

In [9]:
%matplotlib inline
import matplotlib.pyplot as pl
import numpy as np
import time

## Supervised Learning
- Training based on labelled data(input and output)
- Tested based on ability to guess output/label given input features
- Regression, Classification

Application:
-Spam filltering, Speech recognition, machine translation, self driving cars..

Eg: Regression- Housing price prediction (Continuous)
1. Input- House size
2. Output- Price

Eg: Classification- Breast cancer detection (Discrete)
1. Input- tumor size
2. Output- Malignant/benign

## Unsupervised learning
- Only given unlabelled data, find patterns or structures within data
1. Clustering- Group simiar data points together
2. Anomaly detection- Find unusual data points
3. Dimensionality reduction- Compress data using fewer numbers

#### Clustering
Eg: Google News
- Cluster similar articles together
- Mention similar words/ content etc
- algorithm figure out without supervision

Eg2: DNA microarray
- Group individuals based on trait

Eg3: Grouping customers
- Market segmentation found distinct group of individuals (grow skills/ develop career/ stay updated)

## Linear Regression Model

- Train model based on "right" data
- Fit a straight line through data points

Training set > Learning algroithm > f(model) > predict ŷ

How to represent $f$?  

$ f_{w,b} (x) = wx + b$   or  $f(x)$  (Linear)
- Linear regression with one variable
- Univariate linear regression




## Cost Function
- Mean squared error
- Formula:  $$J(w,b) = \frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})^2 \tag{1}$$
- parameters: $w, b$

Algorithm to train linear regression/ complex models/ neural networks:  
   ## Gradient Descent

Minimise  $J(w,b)$

1. Start with some w,b
2. Keep changing w,b to reduce cost
3. Until we settle at a minimum ( might have more than one min/ local minima)

### Gradient Descent Algorithm

$$\begin{align*} \text{repeat}&\text{ until convergence:} \; \lbrace \newline
\;  w &= w -  \alpha \frac{\partial J(w,b)}{\partial w} \tag{3}  \; \newline 
 b &= b -  \alpha \frac{\partial J(w,b)}{\partial b}  \newline \rbrace
\end{align*}$$
where, parameters $w$, $b$ are updated simultaneously, $\alpha$ is learning rate.  
The gradient is defined as:
$$
\begin{align}
\frac{\partial J(w,b)}{\partial w}  &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})x^{(i)} \tag{4}\\
  \frac{\partial J(w,b)}{\partial b}  &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)}) \tag{5}\\
\end{align}
$$

### Learning rate

If $\alpha$ is too small, gradient descent will work but gradient descent algorithm will be very slow.  
If $\alpha$ is too large, may never reach minimum as overshoot, gradient descent fail to converge.   
However, it can reach local minimum with fixed learning rate as derivative becomes smaller when approaching local minimum. (Update steps become smaller)

### Vectorisation
- Make implementation of learning algorithms shorter and 
- utilise modern linear algebra libraries
- Utilise GPU

In [7]:
# start_time = time.time()
w = np.array([1.0, 2.5, -3.3])
b = 4
x = np.array([10, 20, 30])
end

Without vecotrisation, need for loop to sum all.  
With vectorisation, can use dot product.

In [8]:
f = np.dot(w, x) + b
f

-35.0

In [17]:
import numpy as np

# Define vector size
vector_size = 100000

# Create random vectors
vector_a = np.random.rand(vector_size)
vector_b = np.random.rand(vector_size)

# Time for loop dot product
import time

start_loop = time.time()
dot_product_loop = 0
for i in range(vector_size):
    dot_product_loop += vector_a[i] * vector_b[i]
end_loop = time.time()

# Time numpy dot product
start_numpy = time.time()
dot_product_numpy = np.dot(vector_a, vector_b)
end_numpy = time.time()

# Calculate and print execution times
time_loop = end_loop - start_loop
time_numpy = end_numpy - start_numpy

print(f"Time for loop dot product: {time_loop:.5f} seconds") # 5.95863 seconds (10,000,000)
print(f"Time for numpy dot product: {time_numpy:.5f} seconds") #0.00598 seconds (1000 times faster)!

Time for loop dot product: 0.08069 seconds
Time for numpy dot product: 0.00000 seconds


### How does vectorisation work?
- Multiplies all corresponding pairs simultaneously/ in parallel
- Machine adds all products efficiently

- In gradient descent, we might have many input features stored as a vector
- With vectorisation, gradient descent runs in parallel much faster than using for loops