<a href="https://colab.research.google.com/github/Chiraagkv/ML-algorithms/blob/main/Theoretical_Machine_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **📚 Theoretical Machine Learning**

# **Supervised Learning**
1. Linear Regression

## **Linear Regression In One Variable**

### Terminology
* $x_i$: $i^{th}$ input variable
* $y_i$: $i^{th}$ output variable
* $m$: upper limit of $i$
* $h(x)$: Hypothesis function
* $ŷ$: Predicted output of $h(x)$
* $w$: weight
* $b$: bias



### Working Principle
### 1. **Hypothesis Function h(x)**
  $$h_{w,b}(x)= wx + b$$
  * This tries to predict a $y$ for a given $x$

### 2. **Cost Function J(w,b)**
  * Finds the error in the model
  * MSE: $$J(w,b) = \frac{\sum_{i=1}^n\ (ŷ_i - y_i)^2}{2n}$$
    * Divided by $2n$ for derivative purposes.

### 3. **Gradient Descent**
* **Working Principle**:
  1. Plot the graph of J vs w (assume b is a constant for now)
  2. at the current value of w, find derivative of the J function at that coordinate. Now, negative of the derivative is the path to be taken to reach the minima.
  3. alpha is **learning rate** that makes sure that the model does not move too much towards the minima that it overshoots past it.
  4. Do same for b also SIMULTANEOUSLY
$$w = w - α \frac{∂}{∂w}J(w,b)\\
b = b - α \frac{∂}{∂b}J(w,b)$$

### **4. Putting this together**
Repeat till minima reached:
  1. find $ŷ$ values for your training data with whatever $h(x)$ you have.
  2. Find the cost $J(w,b)$ for the present $h(x)$
  3. Update the $w$ and $b$ values based on Gradient Descent.

### **Example**
#### Given:
1. h(x) = 2x + 3
2. X = [1, 2, 3]
3. Y = [3, 5, 7]
4. J is MSE: $$J(w,b) = \frac{\sum_{i=1}^n\ (wx_i + b - y_i)^2}{2n}$$
5. Calculated Partial Derivatives: $$\frac{∂}{∂w}J(w, b) = \frac{\sum_{i=1}^n\ (wx_i + b - y_i).x_i}{n}\\
\frac{∂}{∂b}J(w, b) = \frac{\sum_{i=1}^n\ (wx_i + b - y_i)}{n}$$
6. α = 0.1

#### Working:
* Y predicted = [5, 7, 9]
* J(2, 3) = 2
* $w = w - α(\frac{2×6}{3})$ and $b = b - α(\frac{2}{3})$
* Now, w = 1.6 and b = 2.933
* J(1.6, 2.933) = 0.6955 (much much lesser)


In [None]:
import random
def hypothesis(w, b, X):
  return [w*i+b for i in X]

def mse_modded(Y, Y_pred): # modded because of the 1/2
  ans = 0
  for i in range(len(Y)):
    ans += (Y[i] - Y_pred[i])**2
  return ans/(2*len(Y))

def gradient_descent(X, Y, lr, w, b):
  w_correction = lr*(sum([(hypothesis(w, b, X)[i] - Y[i])*X[i] for i in range(len(X))])/len(X))
  b_correction = lr*(sum([(hypothesis(w, b, X)[i] - Y[i]) for i in range(len(X))])/len(X))
  w -= w_correction
  b -= b_correction
  return w, b

def uni_linear_regression(X, Y, epochs=5, lr=0.01):
  w = random.random()
  b = random.random()
  for i in range(epochs):
    Y_pred = hypothesis(w, b, X)
    cost = mse_modded(Y, Y_pred)
    w, b = gradient_descent(X, Y, lr, w, b)
    print("Present cost:", cost)

In [None]:
X = [1, 2, 3, 4, 5]
Y = [3, 5, 7, 9, 11]
uni_linear_regression(X, Y, epochs=10)

Present cost: 7.663793587397455
Present cost: 5.959743670973275
Present cost: 4.63504555633149
Present cost: 3.605247249968433
Present cost: 2.804697621183241
Present cost: 2.182360918361637
Present cost: 1.6985630655786814
Present cost: 1.3224623048196285
Present cost: 1.0300829286242839
Present cost: 0.8027867468973706


## **Multiple Linear Regression**
* NOT multivariate regression (that is predicting multiple output variables.)

**Notation Update**
* $X_j$ is the $j^{th}$ feature
* $X^{(i)}$ is the $i^{th}$ sample. It is a 1 D matrix (vector)

### Working Principle
* W is the Weights Matrix of shape (1, n) where n is the number of features.
* X is a sample of shape (1, n)
* W, X are in capital because they are vectors
### **Hypothesis Function h(x)**
$$h_{(W,\ b)}(X) = W_1X_1 + W_2X_2 +...+W_nX_n + b\\
⇒ h_{(W,\ b)}(X) = W.X + B
$$

### **Cost function J(W, b)**
* Pretty much the same, except w → W

### **Gradient Descent**
$$w_j = w_j - α\frac{∂}{∂w_j}J(W, b)\ \ ∀\ \ 0 < j ≤ n\\
b = b - α\frac{∂}{∂b}J(W, b)$$


In [None]:
import numpy as np

def h(X, W, b):
  return X@W + b

def loss(Y, Y_pred):
  return np.sum((Y - Y_pred)**2)/(len(Y)*2)

def gradient_descent(W, X, Y, b, lr):
  m = len(X)
  n = len(X[0])
  Y_pred = h(X, W, b)
  W_correction = lr * (X.T@(Y_pred-Y))/m #lr * np.sum(np.multiply(np.reshape(Y_pred - Y, (1, len(Y))), X), axis=0)/len(X)
  b_correction = lr * np.sum(Y_pred - Y)/m
  W -= W_correction
  b -= b_correction
  return W, b

def linear_regression(X, Y, epochs=5, lr=0.01, decay=0):
  n = len(X[0])
  W = np.random.rand(n)
  b = 0
  for i in range(epochs):
    lr = lr / (1 + decay*i)
    Y_pred = h(X, W, b)
    cost = loss(Y, Y_pred)
    W, b = gradient_descent(W, X, Y, b, lr)
  print(f"Cost {i}: ", cost)
  return W, b

In [None]:
X = np.random.randint(-10, 10, (100, 5))

Y = np.sum(X, axis=1) * 2 + 3
W, b = linear_regression(X, Y, epochs=5000, lr=0.01, decay=0)

print("Learned Weights:", W)
print("Learned Bias:", b) # As close to perfection as possible. All weights are 2, bias is basically 3

Cost 4999:  2.935321846043439e-28
Learned Weights: [2. 2. 2. 2. 2.]
Learned Bias: 2.999999999999975


### **Practical Tips for Linear Regression**

#### **Feature Scaling**
* Convert all features to a certain range to help the model converge faster
1. Divide all values with the largest one
2. find average, subtract it from all and then divide everything by maximum absolute value of the data. (Mean Normalization)
3. find mean μ and std. deviation σ
apply the transformation: $x_i = \frac{x_i - μ_i}{σ_i}$ (Z-Score Normalization)


#### **Checking If Gradient Descent is working**
* If J(W, b) vs epochs curve should always be decreasing. If it increases, there is something wrong with the learning rate.
* Test for convergence: Find a acceptably small value of error (ϵ) and as soon as J(w, b) ≤ ϵ, stop it

#### **Choosing a Good Learning Rate**
* with a small enough α, the model should constant converge. So if it starts oscillation, decrease the lr and try again
* 0.001,0.003, 0.01, 0.03, 0.1, 0.3, 1
are the go-to

#### **Feature Engineering**
* We can manipulate whatever features we have to create better features.
* example: If house price ∝ area and the data has 2 features: length, breadth. We create a new one which gives area

### **Polynomial Regression**