<a href="https://colab.research.google.com/github/ShaunakSen/Data-Science-and-Machine-Learning/blob/master/ML_Stanford_Andrew_NG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Learning - Andrew NG

> Notes on the course: https://www.coursera.org/learn/machine-learning

---


## Introduction and basic concepts

Some use cases of clustering:

![](https://i.imgur.com/RIpwHyS.png)

__Cocktail party problem__

Multiple people speaking in parallel - leads to cluttered audio - use UL algorithms to separate out the audios

## Model and Cost function (Linear Regression)

Some terminologies:

![](https://i.imgur.com/TOcSv9S.png)

Univariate LinReg:

![](https://i.imgur.com/kVJJZym.png)

Cost function (sqd error):

![](https://i.imgur.com/rSRiur5.png)

![](https://i.imgur.com/m6Dt0f0.png)

For simplicity let us assume $\theta_0=0$, so the line passes through the origin

Now $h_{\theta}(x)$ is a function of x for fixed $\theta_1$

$J(\theta_1)$ is a function of $\theta_1$

For diff $\theta_1$ we get diff fitted lines ($h_{\theta}(x)$) and as a result diff squared errors ($J(\theta_1)$)

For particular value of $\theta_1$, $h_{\theta}(x)$ will be the ebst fitted line and $J(\theta_1)$ will be minimum
In the diag below its for $\theta_1 = 1$

![](https://i.imgur.com/f9LG869.png)

So lets revisit our problem formulation:

![](https://i.imgur.com/9l7xffC.png)

Now we wont ignore $\theta_0$

The __loss function landscape__ in terms of $\theta_0$ and $\theta_1$ looks like:

![](https://i.imgur.com/IaOMZ6H.png)

Here we see how diff values of the 2 parameters changes the loss function

Imagine we want to fix the value of J - so we cut through the bowl horizontally

- here we fix the value of J but values of $\theta_0$ and $\theta_1$ are diff at each pt on the circle

We are looking at the cross-section which is of the shape of a circle/oval
- this curve represents diff combos of $\theta_0$ and $\theta_1$ that lead to the same J
- this called a contour plot

![](https://i.imgur.com/BqjMRlP.png)

- Each oval represents values of the params that lead to the same J
- Each such param combo leads to a diff line through the data
- As we move closer to the center J dec

![](https://i.imgur.com/Z3yn4qZ.png)

All 3 ponts shown below have same value of J but are a diff line through the data

![](https://i.imgur.com/SpXkMRL.png)


#### Question - can the lowest loss be obtained by multiple lines

While a loss value which is not the global minima implies a circle in the contour plot. Each pt in that circle represents the same loss value, but represent diff lines. But the global minima will be a point in the loss landscape and thus will have only one corresponding line - also LinReg only has one global minima



## Gradient Descent

![](https://i.imgur.com/VQ4bHfe.png)

Say the loss landscape looks like the plot below and we start at some random pt of $\theta_0$ and $\theta_1$ (marked by + sign):

Imagine we are standing at that pt and we take a small step in the direction of steepest descent, we keep repeating until we reach a local minima

![](https://i.imgur.com/tojsdIg.png)

An interesting property of GD is that if we had started from a diff pt we might have reached a diff local minima

![](https://i.imgur.com/il7RlWA.png)

![](https://i.imgur.com/IGdFM6e.png)

ALso note that the weights hould be simultanously updated, not one after another:

![](https://i.imgur.com/M9RYGE8.png)

In the diag above, in the incorrect method, if we update theta0 first and use the updated value to compute theta1, its not the correct method - theta0 and theta1 should be updated simultaously

![](https://i.imgur.com/MhOeyRi.png)

### GD - derivative intuition

![](https://i.imgur.com/ODDxN5h.png)

### GD - learning rate intuition

![](https://i.imgur.com/AmNepnv.png)

![](https://i.imgur.com/Yldm9Mp.png)

- leave theta1 unchanged

Also as we get closer and closer to a minima, the gradient becomes less, so it becomes less steep - so the GD step size automatically becomes smaller

![](https://i.imgur.com/jmRl1M3.png)

## GD for LinReg

![](https://i.imgur.com/d1jVNjh.png)

![](https://i.imgur.com/ZtUMfbW.png)

![](https://i.imgur.com/W8lxwit.png)

For LinReg cost function is a "convex" - local minima is also the global minima - so GD __always converges to global minima__

The diag shows on the contour plot the values theta0 and theta1 take on successive iterations and on the left the best fit line is shown - observe how the thetas converge to values which give minm loss

![](https://i.imgur.com/mzZsYQh.png)

Here we have used __Batch GD__ : __Each step of GD uses ALL training examples__

![](https://i.imgur.com/ZWmulhN.png)



## Gradient Descent and multivariate calculus review

> https://developers.google.com/machine-learning/crash-course/reducing-loss/gradient-descent

---

Say we are trying to learn: `y = w1.x + b`

Suppose we had the time and the computing resources to calculate the loss for all possible values of w1. For the kind of regression problems we've been examining, the resulting plot of loss vs w1 will always be convex. In other words, the plot will always be bowl-shaped, kind of like this:

![](https://developers.google.com/machine-learning/crash-course/images/convex.svg)

Convex problems have only one minimum; that is, only one place where the slope is exactly 0. That minimum is where the loss function converges.

Calculating the loss function for every conceivable value of w1 over the entire data set would be an inefficient way of finding the convergence point. Let's examine a better mechanism—very popular in machine learning—called gradient descent.

The first stage in gradient descent is to pick a starting value (a starting point) for w1. The starting point doesn't matter much; therefore, many algorithms simply set w1 to 0 or pick a random value. The following figure shows that we've picked a starting point slightly greater than 0:

![](https://developers.google.com/machine-learning/crash-course/images/GradientDescentStartingPoint.svg)

The gradient descent algorithm then calculates the gradient of the loss curve at the starting point. Here in Figure 3, the gradient of the loss is equal to the derivative (slope) of the curve, and tells you which way is "warmer" or "colder." __When there are multiple weights, the gradient is a vector of partial derivatives with respect to the weights.__

Note that a gradient is a vector, so it has both of the following characteristics:

- a direction
- a magnitude

The gradient always points in the direction of steepest increase in the loss function. The gradient descent algorithm takes a step in the direction of the negative gradient in order to reduce loss as quickly as possible.

To determine the next point along the loss function curve, the gradient descent algorithm adds some fraction of the gradient's magnitude to the starting point as shown in the following figure:

![](https://developers.google.com/machine-learning/crash-course/images/GradientDescentGradientStep.svg)

---

### Partial derivatives

![](https://i.imgur.com/1izr4JY.png)

### Gradients

![](https://i.imgur.com/aEbEaxP.png)

![](https://developers.google.com/machine-learning/crash-course/images/ThreeDimensionalPlot.svg)

The gradient of f(x,y) is a two-dimensional vector that tells you in which (x, y) direction to move for the maximum increase in height. Thus, the negative of the gradient moves you in the direction of maximum decrease in height. In other words, the negative of the gradient vector points into the valley.
In machine learning, gradients are used in gradient descent. We often have a loss function of many variables that we are trying to minimize, and we try to do this by following the negative of the gradient of the function.

![](https://i.imgur.com/Uy6x6ns.png)

![](https://i.imgur.com/9sUqkJv.png)

## Lin Algebra Review

### Design matrix basics

Say we have a matrix of features m x d and we have wts theta0, theta1... theta_d

We prepend a column of 1s to the features matrix

Shape: X = m x(d+1)


We take a vector of theta = [theta0, theta1.. theta_d] (d+1)x1

X.theta = m x (d+1) . (d+1)x1 -> m x 1

The diag below shows it for d = 1 and m = 4

![](https://i.imgur.com/nEALO1l.png)

### Matrix - matrix mul as multiple matrix-vector mul

We consider each column of the second matrix as a vector and perform matrix-vector mul t gte a resulting vector

We then stack the resulting vectors to gte the o/p matrix

![](https://i.imgur.com/UO4JiYg.png)

![](https://i.imgur.com/HmG5DRj.png)

![](https://i.imgur.com/5LhHcoa.png)


### Matrix - matrix mul to test multiple hypothesis parallely

Here we have m = 4, d = 2 and 3 separate hypothesis each with its own theta_0, theta_1

We stack each set of thetas in a matrix form

When we mul design matrix with this theta matrix, we take each col as a vector - each col represents a combo of thetas - so the resultant columns represents the predictions for each theta set

![](https://i.imgur.com/lQzOvWe.png)


### Identity Matrix

![](https://i.imgur.com/OqYYnly.png)


### Inverse and Transpose

- only sq matrices have inverses

![](https://i.imgur.com/kNu4QAF.png)

- matrices that do not have an inverse are called "singular" or "degenerate"

Transpose:

![](https://i.imgur.com/ISjGs1S.png)

1st row becomes 1st col

2nd row becomes 2nd col


#### Some questions

![](https://i.imgur.com/tAAh2kr.png)

## Multivariate LinReg


Some notations:

![](https://i.imgur.com/nM1iENG.png)

![](https://i.imgur.com/KnzC7gD.png)


Now our hypothesis function should also be able to take in multiple features:

![](https://i.imgur.com/N638jzk.png)

![](https://i.imgur.com/FMSRbSp.jpeg)

### Gradient Descent for Multiple Variables

Summary so far:

![](https://i.imgur.com/eaCw9LO.png)

Now instead of treating all these thetas as diff params we can just treat it as a vector:

![](https://i.imgur.com/st7LzBY.png)

![](https://i.imgur.com/PjuBiUm.png)


When we had n = 1 we had 2 separate update rules for theta0 and theta1:

![](https://i.imgur.com/N4BNiAU.png)

Here the first feature was all 1s

In general:

![](https://i.imgur.com/By1JfqG.png)

And this can be written as:

![](https://i.imgur.com/yCpAmdj.png)

$x_0^{(i)}$ denotes the 0th feature of all i = 1-> m training samples

> Let us assign the eqns in the above fig as 0.1

### Gradient Descent - full abstraction


First let us define the design matrix $X$ and the $θ$ vector:


![](https://i.imgur.com/J449RrK.jpeg)

The hypothesis can be computed as $\mathbf{X}\cdot \theta$

![](https://i.imgur.com/vwvqfId.jpeg)

![](https://i.imgur.com/hRQGSOs.jpeg)

Now the loss function will be a scalar and for the i training examples it is defined as: 

$J(\theta)=1/m\sum_{i=1}^{m}(\theta^{T}.\mathbf{x}^{(i)} - y_i)^2  \rightarrow \mathbf{equation 1}$

We can write the same in vectorized form as:

![](https://i.imgur.com/iq9dHdI.jpeg)

As we see above the vectorized form gives same result as eqn 1

Till now we have computed:

- vectorized form of hypothesis
- vectorized form of loss 

Now we want to compute $\frac{\partial J}{\partial \theta}$ to get the gradient:

![](https://i.imgur.com/oITYEmZ.jpeg)

Couple of things to note here:

- The computed gradient expression : $\textbf{X}^T(\textbf{X}\theta - \mathbf{y})$ is a vector of shape (n+1)x1 consisting of the gradients w.r.t each theta : theta0, theta1... theta_n

- Once we expand out the term $\textbf{X}^T(\textbf{X}\theta - \mathbf{y})$ and see the individual gradients, they are same as we saw earlier in Fig 0.1

- The above is just a way to vectorize everything and get the gradients and update them without the need of any loops

### Gradient Descent in Practice I - Feature Scaling

Imagine we have 2 features (ignoring the bias) and they are on vastly different scales

So our contour plot will take on a very skewed, elliptical shape (left one shows an even more exaggerated scenario)

So GD can take a long time to converge in such cases and it can keep osciallting

![](https://i.imgur.com/KUrwtLJ.png)

We can scale the features s.t the contours take a more circular shape and GD can converge much faster

![](https://i.imgur.com/w9HR0Dl.png)

- in general we want to scale features to a rane of -1 -> +1

- But this -1 > +1 is not some golden rule

- In general even if features are say in range 0-> 3 its fine

- But for big diff like -100 -> +100 or say -0.0001 -> 0.0001 - we should scale them appropriately

- On the RHS there are some rules of thumb mentioned - if features are in that range we dont have to worry about scaling

![](https://i.imgur.com/YiJV9sv.png)

Mean normalization:

![](https://i.imgur.com/ARR17aO.png)

![](https://i.imgur.com/vv7Ehkb.png)

### Gradient Descent in Practice II - Learning Rate


![](https://i.imgur.com/Gi6Ik6p.png)

![](https://i.imgur.com/oWHzG88.png)

### Features and Polynomial Regression

We can define new features based on existing features

![](https://i.imgur.com/ve7sWHj.png)


Choosing a correct poly model:

![](https://i.imgur.com/G9x6qFE.png)

Here blue line represents a quad model and the green line a cubic model

Here as size inc eventually the quad model will dec - this does not make sense for size and house price, so probably the cubic model is better 

We simply treat these new cubic features as separate features and use Lin Reg

- wts should not be non-linear, features can be linear

- also keep note of feature scaling when we use polynomial features

![](https://i.imgur.com/JEBtekv.png)

![](https://i.imgur.com/ytVAkzc.png)

![](https://i.imgur.com/ZXeJCRT.png)

## Linear Regression assumptions

> https://christophm.github.io/interpretable-ml-book/limo.html

![](https://i.imgur.com/95jYXj8.png)

> More on homoskedasticity and why its an important assumption: https://www.statisticssolutions.com/free-resources/directory-of-statistical-analyses/homoscedasticity/

## Normal Equation

This is a method for solving for theta analystically rather than iteratively as in GD

In case the loss function has a single theta, we can just compute the derivative and set that equal to 0 and solve for theta

But in case the loss function is composed of multiple thetas, we set the partial derivatives of the loss w.r.t. each theta separately to 0 and solve for each theta

![](https://i.imgur.com/eCRs8tM.png)

### Example


![](https://i.imgur.com/VAERLnL.png)


The value of theta we get gives us the final result

Lets understand a bit deeper ehat we did

#### Design matrix

![](https://i.imgur.com/UGM6thD.png)

![](https://i.imgur.com/R2prbAE.png)

Note if we are using Normal eqn feature scaling is __not necessary__

![](https://i.imgur.com/Vr4UeDR.png)

Computing the inverse of a matrix is of the order O(n^3) so it gets very expensive as n inc

However for more sophisticated algos like LogReg we will see that the Normal eqn method does not work and we will have to use GD


### Normal Equation Noninvertibility

In Normal eqns we have to compute inv(X^T.X) - what if this matrix is non-invertible?

What are the common cases in which this matrix is non-invertible?

![](https://i.imgur.com/6nDMG0A.png)

### Quiz

![](https://i.imgur.com/9UsRc6M.png)

![](https://i.imgur.com/wRHkEl1.png)

## Classification - Logistic Regression

Say we have a dataset of tumor size and whether the tumor is malignant (1) or not (0)

We can simply fit a st line through it using LinReg

and then we set a threshold on the y axis value

![](https://i.imgur.com/km28jFh.png)

But is this a good approach

Say we have an example way out to the right, now intuitively this should not change the decision boundary, but when we fit LinReg, the line will look different and the decision boundary if we chose the same threshold of 0.5 will change

![](https://i.imgur.com/tfU8Z2O.png)

So by adding a new example the decision boundary shifted and caused us to get a worse hypothesis

More on the problems of using regression for classification tasks: https://stats.stackexchange.com/questions/22381/why-not-approach-classification-through-regression

Also in LinReg the hypothesis can o/p values lesser than 0 or more than 1, which is counter-intuitive for classification

---

Logisic Regression model

We simply apply the sigmoid function to the hypothesis to transform it to a range bw 0->1

![](https://i.imgur.com/H7nBYXX.png)

#### Interpretation of hypothesis o/p

![](https://i.imgur.com/kGj8FQN.png)

![](https://i.imgur.com/4B5thwA.png)

![](https://i.imgur.com/RIeVlWu.jpeg)

![](https://i.imgur.com/N5dt2xG.png)




#### Decision boundary

We want to analyze further when the hypothesis predicts a -ve or a +ve class

Say we set a threshold of 0.5, s.t whenever $h_{\theta}(x) \geq 0.5$ we predict `y = 1` and whenever $h_{\theta}(x) <  0.5$ we predict `y=0`

From the sigmoid curve we see that g(z) >= 0.5 whenever z > 0 

Similarly as we have assumed for LogReg model that $z = \theta^Tx $, we can say that $h_{\theta}(x) = g(\theta^Tx) \geq 0.5$ when $\theta^Tx \geq 0$, so we predict y=1 whenever $\theta^Tx > 0$

![](https://i.imgur.com/mvToRcK.png)

Similarly we predict y=0 whenever z < 0, i.e when $\theta^Tx < 0$

We can use this iinfo to better understand the decision boundary of LogReg

Say we have:

![](https://i.imgur.com/f4KffDj.jpeg)

Now we have not discussed how to find these theta values but assume for now that they take the values (-3,1,1)

![](https://i.imgur.com/DbNdzaD.png)

Expanding on this, we can draw the db:

![](https://i.imgur.com/o1mSM0F.png)

Similarly for the region `x1+x2<3` we would predict y=0

The line separating the 2 classes is called the __decision boundary__

Also note that the db is a function of the theta values - we do not need the training set for plotting the db. of course the training set defines the values of theta, but the db itself can be determined once we have the theta values

![](https://i.imgur.com/CCxqVY2.png)

#### Non linear desision boundaries

We can add higher order poly terms to LogReg to build complex dbs

![](https://i.imgur.com/B9QqZuZ.png)





### Cost function

Let us first define the design matrix for LogReg - note this is the same as LinReg

But here the hypothesis consists of a sigmoid function

![](https://i.imgur.com/3e98ggY.jpeg)

Because the hypo consists of a non-linear param, the loss function will be __non-convex - harder to optimize by GD__

![](https://i.imgur.com/TXhFbYY.jpeg)


![](https://i.imgur.com/0p08YNL.png)

### Simplified Cost Function

![](https://i.imgur.com/CtjmGqg.jpeg)

### Gradient of cost function - for a single feature

![](https://i.imgur.com/EmLBtyh.jpeg)

![](https://i.imgur.com/P6l0lC0.jpeg)

![](https://i.imgur.com/1WFsC8y.jpeg)

Here we have derived the grad of the loss function w.r.t a single parameter - $\theta_j$

Here the grad looks identical to that in LinReg but the diff is in hypthesis

In LinReg $h_{\theta}(x) = \theta^{T}.{\textbf{x}}$

Here $h_{\theta}(x) = sigmoid(\theta^{T}.{\textbf{x}})$

### Gradient Descent - full abstraction

![](https://i.imgur.com/ZfRKLvT.jpeg)

![](https://i.imgur.com/R4qLCRq.jpeg)


Now we can use these gradients in GD to come to an optimal set of theta values using which we can compute $h_{\theta}(x)$ which will be the predicted prob of y class i.e p(y=1 | x, theta)

![](https://i.imgur.com/GhnxKmm.png)

Feature Scaling can help GD converge faster in LogReg as well 

### Interpreting Logistic Regression coefficients

> Based on the article by Dina Jankovik: https://towardsdatascience.com/a-simple-interpretation-of-logistic-regression-coefficients-e3a40a62e8cf

---


Let’s first start from a Linear Regression model, to ensure we fully understand its coefficients. This will be a building block for interpreting Logistic Regression later.

Here’s a Linear Regression model, with 2 predictor variables and outcome Y:

> Y = a+ bX₁ + cX₂ ( Equation * )


Let’s pick a random coefficient, say, b. Let’s assume that b >0. Interpreting b is simple: a __1-unit increase in X₁ will result in an increase in Y by b units, if all other variables remain fixed (this condition is important to know)__. Note that if b < 0, then a 1-unit increase in X₁ will decrease Y by b units.


As an example, let’s consider the following model that predicts the house price based on 2 input variables: square footage and age.

> house_price = a + 50,000* square_footage — 20,000* age


If we increase the square footage by 1 feet square, the house price will increase by $50,000. If we increase the age of the house by 1 year, the house price will decrease by $20,000. For each additional 1 year age increase, the house price will keep on decreasing by additional $20,000.

OK, this was fairly simple. Let’s now move on to Logistic Regression.


Here’s what a Logistic Regression model looks like:

> logit(p) = a+ bX₁ + cX₂ ( Equation ** )

logit(p) is just a shortcut for log(p/1-p), where p = P{Y = 1}, i.e. the probability of “success”, or the presence of an outcome. X₁ and X₂ are the predictor variables, and b and c are their corresponding coefficients, each of which determines the emphasis X₁ and X₂ have on the final outcome Y (or p). Last, a is simply the intercept.

How did we get this equation ?

![](https://i.imgur.com/waM5ipp.jpeg)

We can still use the old logic and say that a 1 unit increase in, say, X₁ will result in b increase in logit(p). I’m literally being a copycat here and applying the linear model interpretation. Why not? Equations * and ** actually have the same shape!

But now we have to dive deeper into the statement “a 1 unit increase in X₁ will result in b increase in logit(p)”. The first portion is clear, but we can’t really sense the b increase in logit(p). What does this mean at all?
To understand this, let’s first unwrap logit(p). As mentioned before, logit(p) = log(p/1-p), where p is the probability that Y = 1. Y can take two values, either 0 or 1. P{Y=1} is called the probability of success. Hence logit(p) = log(P{Y=1}/P{Y=0}). This is called the log-odds ratio.


#### Demystifying the log-odds ratio

We arrived at this interesting term log(P{Y=1}/P{Y=0}) a.k.a. the log-odds ratio. So now back to the coefficient interpretation: a 1 unit increase in X₁ will result in b increase in the log-odds ratio of success : failure.

> The probability of getting a 4 when throwing a fair 6-sided dice is 1/6 or ~16.7%. On the other hand, the odds of getting a 4 are 1:5, or 20%. This is equal to p/(1-p) = (1/6)/(5/6) = 20%. So, the odds ratio represent the ratio of the probability of success and probability of failure. Switching from odds to probabilities and vice versa is fairly simple.

Now, the log-odds ratio is simply the logarithm of the odds ratio. The reason logarithm is introduced is simply because the logarithmic function will yield a normal distribution while shrinking extremely large values of P{Y=1}/P{Y=0}. Also, the logarithmic function is monotonically increasing, so it won’t ruin the order of the original sequence of numbers.


> That being said, an increase in X₁ will result in an increase in the log-odds ratio log(P{Y=1}/P{Y=0}) by amount b > 0, which will increase the odds ratio itself (since log is a monotonically increasing function), and this means that P{Y=1} get a bigger proportion of the 100% probability pie. In other words, if we increase X₁, the odds of Y=1 against Y=0 will increase, resulting in Y=1 being more likely than it was before the increase.

> logit(p) = 0.5 + 0.13 * study_hours + 0.97 * female


In the model above, b = 0.13, c = 0.97, and p = P{Y=1} is the probability of passing a math exam. Let’s pick study_hours and see how it impacts the chances of passing the exam. Increasing the study hours by 1 unit (1 hour) will result in a 0.13 increase in logit(p) or log(p/1-p). Now, if log(p/1–p) increases by 0.13, that means that p/(1 — p) will increase by exp(0.13) = 1.14. __This is a 14% increase in the odds of passing the exam (assuming that the variable female remains fixed).__

Let’s also interpret the impact of being a female on passing the exam. We know that exp(0.97) = 2.64. That being said, the odds for passing the exam are 164% higher for women.




## Multi-class classification: One vs all (one vs rest)

Say we have to predict one of 3 classes (y=1 or y=2 or y=3)

What we do is we take our training dataset and turn it into 3 separate binary classification problems 

1. class1 -> +ve class (triangles) | class2 and class3 -> -ve class (circles)

Fit a classifier $h_{\theta}^1(x)$, to detect bw these 2 classes

2. class2 -> +ve class (squares) | class1 and class3 -> -ve class (circles)

Fit a classifier $h_{\theta}^2(x)$, to detect bw these 2 classes

3. class3 -> +ve class (crosses) | class1 and class2 -> -ve class (circles)

Fit a classifier $h_{\theta}^3(x)$, to detect bw these 2 classes

![](https://i.imgur.com/0Nes0qO.png)


$h_{\theta}^i(x)$ is P(y=i | x, theta) for i=1,2,3

Here we have classifiers to predict P(y=1|x, theta), P(y=2|x, theta), P(y=3|x, theta)

We can pick the i for which $h_{\theta}^i(x)$ is max and that is the predicted class

![](https://i.imgur.com/Yr7YPHG.png)




## Overfitting

Underfitting - high bias

Bias - imagine u fit a simple st line through the data - this is said to have high bias as it seems to have a preconception that a st line should fit the data and inspite of seeing evidences in the training data, it fails to adjust this notion

On the contrary imagine u have a  very complex overfit poly curve - no bias as it fits to each tr pt - high variance as a change in tr data changes model completely

![](https://i.imgur.com/1OuqwmK.png)

![](https://i.imgur.com/891rPpk.png)


### Addressing overfitting

If we had one feature, we could simply plot the target w.r.t that feature and see what complexity of model is best

But for multiple features, its not possible

If we have lot of features and not many tr examples, overfitting may be a prob

Solns:

![](https://i.imgur.com/xfzObsj.png)



### Regularization

![](https://i.imgur.com/5TbZi2j.png)

Here we have added some penalty terms to theta3 and theta4

The only way we can make the overall loss function less, is if we reduce theta3 and theta4

But we do not know which parameters we should regularize - so we regularize all params

![](https://i.imgur.com/rXQzrH7.png)

If we think of J(theta) now the first part tells us that we need to fit the training data well and the 2nd part tells us that we need to keep the magnitude of theta values small. Lambda (regularization param) controls the tradeoff bw the 2 goals

![](https://i.imgur.com/mn0yUvl.png)





### Regularized Linear Regression

To recap, we had added in a regularization term to the MSE cost function of linear regression like:

![](https://i.imgur.com/RrDZ7Hu.png)

In the GD step we had:

![](https://i.imgur.com/V1eSF7v.png)

This remains same for $\theta_0$ as we do not add any regularization to it

But for the other thetas, the GD update step will change

![](https://i.imgur.com/Br893BT.jpeg)

Above we have written the GD update rule for $\theta_1 ... \theta_n$

Note at every step theta is multiplied by a fraction, so it shrinks

### Normal eqn formula for regularization

As previously we had shown that there is a closed form soln for LinReg, we can show a similar soln for regularized lin reg as shown below

![](https://i.imgur.com/tmBvfR8.png)

Also remember that the non-reg version of normal eqn suffered from non-invertibility issues when we have redundant features for when m<=n

To add in regularization, the equation is the same as our original, except that we add another term inside the parentheses:

$$
\begin{align*}& \theta = \left( X^TX + \lambda \cdot L \right)^{-1} X^Ty \newline& \text{where}\ \ L = \begin{bmatrix} 0 & & & & \newline & 1 & & & \newline & & 1 & & \newline & & & \ddots & \newline & & & & 1 \newline\end{bmatrix}\end{align*}
$$

![](https://i.imgur.com/1ON8kJ8.png)

If we apply this new regularized eqn for any lanbda > 0, this solves the prob of non-invertibility



### Regularized Logistic Regression

In LogReg if we have a lot of polynomial terms our DB might get very complex

Similar io LinReg, we can add in a regularization param to the cost function for $\theta_1, ... \theta_n$ as shown

![](https://i.imgur.com/9VR7qWT.png)

If we apply regularization, the DB will be much smoother

The GD step for reg LogReg:

![](https://i.imgur.com/cffrsBW.png)

Note that there is not much diff here - the update to GD we make is similar to how we derived for LinReg as the addition of the regularization terms to the cost function is identical so the derivatives are also same

![](https://i.imgur.com/hJP6Mes.png)

### Quiz

![](https://i.imgur.com/vnQ7etX.png)

![](https://i.imgur.com/Blk6905.png)

![](https://i.imgur.com/EkoUIim.png)

![](https://i.imgur.com/8OlnKzk.png)

## Neural networks introduction

![](https://i.imgur.com/kozlzWQ.png)

As we have a large no of features the number of possible combinations of non-linear features grows rapidly and it becomes impossible for us to manually build such features and feed them into models






### Neural network - simple logistic regression

![](https://i.imgur.com/cPMlc2Q.jpeg)

### Neural network architecture - general form

![](https://i.imgur.com/cS3f4VO.jpeg)

![](https://i.imgur.com/EwB7zjX.jpeg)

![](https://i.imgur.com/Dfrl8vs.png)


#### Final generic form (with all symbols)

![](https://i.imgur.com/qT9ozDF.jpeg)

![](https://i.imgur.com/S0vHsYf.jpeg)


#### NN learning its own features

Let us use the same nw architecture but just focus on the 2nd layer onwards:

![](https://i.imgur.com/zc4l8CM.jpeg)

As we can see, NN can learn to derive features from its inputs - we do not have to provide all feature combinations to them


#### NN as OR and AND n/w

![](https://i.imgur.com/AVLKPOr.jpeg)

### NN for non-linear hypothesis

We have already built NN architectures for AND and OR functions\

#### NN for NOT:

![](https://i.imgur.com/AiSDBha.jpeg)

> To include negations, the general idea is to put a __large negative__ weight in front of the variable that we want to negate

#### NOT(x1) AND NOT(x2)

Using the above idea we can build this

![](https://i.imgur.com/R2tRmqr.jpeg)

![](https://i.imgur.com/ptlDvi5.png)



#### x1 NOR x2

![](https://i.imgur.com/dyybUu3.jpeg)

As we can see, we cannot build `x1 NOR x2` simply using one level, we need 2 levels - or 2 layers

- first layer should have 2 ops to compute:
    - `(NOT x1) AND (NOT x2)`
    - `x1 AND x2`

- second layer should compute the `OR` of the prev 2 ops i.e. `[(NOT x1) AND (NOT x2)] OR [x1 AND x2]`


Let us work through the architecture and calculations:

![](https://i.imgur.com/SVURwdS.jpeg)

![](https://i.imgur.com/sUgmlz0.jpeg)

![](https://i.imgur.com/BCPHAa7.jpeg)


__Simplified diagram__:

![](https://i.imgur.com/dnIh56F.jpeg)

- we had our input layer
- then we had the first hidden layer which computed slightly more complex functions of the input
- by adding another layer we built up on the complexity of the previous layers and got a more complex function
- successive layers build on previous layers and so on...


> As we have built AND, OR and NOT for 2 binary inputs and used that to build up the NOR function, and as AND,OR and NOT are universal gates we can get an intuition as to why NNs are universal function approximators

### NN for Multiclass Classification

Each op node essentially acts like a LogReg classifier that is trying to recognize one of the 4 classes

![](https://i.imgur.com/OjBWm8h.jpeg)

For training data representation, for each feature set the target class label y should be one of `[1,0,0,0] or [0,1,0,0] or [0,0,1,0] or [0,0,0,1]` for each class label

![](https://i.imgur.com/PpSvTpC.png)

layer 2 has 5 nodes + 1 node for bias -> op layer has 10 nodes

10x6 = 60

### Quiz

![](https://i.imgur.com/vS5pnLd.png)

- answer: AND

![](https://i.imgur.com/8HBy1OG.png)

![](https://i.imgur.com/eOg7gpV.jpeg)

![](https://i.imgur.com/Z8Hd7WM.png)


![](https://i.imgur.com/wDCfNhH.jpeg)

- answer: the op will stay the same