### Linear Regression


Lets talk about the dataset, I have downloaded [Used Car Price Dataset](https://www.kaggle.com/datasets/rishabhkarn/used-car-dataset) from Kaggle website. The dataset contains `13` feature/independent variables($x_i$) and a target/dependant variable($y$), from those 13 feature/independant variables, I will be using 4 feature/independent variables they are `mileage(kmpl)`, `engine(cc)`, `max_power(bhp)` and `torque(Nm). ` <br>


mileage(kmpl)| engine(cc) | max_power(bhp) | torque(Nm) | price(in lakhs)
-------------|------------|----------------|------------|-------------
   7.81      |2996        |	   2996        |     333    |       63.75
   17.4      |    999     |     999        |     9863   |       8.99
   20.68     |    1995    |     1995       |      188   |       23.75
   16.5      |    1353    |     1353       |     13808  |       13.56

Here $x$'s are four-dimensional vector in $\mathbb{R}^4$. For instance, $x_1^{(i)}$ is the `mileage(kmpl)`, $x_2^{(i)}$ is the `engine(cc)`, $x_3^{(i)}$ is the `max_power(bhp)`, and $x_4^{(i)}$ is the `torque(Nm)` of the $i$-th house in the training set.


To perform supervised learning, we will represent $y$ as  the functions/hypothese $h$ or as a linear function of $x$.

$$h_{\theta}(x) = \theta_0 + \theta_1x_1 + \theta_2x_2 + \theta_3x_3 + \theta_4x_4$$

Here, the $\theta_i$'s are the parameters (also called wights) parameterizing the space of linear functions mapping $\mathcal{X}$ $\to$ $\mathcal{Y}$. For simplifying the notation, we drop $\theta$ and use $x_0=1$(intercept term).
$$h(x) = \sum_{i=1}^d\theta_ix_i = \theta^Tx$$

on the right-hand side both $\theta$ and $x$ are both vectors and d is the no. of feature/input/independent variables(not counting $x_0$).



Now, given a training set, how do we pick, or learn, the parameters $\theta$. One reasonable method is to make $h(x)$ close to $y$, atleast for the training examples we have. To formalize this, we will define a function that measures, for each value of the $\theta$'s, how close the $h(x^{(i)})$'s are to the correspoding $y^{(i)}$'s. We define the **cost function**:
$$J(\theta) = \frac{1}{2}\sum_{i=1}^n(h_{\theta}(x^{(i)})-y^{(i)})^2.$$



# LMS algorithm

We watnt to choose $\theta$ so as to minimize $J(\theta)$. To do so, let's use a search algorithm that starts with some "initial guess" for $\theta$, and that repeatedly changes $\theta$ to make $J(\theta)$ smaller, until hopefully we converge to a value of $\theta$ that minimizes $J(\theta)$. Specifically, lets consider the **gradient descent** algorithm, which starts with some initial $\theta$, and repeatedly performs the update:
$$\theta_j:=\theta_j - \alpha\frac{\partial}{\partial\theta_j}J(\theta). $$
(This update is simultaneously performed for all values of $j$= 0,...,d.)


Here, $\alpha$ is called the **learning rate**. This is a very natural algorithm that repeatedly takes a step in the direction of steepest descrease of $J$.

In order to implement this algorithm, we have to work out what is the partial derivative term on the right hand side. Lets first work it out for the case of it we have only one training example $(x, y)$, so that we can neglect the sum in the definition of $\textit{J}$. We have:


$$ \frac{\partial}{\partial\theta_j}J(\theta) = \frac{\partial}{\partial\theta_j}\frac{1}{2}(h_\theta(x) - y)^2$$
$$ = 2.\frac{1}{2}(h_\theta(x)-y).\frac{\partial}{\partial\theta_j}(h_\theta(x)-y))$$
$$=(h_\theta(x)-y).\frac{\partial}{\partial\theta_j}(\sum_{i=0}^d\theta_ix_i-y)$$
$$=(h_\theta(x)-y)x_j$$
for a single training example, this gives the update rule:
$$\theta_j:= \theta_j +\alpha(y^{(i)}-h_\theta(x^{(i)}))s_j^{(i)}.$$

This rule is called the **LMS** update rule (LMS stands for "least mean squares"), and is also known as the **Widrow-Hoff** learning rule. This rule has several properties that seem natural and intuitive. For instance, the magnitude of the update is proportial to the error term $(y^{(i)}-h_\theta(x^{(i)}))$; thus for instance, if we are encountering a training example on which our prediction nearly matches the actual value of $y^{(i)}$, then we find that there is little need to change the parameters; in contrast, a larger change to the parameters will be made if our prediction $h_\theta(x^(i)$ has a large error (i.e, if it is very far from $y^{(i)}$).

We'd derived the LMS rule for when there was only a single training example. There are two ways to modify this method for a training set of more than one example. The first is replace it with the following algorithm:

repeat until convergence $ \{ $
$$\theta_j := \theta_j + \alpha\sum_{i=i}^n(y^{(i)} - h_\theta(x^{(i)}))x_j^{(i)}, \text{(for every j)  (1.1)}$$
$ \}$



By grouping the updates of the coordinates into an update of the vector $\theta$, we can rewrite update (1.1) in a slightly more succing way:
$$ \theta := \theta + \alpha\sum_{i=1}^n(y^{(i)} - h_\theta(s^{(i0}))x^{(i)}$$
This is simply gradient descent on the original cost funtion $\textit{J}$. This method looks ar every example in the entire training set on every step, and is called **batch gradient descent**. Note that, while gradient descent can be susceptible to local minima in general, the optimization problem we have posed here for linear regression has only one global, and no other local, optima; thus gradient descent always converges (assumming the learning rate $\alpha$ is not too large) to the global minimum. Indeed, $\textit{J}$ is a convex quadratic funtion. Here is an examlpe of gradient descent as it run to minimize a quadratic function.

<figure>
  <img src="https://github.com/0shankart/MachineLearningNotes/blob/main/TraditionalML/images/batch_gradient_descent_global_minimum.png?raw=true" width="40%"/>
  <figcaption>Ref: https://cs229.stanford.edu/notes2021fall/cs229-notes1.pdf</figcaption>
</figure>
 The ellipses shown above are the countours of a quadratiic function. Also shown is the trajectory taken by graidnet descent, which was initialized at (48, 30). The $x$'s in the figure (joined by straight lines) mark the successive values of $\theta$ that gradient descent went through.

 There is an alternative to batch gradient descent that also works very well. Consider the folling algorithm: <br><br>Loop $ \{ $




In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
!cd /content/drive/MyDrive/files

In [7]:
!ls

imdb_data.csv  Used_Car_Dataset.csv  yelp.csv
