**Supervised Learning** (SL) is a ML Algorithm that is most commonly used in real world applications. Here, we will be focusing on a **Regression** Model, which predicts a number from infinitely many possible numbers. Several performance and optimization algorithms will be applied as we continue through the course to provide additional context on how these algorithms work and how they can be applied in the future. 

The dataset we're looking at contains the housing price predictions based on attributes such as the total area (squared feet), the number of bedrooms, and more. 

Goal: Use this dataset to create a linear regression model to predict the cost of a house based on the total area.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt  # To visualize
import seaborn as sns #To Visualize
import math

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

In [None]:
df = pd.read_csv('/kaggle/input/housing-price-prediction/Housing.csv')
df

For simplicity, we'll begin with Univariate Linear Regression. We'll start by predicting the price of a house based on the area. By plotting the relationship between the Area (Input) and Price (Output), we can recognize the linearity occurring between the two, allowing for good use of a linear regression model. 

In [None]:
x = np.array(df['area']).reshape(-1, 1)
y = np.array(df['price']).reshape(-1, 1)

In [None]:
linear_regressor = LinearRegression()
linear_regressor.fit(x, y)
y_pred = linear_regressor.predict(x)

The **Cost Function** for univariate linear regression is: 
 $$J(w,b) = \frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})^2 \tag{1}$$
 
It's important to be able to properly visualize the Cost Function and be able to execute the method. Below are two options to help your understanding. The first allows you to adjust w and b as you see fit to best match the graph on the right to the trend line in the left. While the second, allows you to code the cost function out to better apply yourself to the concept. 

Goal: Minimize the cost of our model to better fit it to the data.  

For a deeper discussion visit [Medium](https://medium.com/@lachlanmiller_52885/understanding-and-calculating-the-cost-function-for-linear-regression-39b8a3519fcb). 

### **Implementing Cost Function**

In [None]:
figure, axis = plt.subplots(ncols=2, figsize=(15, 8))

sns.scatterplot(x="area", y="price", data=df, ax=axis[0])
axis[0].plot(np.unique(x.flatten()), np.poly1d(np.polyfit(x.flatten(), y.flatten(), 1))(np.unique(x)), c='g')

sns.scatterplot(x="area", y="price", data=df, ax=axis[1])
#Update the w and b values to help visualize where the line of best fit lies. 
w = 1
b = 0
y_fit = w * x + b
axis[1].plot(x, y_fit, c='purple')

In [None]:
def compute_cost(x, y, w, b): 
    #Add Code Here to Compute Cost of Model. 

In [None]:
cost = compute_cost(x, y, w, b)
cost

**Gradient Descent** is an optimization algorithm used to find a local minimum/maximum of the cost function *J(w,b)* and is described as:

$$\begin{align*} \text{repeat}&\text{ until convergence:} \; \newline
\; \lbrace  w &= w -  \alpha \frac{\partial J(w,b)}{\partial w} \tag{2}, 
  \; b = b -  \alpha \frac{\partial J(w,b)}{\partial b} \rbrace \newline 
\end{align*}$$
where, parameters $w$, $b$ are updated simultaneously.  
The gradient is defined as:
$$
\begin{align}
\frac{\partial J(w,b)}{\partial w}  &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})x^{(i)} \tag{3}\\
  \frac{\partial J(w,b)}{\partial b}  &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)}) \tag{4}\\
\end{align}
$$

Here *simultaniously* means that you calculate the partial derivatives for all the parameters before updating any of the parameters.

Goal: Optimize the Cost Function calculated above. 

Visit [Towards Data Science](https://towardsdatascience.com/gradient-descent-algorithm-a-deep-dive-cf04e8115f21) for more information. 

### **Implementing Gradient Descent** 

In [None]:
def compute_gradient(x, y, w, b): 

In [None]:
gradient = compute_gradient(x, y, w, b)
gradient

In [None]:
def gradient_descent(x, y, w_in, b_in, alpha, num_iters): 

In [None]:
gradient_descent_result = gradient_descent(x, y, 1, 0, 0.2, 24)

If the learning rate is too small or too large, then the speed of gradient descent may be affected. Part of our job is to consider what methods could be used to help us realize these issues. In the cell below, find a way to best plot the gradient descent and the effects of a larger/smaller learning rate. 

In [None]:
#Place Two Plots Here. One when Learning Rate is small and another when Learning Rate is large. 