# **Linear Regression**

## **1 Introduction**

This notebook is my learning material to keep track of the notions approached in the [Supervised Machine Learning: Regression and Classification](https://www.coursera.org/learn/machine-learning?specialization=machine-learning-introduction) course from the [Machine Learning Specialization](https://www.coursera.org/specializations/machine-learning-introduction) offered by DeepLearning.AI and Standord University.

Through this notebook, I use the [Housing dataset](https://www.kaggle.com/datasets/ashydv/housing-dataset) created by Ashish.

### **1.0.1 Imports**

In [None]:
# Data manipulation
import numpy as np
import pandas as pd
from sklearn.linear_model import SGDRegressor
from sklearn.preprocessing import StandardScaler 

# Options for pandas
pd.options.display.max_columns = 50
pd.options.display.max_rows = 30

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Options for seaborn
sns.set_style('darkgrid')
%matplotlib inline

from IPython import get_ipython
ipython = get_ipython()

# Autoreload extesnions
if 'autoreload' not in ipython.extension_manager.loaded:
    %load_ext autoreload

### **1.1 Data**

#### **1.1.0.1 Import**

In [None]:
housing = pd.read_csv('Housing.xls')
housing

#### **1.1.1 Exploratory Data Analysis**

In [None]:
housing.info()
housing.describe()

## **2 One-variable Linear Regression**

### **2.1 Data preparation**

In [None]:
# Retrieve features
data = housing[['price', 'area']].copy()

# Mean normalization
data['price'] = (data['price'] - data['price'].mean()) / (data['price'].max() - data['price'].min())
data['area'] = (data['area'] - data['area'].mean()) / (data['area'].max() - data['area'].min())

data

In [None]:
sns.scatterplot(data=data, x='area', y='price')

### **2.2 Analysis**

#### **2.2.1 Model**

$$
f_{w,b}(x^{(i)}) = wx^{(i)}+b \tag{1}
$$

In [None]:
def f(x, w, b):
    return w * x + b

#### **2.2.2 Cost function**

$$
J(w,b) = \frac{1}{2m} \sum_{i=0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})^2  \tag{2}
$$

In [None]:
def compute_cost(X, y, w, b):
    m = X.shape[0]
    c = 0
    for i in range(m):
        c += (f(X[i], w, b) - X[i])**2
        
    return c / (2 * m)

#### **2.2.3 Gradient**

$$
\begin{align}
\frac{\partial J(w,b)}{\partial w} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) -y^{(i)})x^{(i)} \tag{3}
\\
\frac{\partial J(w,b)}{\partial b} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)}) \tag{4}
\end{align}
$$

In [None]:
def compute_gradient(X, y, w, b):
    m = X.shape[0]

    dw = np.sum((f(X, w, b) - y) * X) / m
    db = np.sum((f(X, w, b) - y)) / m
    
    return dw, db

#### **2.2.4 Gradient descent**

$$
\text{repeat until convergence} \left\{
    \begin{array}{ll}
        w \leftarrow w + \alpha \frac{1}{m} \sum_{i=0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)}) x^{(i)} \\
        b \leftarrow b + \alpha \frac{1}{m} \sum_{i=0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})
    \end{array} \tag{5}
\right.
$$

In [None]:
def gradient_descent(X, y, cost_function, gradient_function, alpha, epochs):
    m = X.shape[0]
    cost_history = np.zeros((epochs))
    
    # Initial parameter
    w, b = 0, 0
    
    for i in range(epochs):
        dw, db = compute_gradient(X, y, w, b)
        
        # Update parameter
        w -= alpha * dw
        b -= alpha * db
        
        # Save cost
        cost_history[i] = cost_function(X, y, w, b)
        
    return w, b, cost_history

### **2.3 Results**

#### **2.3.1 Regression line**

In [None]:
X = data['area'].values
y = data['price'].values

w, b, cost_history = gradient_descent(X, y, 
                                      cost_function=compute_cost, gradient_function=compute_gradient,
                                      alpha=0.03, epochs=10000)

print(f'w, b found by gradient descent:\n {w}, {b}')

In [None]:
sns.scatterplot(data=data, x='area', y='price')
sns.lineplot(data=data, x='area', y=f(X, w, b),
             linestyle='dashed', color='r',
             label='trend line')

#### **2.3.2 Convergence**

In [None]:
sns.lineplot(x=range(cost_history.shape[0]), y=cost_history) \
   .set(xlabel='iteration', ylabel='cost')

## **3 Linear Regression with scikit-learn**

### **3.1 Data preparation**

In [None]:
data = housing[['price', 'area', 'bedrooms', 'stories', 'bathrooms']].copy()

data

In [None]:
scaler = StandardScaler()
fit_data = scaler.fit_transform(data)

X, y = fit_data[:, 1:], fit_data[:, 0]

scaler = StandardScaler()
X_norm = scaler.fit_transform(X)

X_norm

### **3.2 Analysis**

In [None]:
sgdr = SGDRegressor(max_iter=1000)
_ = sgdr.fit(X, y)

b = sgdr.intercept_
w = sgdr.coef_
print(f'w, b found:\n {b}, {w}')

### **3.3 Results**


In [None]:
y_pred_sgd = sgdr.predict(X)
y_pred = np.dot(X, w) + b

fig, axs = plt.subplots(1, 4,
                        figsize=(12,3),
                        sharey=True)

for i in range(4):
    sns.scatterplot(x=X[:,i], y=y,
                    label='target',
                    ax=axs[i])
    
    sns.scatterplot(x=X[:,i], y=y_pred,
                    marker='s', alpha=0.3,
                    label='predict',
                    ax=axs[i])
    
    axs[i].set_xlabel(data.columns[i + 1])
    

axs[0].set_ylabel('Price');