<a href="https://www.aus.edu/"><img src="https://i.imgur.com/pdZvnSD.png" width=200> </a>

<h1 align=center><font size = 5>Regression - Univariate and Multivariate</font>
<h1 align=center><font size = 5>Prepared by Alex Aklson, Ph.D.</font>
<h1 align=center><font size = 5>October 3, 2024</font>

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.linear_model import LinearRegression, SGDRegressor

from sklearn.metrics import root_mean_squared_error, mean_squared_error, r2_score

from sklearn.datasets import fetch_california_housing

Create a random dataset.

In [None]:
np.random.seed(0)
X = 2 * np.random.rand(100, 1) 
y = 4 + 3 * X + np.random.randn(100, 1) 

In [None]:
plt.scatter(X, y)
plt.xlabel('Size (m2)')
plt.ylabel('Price ($100,000)')
plt.show()

Split the dataset into training and test sets.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Normal Equation (Ordinary Least Squares)

In [None]:
model = LinearRegression()

In [None]:
model.fit(X_train, y_train)

Get the parameters (weight values)

In [None]:
weight_0 = model.intercept_  # this is the bias term (w_0)
weight_1 = model.coef_  # this is the slope (w_1)

In [None]:
print("Intercept (w_0): {}".format(weight_0))
print("Coefficient (w_1): {}".format(weight_1))

Let's predict the values of the samples in the test set.

In [None]:
y_pred = model.predict(X_test)

In [None]:
len(y_pred)

In [None]:
y_pred

Plot the data and the regression line

In [None]:
plt.figure(figsize=(8, 5))

plt.scatter(X, y, color='blue', label='Actual data')
plt.plot(X_test, y_pred, color='red', label='Regression line')
plt.xlabel('Size (m2)')
plt.ylabel('Price ($100,000)')
plt.legend()

ax = plt.gca() 
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

plt.show()

Calculate mean squared error and R-squared.

In [None]:
rmse = root_mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

In [None]:
print("Root Mean Squared Error: {}".format(rmse))
print("R-squared: {}".format(r2))

### Gradient Descent - StandardScaler

Visualize the distribution of the data.

In [None]:
X_train_flattened = X_train.flatten()

plt.figure(figsize=(8, 5))
sns.histplot(X_train_flattened, bins=10, kde=True, color='blue')

plt.xlabel('Size')
plt.ylabel('Frequency')

ax = plt.gca()
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

Standardize the features.

In [None]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Visualize the distribution of the scaled data.

In [None]:
X_train_scaled_flattened = X_train_scaled.flatten()

plt.figure(figsize=(8, 5))
sns.histplot(X_train_scaled_flattened, bins=10, kde=True, color='blue')

plt.xlabel('Size')
plt.ylabel('Frequency')

ax = plt.gca()
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

Create an SGDRegressor model.

In [None]:
gradient_descent_model = SGDRegressor(max_iter=10000, tol=1e-3, eta0=0.01, random_state=42)

Fit the model to the training data

In [None]:
gradient_descent_model.fit(X_train_scaled, y_train.ravel())

Get the parameters (w values).

In [None]:
gradient_descent_w_0 = gradient_descent_model.intercept_
gradient_descent_w_1 = gradient_descent_model.coef_

In [None]:
print("SGD Intercept (w_0): {}".format(gradient_descent_w_0))
print("SGD Coefficient (w_1): {}".format(gradient_descent_w_1))

Retrieve the mean and standard deviation of the original training data.

In [None]:
X_mean = scaler.mean_  # mean of the original feature
X_std = scaler.scale_  # standard deviation of the original feature

Adjust the coefficients.

In [None]:
w_1_adjusted = gradient_descent_model.coef_ / X_std
w_0_adjusted = gradient_descent_model.intercept_ - (gradient_descent_model.coef_ * X_mean / X_std)

In [None]:
print("Adjusted Intercept (w_0): {}".format(w_0_adjusted))
print("Adjusted Coefficient (w_1): {}".format(w_1_adjusted))

Let's predict the values of the samples in the test set.

In [None]:
y_pred_gradient_descent = gradient_descent_model.predict(X_test_scaled)

Calculate Mean Squared Error and R-squared for Gradient Descent.

In [None]:
rmse_gradient_descent = root_mean_squared_error(y_test, y_pred_gradient_descent)
r2_gradient_descent = r2_score(y_test, y_pred_gradient_descent)

In [None]:
print("Gradient Descent Root Mean Squared Error: {}".format(rmse_gradient_descent))
print("Gradient Descent R-squared: {}".format(r2_gradient_descent))

Plot the original data points and the best-fit line.

In [None]:
plt.figure(figsize=(8, 5))

plt.scatter(X, y, color='blue', label='Actual data')  # scatter plot of actual values
plt.plot(X_test, y_pred_gradient_descent, color='green', label='GD Best-fit line')  # best fit line from gradient descent
plt.xlabel('Size (m2)')
plt.ylabel('Price ($100,000)')

ax = plt.gca() 
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

plt.legend()
plt.show()

Plot both the original data and both regression lines (Gradient Descent and closed-form).

In [None]:
plt.figure(figsize=(8, 5))

plt.scatter(X, y, color='blue', label='Actual data')  # scatter plot of actual values
plt.plot(X_test, y_pred_gradient_descent, color='green', label='GD Best-fit line')  # best-fit line from gradient descent
plt.plot(X_test, y_pred, color='red', label='Closed-form Best-fit line')  # Best-fit line from closed-form

plt.xlabel('Size (m2)')
plt.ylabel('Price ($100,000)')
plt.legend()

ax = plt.gca() 
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

plt.show()

### Gradient Descent - MinMaxScaler

Define an instance of the MinMaxScaler.

In [None]:
## Add your code here



Scale the features.

In [None]:
## Add your code here



Visualize the distribution of the scaled data.

In [None]:
## Add your code here





Create an SGDRegressor model.

In [None]:
## Add your code here.



Fit the model to the training data.

In [None]:
## Add your code here.




Get the parameters (w values).

In [None]:
## Add your code here



Print the parameters.

In [None]:
## Add your code here



What do you notice? Do you they look close to the ones that were estimated by the Linear Equation and GD using standardized features?

Retrieve the min and max of the original training data.

In [None]:
## Add your code here



Adjust the coefficients.

In [None]:
## Add your code here



Print the adjusted coefficients.

In [None]:
## Add your code here




Let's predict the values of the samples in the test set.

In [None]:
## Add your code here




Calculate Mean Squared Error and R-squared for Gradient Descent.

In [None]:
## Add your code here





Print the evaluation metrics.

In [None]:
## Add your code here




Plot the original data points and the best-fit line.

In [None]:
## Add your code here





Plot both the original data and both regression lines (Gradient Descent and closed-form).

In [None]:
## Add your code here





### Plotting the Cost Function $J(w_0, w_1)$ Against Number of Iterations

In [None]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [None]:
gradient_descent_model = SGDRegressor(max_iter=10000, eta0=0.01, random_state=42)

In [None]:
costs = [] # list to store the cost function

n_iterations = 100  # limit to 100 iterations
check_interval = 5  # check cost every 5 iterations

for iteration in range(n_iterations):
    gradient_descent_model.partial_fit(X_train_scaled, y_train.ravel())
    
    if iteration % check_interval == 0:
        y_train_pred = gradient_descent_model.predict(X_train_scaled) # predict on training data
        
        cost = mean_squared_error(y_train, y_train_pred) # calculate the cost (mean squared error)
        costs.append(cost)
        print("Iteration {}: Cost = {}".format(iteration, cost))

Plot the cost function over iterations

In [None]:
plt.figure(figsize=(10, 8))

plt.plot(range(0, n_iterations, check_interval), costs, color='blue')
plt.xlabel('Iterations')
plt.ylabel('Cost (Mean Squared Error)')

ax = plt.gca()
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

plt.show()