# Linear Regression with Stochastic Gradient Descent

This is an experimental notebook walking through the development of a linear regression model using Stochastic Gradient Descent (SGD). The data preprocessing has been kept to the minimum and can be expanded based on one's interest. Please feel free to fork the notebook and play with the values to create a model.

Thanks to [Mathurin Aché](https://www.kaggle.com/mathurinache) for providing this dataset on automobile pricing. The dataset is fairly clean and ready to use.

In [None]:
import numpy as np 
import pandas as pd 
from matplotlib import pyplot as plt

In [None]:
data = pd.read_csv("/kaggle/input/autoprice/dataset_2193_autoPrice.csv")

In the description of the dataset, the author stated that all missing values from the dataset were eliminated. 

In [None]:
data.head()

In [None]:
data.shape

In [None]:
data.describe()

All features seem have similar mean and median (50% mark), and small difference between minimum and 25% mark, and small difference between minimum and 75% mark except for 'compression-ratio' and 'class' features. This indicates large number of outlies. We can plot the features as individual histograms and visually analyse the data spread.

In [None]:
data.hist(figsize=(15, 10))
None

The 'compression-ratio' feature has a few extereme values. We remove the feature before further processing the data in order to avoid any bias. The feature 'symboling' will be treated as a continous variable for this exercise.

In [None]:
new_data = data.drop(['compression-ratio'], axis=1)

The feature to predict will be 'class', which represents the price of the vehicle. The rest of the independent features will be **X** and the dependent feature, a.k.a. 'class' will be **y**.

In [None]:
X, y = new_data.iloc[:,:-1], new_data.iloc[:,-1]

# Create train and test splits
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.30, random_state=1551)

The values of the features belong to different ranges. In order to simplify the model, scale the values using the MinMax scaler.

In [None]:
from sklearn.preprocessing import MinMaxScaler

# Scale the training set
X_scale = MinMaxScaler().fit(X_train)
X_train_trans = X_scale.transform(X_train) # fit on training set and transform the data
X_train = pd.DataFrame(X_train_trans, columns = list(X_train.columns)) # convert matrix to data frame with columns

y_scale = MinMaxScaler().fit(np.array(y_train).reshape(-1, 1))
y_train = y_scale.transform(np.array(y_train).reshape(-1, 1))

# Scale the test set using the X and y scalers
X_test_trans = X_scale.transform(X_test)
X_test = pd.DataFrame(X_test_trans, columns = list(X_test.columns))
y_test = y_scale.transform(np.array(y_test).reshape(-1, 1))
y_test = y_test.flatten()

## Linear Regression

The formula of a  linear regression model is :<br>
>              y_pred = bias + theta * X

where bias is the intercept and theta is the slope/coefficient. A multi-regression model will be written as:<br>
>              y_pred = bias + theta_0 * X_0 + theta_1 * X_1 + ... + theta_n * X_n

The value of y are dependent on the theta values. In order to achieve optimum theta values, we use Stochastic Gradient Descent (SGD). The aim of the SGD is to minimize the error function which can also be written as:<br>
>              J = 1/(2 * m) * sum((y_pred - y)^2)

In order to minimize the function above, the theta coefficients can be calculated as the derivative of the the cost function. Using a small alpha value, calculate the coefficient (theta) values as:<br>
>              theta = theta - alpha/m * sum((y_pred - y) * X)

This can be repeated for a considerably large number of iterations until the local minima of the cost function is met. Various values of alpha can be tested before selecting the one that is large enough to converge at a local minima but small enough to reduce the cost function with every iteration [1].

First we initialise the values needed to produce the model. To accomodate the bias term, we add a column of ones to the dependent data set. This column is added to both the train and test set. 

In [None]:
X_train = np.column_stack(([1]*X_train.shape[0], X_train)) # add a column with ones for the bias value while converting it into a matrix
m,n = X_train.shape # rows and columns 
theta = np.array([1] * n) # initial theta
X = np.array(X_train) # convert X_train into a numpy matrix
y = y_train.flatten() # convert y into an array

alpha = 0.001 # alpha value 
iteration = 1000 # iterations
cost = [] # list to store cost values
theta_new = [] # list to store updates coeffient values

In [None]:
# Linear Regression function

for i in range(0, iteration):
    pred = np.matmul(X,theta) # Calculate predicted value
    J = 1/2 * ((np.square(pred - y)).mean()) # Calculate cost function
   
    t_cols = 0 # iteration for theta values
    
    # Update the theta values for all the features with the gradient of the cost function
    for t_cols in range(0,n): 
        t = round(theta[t_cols] - alpha/m * sum((pred-y)*X[:,t_cols]),4) # calculate new theta value
        theta_new.append(t) # save new theta values in a temporary array
        
# update theta array
    theta = [] # empty the theta array
    theta = theta_new # assign new values of theta to array
    theta_new = [] # empty temporary array
    cost.append(J) # append cost function to the cost array

In [None]:
plt.figure(figsize=(10,8))
plt.plot(cost)
plt.title('Cost Function')
plt.xlabel('Iterations')
plt.ylabel('Cost Function Value')
None

In [None]:
cost[-1]

After experimenting with different values of alpha, I chose 0.001 since it converged well enough to provide optimal theta values and a low cost function value. A suffiently smaller value of alpha will help converge at the local minima by reducing the cost function at every iteration. The Cost Function plot for alpha=0.001 above shows the error values decreasing over the iterations but converging around the local minimum at around 800 itertions, beyond which, the theta values are stabalised. We can either run the function again with 800 iterations and choose the final theta values it produces or choose those at the 1000th iteration as an outcome of this function. 

In [None]:
print("The theta values for the model are :", theta)

Note that the first theta value (theta0) is the bias value. Using these theta values, we can predict the price of the vehicles on the test data. We used the transformed test set to predict the values.

With the calculated theta values, predict the price of the vehicles using the test set.

In [None]:
X_test = np.column_stack(([1]*X_test.shape[0], X_test)) # add a column with ones for the bias value while converting it into a matrix
y_pred = np.matmul(X_test,theta)

Calculate the RMSE and R-square statistic (Coefficient of determination). 

In [None]:
import math
from sklearn.metrics import r2_score
rmse = round(math.sqrt(((y_test-y_pred)**2).mean()),3)
r2 = round(r2_score(y_test,y_pred),3)
print("The Root Mean Square error is: ",rmse)
print("The coefficient of determination is: ", r2)

It is good to note that the RMSE value is quite low, but a small R-square value also means the linear model isn't a good fit for the data. We can compare these metrics with sklearn's SGD Regressor model. <br><br>
Note: Here you can choose to retain the bias variable added to the independent data sets in train and test, or you can remove them by re-running the code above to create train-test splits and scale the data before creating the model. The outcome will be the same.

In [None]:
from sklearn.linear_model import SGDRegressor
sgd_model = SGDRegressor(loss='squared_loss', penalty='l2') # Use Ridge regularization as alpha
model_fit = sgd_model.fit(X_train,y_train.flatten())
y_pred = sgd_model.predict(X_test)

Calculate metrics for the sklearn model.

In [None]:
rmse = round(math.sqrt(((y_test-y_pred)**2).mean()),3)
r2 = round(r2_score(y_test, y_pred),3)
print("The average error (RMSE) is: ",rmse)
print("The coefficient of determination is: ", r2)

While the RMSE score of the scikit-learn model same as the SGD model, the R-square value is slighty different. The R-squared values shows that the model created from scatch captures the same amount of variance of the dependent variable ('class' a.k.a. price of the vehicle) as does the in-built model, although the value is small. This means a linear model may not be fit for this data. The model is underfitting the data, thus giving a small value for R-square. Using a polynomial regression model could perhaps improve the model fit, or a more complex model like Random Forest or Neural Network.<br>

Although the R-square metric isn't great, it proves that our model is as good as the in-built one. Using the same technique and process, a linear regression model with stochastic gradient descent can be developed from scratch for any a linear dataset. 

## References
[1] https://www.coursera.org/learn/machine-learning