## Multiple Linear Regression Model Analyzing the MPG of Autos in 1970-80
### Intro:

This project was created as an in-depth exploration into linear regression models taking multiple inputs and provindg a single output "guess." This program takes in the amount of cylinders, engine displacment, horespower, weight, acceleration, and year of maunfcature to predict mpg of a car. Note: Not allowed to use scikit or similar library

Exploration includes extracting data from a spreadsheet, feature normalization,  gradient descent, graphing cost v iteration, and calculating % error of model.

The Training data set can be found on https://archive-beta.ics.uci.edu/dataset/9/auto+mpg 

Author: Kaartik Tejwani  
Version: 1

In [2]:

# data tools
import pandas as pd
import numpy as np

# visualization for models
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# data analysis tools
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor


%load_ext watermark

import time
#prevent changing global vars
import copy

In [3]:
%watermark -v -p pandas,numpy,matplotlib

Python implementation: CPython
Python version       : 3.11.3
IPython version      : 8.13.2

pandas    : 2.0.1
numpy     : 1.24.3
matplotlib: 3.7.1



### Extracting Data

Using pandas to extract data from excel workbook containg data. Then storing data in numpy arrays.

Also, creating weights array and initial constant

In [4]:

df = pd.read_excel('C:/Users/Public/Documents/Auto Data.xlsx', engine="openpyxl", sheet_name='auto-mpg')

#target values
MPG_data = df['MPG']  # Replace 'column_name' with the actual name of the column
MPG_array = np.array(MPG_data)

#create an empty list to append DataFrame rows
data_rows = []


for index, row in df.iterrows():
    temp = np.array(row)
    #eliminate first and last two elements; they are not in model
    temp = temp[1:len(temp)-2] 
    data_rows.append(temp)

#stack the data rows to create a numpy array
trainData = np.vstack(data_rows)
trainDataShape = trainData.shape






### Normalizing data using mean

In [5]:
def mean_normalize(data):
    """
    A scalar output using the model
    Args:
      data (ndarray): Shape (m,n) array with m data points and n features
      
    Returns:
      dataCopy (ndarray):  mean normalized data
    """
    dataCopy = copy.deepcopy(data)
    dataShape = dataCopy.shape
    
    # empty list to store each features unique mean
    means = []
    #empty list to store each features unqiue range
    rangeOfData = []
    
    # Use nested for loop to get calculate mean and range for datset
    for col in range(dataShape[1]):
        nums = []
        for row in range(dataShape[0]):
            nums.append(data[row][col])
        avg = np.mean(np.array(nums))
        rangeOfCol = np.ptp(np.array(nums))
        means.append(avg)
        rangeOfData.append(rangeOfCol)
    
    # Use second nested for loop to get normalize datset
    for col in range(dataShape[1]):
        for row in range(dataShape[0]):
            dataCopy[row][col] = (dataCopy[row][col] - means[col])/ rangeOfData[col]
    
    
    return dataCopy




### Defining Model and Functions

In [14]:


def predictMPG(datapoint, weights, b): 
    """
    A scalar output using the model
    Args:
      datapoint (ndarray): Shape (n,1) array with multiple features
      weights (ndarray): Shape (n, 1) model parameters   
      b (scalar):             model parameter 
      
    Returns:
      guess (scalar):  prediction/guess
    """
    guess = np.dot(datapoint, weights) + b     
    return guess

def cost_function(data, targets, weights, b):
    """
    Sqared cost of current parameters
    Args:
      data (ndarray (m,n)): Full data, m number of datapoints and n different featres
      targets (ndarray): Shape (n,1) actual model
      weights (ndarray): Shape (n, 1) model parameters   
      b (scalar):             model parameter 
    Returns:
      cost (scalar):  sqared cost
    """
    NumPoints = data.shape[0]
    cost = 0.0
    
    for i in range(NumPoints):
        #perdicts MPG and compres to actual (given) MPG
        cost += (predictMPG(data[i], weights, b) - targets[i])**2
    cost = cost / (2*NumPoints)
    return cost

def calc_gradient(data, targets, weights, b):
    """
    Calculates the gradient for the cost function using partial deritaves of weights,b
    Args:
      data (ndarray (m,n)): Full data, m number of datapoints and n different featres
      targets (ndarray): Shape (n,1) actual model
      weights (ndarray): Shape (n, 1) model parameters   
      b (scalar):             model parameter 
    Returns:
      deriv_dw (ndarrary (n,1)): Gradient of cost function with respect to weights
      deriv_db (scalar): Gradeint of cost function with respect to b
      
    """
    dataShape = data.shape
    deriv_dw = np.zeros(weights.shape)
    deriv_db = 0

    for i in range(dataShape[0]): #392 data points
        error = predictMPG(data[i], weights, b) - targets[i]
        for w in range(dataShape[1]): #6 for 6 different features
            deriv_dw[w] += (error * data[i,w])
        deriv_db += error

    return (deriv_dw / dataShape[0]), (deriv_db / dataShape[0])

def perform_grad(data, targets, weightsIn, bIn, alpha, iters):
    """
    Performs gradient descent to update/learn weights and b. Updates weights and b by taking 
    iters gradient steps with rate alpha
    
    Args:
      data (ndarray (m,n)): Full data, m number of datapoints and n different featres
      targets (ndarray): Shape (n,1) actual model
      weightsIN (ndarray): Shape (n, 1) initial model parameters   
      bIn (scalar):             initoal model parameter 
      alpha (float): Learning rate
      iters (int): number of iterations to run gradient descent
      
    Returns:
      w (ndarray (n,1)) : Updated values of parameters 
      b (scalar)       : Updated value of parameter 
      costHist (ndarray (n,1)): Hisotry of cost function
      iterations (ndarray (n,1)): Iteration when cost was recorded
    """
    w = copy.deepcopy(weightsIn)
    b = bIn
    costHist = []
    iterations = []
    for i in range(iters):
        derivW, derivB = calc_gradient(data, targets, w, b)
        w = w - alpha * derivW
        b = b - alpha * derivB
        if i % 5 == 0:
            costHist.append(cost_function(data, targets, w, b))
            iterations.append(i)
    return w, b, costHist, iterations


### Main Program and Tests

In [25]:
# initial parameter (weights)
weights = np.array([-5.1871476,  -4.29174755, -3.3185558,  -3.29637758,  0.9904946,   3.81195076])
# initial parameter (y intercept)
b = 23.44416896380101
# Speed of Linear Regression (scales change in parameters per iteration)
alpha = 0.0001
#number of steps in grad desc
iters = 5000

print(f"The number of datapoints is {trainDataShape[0]} and the number of features is {trainDataShape[1]}")


print()
print("the training data is formatted as: ")
print("[pistons displacement horsepower weight acceleration year]")
print(trainData)

print()
print("performing mean normalization")
print("the normalized training data is:")
NormData = mean_normalize(trainData)
print(NormData)

print()
print("performing gradient descent")
finW, finB, costs, reps = perform_grad(NormData, MPG_array, weights, b, alpha, iters)

#Graphing cost v iteration
x = reps
y1 = costs

plt.plot(x, y1, label='Cost')
plt.xlabel('Number of Iterations')
plt.ylabel('Sqared error')
plt.title('Sqared Error vs Number of Iterations')
plt.legend()

plt.show()


print(f"The final weights were {finW} and b was {finB}")
print(f"The final cost is {cost_function(NormData, MPG_array, finW, finB)}")

testArr = np.array([6, 250, 105, 3353, 14.5, 76])
testMean = np.mean(testArr)
testRange = np.ptp(testArr)
for i in range(testArr.shape[0]):
    testArr[i]  = (testArr[i] - testMean) / testRange
print(testArr)
print("test case: [6	250	105	3353	14.5	76]")
print("correct answer: 22 MPG")
print(f"model predicts {predictMPG(testArr, finW, finB)}")
print(f"percent error is {(predictMPG(testArr, finW, finB) - 22)*100/22}")
print("5% NOT BAD :D")
#Other tests showed similar results

The number of datapoints is 392 and the number of features is 6

the training data is formatted as: 
[pistons displacement horsepower weight acceleration year]
[[8 307.0 130 3504 12.0 70]
 [8 350.0 165 3693 11.5 70]
 [8 318.0 150 3436 11.0 70]
 ...
 [4 135.0 84 2295 11.6 82]
 [4 120.0 79 2625 18.6 82]
 [4 119.0 82 2720 19.4 82]]

performing mean normalization
the normalized training data is:
[[0.5056122448979592 0.29092509096661917 0.13875332741792365
  0.14925313760321254 -0.2107932458697764 -0.49829931972789154]
 [0.5056122448979592 0.4020362020777303 0.32897071872227146
  0.20283975512518587 -0.24055515063168115 -0.49829931972789154]
 [0.5056122448979592 0.3193488635764383 0.2474489795918367
  0.12997329637837557 -0.2703170553935859 -0.49829931972789154]
 ...
 [-0.29438775510204085 -0.15351935347782525 -0.11124667258207635
  -0.1935310982913154 -0.23460276967930022 0.5017006802721085]
 [-0.29438775510204085 -0.19227904340030588 -0.13842058562555462
  -0.0999671629354889 0.1820638969


KeyboardInterrupt

