# Homework 4: Regression
### Nicholas Thomson

### Import libraries

In [1]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
import statsmodels.api as sm

Dataset taken from: https://www.kaggle.com/datasets/gagandeep16/car-sales

It contains information relating to cars and their prices. The goal is to create a multiple linear regression model that can show what makes a car more or less expensive.

In [2]:
data = pd.read_csv('Car_sales.csv')
data.head()

Unnamed: 0,Manufacturer,Model,Sales_in_thousands,__year_resale_value,Vehicle_type,Price_in_thousands,Engine_size,Horsepower,Wheelbase,Width,Length,Curb_weight,Fuel_capacity,Fuel_efficiency,Latest_Launch,Power_perf_factor
0,Acura,Integra,16.919,16.36,Passenger,21.5,1.8,140.0,101.2,67.3,172.4,2.639,13.2,28.0,2/2/2012,58.28015
1,Acura,TL,39.384,19.875,Passenger,28.4,3.2,225.0,108.1,70.3,192.9,3.517,17.2,25.0,6/3/2011,91.370778
2,Acura,CL,14.114,18.225,Passenger,,3.2,225.0,106.9,70.6,192.0,3.47,17.2,26.0,1/4/2012,
3,Acura,RL,8.588,29.725,Passenger,42.0,3.5,210.0,114.6,71.4,196.6,3.85,18.0,22.0,3/10/2011,91.389779
4,Audi,A4,20.397,22.255,Passenger,23.99,1.8,150.0,102.6,68.2,178.0,2.998,16.4,27.0,10/8/2011,62.777639


### Data Manipulation

Remove NaN values, variables with too many dummies, and also convert categorical variables to dummies

In [3]:
data = data.dropna()
data = data.drop(['Latest_Launch', 'Model', 'Manufacturer', 'Sales_in_thousands', '__year_resale_value'], axis = 1)
data = pd.get_dummies(data)
data.head()

Unnamed: 0,Price_in_thousands,Engine_size,Horsepower,Wheelbase,Width,Length,Curb_weight,Fuel_capacity,Fuel_efficiency,Power_perf_factor,Vehicle_type_Car,Vehicle_type_Passenger
0,21.5,1.8,140.0,101.2,67.3,172.4,2.639,13.2,28.0,58.28015,False,True
1,28.4,3.2,225.0,108.1,70.3,192.9,3.517,17.2,25.0,91.370778,False,True
3,42.0,3.5,210.0,114.6,71.4,196.6,3.85,18.0,22.0,91.389779,False,True
4,23.99,1.8,150.0,102.6,68.2,178.0,2.998,16.4,27.0,62.777639,False,True
5,33.95,2.8,200.0,108.7,76.1,192.0,3.561,18.5,22.0,84.565105,False,True


### Select Variables

Select the dependent and independent variables

In [4]:
y = data['Price_in_thousands']
X = data.drop(['Price_in_thousands'],axis=1) # Select all variables except for Price

### Split the dataset into Testing and Training

In [5]:
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.2, random_state=544)

### Create and train the model

In [6]:
# Create a linear regression model
model = LinearRegression()

# Fit the model to the training data
model.fit(X_train, Y_train)

### Make predictions

In [7]:
Y_pred = model.predict(X_test)

### Evaluate the model

In [8]:
mse = mean_squared_error(Y_test, Y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(Y_test, Y_pred)

print(f'Mean Squared Error (MSE): {mse:.2f}')
print(f'Root Mean Squared Error (RMSE): {rmse:.2f}')
print(f'R-squared (R²): {r2:.2f}')

Mean Squared Error (MSE): 0.00
Root Mean Squared Error (RMSE): 0.00
R-squared (R²): 1.00


### Interpret the Result

In [9]:
coefficients = pd.DataFrame({'Feature': X.columns, 'Coefficient': model.coef_})  
print(coefficients)  

                   Feature   Coefficient
0              Engine_size -6.754243e-01
1               Horsepower -9.168913e-01
2                Wheelbase -3.215473e-13
3                    Width -6.039167e-10
4                   Length  1.418913e-10
5              Curb_weight  1.199840e-08
6            Fuel_capacity -2.168542e-09
7          Fuel_efficiency -6.278946e-10
8        Power_perf_factor  2.592316e+00
9         Vehicle_type_Car -1.966592e-09
10  Vehicle_type_Passenger  1.966592e-09


Certain variables such as power performance factor have a strong relationship with a cars sale price. Most of the variables that were tested did not strongly correlate with the price.

The R^2 variable being a perfect 1.00 was suprising, and I am unsure as to how the model was able to perfectly predict the sale price of a car. RMSE was 0 because of this statistic. 