# Car Price Predictor - Linear Regression Model

##### this data comes from kaggle.com - https://www.kaggle.com/datasets/CooperUnion/cardataset

First need to load tools, libraries, and original data set into the notebook.  

In [9]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt

df = pd.read_csv(r'C:\Users\v-joecamp\OneDrive - Microsoft\Desktop\Jupyter Notebook\car prices.csv')

df.head()

Unnamed: 0,Make,Model,Year,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Market Category,Vehicle Size,Vehicle Style,highway MPG,city mpg,Popularity,MSRP
0,BMW,1 Series M,2011,premium unleaded (required),335.0,6.0,MANUAL,rear wheel drive,2.0,"Factory Tuner,Luxury,High-Performance",Compact,Coupe,26,19,3916,46135
1,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Convertible,28,19,3916,40650
2,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,High-Performance",Compact,Coupe,28,20,3916,36350
3,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Coupe,28,18,3916,29450
4,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,Luxury,Compact,Convertible,28,18,3916,34500


In [10]:
#check for missing data
df.isnull().sum()

Make                    0
Model                   0
Year                    0
Engine Fuel Type        3
Engine HP              69
Engine Cylinders       30
Transmission Type       0
Driven_Wheels           0
Number of Doors         6
Market Category      3742
Vehicle Size            0
Vehicle Style           0
highway MPG             0
city mpg                0
Popularity              0
MSRP                    0
dtype: int64

In [11]:
#need to clean up the data a bit. In this project I'm just focusing on the follwoing variables: Year, Engine HP, Engine 
#Cylinders, Number of Doors, highway MPG, city mpg, and MSRP. Therefore, I am cleaning the Engine HP, Engine Cylinders, and 
#Number of Doors to reflect the average number of each variable in the fields that are missing data. 

df['Engine HP'].fillna(df['Engine HP'].median(), inplace=True)
df['Engine Cylinders'].fillna(df['Engine Cylinders'].median(), inplace=True)
df['Number of Doors'].fillna(df['Number of Doors'].median(), inplace=True)

In [12]:
## let's confirm tht missing data has been corrected. 

df.isnull().sum()

Make                    0
Model                   0
Year                    0
Engine Fuel Type        3
Engine HP               0
Engine Cylinders        0
Transmission Type       0
Driven_Wheels           0
Number of Doors         0
Market Category      3742
Vehicle Size            0
Vehicle Style           0
highway MPG             0
city mpg                0
Popularity              0
MSRP                    0
dtype: int64

We are going to build two models. First one will be based on the entire data set. The second will be based on a filtered data set. This filtered data set will about 96% of the original. The removed 4% are vehices with fairly high MSRP values - I consider these to be outliers. Each model will be based on the same variables - Year, Engine HP, Engine Cylinders, Number of Doors, highway MPG, city mpg,and popularity. 

### Linear Regression Model using entire data set

In [13]:
#Defining the features (X) and (y)
X = df[['Year', 'Engine HP', 'Engine Cylinders', 'Number of Doors', 'highway MPG', 'city mpg', 'Popularity']]
y = df['MSRP']

#split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#create the linear regression model
model = LinearRegression()

#train the model
model.fit(X_train, y_train)

#make predictions
y_pred = model.predict(X_test)

#evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2=r2_score(y_test, y_pred)
print(f'Mean Squared Error: {mse.round(2)}')
print(f'R squared: {r2.round(4)}')

Mean Squared Error: 1063163788.6
R squared: 0.554


Definition for above:
Mean Squared Error - MSE measures the average squared difference between the predicted values and the actual target values within a dataset. A smaller MSE indicates that the model's predictions are closer to the actual values. 
R-squared -  the coefficient of determination - is a statistical measure used in machine learning to evaluate the quality of a regression model. The value of R squared will lie between 0 and 1. A higherR-squared value indictes the better fit of the model to the data. 

In [16]:
##Comparison of actual versus predicted MSRP values

comparison_df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred.round(2)})
comparison_df["difference"] = comparison_df['Actual'] - comparison_df['Predicted']
comparison_df["absolute difference"] = np.abs(comparison_df['Actual'] - comparison_df['Predicted'])
print(comparison_df)

       Actual  Predicted  difference  absolute difference
3995    29695   41368.87   -11673.87             11673.87
7474    30495   15523.86    14971.14             14971.14
7300    37650   39535.10    -1885.10              1885.10
3148    16170    2349.63    13820.37             13820.37
747      2000   -8293.78    10293.78             10293.78
...       ...        ...         ...                  ...
267     35550   63260.83   -27710.83             27710.83
4320    48360   55983.29    -7623.29              7623.29
5799    31750    8916.82    22833.18             22833.18
6080    20995   18037.91     2957.09              2957.09
11511   57700   40620.00    17080.00             17080.00

[2383 rows x 4 columns]


In [17]:
mean_absolute_difference = comparison_df['absolute difference'].mean()
print(f'Mean Difference: {mean_absolute_difference.round(2)}')

Mean Difference: 20475.76


### Above results don't look all that great.  

## Next effort is build linear regression model with a subset of the larger original data set.  

In [18]:
## As we did in the EDA notebook let's bucketize range of MSRP values 

#Define the bucket range
MSRP_range = range(2000, 2200000, 50000)

#create buckets 
msrp_buckets = pd.cut(df['MSRP'], bins=MSRP_range, include_lowest=True)
bucket_counts = msrp_buckets.value_counts().sort_index()

print(bucket_counts)

MSRP
(1999.999, 52000.0]       10089
(52000.0, 102000.0]        1200
(102000.0, 152000.0]        226
(152000.0, 202000.0]        133
(202000.0, 252000.0]        109
(252000.0, 302000.0]         63
(302000.0, 352000.0]         33
(352000.0, 402000.0]         15
(402000.0, 452000.0]         19
(452000.0, 502000.0]         16
(502000.0, 552000.0]          4
(552000.0, 602000.0]          0
(602000.0, 652000.0]          1
(652000.0, 702000.0]          0
(702000.0, 752000.0]          0
(752000.0, 802000.0]          0
(802000.0, 852000.0]          0
(852000.0, 902000.0]          0
(902000.0, 952000.0]          0
(952000.0, 1002000.0]         0
(1002000.0, 1052000.0]        0
(1052000.0, 1102000.0]        0
(1102000.0, 1152000.0]        0
(1152000.0, 1202000.0]        0
(1202000.0, 1252000.0]        0
(1252000.0, 1302000.0]        0
(1302000.0, 1352000.0]        0
(1352000.0, 1402000.0]        2
(1402000.0, 1452000.0]        0
(1452000.0, 1502000.0]        2
(1502000.0, 1552000.0]        0
(15

### Based on above we are going to build a subset of this dataset by omitting wht I consider to be outliers. To me, anything priced higher than 152,000 is an outlier. The new dataset will be MSRP values from 2000 to 152,000. That's approximately 11,515 cars out of the original dataset of 11,914 or 96.6% of the original population. 

In [19]:
#Define MSRP range
msrp_min = 2000
msrp_max = 152000

#create filtered dataframe
filtered1_df = df[(df['MSRP']>= msrp_min) & (df['MSRP'] <= msrp_max)]

In [22]:
##Let's test the new dataframe and this looks good. The new dataframe is comprised of cars from 2000 to 152000. 

#Define the bucket range
MSRP_range = range(2000, 165000, 10000)

#create buckets 
msrp_buckets = pd.cut(filtered1_df['MSRP'], bins=MSRP_range, include_lowest=True)
bucket_counts = msrp_buckets.value_counts().sort_index()

print(bucket_counts)

MSRP
(1999.999, 12000.0]     1674
(12000.0, 22000.0]      1595
(22000.0, 32000.0]      3355
(32000.0, 42000.0]      2275
(42000.0, 52000.0]      1190
(52000.0, 62000.0]       494
(62000.0, 72000.0]       306
(72000.0, 82000.0]       157
(82000.0, 92000.0]       147
(92000.0, 102000.0]       96
(102000.0, 112000.0]      60
(112000.0, 122000.0]      56
(122000.0, 132000.0]      33
(132000.0, 142000.0]      42
(142000.0, 152000.0]      35
(152000.0, 162000.0]       0
Name: count, dtype: int64


In [27]:
#time to build a new linear regression model

# Define the features (X) and (y)
X1 = filtered1_df[['Year', 'Engine HP', 'Engine Cylinders', 'Number of Doors', 'highway MPG', 'city mpg', 'Popularity']]
y1 = filtered1_df['MSRP']

#split the data in to training and test sets
X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y1, test_size=0.2, random_state=42)

#create the linear regression
model = LinearRegression()

#train the model
model.fit(X1_train, y1_train)

#make predictions
y_pred1 = model.predict(X1_test)

In [30]:
#evalute the model
mse = mean_squared_error(y1_test, y_pred1)
r2 = r2_score(y1_test, y_pred1)
print(f'Mean Squared Error: {mse.round(2)}')
print(f'R squared: {r2.round(4)}')

Mean Squared Error: 148305069.68
R squared: 0.7148


In [32]:
#Comparison of actual versus predicted MSRP values.

comparison_df = pd.DataFrame({'Actual': y1_test, 'Predicted': y_pred1.round(2)})
comparison_df["difference"] = comparison_df['Actual'] - comparison_df['Predicted']
comparison_df["absolute difference"] = np.abs(comparison_df['Actual'] - comparison_df['Predicted'])
print(comparison_df)

       Actual  Predicted  difference  absolute difference
5342    24845   28976.71    -4131.71              4131.71
2115    41200   34625.84     6574.16              6574.16
3062    29945   28365.67     1579.33              1579.33
10413   31260   30209.10     1050.90              1050.90
4137    67250   60091.33     7158.67              7158.67
...       ...        ...         ...                  ...
838     30330   28813.02     1516.98              1516.98
6831    13995   10957.54     3037.46              3037.46
1223    25830   28070.28    -2240.28              2240.28
1613    18995   22660.80    -3665.80              3665.80
10944   42550   57022.33   -14472.33             14472.33

[2303 rows x 4 columns]


In [33]:
mean_absolute_difference = comparison_df['absolute difference'].mean()
print(f'Mean Difference: {mean_absolute_difference.round(2)}')

Mean Difference: 8080.15


### Takeaway, and as indicated in the correlation matrix within the EDA notebook, there is a better correlation between variables and MSRP in the reduced size dataset. The model using the original dataframe (df) produced a mean difference (Actual MSRP to Predicted MSRP) of 20,475 - not very good. The model using the adjusted dataframe (filtered1_df) produced a mean difference (Actual MSRP to Predicted MSRP) of 8,080. This is much better but can we do better by including variables with the data types of object.  