#     Predicting Electric Vehicle Base MSRP Using Supervised Regression

## Objective
* The aim of this project is to develop a supervised regression model to predict the Base MSRP (Manufacturer’s Suggested Retail Price) of electric vehicles (EVs) based on key technical and categorical features.
* By analyzing historical data, we aim to uncover how factors like electric range, battery capacity, and model year influence EV pricing.
* These insights can support better decision-making for:
(i) EV Manufacturers – Optimizing pricing strategies
(ii) Policymakers – Evaluating and refining EV incentive programs
(iii) Consumers & Market Analysts – Understanding value, trends, and competitiveness


## 1. Data Preprocessing

In [25]:
# Importing libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression 
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR  
from sklearn.neural_network import MLPRegressor 
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error  
from sklearn.model_selection import GridSearchCV 
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline 
import joblib

In [29]:
# to deactivate warnings
import warnings
warnings.filterwarnings('ignore')

In [30]:
# Load dataset
data = pd.read_csv(r"C:\Users\aksha\Downloads\Electric_Vehicle_Population_Data (1).csv")

In [31]:
df = pd.DataFrame(data)

In [32]:
df.head()

Unnamed: 0,VIN (1-10),County,City,State,Postal Code,Model Year,Make,Model,Electric Vehicle Type,Clean Alternative Fuel Vehicle (CAFV) Eligibility,Electric Range,Base MSRP,Legislative District,DOL Vehicle ID,Vehicle Location,Electric Utility,2020 Census Tract
0,5YJ3E1EB6K,King,Seattle,WA,98178.0,2019,TESLA,MODEL 3,Battery Electric Vehicle (BEV),Clean Alternative Fuel Vehicle Eligible,220.0,0.0,37.0,101250425,POINT (-122.23825 47.49461),CITY OF SEATTLE - (WA)|CITY OF TACOMA - (WA),53033010000.0
1,5YJYGAEE5M,Yakima,Selah,WA,98942.0,2021,TESLA,MODEL Y,Battery Electric Vehicle (BEV),Eligibility unknown as battery range has not b...,0.0,0.0,15.0,224162858,POINT (-120.53145 46.65405),PACIFICORP,53077000000.0
2,5YJSA1E65N,Yakima,Granger,WA,98932.0,2022,TESLA,MODEL S,Battery Electric Vehicle (BEV),Eligibility unknown as battery range has not b...,0.0,0.0,15.0,187279214,POINT (-120.1871 46.33949),PACIFICORP,53077000000.0
3,5YJ3E1EBXN,King,Bellevue,WA,98004.0,2022,TESLA,MODEL 3,Battery Electric Vehicle (BEV),Eligibility unknown as battery range has not b...,0.0,0.0,41.0,219428079,POINT (-122.1872 47.61001),PUGET SOUND ENERGY INC||CITY OF TACOMA - (WA),53033020000.0
4,JM3KKEHA8S,Thurston,Yelm,WA,98597.0,2025,MAZDA,CX-90,Plug-in Hybrid Electric Vehicle (PHEV),Not eligible due to low battery range,26.0,0.0,2.0,277791643,POINT (-122.60735 46.94239),PUGET SOUND ENERGY INC,53067010000.0


In [33]:
print("Shape of the dataset is: ")
df.shape

Shape of the dataset is: 


(246137, 17)

In [34]:
print("Columns: ")
df.columns

Columns: 


Index(['VIN (1-10)', 'County', 'City', 'State', 'Postal Code', 'Model Year',
       'Make', 'Model', 'Electric Vehicle Type',
       'Clean Alternative Fuel Vehicle (CAFV) Eligibility', 'Electric Range',
       'Base MSRP', 'Legislative District', 'DOL Vehicle ID',
       'Vehicle Location', 'Electric Utility', '2020 Census Tract'],
      dtype='object')

In [35]:
print("Dataset Information:")
print("\t")
df.info()

Dataset Information:
	
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 246137 entries, 0 to 246136
Data columns (total 17 columns):
 #   Column                                             Non-Null Count   Dtype  
---  ------                                             --------------   -----  
 0   VIN (1-10)                                         246137 non-null  object 
 1   County                                             246133 non-null  object 
 2   City                                               246133 non-null  object 
 3   State                                              246137 non-null  object 
 4   Postal Code                                        246133 non-null  float64
 5   Model Year                                         246137 non-null  int64  
 6   Make                                               246137 non-null  object 
 7   Model                                              246137 non-null  object 
 8   Electric Vehicle Type                              

In [39]:
print("Summary Statistics: ")
print("\t")
df.describe()

Summary Statistics: 
	


Unnamed: 0,Postal Code,Model Year,Electric Range,Base MSRP,Legislative District,DOL Vehicle ID,2020 Census Tract
count,246133.0,246137.0,246120.0,246120.0,245597.0,246137.0,246133.0
mean,98179.658481,2021.535698,44.872192,746.606188,28.871831,237432400.0,52976850000.0
std,2494.101983,2.999144,82.913952,6987.233456,14.895938,67191580.0,1580103000.0
min,1731.0,2000.0,0.0,0.0,1.0,4385.0,1001020000.0
25%,98052.0,2020.0,0.0,0.0,17.0,208339100.0,53033010000.0
50%,98126.0,2023.0,0.0,0.0,32.0,254846000.0,53033030000.0
75%,98375.0,2024.0,37.0,0.0,42.0,271731900.0,53053070000.0
max,99577.0,2026.0,337.0,845000.0,49.0,479254800.0,56021000000.0


In [40]:
print("Unique Values: ")
print("\t")
df.nunique()

Unique Values: 
	


VIN (1-10)                                            14503
County                                                  217
City                                                    803
State                                                    48
Postal Code                                             993
Model Year                                               21
Make                                                     46
Model                                                   174
Electric Vehicle Type                                     2
Clean Alternative Fuel Vehicle (CAFV) Eligibility         3
Electric Range                                          110
Base MSRP                                                31
Legislative District                                     49
DOL Vehicle ID                                       246137
Vehicle Location                                        992
Electric Utility                                         76
2020 Census Tract                       

In [41]:
#Finding missing values
missing_values = df.isnull().sum()
print("Missing Values:")
print("\t")
print(missing_values)

Missing Values:
	
VIN (1-10)                                             0
County                                                 4
City                                                   4
State                                                  0
Postal Code                                            4
Model Year                                             0
Make                                                   0
Model                                                  0
Electric Vehicle Type                                  0
Clean Alternative Fuel Vehicle (CAFV) Eligibility      0
Electric Range                                        17
Base MSRP                                             17
Legislative District                                 540
DOL Vehicle ID                                         0
Vehicle Location                                      11
Electric Utility                                       4
2020 Census Tract                                      4
dtype: int64


In [42]:
# Handling missing values : Fill the missing values with mean(For numerical columns)
df['Electric Range'].fillna(df['Electric Range'].mean(), inplace=True)
df['Base MSRP'].fillna(df['Base MSRP'].mean(), inplace=True)
df['Postal Code'].fillna(df['Postal Code'].mean(), inplace=True)
df['2020 Census Tract'].fillna(df['2020 Census Tract'].mean(), inplace=True)

In [43]:
# Fill the missing values with mode(For categorical columns )
df['County'].fillna(df['County'].mode()[0], inplace=True)
df['City'].fillna(df['City'].mode()[0], inplace=True)
df['State'].fillna(df['State'].mode()[0], inplace=True)
df['Vehicle Location'].fillna(df['Vehicle Location'].mode()[0], inplace=True)
df['Electric Utility'].fillna(df['Electric Utility'].mode()[0], inplace=True)
df['Legislative District'].fillna(df['Legislative District'].mode()[0], inplace=True)

In [44]:
# Checking for the null values after imputation
missing_values = df.isnull().sum()
print("Missing Values after Imputation:")
print(missing_values)

Missing Values after Imputation:
VIN (1-10)                                           0
County                                               0
City                                                 0
State                                                0
Postal Code                                          0
Model Year                                           0
Make                                                 0
Model                                                0
Electric Vehicle Type                                0
Clean Alternative Fuel Vehicle (CAFV) Eligibility    0
Electric Range                                       0
Base MSRP                                            0
Legislative District                                 0
DOL Vehicle ID                                       0
Vehicle Location                                     0
Electric Utility                                     0
2020 Census Tract                                    0
dtype: int64


In [21]:
# Checking for duplicates
print("\t")
print(f"Total number of duplicate values is : {df.duplicated().sum()}")

	
Total number of duplicate values is : 0
