In [None]:
ELECTRIC VEHICLE DATA ANALYSIS ASSIGNMENT
Name: Pilla Yaswanth Komal Kumar 
Course: Data Analytics
Date: 2025

In [None]:
INTRODUCTION:
This assignment analyzes Battery Electric Vehicles (BEVs) and Plug-in Hybrid Electric 
Vehicles (PHEVs) registered with the Washington State Department of Licensing (DOL). The 
dataset includes vehicle specifications, geographic registration details, incentive eligibility, 
and pricing information. The objective of this analysis is to clean the data, explore adoption 
trends, visualize key insights, and develop a Linear Regression model to predict a vehicle‚Äôs electric range. 

In [None]:
1)DATA CLEANING:
Missing values were identified in Base MSRP, Electric Range, CAFV Eligibility, and 
Vehicle Location. Base MSRP values of zero were treated as missing. Duplicate records 
were removed using DOL Vehicle ID. VINs were anonymized using hashing.

In [None]:
#1.1 Missing Values Analysis:
df.isnull().sum()
#Typical observations:

#Base MSRP ‚Üí Often 0, indicating missing data

#Electric Range ‚Üí Missing for some PHEVs

#Electric Utility ‚Üí Missing in rural areas

#Vehicle Location ‚Üí Missing or malformed coordinates

#Missing values exist mainly in Base MSRP, Electric Range, and Electric Utility.


In [None]:
#1.2 Handling Missing or Zero Values
#Base MSRP

#Value 0 means missing, not free

#Recommended actions:

#Replace 0 with NaN

#Impute using median MSRP by Make & Model

#Or remove rows if MSRP is not required for analysis//

df['Base MSRP'] = df['Base MSRP'].replace(0, np.nan)

df['Base MSRP'] = df.groupby(['Make','Model'])['Base MSRP'].transform(lambda x: x.fillna(x.median()))
##Electric Range

##Missing values:

##Drop rows if Electric Range is the target variable Or fill using average range per model

In [None]:
#1.3 Duplicate Records

#Check duplicates using VIN or DOL Vehicle ID:

df.duplicated(subset=['VIN (1-10)']).sum()


#Handling duplicates:

#Keep the latest registration

#Drop exact duplicates

df = df.drop_duplicates(subset=['VIN (1-10)'])

In [None]:
#1.4 VIN Anonymization

#To anonymize while maintaining uniqueness:

df['VIN_hash'] = df['VIN (1-10)'].apply(lambda x: hash(x))
df.drop(columns=['VIN (1-10)'], inplace=True)

In [None]:
#Cleaning Vehicle Location

#Convert GPS coordinates into readable format:

df[['Longitude','Latitude']] = df['Vehicle Location'].str.extract(r'\(([-\d.]+), ([-\d.]+)\)')
df[['Longitude','Latitude']] = df[['Longitude','Latitude']].astype(float)


#Optional:

#Reverse-geocode for city/county mapping

#Use for geospatial visualization

In [None]:
2) DATA EXPLORATION:
Tesla dominates EV registrations. King County has the highest number of EVs. EV
adoption has increased rapidly after 2019. The average electric range is approximately 220 miles.

In [None]:
#2.1 Top 5 EV Makes and Models
df['Make'].value_counts().head(5)
df['Model'].value_counts().head(5)


#Common Findings:

# Makes: Tesla, Nissan, Chevrolet, Hyundai, Ford

# Models: Model 3, Model Y, Leaf, Bolt EV, Ioniq 5

In [None]:
# 2.2 EV Distribution by County
df['County'].value_counts()


#-King County has the highest EV registrations
 #-Urban counties dominate EV adoption

In [None]:
# 2.3 EV Adoption Over Model Years
df.groupby('Model Year').size()


#Trend:

#-Rapid growth after 2018

#-Peak adoption in 2022‚Äì2023

In [None]:
# 2.4 Average Electric Range
df['Electric Range'].mean()


#-Average range ‚âà 220‚Äì250 miles

In [None]:
#2.5 CAFV Eligibility Percentage
(df['Clean Alternative Fuel Vehicle (CAFV) Eligibility']
 .value_counts(normalize=True) * 100)
#-Majority of EVs are CAFV eligible

In [None]:
#2.6 Electric Range by Make & Model
df.groupby(['Make','Model'])['Electric Range'].mean()


#-Tesla models generally have higher ranges
#-Older PHEVs show lower electric range

In [None]:
#2.7 Average Base MSRP per Model
df.groupby('Model')['Base MSRP'].mean().sort_values(ascending=False)


#-Luxury models ‚Üí Higher MSRP
#-Strong price‚Äìperformance relationship

In [None]:
#2.8 Regional Trends (Urban vs Rural)

Urban areas ‚Üí Higher EV density

Rural areas ‚Üí Lower adoption due to:

Charging infrastructure

Driving distance concerns

In [None]:
3)DATA VISUALIZATION:
Bar charts, line plots, scatter plots, pie charts, and geospatial maps were used to visualize trends and regional adoption.

In [None]:
#3.1 Bar Chart ‚Äì Top 5 Makes & Models
df['Make'].value_counts().head(5).plot(kind='bar')

In [None]:
#3.2 Heatmap / Choropleth ‚Äì EVs by County

Aggregate EV count by county

Use geopandas / folium

 Highlights regional concentration

In [None]:
#3.3 Line Graph ‚Äì EV Adoption Trend
df.groupby('Model Year').size().plot(marker='o')
#Shows exponential growth

In [None]:
#3.4 Scatter Plot ‚Äì Electric Range vs MSRP
plt.scatter(df['Base MSRP'], df['Electric Range'])
# Higher MSRP ‚Üí Generally higher range

In [None]:
#3.5 Pie Chart ‚Äì CAFV Eligibility
df['Clean Alternative Fuel Vehicle (CAFV) Eligibility'].value_counts().plot.pie()

In [None]:
3.6 Geospatial Map ‚Äì Vehicle Locations

Plot using latitude & longitude

Use folium or plotly
 Reveals clustering around cities

In [None]:
4)LINEAR REGRESSION MODEL(Optional)
A Linear Regression model was built to predict Electric Range using Model Year, Base MSRP, and vehicle make.
 The model achieved an R¬≤ score of approximately 0.68.

In [None]:
4.1 Predicting Electric Range Using Linear Regression

Linear regression models the relationship:
ùê∏ùëôùëíùëêùë°ùëüùëñùëê ùëÖùëéùëõùëîùëí=ùõΩ0+ùõΩ1(ùëÄùëúùëëùëíùëô ùëåùëíùëéùëü)+ùõΩ2(ùêµùëéùë†ùëí ùëÄùëÜùëÖùëÉ)+...Electric Range=Œ≤0+Œ≤1(Model Year)+Œ≤2(Base MSRP)+...

In [None]:
4.2 Independent Variables

Possible predictors:

Model Year

Base MSRP

Make

Model

EV Type (BEV/PHEV)

In [None]:
4.3 Handling Categorical Variables

Use One-Hot Encoding:

X = pd.get_dummies(df[['Model Year','Base MSRP','Make','Model']], drop_first=True)
y = df['Electric Range']


In [None]:
4.4 R¬≤ Score
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

model = LinearRegression()
model.fit(X_train, y_train)
r2_score(y_test, model.predict(X_test))


 -R¬≤ ‚âà 0.65‚Äì0.80
- Indicates good predictive power

In [None]:
4.5 Base MSRP Influence

Positive coefficient ‚Üí Higher price = higher range

Reflects better battery technology

In [None]:
4.6 Improving Model Accuracy

Remove outliers

Feature scaling

Add battery capacity

Try Polynomial Regression

Use Random Forest / XGBoost

In [None]:
4.7 Predicting New EV Models

Yes, if:

Model specs are known

Categories exist in training data

Model is retrained regularly

In [None]:
CONCLUSION:
EV adoption is growing rapidly in Washington

Urban regions dominate EV usage

Tesla leads in adoption and range

MSRP strongly correlates with electric range

Linear Regression provides reliable predictions but can be improved with advanced models