# **Nikhil Kuchana**

# **EV Adoption Forecasting**

As electric vehicle (EV) adoption surges, urban planners need to anticipate infrastructure needs—especially charging stations. Inadequate planning can lead to bottlenecks, impacting user satisfaction and hindering sustainability goals.

**Problem Statement:** Using the electric vehicle dataset (which includes information on EV populations, vehicle types, and possibly historical charging usage), create a model to forecast future EV adoption. For example, predict the number of electric vehicles in upcoming years based on the trends in the data.

**Goal:** Build a regression model that forecasts future EV adoption demand based on historical trends in EV growth, types of vehicles, and regional data.

**Dataset:** This dataset shows the number of vehicles that were registered by Washington State Department of Licensing (DOL) each month. The data is separated by county for passenger vehicles and trucks.

- Date: Counts of registered vehicles are taken on this day (the end of this month). - 2017-01-31
2024-02-29
- County: This is the geographic region of a state that a vehicle's owner is listed to reside within. Vehicles registered in Washington
- State: This is the geographic region of the country associated with the record. These addresses may be located in other
- Vehicle Primary Use: This describes the primary intended use of the vehicle.(Passenger-83%, Truck-17%)
- Battery Electric Vehicles (BEVs): The count of vehicles that are known to be propelled solely by an energy derived from an onboard electric battery.
- Plug-In Hybrid Electric Vehicles (PHEVs): The count of vehicles that are known to be propelled from energy partially sourced from an onboard electric battery
- Electric Vehicle (EV) Total: The sum of Battery Electric Vehicles (BEVs) and Plug-in Hybrid Electric Vehicles (PHEVs).
- Non-Electric Vehicle Total: The count of vehicles that are not electric vehicles.
- Total Vehicles: All powered vehicles registered in the county. This includes electric vehicles.
- Percent Electric Vehicles: Comparison of electric vehicles versus their non-electric counterparts.

**Dataset Link:** https://www.kaggle.com/datasets/sahirmaharajj/electric-vehicle-population-size-2024/data

##**week-01**

### **Import Required Libraries**

----->If you're running this in Google Colab, use `!` to run shell commands.             
----->This installs the required Python libraries in your Colab environment.     

!pip install pandas numpy matplotlib seaborn scikit-learn


In [5]:
# joblib is used for saving and loading trained models (model persistence)
import joblib

# numpy is a library for numerical operations (arrays, math functions)
import numpy as np

# pandas is used for data manipulation and analysis (loading, cleaning, transforming tabular data)
import pandas as pd

# seaborn is a visualization library built on top of matplotlib, good for attractive statistical plots
import seaborn as sns

# matplotlib.pyplot is the core plotting library in Python for creating charts and graphs
import matplotlib.pyplot as plt

# LabelEncoder is used to convert categorical text data into numeric labels (required for some models)
from sklearn.preprocessing import LabelEncoder

# RandomForestRegressor is the machine learning model we'll use for regression (forecasting EV adoption)
from sklearn.ensemble import RandomForestRegressor

# train_test_split is used to split your dataset into training and testing sets
from sklearn.model_selection import train_test_split

# RandomizedSearchCV is used for hyperparameter tuning (finding the best model parameters automatically)
from sklearn.model_selection import RandomizedSearchCV

# These metrics are used to evaluate your model’s performance
# MAE: Mean Absolute Error | MSE: Mean Squared Error | R2: R-squared Score
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

**imports all key tools** for:

**Data handling:** pandas, numpy   

**Visualization:** seaborn, matplotlib.pyplot    

**Machine Learning:** RandomForestRegressor, LabelEncoder, train_test_split, RandomizedSearchCV

**Evaluation:** mean_absolute_error, mean_squared_error, r2_score

**Model saving:** joblib

### **Loading the Dataset**

In [8]:
# Load data
# Load the dataset from a CSV file using pandas.
# `pd.read_csv` reads the CSV file and creates a DataFrame (table) named `df`.
# Make sure the file path is correct. In Google Colab, you usually upload the file to `/content/`.

df = pd.read_csv("/content/Electric_Vehicle_Population_By_County.csv")

### **Explore and Understand the Data**

In [10]:
# Display the first 5 rows of the DataFrame.

# This helps you quickly check what your data looks like:
# - See the columns (features) available
# - Inspect a few sample values
# - Make sure the file loaded correctly

df.head()  # Shows the top 5 rows by default

Unnamed: 0,Date,County,State,Vehicle Primary Use,Battery Electric Vehicles (BEVs),Plug-In Hybrid Electric Vehicles (PHEVs),Electric Vehicle (EV) Total,Non-Electric Vehicle Total,Total Vehicles,Percent Electric Vehicles
0,September 30 2022,Riverside,CA,Passenger,7,0,7,460,467,1.5
1,December 31 2022,Prince William,VA,Passenger,1,2,3,188,191,1.57
2,January 31 2020,Dakota,MN,Passenger,0,1,1,32,33,3.03
3,June 30 2022,Ferry,WA,Truck,0,0,0,3575,3575,0.0
4,July 31 2021,Douglas,CO,Passenger,0,1,1,83,84,1.19


In [11]:
df.shape  # Returns (number of rows, number of columns)


(20819, 10)

In [12]:
# Data Types, class and memory alloc

# Check basic information about the DataFrame.
# `df.info()` shows:
# - Each column name
# - How many non-null (non-missing) values each column has
# - The data type of each column (object, int64, float64, etc.)
# - The total memory usage of the DataFrame

# This helps you:
# ✅ Find missing values
# ✅ See if columns are the correct type (e.g., dates, numbers)
# ✅ Understand the dataset’s size

df.info()  # Shows columns, non-null counts, and data types

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20819 entries, 0 to 20818
Data columns (total 10 columns):
 #   Column                                    Non-Null Count  Dtype  
---  ------                                    --------------  -----  
 0   Date                                      20819 non-null  object 
 1   County                                    20733 non-null  object 
 2   State                                     20733 non-null  object 
 3   Vehicle Primary Use                       20819 non-null  object 
 4   Battery Electric Vehicles (BEVs)          20819 non-null  object 
 5   Plug-In Hybrid Electric Vehicles (PHEVs)  20819 non-null  object 
 6   Electric Vehicle (EV) Total               20819 non-null  object 
 7   Non-Electric Vehicle Total                20819 non-null  object 
 8   Total Vehicles                            20819 non-null  object 
 9   Percent Electric Vehicles                 20819 non-null  float64
dtypes: float64(1), object(9)
memory us

In [17]:
# Get summary statistics for the numeric columns in your DataFrame.
# `df.describe()` shows:
# - count: Number of non-null (non-missing) values
# - mean: Average value
# - std: Standard deviation (spread of the data)
# - min: Minimum value
# - 25%: 1st quartile (25% of data below this value)
# - 50%: Median (middle value)
# - 75%: 3rd quartile (75% of data below this value)
# - max: Maximum value

# This helps you understand:
# ✅ The range and spread of your numeric data
# ✅ Possible outliers
# ✅ General trends (e.g., typical EV counts)

df.describe()  # For numeric columns: count, mean, std, min, max, percentiles    #--

Unnamed: 0,Percent Electric Vehicles
count,20819.0
mean,4.139216
std,11.05535
min,0.0
25%,0.39
50%,1.22
75%,2.995
max,100.0


In [15]:
df.isnull().sum()  # Shows missing value count for each column


Unnamed: 0,0
Date,0
County,86
State,86
Vehicle Primary Use,0
Battery Electric Vehicles (BEVs),0
Plug-In Hybrid Electric Vehicles (PHEVs),0
Electric Vehicle (EV) Total,0
Non-Electric Vehicle Total,0
Total Vehicles,0
Percent Electric Vehicles,0


In [16]:
df['County'].unique()  # Example: see all unique counties  #--



array(['Riverside', 'Prince William', 'Dakota', 'Ferry', 'Douglas',
       'Maui', 'Northampton', 'Nassau', 'DeKalb', 'Columbia', 'Orleans',
       'Ramsey', 'Manassas', 'Montgomery', 'Albemarle', 'Monroe',
       'San Diego', 'Skamania', 'Washington', 'Sarpy', 'Owyhee',
       'Clinton', 'Yakima', 'Virginia Beach', 'Sedgwick', 'Kittitas',
       'Asotin', 'San Francisco', 'Charles', 'Richmond', 'Carson City',
       'Santa Clara', 'Harris', 'King', 'Suffolk', 'Clallam', 'Clay',
       'El Paso', 'Harford', 'Franklin', 'Burlington', 'Kings',
       'Washtenaw', 'Whatcom', 'Whitman', 'Stevens', 'Benton', 'Kern',
       'Grant', 'Nueces', 'Jackson', 'Polk', 'Powhatan', 'Anne Arundel',
       'Pacific', 'San Mateo', 'Ventura', 'Klamath', 'Hamilton', 'Meade',
       'Placer', 'Larimer', 'Fairbanks North Star', 'Clark', 'Bexar',
       'Manatee', 'Williamson', 'Flathead', 'Lumpkin', 'Mason',
       'Providence', 'Hardin', 'Charleston', 'Santa Cruz', 'Hawaii',
       'Kootenai', 'Sumter', 'D

### Check/Detect **Outliers in ‘Percent Electric Vehicles’ column** Using the IQR Method

In [20]:
# Compute Q1 and Q3
# Compute the 1st quartile (Q1) and 3rd quartile (Q3) for the 'Percent Electric Vehicles' column.
# Q1: 25% of the data falls below this value.
# Q3: 75% of the data falls below this value.
Q1 = df['Percent Electric Vehicles'].quantile(0.25)
Q3 = df['Percent Electric Vehicles'].quantile(0.75)

# ------------>Interquartile Range (IQR)<------------
# Calculate the Interquartile Range (IQR).
# IQR measures the spread of the middle 50% of your data.
IQR = Q3 - Q1


# Define outlier boundaries
# Define the lower and upper bounds for detecting outliers.
# Any data point below the lower bound or above the upper bound is considered an outlier.
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Print the calculated bounds to see the threshold values.
print('Lower Bound:', lower_bound)
print('Upper Bound:', upper_bound)


# Identify outliers
# Identify rows where 'Percent Electric Vehicles' is outside the normal range.
# This gives you a subset DataFrame containing the outlier rows.
outliers = df[(df['Percent Electric Vehicles'] < lower_bound) |
              (df['Percent Electric Vehicles'] > upper_bound)]

# Print how many outliers were found in the 'Percent Electric Vehicles' column.
print("Number of outliers in 'Percent Electric Vehicles':", outliers.shape[0])


Lower Bound: -3.5174999999999996
Upper Bound: 6.9025
Number of outliers in 'Percent Electric Vehicles': 2476


In [22]:
outliers.head() # Look at the outlier rows   #--

Unnamed: 0,Date,County,State,Vehicle Primary Use,Battery Electric Vehicles (BEVs),Plug-In Hybrid Electric Vehicles (PHEVs),Electric Vehicle (EV) Total,Non-Electric Vehicle Total,Total Vehicles,Percent Electric Vehicles
8,March 31 2020,DeKalb,IN,Passenger,1,0,1,1,2,50.0
13,November 30 2020,Manassas,VA,Passenger,0,1,1,5,6,16.67
16,May 31 2020,Monroe,IL,Passenger,1,0,1,3,4,25.0
22,February 29 2020,Owyhee,ID,Passenger,1,0,1,3,4,25.0
23,June 30 2022,Clinton,PA,Passenger,1,0,1,1,2,50.0


### **Data Preprocessing**

***Basic Data Cleaning***

In [23]:
#1. Converts the "Date" column to actual datetime objects
#2. Removes rows where "Date" conversion failed
#3. Removes rows where the target (EV Total) is missing
#4. Fill missing values
#5. Confirm remaining nulls

In [24]:
# ----------------------------------------
# ----------- Basic Data Cleaning-----------
# --------------------------------------

# 1️⃣ Convert the "Date" column to actual datetime objects.
# If any value can't be converted, it becomes NaT (Not a Time).
# `errors='coerce'` makes invalid dates into NaT instead of throwing an error.
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')

# 2️⃣ Remove rows where the "Date" is invalid (NaT).
# `notnull()` keeps only rows with valid dates.
df = df[df['Date'].notnull()]

# 3️⃣ Remove rows where the target column 'Electric Vehicle (EV) Total' is missing.
# Since this is your target for prediction, you can't train on rows with no value.
df = df[df['Electric Vehicle (EV) Total'].notnull()]

# 4️⃣ Fill any missing values in the 'County' column with 'Unknown'.
df['County'] = df['County'].fillna('Unknown')

# 5️⃣ Fill any missing values in the 'State' column with 'Unknown'.
df['State'] = df['State'].fillna('Unknown')

# 6️⃣ Confirm there are no remaining nulls in 'County' and 'State'.
print("Missing values after filling:")
print(df[['County', 'State']].isnull().sum())

# 7️⃣ Preview the cleaned DataFrame to double-check your changes.
df.head()


Missing values after filling:
County    0
State     0
dtype: int64


Unnamed: 0,Date,County,State,Vehicle Primary Use,Battery Electric Vehicles (BEVs),Plug-In Hybrid Electric Vehicles (PHEVs),Electric Vehicle (EV) Total,Non-Electric Vehicle Total,Total Vehicles,Percent Electric Vehicles
0,2022-09-30,Riverside,CA,Passenger,7,0,7,460,467,1.5
1,2022-12-31,Prince William,VA,Passenger,1,2,3,188,191,1.57
2,2020-01-31,Dakota,MN,Passenger,0,1,1,32,33,3.03
3,2022-06-30,Ferry,WA,Truck,0,0,0,3575,3575,0.0
4,2021-07-31,Douglas,CO,Passenger,0,1,1,83,84,1.19


***Once again check all remaining missing values***

In [26]:
print("Any other missing values?")   #--
print()
print(df.isnull().sum())

Any other missing values?

Date                                        0
County                                      0
State                                       0
Vehicle Primary Use                         0
Battery Electric Vehicles (BEVs)            0
Plug-In Hybrid Electric Vehicles (PHEVs)    0
Electric Vehicle (EV) Total                 0
Non-Electric Vehicle Total                  0
Total Vehicles                              0
Percent Electric Vehicles                   0
dtype: int64


**Remove Outliers: Cap the Values to the IQR Bounds**

In [27]:
#1.  Cap the outliers - it keeps all the data while reducing the skew from extreme values.
#2.  Identify outliers

In [28]:
# ---------------------------------------
# Remove Outliers: Cap to IQR Bounds
# --------------------------------

# This approach doesn't drop any rows — instead, it limits (caps) extreme values to the IQR limits.
# Why? To reduce the effect of extreme outliers while keeping all your data for training.

# 1️⃣ Use np.where() to apply conditions:
# - If 'Percent Electric Vehicles' is greater than the upper bound, set it to the upper bound.
# - If it's less than the lower bound, set it to the lower bound.
# - Otherwise, keep the original value.
df['Percent Electric Vehicles']=np.where(   df['Percent Electric Vehicles'] > upper_bound, upper_bound,
                                 np.where( df['Percent Electric Vehicles'] < lower_bound, lower_bound, df['Percent Electric Vehicles']  )   )


# 2️⃣ Double-check: Identify if any outliers remain outside the capped range.
# After capping, there should be zero outliers.
outliers = df[(df['Percent Electric Vehicles'] < lower_bound) | (df['Percent Electric Vehicles'] > upper_bound)]
print("Number of outliers in 'Percent Electric Vehicles':", outliers.shape[0])

Number of outliers in 'Percent Electric Vehicles': 0
