<a href="https://colab.research.google.com/github/RajdeepKushwaha5/EV-Adoption-Forecasting/blob/main/EV_Adotion_Forecasting_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**EV Adoption Forecasting**

**Problem Statement:** Using the electric vehicle dataset (which includes information on EV populations, vehicle types, and possibly historical charging usage), create a model to forecast future EV adoption. For example, predict the number of electric vehicles in upcoming years based on the trends in the data.

**Goal:** Build a regression model that forecasts future EV adoption demand based on historical trends in EV growth, types of vehicles, and regional data.

In [None]:
!pip install pandas numpy matplotlib seaborn scikit-learn # global env



In [None]:
import joblib
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

**Load Dataset**

In [23]:
# Load data
df = pd.read_csv("Electric_Vehicle_Population_By_Country.csv")

**Explore and Understand the Data**

In [24]:
df.head()

Unnamed: 0,Date,County,State,Vehicle Primary Use,Battery Electric Vehicles (BEVs),Plug-In Hybrid Electric Vehicles (PHEVs),Electric Vehicle (EV) Total,Non-Electric Vehicle Total,Total Vehicles,Percent Electric Vehicles
0,September 30 2022,Riverside,CA,Passenger,7,0,7,460,467,1.5
1,December 31 2022,Prince William,VA,Passenger,1,2,3,188,191,1.57
2,January 31 2020,Dakota,MN,Passenger,0,1,1,32,33,3.03
3,June 30 2022,Ferry,WA,Truck,0,0,0,3575,3575,0.0
4,July 31 2021,Douglas,CO,Passenger,0,1,1,83,84,1.19


In [25]:
print(f"Total datapoints: {df.shape[0]}")
print(f"Total features: {df.shape[1]}")

Total datapoints: 20819
Total features: 10


In [26]:
# Data Types, class and memory alloc
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20819 entries, 0 to 20818
Data columns (total 10 columns):
 #   Column                                    Non-Null Count  Dtype  
---  ------                                    --------------  -----  
 0   Date                                      20819 non-null  object 
 1   County                                    20733 non-null  object 
 2   State                                     20733 non-null  object 
 3   Vehicle Primary Use                       20819 non-null  object 
 4   Battery Electric Vehicles (BEVs)          20819 non-null  object 
 5   Plug-In Hybrid Electric Vehicles (PHEVs)  20819 non-null  object 
 6   Electric Vehicle (EV) Total               20819 non-null  object 
 7   Non-Electric Vehicle Total                20819 non-null  object 
 8   Total Vehicles                            20819 non-null  object 
 9   Percent Electric Vehicles                 20819 non-null  float64
dtypes: float64(1), object(9)
memory us

In [27]:
# Missing Values Calculation
print(df.isnull().sum())

Date                                         0
County                                      86
State                                       86
Vehicle Primary Use                          0
Battery Electric Vehicles (BEVs)             0
Plug-In Hybrid Electric Vehicles (PHEVs)     0
Electric Vehicle (EV) Total                  0
Non-Electric Vehicle Total                   0
Total Vehicles                               0
Percent Electric Vehicles                    0
dtype: int64


**Checking if any column contain outliers**

In [28]:
# Compute Q1 and Q3
Q1 = df['Percent Electric Vehicles'].quantile(0.25)
Q3 = df['Percent Electric Vehicles'].quantile(0.75)
IQR = Q3 - Q1

# Define outlier boundaries
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
print('lower_bound:', lower_bound)
print('upper_bound:', upper_bound)

# Identify outliers
outliers = df[(df['Percent Electric Vehicles'] < lower_bound) | (df['Percent Electric Vehicles'] > upper_bound)]
print("Number of outliers in 'Percent Electric Vehicles':", outliers.shape[0])

lower_bound: -3.5174999999999996
upper_bound: 6.9025
Number of outliers in 'Percent Electric Vehicles': 2476


**Data Preprocessing**

Basic Data Cleaning

In [30]:
# Converts the "Date" column to actual datetime objects
# Adding a try-except block for robustness in case the column doesn't exist
if 'Date' in df.columns:
    df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
    # Removes rows where "Date" conversion failed
    df = df[df['Date'].notnull()]
    print(f"Remaining rows after handling invalid dates: {df.shape[0]}")
else:
    print("'Date' column not found. Skipping date conversion and filtering.")

# Removes rows where the target (EV Total) is missing
target_column = 'Electric Vehicle (EV) Total'
if target_column in df.columns:
    initial_rows = df.shape[0]
    df = df[df[target_column].notnull()]
    rows_removed = initial_rows - df.shape[0]
    print(f"Removed {rows_removed} rows with missing target variable ('{target_column}').")
    print(f"Remaining rows: {df.shape[0]}")
else:
    print(f"Target column '{target_column}' not found. Cannot remove rows with missing target.")


# Fill missing values for categorical features
categorical_cols_to_fill = ['County', 'State']
for col in categorical_cols_to_fill:
    if col in df.columns:
        initial_nulls = df[col].isnull().sum()
        df[col] = df[col].fillna('Unknown')
        filled_nulls = initial_nulls - df[col].isnull().sum()
        if filled_nulls > 0:
             print(f"Filled {filled_nulls} missing values in '{col}' with 'Unknown'.")
    else:
        print(f"Column '{col}' not found. Skipping fillna for this column.")

# Confirm remaining nulls for the filled columns specifically
print("\nMissing values after filling specified columns:")
print(df[categorical_cols_to_fill].isnull().sum())

# It's good practice to check for missing values across *all* columns after initial cleaning
print("\nMissing values across all columns after initial preprocessing:")
print(df.isnull().sum())

# Display the first few rows of the cleaned DataFrame
print("\nFirst 5 rows of the preprocessed DataFrame:")
print(df.head())


Remaining rows after handling invalid dates: 20819
Removed 0 rows with missing target variable ('Electric Vehicle (EV) Total').
Remaining rows: 20819
Filled 86 missing values in 'County' with 'Unknown'.
Filled 86 missing values in 'State' with 'Unknown'.

Missing values after filling specified columns:
County    0
State     0
dtype: int64

Missing values across all columns after initial preprocessing:
Date                                        0
County                                      0
State                                       0
Vehicle Primary Use                         0
Battery Electric Vehicles (BEVs)            0
Plug-In Hybrid Electric Vehicles (PHEVs)    0
Electric Vehicle (EV) Total                 0
Non-Electric Vehicle Total                  0
Total Vehicles                              0
Percent Electric Vehicles                   0
dtype: int64

First 5 rows of the preprocessed DataFrame:
        Date          County State Vehicle Primary Use  \
0 2022-09-30       

**Remove Outliers: Cap the values to the IQR bounds**

In [31]:
# Cap the outliers - it keeps all the data while reducing the skew from extreme values.

df['Percent Electric Vehicles'] = np.where(df['Percent Electric Vehicles'] > upper_bound, upper_bound,
                                 np.where(df['Percent Electric Vehicles'] < lower_bound, lower_bound, df['Percent Electric Vehicles']))

# Identify outliers
outliers = df[(df['Percent Electric Vehicles'] < lower_bound) | (df['Percent Electric Vehicles'] > upper_bound)]
print("Number of outliers in 'Percent Electric Vehicles':", outliers.shape[0])

Number of outliers in 'Percent Electric Vehicles': 0
