# Introduction
The goal of this project is to analyze the Electric Vehicle Population Data dataset and predict the Base MSRP of electric vehicles based on their characteristics. The dataset contains information on various electric vehicle models, including their range, battery capacity, and price.

## Problem Statement
The goal of this project is to develop a predictive model that can accurately forecast the Base MSRP of electric vehicles based on their characteristics. This model can be used by EV manufacturers, policymakers, and industry analysts to better understand the factors that influence EV pricing and to make informed decisions.
* Project Type : Supervised Learning - Regression
* Target Variable: Base MSRP

# Data Preprocessing

## Importing necessary libraries

In [4]:
# to deactivate warnings
import warnings
import sys
if not sys.warnoptions:
    warnings.simplefilter("ignore")

In [129]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

1. Acquire the dataset
2. Importing libraries
3. Importing datasets
4. Finding missing data
5. Finding and Handling outliers
6. Encoding Categorical Data
7. Featuring (column) Selection
8. Splitting the dataset
9. Feature scaling

## Importing Datasets

In [7]:
data = pd.read_csv("C:\\Users\\aniru\\Downloads\\Electric_Vehicle_Population_Data (4).csv")

In [10]:
print("Shape of the dataset is: ")
data.shape

Shape of the dataset is: 


(223995, 17)

In [12]:
print("Columns: ")
data.columns

Columns: 


Index(['VIN (1-10)', 'County', 'City', 'State', 'Postal Code', 'Model Year',
       'Make', 'Model', 'Electric Vehicle Type',
       'Clean Alternative Fuel Vehicle (CAFV) Eligibility', 'Electric Range',
       'Base MSRP', 'Legislative District', 'DOL Vehicle ID',
       'Vehicle Location', 'Electric Utility', '2020 Census Tract'],
      dtype='object')

In [14]:
print("Dataset Information:")
print("\t")
data.info()

Dataset Information:
	
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 223995 entries, 0 to 223994
Data columns (total 17 columns):
 #   Column                                             Non-Null Count   Dtype  
---  ------                                             --------------   -----  
 0   VIN (1-10)                                         223995 non-null  object 
 1   County                                             223992 non-null  object 
 2   City                                               223992 non-null  object 
 3   State                                              223995 non-null  object 
 4   Postal Code                                        223992 non-null  float64
 5   Model Year                                         223995 non-null  int64  
 6   Make                                               223995 non-null  object 
 7   Model                                              223995 non-null  object 
 8   Electric Vehicle Type                              

In [16]:
print("Summary Statistics: ")
print("\t")
data.describe()

Summary Statistics: 
	


Unnamed: 0,Postal Code,Model Year,Electric Range,Base MSRP,Legislative District,DOL Vehicle ID,2020 Census Tract
count,223992.0,223995.0,223977.0,223977.0,223521.0,223995.0,223992.0
mean,98176.491165,2021.264408,47.736187,829.894386,28.876361,232932800.0,52979970000.0
std,2544.240509,2.989676,84.98714,7372.509049,14.911023,68843290.0,1531491000.0
min,1731.0,1999.0,0.0,0.0,1.0,4385.0,1001020000.0
25%,98052.0,2020.0,0.0,0.0,17.0,200800200.0,53033010000.0
50%,98126.0,2022.0,0.0,0.0,32.0,248299200.0,53033030000.0
75%,98374.0,2023.0,39.0,0.0,42.0,267397300.0,53053070000.0
max,99577.0,2025.0,337.0,845000.0,49.0,479254800.0,56021000000.0


In [18]:
print("Unique Values: ")
print("\t")
data.nunique()

Unique Values: 
	


VIN (1-10)                                            13175
County                                                  207
City                                                    789
State                                                    48
Postal Code                                             954
Model Year                                               21
Make                                                     46
Model                                                   164
Electric Vehicle Type                                     2
Clean Alternative Fuel Vehicle (CAFV) Eligibility         3
Electric Range                                          109
Base MSRP                                                31
Legislative District                                     49
DOL Vehicle ID                                       223995
Vehicle Location                                        952
Electric Utility                                         76
2020 Census Tract                       

### Finding missing values

In [20]:
missing_values = data.isnull().sum()
print("Missing Values:")
print("\t")
print(missing_values)

Missing Values:
	
VIN (1-10)                                             0
County                                                 3
City                                                   3
State                                                  0
Postal Code                                            3
Model Year                                             0
Make                                                   0
Model                                                  0
Electric Vehicle Type                                  0
Clean Alternative Fuel Vehicle (CAFV) Eligibility      0
Electric Range                                        18
Base MSRP                                             18
Legislative District                                 474
DOL Vehicle ID                                         0
Vehicle Location                                      10
Electric Utility                                       3
2020 Census Tract                                      3
dtype: int64


### Handling Missing values
* Fill the missing values with mean(For numerical columns)

In [22]:
data['Electric Range'].fillna(data['Electric Range'].mean(), inplace=True)
data['Base MSRP'].fillna(data['Base MSRP'].mean(), inplace=True)
data['Postal Code'].fillna(data['Postal Code'].mean(), inplace=True)
data['2020 Census Tract'].fillna(data['2020 Census Tract'].mean(), inplace=True)

* Fill the missing values with mode(For categorical columns )

In [24]:
data['County'].fillna(data['County'].mode()[0], inplace=True)
data['City'].fillna(data['City'].mode()[0], inplace=True)
data['State'].fillna(data['State'].mode()[0], inplace=True)
data['Vehicle Location'].fillna(data['Vehicle Location'].mode()[0], inplace=True)
data['Electric Utility'].fillna(data['Electric Utility'].mode()[0], inplace=True)
data['Legislative District'].fillna(data['Legislative District'].mode()[0], inplace=True)

In [35]:
# Checking for the null values after imputation
missing_values = data.isnull().sum()
print("Missing Values after Imputation:")
print(missing_values)

Missing Values after Imputation:
VIN (1-10)                                           0
County                                               0
City                                                 0
State                                                0
Postal Code                                          0
Model Year                                           0
Make                                                 0
Model                                                0
Electric Vehicle Type                                0
Clean Alternative Fuel Vehicle (CAFV) Eligibility    0
Electric Range                                       0
Base MSRP                                            0
Legislative District                                 0
DOL Vehicle ID                                       0
Vehicle Location                                     0
Electric Utility                                     0
2020 Census Tract                                    0
dtype: int64


### Checking for duplicates

In [28]:
print("\t")
print(f"Total number of duplicate values is : {data.duplicated().sum()}")

	
Total number of duplicate values is : 0


### Finding and Handling outliers
* Find Outliers using the IQR method

In [48]:
# Find outliers in all numerical columns
numerical_cols = data.select_dtypes(include=['int64', 'float64']).columns
for col in numerical_cols:
    Q1 = data[col].quantile(0.25)
    Q3 = data[col].quantile(0.75)
    IQR = Q3 - Q1
    outliers = data[col][(data[col] < (Q1 - 1.5 * IQR)) | (data[col] > (Q3 + 1.5 * IQR))]
    print(f'Outliers in {col}: {outliers.shape[0]}')

Outliers in Postal Code: 14600
Outliers in Model Year: 14605
Outliers in Electric Range: 37715
Outliers in Base MSRP: 3278
Outliers in Legislative District: 0
Outliers in DOL Vehicle ID: 9928
Outliers in 2020 Census Tract: 551


In [61]:
data_cleaned = data.copy()

In [81]:
# Remove outliers from the dataset
numerical_cols = data_cleaned.select_dtypes(include=['int64', 'float64']).columns
for col in numerical_cols:
    Q1 = data_cleaned[col].quantile(0.25)
    Q3 = data_cleaned[col].quantile(0.75)
    IQR = Q3 - Q1
    data = data_cleaned[~((data_cleaned[col] < (Q1 - 1.5 * IQR)) | (data_cleaned[col] > (Q3 + 1.5 * IQR)))]

### Encoding Categorical Data

In [86]:
from sklearn.preprocessing import LabelEncoder

In [113]:
# Converting categorical variables into numerical variables 
categorical_cols = ['VIN (1-10)','County', 'City', 'State', 'Make', 'Model', 'Electric Vehicle Type', 'Clean Alternative Fuel Vehicle (CAFV) Eligibility', 
                    'Legislative District', 'Vehicle Location', 'Electric Utility']
le = LabelEncoder()
for col in categorical_cols:
    data_cleaned[col] = le.fit_transform(data_cleaned[col])

### Featuring (Column) Selection

In [93]:
X = data_cleaned.drop('Base MSRP', axis=1)
y = data_cleaned['Base MSRP']

### Splitting the Dataset
* Split the data into training and testing sets 

In [96]:
from sklearn.model_selection import train_test_split

In [98]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Feature Scaling
* Scale the data using StandardScaler

In [105]:
from sklearn.preprocessing import StandardScaler

In [125]:
categorical_cols = X_train.select_dtypes(include=['object']).columns
X_train_cat = pd.get_dummies(X_train[categorical_cols])
X_test_cat = pd.get_dummies(X_test[categorical_cols])

In [127]:
# Scale the numeric columns
numeric_cols = X_train.select_dtypes(include=['int64', 'float64']).columns
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train[numeric_cols])
X_test_scaled = scaler.transform(X_test[numeric_cols])