# Electric Vehicle Specifications Analysis (2025)

## Introduction

The shift toward electric vehicles (EVs) is transforming the global automotive industry. This notebook presents an analysis of the **Electric Vehicle Specifications Dataset (2025)**, published by [Urvish Ahir](https://www.kaggle.com/urvishahir) on Kaggle.

This dataset includes technical specifications of a wide range of EVs, such as:

- Brand and Model
- Battery Capacity
- Electric Range
- Charging Speed
- Top Speed
- Acceleration (0–100 km/h)
- Price

> **Dataset Source:**  
> [Kaggle - Electric Vehicle Specifications Dataset 2025](https://www.kaggle.com/datasets/urvishahir/electric-vehicle-specifications-dataset-2025)


##  Dataset Overview

In this section, we will:
- Import necessary libraries
- Load the dataset
- View its structure and basic information


### Import necessary libraries

In [21]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt

# Show plots inline in Jupyter Notebook
%matplotlib inline

#Auto reloads notebook when changes are made
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### Load the dataset

In [22]:
df = pd.read_csv('electric_vehicles_spec_2025.csv')

### Preview Dataset

In [23]:
df.head() 

Unnamed: 0,brand,model,top_speed_kmh,battery_capacity_kWh,battery_type,number_of_cells,torque_nm,efficiency_wh_per_km,range_km,acceleration_0_100_s,...,towing_capacity_kg,cargo_volume_l,seats,drivetrain,segment,length_mm,width_mm,height_mm,car_body_type,source_url
0,Abarth,500e Convertible,155,37.8,Lithium-ion,192.0,235.0,156,225,7.0,...,0.0,185,4,FWD,B - Compact,3673,1683,1518,Hatchback,https://ev-database.org/car/1904/Abarth-500e-C...
1,Abarth,500e Hatchback,155,37.8,Lithium-ion,192.0,235.0,149,225,7.0,...,0.0,185,4,FWD,B - Compact,3673,1683,1518,Hatchback,https://ev-database.org/car/1903/Abarth-500e-H...
2,Abarth,600e Scorpionissima,200,50.8,Lithium-ion,102.0,345.0,158,280,5.9,...,0.0,360,5,FWD,JB - Compact,4187,1779,1557,SUV,https://ev-database.org/car/3057/Abarth-600e-S...
3,Abarth,600e Turismo,200,50.8,Lithium-ion,102.0,345.0,158,280,6.2,...,0.0,360,5,FWD,JB - Compact,4187,1779,1557,SUV,https://ev-database.org/car/3056/Abarth-600e-T...
4,Aiways,U5,150,60.0,Lithium-ion,,310.0,156,315,7.5,...,,496,5,FWD,JC - Medium,4680,1865,1700,SUV,https://ev-database.org/car/1678/Aiways-U5


### Dataset Dimensions

In [24]:
print(f"Dataset contains {df.shape[0]} rows and {df.shape[1]} columns.")

Dataset contains 478 rows and 22 columns.


### Dataset Column Names

In [25]:
df.columns.tolist()

['brand',
 'model',
 'top_speed_kmh',
 'battery_capacity_kWh',
 'battery_type',
 'number_of_cells',
 'torque_nm',
 'efficiency_wh_per_km',
 'range_km',
 'acceleration_0_100_s',
 'fast_charging_power_kw_dc',
 'fast_charge_port',
 'towing_capacity_kg',
 'cargo_volume_l',
 'seats',
 'drivetrain',
 'segment',
 'length_mm',
 'width_mm',
 'height_mm',
 'car_body_type',
 'source_url']

### Data Types & Missing Values

In [26]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 478 entries, 0 to 477
Data columns (total 22 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   brand                      478 non-null    object 
 1   model                      477 non-null    object 
 2   top_speed_kmh              478 non-null    int64  
 3   battery_capacity_kWh       478 non-null    float64
 4   battery_type               478 non-null    object 
 5   number_of_cells            276 non-null    float64
 6   torque_nm                  471 non-null    float64
 7   efficiency_wh_per_km       478 non-null    int64  
 8   range_km                   478 non-null    int64  
 9   acceleration_0_100_s       478 non-null    float64
 10  fast_charging_power_kw_dc  477 non-null    float64
 11  fast_charge_port           477 non-null    object 
 12  towing_capacity_kg         452 non-null    float64
 13  cargo_volume_l             477 non-null    object 

### Descriptive Statistics

In [27]:
df.describe()

Unnamed: 0,top_speed_kmh,battery_capacity_kWh,number_of_cells,torque_nm,efficiency_wh_per_km,range_km,acceleration_0_100_s,fast_charging_power_kw_dc,towing_capacity_kg,seats,length_mm,width_mm,height_mm
count,478.0,478.0,276.0,471.0,478.0,478.0,478.0,477.0,452.0,478.0,478.0,478.0,478.0
mean,185.487448,74.043724,485.293478,498.012739,162.903766,393.179916,6.882636,125.008386,1052.261062,5.263598,4678.506276,1887.359833,1601.125523
std,34.252773,20.331058,1210.819733,241.461128,34.317532,103.287335,2.730696,58.205012,737.851774,1.003961,369.210573,73.656807,130.754851
min,125.0,21.3,72.0,113.0,109.0,135.0,2.2,29.0,0.0,2.0,3620.0,1610.0,1329.0
25%,160.0,60.0,150.0,305.0,143.0,320.0,4.8,80.0,500.0,5.0,4440.0,1849.0,1514.0
50%,180.0,76.15,216.0,430.0,155.0,397.5,6.6,113.0,1000.0,5.0,4720.0,1890.0,1596.0
75%,201.0,90.6,324.0,679.0,177.75,470.0,8.2,150.0,1600.0,5.0,4961.0,1939.0,1665.0
max,325.0,118.0,7920.0,1350.0,370.0,685.0,19.1,281.0,2500.0,9.0,5908.0,2080.0,1986.0


## Data Cleaning and Preprocessing

#### 1. Handling Missing Values
If you find missing values, you have following options: 
- Drop rows with missing values
- Fill missing values with a specific value (e.g., mean, median, mode)


In [28]:
# Replacing empty strings with NaN 
df.replace(r'^\s*$', pd.NA, regex=True, inplace=True)

In [29]:
# Checking for missing values 
df.isnull().sum()

brand                          0
model                          1
top_speed_kmh                  0
battery_capacity_kWh           0
battery_type                   0
number_of_cells              202
torque_nm                      7
efficiency_wh_per_km           0
range_km                       0
acceleration_0_100_s           0
fast_charging_power_kw_dc      1
fast_charge_port               1
towing_capacity_kg            26
cargo_volume_l                 1
seats                          0
drivetrain                     0
segment                        0
length_mm                      0
width_mm                       0
height_mm                      0
car_body_type                  0
source_url                     0
dtype: int64

In [30]:
# Drop row with missing 'model'
df = df.dropna(subset=['model'])

# Fill missing 'number_of_cells' with mean
df['number_of_cells'] = df['number_of_cells'].fillna(df['number_of_cells'].mean())

# = Fill missing 'torque_nm' with mean
df['torque_nm'] = df['torque_nm'].fillna(df['torque_nm'].mean())

# Fill missing 'fast_charging_power_kw_dc' with mean 
df['fast_charging_power_kw_dc'] = df['fast_charging_power_kw_dc'].fillna(df['fast_charging_power_kw_dc'].mean())

# Fill missing 'fast_charge_port' with mode
df['fast_charge_port'] = df['fast_charge_port'].fillna(df['fast_charge_port'].mode()[0])

# Fill missing 'towing_capacity_kg' with mean
df['towing_capacity_kg'] = df['towing_capacity_kg'].fillna(df['towing_capacity_kg'].mean())

In [31]:
df['cargo_volume_l'] = df['cargo_volume_l'].astype(str)

# Extract numeric values from 'cargo_volume_l'
# \d+ matches one or more digits, 
# \.? matches an optional decimal point, 
# \d* matches zero or more digits after the decimal point
df['cargo_volume_l'] = df['cargo_volume_l'].str.extract(r'(\d+\.?\d*)')

# Convert 'cargo_volume_l' to numeric
# coercing errors to NaN 
df['cargo_volume_l'] = pd.to_numeric(df['cargo_volume_l'], errors='coerce')

# Fill missing 'cargo_volume_l' with median
df['cargo_volume_l'] = df['cargo_volume_l'].fillna(df['cargo_volume_l'].median())

#### 2. Remove Duplicates

In [32]:
# Check for duplicates
df.duplicated().sum()

np.int64(0)

Hence, there are no duplicates in the dataset.


#### 3. Fixing Data Types 

In [33]:
df.dtypes

brand                         object
model                         object
top_speed_kmh                  int64
battery_capacity_kWh         float64
battery_type                  object
number_of_cells              float64
torque_nm                    float64
efficiency_wh_per_km           int64
range_km                       int64
acceleration_0_100_s         float64
fast_charging_power_kw_dc    float64
fast_charge_port              object
towing_capacity_kg           float64
cargo_volume_l               float64
seats                          int64
drivetrain                    object
segment                       object
length_mm                      int64
width_mm                       int64
height_mm                      int64
car_body_type                 object
source_url                    object
dtype: object

In [34]:
# Follwing Columns Should be of type 'category'
category_columns = [
    'brand', 'battery_type', 'fast_charge_port', 'drivetrain', 'segment', 'car_body_type'
]

df[category_columns] = df[category_columns].astype('category')

# Converting 'string' like text 
df['model'] = df['model'].astype('string')
df['source_url'] = df['source_url'].astype('string')

# Convert count-like floats to nullable integers
df['number_of_cells'] = df['number_of_cells'].round().astype('Int64')

In [35]:
df.dtypes

brand                              category
model                        string[python]
top_speed_kmh                         int64
battery_capacity_kWh                float64
battery_type                       category
number_of_cells                       Int64
torque_nm                           float64
efficiency_wh_per_km                  int64
range_km                              int64
acceleration_0_100_s                float64
fast_charging_power_kw_dc           float64
fast_charge_port                   category
towing_capacity_kg                  float64
cargo_volume_l                      float64
seats                                 int64
drivetrain                         category
segment                            category
length_mm                             int64
width_mm                              int64
height_mm                             int64
car_body_type                      category
source_url                   string[python]
dtype: object

#### 4. Standardize Text Columns

In [36]:
text_columns = [
    'brand', 'model', 'battery_type',
    'fast_charge_port', 'drivetrain',
    'segment', 'car_body_type'
]


# Standardize string formatting 
for col in text_columns:
    previous_dtype = df[col].dtype
    df[col] = df[col].astype('string').str.strip().str.lower().replace('Nan', pd.NA)
    df[col] = df[col].astype(previous_dtype)

In [37]:
# checking if any column has inconsistent values
for col in text_columns:
    print(f"Unique values in '{col}': {df[col].unique().tolist()}")

Unique values in 'brand': [nan]
Unique values in 'model': ['500e convertible', '500e hatchback', '600e scorpionissima', '600e turismo', 'u5', 'u6', 'romeo junior elettrica 54 kwh', 'romeo junior elettrica 54 kwh veloce', 'a290 electric 180 hp', 'a290 electric 220 hp', 'a6 avant e-tron', 'a6 avant e-tron performance', 'a6 avant e-tron quattro', 'a6 sportback e-tron', 'a6 sportback e-tron performance', 'a6 sportback e-tron quattro', 'q4 sportback e-tron 40', 'q4 sportback e-tron 45', 'q4 sportback e-tron 45 quattro', 'q4 sportback e-tron 55 quattro', 'q4 e-tron 40', 'q4 e-tron 45', 'q4 e-tron 45 quattro', 'q4 e-tron 55 quattro', 'q6 e-tron', 'q6 e-tron sportback', 'q6 e-tron sportback performance', 'q6 e-tron sportback quattro', 'q6 e-tron performance', 'q6 e-tron quattro', 's6 avant e-tron', 's6 sportback e-tron', 'sq6 e-tron', 'sq6 e-tron sportback', 'e-tron gt rs', 'e-tron gt rs performance', 'e-tron gt s', 'e-tron gt quattro', 'i4 m50', 'i4 edrive35', 'i4 edrive40', 'i4 xdrive40', 'i