<img src="https://devra.ai/analyst/notebook/2797/image.jpg" style="width: 100%; height: auto;" />

<div style="text-align:center; border-radius:15px; padding:15px; color:white; margin:0; font-family: 'Orbitron', sans-serif; background: #2E0249; background: #11001C; box-shadow: 0px 4px 8px rgba(0, 0, 0, 0.3); overflow:hidden; margin-bottom: 1em;">  <div style="font-size:150%; color:#FEE100"><b>EV Electric Vehicles Analysis</b></div>  <div>This notebook was created with the help of <a href="https://devra.ai/ref/kaggle" style="color:#6666FF">Devra AI</a></div></div>Electric vehicles have come a long way, and the dataset at hand offers intriguing insights about battery types, range, and pricing among others. If you find this notebook useful, please consider upvoting it.

## Table of Contents
- [Imports and Setup](#Imports-and-Setup)
- [Data Loading and Inspection](#Data-Loading-and-Inspection)
- [Data Cleaning and Preprocessing](#Data-Cleaning-and-Preprocessing)
- [Exploratory Data Analysis](#Exploratory-Data-Analysis)
- [Predictive Modeling](#Predictive-Modeling)
- [Conclusion](#Conclusion)

In [None]:
# Imports and Setup
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd

import matplotlib
matplotlib.use('Agg')  # For headless environments
import matplotlib.pyplot as plt
plt.switch_backend('Agg')  

import seaborn as sns

# Ensure inline plotting in notebooks
%matplotlib inline

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

# Set a seaborn theme for the plots
sns.set(style='whitegrid')

## Data Loading and Inspection

In [None]:
# Load the CSV file from the directory
csv_file_path = '/kaggle/input/ev-electrical-vehicles-dataset-3k-records-2025/electric_vehicles_dataset.csv'
try:
    df = pd.read_csv(csv_file_path, encoding='utf-8', delimiter=',')
    print('CSV file loaded successfully.')
except Exception as e:
    print(f'An error occurred while loading the CSV file: {e}')

# Display the first few rows of the dataset
df.head()

## Data Cleaning and Preprocessing

Before diving into the analysis, it is important to check for missing values and ensure the data types are correct. Notably, some columns like date or year might need formatting (although here we only have a 'Year' column which is numeric).

In [None]:
# Check info and data types
df.info()

# Check the shape of the dataset
print('Dataset shape:', df.shape)

# Let's look into missing values in each column
missing_values = df.isnull().sum()
print('Missing values in each column:')
print(missing_values)

In [None]:
# Fill or drop missing values as necessary
# For simplicity, we fill numeric missing values with the median and categorical with the mode
numeric_cols = df.select_dtypes(include=[np.number]).columns
categorical_cols = df.select_dtypes(include=['object']).columns

for col in numeric_cols:
    if df[col].isnull().sum() > 0:
        median_value = df[col].median()
        df[col].fillna(median_value, inplace=True)
        print(f'Filled missing values in numeric column {col} with median: {median_value}')

for col in categorical_cols:
    if df[col].isnull().sum() > 0:
        mode_value = df[col].mode()[0]
        df[col].fillna(mode_value, inplace=True)
        print(f'Filled missing values in categorical column {col} with mode: {mode_value}')

# Verify that there are no remaining missing values
print('Remaining missing values:')
print(df.isnull().sum())

## Exploratory Data Analysis

In [None]:
# A quick summary of the dataset statistics
display(df.describe())

# Plotting distributions for numeric features using histograms
numeric_df = df.select_dtypes(include=[np.number])
numeric_cols = numeric_df.columns

fig, axes = plt.subplots(nrows=len(numeric_cols), ncols=1, figsize=(8, 4*len(numeric_cols)))
for i, col in enumerate(numeric_cols):
    sns.histplot(df[col], ax=axes[i], kde=True, color='teal')
    axes[i].set_title(f"Distribution of {col}")
plt.tight_layout()
plt.show()

# Count plots for some categorical features
categorical_features = ['Manufacturer', 'Battery_Type', 'Charging_Type', 'Color', 'Country_of_Manufacture']
fig, axes = plt.subplots(nrows=len(categorical_features), ncols=1, figsize=(8, 4*len(categorical_features)))
for i, col in enumerate(categorical_features):
    sns.countplot(x=col, data=df, ax=axes[i], palette='viridis')
    axes[i].set_title(f"Count plot of {col}")
    axes[i].tick_params(axis='x', rotation=45)
plt.tight_layout()
plt.show()

In [None]:
# If there are 4 or more numeric columns, display a correlation heatmap
if len(numeric_cols) >= 4:
    corr = numeric_df.corr()
    plt.figure(figsize=(10, 8))
    sns.heatmap(corr, annot=True, cmap='coolwarm', fmt='.2f')
    plt.title('Correlation Heatmap of Numeric Features')
    plt.show()
else:
    print('Not enough numeric columns for a correlation heatmap.')

In [None]:
# Pairplot across a subset of features to see pairwise relationships
pairplot_features = ['Battery_Capacity_kWh', 'Range_km', 'Charge_Time_hr', 'Price_USD']
sns.pairplot(df[pairplot_features], diag_kind='kde', corner=True)
plt.suptitle('Pairplot of Key Numeric Features', y=1.02)
plt.show()

## Predictive Modeling

We will now attempt to predict the Price_USD of the electric vehicles using a simple linear regression model. A few numeric and categorical features will be used as predictors. Note that during a production-level project more advanced feature engineering and model validation would be recommended. Dry humor aside, sometimes simple models reveal surprisingly good insights.

In [None]:
# For the predictor, let's choose Price_USD as the target variable.
# We will use numeric features and encode a few categorical features.

# First, select predictor and target columns
target = 'Price_USD'

# We'll choose a subset of features that are likely to influence price
features = ['Year', 'Battery_Capacity_kWh', 'Range_km', 'Charge_Time_hr', 'Autonomous_Level', 'Safety_Rating', 'Units_Sold_2024', 'Warranty_Years']

# Additionally, incorporate a categorical variable such as Manufacturer using one-hot encoding
df_model = df[features + ['Manufacturer']].copy()
df_model = pd.get_dummies(df_model, columns=['Manufacturer'], drop_first=True)

# Define X and y
X = df_model.drop(target, axis=1, errors='ignore')
y = df[target]

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict and evaluate the model
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
print(f"Linear Regression Model R-squared Score: {r2:.2f}")

## Conclusion

We explored an electric vehicles dataset, cleaned and preprocessed the data, performed several visualizations, and built a simple linear regression model to predict the price. While the approach was straightforward, it paves the way for further analysis such as exploring non-linear relationships, evaluating different models, and more advanced feature engineering.

If you found these insights useful, don't forget to upvote this notebook.