# PART 3 : Handling missing data and Quality check 

In this section, we will go through the usual data check to maintain good and useful data. The check marks will be : 
1. Check for Data Types 
2. Check for missing values (It will be divided seperately b/w EVs and Hybrid models)
3. Check for null values 
4. Check for duplicates 
 

In [1]:
import pandas as pd

df = pd.read_csv("combined_vehicle_test_logs.csv", parse_dates=['timestamp'])


In [2]:
df

Unnamed: 0,timestamp,vehicle_id,propulsion_type,test_scenario,vehicle_speed_mph,acceleration_mphps,engine_rpm,torque,coolant_temp,ambient_temp,power_kw,phase,soc,source_file
0,2025-08-11 21:02:06.665331,V001,EV,Cold Start,0.00,0.00,500.00,99.49,39.87,29.089257,5.21,Idle,99.99,V001_Cold_Start.csv
1,2025-08-11 21:02:07.665331,V001,EV,Cold Start,0.00,0.00,500.00,98.36,39.89,29.086131,5.15,Idle,100.00,V001_Cold_Start.csv
2,2025-08-11 21:02:08.665331,V001,EV,Cold Start,0.00,0.00,500.00,118.50,39.91,29.093595,6.20,Idle,100.00,V001_Cold_Start.csv
3,2025-08-11 21:02:09.665331,V001,EV,Cold Start,0.00,0.00,500.00,108.82,39.93,29.105409,5.70,Idle,100.00,V001_Cold_Start.csv
4,2025-08-11 21:02:10.665331,V001,EV,Cold Start,0.00,0.00,500.00,123.97,39.95,29.106028,6.49,Idle,99.89,V001_Cold_Start.csv
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
74695,2025-08-11 21:59:02.424916,V005,EV,Hill Climb,28.78,-2.10,1227.38,254.26,57.15,25.606847,32.68,Brake,80.54,V005_Hill_Climb.csv
74696,2025-08-11 21:59:03.424916,V005,EV,Hill Climb,31.18,2.40,1185.19,226.32,57.15,25.618882,28.09,Accelerate,80.02,V005_Hill_Climb.csv
74697,2025-08-11 21:59:04.424916,V005,EV,Hill Climb,30.48,-0.71,1202.66,218.90,57.15,25.624723,27.57,Brake,80.20,V005_Hill_Climb.csv
74698,2025-08-11 21:59:05.424916,V005,EV,Hill Climb,30.64,0.16,1157.06,262.23,57.15,25.617192,31.77,Cruise,80.48,V005_Hill_Climb.csv


In [3]:
# 1. Check the data types 

print(df.dtypes)


timestamp             datetime64[ns]
vehicle_id                    object
propulsion_type               object
test_scenario                 object
vehicle_speed_mph            float64
acceleration_mphps           float64
engine_rpm                   float64
torque                       float64
coolant_temp                 float64
ambient_temp                 float64
power_kw                     float64
phase                         object
soc                          float64
source_file                   object
dtype: object


In [4]:
# pd.to_datetime → Turns date strings into real datetime objects for time-based analysis.

df['timestamp'] = pd.to_datetime(df['timestamp'])


In [5]:
# 2. Check for missing values 
# Calculates how many missing values each column has and prints only the columns where the count is greater than zero.

missing = df.isnull().sum()
print(missing[missing > 0])


soc    30840
dtype: int64


In [6]:
# 3 . Check for null values 
# Filters out Data Frame by propulsion Type 

ev_df = df[df['propulsion_type'] == 'EV']
hybrid_df = df[df['propulsion_type'] == 'Hybrid']


In [7]:
# Will check if there are any missing values for EV's

print("EV missing values:\n", ev_df.isnull().sum())

# Should expect:
# fuel_rate == 0 (always) because EVs do not have fuel rate and costs per charge (This data column can be ignored )
# gear == NaN (valid as EVs don't shift gears)
# State of the charge (SOC) cannot be Null 
ev_df['soc'].isnull().sum()         # Should be 0


EV missing values:
 timestamp             0
vehicle_id            0
propulsion_type       0
test_scenario         0
vehicle_speed_mph     0
acceleration_mphps    0
engine_rpm            0
torque                0
coolant_temp          0
ambient_temp          0
power_kw              0
phase                 0
soc                   0
source_file           0
dtype: int64


0

Here, we can see that gears have missing values Ev's do not have any gear controls.....

Here, we can see that State of Charge (Soc) is 0, as plug in hybrids use traditional fuel wit battery supply.....

In [8]:
df_cleaned = pd.concat([ev_df, hybrid_df], ignore_index=True)

In [9]:
# 4. Check for Duplicate Values 

duplicates = df.duplicated().sum()
print(f"Duplicate rows: {duplicates}")

df = df.drop_duplicates()


Duplicate rows: 0


There no Duplicates present in the dataset....

In [10]:
df.to_csv("cleaned_vehicle_test_logs.csv", index=False)
