# ICBC 2023 Lower Mainland vehicle dataset

## Source

In [10]:
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import seaborn as sns

In [11]:
 # Check if file path is valid.
file_path = './icbc/Vehicle Population - 2023 Passenger Vehicles_Full _data.csv'
if not os.path.isfile(file_path):
    fnf_err = f'{file_path} not found.'
    raise FileNotFoundError(fnf_err)

vehicle_df = pd.read_csv(file_path)

# Take a peek at the first couple of rows in the dataset.
print(vehicle_df.head(2))

dataset_columns = vehicle_df.columns.tolist()
print(f"This dataset contains the following columns {dataset_columns}")

num_rows, num_cols = vehicle_df.shape
print(f"Data set shape {num_rows} rows x {num_cols} cols")

  Veh Pop - Criteria Selector Vehicle Use Anti Theft Device Indicator  \
0              Lower Mainland    Business                          No   
1              Lower Mainland    Business                          No   

             Body Style Electric_Vehicle_Indicator Fleet Vehicle Indicator  \
0         Fourdoorsedan                         No                      No   
1  Fourdoorstationwagon                         No                      No   

  Fuel Type Hybrid Vehicle Indicator        Make                Model  \
0    Diesel                       No  VOLKSWAGEN  RABBIT OTHER MODELS   
1    Diesel                       No      TOYOTA   LAND CRUISER WAGON   

   Model Year Municipality             Owner Type          Region  \
0        1978      Langley  External organization  Lower Mainland   
1        1991      Burnaby  External organization  Lower Mainland   

   Vehicle Count  
0              1  
1              1  
This dataset contains the following columns ['Veh Pop - Crit

# Initial comments
This dataset contains redundant columns. We specifically set the criteria to Lower Mainland vehicles on ICBC's data portal. We can get rid of the columns 'Veh Pop - Criteria Selector' and 'Region'.

In [12]:
vehicle_df.drop(['Veh Pop - Criteria Selector', 'Region'], axis=1, inplace=True)

dataset_columns = vehicle_df.columns.tolist()
print(f"This dataset contains the following columns {dataset_columns}")

num_rows, num_cols = vehicle_df.shape
print(f"Data set shape {num_rows} rows x {num_cols} cols")
assert num_cols == 13, "Columns delete failed."

This dataset contains the following columns ['Vehicle Use', 'Anti Theft Device Indicator', 'Body Style', 'Electric_Vehicle_Indicator', 'Fleet Vehicle Indicator', 'Fuel Type', 'Hybrid Vehicle Indicator', 'Make', 'Model', 'Model Year', 'Municipality', 'Owner Type', 'Vehicle Count']
Data set shape 719144 rows x 13 cols


# Exploratory Data Analysis
Here we take a quick look at each column to see if there's any obvious redundancies (if all values for col C are 'xyz', then there's isn't much to be investigated in col C)

In [13]:
for c in dataset_columns:
    print(f"Examining column {c}")
    # Find unique values in this column.
    uniques = vehicle_df[c].unique()
    print(len(uniques))

Examining column Vehicle Use
3
Examining column Anti Theft Device Indicator
2
Examining column Body Style
24
Examining column Electric_Vehicle_Indicator
2
Examining column Fleet Vehicle Indicator
2
Examining column Fuel Type
18
Examining column Hybrid Vehicle Indicator
2
Examining column Make
427
Examining column Model
9496
Examining column Model Year
116
Examining column Municipality
54
Examining column Owner Type
2
Examining column Vehicle Count
284


# Comments
- There are 3 categories for vehicle use. We'll have to dig into that later.
- There are 2 categories for anti theft device indicator. This is likely a Boolean field. A vehicle either has an anti theft device or it does not.
- There are 24 body styles. We'll have to dig into that later.
- There are 2 categories for electric vehicle indicator. This must be a Boolean field for obvious reasons. We'll have to check if a hybrid (Battery Electric Vehicle - BEV) and a PHEV (Plug-in Electric Vehicle) are declared as electric vehicles or not.
- There are 2 categories for fleet vehicle indicator. This is definitely a Boolean field.
- There are 18 categories for fuel type, which is more than we anticipated (diesel, gasoline, electric, LPG).
- There are 427 vehicle makes, which seems too high a number. We'll have to dig into that - are the makes misspelled or sometimes present as acronyms and sometimes in full?
- There are 9496 models. This seems slightly too many. We'll have to take a look later.
- There are 116 model years, which is unexpected. We don't expect the years to span more than 50/60 years.
- There are 54 municipalities, which is expected for the Lower Mainland.
- There are 2 types of owners.
- At most, there are 284 different values for vehicle count.