# Vehicle Autonomy Estimator

This notebook aims to develop a ML model that can estimate the autonomy of a vehicle (km/L) based on some characteristics

In [98]:
import pandas as pd
import numpy as np
from sklearn.feature_selection import mutual_info_regression
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, MinMaxScaler
from sklearn.linear_model import LinearRegression 
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
import joblib

pd.set_option("display.max_columns",None)

## Data Extraction

There's already a cleaned dataset stored in this repo, the main process of cleaning it can be checked in the *data_sources/data_preparation.ipynb* notebook

In [99]:
data = pd.read_parquet(r"../../../data_sources/vehicle_data_prepared.parquet")

In [100]:
data.query("combined_kmpl_for_fuel_type2 != 0").sort_values("combined_kmpl_for_fuel_type1", ascending=False)

Unnamed: 0,make,basemodel,model,year,vehicle_size_class,cylinders,engine_displacement_liters,t_charger,s_charger,electric_motor,transmission,drive,start_stop,fuel_type,fuel_type1,fuel_type2,phev_blended,city_electricity_consumption_kwhpkm,city_kmpl_for_fuel_type1,city_kmpl_for_fuel_type2,highway_electricity_consumption_kwhpkm,highway_kmpl_for_fuel_type1,highway_kmpl_for_fuel_type2,combined_electricity_consumption_kwhpkm,combined_kmpl_for_fuel_type1,combined_kmpl_for_fuel_type2,hours_to_charge_at_120v,hours_to_charge_at_240v,co2_tailpipe_for_fuel_type1_gpkm,co2_tailpipe_for_fuel_type2_gpkm
21117,Toyota,Prius Prime,Prius Prime,2019,Midsize Cars,4.0,1.8,False,False,22 and 53 kW AC Induction,Automatic (variable gear ratios),Front-Wheel Drive,True,Regular Gas and Electricity,Regular Gasoline,Electricity,True,14.291573,19.47033,51.330870,17.398437,18.762318,42.834726,15.534318,19.116324,47.082798,0.0,2.0,48.467073,0.000000
18085,Toyota,Prius Prime,Prius Prime,2021,Midsize Cars,4.0,1.8,False,False,22 and 53 kW AC Induction,Automatic (variable gear ratios),Front-Wheel Drive,True,Regular Gas and Electricity,Regular Gasoline,Electricity,True,14.291573,19.47033,51.330870,17.398437,18.762318,42.834726,15.534318,19.116324,47.082798,0.0,2.0,48.467073,0.000000
22929,Toyota,Prius Prime,Prius Prime,2022,Midsize Cars,4.0,1.8,False,False,22 and 53 kW AC Induction,Automatic (variable gear ratios),Front-Wheel Drive,True,Regular Gas and Electricity,Regular Gasoline,Electricity,True,14.291573,19.47033,51.330870,17.398437,18.762318,42.834726,15.534318,19.116324,47.082798,0.0,2.0,48.467073,0.000000
23400,Toyota,Prius Prime,Prius Prime,2017,Midsize Cars,4.0,1.8,False,False,16 and 37 kW AC Induction,Automatic (variable gear ratios),Front-Wheel Drive,True,Regular Gas and Electricity,Regular Gasoline,Electricity,True,14.291573,19.47033,51.330870,17.398437,18.762318,42.834726,15.534318,19.116324,47.082798,0.0,2.0,48.467073,0.000000
8901,Toyota,Prius Prime,Prius Prime,2018,Midsize Cars,4.0,1.8,False,False,16 and 37 kW AC Induction,Automatic (variable gear ratios),Front-Wheel Drive,True,Regular Gas and Electricity,Regular Gasoline,Electricity,True,14.291573,19.47033,51.330870,17.398437,18.762318,42.834726,15.534318,19.116324,47.082798,0.0,2.0,48.467073,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23343,Ford,F150,F150 Dual-fuel 2WD (CNG),2002,Standard Pickup Trucks 2WD,8.0,5.4,False,False,,Automatic 4-spd,Rear-Wheel Drive,False,Gasoline or natural gas,Regular Gasoline,Natural Gas,False,0.000000,3.54006,3.540060,0.000000,4.956084,4.602078,0.000000,3.894066,3.894066,0.0,0.0,502.012683,397.113667
9515,Ford,F150,F150 Dual-fuel 4WD (CNG),2002,Standard Pickup Trucks 4WD,8.0,5.4,False,False,,Automatic 4-spd,4-Wheel or All-Wheel Drive,False,Gasoline or natural gas,Regular Gasoline,Natural Gas,False,0.000000,3.54006,3.540060,0.000000,4.956084,4.602078,0.000000,3.894066,3.894066,0.0,0.0,502.012683,397.113667
18791,Ford,F150,F150 Dual-fuel 2WD (CNG),2001,Standard Pickup Trucks 2WD,8.0,5.4,False,False,,Automatic 4-spd,Rear-Wheel Drive,False,Gasoline or natural gas,Regular Gasoline,Natural Gas,False,0.000000,3.54006,3.540060,0.000000,4.956084,4.602078,0.000000,3.894066,3.894066,0.0,0.0,502.012683,397.113667
28035,Ford,F150,F150 Dual-fuel 4WD (CNG),2001,Standard Pickup Trucks 4WD,8.0,5.4,False,False,,Automatic 4-spd,4-Wheel or All-Wheel Drive,False,Gasoline or natural gas,Regular Gasoline,Natural Gas,False,0.000000,3.54006,3.540060,0.000000,4.956084,4.602078,0.000000,3.894066,3.894066,0.0,0.0,502.012683,397.113667


Selecting some columns that will be analize and see if has any relation with the target *combined_kmpl_for_fuel_type1* which is the kmpl of a vehicle driving in highway and city

In [101]:
cols = [
    "make",
    "basemodel",
    "year",
    "vehicle_size_class",
    "cylinders",
    "engine_displacement_liters",
    "t_charger",
    "s_charger",
    "electric_motor",
    "transmission",
    "drive",
    "start_stop",
    "fuel_type",
    "fuel_type1",
    "fuel_type2",
    "phev_blended"
]

In [102]:
X = data[cols]
Y = data.combined_kmpl_for_fuel_type1

## Feature Selection

Im gonna use mutual information to get the most relevant features for the model. as the target is of a continous type, Im gonna use *mutual_info_regression* in sklearn; This model also needs a list of the column's indexes which datatype is categorical and/or ordinal.

This method also requires all the columns to be numeric. So for all ordinal and categorical values, im gonna use the *OrdinalEncoder* in sklearn. As this model is not sensitive to the values of data; instead search for any reduction in the uncertanty of the target value knowing the feature value.

**In this case, all the features are either ordinal or categorical, so all of them can go trough the OrdinalEncoder in order to make this analysis easier**

In [103]:
encoder = OrdinalEncoder()

In [104]:
X_encoded = encoder.fit_transform(X)

In [105]:
X_encoded[:2,:]

array([[ 33., 751.,  17.,   5.,   5.,  25.,   0.,   0., 391.,  19.,   4.,
          0.,  12.,   6.,   3.,   0.],
       [ 22., 390.,  17.,   1.,   5.,  30.,   0.,   0., 391.,  21.,   3.,
          0.,  12.,   6.,   3.,   0.]])

In [106]:
mut_info_results = mutual_info_regression(X_encoded, Y, discrete_features=list(range(len(X.columns))))

In [107]:
df_mut_info = pd.DataFrame({
    "feature":X.columns.tolist(),
    "mutual_info_score":mut_info_results
    
},
).sort_values("mutual_info_score",ascending=False)

In [108]:
df_mut_info

Unnamed: 0,feature,mutual_info_score
1,basemodel,1.368696
5,engine_displacement_liters,0.89783
4,cylinders,0.65483
0,make,0.436854
9,transmission,0.399305
3,vehicle_size_class,0.322091
8,electric_motor,0.296249
10,drive,0.265951
12,fuel_type,0.213465
13,fuel_type1,0.176288


0 < mut_info_score < 0.5 -> medium relation

Based on this, It will be choosen those features with mut_info_score > 0.5

In [109]:
selected_features = df_mut_info.query("mutual_info_score >= 0.3").feature.tolist()

In [110]:
selected_features

['basemodel',
 'engine_displacement_liters',
 'cylinders',
 'make',
 'transmission',
 'vehicle_size_class']

Looking for any linear relation

In [111]:
data[selected_features + ["combined_kmpl_for_fuel_type1"]].corr(method="spearman", numeric_only=True).combined_kmpl_for_fuel_type1

engine_displacement_liters     -0.860252
cylinders                      -0.834432
combined_kmpl_for_fuel_type1    1.000000
Name: combined_kmpl_for_fuel_type1, dtype: float64

## Training model

In [112]:
X = data[selected_features]
Y = data.combined_kmpl_for_fuel_type1

In [113]:
categorical_features = [x for x in X.columns if x not in X._get_numeric_data()]
numeric_features = [x for x in X.columns if x not in categorical_features]

In [114]:
categorical_features

['basemodel', 'make', 'transmission', 'vehicle_size_class']

In [115]:
numeric_features

['engine_displacement_liters', 'cylinders']

In [116]:
col_transformer = ColumnTransformer(transformers=[
                                    ("one_hot_cat_vars", OneHotEncoder(handle_unknown="ignore"), categorical_features),
                                    ("ordinal_encoder_num_vars", MinMaxScaler(), numeric_features)
                                    ]
)

In [117]:
X_processed = col_transformer.fit_transform(X)

In [142]:
r2_scores = cross_val_score(
    estimator=LinearRegression(),
    X = X_processed,
    y = Y,
    cv = 5,
    scoring = "neg_root_mean_squared_error"
      )
r2_scores *= -1

In [143]:
r2_scores

array([1.01421876, 1.12274539, 1.14625531, 1.12453354, 1.16784415])

In [144]:
r2_scores.mean()

np.float64(1.1151194291147533)

In [128]:
model = Pipeline(steps=
                 [("data_processing", col_transformer),
                 ("model", LinearRegression())]    
                 )

model.fit(X, Y)

## Saving the model

In [129]:
joblib.dump(model, "../model/autonomy_estimator.joblib")

['../model/autonomy_estimator.joblib']

## Testing the model

In [130]:
model = joblib.load("../model/autonomy_estimator.joblib")

In [131]:
x_test = X.iloc[:1,:]
x_test

Unnamed: 0,basemodel,engine_displacement_liters,cylinders,make,transmission,vehicle_size_class
0,Sedona,3.3,6.0,Kia,Automatic (S6),Minivan - 2WD


In [132]:
model.predict(x_test)

array([6.91911714])

In [136]:
data.loc[0,"combined_kmpl_for_fuel_type1"]

np.float64(7.08012)

## Saving the dataset

In [87]:
X["combined_kmpl_for_fuel_type1"] = Y

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X["combined_kmpl_for_fuel_type1"] = Y


In [91]:
X.sample(20).to_parquet("../../unit_tests/testing_data.parquet", index = False)