# Predição de Consumo de Combustível

Criação de modelos de aprendizado supervisionado para predizer o consumo de combustível de carros.

## Imports

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.neural_network import MLPRegressor
from sklearn.svm import SVR

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import PolynomialFeatures

#from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold, cross_val_score

import preprocessing as pre

## Pré-processamento

Aqui foi realizado o carregamento dos dados originais de um arquivo CSV.

In [2]:
original_df = pd.read_csv('data/car_data_original.csv')

Após isso, realizou-se a remoção das instâncias de carros elétricos pois continham muitos atributos faltantes e removeu-se os atributos de consumo em rodovia e cidade pois podem ser altamente colineares com a saída, simplificando demais o trabalho do preditor. Depois, separou-se as instâncias de treinamento/validação das instâncias de teste com o código abaixo.

In [3]:
# Removendo os atributos colineares
no_colinear_df = pre.drop_atributes(original_df, ['highway_mpg', 'city_mpg'])

# Removendo as instâncias com NaN
no_nan_df = pre.remove_instances_with_nan(no_colinear_df)

Choosing the variables to be analyzed (X) and predicted (y)

In [4]:
#df_filtered = original_df.drop(columns=["city_mpg", "highway_mpg"])
X = no_nan_df.drop(columns=["combination_mpg"])
y = no_nan_df["combination_mpg"]

Determinando o tamanho dos conjuntos de Teste e Validação.

In [5]:
val_size = 0.15

Forçando ao menos uma instância de cada categoria apareça no dataset de treino.

Forcing at least one instance of every category to appear in the fossil_fuel dataset.

As in no_electric_cars, there isnt any strategy for dealing with non represented instances during the testing phase, we force every instance to appear at least once in the training case.

In [6]:
train_indices = set()
categorical_features = X.select_dtypes(include=['object', 'bool']).columns
for feature in categorical_features:
    unique_values = X[feature].unique()
    for value in unique_values:
        index = X[X[feature] == value].index[0]
        train_indices.add(index)

train_indices = list(train_indices)
X_train_mandatory = X.loc[train_indices]
y_train_mandatory = y.loc[train_indices]

X = X.drop(train_indices)
y = y.drop(train_indices)

## Divisão de Dados e Treinamento para primeiras Avaliações

Separando o conjunto de Teste antes de começar qualquer treinamento ou avaliação de modelos.

In [7]:
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size)

Treinamento com KFold e resultados - falta organizar

In [14]:
# Extra iterations to calculate average and increase reproducibility
n_repeats = 1
k = 9 # K for Knn
Ksplit = 10
plot_flag = False  

methods = [
    "knn",
    "random_forest",
    "linear_regression",
    "neural_networks",
    "svm",
]

for method in methods:
        mae_total = 0
        mse_total = 0
        all_predictions = []  # List to store predictions for each split
        for split_random_state in range(0, n_repeats):
            # Separate validation data and the remaining instances
            X_train, X_val, y_train, y_val  = train_test_split(X, y, test_size=val_size, random_state=split_random_state)
            # Split the remaining instances in Ksplit parts
            kf = KFold(n_splits=Ksplit, shuffle=True, random_state=split_random_state)
            for train_index, test_index in kf.split(X_train):
                X_train, X_test = X_train.iloc[train_index], X_train.iloc[test_index]
                y_train, y_test = y_train.iloc[train_index], y_train.iloc[test_index]
                # If no_electric_cars, join the forced test_case with the random test_case
                # Concatanate the mandatory and training sets
                X_train = pd.concat([X_train_mandatory, X_train])
                y_train = pd.concat([y_train_mandatory, y_train])

                # Preprocessing is sensitive to type
                # Separate analyzed features into numerical and categorical
                # As to apply preprocessing only to valid features
                numerical_features = X_train.select_dtypes(include=['int64', 'float64']).columns
                categorical_features = X_train.select_dtypes(include=['object', 'bool']).columns

                # ColumnTransformer applies preprocessing patterns e.g. StandardScaler() and
                # OneHotEncoder() to groups, e.g. numerical_features and categorical_features
                # Preprocessor will be applied to every dataset
                preprocessor = ColumnTransformer([
                    #("num", StandardScaler(), numerical_features),
                    #("cat", OneHotEncoder(), categorical_features)
                    #("num", PolynomialFeatures(degree=2, include_bias=False), numerical_features),
                    ("num", StandardScaler(), numerical_features),
                    ("cat", OneHotEncoder(), categorical_features)
                ])
                # Select model and configuration for use in the pipeline
                match method:
                    case "knn":
                        model = KNeighborsClassifier(n_neighbors=k)
                    case "random_forest":
                        model = RandomForestRegressor(max_depth=6, random_state=split_random_state)
                    case "linear_regression":
                        model = LinearRegression()
                    case "neural_networks":
                        model = MLPRegressor(random_state=split_random_state, max_iter=2500)
                    case "svm":
                        model = SVR(C=1.0, epsilon=0.2)
                # Pipeline applies the preprocessed dataset to the model for fitting
                pipe = Pipeline([
                    ("preprocessor", preprocessor),
                    ("regressor", model)
                ])

                # Fitting the pipeline
                pipe.fit(X_train, y_train)

                # Prediction and evaluation
                y_pred = pipe.predict(X_test)
                # all_predictions.append(y_pred)

                if plot_flag:
                    plt.figure(figsize=(10, 6))
                    plt.scatter(y_test, y_pred, color='blue', alpha=0.6, label='Average Predicted vs Actual')
                    plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=2, label='Ideal Fit Line')
                    plt.xlabel('Actual Combination MPG')
                    plt.ylabel('Average Predicted Combination MPG')
                    plt.title(f'{method.upper()} Model: Average Predicted vs Actual Combination MPG')
                    plt.legend()
                    plt.grid(True)
                    plt.savefig(f'src/plots/{method}_actual_vs_predicted.png')

                mae = mean_absolute_error(y_test, y_pred)
                mse = mean_squared_error(y_test, y_pred)
                mae_total += mae
                mse_total += mse

            #print("Mean Absolute Error (MAE) for iteration {} of {} using the {} method:".format(split_random_state+1, data_case, method), mae)
            #print("Mean Squared Error (MSE) for iteration {} of {} using the {} method:".format(split_random_state+1, data_case, method), mse)
            #print("\n")
        
        #all_predictions = np.array(all_predictions)
        #print("ALL")
        #print(all_predictions)
        #average_predictions = np.mean(all_predictions, axis=0)
        #print("Avg")
        #print(average_predictions)

        mae_med = mae_total/n_repeats
        mse_med = mse_total/n_repeats
        print("Average MAE using the {} method: {}".format(method, mae_med))
        print("Average MSE using the {} method: {}".format( method, mse_med))
        print("\n")

Average MAE using the knn method: 18.384057971014492
Average MSE using the knn method: 106.03260869565219


Average MAE using the random_forest method: 12.557776749103175
Average MSE using the random_forest method: 30.94808938735047


Average MAE using the linear_regression method: 1.0629426136684348
Average MSE using the linear_regression method: 2.292842054181946


Average MAE using the neural_networks method: 1.1927060132634846
Average MSE using the neural_networks method: 2.2425802766288685


Average MAE using the svm method: 12.858357924648157
Average MSE using the svm method: 68.80033668376156


