# MTH3302 - Méthodes probabilistes et statistiques pour I.A.
#### Polytechnique Montréal


### Projet A2024

-----

# Prédiction de la consommation en carburant de voitures récentes.

### Contexte

## TODO

### Objectif

## TODO

### Données
Les données utilisées pour inférer la consommation de carburant sont les suivantes :

## TODO


In [2]:
using CSV, DataFrames, Statistics, Dates, Gadfly, Combinatorics, Plots, StatsBase, StatsPlots, Random, StatsModels, GLM, LinearAlgebra

In [101]:
full_train = CSV.read("../data/raw/train.csv", DataFrame; delim=";")
test =  CSV.read("../data/raw/test.csv", DataFrame; delim=";") #ne contient pas la varialbe consommation

Random.seed!(1234) #pour la reproductibilit

ntrain = round(Int, .8*nrow(full_train)) #80% des données pour l'entrainement: 80% * nb de lignes

train_id = sample(1:nrow(full_train), ntrain, replace=false, ordered=true) #échantillonnage aléatoire pour l'entrainement
valid_id = setdiff(1:nrow(full_train), train_id) #échantillon de validation. prend celles qui ne sont pas dans l'échantillon d'entrainement

train = full_train[train_id, :]  
valid = full_train[valid_id, :]

first(train, 5)



Row,annee,type,nombre_cylindres,cylindree,transmission,boite,consommation
Unnamed: 0_level_1,Int64,String31,Int64,String3,String15,String15,String31
1,2023,voiture_moyenne,8,44,integrale,automatique,138358823529412
2,2020,VUS_petit,4,2,integrale,automatique,980041666666667
3,2021,voiture_compacte,6,33,propulsion,automatique,117605
4,2023,voiture_deux_places,8,5,integrale,automatique,130672222222222
5,2022,voiture_moyenne,8,44,integrale,automatique,138358823529412


## 1. Étude des données 

In [None]:
# Résumé des données
println(describe(train))


# 2. Exploration des données

## 2.1 Helpers

In [None]:
function safe_parse_int(x)
    try
        parse(Int, x)
    catch
        missing
    end
end

In [11]:
function safe_parse_float(x)
    try
        return parse(Float64, x)
    catch
        return missing
    end
end

safe_parse_float (generic function with 1 method)

# 2.2 Analyse des données

In [None]:
data = deepcopy(train)
data = dropmissing(data)

In [None]:
# Remplacer les virgules par des points
data.cylindree = replace.(data.cylindree, "," => ".")
data.consommation = replace.(data.consommation, "," => ".")

# Convertir 'cylindree' en Float64
data.cylindree = safe_parse_float.(data.cylindree)
data.consommation = safe_parse_float.(data.consommation)

In [None]:
# Résumé des données
println(describe(data))

# Corrélation entre les variables

In [None]:
numeric_cols = [:annee, :nombre_cylindres, :cylindree, :consommation]

M = cor(Matrix(data[:, numeric_cols]))

# Afficher la matrice de corrélation
println("Matrice de corrélation :")
println(M)

# PLOT
(n,m) = size(M)
heatmap(M, fc=cgrad([:white,:dodgerblue4]), xticks=(1:m,numeric_cols), xrot=90, yticks=(1:m,numeric_cols), yflip=true)
annotate!([(j, i, text(round(M[i,j],digits=3), 8,"Computer Modern",:black)) for i in 1:n for j in 1:m])

1. `nombre_cylindres` et `cylindree` est très élevée, ce qui indique une forte relation positive. Cela suggère que le nombre de cylindres est fortement associé à la cylindrée des véhicules.

2. La corrélation entre `cylindree` et `consommation` est également élevée, montrant qu'une augmentation de la cylindrée est associée à une augmentation de la consommation (par exemple, les moteurs plus gros consomment plus de carburant).

3. Une corrélation similaire existe entre `nombre_cylindres` et `consommation`, ce qui est logique, car le nombre de cylindres et la cylindrée sont liés.

4. Les corrélations entre annee et les autres variables sont faibles et négatives, indiquant que les variables comme le nombre de cylindres, la cylindrée et la consommation ont légèrement diminué avec le temps.

## Consommation par type de véhicule

In [None]:
set_default_plot_size(20cm, 20cm)
Gadfly.plot(train, x=:type, y=:consommation, Geom.boxplot )

In [None]:
unique_categories = unique(skipmissing(data[:, :type]))
occurences = [sum(skipmissing(data[:, :type]) .== category) for category in unique_categories]
occurences = DataFrame(category = unique_categories, occurences = occurences)
occurences = occurences[occurences.occurences .> 10, :] #TODO INVESTIGATE 

Consommation en fonction du type véhicule moyen :

In [None]:
set_default_plot_size(20cm, 20cm)
vehicule_moyenne = filter(row -> row.type == "voiture_moyenne", data)
Gadfly.plot(vehicule_moyenne, x=:annee, y=:consommation, color=:type, Geom.point, Geom.smooth(method=:loess), Guide.xlabel("Année"), Guide.ylabel("Consommation (L/100km)"), Guide.colorkey("Type"))

Consommation en fonction du type VUS_petit

In [None]:
set_default_plot_size(20cm, 20cm)
vehicule_VUSp = filter(row -> row.type == "VUS_petit", data)
Gadfly.plot(vehicule_VUSp, x=:annee, y=:consommation, color=:type, Geom.point, Geom.smooth(method=:loess), Guide.xlabel("Année"), Guide.ylabel("Consommation (L/100km)"), Guide.colorkey("Type"))

Consommation en fonction du type véhicule compacte

In [None]:
set_default_plot_size(20cm, 20cm)
voiture_compacte = filter(row -> row.type == "voiture_compacte", data)
Gadfly.plot(voiture_compacte, x=:annee, y=:consommation, color=:type, Geom.point, Geom.smooth(method=:loess), Guide.xlabel("Année"), Guide.ylabel("Consommation (L/100km)"), Guide.colorkey("Type"))

Consommation en fonction du type véhicule 2 places

In [None]:
set_default_plot_size(20cm, 20cm)
voiture_deux_places = filter(row -> row.type == "voiture_deux_places", data)
Gadfly.plot(voiture_deux_places, x=:annee, y=:consommation, color=:type, Geom.point, Geom.smooth(method=:loess), Guide.xlabel("Année"), Guide.ylabel("Consommation (L/100km)"), Guide.colorkey("Type"))

Consommation en fonction du type véhicule camionnette standard

In [None]:
set_default_plot_size(20cm, 20cm)
camionnette_standard = filter(row -> row.type == "camionnette_standard", data)
Gadfly.plot(camionnette_standard, x=:annee, y=:consommation, color=:type, Geom.point, Geom.smooth(method=:loess), Guide.xlabel("Année"), Guide.ylabel("Consommation (L/100km)"), Guide.colorkey("Type"))

Consommation en fonction du type véhicule mini compacte

In [None]:
set_default_plot_size(20cm, 20cm)
voiture_minicompacte = filter(row -> row.type == "voiture_minicompacte", data)
Gadfly.plot(voiture_minicompacte, x=:annee, y=:consommation, color=:type, Geom.point, Geom.smooth(method=:loess), Guide.xlabel("Année"), Guide.ylabel("Consommation (L/100km)"), Guide.colorkey("Type"))

Consommation en fonction du type véhicule VUS standard

In [None]:
set_default_plot_size(20cm, 20cm)
VUS_standard = filter(row -> row.type == "VUS_standard", data)
Gadfly.plot(VUS_standard, x=:annee, y=:consommation, color=:type, Geom.point, Geom.smooth(method=:loess), Guide.xlabel("Année"), Guide.ylabel("Consommation (L/100km)"), Guide.colorkey("Type"))

Consommation en fonction du type véhicule sous-compacte

In [None]:
set_default_plot_size(20cm, 20cm)
voiture_sous_compacte = filter(row -> row.type == "voiture_sous_compacte", data)
Gadfly.plot(voiture_sous_compacte, x=:annee, y=:consommation, color=:type, Geom.point, Geom.smooth(method=:loess), Guide.xlabel("Année"), Guide.ylabel("Consommation (L/100km)"), Guide.colorkey("Type"))

## Consommation par cylindrée

## Consommation par nombre de cylindres

## Consommation par année //TODO METTRE UNE NOTE COMME QUOI PAS BESOIN D'INVESTIGUER

# 3. Régression linéaire

In [None]:
Random.seed!(1234) #pour la reproductibilité

ntrain = round(Int, .8*nrow(full_train)) #80% des données pour l'entrainement: 80% * nb de lignes

train_id = sample(1:nrow(full_train), ntrain, replace=false, ordered=true) #échantillonnage aléatoire pour l'entrainement
valid_id = setdiff(1:nrow(full_train), train_id) #échantillon de validation. prend celles qui ne sont pas dans l'échantillon d'entrainement

train = full_train[train_id, :]  
valid = full_train[valid_id, :]

first(train, 5)

In [None]:
# Remplacer les virgules par des points
train.cylindree = replace.(train.cylindree, "," => ".")
valid.cylindree = replace.(valid.cylindree, "," => ".")
train.consommation = replace.(train.consommation, "," => ".")
valid.consommation = replace.(valid.consommation, "," => ".")

# Convertir 'cylindree' en Float64
train.cylindree = safe_parse_float.(train.cylindree)
valid.cylindree = safe_parse_float.(valid.cylindree)
train.consommation = safe_parse_float.(train.consommation)
valid.consommation = safe_parse_float.(valid.consommation)

In [None]:
println(describe(train))

In [None]:
#drop type, numbre cylindre, cylindree,
train = select(train, Not([:type, :transmission, :boite]))
valid = select(valid,  Not([:type, :transmission, :boite]))


In [None]:
model = GLM.lm(@formula(consommation ~ annee + cylindree + nombre_cylindres), train)
# Prediction avec l'ensemble de validation
valid_prediction = GLM.predict(model, valid)
# Trouver la moyenne de prediction
mean_prediction = mean(valid_prediction)
# Remplacer les missing par la moyenne
valid_prediction = coalesce.(valid_prediction, mean_prediction)
# Transformer les predictions en valeur entiere
#v = Int.(round.(valid_prediction, digits=0)) #mettre une commentaire sur la difference que ca entraine sur le rmse
# Calculer le RMSE
rmse = sqrt(mean((valid_prediction - valid.consommation).^2))
println("RMSE: ", rmse)

# 4. Régression bayesienne

In [102]:
Random.seed!(1234) #pour la reproductibilité

ntrain = round(Int, .8*nrow(full_train)) #80% des données pour l'entrainement: 80% * nb de lignes

train_id = sample(1:nrow(full_train), ntrain, replace=false, ordered=true) #échantillonnage aléatoire pour l'entrainement
valid_id = setdiff(1:nrow(full_train), train_id) #échantillon de validation. prend celles qui ne sont pas dans l'échantillon d'entrainement

train = full_train[train_id, :]
valid = full_train[valid_id, :]

first(train, 5)

Row,annee,type,nombre_cylindres,cylindree,transmission,boite,consommation
Unnamed: 0_level_1,Int64,String31,Int64,String3,String15,String15,String31
1,2023,voiture_moyenne,8,44,integrale,automatique,138358823529412
2,2020,VUS_petit,4,2,integrale,automatique,980041666666667
3,2021,voiture_compacte,6,33,propulsion,automatique,117605
4,2023,voiture_deux_places,8,5,integrale,automatique,130672222222222
5,2022,voiture_moyenne,8,44,integrale,automatique,138358823529412


In [103]:
train.cylindree = replace.(train.cylindree, "," => ".")
valid.cylindree = replace.(valid.cylindree, "," => ".")
train.consommation = replace.(train.consommation, "," => ".")
valid.consommation = replace.(valid.consommation, "," => ".")

train.cylindree = safe_parse_float.(train.cylindree)
valid.cylindree = safe_parse_float.(valid.cylindree)
train.consommation = safe_parse_float.(train.consommation)
valid.consommation = safe_parse_float.(valid.consommation)

train =dropmissing(train)

Row,annee,type,nombre_cylindres,cylindree,transmission,boite,consommation
Unnamed: 0_level_1,Int64,String31,Int64,Float64,String15,String15,Float64
1,2023,voiture_moyenne,8,4.4,integrale,automatique,13.8359
2,2020,VUS_petit,4,2.0,integrale,automatique,9.80042
3,2021,voiture_compacte,6,3.3,propulsion,automatique,11.7605
4,2023,voiture_deux_places,8,5.0,integrale,automatique,13.0672
5,2022,voiture_moyenne,8,4.4,integrale,automatique,13.8359
6,2022,voiture_moyenne,8,4.4,integrale,automatique,13.8359
7,2022,voiture_minicompacte,3,1.5,traction,automatique,7.35031
8,2024,voiture_minicompacte,3,1.5,traction,manuelle,7.58742
9,2020,VUS_standard,6,3.8,integrale,automatique,11.2005
10,2019,voiture_compacte,6,3.3,propulsion,automatique,11.7605


In [104]:
# Liste des colonnes catégoriques
categorical_cols = [:type, :transmission, :boite]

# Fonction pour le One-Hot Encoding
function one_hot_encode(df, cols)
    for col in cols
        levels_col = unique(df[!, col])
        for level in levels_col
            new_col = Symbol(string(col) * "_" * string(level))
            df[!, new_col] = ifelse.(df[!, col] .== level, 1.0, 0.0)
        end
        # Supprimer la colonne originale
        select!(df, Not(col))
    end
    return df
end

# Appliquer l'encodage sur les données d'entraînement et de test
train = one_hot_encode(train, categorical_cols)
test = one_hot_encode(test, categorical_cols)

Row,annee,nombre_cylindres,cylindree,type_voiture_moyenne,type_VUS_petit,type_voiture_sous_compacte,type_voiture_deux_places,type_camionnette_standard,type_VUS_standard,type_voiture_compacte,type_voiture_grande,type_voiture_minicompacte,type_monospace,type_break_petit,type_break_moyen,type_camionnette_petit,transmission_traction,transmission_4x4,transmission_propulsion,transmission_integrale,boite_manuelle,boite_automatique
Unnamed: 0_level_1,Int64,Int64,String3,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,2014,4,25,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
2,2014,4,25,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
3,2014,4,25,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
4,2014,4,2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
5,2014,8,58,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
6,2014,8,5,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
7,2014,8,5,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
8,2014,4,24,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
9,2014,6,35,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
10,2014,10,52,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0


Préparation de X et y

In [105]:
y = train.consommation
X = select(train, Not(:consommation))
X = hcat(X, ones(size(X, 1)))
X = Matrix(X)
y = Vector(y)

317-element Vector{Float64}:
 13.8358823529412
  9.80041666666667
 11.7605
 13.0672222222222
 13.8358823529412
 13.8358823529412
  7.3503125
  7.58741935483871
 11.2004761904762
 11.7605
  ⋮
 12.3794736842105
  9.80041666666667
  7.84033333333333
  8.71148148148148
 10.2265217391304
 12.3794736842105
 11.2004761904762
  7.84033333333333
  9.04653846153846

In [106]:
# Obtenir les noms des colonnes
feature_names = names(train)
# Trouver les indices des variables numériques
numeric_features = [:cylindree, :annee, :nombre_cylindres]
numeric_indices = findall(x -> x in numeric_features, feature_names)

Int64[]

Standardisation des variables

In [107]:
means = mean(X[:, numeric_indices], dims=1)
stds = std(X[:, numeric_indices], dims=1)

# Normaliser X_train
X[:, numeric_indices] = (X[:, numeric_indices] .- means) ./ stds

317×0 Matrix{Float64}

Partitionnement des données

In [108]:
Random.seed!(1234)
n = size(X, 1)
indices = randperm(n)
n_train = Int(round(0.8 * n))
train_indices = indices[1:n_train]
valid_indices = indices[n_train+1:end]

X_train = X[train_indices, :]
y_train = y[train_indices]
X_valid = X[valid_indices, :]
y_valid = y[valid_indices]

63-element Vector{Float64}:
 10.6913636363636
  4.52326923076923
 10.6913636363636
  8.40035714285714
  8.40035714285714
 10.2265217391304
 13.8358823529412
 12.3794736842105
 12.3794736842105
  8.11068965517241
  ⋮
 12.3794736842105
  8.11068965517241
  7.58741935483871
  7.84033333333333
 13.8358823529412
 10.2265217391304
 10.2265217391304
 11.2004761904762
 11.7605

Régression Ridge

In [None]:
XtX = X_train' * X_train
Xty = X_train' * y_train
n_features = size(X_train, 2)

UndefVarError: UndefVarError: `y_train_split` not defined

Estimation du lambda

In [110]:
lambda_values = 10 .^ range(-5, stop=5, length=100)
best_rmse = Inf
best_lambda = 0.0
best_beta = nothing

I_reg = Diagonal(vcat(ones(n_features - 1), [0.0]))

for λ in lambda_values
    beta = (XtX + λ * I_reg) \ Xty
    y_pred_valid = X_valid * beta
    rmse = sqrt(mean((y_pred_valid - y_valid).^2))
    println("Lambda: ", λ, " RMSE: ", rmse)
    if rmse < best_rmse
        best_rmse = rmse
        best_lambda = λ
        best_beta = beta
    end
end

println("Meilleure valeur de Lambda: ", best_lambda)
println("Meilleur RMSE : ", best_rmse)

In [111]:
# Entraînement final
XtX_full = X_train' * X_train
Xty_full = X_train' * y_train
beta_final = (XtX_full + best_lambda * I_reg) \ Xty_full

# Prédictions sur le test
y_valid_pred = X_valid * beta_final
rmse = sqrt(mean((y_pred - y_valid).^2))

1.690019472894159e63

# Validation par k-fold cross-validation

In [112]:
data_k_folds = vcat(train, valid)
y = data_k_folds.consommation
X = select(data_k_folds, Not(:consommation))

n = nrow(data_k_folds)
k = 5  
fold_size = n ÷ k

indices = randperm(n)

rms_scores = []

for i in 0:(k-1)
    test_indices = indices[(i*fold_size + 1):min((i+1)*fold_size, n)]
    train_indices = setdiff(indices, test_indices)
    
    train_data = data_k_folds[train_indices, :]
    test_data = data_k_folds[test_indices, :]
    
    model = lm(@formula(consommation ~ annee + cylindree + nombre_cylindres), data_k_folds)
    
    valid_prediction = GLM.predict(model, test_data)
    
    mean_prediction = mean(skipmissing(valid_prediction))
    valid_prediction = coalesce.(valid_prediction, mean_prediction)
    
    if any(ismissing, valid_prediction)
        error("Skip les valeur missing")
    end
    
    v = max.(valid_prediction, 0) 
    
    score = sqrt(mean((v - test_data.consommation).^2))
    push!(rms_scores, score)
end

moyenne_rmse = mean(rms_scores)
println("Moyenne RMSLE : $moyenne_rmse")

ArgumentError: ArgumentError: column(s) type, transmission and boite are missing from argument(s) 1, and column(s) type_voiture_moyenne, type_VUS_petit, type_voiture_compacte, type_voiture_deux_places, type_voiture_minicompacte, type_VUS_standard, type_monospace, type_camionnette_petit, type_voiture_sous_compacte, type_break_petit, type_voiture_grande, type_camionnette_standard, transmission_integrale, transmission_propulsion, transmission_traction, transmission_4x4, boite_automatique and boite_manuelle are missing from argument(s) 2

###### TODO CONCLUSION