# Presentation du cas

L'objectif du cas est de prédire les émissions de CO2 et la consommation totale d’énergie de bâtiments absentes actuellement.

Les contraintes proposees sont les caracteristiques des batiments (taille et usage des bâtiments, mention de travaux récents, date de construction..)

Les donnees d'entrainement et de test utilisent des releves faits en 2015 et 2016.

Les bases de donnees sont composees de deux types de formats:
- JSON: les donnees descriptives qui renseignent les donnes brutes en format csv.
- CSV: les donnees brutes qui serviront a l'analyse exploratoire, au cleaning ainsi qu'a la modelisation finale.

La derniere partie consiste a creer des modeles de prediction, optimiser les hyper parametres et selectionner le meilleur modele.

## Importations des librairies

In [19]:
# Standard libraries
import pandas as pd
import numpy as np
import sys
import os
import glob
import warnings
import json
import pickle
import string
from math import sqrt
import re

# Graphic libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Machine learning libraries
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
import lightgbm as lgb

# Model evaluation libraries
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score


## Telechargements donnees

In [45]:
with open("train_data.pkl","rb") as f:
    [train_energy_data_x,train_energy_data_y],[train_ghg_data_x,train_ghg_data_y],[knn,mm]=pickle.load(f)
    
with open("test_data.pkl","rb") as f:
    [test_energy_data_x,test_energy_data_y],[test_ghg_data_x,test_ghg_data_y]=pickle.load(f)

## Multilinear regression model

In [3]:
lr1=LinearRegression()
lr1.fit(train_energy_data_x,train_energy_data_y)
y_train_pred=lr1.predict(train_energy_data_x)
print(sqrt(mean_squared_error(train_energy_data_y,y_train_pred)))
print(r2_score(train_energy_data_y,y_train_pred))
y_test_pred=lr1.predict(test_energy_data_x)
print(sqrt(mean_squared_error(test_energy_data_y,y_test_pred)))
print(r2_score(test_energy_data_y,y_test_pred))

8620173.642278207
0.8007357385367739
170061123562195.97
-189219545788023.88


In [4]:
lr2=LinearRegression()
lr2.fit(train_ghg_data_x,train_ghg_data_y)
y_train_pred=lr2.predict(train_ghg_data_x)
print(sqrt(mean_squared_error(train_ghg_data_y,y_train_pred)))
print(r2_score(train_ghg_data_y,y_train_pred))
y_test_pred=lr2.predict(test_ghg_data_x)
print(sqrt(mean_squared_error(test_ghg_data_y,y_test_pred)))
print(r2_score(test_ghg_data_y,y_test_pred))

357.7499052258423
0.6306786809914446
289766720216.53577
-1.2223983200798595e+18


## Random forest model

In [5]:
rf1=RandomForestRegressor()
rf1.fit(train_energy_data_x,train_energy_data_y)
y_train_pred=rf1.predict(train_energy_data_x)
print(sqrt(mean_squared_error(train_energy_data_y,y_train_pred)))
print(r2_score(train_energy_data_y,y_train_pred))
y_test_pred=rf1.predict(test_energy_data_x)
print(sqrt(mean_squared_error(test_energy_data_y,y_test_pred)))
print(r2_score(test_energy_data_y,y_test_pred))

  


4669682.977995387
0.9415247374265059
4172516.031371131
0.8860926009107973


In [6]:
rf2=RandomForestRegressor()
rf2.fit(train_ghg_data_x,train_ghg_data_y)
y_train_pred=rf2.predict(train_ghg_data_x)
print(sqrt(mean_squared_error(train_ghg_data_y,y_train_pred)))
print(r2_score(train_ghg_data_y,y_train_pred))
y_test_pred=rf2.predict(test_ghg_data_x)
print(sqrt(mean_squared_error(test_ghg_data_y,y_test_pred)))
print(r2_score(test_ghg_data_y,y_test_pred))

  


132.78121302754153
0.9491233406702717
170.12699655281875
0.5786313662792602


## Gradient boosting model

In [46]:
lgb1=lgb.LGBMRegressor()
lgb1.fit(train_energy_data_x,train_energy_data_y)
y_train_pred=lgb1.predict(train_energy_data_x)
print(sqrt(mean_squared_error(train_energy_data_y,y_train_pred)))
print(r2_score(train_energy_data_y,y_train_pred))
y_test_pred=lgb1.predict(test_energy_data_x)
print(sqrt(mean_squared_error(test_energy_data_y,y_test_pred)))
print(r2_score(test_energy_data_y,y_test_pred))

10415179.618361998
0.7091085786516557
7451728.574802683
0.6366961813768218


In [47]:
lgb2=lgb.LGBMRegressor()
lgb2.fit(train_ghg_data_x,train_ghg_data_y)
y_train_pred=lgb2.predict(train_ghg_data_x)
print(sqrt(mean_squared_error(train_ghg_data_y,y_train_pred)))
print(r2_score(train_ghg_data_y,y_train_pred))
y_test_pred=lgb2.predict(test_ghg_data_x)
print(sqrt(mean_squared_error(test_ghg_data_y,y_test_pred)))
print(r2_score(test_ghg_data_y,y_test_pred))

285.8157562657128
0.7642684800642852
219.3757611916516
0.2993630851180944
