## ASHRAE Energy Predictions

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Problématique et but du projet
De nos jours, de plus en plus d'investissements sont réalisés dans le domaine de l'immobilier dans le but de réduire les consommations d'énergie des bâtiments et d'améliorer l'impact environnemental.

Ainsi, les propriétaires d'immeubles peuvent bénéficier de financements basés sur la différence entre la consommation d'énergie réelle du bâtiment et celle qu'il aurait utilisée sans aucuns travaux d'aménagement. Toutefois, les données sur la consommation d'énergie des bâtiments au cas où il n'aurait pas de rénovation ne sont pas disponibles.

Pour résoudre ce problème, des modèles contrefactuels sont développés afin de modéliser la consommation d'énergie d'un bâtiment sans travaux de rénovation.

Le but de ce TP est de construire ces modèles contrefactuels pour les quatre types d'énergie que sont la consommation d'eau froide, d'électricité, d'eau chaude et de vapeur en se basant sur les taux d'utilisation historiques et les conditions météorologiques observées. Il s'agira concrètement de prédire les valeurs de la variable meter_reading pour 1449 bâtiments.

# Analyse exploratoire des données

## Importation des données

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
import scipy.stats
import gc
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
    

In [None]:
train = pd.read_csv('/content/drive/MyDrive/Kaggle/train.csv')
weather_train = pd.read_csv('/content/drive/MyDrive/Kaggle/weather_train.csv')
building_metadata = pd.read_csv('/content/drive/MyDrive/Kaggle/building_metadata.csv')

## Description des données
L'ensemble de données comprend trois années de relevés de compteurs horaires de plus de mille bâtiments sur plusieurs sites différents à travers le monde. Les données sont regroupées en deux types: les données météorologiques et les données individuelles. Les données météorologiques sont communes à tous les bâtiments d'un même site et les données individuelles sont propres à chacun des bâtiments.

Les données météorologiques sont contenues dans la base weather_train/test.csv et concernent les données de températures, de niveau de précipitation, de vitesse et de direction du vent.

Les données individuelles sont contenues dans les bases building_meta.csv et concernent les données sur l'utilisation primaire du bâtiment, la surface brute du bâtiment, le nombre d'étages des immeubles ou encore l'année d'ouverture du bâtiment.

Les données train.csv décrivent quant à elles contiennent les données sur les quatre différents types de compteur (eau chaude et froide, gaz et électricité) et les consommations de ces énergies de chaque bâtiment.

## Etude de chaque table de données



### Analyse de la base de données train

Cette base de données contient les 4 variables sivantes:

- building_id: identifie chaque batiment
- meter: le type de compteur. Il y en a 4, electricity, chilled water, hot water et steam
- timestamp: les dates et heures de lecture de compteur; toutes les heures du 01/01/2016 à 00:00:00 au 31/12/2016 à 23:00:00
- meter_reading: les consommations d'énergie rélévées 

In [None]:
train.info()

Pour avoir une idée de la distribution de la variable d'intérêt qui est la consommation d'énergie (meter_reading), nous faisons un histogramme

In [None]:
plt.hist(train['meter_reading'])

Nous constatons qu'il y a plein de zéros pour cette variable. Toute fois, l'histogramme ne nous permet pas de bien voir la distribution de cette variable pour les valeurs différentes de zéro. Nous allons étudier les données pour des meter reading égale à zéro séparément des autres dans un premier temps.

In [None]:
train['meter_reading'].describe()

En regardant la descrirption de cette variable, la moyenne est loin de zéro. On s'intéresse donc maintenant au pourcentage de valeurs qui sont égales à zéro.

In [None]:
train[train['meter_reading']==0].shape[0] / train.shape[0]

Environ 9.3% des valeurs des compteurs sont égales à zéro. Nous ne pouvons pas dire exactement à quoi attribuer ce phénomène. Il se pourrait que certains de ces valeurs sont égaleas à zéro parcequ'il n'y avait pas de consommation d'énergie, ou par exemple pour le dû au fait qu'en hiver on n'a pas besoin d'eau glacée.

In [None]:
len(train['meter'])

In [None]:
train['meter'].replace({0:"electricity",1:"chilledwater",2:"steam",3:"hotwater"},inplace=True)

In [None]:
meter_dict = {}
for i in train['meter'].unique():
    percent = round(train[train['meter_reading']== 0]['meter'].value_counts()[i] / train['meter'].value_counts()[i],2)
    meter_dict[i] = percent
zero_meter = pd.Series(meter_dict)
sns.barplot(x=zero_meter.index, y= zero_meter)
plt.title("Meters percentage having zero readings")
plt.show()

D'après le graphique ci-dessus, on constate que le compteur d'eau chaude a plus de zéros que les autres compteurs. Mais nous avons à faire au données temporelles, et donc ce n'est peut être pas la meilleur façon de regarder cette variable. Dans ce qui suit, nous allons faire des études plus détaillées.

Nous convertissons tout d'abord la variable timestamp en objet de date et temps, et ensuite attribuons le mois correspondant à chaque date pour pouvoir étudier les tendences mensuelles de meter_reading.

In [None]:
train['meter'].unique()

train['timestamp2'] = pd.to_datetime(train["timestamp"])
train['month'] = train.timestamp2.dt.month

On représente ensuite séparément pour chaque compteur les valeurs de meter_reading par mois pour voir à quel moment on a ces valeurs sur l'année.

In [None]:
fig, axs = plt.subplots(2,2, sharey=True, tight_layout=True,figsize=(10,6))

axs[0][0].hist(x ="month",data =train[(train.meter_reading == 0) & (train.meter=="electricity")],bins =12,color = "navajowhite")
axs[0][0].set_title("For electricity")

axs[0][1].hist(x ="month",data =train[(train.meter_reading == 0) & (train.meter=="chilledwater")],bins =12,color = "skyblue")
axs[0][1].set_title("For chilled water")

axs[1][0].hist(x ="month",data =train[(train.meter_reading == 0) & (train.meter=="steam")],bins =12,color = "slategrey")
axs[1][0].set_title("For steam")

axs[1][1].hist(x ="month",data =train[(train.meter_reading == 0) & (train.meter=="hotwater")],bins =12,color = "lightcoral")
axs[1][1].set_title("For hot water")

Comme on peut le voir sur les graphes, et comme attendu, les consommation de 0 change d'un mois à l'autre sur l'année en fonction du type de compteur aussi. Pour le compteur d'électricité, on a des consommation de 0 plutôt au début de l'année jusqu'en mai environ. A partir de juin les consommation de 0 baisse. Les consommations à 0 de "steam" et d'eau chaude ont à peu près la même tendance sur l'année. Les consommations à 0 d'eau glacée sont saisonnières aussi.

En ne séparant pas les consommations à 0 du reste des consommations comme dans les graphiques ci-dessous, on n'arrive pas à voir l'effet des consommations à 0.

In [None]:
fig, axs = plt.subplots(2,2, sharey=True, tight_layout=True,figsize=(10,6))

axs[0][0].hist(x ="month",data =train[(train.meter=="electricity")],bins =12,color = "navajowhite")
axs[0][0].set_title("For electricity")

axs[0][1].hist(x ="month",data =train[(train.meter=="chilledwater")],bins =12,color = "skyblue")
axs[0][1].set_title("For chilled water")

axs[1][0].hist(x ="month",data =train[(train.meter=="steam")],bins =12,color = "slategrey")
axs[1][0].set_title("For steam")

axs[1][1].hist(x ="month",data =train[(train.meter=="hotwater")],bins =12,color = "lightcoral")
axs[1][1].set_title("For hot water")

Les consommations différentes de 0 ont une tendance plutôt uniforme, comme ci-dessous.

In [None]:
fig, axs = plt.subplots(2,2, sharey=True, tight_layout=True,figsize=(10,6))

axs[0][0].hist(x ="month",data =train[(train.meter_reading != 0) & (train.meter=="electricity")],bins =12,color = "navajowhite")
axs[0][0].set_title("For electricity")

axs[0][1].hist(x ="month",data =train[(train.meter_reading != 0) & (train.meter=="chilledwater")],bins =12,color = "skyblue")
axs[0][1].set_title("For chilled water")

axs[1][0].hist(x ="month",data =train[(train.meter_reading != 0) & (train.meter=="steam")],bins =12,color = "slategrey")
axs[1][0].set_title("For steam")

axs[1][1].hist(x ="month",data =train[(train.meter_reading != 0) & (train.meter=="hotwater")],bins =12,color = "lightcoral")
axs[1][1].set_title("For hot water")

In [None]:
sns.kdeplot(train.loc[(train['meter']=='electricity'), 
            'meter_reading'], color='yellow', shade=False, Label='electricity')

sns.kdeplot(train.loc[(train['meter']=='chilledwater'), 
            'meter_reading'], color='b', shade=False, Label='chilledwater')

sns.kdeplot(train.loc[(train['meter']=='steam'), 
            'meter_reading'], color='gray', shade=False, Label='steam')

sns.kdeplot(train.loc[(train['meter']=='hotwater'), 
            'meter_reading'], color='r', shade=False, Label='hotwater')

plt.xlabel('meter_reading') 
plt.ylabel('Probability Density') 

Nous faisons une log transformation sur la variable meter_reading pour avoir des données plus normalisées et une bonne représentation de la densité.

In [None]:
train['meter_reading_log'] = np.log1p(train['meter_reading'])

In [None]:
sns.kdeplot(train.loc[(train['meter']=='electricity'), 
            "meter_reading_log"], color='yellow', shade=False, Label='electricity')

sns.kdeplot(train.loc[(train['meter']=='chilledwater'), 
            "meter_reading_log"], color='b', shade=False, Label='chilledwater')

sns.kdeplot(train.loc[(train['meter']=='steam'), 
            "meter_reading_log"], color='gray', shade=False, Label='steam')

sns.kdeplot(train.loc[(train['meter']=='hotwater'), 
            "meter_reading_log"], color='r', shade=False, Label='hotwater')

plt.xlabel('meter_reading_log') 
plt.ylabel('Probability Density') 

In [None]:
len(set(train['building_id']))

### Analyse des données building

Les données building_meta.csv sont composé de 6 variables définies par :

- site_id : Clé étrangère pour les fichiers météo.
- building_id : Clé étrangère pour train.csv
- primary_use : Indicateur de la catégorie principale d'activités pour le bâtiment basé sur les définitions de type de propriété EnergyStar
- square_feet : Surface de plancher brute du bâtiment
- year_built : Année d'ouverture du bâtiment
- floor_count : Nombre d'étages du bâtiment

In [None]:
building_metadata.shape

In [None]:
building_metadata.head()

In [None]:
building_metadata.describe()

In [None]:
building_metadata.info()

les variables years_built et floor_count ont beaucoup de valeurs manquantes ( 774 pour years_built  et 1094 pour floor_count)

#### Variable primary_use

In [None]:
building_metadata.primary_use.unique()

16 types de catégorie principale d'activités pour le bâtiment, la variable primary_use peut être pris comme une varibles catégorielle.


In [None]:
sns.countplot(y="primary_use",data=building_metadata ,color="salmon")

On remarque que la majorité des acitivités des bâtiments est lié à l'éducation, et la minorité est lié a des lieux de culte religieux.  

#### Variable site_id

In [None]:
site_build = building_metadata.groupby('site_id').building_id.size()
sns.barplot(x=site_build.index , y= site_build,color="blue")
plt.ylabel("Number of building")
del site_build

On remarque le site 3 a le plus grand nombre de batiments, et le site 11 en a le moins.

#### Variable square_feet, years_built et floor_count

In [None]:
fig, axes = plt.subplots(3,1,figsize=(10,10)) 
columns = building_metadata.drop(["primary_use","site_id","building_id"],axis=1).columns
for i,col in enumerate(list(columns)):
    plot = building_metadata.boxplot(col, by="site_id", ax=axes.flatten()[i])

plt.tight_layout() 

plt.show()

On remarque que le site 0 est un site avec des bâtiments récents, et le site 4 qu'en a lui a des baitments assez anciens. Par rapport au nombres d'étages, le site 8 ne possède pas beaucoup d'étages contrairement au site 7 qui a des bâtiments avec plus d'étages.

In [None]:
building_metadata.hist(column ="year_built",bins=40)

Beaucoup de batiments datent des années 1975.

In [None]:
building_metadata['year_built'].describe()

### Analyse des données méteorologiques (weather_data)

In [None]:
weather_train.shape

La base d'apprentissage des données météorologiques comportent au total 139773 observations évaluées sur 8 variables que sont: la température de l'air, la couverture en nuage, la température de la rosée, le niveau des pluies, la pression du niveau de mer, la direction et la vitesse du vent. Chaque observation (ligne) corresponds au relevé météorologique par heure de chaque site. Ainsi, chaque site a 24 lignes pour les 8 données méteo recueillies.

In [None]:
weather_train.info()

Toutes les données de cette base n'ont pas pu être recueillies. Il y a des valeurs manquantes pour toutes les variables mesurées sauf évidemment les heures de recueil. Les heures de recueil **timestamp** seront transformées en format date.

In [None]:
weather_train.describe()

En analysant les statistiques descriptives des variables de la table weather on peut soupçonner la présence de variables abérrantes. En effet, lorsque nous prenons les valeurs de températures, nous avons des valeurs extrêmes (min et max) très éloignées avec de fortes dispersions (std=10,62 et 9,79).
La variable precip_depth_1_hr contient à priori des variables abbérantes car la valeur minimum est de -1 alors que la valeur maximale est de 343 avec une valeur moyenne de 0.98 et des quartiles nulles.
De même, les variables de mesure du vent (wind_direction et wind_speed) présentent des caractéristiques similaires aux précédentes.
Nous pouvons vérifier la présence de valeurs abbérantes grâce au box-plot ci-dessous.

In [None]:
fig, axes = plt.subplots(7,1,figsize=(10,30)) 
columns = weather_train.drop(["site_id","timestamp"],axis=1).columns
for i,col in enumerate(list(columns)):

    plot = weather_train.boxplot(col, by="site_id", ax=axes.flatten()[i])

plt.tight_layout() 

plt.show()

La représentation des box-plot confirme bien la présence de valeurs abbérantes pour toutes les variables d'étude de la table sauf la variable wind_direction.
Aussi, en général les données météorologiques varient en fonction des sites étudiés.
Par exemple, les sites 0, 8 et 9 ont en moyenne des températures (ambiante et de la rosée) élevées contrairement aux sites 7 et 11 avec des températures (ambiante et de la rosée) basses.
Par ailleurs, on peut également noter que nous avons très peu de données pour les variables precip_depth_1_hr et cloud_coverage. Sur le site 11, nous n'avons pas de données pour la variable cloud_coverage et sur les sites 1, 5 et 12 nous n'avons pas de données pour la variable precip_depth_1_hr.

Etudions la corrélation entre les différentes variables de la base weather_train

In [None]:
sns.heatmap(weather_train.corr(),linewidths=.5,annot=True)

On obtient ci-dessus une matrice des valeurs de corrélation représentée par des couleurs. On remarque essentiellement qu'il y a une forte corrélation entre les variables **dew_temperature** et **air_température**

# Fusion des différentes tables de données

Nous avons à cette étape fusinner toutes les tables afin d'avoir une grande base de données contenant toutes les informations. Chaque table contient une clé qui permets de les référencer et donc de les fusionner entre elles.

In [None]:
train = train.merge(building_metadata, on='building_id', how='left')
alltrain = train.merge(weather_train, on=['site_id','timestamp'], how='left')
del building_metadata,weather_train,train
gc.collect()

154

In [None]:
alltrain.to_csv (r'/content/drive/MyDrive/Kaggle/alltrain.csv', index = False, header=True)

In [None]:
np.shape(alltrain)

(20216100, 16)

La base de données complète contient au total 20216100 individus et 16 variables.

## Analyse de la table alltrain

In [None]:
alltrain.head()

Unnamed: 0,building_id,meter,timestamp,meter_reading,site_id,primary_use,square_feet,year_built,floor_count,air_temperature,cloud_coverage,dew_temperature,precip_depth_1_hr,sea_level_pressure,wind_direction,wind_speed
0,0,0,2016-01-01 00:00:00,0.0,0,Education,7432,2008.0,,25.0,6.0,20.0,,1019.7,0.0,0.0
1,1,0,2016-01-01 00:00:00,0.0,0,Education,2720,2004.0,,25.0,6.0,20.0,,1019.7,0.0,0.0
2,2,0,2016-01-01 00:00:00,0.0,0,Education,5376,1991.0,,25.0,6.0,20.0,,1019.7,0.0,0.0
3,3,0,2016-01-01 00:00:00,0.0,0,Education,23685,2002.0,,25.0,6.0,20.0,,1019.7,0.0,0.0
4,4,0,2016-01-01 00:00:00,0.0,0,Education,116607,1975.0,,25.0,6.0,20.0,,1019.7,0.0,0.0


In [None]:
alltrain.shape

(20216100, 16)

In [None]:
alltrain.describe()

Unnamed: 0,building_id,meter,meter_reading,site_id,square_feet,year_built,floor_count,air_temperature,cloud_coverage,dew_temperature,precip_depth_1_hr,sea_level_pressure,wind_direction,wind_speed
count,20216100.0,20216100.0,20216100.0,20216100.0,20216100.0,8088455.0,3506933.0,20119440.0,11390740.0,20115960.0,16467080.0,18984430.0,18767050.0,20072420.0
mean,799.278,0.6624412,2117.121,7.992232,107783.0,1968.277,4.184848,15.98795,1.900423,7.747429,0.7964155,1016.085,173.0151,3.377525
std,426.9133,0.9309921,153235.6,5.09906,117142.4,30.20815,4.008277,10.94729,2.402909,10.17867,7.468997,7.060539,114.0574,2.265694
min,0.0,0.0,0.0,0.0,283.0,1900.0,1.0,-28.9,0.0,-35.0,-1.0,968.2,0.0,0.0
25%,393.0,0.0,18.3,3.0,32527.0,1951.0,1.0,8.6,0.0,0.0,0.0,1011.6,70.0,2.1
50%,895.0,0.0,78.775,9.0,72709.0,1969.0,3.0,16.7,0.0,8.9,0.0,1016.0,180.0,3.1
75%,1179.0,1.0,267.984,13.0,139113.0,1993.0,6.0,24.1,4.0,16.1,0.0,1020.5,280.0,4.6
max,1448.0,3.0,21904700.0,15.0,875000.0,2017.0,26.0,47.2,9.0,26.1,343.0,1045.5,360.0,19.0


In [None]:
alltrain.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20216100 entries, 0 to 20216099
Data columns (total 16 columns):
 #   Column              Dtype  
---  ------              -----  
 0   building_id         int64  
 1   meter               int64  
 2   timestamp           object 
 3   meter_reading       float64
 4   site_id             int64  
 5   primary_use         object 
 6   square_feet         int64  
 7   year_built          float64
 8   floor_count         float64
 9   air_temperature     float64
 10  cloud_coverage      float64
 11  dew_temperature     float64
 12  precip_depth_1_hr   float64
 13  sea_level_pressure  float64
 14  wind_direction      float64
 15  wind_speed          float64
dtypes: float64(10), int64(4), object(2)
memory usage: 2.6+ GB


 
## Changement des types de variables

In [None]:
alltrain.timestamp=pd.to_datetime(alltrain['timestamp'])

In [None]:
varia=['building_id','site_id', 'meter']

for col in varia:
    
    alltrain[col]=alltrain[col].astype('object')
    

In [None]:
alltrain.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20216100 entries, 0 to 20216099
Data columns (total 16 columns):
 #   Column              Dtype         
---  ------              -----         
 0   building_id         object        
 1   meter               object        
 2   timestamp           datetime64[ns]
 3   meter_reading       float64       
 4   site_id             object        
 5   primary_use         object        
 6   square_feet         int64         
 7   year_built          float64       
 8   floor_count         float64       
 9   air_temperature     float64       
 10  cloud_coverage      float64       
 11  dew_temperature     float64       
 12  precip_depth_1_hr   float64       
 13  sea_level_pressure  float64       
 14  wind_direction      float64       
 15  wind_speed          float64       
dtypes: datetime64[ns](1), float64(10), int64(1), object(4)
memory usage: 2.6+ GB


In [None]:
alltrain['meter'] = pd.Categorical(alltrain['meter']).rename_categories({0: 'electricity', 
                                                                   1: 'chilledwater',
                                                                   2: 'steam', 
                                                                   3: 'hotwater'})

## Occurence des modalités de chaque variable quali

In [None]:
 for col in alltrain.select_dtypes(object).columns:
    print (f'{col :-<30} {len(alltrain[col].value_counts())}')

building_id------------------- 1449
site_id----------------------- 16
primary_use------------------- 16


## Analyse des valeurs manquantes

In [None]:
(alltrain.isna().sum()/alltrain.shape[0]).sort_values(ascending=False)

floor_count           0.826528
year_built            0.599900
cloud_coverage        0.436551
precip_depth_1_hr     0.185447
wind_direction        0.071678
sea_level_pressure    0.060925
wind_speed            0.007107
dew_temperature       0.004953
air_temperature       0.004781
square_feet           0.000000
primary_use           0.000000
site_id               0.000000
meter_reading         0.000000
timestamp             0.000000
meter                 0.000000
building_id           0.000000
dtype: float64

In [None]:
alltrain.head()

Unnamed: 0,building_id,meter,timestamp,meter_reading,site_id,primary_use,square_feet,year_built,floor_count,air_temperature,cloud_coverage,dew_temperature,precip_depth_1_hr,sea_level_pressure,wind_direction,wind_speed
0,0,electricity,2016-01-01,0.0,0,Education,7432,2008.0,,25.0,6.0,20.0,,1019.7,0.0,0.0
1,1,electricity,2016-01-01,0.0,0,Education,2720,2004.0,,25.0,6.0,20.0,,1019.7,0.0,0.0
2,2,electricity,2016-01-01,0.0,0,Education,5376,1991.0,,25.0,6.0,20.0,,1019.7,0.0,0.0
3,3,electricity,2016-01-01,0.0,0,Education,23685,2002.0,,25.0,6.0,20.0,,1019.7,0.0,0.0
4,4,electricity,2016-01-01,0.0,0,Education,116607,1975.0,,25.0,6.0,20.0,,1019.7,0.0,0.0


# Extraction de features

In [None]:
def rec_heur(x):
    
    if x in np.arange(6, 19):
        return 'journee'
    
    if x in np.arange(19, 23):
        return 'nuit'
    
    if x in [23, 0, 1, 2, 3, 4, 5]:
        return 'tard'

In [None]:
alltrain['year_built'].describe()

count    8.088455e+06
mean     1.968277e+03
std      3.020815e+01
min      1.900000e+03
25%      1.951000e+03
50%      1.969000e+03
75%      1.993000e+03
max      2.017000e+03
Name: year_built, dtype: float64

In [None]:
def discredit_var(x):
   
    
    if x <= 1951:
        return 'yearB_q1'
    
    if  1951 < x <= 1969:
        return 'yearB_q2'
    
    if 1969 < x <= 1993:
        return 'yearB_q3'
    
    if  1993 < x:
        return 'yearB_q4'
    

In [None]:
def preProcecing_df(df_):
    
    df=df_.copy()
    
    
    saison={3: 'printent',4:'printent',5:'printent',
          6: 'ete', 7: 'ete',8: 'ete', 
          9: 'automne', 10: 'automne', 11: 'automne', 
          1: 'hiver', 12: 'hiver', 2: 'hiver'}
    
    
    
    
    df['mois'] = df.timestamp.dt.month
    df['day'] = df.timestamp.dt.day
    df['heure'] = df.timestamp.dt.hour
    
    df['heureDiscredite'] = df['heure'].apply(rec_heur)
    
    df['week_end'] = [1 if x in [5,6] else 0 for x in df.day]
    df['saison'] = df['mois'].apply(lambda x: saison.get(x))
    
    
    median_group = df.groupby(['site_id'])['year_built'].transform('median')
    df['year_built'].fillna(median_group,inplace = True)
    df['year_built'].fillna(df['year_built'].median(), inplace=True)
    
    df['year_built'] = df['year_built'].apply(discredit_var) 
    
    df.floor_count.fillna(0,inplace = True)
    
    colonneAsNum=['air_temperature', 'dew_temperature','wind_direction']
    
    
    
    for col in colonneAsNum:
        median_group = df.groupby(['site_id', 'saison', 'week_end', 'primary_use'])[col].transform('median')
        df[col].fillna(median_group,inplace = True)
                
    
    
                    
    for col in [ 'day', 'heure', 'timestamp', 
                "precip_depth_1_hr", "wind_speed", "sea_level_pressure", "cloud_coverage", "mois"]:
        del df[col]
    
    return df



In [None]:
X = preProcecing_df(alltrain)
X =  X[X['meter_reading']!= np.float(0)]
X['meter_reading']=np.log1p(X['meter_reading'])

In [None]:
(X.isna().sum()/X.shape[0]).sort_values(ascending=False)

saison             0.0
week_end           0.0
heureDiscredite    0.0
wind_direction     0.0
dew_temperature    0.0
air_temperature    0.0
floor_count        0.0
year_built         0.0
square_feet        0.0
primary_use        0.0
site_id            0.0
meter_reading      0.0
meter              0.0
building_id        0.0
dtype: float64

In [None]:
X.shape

(18342124, 14)

In [None]:
X.to_csv (r'/content/drive/MyDrive/Kaggle/X.csv', index = False, header=True)

In [None]:
del data
X.head()

Unnamed: 0,building_id,meter,meter_reading,site_id,primary_use,square_feet,year_built,floor_count,air_temperature,dew_temperature,wind_direction,heureDiscredite,week_end,saison
45,46,electricity,3.993413,0,Retail,9045,yearB_q4,0.0,25.0,20.0,0.0,tard,0,hiver
72,74,electricity,3.784219,0,Parking,387638,yearB_q4,0.0,25.0,20.0,0.0,tard,0,hiver
91,93,electricity,3.978196,0,Office,33370,yearB_q3,0.0,25.0,20.0,0.0,tard,0,hiver
103,105,electricity,3.190624,1,Education,50623,yearB_q2,5.0,3.8,2.4,240.0,tard,0,hiver
104,106,electricity,0.318163,1,Education,5374,yearB_q2,4.0,3.8,2.4,240.0,tard,0,hiver


## conversion type

In [None]:
X["building_id"] = X["building_id"].astype('category')
X["site_id"] = X["site_id"].astype('category')
X["primary_use"] = X["primary_use"].astype('category')
X["saison"] = X["saison"].astype('category')
X["heureDiscredite"] = X["heureDiscredite"].astype('category')
X["year_built"] = X["year_built"].astype('category')

# Encodeur One Hot

In [None]:
def encodeur(df): 
    X_Encod=pd.concat([df, pd.get_dummies(df["primary_use"], dtype=int) ], axis=1)

    X_Encod=pd.concat([X_Encod, pd.get_dummies(df["saison"], dtype=int) ], axis=1)

    X_Encod=pd.concat([X_Encod, pd.get_dummies(df["heureDiscredite"], dtype=int) ], axis=1)
    X_Encod=pd.concat([X_Encod, pd.get_dummies(df["meter"], dtype=int) ], axis=1)
    X_Encod=pd.concat([X_Encod, pd.get_dummies(df["year_built"], dtype=int) ], axis=1)

    for col in ["primary_use",'year_built', 'yearB_q4', "saison", "heureDiscredite", 
                'Office', "printent", "journee", 'meter', 'hotwater' ]:
        del X_Encod[col]


    return X_Encod

In [None]:
def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)

    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))

    return df

In [None]:
X = pd.read_csv('/content/drive/MyDrive/Kaggle/X.csv')

In [None]:
X_Encod = encodeur(X)

In [None]:
X_Encod.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18342124 entries, 0 to 18342123
Data columns (total 35 columns):
 #   Column                         Dtype  
---  ------                         -----  
 0   building_id                    int16  
 1   meter_reading                  float16
 2   site_id                        int8   
 3   square_feet                    int32  
 4   floor_count                    float16
 5   air_temperature                float16
 6   dew_temperature                float16
 7   wind_direction                 float16
 8   week_end                       int8   
 9   Education                      int8   
 10  Entertainment/public assembly  int8   
 11  Food sales and service         int8   
 12  Healthcare                     int8   
 13  Lodging/residential            int8   
 14  Manufacturing/industrial       int8   
 15  Other                          int8   
 16  Parking                        int8   
 17  Public services                int8   
 18  

In [None]:
X_Encod.to_csv (r'/content/drive/MyDrive/Kaggle/X_Encod.csv', index = False, header=True)

# Train test split

In [None]:
from sklearn.model_selection import ShuffleSplit

def trainAndTest(DF):
    df= DF.copy()
    
    uniqueSite=list(pd.unique(df["site_id"]))
    rs = ShuffleSplit(n_splits=1, test_size=.3, random_state=0)
    for train_index, test_index in rs.split(uniqueSite):

        df['trainIndex'] = [1 if x in train_index else 0 for x in df.site_id]
        x_train = df[df['trainIndex']==1]
        y_train = x_train['meter_reading']

        x_test = df[df['trainIndex']==0]
        y_test = x_test['meter_reading']

    del x_train['trainIndex'] 
    del x_train['meter_reading'] 

    del x_test['trainIndex']
    del x_test['meter_reading']
    
    return x_train, y_train, x_test, y_test


In [None]:
 X_train, Y_train, X_test, Y_test = trainAndTest(X_Encod)

In [None]:
X_train=reduce_mem_usage(X_train)
X_test=reduce_mem_usage(X_test)

Memory usage after optimization is: 564.90 MB
Decreased by 0.0%
Memory usage after optimization is: 309.72 MB
Decreased by 0.0%


## MinMaxScaler

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

In [None]:
def minMax(DF, listColumns, scaler):
    df=DF.copy()
    for col in listColumns:
        df[col]=scaler.fit_transform(df[[col]])
        
    return df

In [None]:
listColumns= [ 'wind_direction',  'dew_temperature',
              'air_temperature', 'floor_count', 'square_feet']


X_train = minMax(X_train, listColumns, scaler)
X_test = minMax(X_test, listColumns, scaler)

# Modèle Xgboost

## Entrainement

In [None]:
X_train1=X_train.copy()
del X_train1['building_id']

In [None]:
X_test1=X_test.copy()
del X_test1['building_id']

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
import scipy.stats
import gc
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

In [None]:
X_train1 = pd.read_csv('/content/drive/MyDrive/Kaggle/X_train1.csv')#chemin Olivier

In [None]:
X_test1 = pd.read_csv('/content/drive/MyDrive/Kaggle/X_test1.csv') #cHemin Olivier

In [None]:
Y_train = pd.read_csv('/content/drive/MyDrive/Kaggle/Y_train.csv')#chemin Olivier

In [None]:
Y_test = pd.read_csv('/content/drive/MyDrive/Kaggle/Y_test.csv')#chemin Olivier

In [None]:
def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)

    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))

    return df

In [None]:
X_train1=reduce_mem_usage(X_train1)
X_test1=reduce_mem_usage(X_test1)

Memory usage after optimization is: 0.42 MB
Decreased by 85.6%
Memory usage after optimization is: 0.23 MB
Decreased by 85.6%


In [None]:
Y_train=reduce_mem_usage(Y_train)
Y_test1=reduce_mem_usage(Y_test)

Memory usage after optimization is: 0.02 MB
Decreased by 75.0%
Memory usage after optimization is: 0.01 MB
Decreased by 75.0%


In [None]:
! sudo pip install xgboost



In [None]:
from xgboost import XGBRegressor as XGB

In [None]:
X_train1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11846809 entries, 0 to 11846808
Data columns (total 33 columns):
 #   Column                         Dtype  
---  ------                         -----  
 0   site_id                        int8   
 1   square_feet                    float16
 2   floor_count                    float16
 3   air_temperature                float16
 4   dew_temperature                float16
 5   wind_direction                 float16
 6   week_end                       int8   
 7   Education                      int8   
 8   Entertainment/public assembly  int8   
 9   Food sales and service         int8   
 10  Healthcare                     int8   
 11  Lodging/residential            int8   
 12  Manufacturing/industrial       int8   
 13  Other                          int8   
 14  Parking                        int8   
 15  Public services                int8   
 16  Religious worship              int8   
 17  Retail                         int8   
 18  

In [None]:
Y_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11846809 entries, 0 to 11846808
Data columns (total 1 columns):
 #   Column         Dtype  
---  ------         -----  
 0   meter_reading  float16
dtypes: float16(1)
memory usage: 22.6 MB


In [None]:
params = {
    "objective": "reg:squarederror",
    "metric": "rmse",
    "reg_lambda": 2,
    "tree_method": "approx",
    "learning_rate": 0.05
}

In [None]:
XGB = XGB(**params)

In [None]:
XGB_Reg=XGB.fit(X_train1, Y_train,early_stopping_rounds=True,
        eval_set=[(X_test1, Y_test)],
        eval_metric='rmse',
        verbose=True)

[0]	validation_0-rmse:4.41714
Will train until validation_0-rmse hasn't improved in True rounds.
[1]	validation_0-rmse:4.21481
[2]	validation_0-rmse:4.02266
[3]	validation_0-rmse:3.84214
[4]	validation_0-rmse:3.67004
[5]	validation_0-rmse:3.50492
[6]	validation_0-rmse:3.3526
[7]	validation_0-rmse:3.20395
[8]	validation_0-rmse:3.07277
[9]	validation_0-rmse:2.94373
[10]	validation_0-rmse:2.82426
[11]	validation_0-rmse:2.71472
[12]	validation_0-rmse:2.60716
[13]	validation_0-rmse:2.50971
[14]	validation_0-rmse:2.41513
[15]	validation_0-rmse:2.32881
[16]	validation_0-rmse:2.24948
[17]	validation_0-rmse:2.17191
[18]	validation_0-rmse:2.10045
[19]	validation_0-rmse:2.03629
[20]	validation_0-rmse:1.97472
[21]	validation_0-rmse:1.91844
[22]	validation_0-rmse:1.86755
[23]	validation_0-rmse:1.81824
[24]	validation_0-rmse:1.77237
[25]	validation_0-rmse:1.73031
[26]	validation_0-rmse:1.69042
[27]	validation_0-rmse:1.65759
[28]	validation_0-rmse:1.62654
[29]	validation_0-rmse:1.59203
[30]	validatio

In [None]:
Y_pred=XGB_Reg.predict(X_test1)

In [None]:
print(Y_pred[300:320]);print(Y_test[300:320])

In [None]:
mse_XGB = mean_squared_error(Y_test, Y_pred)**0.5
print("The mean squared error (MSE) of XGB Model on test set: {:.4f}".format(mse_XGB))

The mean squared error (MSE) of XGB Model on test set: 1.3659


In [None]:
XGB_Reg.save_model('/content/drive/MyDrive/Kaggle/XGB_Reg.txt') 

In [None]:
pickle.dump(XGB_Reg, open("pima.pickle.dat", "wb"))

## Prediction et soumission

In [None]:
! mkdir ~/.kaggle/

mkdir: cannot create directory ‘/root/.kaggle/’: File exists


In [None]:
! cp '/content/drive/MyDrive/Kaggle/kaggle.json' ~/.kaggle/

In [None]:
! chmod 600 ~/.kaggle/kaggle.json

In [None]:
! kaggle competitions download -c ashrae-energy-prediction

In [None]:
! unzip '/content/weather_test.csv.zip' -d weather_test
! unzip '/content/test.csv.zip' -d test


Archive:  /content/weather_test.csv.zip
  inflating: weather_test/weather_test.csv  
Archive:  /content/test.csv.zip
  inflating: test/test.csv           


In [None]:
test = pd.read_csv('/content/drive/MyDrive/Kaggle/test.csv')
weather_test = pd.read_csv('/content/drive/MyDrive/Kaggle/weather_test.csv')
building_metadata = pd.read_csv('/content/drive/MyDrive/Kaggle/building_metadata.csv')


In [None]:
def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**3
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)

    end_mem = df.memory_usage().sum() / 1024**3
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))

    return df

In [None]:
test = reduce_mem_usage(test)

Memory usage after optimization is: 0.58 MB
Decreased by 53.1%


In [None]:
weather_test = reduce_mem_usage(weather_test)

Memory usage after optimization is: 0.01 MB
Decreased by 68.1%


In [None]:
building_metadata = reduce_mem_usage(building_metadata)

Memory usage after optimization is: 0.00 MB
Decreased by 60.3%


In [None]:
test =test.merge(building_metadata, on='building_id', how='left')
alltest= test.merge(weather_test, on=['site_id', 'timestamp'], how='left')
del test, weather_test,building_metadata
gc.collect()

319

In [None]:
alltest = reduce_mem_usage(alltest)

Memory usage after optimization is: 2.10 MB
Decreased by 0.0%


In [None]:
alltest.timestamp=pd.to_datetime(alltest['timestamp'])

In [None]:
(alltest.isna().sum()/alltest.shape[0]).sort_values(ascending=False)

floor_count           0.826050
year_built            0.589916
cloud_coverage        0.468664
precip_depth_1_hr     0.187099
wind_direction        0.071435
sea_level_pressure    0.060359
wind_speed            0.007245
dew_temperature       0.006255
air_temperature       0.005322
square_feet           0.000000
primary_use           0.000000
site_id               0.000000
timestamp             0.000000
meter                 0.000000
building_id           0.000000
row_id                0.000000
dtype: float64

In [None]:
alltest.to_csv (r'/content/drive/MyDrive/Kaggle/alltest.csv', index = False, header=True)

In [None]:
data_test = preProcecing_df(alltest)

In [None]:
data_test['meter'] = pd.Categorical(data_test['meter']).rename_categories({0: 'electricity', 
                                                                   1: 'chilledwater',
                                                                   2: 'steam', 
                                                                   3: 'hotwater'})

In [None]:
reduce_mem_usage(data_test)

Memory usage after optimization is: 2.37 MB
Decreased by 10.3%


Unnamed: 0,row_id,building_id,meter,site_id,primary_use,square_feet,year_built,floor_count,air_temperature,dew_temperature,wind_direction,heureDiscredite,week_end,saison
0,0,0,electricity,0,Education,7432,yearB_q4,0.0,17.796875,11.703125,100.0,tard,0,hiver
1,1,1,electricity,0,Education,2720,yearB_q4,0.0,17.796875,11.703125,100.0,tard,0,hiver
2,2,2,electricity,0,Education,5376,yearB_q3,0.0,17.796875,11.703125,100.0,tard,0,hiver
3,3,3,electricity,0,Education,23685,yearB_q4,0.0,17.796875,11.703125,100.0,tard,0,hiver
4,4,4,electricity,0,Education,116607,yearB_q3,0.0,17.796875,11.703125,100.0,tard,0,hiver
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41697595,41697595,1444,electricity,15,Entertainment/public assembly,19619,yearB_q1,0.0,6.699219,1.700195,210.0,journee,0,printent
41697596,41697596,1445,electricity,15,Education,4298,yearB_q2,0.0,6.699219,1.700195,210.0,journee,0,printent
41697597,41697597,1446,electricity,15,Entertainment/public assembly,11265,yearB_q4,0.0,6.699219,1.700195,210.0,journee,0,printent
41697598,41697598,1447,electricity,15,Lodging/residential,29775,yearB_q4,0.0,6.699219,1.700195,210.0,journee,0,printent


In [None]:
data_test["building_id"] = data_test["building_id"].astype('category')
data_test["site_id"] = data_test["site_id"].astype('category')
data_test["primary_use"] = data_test["primary_use"].astype('category')
data_test["saison"] = data_test["saison"].astype('category')
data_test["heureDiscredite"] = data_test["heureDiscredite"].astype('category')
data_test["year_built"] = data_test["year_built"].astype('category')

In [None]:
data_test.to_csv (r'/content/drive/MyDrive/Kaggle/data_test.csv', index = False, header=True)

In [None]:
listColumns= [ 'wind_direction',  'dew_temperature',
              'air_temperature', 'floor_count', 'square_feet']

gc.collect()

386

In [None]:
def encodeur(df): 
    X_Encod=pd.concat([df, pd.get_dummies(df["primary_use"], dtype=int) ], axis=1)
    reduce_mem_usage(X_Encod)
    X_Encod=pd.concat([X_Encod, pd.get_dummies(df["saison"], dtype=int) ], axis=1)
    reduce_mem_usage(X_Encod)
    X_Encod=pd.concat([X_Encod, pd.get_dummies(df["heureDiscredite"], dtype=int) ], axis=1)
    reduce_mem_usage(X_Encod)
    X_Encod=pd.concat([X_Encod, pd.get_dummies(df["meter"], dtype=int) ], axis=1)
    reduce_mem_usage(X_Encod)
    X_Encod=pd.concat([X_Encod, pd.get_dummies(df["year_built"], dtype=int) ], axis=1)
    reduce_mem_usage(X_Encod)

    for col in ["primary_use",'year_built', 'yearB_q4', "saison", "heureDiscredite", 
                'Office', "printent", "journee", 'meter', 'hotwater' ]:
        del X_Encod[col]


    return X_Encod

In [None]:
X_final = encodeur(data_test)

Memory usage after optimization is: 1.90 MB
Decreased by 69.6%
Memory usage after optimization is: 2.06 MB
Decreased by 34.6%
Memory usage after optimization is: 2.17 MB
Decreased by 27.3%
Memory usage after optimization is: 2.33 MB
Decreased by 31.8%
Memory usage after optimization is: 2.49 MB
Decreased by 30.4%


In [None]:
X_final = minMax(X_final, listColumns, scaler)
del data_test

In [None]:
X_final = X_final.drop(["row_id"],axis=1)

In [None]:
X_train1.info()

In [None]:
X_final.info()

In [None]:
del X_final1['building_id']

In [None]:
X_final["site_id"] = X_final["site_id"].astype('int')

In [None]:
X_final  = pd.read_csv('/content/drive/MyDrive/Kaggle/X_final.csv')

In [None]:
reduce_mem_usage(X_final1)

In [None]:
X_final = pd.read_csv('/content/drive/MyDrive/Kaggle/X_final.csv')

In [None]:
reduce_mem_usage(X_final1)

In [None]:
X_final.insert(0, 'newColMean', df.mean(1))

In [None]:
X_final1=X_final.iloc[1:13899200,:]
reduce_mem_usage(X_final1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  del sys.path[0]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # This is added back by InteractiveShellApp.init_path()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the cav

Memory usage after optimization is: 0.52 MB
Decreased by 14.9%


Unnamed: 0,building_id,site_id,square_feet,floor_count,air_temperature,dew_temperature,wind_direction,week_end,Education,Entertainment/public assembly,Food sales and service,Healthcare,Lodging/residential,Manufacturing/industrial,Other,Parking,Public services,Religious worship,Retail,Services,Technology/science,Utility,Warehouse/storage,automne,ete,hiver,nuit,tard,electricity,chilledwater,steam,yearB_q1,yearB_q2,yearB_q3
1,1,0,0.002787,0.000000,0.601074,0.742676,0.277832,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,1,0,0,0,0,0
2,2,0,0.005821,0.000000,0.601074,0.742676,0.277832,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,1,0,0,0,0,1
3,3,0,0.026749,0.000000,0.601074,0.742676,0.277832,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,1,0,0,0,0,0
4,4,0,0.132935,0.000000,0.601074,0.742676,0.277832,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,1,0,0,0,0,1
5,5,0,0.008820,0.000000,0.601074,0.742676,0.277832,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13899195,611,4,0.146118,0.269043,0.506836,0.723633,0.138916,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0
13899196,612,4,0.087769,0.153809,0.506836,0.723633,0.138916,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0
13899197,613,4,0.001168,0.038452,0.506836,0.723633,0.138916,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0
13899198,614,4,0.211426,0.307617,0.506836,0.723633,0.138916,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0


In [None]:
X_final1.insert(28, 'electricity', X_final.copy())

In [None]:
Y_pred_final1 = XGB_Reg.predict(X_final1)

In [None]:
Y_pred_final1XGB_df = pd.DataFrame(data=Y_pred_final1)

In [None]:
Y_pred_finalXGB_df = pd.DataFrame(data=Y_pred_final)
Y_pred_finalXGB_df.to_csv (r'/content/drive/MyDrive/Kaggle/Y_pred_finalXGB_df.csv', index = False, header=True)

In [None]:
! unzip '/content/sample_submission.csv.zip' -d sample_submission

Archive:  /content/sample_submission.csv.zip
  inflating: sample_submission/sample_submission.csv  


In [None]:
submission  = pd.read_csv('/content/drive/MyDrive/Kaggle/sample_submission.csv')
submission['meter_reading'] = np.exp(Y_pred_final)
submission.loc[submission['meter_reading']<0, 'meter_reading'] = 0
submission.to_csv('/content/drive/MyDrive/Kaggle/submission.csv', index=False)

In [None]:
! kaggle competitions submit -c ashrae-energy-prediction -f '/content/drive/MyDrive/Kaggle/submission.csv' -m "Firt submission using Hist Boost algorithm"

100% 1.05G/1.05G [00:35<00:00, 31.6MB/s]
Successfully submitted to ASHRAE - Great Energy Predictor III

# Modèle HistGradient Boosting

## Entrainement

In [None]:
from sklearn import datasets, ensemble
from sklearn.inspection import permutation_importance
from sklearn.metrics import mean_squared_error


In [None]:
from sklearn.experimental import enable_hist_gradient_boosting 
from sklearn.ensemble import HistGradientBoostingRegressor

In [None]:
X_train1 = pd.read_csv('/content/drive/MyDrive/Kaggle/X_train1.csv')#chemin Olivier

In [None]:
X_test1 = pd.read_csv('/content/drive/MyDrive/Kaggle/X_test1.csv')

In [None]:
Y_train = pd.read_csv('/content/drive/MyDrive/Kaggle/Y_train.csv')#chemin Olivier

In [None]:
Y_test = pd.read_csv('/content/drive/MyDrive/Kaggle/Y_test.csv')#chemin Olivier

In [None]:
print(y_train[0:5])

   meter_reading_log
0           4.843242
1           6.438743
2           3.356897
3           4.278609
4           3.418710


In [None]:
HGB =HistGradientBoostingRegressor()
HGB_reg=HGB.fit(X_train1, Y_train)

  y = column_or_1d(y, warn=True)


In [None]:
Y_pred = HGB_reg.predict(X_test1)

In [None]:
print(Y_pred[300:320]);print(Y_test[300:320])

In [None]:
mse_HGB = mean_squared_error(Y_test, Y_pred)**0.5
print("The mean squared error (MSE) on test set: {:.4f}".format(mse_HGB))

The mean squared error (MSE) on test set: 1.3927


## Prédiction et soumission

In [None]:
X_final = pd.read_csv('/content/drive/MyDrive/Kaggle/X_final.csv')

In [None]:
X_final=reduce_mem_usage(X_final)

Memory usage after optimization is: 2.17 MB
Decreased by 75.9%


In [None]:
Y_pred_final = HGB_reg.predict(X_final)

In [None]:
Y_pred_final_df = pd.DataFrame(data=Y_pred_final)
Y_pred_final_df.to_csv (r'/content/drive/MyDrive/Kaggle/Y_pred_final_df.csv', index = False, header=True)

In [None]:
! unzip '/content/sample_submission.csv.zip' -d sample_submission

Archive:  /content/sample_submission.csv.zip
  inflating: sample_submission/sample_submission.csv  


In [None]:
submission  = pd.read_csv('/content/drive/MyDrive/Kaggle/sample_submission.csv')
submission['meter_reading'] = np.exp(Y_pred_final)
submission.loc[submission['meter_reading']<0, 'meter_reading'] = 0
submission.to_csv('/content/drive/MyDrive/Kaggle/submission.csv', index=False)

In [None]:
! kaggle competitions submit -c ashrae-energy-prediction -f '/content/drive/MyDrive/Kaggle/submission.csv' -m "Firt submission using Hist Boost algorithm"

100% 1.05G/1.05G [00:35<00:00, 31.6MB/s]
Successfully submitted to ASHRAE - Great Energy Predictor III

# Modèle lightgbm

## Entrainement

In [None]:
import lightgbm as lgb
from sklearn.metrics import mean_squared_error

In [None]:
params = {
    "objective": "regression",
    "boosting": "gbdt",
    "num_leaves": 40,
    "learning_rate": 0.05,
    "feature_fraction": 0.85,
    "reg_lambda": 2,
    "metric": "rmse"
}

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
import scipy.stats
import gc
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

In [None]:
X_train1 = pd.read_csv('/content/drive/MyDrive/Kaggle/X_train1.csv')#chemin Olivier

In [None]:
X_test1 = pd.read_csv('/content/drive/MyDrive/Kaggle/X_test1.csv') #cHemin Olivier

In [None]:
Y_train = pd.read_csv('/content/drive/MyDrive/Kaggle/Y_train.csv')#chemin Olivier

In [None]:
Y_test = pd.read_csv('/content/drive/MyDrive/Kaggle/Y_test.csv')#chemin Olivier

In [None]:
def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)

    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))

    return df

In [None]:
X_train1=reduce_mem_usage(X_train1)
X_test1=reduce_mem_usage(X_test1)

Memory usage after optimization is: 429.32 MB
Decreased by 85.6%
Memory usage after optimization is: 235.39 MB
Decreased by 85.6%


In [None]:
Y_train=reduce_mem_usage(Y_train)
Y_test1=reduce_mem_usage(Y_test)

Memory usage after optimization is: 22.60 MB
Decreased by 75.0%
Memory usage after optimization is: 12.39 MB
Decreased by 75.0%


In [None]:
LGB = lgb.LGBMRegressor(**params)

In [None]:
LGB_Reg=LGB.fit(X_train1, Y_train,early_stopping_rounds=True,
        eval_set=[(X_test1, Y_test)],
        eval_metric='rmse',
        verbose=True)

[1]	valid_0's rmse: 1.80202
Training until validation scores don't improve for True rounds.
[2]	valid_0's rmse: 1.75474
[3]	valid_0's rmse: 1.70873
[4]	valid_0's rmse: 1.68901
[5]	valid_0's rmse: 1.65891
[6]	valid_0's rmse: 1.62276
[7]	valid_0's rmse: 1.59419
[8]	valid_0's rmse: 1.56949
[9]	valid_0's rmse: 1.54269
[10]	valid_0's rmse: 1.51937
[11]	valid_0's rmse: 1.49915
[12]	valid_0's rmse: 1.47931
[13]	valid_0's rmse: 1.47772
[14]	valid_0's rmse: 1.45792
[15]	valid_0's rmse: 1.45743
[16]	valid_0's rmse: 1.44662
[17]	valid_0's rmse: 1.43246
[18]	valid_0's rmse: 1.42075
[19]	valid_0's rmse: 1.41124
[20]	valid_0's rmse: 1.40062
[21]	valid_0's rmse: 1.39391
[22]	valid_0's rmse: 1.38831
[23]	valid_0's rmse: 1.38023
[24]	valid_0's rmse: 1.37424
[25]	valid_0's rmse: 1.36811
[26]	valid_0's rmse: 1.36403
[27]	valid_0's rmse: 1.35914
[28]	valid_0's rmse: 1.355
[29]	valid_0's rmse: 1.35164
[30]	valid_0's rmse: 1.34894
[31]	valid_0's rmse: 1.34651
[32]	valid_0's rmse: 1.34508
[33]	valid_0's rmse

## Prediction et soumission

In [None]:
X_final = pd.read_csv('/content/drive/MyDrive/Kaggle/X_final.csv')

In [None]:
X_final=reduce_mem_usage(X_final)

Memory usage after optimization is: 1590.64 MB
Decreased by 85.3%


In [None]:
del X_final['building_id']

In [None]:
Y_pred_finalLGB = LGB_Reg.predict(X_final)

In [None]:
Y_pred_finalLGB_df = pd.DataFrame(data=Y_pred_finalLGB)
Y_pred_finalLGB_df.to_csv (r'/content/drive/MyDrive/Kaggle/Y_pred_finalLGB_df.csv', index = False, header=True)

In [None]:
reduce_mem_usage(Y_pred_finalLGB_df)

In [None]:
submission  = pd.read_csv('/content/drive/MyDrive/Kaggle/sample_submission.csv')
submission['meter_reading'] = np.exp(Y_pred_finalLGB)
submission.loc[submission['meter_reading']<0, 'meter_reading'] = 0
submission.to_csv('/content/drive/MyDrive/Kaggle/submission.csv', index=False)

In [None]:
! kaggle competitions submit -c ashrae-energy-prediction -f '/content/drive/MyDrive/Kaggle/submission.csv' -m "Firt submission using Hist Boost algorithm"

100% 1.05G/1.05G [00:13<00:00, 84.1MB/s]
Successfully submitted to ASHRAE - Great Energy Predictor III