# Gradient Boosting

Ce notebook utilise la méthode de gradient boosting afin de réaliser des prédictions. La bibliothèque XGBoost a été utilisé.

In [14]:
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

In [18]:
doc = '_cleaned_w_outlier_feat.csv'
df = pd.read_csv('Data/train'+doc)
df = df.dropna()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16323 entries, 0 to 16322
Data columns (total 23 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   id                      16323 non-null  int64  
 1   date                    16323 non-null  object 
 2   prix                    16323 non-null  int64  
 3   nb_chambres             16323 non-null  int64  
 4   nb_sdb                  16323 non-null  float64
 5   m2_interieur            16323 non-null  float64
 6   m2_jardin               16323 non-null  float64
 7   m2_etage                16323 non-null  float64
 8   m2_soussol              16323 non-null  float64
 9   nb_etages               16323 non-null  float64
 10  vue_mer                 16323 non-null  int64  
 11  vue_note                16323 non-null  int64  
 12  etat_note               16323 non-null  int64  
 13  design_note             16323 non-null  int64  
 14  annee_construction      16323 non-null

On entraîne le modèle :

In [19]:
features = ['nb_chambres', 'm2_etage', 'nb_sdb', 'm2_interieur', 'm2_soussol','m2_jardin', 'nb_etages', 'vue_mer', 'vue_note', 'etat_note', 'design_note', 'annee_construction', 'annee_renovation', 'm2_interieur_15voisins', 'm2_jardin_15voisins', 'zipcode', 'lat', 'long', 'cos_month', 'day_count']
X = df[features]
print(len(features))
y = df['prix']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

20


In [20]:
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

params = {
    'tree_method': 'hist',
    'booster': 'gbtree',
    'objective': 'reg:squarederror',
    'eval_metric': 'rmse',
    'max_depth': 5,
    'eta': 0.3,
    'subsample':0.95,
    'learning_rate':0.01
}

model = xgb.train(params, dtrain, num_boost_round=10000)

In [21]:
y_pred = model.predict(dtrain)
mse = mean_squared_error(y_train, y_pred)
r2 = r2_score(y_train, y_pred)
print("Mean Squared Error:", mse)
print("R-squared Score:", r2)

Mean Squared Error: 932258050.6415
R-squared Score: 0.9904846785014021


In [22]:
df_test = pd.read_csv('Data/test_cleaned_feat.csv')
df_test = df_test.dropna()

In [23]:
df_test = df_test.dropna()
X_df_test = df_test[features]
X_df_test = xgb.DMatrix(X_df_test)
y_pred = model.predict(X_df_test)
#create the csv result.csv that concatenate the id and the predicted price
y_pred = pd.Series(y_pred, name='prix')  # name is optional, just for clarity
result = pd.concat([df_test['id'], y_pred], axis=1)

In [25]:
result.columns = ['id', 'prix']
result.to_csv('Data/result_XGB'+doc+'.csv', index=False)