<a href="https://colab.research.google.com/github/LeRoiBof/ML-Project/blob/main/Project_ML1_LGMB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Project - Machine Learning - LightGBM (Light Gradient Boosting Machine)

Import libraries

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import re, math
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from lightgbm import LGBMRegressor
from lightgbm import LGBMClassifier
from sklearn.model_selection import GridSearchCV

Data loading

In [11]:
df_train = pd.read_csv('train.csv')
y_column = 'LotArea'
X_train, y_train = df_train.drop(y_column, axis=1), df_train[y_column]

X_test = pd.read_csv('test.csv')
X_test_id = X_test['ID']
X_test = X_test.drop(columns='ID')

X_train['Fireplaces']

0       1
1       0
2       1
3       0
4       0
       ..
1454    1
1455    0
1456    1
1457    2
1458    1
Name: Fireplaces, Length: 1459, dtype: int64

Preprocessing

In [18]:
selected_features = ['MSSubClass','MSZoning','LotFrontage','LotShape','LotConfig','Neighborhood','BldgType','HouseStyle','RoofStyle','MasVnrArea','Foundation','BsmtFinSF1','BsmtFinSF2','BsmtUnfSF','TotalBsmtSF','Heating','CentralAir','Electrical','1stFlrSF','2ndFlrSF','LowQualFinSF','GrLivArea','BsmtFullBath','BsmtHalfBath','FullBath','HalfBath','BedroomAbvGr','KitchenAbvGr','TotRmsAbvGrd','Fireplaces','GarageType','GarageCars','GarageArea','WoodDeckSF','OpenPorchSF','EnclosedPorch','3SsnPorch','ScreenPorch','PoolArea','MiscFeature','MiscVal']
X_train = X_train[selected_features]
X_test = X_test[selected_features]

numeric_features = ['MSSubClass', 'LotFrontage','MasVnrArea','BsmtFinSF1','BsmtFinSF2','BsmtUnfSF','TotalBsmtSF','1stFlrSF','2ndFlrSF','LowQualFinSF','GrLivArea','BsmtFullBath','BsmtHalfBath','FullBath','HalfBath','BedroomAbvGr','KitchenAbvGr','TotRmsAbvGrd','Fireplaces','GarageCars','GarageArea','WoodDeckSF','OpenPorchSF','EnclosedPorch','3SsnPorch','ScreenPorch','PoolArea','MiscVal']
categorical_features = ['MSZoning','LotShape','LotConfig','Neighborhood','BldgType','HouseStyle','RoofStyle','Foundation','Heating','CentralAir','Electrical','GarageType','MiscFeature']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean'))
])
categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

model = Pipeline(steps=[('preprocessor', preprocessor), ('model',LGBMRegressor(
    boosting_type='gbdt',
    n_estimators=2000,
    learning_rate=0.01,
    max_depth=42,
    subsample=0.8,
    random_state=34,
    num_leaves=31
))])

# Définir les paramètres à rechercher
param_grid = {
    'model__n_estimators': [1000],
    'model__learning_rate': [0.01, 0.05, 0.1],
    'model__max_depth': [3, 4, 5],
    'model__subsample': [0.7, 0.8, 0.9]
}

X_train_vn = X_train[numeric_features + categorical_features]

grid_search = GridSearchCV(model, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train_vn, y_train)

best_model = grid_search.best_estimator_

[1;30;43mLe flux de sortie a été tronqué et ne contient que les 5000 dernières lignes.[0m
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000897 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2480
[LightGBM] [Info] Number of data points in the train set: 1167, number of used features: 80
[LightGBM] [Info] Start training from score 10310.329049
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000759 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2465
[LightGBM] [Info] Number of data points in the train set: 1167, number of used features: 78
[LightGBM] [Info] Start training from score 10165.360754
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.00

Fitting and prediction

In [19]:
best_model.fit(X_train, y_train)
y_pred = best_model.predict(X_test)

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000919 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2736
[LightGBM] [Info] Number of data points in the train set: 1459, number of used features: 83
[LightGBM] [Info] Start training from score 10195.233722


In [20]:
submission = pd.DataFrame({
    'ID': X_test_id,
    'LotArea': y_pred
})
submission.to_csv('submission.csv', index=False)
submission

Unnamed: 0,ID,LotArea
0,0,7595.043389
1,1,8483.325938
2,2,9390.227902
3,3,12387.455252
4,4,11066.060143
...,...,...
1455,1455,8437.492956
1456,1456,7478.157326
1457,1457,10612.465618
1458,1458,8332.106401
