<a href="https://www.kaggle.com/code/neesham/xgboost-v-s-lightgbm?scriptVersionId=120271689" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# In this notebook we will compare two Ultimate ML algorithms.

![image](https://res.cloudinary.com/hire-easy/image/upload/v1676710215/decision-trees_gfekyp.png)

### XGBoost and LightGBM

# Ready? Let's go!



# Set Up

In [1]:
import os
from time import time
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_absolute_error

# Pre-Processing the Data (Melbourn Housing).

In [2]:
# Read the data
X = pd.read_csv('../input/train.csv', index_col='Id')

X_test_full = pd.read_csv('../input/test.csv', index_col='Id')

# Remove rows with missing target, separate target from predictors
X.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = X.SalePrice              
X.drop(['SalePrice'], axis=1, inplace=True)

# Break off validation set from training data
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                                random_state=0)

# "Cardinality" means the number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
low_cardinality_cols = [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 10 and 
                        X_train_full[cname].dtype == "object"]

# Select numeric columns
numeric_cols = [cname for cname in X_train_full.columns if X_train_full[cname].dtype in ['int64', 'float64']]

# Keep selected columns only
my_cols = low_cardinality_cols + numeric_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()
X_test = X_test_full[my_cols].copy()

# One-hot encode the data (to shorten the code, we use pandas)
X_train = pd.get_dummies(X_train)
X_valid = pd.get_dummies(X_valid)
X_test = pd.get_dummies(X_test)
X_train, X_valid = X_train.align(X_valid, join='left', axis=1)
X_train, X_test = X_train.align(X_test, join='left', axis=1)

## Similarity between XGBoost and LightGBM

1. Both are open-source gradient boosting frameworks that use decision trees for supervised learning tasks.

2. Both frameworks use similar boosting algorithms that combine multiple weak learners to create a strong learner.

3. Both frameworks allow for parallel processing and can handle large datasets.

4. Both frameworks offer a wide range of hyperparameters that can be tuned to optimize performance.

5. Both frameworks have gained popularity in the machine learning community and are widely used in industry and academia.

## Difference between XGBoost and LightGBM

1. **Performance**: LightGBM is generally faster than XGBoost because it uses a histogram-based approach to binning continuous features, which can reduce the number of operations required to build trees. LightGBM can also handle larger datasets more efficiently than XGBoost.

2. **Memory usage**: LightGBM uses less memory than XGBoost because it only stores non-zero values in the histograms, while XGBoost stores all the values. This can be an important consideration when dealing with large datasets.

3. **Tree-building strategy**: LightGBM uses a leaf-wise approach to building trees, while XGBoost uses a depth-wise approach. The leaf-wise approach can lead to more complex trees, but can also lead to overfitting if not carefully tuned. The depth-wise approach builds simpler trees but can be more computationally expensive.

4. **Tuning parameters**: Both LightGBM and XGBoost have many tuning parameters, but the default values for LightGBM tend to be more conservative, leading to better out-of-the-box performance.

> Thats enough for the theory let's move to the coding part.

# The XGBoost

In [3]:
from xgboost import XGBRegressor

# Set the number of threads to use for training
num_threads = 3

# Define the model
model1 = XGBRegressor(n_estimators=1000, learning_rate=0.05, nthread=num_threads)

t0 = time()

# Fit the model
model1.fit(X_train, y_train,early_stopping_rounds=5,eval_set=[(X_valid, y_valid)],
             verbose=False)

print("Execution Time: ", time() - t0)

# Get predictions
predictions = model1.predict(X_valid)

# Calculate MAE
mae_2 = mean_absolute_error(y_valid, predictions)

print("Mean absolute error is: ", mae_2)

# Accuracy
print("Accuracy is: ", (model1.score(X_valid, y_valid)) * 100)



Execution Time:  2.9560747146606445
Mean absolute error is:  16802.965325342466
Accuracy is:  84.67858042263228


# Tuning the hyperparameters

In [4]:
# Define the model
model1 = XGBRegressor()


# Define the grid of hyperparameters to search
params = {
    "n_estimators": [700, 1000, 1200],
    "max_depth": [5, 7, 9],
    "learning_rate": [0.01, 0.05, 0.001],
}

t0 = time()

# Create the grid search object
grid = GridSearchCV(model1, params, cv=2)

# Fit the model
grid.fit(X_train, y_train)

print("Execution Time: ", time() - t0)

# Print the best hyperparameters
print("Best hyperparameters: ", grid.best_params_)

Execution Time:  400.55134749412537
Best hyperparameters:  {'learning_rate': 0.05, 'max_depth': 5, 'n_estimators': 1000}


# The LightGBM

In [5]:
import lightgbm as lgb 

# Set the number of threads to use for training
num_threads = 3

# Define the model
model2 = lgb.LGBMRegressor(n_estimators = 1000, learning_rate = 0.05, nthread=num_threads)

t0 = time()

# Fit the model
model2.fit(X_train, y_train, eval_set=[(X_valid, y_valid)], verbose = False)

print("Execution Time: ", time() - t0)

# Get predictions
predictions = model2.predict(X_valid)

# Calculate MAE
mae_2 = mean_absolute_error(y_valid, predictions)

print("Mean absolute error is: ", mae_2)

# Accuracy
print("Accuracy is: ", (model2.score(X_valid, y_valid)) * 100)



Execution Time:  7.691295385360718
Mean absolute error is:  17259.09657799767
Accuracy is:  87.6791254297951


# Tuning the hyperparameters

In [6]:
# Define the model
model2 = lgb.LGBMRegressor()

# Define the grid of hyperparameters to search
params = {
    "n_estimators": [700, 1000, 1200],
    "num_leaves": [5, 7, 9],
    "learning_rate": [0.01, 0.05, 0.001],
}

t0 = time()

# Create the grid search object
grid = GridSearchCV(model2, params, cv=2)

# Fit the model
grid.fit(X_train, y_train)

print("Execution Time: ", time() - t0)

# Print the best hyperparameters
print("Best hyperparameters: ", grid.best_params_)

Execution Time:  102.8579957485199
Best hyperparameters:  {'learning_rate': 0.01, 'n_estimators': 1200, 'num_leaves': 5}


# Playing with CV

In [7]:
from sklearn.model_selection import cross_val_score

scores_of_XGBoost = cross_val_score(model1, X_valid, y_valid, cv = 5)

scores_of_lightGBM = cross_val_score(model2, X_valid, y_valid, cv = 5)

scores_of_XGBoost.sort()
scores_of_lightGBM.sort()

print("Scores of XGBoost are: ", *scores_of_XGBoost)
print("Scores of lightGBM are: ", *scores_of_lightGBM)

Scores of XGBoost are:  0.2552202697364413 0.6775487414440469 0.724680272574131 0.7528381976452179 0.8291269183275427
Scores of lightGBM are:  0.6625301846164433 0.6926794510430847 0.7558711467543416 0.81287774953039 0.8576677489210873


# Conclusion

So, both the algorithms performed really great.  Both the algorithms have approximately the same accuracy. But one thing that makes lightGBM notorious is its speed and it clearly outperforms XGBoost in terms of speed.

Thanks for reading and if you found this notebook helpful then please smash that upvote button. Also comment down your favorite feature about XGBoost and lightGBM.