### LightGBM (Light Gradient Boosting Machine)
- What is LightGBM?

`LightGBM` is an open-source, high-performance gradient boosting framework developed by Microsoft. It builds decision trees in a leaf-wise manner rather than a level-wise manner, which generally leads to faster training and higher accuracy. It uses histogram-based algorithms for faster splitting and has specialized features like Gradient-Based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB) to speed up training on large datasets efficiently. LightGBM supports both classification and regression tasks.

**Key features:**

  - Leaf-wise tree growth for better accuracy

  - Histogram-based splits for speed and memory efficiency

  - GOSS to focus training on samples with large gradients

  - EFB to reduce dimensionality by bundling features

  - Supports GPU acceleration

Example usage (regression with Python LightGBM):

In [4]:
pip install lightgbm

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [5]:
#!pip install lightgbm
import lightgbm as lgb
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

#Load dataset
data = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)

In [6]:
# Create LightGBM dataset format
train_data = lgb.Dataset(X_train, label=y_train)
test_data = lgb.Dataset(X_test, label=y_test, reference=train_data)

In [7]:
# Set parameters
params = {
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': 'rmse',
    'num_leaves': 31,
    'learning_rate': 0.05,
    'feature_fraction': 0.9,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'verbose': 0
}


In [8]:
# Train the model
num_rounds = 100
model = lgb.train(params, train_data, num_rounds, valid_sets=[test_data])

In [9]:
# Predict and evaluate
y_pred = model.predict(X_test, num_iteration=model.best_iteration)
rmse = mean_squared_error(y_test, y_pred)
print(f'RMSE: {rmse}')

RMSE: 0.23285525605518215


### CatBoost (Categorical Boosting)
- What is CatBoost?

`CatBoost` is a gradient boosting framework developed by Yandex that handles categorical features natively without the need for extensive preprocessing like one-hot encoding. It uses ordered boosting, which avoids prediction shifts and overfitting, and offers automatic handling of categorical variables, making it extremely effective for datasets with mixed categorical and numerical features. CatBoost is also competitive in speed and accuracy and supports GPU acceleration.

**Key features:**

  - Native categorical feature support

  - Ordered boosting to reduce overfitting

  - Efficient handling of missing data

  - Easy hyperparameter tuning

  - Supports classification, regression, ranking, and multi-class problems

Example usage (classification with Python CatBoost):

In [10]:
!pip install catboost

Defaulting to user installation because normal site-packages is not writeable
Collecting catboost
  Downloading catboost-1.2.8-cp313-cp313-win_amd64.whl.metadata (1.5 kB)
Collecting graphviz (from catboost)
  Downloading graphviz-0.21-py3-none-any.whl.metadata (12 kB)
Downloading catboost-1.2.8-cp313-cp313-win_amd64.whl (102.4 MB)
   ---------------------------------------- 0.0/102.4 MB ? eta -:--:--
   ---------------------------------------- 1.0/102.4 MB 6.6 MB/s eta 0:00:16
    --------------------------------------- 1.8/102.4 MB 4.6 MB/s eta 0:00:22
    --------------------------------------- 2.4/102.4 MB 4.3 MB/s eta 0:00:24
   - -------------------------------------- 3.1/102.4 MB 4.1 MB/s eta 0:00:25
   - -------------------------------------- 3.7/102.4 MB 3.7 MB/s eta 0:00:28
   - -------------------------------------- 4.2/102.4 MB 3.5 MB/s eta 0:00:29
   - -------------------------------------- 4.7/102.4 MB 3.3 MB/s eta 0:00:31
   -- ------------------------------------- 5.2/10

In [11]:
#!pip install catboost
from catboost import CatBoostClassifier
from sklearn.datasets import fetch_openml
from sklearn.metrics import accuracy_score

# Load the dataset
data = fetch_openml(name='adult', version=2, as_frame=True)
X = data.data
y = data.target

X.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States
4,18,,103497,Some-college,10,Never-married,,Own-child,White,Female,0,0,30,United-States


In [12]:
y.head()

0    <=50K
1    <=50K
2     >50K
3     >50K
4    <=50K
Name: class, dtype: category
Categories (2, object): ['<=50K', '>50K']

In [13]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype   
---  ------          --------------  -----   
 0   age             48842 non-null  int64   
 1   workclass       46043 non-null  category
 2   fnlwgt          48842 non-null  int64   
 3   education       48842 non-null  category
 4   education-num   48842 non-null  int64   
 5   marital-status  48842 non-null  category
 6   occupation      46033 non-null  category
 7   relationship    48842 non-null  category
 8   race            48842 non-null  category
 9   sex             48842 non-null  category
 10  capital-gain    48842 non-null  int64   
 11  capital-loss    48842 non-null  int64   
 12  hours-per-week  48842 non-null  int64   
 13  native-country  47985 non-null  category
dtypes: category(8), int64(6)
memory usage: 2.6 MB


In [14]:
X.isnull().sum()

age                  0
workclass         2799
fnlwgt               0
education            0
education-num        0
marital-status       0
occupation        2809
relationship         0
race                 0
sex                  0
capital-gain         0
capital-loss         0
hours-per-week       0
native-country     857
dtype: int64

In [15]:
X.columns

Index(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
       'marital-status', 'occupation', 'relationship', 'race', 'sex',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country'],
      dtype='object')

In [16]:
# Specify Categorical feature indices
cat_features = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']

In [17]:
# Fill the NaN values in the categorical features
for col in cat_features:
  if X[col].isnull().any():
    X[col] = X[col].astype('object').fillna('Unknown')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X[col] = X[col].astype('object').fillna('Unknown')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X[col] = X[col].astype('object').fillna('Unknown')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X[col] = X[col].astype('object').fillna('Unknown')


In [18]:
# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize CatBoost model
model = CatBoostClassifier(iterations=1000, learning_rate=0.05, depth=6, verbose=100)

# Train model with categorical features specified
model.fit(X_train, y_train, cat_features=cat_features, eval_set=(X_test, y_test))

# Predict and evaluate
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')

0:	learn: 0.6447736	test: 0.6439227	best: 0.6439227 (0)	total: 422ms	remaining: 7m 1s
100:	learn: 0.2907311	test: 0.2855474	best: 0.2855474 (100)	total: 17s	remaining: 2m 31s
200:	learn: 0.2774430	test: 0.2757754	best: 0.2757754 (200)	total: 33.6s	remaining: 2m 13s
300:	learn: 0.2689627	test: 0.2705039	best: 0.2704879 (299)	total: 1m	remaining: 2m 19s
400:	learn: 0.2645073	test: 0.2687909	best: 0.2687909 (400)	total: 1m 27s	remaining: 2m 11s
500:	learn: 0.2607042	test: 0.2679836	best: 0.2679836 (500)	total: 1m 52s	remaining: 1m 52s
600:	learn: 0.2570087	test: 0.2671981	best: 0.2671981 (600)	total: 2m 19s	remaining: 1m 32s
700:	learn: 0.2538711	test: 0.2670646	best: 0.2669976 (666)	total: 2m 46s	remaining: 1m 11s
800:	learn: 0.2509253	test: 0.2665801	best: 0.2665433 (788)	total: 3m 11s	remaining: 47.6s
900:	learn: 0.2482663	test: 0.2664843	best: 0.2664843 (900)	total: 3m 37s	remaining: 23.9s
999:	learn: 0.2456063	test: 0.2661571	best: 0.2661421 (995)	total: 4m 1s	remaining: 0us

bestTes