<a href="https://colab.research.google.com/github/Manikantaamanchi424/Infosys-Intern-Project/blob/main/GBM_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### LightGBM (Light Gradient Boosting Machine)
- What is LightGBM?

`LightGBM` is an open-source, high-performance gradient boosting framework developed by Microsoft. It builds decision trees in a leaf-wise manner rather than a level-wise manner, which generally leads to faster training and higher accuracy. It uses histogram-based algorithms for faster splitting and has specialized features like Gradient-Based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB) to speed up training on large datasets efficiently. LightGBM supports both classification and regression tasks.

**Key features:**

  - Leaf-wise tree growth for better accuracy

  - Histogram-based splits for speed and memory efficiency

  - GOSS to focus training on samples with large gradients

  - EFB to reduce dimensionality by bundling features

  - Supports GPU acceleration

Example usage (regression with Python LightGBM):

In [None]:
#!pip install lightgbm
import lightgbm as lgb
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

#Load dataset
data = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)

In [None]:
# Create LightGBM dataset format
train_data = lgb.Dataset(X_train, label=y_train)
test_data = lgb.Dataset(X_test, label=y_test, reference=train_data)

In [None]:
# Set parameters
params = {
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': 'rmse',
    'num_leaves': 31,
    'learning_rate': 0.05,
    'feature_fraction': 0.9,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'verbose': 0
}


In [None]:
# Train the model
num_rounds = 100
model = lgb.train(params, train_data, num_rounds, valid_sets=[test_data])



In [None]:
# Predict and evaluate
y_pred = model.predict(X_test, num_iteration=model.best_iteration)
rmse = mean_squared_error(y_test, y_pred)
print(f'RMSE: {rmse}')

RMSE: 0.23285525605518215


### CatBoost (Categorical Boosting)
- What is CatBoost?

`CatBoost` is a gradient boosting framework developed by Yandex that handles categorical features natively without the need for extensive preprocessing like one-hot encoding. It uses ordered boosting, which avoids prediction shifts and overfitting, and offers automatic handling of categorical variables, making it extremely effective for datasets with mixed categorical and numerical features. CatBoost is also competitive in speed and accuracy and supports GPU acceleration.

**Key features:**

  - Native categorical feature support

  - Ordered boosting to reduce overfitting

  - Efficient handling of missing data

  - Easy hyperparameter tuning

  - Supports classification, regression, ranking, and multi-class problems

Example usage (classification with Python CatBoost):

In [None]:
!pip install catboost

Collecting catboost
  Downloading catboost-1.2.8-cp312-cp312-manylinux2014_x86_64.whl.metadata (1.2 kB)
Downloading catboost-1.2.8-cp312-cp312-manylinux2014_x86_64.whl (99.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m99.2/99.2 MB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: catboost
Successfully installed catboost-1.2.8


In [None]:
#!pip install catboost
from catboost import CatBoostClassifier
from sklearn.datasets import fetch_openml
from sklearn.metrics import accuracy_score

# Load the dataset
data = fetch_openml(name='adult', version=2, as_frame=True)
X = data.data
y = data.target

X.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States
4,18,,103497,Some-college,10,Never-married,,Own-child,White,Female,0,0,30,United-States


In [None]:
y.head()

Unnamed: 0,class
0,<=50K
1,<=50K
2,>50K
3,>50K
4,<=50K


In [None]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype   
---  ------          --------------  -----   
 0   age             48842 non-null  int64   
 1   workclass       46043 non-null  category
 2   fnlwgt          48842 non-null  int64   
 3   education       48842 non-null  category
 4   education-num   48842 non-null  int64   
 5   marital-status  48842 non-null  category
 6   occupation      46033 non-null  category
 7   relationship    48842 non-null  category
 8   race            48842 non-null  category
 9   sex             48842 non-null  category
 10  capital-gain    48842 non-null  int64   
 11  capital-loss    48842 non-null  int64   
 12  hours-per-week  48842 non-null  int64   
 13  native-country  47985 non-null  category
dtypes: category(8), int64(6)
memory usage: 2.6 MB


In [None]:
X.isnull().sum()

Unnamed: 0,0
age,0
workclass,2799
fnlwgt,0
education,0
education-num,0
marital-status,0
occupation,2809
relationship,0
race,0
sex,0


In [None]:
X.columns

Index(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
       'marital-status', 'occupation', 'relationship', 'race', 'sex',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country'],
      dtype='object')

In [None]:
# Specify Categorical feature indices
cat_features = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']

In [None]:
# Fill the NaN values in the categorical features
for col in cat_features:
  if X[col].isnull().any():
    X[col] = X[col].astype('object').fillna('Unknown')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X[col] = X[col].astype('object').fillna('Unknown')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X[col] = X[col].astype('object').fillna('Unknown')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X[col] = X[col].astype('object').fillna('Unknown')


In [None]:
# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize CatBoost model
model = CatBoostClassifier(iterations=1000, learning_rate=0.05, depth=6, verbose=100)

# Train model with categorical features specified
model.fit(X_train, y_train, cat_features=cat_features, eval_set=(X_test, y_test))

# Predict and evaluate
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')

0:	learn: 0.6447736	test: 0.6439227	best: 0.6439227 (0)	total: 202ms	remaining: 3m 22s
100:	learn: 0.2907311	test: 0.2855474	best: 0.2855474 (100)	total: 11.1s	remaining: 1m 38s
200:	learn: 0.2786445	test: 0.2766110	best: 0.2766110 (200)	total: 19.4s	remaining: 1m 17s
300:	learn: 0.2693688	test: 0.2704353	best: 0.2704353 (300)	total: 28.2s	remaining: 1m 5s
400:	learn: 0.2652878	test: 0.2690773	best: 0.2690773 (400)	total: 37.7s	remaining: 56.3s
500:	learn: 0.2614034	test: 0.2682083	best: 0.2681948 (498)	total: 45.3s	remaining: 45.2s
600:	learn: 0.2577463	test: 0.2673236	best: 0.2673236 (600)	total: 55.1s	remaining: 36.6s
700:	learn: 0.2547281	test: 0.2670082	best: 0.2670082 (700)	total: 1m 4s	remaining: 27.5s
800:	learn: 0.2516728	test: 0.2667126	best: 0.2666970 (798)	total: 1m 12s	remaining: 17.9s
900:	learn: 0.2489046	test: 0.2665288	best: 0.2665009 (874)	total: 1m 21s	remaining: 8.99s
999:	learn: 0.2464334	test: 0.2662221	best: 0.2662221 (999)	total: 1m 30s	remaining: 0us

bestTest 