<a href="https://colab.research.google.com/github/Jhansipothabattula/Machine_Learning/blob/main/Day39.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LightGBM and CatBoost

**Introduction to LightGBM**

- what is LightGBM?

  - Implementation of Gradient Boosting designed to handle large datasets and high-dimensionality data with speed and accuracy

  - Key features of LightGBM:

    - Histogram-Based Splitting

    - Leaf-Wise Tree Growth

    - Support for GPU Training

    - Handling Sparse Data

  - Advantages

    - Faster training than XGBoost

    - Handles large datasets effectively

    - Reduces Memory usage with Histogram-based splitting

  - When to use LightGBM

    - Large datasets with numerical features

    - Time-sensitive tasks requiring fast training

**Overview of CatBoost**

- What is catBoost?

  - Gradient Boosting library developed specifically to handle categorical features without the need for preprocessing like One_hot encoding
  
  - Key Features of CatBoost:
    
    - Native Support for Ctegorical data

    - Ordered Boosting

    - Robust to Overfitting

  - Advantages

    - Eliminates the need for maual encoding of categorical data

    - Reduces overfitting with robust boosting techniques

    - Easy to implement for datasets with many categorical features

  - When to use CatBoost

    - Datasets with a high proportion of categorical features

    - Applications where overfitting is a concern

**XGBoost, LightGBM, and CatBoost**
| Feature | XGBoost | LightGBM | catBoost |
|----------|----------|----------|----------|
| Speed    | Moderate     | fast     | Fast     |
| Handling categorical Data    | Requires encoding     | Requires encoding     | native support     |
| Memory usage    | Moderate     | Low     | Moderate     |
| Tuning Complexity    | Moderate     | High(leaf-wise growth)     | Low     |
| Best use cases    | General-purpose models     | Large Datasets     | Categorical-heavy Datasets     |


**1. Train and compare LightGBM, catBoost, and XGBoost models on a dataset, focusing on their ability to handle large datasets and categorical data**

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Load Titanic dataset
from google.colab import files
uploaded = files.upload()

Saving titanic.csv to titanic (1).csv


In [3]:
df = pd.read_csv("titanic (1).csv")
print(df.head())

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  


In [6]:
!pip install catboost


Collecting catboost
  Downloading catboost-1.2.8-cp312-cp312-manylinux2014_x86_64.whl.metadata (1.2 kB)
Downloading catboost-1.2.8-cp312-cp312-manylinux2014_x86_64.whl (99.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m99.2/99.2 MB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: catboost
Successfully installed catboost-1.2.8


In [18]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import lightgbm as lgb
from sklearn.metrics import accuracy_score
from catboost import CatBoostClassifier
from xgboost import XGBClassifier

# Select Features and target
features = ["Pclass", "Sex", "Age", "Fare", "Embarked"]
target = "Survived"

# --- Common Data Preprocessing (Fill missing values) ---
# Fill missing values for Age and Embarked BEFORE any encoding for both sets of models.
df["Age"] = df["Age"].fillna(df["Age"].median())
df["Sex"] = df["Sex"].fillna(df["Sex"].mode()[0])
df["Embarked"] = df["Embarked"].fillna(df["Embarked"].mode()[0])

# --- Data Preparation for LightGBM and XGBoost (label-encoded categorical features) ---
df_encoded = df.copy() # Make a copy for encoding

# Encode categorical Features for LightGBM and XGBoost
label_encoders = {}
for col in ["Sex", "Embarked"]:
  le = LabelEncoder()
  df_encoded[col] = le.fit_transform(df_encoded[col])
  label_encoders[col] = le

X_encoded = df_encoded[features]
y = df_encoded[target] # Target is common for all models

X_train_encoded, X_test_encoded, y_train, y_test = train_test_split(X_encoded, y, test_size = 0.2, random_state=42)

print(f"Training Data Shape (Encoded for LGBM/XGBoost): {X_train_encoded.shape}")
print(f"Testing Data Shape (Encoded for LGBM/XGBoost): {X_test_encoded.shape}")


# --- Data Preparation for CatBoost (with original categorical features for native handling) ---
df_catboost_native = df.copy() # Make a copy to retain original categorical types

# Ensure categorical columns are explicitly of 'category' dtype for CatBoost.
# This helps CatBoost recognize them correctly when passing column names as cat_features.
for col in ["Pclass", "Sex", "Embarked"]:
    df_catboost_native[col] = df_catboost_native[col].astype('category')

X_cat_native = df_catboost_native[features]
# y_train, y_test are already split and derived from the common target.
X_train_cat_native, X_test_cat_native, _, _ = train_test_split(X_cat_native, y, test_size = 0.2, random_state=42)

print(f"Training Data Shape (Native for CatBoost): {X_train_cat_native.shape}")
print(f"Testing Data Shape (Native for CatBoost): {X_test_cat_native.shape}")

Training Data Shape (Encoded for LGBM/XGBoost): (712, 5)
Testing Data Shape (Encoded for LGBM/XGBoost): (179, 5)
Training Data Shape (Native for CatBoost): (712, 5)
Testing Data Shape (Native for CatBoost): (179, 5)
