`Boosting
`
Boosting is a machine learning technique that combines multiple weak models to create a strong predictive model. The idea is to train a sequence of models, with each subsequent model attempting to correct the errors of the previous model.

How Boosting Works:

1. Initial Model: A simple model is trained on the data.
2. Error Calculation: The errors of the initial model are calculated.
3. New Model: A new model is trained to correct the errors of the previous model.
4. Iteration: Steps 2-3 are repeated, with each new model attempting to improve upon the previous one.
5. Final Model: The final model is a weighted combination of all the individual models.

`XGBoost
`
XGBoost (Extreme Gradient Boosting) is a popular implementation of the gradient boosting algorithm. It's designed to be highly efficient, flexible, and portable.

`Key Features of XGBoost:
`
1. Gradient Boosting: XGBoost uses gradient boosting to optimize the loss function.
2. Tree-Based Models: XGBoost uses tree-based models as the base learners.
3. Regularization: XGBoost includes regularization techniques to prevent overfitting.
4. Parallel Computing: XGBoost supports parallel computing, making it fast and efficient.
5. Handling Missing Values: XGBoost can handle missing values in the data.

`Advantages of XGBoost:
`
1. High Accuracy: XGBoost is known for its high accuracy and performance.
2. Handling Large Datasets: XGBoost can handle large datasets and is designed for big data applications.
3. Flexibility: XGBoost supports various loss functions and evaluation metrics.

`Applications:
`
1. Classification: XGBoost is widely used for classification tasks, such as image classification, sentiment analysis, and medical diagnosis.
2. Regression: XGBoost is used for regression tasks, such as predicting continuous outcomes like house prices or stock prices.

XGBoost is a powerful algorithm that's widely used in machine learning applications due to its accuracy, efficiency, and flexibility.

In [40]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report,  mean_squared_error, r2_score
from sklearn.model_selection import train_test_split




In [41]:
data = sns.load_dataset('diamonds')
data.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


In [42]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53940 entries, 0 to 53939
Data columns (total 10 columns):
 #   Column   Non-Null Count  Dtype   
---  ------   --------------  -----   
 0   carat    53940 non-null  float64 
 1   cut      53940 non-null  category
 2   color    53940 non-null  category
 3   clarity  53940 non-null  category
 4   depth    53940 non-null  float64 
 5   table    53940 non-null  float64 
 6   price    53940 non-null  int64   
 7   x        53940 non-null  float64 
 8   y        53940 non-null  float64 
 9   z        53940 non-null  float64 
dtypes: category(3), float64(6), int64(1)
memory usage: 3.0 MB


In [43]:
le = LabelEncoder()
for col in data.columns:
    if data[col].dtype == "category":
        data[col] = le.fit_transform(data[col])

data.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,2,1,3,61.5,55.0,326,3.95,3.98,2.43
1,0.21,3,1,2,59.8,61.0,326,3.89,3.84,2.31
2,0.23,1,1,4,56.9,65.0,327,4.05,4.07,2.31
3,0.29,3,5,5,62.4,58.0,334,4.2,4.23,2.63
4,0.31,1,6,3,63.3,58.0,335,4.34,4.35,2.75


In [44]:
X = data.drop(columns='cut', axis=1)
y = data["cut"]
train_X, test_X, train_Y, test_Y = train_test_split(X, y, train_size=0.2, random_state=42)
model = DecisionTreeClassifier(max_depth=10, class_weight="balanced", random_state=42)
model.fit(train_X, train_Y)

y_pred = model.predict(test_X)

# 📊 Evaluate performance
print("\nAccuracy Score:\n", accuracy_score(test_Y, y_pred))
print("\nClassification Report:\n", classification_report(test_Y, y_pred))


Accuracy Score:
 0.7077539859102707

Classification Report:
               precision    recall  f1-score   support

           0       0.81      0.85      0.83      1299
           1       0.53      0.73      0.61      3950
           2       0.83      0.87      0.85     17284
           3       0.67      0.73      0.70     11037
           4       0.57      0.35      0.44      9582

    accuracy                           0.71     43152
   macro avg       0.68      0.71      0.68     43152
weighted avg       0.70      0.71      0.70     43152



In [45]:
model = RandomForestClassifier(random_state=42)
model.fit(train_X, train_Y)

y_pred = model.predict(test_X)
# 📊 Evaluate performance
print("\nAccuracy Score:\n", accuracy_score(test_Y, y_pred))
print("\nClassification Report:\n", classification_report(test_Y, y_pred))


Accuracy Score:
 0.7614479050797182

Classification Report:
               precision    recall  f1-score   support

           0       0.90      0.85      0.87      1299
           1       0.78      0.64      0.70      3950
           2       0.82      0.91      0.86     17284
           3       0.73      0.80      0.76     11037
           4       0.64      0.49      0.56      9582

    accuracy                           0.76     43152
   macro avg       0.77      0.74      0.75     43152
weighted avg       0.75      0.76      0.75     43152



In [46]:
from xgboost import XGBClassifier


In [47]:

params = {
    "n_estimators": 300,      # smaller than 1000
    "max_depth": 5,           # not too deep
    "learning_rate": 0.1,     # decent default
    "subsample": 0.8,
    "colsample_bytree": 0.8,
    "n_jobs": -1,             # use all CPU cores
    "tree_method": "hist"     
}
model = XGBClassifier(**params)

model.fit(
    train_X, train_Y,

)

y_pred = model.predict(test_X)

print("\nAccuracy Score:\n", accuracy_score(test_Y, y_pred))
print("\nClassification Report:\n", classification_report(test_Y, y_pred))



Accuracy Score:
 0.7806822395253986

Classification Report:
               precision    recall  f1-score   support

           0       0.90      0.88      0.89      1299
           1       0.79      0.66      0.72      3950
           2       0.82      0.91      0.87     17284
           3       0.77      0.81      0.79     11037
           4       0.66      0.55      0.60      9582

    accuracy                           0.78     43152
   macro avg       0.79      0.76      0.77     43152
weighted avg       0.77      0.78      0.78     43152

