<a href="https://colab.research.google.com/github/Khuzamaalk/T5_BootCamp/blob/main/M_Boosting_Exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Boosting Exercise

In this exercise, you will learn about the Boosting technique, which is an ensemble method used to primarily reduce bias, and also variance in supervised learning. It combines multiple weak learners into a single strong learner. The learners are trained sequentially, each trying to correct its predecessor.

## Dataset
We will use the Breast Cancer dataset for this exercise. This dataset contains features computed from digitized images of breast mass and is used to predict whether the mass is malignant or benign. **Feel free to use another dataset!!**

## Task
Your task is to:
1. Load the dataset.
2. Preprocess the data (if necessary).
3. Implement boosting models.
4. Evaluate the models performance.

Please fill in the following code blocks to complete the exercise.

## AdaBoost Tutorial


### Step 1: Import Required Libraries
First, import the necessary libraries for data manipulation, model training, and evaluation.

In [None]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
import warnings
warnings.filterwarnings('ignore')

### Step 2: Load and Preprocess the Dataset
Load the dataset and preprocess it. This includes handling missing values, encoding categorical variables, and splitting the data into features and target variables.

In [None]:
# Load the dataset
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


In [None]:
# Separate features and target variable
X = df.drop('target', axis=1)
y = df['target']

In [None]:
df.isnull().sum()

mean radius                0
mean texture               0
mean perimeter             0
mean area                  0
mean smoothness            0
mean compactness           0
mean concavity             0
mean concave points        0
mean symmetry              0
mean fractal dimension     0
radius error               0
texture error              0
perimeter error            0
area error                 0
smoothness error           0
compactness error          0
concavity error            0
concave points error       0
symmetry error             0
fractal dimension error    0
worst radius               0
worst texture              0
worst perimeter            0
worst area                 0
worst smoothness           0
worst compactness          0
worst concavity            0
worst concave points       0
worst symmetry             0
worst fractal dimension    0
target                     0
dtype: int64

In [None]:
df.duplicated().sum()

0

### Step 3: Split the Dataset
Split the dataset into training and testing sets to evaluate the performance of the models.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


In [None]:
# Preprocess the data using StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

### Step 4: Initialize and Train the AdaBoost Classifier
Initialize a Decision Tree classifier and use it as the base estimator for the AdaBoost classifier.

In [None]:
dtc = DecisionTreeClassifier(max_depth=1)

adaboost_classifier = AdaBoostClassifier(base_estimator=dtc, n_estimators=50, random_state=42)

adaboost_classifier.fit(X_train, y_train)

pred = adaboost_classifier.predict(X_test)

# Evaluate the model's accuracy
accuracy = accuracy_score(y_test, pred)
print(f'AdaBoost Classifier Model Accuracy: {accuracy * 100:.2f}%')

AdaBoost Classifier Model Accuracy: 97.66%


## XGBoost Tutorial


### Step 1: Import Required Libraries
First, import the necessary libraries for data manipulation, model training, and evaluation.

In [None]:
from xgboost import XGBClassifier

### Step 2: Load and Preprocess the Dataset
Load the dataset and preprocess it. This includes handling missing values, encoding categorical variables, and splitting the data into features and target variables.

In [None]:
from sklearn.datasets import load_breast_cancer

# Load the dataset
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

X = df.drop('target', axis=1)
y = df['target']

### Step 3: Split the Dataset
Split the dataset into training and testing sets to evaluate the performance of the models.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

### Step 4: Initialize and Train the XGBoost Classifier
Initialize and train the XGBoost classifier.

In [None]:
xgb = XGBClassifier(n_estimators=50, random_state=42)
xgb.fit(X_train, y_train)

pred = xgb.predict(X_test)

# Evaluate the model's accuracy
accuracy = accuracy_score(y_test, pred)
print(f'XGBoost Classifier Model Accuracy: {accuracy * 100:.2f}%')

XGBoost Classifier Model Accuracy: 96.49%


## Gradient Boosting Tutorial


### Step 1: Import Required Libraries
First, import the necessary libraries for data manipulation, model training, and evaluation.

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

### Step 2: Load and Preprocess the Dataset
Load the dataset and preprocess it. This includes handling missing values, encoding categorical variables, and splitting the data into features and target variables.

In [None]:
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

X = df.drop('target', axis=1)
y = df['target']

### Step 3: Split the Dataset
Split the dataset into training and testing sets to evaluate the performance of the models.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


### Step 4: Initialize and Train the Gradient Boosting Classifier
Initialize and train the Gradient Boosting classifier.

In [None]:
gbc = GradientBoostingClassifier(n_estimators=50, random_state=42)
gbc.fit(X_train, y_train)


pred = gbc.predict(X_test)

# Evaluate the model's accuracy
accuracy = accuracy_score(y_test, pred)
print(f'Gradient Boosting Classifier Model Accuracy: {accuracy * 100:.2f}%')

Gradient Boosting Classifier Model Accuracy: 95.91%
