# Ensemble Models
Ensemble methods are powerful machine learning techniques that combine predictions from multiple models to improve overall performance. Instead of relying on a single model, ensemble methods leverage the collective intelligence of several models, often leading to better accuracy, robustness, and generalization. Ensemble methods operate under the principle that a group of models (an ensemble) can outperform a single model when their individual strengths and weaknesses are combined effectively. These methods are particularly useful for reducing variance, bias, or improving prediction accuracy.

# Bagging
Bagging (Bootstrap Aggregating) is an ensemble method that reduces variance by training multiple versions of the same model on different random subsets of the training data, sampled with replacement. Each model is trained independently, and their predictions are combined—through averaging for regression or majority voting for classification—to produce a more accurate and robust result. Bagging works best with high-variance models like decision trees, as it mitigates overfitting while retaining predictive power. Random Forest is a popular example of bagging, adding further randomness by selecting a subset of features at each tree split.

# Boosting

Boosting is an ensemble method that reduces bias by training models sequentially, where each new model focuses on correcting the errors made by its predecessor. Weak learners, often simple models like shallow decision trees, are combined iteratively to form a strong model. Boosting assigns higher importance to misclassified or poorly predicted instances, improving overall accuracy. Unlike bagging, boosting models are dependent on each other, making the process more sensitive to noise but highly effective for complex problems. Popular boosting algorithms include AdaBoost and XGBoost, known for their ability to achieve high predictive performance.


# Adaboost 

Adaboost, or Adaptive Boosting, is an ensemble learning algorithm that combines multiple weak classifiers to form a single strong classifier. The algorithm focuses on difficult-to-classify samples by iteratively adjusting the weights of training data points. This adaptive process ensures that the model progressively improves its performance on challenging examples. Adaboost is primarily used for binary classification problems but can also be extended to handle multi-class scenarios.


### Importing Libraries

In [10]:
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification

### Implementing Adaboost
During initialization, it sets the number of weak classifiers, and stores their weights and models. The train method iteratively trains weak classifiers (decision stumps) by calculating weighted errors and adjusting sample weights to focus on misclassified samples. Each stump is assigned a weight based on its accuracy. The _train_stump method identifies the best stump by evaluating all features, thresholds, and polarities, minimizing weighted classification error. The predict method combines the predictions of all stumps using their weights, yielding the final ensemble prediction by taking the sign of the weighted sum. This method efficiently boosts weak learners into a strong classifier.

In [11]:
class Adaboost:
    def __init__(self, n_estimators=50):
        self.n_estimators = n_estimators
        self.alphas = []
        self.models = []

    def train(self, X, y):
        n_samples, _ = X.shape
        weights = np.ones(n_samples) / n_samples

        for _ in range(self.n_estimators):
            stump = self._train_stump(X, y, weights)
            predictions = stump['predictions']
            error = np.sum(weights * (predictions != y))

            alpha = 0.5 * np.log((1 - error) / (error + 1e-10))
            self.alphas.append(alpha)
            self.models.append(stump)

            weights *= np.exp(-alpha * y * predictions)
            weights /= np.sum(weights)

    def _train_stump(self, X, y, weights):
        n_samples, n_features = X.shape
        min_error = float('inf')
        stump = {}

        for feature_idx in range(n_features):
            thresholds = np.unique(X[:, feature_idx])
            for threshold in thresholds:
                for polarity in [1, -1]:
                    predictions = np.ones(n_samples)
                    predictions[polarity * X[:, feature_idx] < polarity * threshold] = -1
                    error = np.sum(weights * (predictions != y))

                    if error < min_error:
                        min_error = error
                        stump = {
                            'feature_idx': feature_idx,
                            'threshold': threshold,
                            'polarity': polarity,
                            'predictions': predictions
                        }
        return stump

    def predict(self, X):
        n_samples = X.shape[0]
        final_predictions = np.zeros(n_samples)

        for alpha, model in zip(self.alphas, self.models):
            predictions = np.ones(n_samples)
            feature_idx = model['feature_idx']
            threshold = model['threshold']
            polarity = model['polarity']
            predictions[polarity * X[:, feature_idx] < polarity * threshold] = -1
            final_predictions += alpha * predictions

        return np.sign(final_predictions)

### Loading Data

We generate a synthetic binary classification dataset using make_classification, creating 500 samples with 5 features and 2 classes, while ensuring reproducibility with a fixed random seed. The class labels are initially 0 and 1, but they are converted to -1 and 1 using np.where to align with algorithms like Adaboost that require binary labels in this format. The dataset is then split into training and testing sets, with the first 350 samples used for training and the remaining 150 for testing. This setup provides a well-structured dataset for evaluating classification models.

In [12]:
X, y = make_classification(n_samples=500, n_features=5, n_classes=2, random_state=42)
y = np.where(y == 0, -1, 1)

train_size = 350
X_train, X_test = X[:train_size], X[train_size:]
y_train, y_test = y[:train_size], y[train_size:]

### Training the Model

In [13]:
adaboost = Adaboost(n_estimators=50)
adaboost.train(X_train, y_train)

### Prediction and Testing
We evaluates the performance of the Adaboost model on the test dataset. It begins by predicting the class labels for the test data (X_test) using the model's predict method, storing the results in y_pred. The confusion_matrix function computes the number of true positives, true negatives, false positives, and false negatives, returning a 2x2 matrix summarizing the classification results. The accuracy_score function calculates the fraction of correctly predicted samples by comparing y_true and y_pred. The precision_score function calculates precision, defined as the ratio of true positives to all predicted positives, ensuring no division by zero. The confusion matrix, accuracy, and precision are then printed to provide a detailed summary of the model's performance.

In [14]:
y_pred = adaboost.predict(X_test)

def confusion_matrix(y_true, y_pred):
    TP = np.sum((y_true == 1) & (y_pred == 1))
    TN = np.sum((y_true == -1) & (y_pred == -1))
    FP = np.sum((y_true == -1) & (y_pred == 1))
    FN = np.sum((y_true == 1) & (y_pred == -1))
    return np.array([[TP, FP], [FN, TN]])

def accuracy_score(y_true, y_pred):
    return np.sum(y_true == y_pred) / len(y_true)

def precision_score(y_true, y_pred):
    TP = np.sum((y_true == 1) & (y_pred == 1))
    FP = np.sum((y_true == -1) & (y_pred == 1))
    return TP / (TP + FP) if (TP + FP) > 0 else 0

cm = confusion_matrix(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)

print("Confusion Matrix:\n", cm, end="\n\n")
print("Accuracy:", accuracy)
print("Precision:", precision)


Confusion Matrix:
 [[65  4]
 [10 71]]

Accuracy: 0.9066666666666666
Precision: 0.9420289855072463


## Theory for Adaboost Algorithm

The Adaboost algorithm begins by initializing the weights of all training samples equally. For a dataset with $N$ samples, the initial weight for each sample is given by:

$$
w_i = \frac{1}{N}, \quad \forall i \in \{1, 2, \dots, N\}
$$

A weak classifier $h_m(x)$ is then trained at iteration $m$. This classifier minimizes the weighted classification error, calculated as:

$$
\epsilon_m = \frac{\sum_{i=1}^{N} w_i \cdot \mathbb{I}(y_i \neq h_m(x_i))}{\sum_{i=1}^{N} w_i}
$$

Here, $\mathbb{I}$ is the indicator function, which is 1 if the condition is true and 0 otherwise.

The weight of the weak classifier, $\alpha_m$, is then computed as:

$$
\alpha_m = \frac{1}{2} \ln\left(\frac{1 - \epsilon_m}{\epsilon_m}\right)
$$

The weights of the training samples are updated to emphasize misclassified examples. The update rule is:

$$
w_i \leftarrow w_i \cdot \exp\left(-\alpha_m \cdot y_i \cdot h_m(x_i)\right)
$$

The weights are then normalized so that they sum to 1:

$$
w_i \leftarrow \frac{w_i}{\sum_{i=1}^{N} w_i}
$$

Finally, after all iterations, the strong classifier is constructed as a weighted combination of the weak classifiers:

$$
H(x) = \text{sign}\left(\sum_{m=1}^{M} \alpha_m \cdot h_m(x)\right)
$$



# XGBoost
XGBoost (eXtreme Gradient Boosting) is a highly efficient and scalable implementation of gradient-boosted decision trees. It is widely used for supervised learning tasks, particularly with structured or tabular data, due to its performance and robustness. XGBoost stands out because of its ability to handle missing data, its use of regularization techniques to prevent overfitting, and its speed, thanks to its parallelized tree-building process.

In this notebook, we will implement XGBoost using only NumPy and Pandas libraries, focusing on manually defining the steps involved in the algorithm. This approach provides a deeper understanding of how XGBoost works internally.

### Importing Libraries

In [None]:
import numpy as np
import pandas as pd

### Dataset Description

The dataset we use in this notebook contains features and a target variable for a classification task. Each row represents an observation, with columns for input features and the corresponding output label. For demonstration purposes, we assume the dataset has been preprocessed to handle missing values and categorical data. The focus here will be on implementing XGBoost rather than data cleaning.

In [15]:
data = pd.read_csv("testyyy.csv")
print(data.head())

   Feature_1  Feature_2  Target
0   2.496714   1.861736     0.0
1   2.647689   3.523030     0.0
2   1.765847   1.765863     0.0
3   3.579213   2.767435     0.0
4   1.530526   2.542560     0.0


### Data Preprocessing

Before training our model, we need to split the dataset into features and labels, followed by dividing the data into training and test sets. Features refer to the input variables used for making predictions, while the labels are the target variables. We will use 80% of the data for training and 20% for testing.

In [16]:
X = data.iloc[:, :-1].values  
y = data.iloc[:, -1].values

train_size = int(0.8 * len(X))
X_train, X_test = X[:train_size], X[train_size:]
y_train, y_test = y[:train_size], y[train_size:]

### XGBoost Implementation

XGBoost uses gradient boosting to optimize an objective function. It builds trees sequentially, where each tree is trained to correct the residual errors of the previous trees. Initialize predictions with a constant value (e.g., the mean of the target). Iteratively fit decision trees to the negative gradient of the loss function. Update predictions by adding the weighted outputs of the trees. Use a learning rate to scale the updates, controlling the contribution of each tree.

In [17]:
n_estimators = 100  
learning_rate = 0.1  
max_depth = 3  

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

predictions = np.full_like(y_train, np.mean(y_train), dtype=np.float64)

for estimator in range(n_estimators):
    residuals = y_train - predictions
    tree_output = np.sign(residuals) 
    predictions += learning_rate * tree_output

### Predictions and Evaluation

After training, we use the final model to predict on the test set. Predictions are made by summing the contributions of all trees and applying the sigmoid function to convert them into probabilities. These probabilities are thresholded at 0.5 to generate binary predictions.

In [18]:
train_predictions = np.full_like(y_train, np.mean(y_train), dtype=np.float64)
test_predictions = np.full_like(y_test, np.mean(y_train), dtype=np.float64)

def simple_tree(X, residuals):
    n_features = X.shape[1]
    best_split = None
    min_error = float("inf")
    
    for feature in range(n_features):
        threshold = np.median(X[:, feature])
        left_idx = X[:, feature] <= threshold
        right_idx = X[:, feature] > threshold
        
        left_residuals = residuals[left_idx]
        right_residuals = residuals[right_idx]
        
        error = (np.mean(left_residuals) ** 2) * len(left_residuals) + \
                (np.mean(right_residuals) ** 2) * len(right_residuals)
        
        if error < min_error:
            min_error = error
            best_split = (feature, threshold, np.mean(left_residuals), np.mean(right_residuals))
    
    return best_split

for estimator in range(n_estimators):
    residuals = y_train - train_predictions
    
    split = simple_tree(X_train, residuals)
    if split is None:
        break
    
    feature, threshold, left_val, right_val = split
    
    left_idx = X_train[:, feature] <= threshold
    right_idx = X_train[:, feature] > threshold
    train_predictions[left_idx] += learning_rate * left_val
    train_predictions[right_idx] += learning_rate * right_val
    
    left_idx_test = X_test[:, feature] <= threshold
    right_idx_test = X_test[:, feature] > threshold
    test_predictions[left_idx_test] += learning_rate * left_val
    test_predictions[right_idx_test] += learning_rate * right_val

test_predictions = sigmoid(test_predictions)
test_predictions_binary = (test_predictions > 0.5).astype(int)

accuracy = np.mean(test_predictions_binary == y_test)
print(f"Accuracy: {accuracy:.2f}")


Accuracy: 1.00


## Theory for XGBoost

XGBoost optimizes the following loss function:
$$L(\theta) = \sum_{i=1}^n l(y_i, \hat{y}_i) + \sum_{k=1}^K \Omega(f_k)$$

Where:
- $l$ is the loss function 
- $\Omega(f_k)$ is the regularization term to penalize complexity
- $f_k$ represents the $k$-th tree in the ensemble

Each tree corrects the residuals of the previous ensemble, updating the predictions as follows:
$$g_m(x) = g_{m-1}(x) + \alpha h_m(x)$$

Here, $\alpha$ is the learning rate, and $h_m(x)$ is the output of the $m$-th tree. The process continues until the specified number of trees is trained or the residuals are sufficiently minimized.