Atalov S.

Fundamentals of Machine Learning and Artificial Intelligence

# Lab 4: Implementing a Gradient Boosting Classifier from Scratch

---

### Objective:
The goal of this lab is to develop a deeper understanding of ensemble learning methods by implementing a Gradient Boosting Classifier from scratch in Python. You will apply your implementation to predict survival on the Titanic dataset.


### Requirements:
1. **Data**: Use the Titanic dataset available from the [Titanic: Machine Learning from Disaster](https://www.kaggle.com/c/titanic) on Kaggle. You will need to perform data preprocessing (handle missing values, convert categorical data to numeric).

2. **Implementation**:
    - **`GradientBoostingClassifier` class**: Your class should have at least three methods:
      - `fit(X, y)`: Method to train the model.
      - `predict(X)`: Method to predict the target for given input.
      - `score(X, y)`: Method to calculate the accuracy of the model.
    - The classifier should use decision trees as the weak learners. You can use an existing implementation of decision trees (like `DecisionTreeRegressor` from `sklearn`) or write your own from scratch.

3. **Evaluation**:
    - Split the Titanic dataset into training and testing sets.
    - Train your model on the training set and evaluate its performance on the test set.
    - Plot the training and testing accuracy as a function of the number of boosting rounds.

### Deliverables:
1. **Code**: A Jupyter notebook containing all the code, comments explaining your logic, and any assumptions made.
2. **Report**: A brief report explaining your findings, the performance of the model, and any challenges you faced during the implementation.

### Tips:
- Start by understanding the algorithm using resources like Chapter 10 of ["The Elements of Statistical Learning"](https://www.sas.upenn.edu/~fdiebold/NoHesitations/BookAdvanced.pdf).
- Testing your algorithm on a simpler dataset (like the Iris dataset) can help you debug.

### Submission:
Submit the Jupyter notebook and the report via the ecourse by **11 May 2024 01:00**.

---


In [3]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeRegressor
import matplotlib.pyplot as plt

In [14]:

class GradientBoostingClassifier:
    def __init__(self, n_estimators=500, learning_rate=0.1):
        self.n_estimators = n_estimators
        self.learning_rate = learning_rate
        self.models = []
        self.weights = []

    def fit(self, X, y):
        #заполнение, чтобы были ошибки
        y_pred = np.full(len(y), np.mean(y))

        for _ in range(self.n_estimators):
            residuals = y - y_pred

            tree = DecisionTreeRegressor(max_depth=2)
            tree.fit(X, residuals)


            update = self.learning_rate * tree.predict(X)
            y_pred += update

            self.models.append(tree)
            self.weights.append(self.learning_rate)

    def predict(self, X):
        
        y_pred = np.zeros(len(X))
        for i, model in enumerate(self.models):
            y_pred += self.weights[i] * model.predict(X)
        return np.round(y_pred)

    def score(self, X, y):
        return accuracy_score(y, self.predict(X))

In [15]:
titanic_train = "https://raw.githubusercontent.com/lobachevksy/teaching/main/titanic/train.csv"

In [16]:

data = pd.read_csv(titanic_train)

data.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1, inplace=True)
data['Age'].fillna(data['Age'].mean(), inplace=True)
data['Embarked'].fillna(data['Embarked'].mode()[0], inplace=True)
data = pd.get_dummies(data, columns=['Sex', 'Embarked'], drop_first=True)

X = data.copy()
y = X.pop("Survived")

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

gb = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1)
gb.fit(X_train.values, y_train.values) 

print(gb.score(X_train.values, y_train.values)) 
gb.score(X_test.values, y_test.values) 



0.7640449438202247


0.7374301675977654