<a href="https://colab.research.google.com/github/DmitriiGoro/ML_2024_3_term/blob/master/HomeTasks/assignment_bagging_and_oob.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Home assignment 05: Bagging and OOB score

Please, fill the lines in the code below.
This is a simplified version of `BaggingRegressor` from `sklearn`. Please, notice, that `sklearn` API is **not preserved**.

Your algorithm should be able to train different instances of the same model class on bootstrapped datasets and to provide [OOB score](https://en.wikipedia.org/wiki/Out-of-bag_error) for the training set.

The model should be passed as model class with no explicit parameters and no parentheses.

Example:
```
import numpy as np
from sklearn.linear_model import LinearRegression

bagging_regressor = SimplifiedBaggingRegressor(num_bags=10, oob=True)
bagging_regressor.fit(LinearRegression, X, y)

```

In [3]:
import numpy as np
import math

In [9]:
class SimplifiedBaggingRegressor:
    def __init__(self, num_bags, oob=False):
        self.num_bags = num_bags
        self.oob = oob


    def _generate_splits(self, data: np.ndarray):
        '''
        Generate indices for every bag and store in self.indices_list.
        '''
        self.indices_list = []
        data_length = len(data)

        for _ in range(self.num_bags):
            subset_i = np.random.choice(data.shape[0], size=len(data), replace=True)
            self.indices_list.append(subset_i)


    def fit(self, model_constructor, data, target):
        '''
        Fit model on every bag.
        Model constructor with no parameters (and with no ()) is passed to this function.

        example:

        bagging_regressor = SimplifiedBaggingRegressor(num_bags=10, oob=True)
        bagging_regressor.fit(LinearRegression, X, y)
        '''
        self.data = None
        self.target = None
        self._generate_splits(data)

        assert len(set(list(map(len, self.indices_list)))) == 1, 'All bags should be of the same length!'
        assert list(map(len, self.indices_list))[0] == len(data), 'All bags should contain `len(data)` number of elements!'

        self.models_list = []

        for bag in range(self.num_bags):
            model = model_constructor()
            indices_bag = self.indices_list[bag]
            data_bag, target_bag = data[indices_bag], target[indices_bag]

            model.fit(data_bag, target_bag)

            self.models_list.append(model.fit(data_bag, target_bag)) # store fitted models here

        if self.oob:
            self.data = data
            self.target = target

    def predict(self, data):
        '''
        Get average prediction for every object from passed dataset
        '''
        all_predictions = []

        for model in self.models_list:
            predictions = model.predict(data)
            all_predictions.append(predictions)

        all_predictions = np.array(all_predictions)

        avg_predictions = np.mean(all_predictions, axis=0)

        return avg_predictions

    def _get_oob_predictions_from_every_model(self):
        '''
        Generates list of lists, where list i contains predictions for self.data[i] object
        from all models, which have not seen this object during training phase
        '''
        list_of_predictions_lists = [[] for _ in range(len(self.data))]

        for model_index, model in enumerate(self.models_list):
            train_indices = self.indices_list[model_index]
            oob_indices = set(range(len(self.data))) - set(train_indices)

            for idx in oob_indices:
                prediction = model.predict(self.data[idx].reshape(1, -1))
                list_of_predictions_lists[idx].append(prediction[0])

        self.list_of_predictions_lists = np.array(list_of_predictions_lists, dtype=object)

    def _get_averaged_oob_predictions(self):
        '''
        Compute average prediction for every object from training set.
        If object has been used in all bags on training phase, return None instead of prediction
        '''
        self._get_oob_predictions_from_every_model()

        averaged_oob_predictions = []

        for i, predictions in enumerate(self.list_of_predictions_lists):
            if len(predictions) == 0:
                # Если объект не имел OOB предсказаний (был использован во всех мешках), ставим None
                averaged_oob_predictions.append(np.nan)
            else:
                # Усредняем предсказания для данного объекта
                averaged_oob_predictions.append(np.mean(predictions))

        # Сохраняем усредненные предсказания
        self.oob_predictions = np.array(averaged_oob_predictions, dtype=np.float64)


    def OOB_score(self):
        '''
        Compute mean square error for all objects, which have at least one prediction
        '''
        self._get_averaged_oob_predictions()

        y_true = self.target
        y_pred = self.oob_predictions

        mask = ~np.isnan(y_pred)

        mse = np.mean((y_true[mask] - y_pred[mask]) ** 2)

        return mse

### Local tests:

In [5]:
from sklearn.linear_model import LinearRegression
from tqdm.auto import tqdm

#### Simple tests:

In [11]:
for _ in tqdm(range(100)):
    X = np.random.randn(2000, 10)
    y = np.mean(X, axis=1)
    bagging_regressor = SimplifiedBaggingRegressor(num_bags=10, oob=True)
    bagging_regressor.fit(LinearRegression, X, y)
    predictions = bagging_regressor.predict(X)
    assert np.mean((predictions - y)**2) < 1e-6, 'Linear dependency should be fitted with almost zero error!'
    assert bagging_regressor.oob, 'OOB feature must be turned on'
    oob_score = bagging_regressor.OOB_score()
    assert oob_score < 1e-6, 'OOB error for linear dependency should be also close to zero!'
    assert abs(
        np.mean(
            list(map(len, bagging_regressor.list_of_predictions_lists))
        ) / bagging_regressor.num_bags - 1/np.exp(1)) < 0.1, 'Probability of missing a bag should be close to theoretical value!'

print('Simple tests done!')

  0%|          | 0/100 [00:00<?, ?it/s]

Simple tests done!


#### Medium tests

In [12]:
for _ in tqdm(range(10)):
    X = np.random.randn(200, 150)
    y = np.random.randn(len(X))
    bagging_regressor = SimplifiedBaggingRegressor(num_bags=20, oob=True)
    print(X)
    bagging_regressor.fit(LinearRegression, X, y)
    predictions = bagging_regressor.predict(X)
    average_train_error = np.mean((predictions - y)**2)
    assert bagging_regressor.oob, 'OOB feature must be turned on'
    oob_score = bagging_regressor.OOB_score()
    assert oob_score > average_train_error, 'OOB error must be higher than train error due to overfitting!'
    assert abs(
        np.mean(
            list(map(len, bagging_regressor.list_of_predictions_lists))
        ) / bagging_regressor.num_bags - 1/np.exp(1)) < 0.1, 'Probability of missing a bag should be close to theoretical value!'

print('Medium tests done!')

  0%|          | 0/10 [00:00<?, ?it/s]

[[-0.37184443  0.85001643 -1.77295074 ... -0.25984658 -0.45106037
  -1.23333411]
 [-0.29824895  0.22056699  0.25523367 ...  0.10985876 -1.15286494
   0.51598242]
 [ 0.94657903 -0.47831262  0.10004343 ...  0.17713721  1.04526954
  -0.91009902]
 ...
 [-1.13715967 -1.34191344  0.02370546 ...  1.07746056 -1.20227513
   1.32014721]
 [ 0.88064048 -2.16715319 -0.66463981 ... -0.21528301  0.40698914
  -0.3401095 ]
 [ 0.59345483  0.78216596 -0.33318073 ...  0.17846167  1.23691567
  -0.38187244]]
[[ 2.60535531e-03 -9.35190180e-01  6.75063092e-01 ...  1.54730683e+00
  -5.47037498e-01 -8.52401889e-01]
 [-1.64496972e-01 -7.45846820e-01  9.26822868e-01 ... -1.03504696e+00
   3.92322613e-01  7.49510025e-01]
 [-5.19553312e-01  2.84535129e-02  6.69318354e-01 ... -2.61939470e+00
  -1.68968297e+00  1.68933347e+00]
 ...
 [ 2.02396130e+00  9.65737052e-01 -2.50442379e+00 ... -3.81195510e-01
  -1.93077753e-01 -5.32764425e-01]
 [-1.50752389e+00  1.04289613e+00 -4.79969599e-02 ...  8.35727026e-01
  -4.11798685

#### Complex tests:

In [13]:
for _ in tqdm(range(10)):
    X = np.random.randn(2000, 15)
    y = np.random.randn(len(X))
    bagging_regressor = SimplifiedBaggingRegressor(num_bags=100, oob=True)
    bagging_regressor.fit(LinearRegression, X, y)
    predictions = bagging_regressor.predict(X)
    oob_score = bagging_regressor.OOB_score()
    assert abs(
        np.mean(
            list(map(len, bagging_regressor.list_of_predictions_lists))
        ) / bagging_regressor.num_bags - 1/np.exp(1)) < 1e-2, 'Probability of missing a bag should be close to theoretical value!'

print('Complex tests done!')

  0%|          | 0/10 [00:00<?, ?it/s]

Complex tests done!


In [None]:
np.mean(
            list(map(len, bagging_regressor.list_of_predictions_lists))
        ) / bagging_regressor.num_bags - 1/np.exp(1)

Great job! Please, save `SimplifiedBaggingRegressor` to  `bagging.py` and submit your solution to the grading system!