## Home assignment 05: Bagging and OOB score

Please, fill the lines in the code below.
This is a simplified version of `BaggingRegressor` from `sklearn`. Please, notice, that `sklearn` API is **not preserved**.

Your algorithm should be able to train different instances of the same model class on bootstrapped datasets and to provide [OOB score](https://en.wikipedia.org/wiki/Out-of-bag_error) for the training set.

The model should be passed as model class with no explicit parameters and no parentheses.

Example:
```
import numpy as np
from sklearn.linear_model import LinearRegression

bagging_regressor = SimplifiedBaggingRegressor(num_bags=10, oob=True)
bagging_regressor.fit(LinearRegression, X, y)

```

In [1]:
import numpy as np

In [167]:
import random

import numpy as np

class SimplifiedBaggingRegressor:
    def __init__(self, num_bags, oob=False):
        self.num_bags = num_bags
        self.oob = oob
        
    def _generate_splits(self, data: np.ndarray):
        '''
        Generate indices for every bag and store in self.indices_list list
        '''
        self.indices_list = []
        data_length = len(data)
        for bag in range(self.num_bags):
            random_subset = random.choices(range(0, data_length), k=data_length)
            self.indices_list.append(random_subset)
        
    def fit(self, model_constructor, data, target):
        '''
        Fit model on every bag.
        Model constructor with no parameters (and with no ()) is passed to this function.
        
        example:
        
        bagging_regressor = SimplifiedBaggingRegressor(num_bags=10, oob=True)
        bagging_regressor.fit(LinearRegression, X, y)
        '''
        self.data = None
        self.target = None
        self._generate_splits(data)
        assert len(set(list(map(len, self.indices_list)))) == 1, 'All bags should be of the same length!'
        assert list(map(len, self.indices_list))[0] == len(data), 'All bags should contain `len(data)` number of elements!'
        self.models_list = []
        for bag in range(self.num_bags):
            model = model_constructor()
            data_bag, target_bag = [data[i] for i in self.indices_list[bag]], [target[i] for i in self.indices_list[bag]]
            self.models_list.append(model.fit(data_bag, target_bag)) # store fitted models here
        if self.oob:
            self.data = data
            self.target = target
        
    def predict(self, data):
        '''
        Get average prediction for every object from passed dataset
        '''
        results = np.array([model.predict(data) for model in self.models_list])
        return np.mean(results, axis=0)
        # Your code here
    
    def _get_oob_predictions_from_every_model(self):
        '''
        Generates list of lists, where list i contains predictions for self.data[i] object
        from all models, which have not seen this object during training phase
        '''
        list_of_predictions_lists = [[] for _ in range(len(self.data))]

        for idx in range(len(self.data)):
            for i, indices in enumerate(self.indices_list):
                if idx not in indices:
                    list_of_predictions_lists[idx].append(
                        self.models_list[i].predict(np.array(self.data[idx]).reshape(1, -1))
                    )        
        self.list_of_predictions_lists = np.array(list_of_predictions_lists, dtype=object)
    
    def _get_averaged_oob_predictions(self):
        '''
        Compute average prediction for every object from training set.
        If object has been used in all bags on training phase, return None instead of prediction
        '''
        self._get_oob_predictions_from_every_model()
        self.oob_predictions = np.array(
            [
                np.mean(np.array(self.list_of_predictions_lists[i])) 
                if len(self.list_of_predictions_lists[i]) != self.num_bags else np.nan
                for i in range(len(list(self.list_of_predictions_lists)))
            ]
        )
        
        
    def OOB_score(self):
        '''
        Compute mean square error for all objects, which have at least one prediction
        '''
        self._get_averaged_oob_predictions()
        indices = ~np.isnan(self.oob_predictions)
        print(np.mean((self.oob_predictions[indices] - self.target[indices])**2))
        return np.mean((self.oob_predictions[indices] - self.target[indices])**2)





### Local tests:

In [168]:
from sklearn.linear_model import LinearRegression
from tqdm.auto import tqdm

#### Simple tests:

In [169]:
# for _ in tqdm(range(100)):
#     X = np.random.randn(2000, 10)
#     y = np.mean(X, axis=1)
#     bagging_regressor = SimplifiedBaggingRegressor(num_bags=10, oob=True)
#     bagging_regressor.fit(LinearRegression, X, y)
#     predictions = bagging_regressor.predict(X)
#     assert np.mean((predictions - y)**2) < 1e-6, 'Linear dependency should be fitted with almost zero error!'
#     assert bagging_regressor.oob, 'OOB feature must be turned on'
#     oob_score = bagging_regressor.OOB_score()
#     assert oob_score < 1e-6, 'OOB error for linear dependency should be also close to zero!'
#     assert abs(
#         np.mean(
#             list(map(len, bagging_regressor.list_of_predictions_lists))
#         ) / bagging_regressor.num_bags - 1/np.exp(1)) < 0.1, 'Probability of missing a bag should be close to theoretical value!'
    
# print('Simple tests done!')

#### Medium tests

In [170]:
for _ in tqdm(range(10)):
    X = np.random.randn(200, 150)
    y = np.random.randn(len(X))
    bagging_regressor = SimplifiedBaggingRegressor(num_bags=20, oob=True)
    bagging_regressor.fit(LinearRegression, X, y)
    predictions = bagging_regressor.predict(X)
    average_train_error = np.mean((predictions - y)**2)
    assert bagging_regressor.oob, 'OOB feature must be turned on'
    oob_score = bagging_regressor.OOB_score()
    print(average_train_error)
    assert oob_score > average_train_error, 'OOB error must be higher than train error due to overfitting!'
    assert abs(
        np.mean(
            list(map(len, bagging_regressor.list_of_predictions_lists))
        ) / bagging_regressor.num_bags - 1/np.exp(1)) < 0.1, 'Probability of missing a bag should be close to theoretical value!'
    
print('Medium tests done!')

 10%|████▍                                       | 1/10 [00:00<00:04,  1.92it/s]

2.8939909832801947
0.4207051599263644


 20%|████████▊                                   | 2/10 [00:01<00:04,  1.73it/s]

3.4931642649741432
0.5186735766239896


 30%|█████████████▏                              | 3/10 [00:01<00:04,  1.73it/s]

2.4635494325491023
0.32871746112149536


 40%|█████████████████▌                          | 4/10 [00:02<00:03,  1.68it/s]

2.763158044863594
0.3894230657628871


 50%|██████████████████████                      | 5/10 [00:02<00:02,  1.77it/s]

2.5032560175926455
0.37018253680552654


 60%|██████████████████████████▍                 | 6/10 [00:03<00:02,  1.71it/s]

2.64061945071727
0.3713010661965145


 70%|██████████████████████████████▊             | 7/10 [00:04<00:01,  1.63it/s]

1.907283531674827
0.2669668759074573


 80%|███████████████████████████████████▏        | 8/10 [00:04<00:01,  1.59it/s]

2.01835689053195
0.2749002537362517


 90%|███████████████████████████████████████▌    | 9/10 [00:05<00:00,  1.59it/s]

2.6026634886432816
0.35534558184084963


100%|███████████████████████████████████████████| 10/10 [00:06<00:00,  1.61it/s]

2.8925554576518486
0.4164179490715674
Medium tests done!





#### Complex tests:

In [171]:
for _ in tqdm(range(10)):
    X = np.random.randn(2000, 15)
    y = np.random.randn(len(X))
    bagging_regressor = SimplifiedBaggingRegressor(num_bags=100, oob=True)
    bagging_regressor.fit(LinearRegression, X, y)
    predictions = bagging_regressor.predict(X)
    oob_score = bagging_regressor.OOB_score()
    assert abs(
        np.mean(
            list(map(len, bagging_regressor.list_of_predictions_lists))
        ) / bagging_regressor.num_bags - 1/np.exp(1)) < 1e-2, 'Probability of missing a bag should be close to theoretical value!'
    
print('Complex tests done!')

 10%|████▍                                       | 1/10 [00:03<00:29,  3.31s/it]

1.0480352454061512


 20%|████████▊                                   | 2/10 [00:06<00:26,  3.26s/it]

1.0076263107284051


 30%|█████████████▏                              | 3/10 [00:09<00:23,  3.31s/it]

1.0378847963860922


 40%|█████████████████▌                          | 4/10 [00:13<00:19,  3.30s/it]

0.9965589801600826


 50%|██████████████████████                      | 5/10 [00:16<00:16,  3.27s/it]

1.022499701277144


 60%|██████████████████████████▍                 | 6/10 [00:19<00:13,  3.26s/it]

1.0808324550697317


 70%|██████████████████████████████▊             | 7/10 [00:22<00:09,  3.24s/it]

0.9649534511216598


 80%|███████████████████████████████████▏        | 8/10 [00:26<00:06,  3.21s/it]

1.0456766331458855


 90%|███████████████████████████████████████▌    | 9/10 [00:29<00:03,  3.24s/it]

0.9872203763892968


100%|███████████████████████████████████████████| 10/10 [00:32<00:00,  3.25s/it]

1.0340575870753261
Complex tests done!





In [172]:
np.mean(
            list(map(len, bagging_regressor.list_of_predictions_lists))
        ) / bagging_regressor.num_bags - 1/np.exp(1)

-0.0006744411714423304

Great job! Please, save `SimplifiedBaggingRegressor` to  `bagging.py` and submit your solution to the grading system!