## Ensembles for Customer Satisfaction Prediction

KATE expects your code to define variables with specific names that correspond to certain things we are interested in.

KATE will run your notebook from top to bottom and check the latest value of those variables, so make sure you don't overwrite them.

* Remember to uncomment the line assigning the variable to your answer and don't change the variable or function names.
* Use copies of the original or previous DataFrames to make sure you do not overwrite them by mistake.

You will find instructions below about how to define each variable.

Once you're happy with your code, upload your notebook to KATE to check your feedback.

Businesses can improve their services by tailoring them to individual customers. One important factor is knowing when customers are dissatisfied. Based on their records, one can use machine learning tools to make predictions about which customers are more at risk of being dissatisfied than others. Such predictions allow for individualized actions that may help retain customers and will improve quality.

In this assignment, we will build a prediction model for bank account owners' satisfaction. The record includes more than 300 features for each client, including variable related to their balance and which banking operations they have performed. Many of these variables are sparse; some numerical, some categorical. 

Ensemble methods based on decision trees, such as random forests and boosting algorithms, have been very successful in modeling such heterogeneous tabular data. To learn how these models work, you will implement them step-by-step, and see how the performance of your predictions improve.

### Load the data

Load the data in `data/train_data.csv` with `pandas`. Inspect its content with `.head()`, `.shape` and other methods of your choice.

#### Target variable

The last column, named `TARGET`, is the variable to be predicted. `TARGET=1` represents a dissatisfied customer. Inspect the target column with `.value_counts()`. 

What is the proportion of dissatisfied customers? Is the dataset balanced or imbalanced?

### Note on dataset properties

As you can see, the dataset is highly imbalanced: there are only 2.6k positive entries and 63.4k negative entries. It definitely should be addressed in the models by introducing class_weight parameter where possible (there are different ways it can be done - feel free check it out in sklearn documentation).

If that is not possible to introduce class weights for the model due to the model type, be ready to the permanent majority class vote in the output. This can be addressed by tweaking the model parameters.

Separate the data into features `X` and target `y`. Split the data into training and validation sets, with validation set of 5000 samples, with stratified split to keep the same level of imbalance.

*Hint: you may use `train_test_split()` for stratified splits.*

### Basic modelling pipeline

Implement a basic modelling pipeline for a Decision Tree Classifier, fitting the training data and printing the training and validation accuracy.

Note that the prediction score is quite high, even for this very simple model. Take a moment to think why this high score is not that significant.

#### ROC curve metric

Change your scoring metric to `roc_auc_score`, which calculates the area below the ROC curve of your **prediction probabilities**, instead of using the binary prediction decisions.

*Hint: Use the probabilities for `y = True` (not `y = False`).*

#### Baseline score for random predictions

Calculate the ROC AUC for random uniform prediction probabilities. 

Is the Decision Tree better? Based on the training and validation scores, what is the problem with the Decision Tree model?

*Hint: You can use `np.random.uniform`.*

Create a function named `test_model(model, X_train, y_train, X_test, y_test)` that performs the basic prediction pipeline, receiving as argument the model and data, fitting the training data, and returning the training and test prediction scores. Check that it works with the Decision Tree model.

## Optimizing decision trees 

We can improve the prediction model by setting up the Decision Tree. Check the arguments available for the `DecisionTreeClassifier` class. 

Which arguments do you think could improve the validation score? Optimize your model by changing the meta-parameters. Inspect the most important meta-parameter by calculating the training and validation score for different values.

To evaluate your models, we will test your data on a testing set. Load the test at `data/test_data.csv`.

In [None]:
test_data = pd.read_csv('data/test_data.csv')

 Calculate the prediction probabilities for the test data for the best Decision Tree, saving them in a variable named `dtc_preds`. `dtc_preds` should be an numpy array a single dimension.

### Bagging and Random Forests

While Decision Trees are prone to overfitting, their ensemble can be powerful predictors. Random Forests are essentially Bagging ensembles of decision trees, using the average prediction of the multiple decision trees base models, each trained with a different set of data samples.

You will create a Bagging model class, named `myBagging`, filling the class structure below.

The `.fit()` method should fit each base model with a bootstrap sample of the data (with replacement), with data size proportional by the meta-parameter `subsample`. That is, if `subsample=0.5`, each base model should get half the total number of samples.

The `.predict_proba()` method should estimate and average the prediction probabilities of the base models.

*Hint: You can use the `resample()` function for creating bootstrap samples.*

In [None]:
class myBagging:
    def __init__(self, base_models, subsample = 1.):
        self.n_models = len(base_models)
        self.base_models = base_models
        self.subsample = subsample
        
    def fit(self, X, y):
        '''Loop over base models, generate a bootstrap sample of the data with 'resample()',
           and fit them to the data.
           
           To access the variables inside the myBagging class, use the 'self.' prefix, 
           i.e. self.base_models, self.n_models and self.subsample
        '''
        pass
    
    def predict_proba(self, X):
        '''Return the ensemble predictions, given by the average prediction probability over base models.
           It should be an array with the length of the dataset.'''
        pass
    


Run and score a Random Forest, with 10 base Decision Trees, with maximum depth 10 and subsample 0.5. Use your `myBagging` class and `test_model()`.

### Extra-Trees

Extremely Randomized Trees are decision trees in which, at each node split during training , only a fraction of the features is considered for the optimal split (e.g. for optimal Gini gain). This functionality is implemented on `sklearn` under the parameter `max_features`. 

Run and score a Extra-Trees version of your Random Forest, by changing the `max_features` parameter.

### Sklearn comparison

For comparison, run and score the `sklearn` implementation, `RandomForestClassifier`.

### Optimize your Random Forest

Optimize your Random Forest meta-parameters, both of the myBagging and Decision Trees, and make your predictions for the test data, saving the predictions under `rf_preds`.

Note that including more decision trees improve performance but increases the computational cost of training linearly. The `max_depth` and `max_features` arguments can heavily cut the training time, by reducing the tree size and number of features considered at each split.

## Gradient Boosting

We will now implement a more sophisticated ensemble, Gradient Boosting, in which the base models are trained sequentially. Each new base model predicts what previous base models missed. 

As gradient boosting requires a continuous gradient, it can only use regression models for the base learner. 

For this exercise, we will perform regression directly on the 0-1 class labels, and treat the raw outputs as probabilities. 

We will try to setup the base models to optimise the MSE loss function against the class-labels, for which the gradient becomes simply the residual errors. 

When applied to probabilities, the MSE is known as the Brier score. 

Whilst performing this exercise, have a think about whether this is a robust approach. 

If not, what would you change either to your base-learners, meta-algorithm, or evaluation metrics to make this more robust?

You will have a chance to implement your suggestions tomorrow!

In the below structure, fill the `.fit()` and `.predict_proba()` functions. 

In [None]:
class myGradientBoosting:
    
    def __init__(self, base_models, learning_rate=0.5):
        self.n_models = len(base_models)
        self.models = base_models
        self.learning_rate = learning_rate
    
    def fit(self, x, y):
        ''' The `.fit()` function should loop over each base model 
         fitting it to the residual of the ensemble predictions so far, for the MSE loss:
         
         predictions = 0
         for each base model:
             residual = y - predictions   
             fit base model and make new predictions
             predictions = predictions + learning_rate * new_prediction 
        '''
        pass
       
    def predict_proba(self, x):
        ''' Generate the ensemble prediction, by looping over each base model.
            Get their predictions and sum them, scaled by the learning rate.
        
            Trick: Regressor models return only one prediction (instead of two probabilities in the Classifiers).
                   To make your class compatible with test_model(), you can repeat the predictions, e.g.:
                   predictions.reshape(-1,1).repeat(2,axis=1)'''
        pass
        


Run and score a Gradient Boosting model, with 20 base decision trees, with maximum depth 5, maximum feature 0.5 and learning rate 0.5. Use your `myGradientBoosting` class and `test_model()`. 

For comparison, run and score the `sklearn` implementation, `GradientBoostingClassifier`.

Optimize your myGradientBoosting and decision tree meta-parameters, and make your predictions for the test data, saving the predictions under `gb_preds`.

Try to think about the difference between your implementation and the GradientBoostingClassifier.

Are there any fundamental differences? If so, why?

You could try looking at the distribution of your output probabilities for each model.