 # Decion based models. From decision Tree to Random Forest

### Load the California housing Dataset and Prepare the Data
- Load the California housing dataset from sklearn (`sklearn.datasets.fetch_california_housing()`).
- Separate the features and the target variable, which is 'medv' in this case.
- Analyse the dataset and process different necessary transformations of the data.
- Split the dataset

###  Fit the Regression Tree
- Fit a decision tree to the training data.
- Try to vizualize the behavior of the DT while changing parameters

### Perform Cost Complexity Pruning
- Apply cost complexity pruning to the regression tree using the appropriate library function : `cost_complexity_pruning_path()`
- Determine the optimal pruning parameter `ccp_alpha` through cross-validation

### Analyze the results
- Use the best estimator obtained from cross-validation to make predictions on the test set
- Utilize the `predict()` method of the best estimator to generate predictions for the test data
- Evaluate the Performance of the Model on the Test Set
- Plot the tree to see if its interpretable

### Some questions

- When learning the tree, we chose a feature to test at each step by maximizing the expected information gain. Does this approach allow us to generate the optimal decision
tree ? Why or why not ? Hint : When playing chess, do you consider only the immediate improvement of your position when deciding on your next move ?

- Why might a Decision Tree work well on a small dataset but perform poorly on a larger one ?

- How does the depth of the tree impact its performance in terms of overfitting and underfitting ?

- How are categorical features handled during splitting ?
 
- In your opinion, why is feature scaling not required when training a Decision Tree ?

- When splitting a node, what happens if all observations have the same target value ?

- What Decision Trees are often described as "white-box models". What does this mean, and why is it beneficial in certain applications ?

- If your dataset is imbalanced, how might this affect the splits chosen by the Decision Tree ? What adjustments could you make to address this ?

## Experiment Boostrapping

- Write a function that creates bootstrap samples by randomly sampling data points with replacement from the training set
- Train multiple Decision Tree models, each using a different bootstrap sample
- Generate predictions from each model for a fixed test set and store the results
- Compute the mean, standard deviation, and range of predictions for each test point across the models

- Create some vizualisation to compare the mean predictions with the true values
- Show the distribution of predictions for a few selected test points

- How does the variance of predictions change across test samples ?
- Why do some test points exhibit higher variance ?
- How does this relate to the overfitting tendency of Decision Trees ?
- How might combining predictions help to reduce variance ?

- Try using a different weak learners with the same setup and compare their prediction variance
- Increase or decrease the number of bootstrap samples. How does this affect the results ?

### Bagging Implementation

- In order to improve the MSE and reduce the variance of the results implement your own Bagging class.

- Recall that Bagging is performed when all predictors (i.e. covariates - features) are used.

- Fit your Bagging regressor to the training data. Compare it to `DecisionTree` Sklearn built-in class.

- Effect of Increasing Trees in Bagging : 
    - Vizualize the effect of increasing the number of weak learners. 
    - Does increasing the number of trees always help ?

### Random Forest
Random Forest introduces additional randomness by restricting the number of features that each tree can consider when splitting a node.

Usually, $\sqrt{p}$ features for classification and $p/3$ for regression, where $p$ is the total number of features.

This restriction decorrelates the trees, reducing the chance of overfitting even further.

- Modify your class to chose the maximum number of features to consider when searching for the best split. 

- Create a plot displaying the test, train and OOB error resulting from random forests for a more comprehensive range of values for max_features and n_estimators
- Describe the results obtained

- Visualize the difference in variance between your Decision Tree and Random Forest (which you can replicate with different parameters).

--- 