# Random Forests (RF)
### Definition
- Base estimator: Decision Tree
- Each estimator is trained on a different bootstrap sample having the same size as the training set. 
- RF introduces further randomization in the training of individual trees

![title](https://drive.google.com/uc?export=view&id=1r5FGL17FR5IHOm8RAArXXJWl2NWGIQHF)

### Steps
1. Build a bootstrapped dataset, with the same shape as the original dataset
    - Take random observations (rows) from the original dataset
2. Create a decision tree using the bootstrapped dataset, but only use a random subset of variables (columns) at each step.
3. Go to step 1 and repeat

### Advantages and limitations
**Advantages**:
- Random Forests combine the simplicity of decision trees with flexibility resulting in a vast improvement in accuracy.
- The variety is what makes random forests more effective than individual decision trees.
- ***BAGGING tends to decrease variance, not bias. Thus, it tries to resolve the issue of overfitting the training data.*** 

**Feature Importance**:
- Tree-based methods: enable measuring the importance of each feature in prediction. 
    - how much the tree nodes use a particular feature to reduce impurity

### Clustering
**Proximity Matrix**:
- We know that two samples (rows) are similar if both end up in the same leaf node. 
- A Matrix can be build, with 0s and 1s, when the above is met. 
- This is repeated for each decision tree in the random forest, and values are aggregated in the matrix. 
- Then we divide each proximity value by the total number of trees.
- The closer the values to 1, the less distance there is between the samples!! BAM
    - We can build a heatmap or an MDS to represent the distances to each other! 
    - It doesn't matter if it numeric or categorical.

![title](https://drive.google.com/uc?export=view&id=1gJ-79NoPUkx0BAUtti_QDMbUdJGZJdBE)

### Missing data
**Types**:
1. On the training data set
    - The general idea for dealing with missing data in this context is to make an initial guess that could be bad, then gradually refine the guess until it is (hopefully) a good guess. 
        - For example, we could group by the target value (heart disease - 0, 1) and take the most repeated value or the average to fill the nans.
2. On the test data set
    - We don't know the target value...

In [5]:
import pandas as pd 

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as MSE 

# Import data
file1 = 'https://raw.githubusercontent.com/prince381/car_mpg_predict/master/cars1.csv'
file2 = 'https://raw.githubusercontent.com/prince381/car_mpg_predict/master/cars2.csv'
cars1 = pd.read_csv(file1).dropna(how='all', axis=1)
cars2 = pd.read_csv(file2)  
df = pd.concat([cars1, cars2], ignore_index=True, sort=False)

# Split data
seed = 1
X = df[['displacement']].to_numpy().reshape(-1, 1)
y = df['mpg'].to_numpy()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=seed)

# Instantiate and train model
rf = RandomForestRegressor(n_estimators=400, min_samples_leaf=0.12, random_state=seed)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)

# Evaluate the model
rmse_test = MSE(y_test, y_pred) ** (1/2)
print(f"Test set RMSE: {rmse_test}")

Test set RMSE: 3.7603781331419897
