# Random Forest for Classification

---

>Note: No Overfitting occurs in Random Forest.

>Note: Is done with original data. (NOT! normalized)

- **Structure**: A Random Forest is an ensemble of multiple decision trees, usually trained with a technique called bagging. Each tree is built from a random subset of the training data, and the final prediction is made by aggregating the outputs of all the individual trees (e.g., by majority vote for classification or averaging for regression).

- **How it works:**
    - **Bagging**: Randomly samples the data with replacement to train each decision tree on a slightly different subset.
    - **Feature Randomness**: At each split in a tree, a random subset of features is chosen, making the model less correlated and more diverse.

- **Advantages:**
    - **Reduces Overfitting**: By averaging multiple trees, the model reduces the risk of overfitting that individual trees might have.
    - **High Accuracy**: Generally provides better performance and accuracy than a single decision tree.
Robust: Less sensitive to outliers and noise in the data.

- **Disadvantages:**
    - **Complexity**: More difficult to interpret than a single decision tree.
    - **Computationally Expensive**: Requires more computational resources and time for training and making predictions, especially with a large number of trees.
    - **Less Intuitive**: Not as straightforward to understand the decision-making process.

`Preferable for more complex problems, larger datasets, and when you need a robust and accurate model.`

![Getting Started](..\data\raw\image.png)


---

Imported Libraries

In [None]:
# Data processing
# ==================================================================================
import pandas as pd
import numpy as np

# Charts
# ==================================================================================
import matplotlib.pyplot as plt

# Preprocessing and modeling
# ==================================================================================
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# from sklearn.metrics import confusion_matrix
# from sklearn.metrics import ConfusionMatrixDisplay
# from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import ParameterGrid
# from sklearn.inspection import permutation_importance
from sklearn import tree


# Warnings Configuration
# ==================================================================================
import warnings

def ignore_warn(*args, **kwargs):
    pass
warnings.warn = ignore_warn # ignore annoying warning (from sklearn and seaborn)
warnings.filterwarnings("ignore", category=FutureWarning)

## Step 1: Decision making: Which is the best dataset



In [14]:
# Train data frames
X_train_with_outliers_sel = pd.read_csv('../data/processed/X_train_with_outliers_sel.csv')
X_train_without_outliers_sel = pd.read_csv('../data/processed/X_train_without_outliers_sel.csv')
y_train = pd.read_csv('../data/processed/y_train.csv')

# Test data frames
X_test_with_outliers_sel = pd.read_csv('../data/processed/X_test_with_outliers_sel.csv')
X_test_without_outliers_sel = pd.read_csv('../data/processed/X_test_without_outliers_sel.csv')
y_test = pd.read_csv('../data/processed/y_test.csv')



In [15]:
# Return a contiguous flattened array.

# A 1-D array, containing the elements of the input, is returned. A copy is made only if needed.

y_train = y_train.values.ravel()

In [16]:
# Return a contiguous flattened array.

# A 1-D array, containing the elements of the input, is returned. A copy is made only if needed.

y_test = y_test.values.ravel()

In [17]:
# train_dicts (dict)
# =====================================================================================
train_dicts = {
  "X_train_with_outliers_sel": X_train_with_outliers_sel,
  "X_train_without_outliers_sel": X_train_without_outliers_sel
}

# test_dicts (dict)
# =====================================================================================
test_dicts = {
  "X_test_with_outliers_sel": X_test_with_outliers_sel,
  "X_test_without_outliers_sel": X_test_without_outliers_sel
}

# -.-.--.-.-.-.-.-.-.-.--.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.--.-.-.-.-.-.-

# train_dfs (list)
# =====================================================================================
train_dfs = [
  X_train_with_outliers_sel,
  X_train_without_outliers_sel
]

# test_dfs (list)
# =====================================================================================
test_dfs = [
  X_test_with_outliers_sel,
  X_test_without_outliers_sel
]

# -.-.--.-.-.-.-.-.-.-.--.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.--.-.-.-.-.-.-

# Print .shape
# =====================================================================================
print("|X_train|")
print("=================================================================")
print(f"X_train_with_outliers_sel shape: {X_train_with_outliers_sel.shape} ")
print(f"X_train_without_outliers_sel shape: {X_train_without_outliers_sel.shape}\n ")

print("|X_test|")
print("=================================================================")
print(f"X_test_with_outliers_sel shape: {X_test_with_outliers_sel.shape} ")
print(f"X_test_without_outliers_sel shape: {X_test_without_outliers_sel.shape}\n ")

print("|Y_train|")
print("=================================================================")
print(f"y_train shape: {y_train.shape}\n ")

print("|Y_test|")
print("=================================================================")
print(f"y_test shape: {y_test.shape} ")

|X_train|
X_train_with_outliers_sel shape: (614, 8) 
X_train_without_outliers_sel shape: (614, 8)
 
|X_test|
X_test_with_outliers_sel shape: (154, 8) 
X_test_without_outliers_sel shape: (154, 8)
 
|Y_train|
y_train shape: (614,)
 
|Y_test|
y_test shape: (154,) 


In [None]:
results = []

for df_index in range(len(train_dfs)):
  model = RandomForestClassifier(random_state = 42) # Model initialization and training

  train_df = train_dfs[df_index]
  model.fit(train_df, y_train)

  y_test_pred = model.predict(test_dfs[df_index]) # Model prediction

  results.append(
    {
        "index": df_index,
        "train_df": list(train_dicts.keys())[df_index],
        "Accuracy_score": accuracy_score(y_test, y_test_pred)
  })

resultados = sorted(results, key = lambda x: x["Accuracy_score"], reverse = True)
resultados

[{'index': 0,
  'train_df': 'X_train_with_outliers_sel',
  'Accuracy_score': 0.7207792207792207},
 {'index': 1,
  'train_df': 'X_train_without_outliers_sel',
  'Accuracy_score': 0.7142857142857143}]

In [19]:
print (f"The best train dataframe is |{resultados[0]['train_df']}|.\n\
=======================================================      \n\
| Accuracy score: {resultados[0]['Accuracy_score']}   |\n\
========================================")

The best train dataframe is |X_train_with_outliers_sel|.
| Accuracy score: 0.7207792207792207   |


## Step 2: Model hyperparameters optimization

- ### 2.1 Grid SearchCV