# Random Forest Algorithm  
## Training & Prediction Workflow, Implementation, Evaluation, and Hyperparameter Tuning




## 1. Why Random Forest Here?

Random Forest is one of the strongest baseline models for tabular data.

Why?
- It handles non-linear relationships
- It is resistant to overfitting compared to a single decision tree
- It requires minimal preprocessing
- It works well even when feature interactions are complex

We will now implement it step by step.


In [18]:
#Importing Necessary Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix


## 2. Dataset Selection

We use the **Breast Cancer Wisconsin dataset** from sklearn.

Why this dataset?
- Clean and well-structured
- Binary classification
- Medical context encourages careful evaluation
- Non-linear patterns suit Random Forest well


In [20]:
#Dataset Loadin
data = load_breast_cancer()

X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target, name="target")

print("Feature matrix shape:", X.shape)
print("Target shape:", y.shape)

Feature matrix shape: (569, 30)
Target shape: (569,)



### Understanding the Target Variable

- 0 → Malignant (Cancer)
- 1 → Benign (Non-cancer)

This is a **binary classification problem**.


In [22]:
y.value_counts()

target
1    357
0    212
Name: count, dtype: int64

In [26]:
X.columns

Index(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error', 'fractal dimension error',
       'worst radius', 'worst texture', 'worst perimeter', 'worst area',
       'worst smoothness', 'worst compactness', 'worst concavity',
       'worst concave points', 'worst symmetry', 'worst fractal dimension'],
      dtype='object')

In [34]:
data.target_names

array(['malignant', 'benign'], dtype='<U9')


## 3. Train-Test Split

We split the data into:
- Training set (75%)
- Test set (25%)

We use **stratification** to preserve class balance.


In [36]:
#Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.25,
    random_state=42,
    stratify=y
)

print("Training samples:", X_train.shape[0])
print("Test samples:", X_test.shape[0])

Training samples: 426
Test samples: 143



## 4. Training a Baseline Random Forest Model

We start with a simple Random Forest using default-friendly parameters.
No tuning yet.


In [40]:
#Baseline Random Forest Model
rf_baseline = RandomForestClassifier(
    n_estimators=100,
    random_state=42
)


target
1    90
0    53
Name: count, dtype: int64


## 5. Making Predictions



## 6. Model Evaluation

Accuracy alone is **not enough**, especially in healthcare problems.



### Classification Report

This shows:
- Precision
- Recall
- F1-score

Recall is especially important here because false negatives are dangerous.



### Confusion Matrix



## 7. Feature Importance

Random Forest provides **global feature importance**.
This tells us which features were most useful overall.



## 8. Why Hyperparameter Tuning Matters

Problems with default settings:
- Trees may be too deep
- Model may overfit
- Training may be unnecessarily slow

We tune **structure**, not vanity metrics.



## 9. Key Hyperparameters

- n_estimators: number of trees
- max_depth: tree depth
- min_samples_split: minimum samples to split
- max_features: features per split



## 10. GridSearchCV for Tuning



## 11. Evaluating the Tuned Model
