1. create many decision trees, each tree uses random part of the data
2. pick random features from all features
3. each tree makes a prediction
4. for classification, use majority voting to get the final result

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
import warnings
warnings.filterwarnings('ignore')

In [2]:
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer(return_X_y = True, as_frame = True)

In [3]:
X = data[0]
Y = data[1] # target

# Splitting the dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.25, random_state=100) # set random state to ensure reproducibility

In [4]:
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X_train, y_train)

In [5]:
y_pred = rf_classifier.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)

print(f"Accuracy: {accuracy:.2f}")
print("\nClassification Report:\n", classification_rep)

sample = X_test.iloc[0:1]
prediction = rf_classifier.predict(sample)

sample_dict = sample.iloc[0].to_dict()
print(f"\nSample Passenger: {sample_dict}")
print(f"Predicted Survival: {'Survived' if prediction[0] == 1 else 'Did Not Survive'}")

Accuracy: 0.96

Classification Report:
               precision    recall  f1-score   support

           0       0.96      0.93      0.95        56
           1       0.96      0.98      0.97        87

    accuracy                           0.96       143
   macro avg       0.96      0.95      0.96       143
weighted avg       0.96      0.96      0.96       143


Sample Passenger: {'mean radius': 17.91, 'mean texture': 21.02, 'mean perimeter': 124.4, 'mean area': 994.0, 'mean smoothness': 0.123, 'mean compactness': 0.2576, 'mean concavity': 0.3189, 'mean concave points': 0.1198, 'mean symmetry': 0.2113, 'mean fractal dimension': 0.07115, 'radius error': 0.403, 'texture error': 0.7747, 'perimeter error': 3.123, 'area error': 41.51, 'smoothness error': 0.007159, 'compactness error': 0.03718, 'concavity error': 0.06165, 'concave points error': 0.01051, 'symmetry error': 0.01591, 'fractal dimension error': 0.005099, 'worst radius': 20.8, 'worst texture': 27.78, 'worst perimeter': 149.6, 

Recall and precision for 0 is higher than the results from decision tree, meaning random forest did better at retriving the 0 class and correctly predicting the 0 class than the descision tree. 

Feature importance:

1. Gini importance

derived from the Random Forest algorithm’s internal structure.

It measures how much each feature contributes to reducing impurity in decision trees.

This gives a global measure: features that consistently create “good splits” across many trees get high importance.

In [6]:
feature_names = data[0].columns

In [7]:
importances = rf_classifier.feature_importances_
feature_imp_df = pd.DataFrame({'Feature': feature_names, 'Gini Importance': importances}).sort_values(
    'Gini Importance', ascending=False)
print(feature_imp_df)

                    Feature  Gini Importance
27     worst concave points         0.158391
7       mean concave points         0.131446
23               worst area         0.125334
2            mean perimeter         0.069705
22          worst perimeter         0.069365
6            mean concavity         0.065370
20             worst radius         0.054087
3                 mean area         0.053891
0               mean radius         0.046902
26          worst concavity         0.036139
13               area error         0.025670
10             radius error         0.016350
25        worst compactness         0.016004
21            worst texture         0.014891
1              mean texture         0.014792
12          perimeter error         0.012966
24         worst smoothness         0.011660
4           mean smoothness         0.009868
5          mean compactness         0.008184
16          concavity error         0.008169
29  worst fractal dimension         0.008053
15        

2. Permutation importance

Measure model accuracy on a validation set.

Randomly shuffle one feature column (breaking its relationship to the target).

Re-evaluate model accuracy.

The drop in accuracy = importance of that feature.

In [8]:
from sklearn.inspection import permutation_importance
result = permutation_importance(
    rf_classifier, X_test, y_test, n_repeats=10, random_state=0, n_jobs=-1)
perm_imp_df = pd.DataFrame({'Feature': feature_names, 'Permutation Importance': result.importances_mean}).sort_values(
    'Permutation Importance', ascending=False)
print(perm_imp_df)

                    Feature  Permutation Importance
16          concavity error                0.004895
3                 mean area                0.004196
1              mean texture                0.003497
20             worst radius                0.003497
24         worst smoothness                0.002098
22          worst perimeter                0.002098
26          worst concavity                0.000699
23               worst area                0.000699
0               mean radius                0.000000
28           worst symmetry                0.000000
25        worst compactness                0.000000
21            worst texture                0.000000
19  fractal dimension error                0.000000
18           symmetry error                0.000000
17     concave points error                0.000000
15        compactness error                0.000000
14         smoothness error                0.000000
12          perimeter error                0.000000
11          

permutation importance actually decides that the inclusion of'worst concave points' and 'mean concave points' actually hurts model performance which is the opposite of gini importance

why is the permutation importance lower than gini importance?

some possibilities:
1. multicollinearity
2. importance comes from deel splits (gini importance exaggerates contributions from many small splits deep in trees
3. noise features
4. gini importance biased toward features with many categories / continuous ranges bc they create more possible split points
5. model has high var or overfits