<h1 style="text-align:center">RANDOM FOREST</h1>

## Ensemble learning
Combining predictions from multiple models to get better performance than a single model.

### Types of Ensemble Learning:

Bagging → Random Forest (✔️)

Boosting → AdaBoost, XGBoost

Stacking

In [2]:
from sklearn.datasets import load_wine

data = load_wine()


### Create X and y

In [3]:
X = data.data        # Features
y = data.target     # Labels (0, 1, 2)


### Train-Test Split

In [4]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42
)


### import RandomForestClassifier

In [5]:
from sklearn.ensemble import RandomForestClassifier


### Create Random Forest Model

The parameter n_estimators=200 specifies the number of decision trees in the forest. In Random Forest, each tree is trained on a different bootstrap sample of the training data and uses a random subset of features at each split. By setting n_estimators to 200, the model will build 200 individual decision trees. Increasing the number of trees generally improves model performance and stability because the predictions are averaged (or voted) across more trees, which reduces variance and overfitting. However, more trees also increase computational cost and training time, so this value is a balance between accuracy and efficiency.

The parameter random_state=42 sets the random seed used by the algorithm. Random Forest involves randomness in two main places: sampling data points (bootstrap sampling) and selecting random subsets of features at each split. By fixing the random_state, the randomness becomes reproducible, meaning that every time you run the code with the same data and parameters, you will get the same model and the same results. This is especially important for debugging, comparison of models, and reproducible experiments.

In [6]:
model = RandomForestClassifier(
    n_estimators=200,    # number of trees
    random_state=42
)


### fit() and predict()

In [7]:
model.fit(X_train, y_train)


In [8]:
y_pred = model.predict(X_test)


### Accuracy

In [9]:
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)


Accuracy: 1.0


### Confusion matrix

In [10]:
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)
print(cm)


[[15  0  0]
 [ 0 18  0]
 [ 0  0 12]]


# FEATURE IMPORTANCE
Which features are most important for prediction?

### This improves:

Interpretability

Trust in model

Feature selection



Random Forest provides feature importance, helping in model interpretability.

This code is used to analyze and interpret a trained Random Forest model by identifying which input features are most important for making predictions. First, the pandas library is imported to allow creation and manipulation of structured tabular data. The line importances = model.feature_importances_ extracts the feature importance scores from the trained Random Forest model; these scores are numerical values that indicate how much each feature contributed to reducing impurity across all the decision trees in the forest, and together they sum to 1. Next, a pandas DataFrame is created using a dictionary where one column, Feature, contains the names of the input features obtained from data.feature_names, and the second column, Importance, contains the corresponding importance scores from the model. Each feature name aligns index-wise with its importance value. The DataFrame is then sorted in descending order using the sort_values method so that the most influential features appear at the top, making interpretation easier. Finally, print(importance_df) displays the sorted table, allowing us to clearly see which features have the greatest impact on the model’s predictions, thereby improving model interpretability and helping with feature selection.

In [11]:
import pandas as pd

importances = model.feature_importances_

importance_df = pd.DataFrame({
    'Feature': data.feature_names,
    'Importance': importances
}).sort_values(by='Importance', ascending=False)

print(importance_df)


                         Feature  Importance
6                     flavanoids    0.187065
9                color_intensity    0.170184
12                       proline    0.156942
0                        alcohol    0.121404
11  od280/od315_of_diluted_wines    0.117764
10                           hue    0.058332
5                  total_phenols    0.047604
3              alcalinity_of_ash    0.034880
4                      magnesium    0.030259
8                proanthocyanins    0.028778
1                     malic_acid    0.026466
2                            ash    0.013385
7           nonflavanoid_phenols    0.006937
