#### What is Random Forest ?

##### Random Forest (RF) is an ensemble learning technique used for both classification and regression tasks. It builds multiple decision trees during training and outputs the mode (classification) or mean (regression) of the individual trees. It’s a special case of Bagging (Bootstrap Aggregation) applied specifically with decision trees and a layer of random feature selection.

#### How Random Forest Extends Bagging ?
##### Concept	Bagging	Random Forest
##### Base Estimator	Any (commonly decision trees)	Always decision trees
##### Sampling	Bootstrap (with replacement)	Bootstrap (same)
##### Feature Randomness	❌ Uses all features	✅ Random subset of features at each split
##### Correlation Between Trees	❗ Higher	✅ Lower (because of feature randomness)
##### Overfitting Tendency	Medium	Low (compared to single decision tree or plain bagging)
##### Interpretability	Medium	Low (harder to interpret due to randomness & averaging)

#### Why Use Random Forest Instead of Plain Bagging?
##### ✅ Advantages of Random Forest over Bagging:
##### Lower Variance: Random forests are less likely to overfit due to random feature selection.
##### Higher Accuracy: Often achieves better generalization than bagging.
##### Less Correlated Trees: Random feature selection reduces correlation, improving ensemble performance.
##### Feature Importance: RF provides estimates of feature importance, useful for interpretation.
##### Out-of-Bag (OOB) Error Estimation: Can evaluate model accuracy during training without cross-validation.

#### 🚫 When Not to Use Random Forest:
##### If interpretability is crucial.
##### When your data has very few features (RF might lose info when sampling features).
##### When performance is similar to bagging (on simple problems), and you prefer slightly faster training.

#### 🌟 Bagging vs Random Forest — Key Differences
##### Feature / Aspect	Bagging	Random Forest
##### 🎯 Full Name	Bootstrap Aggregating	Random Forest
##### 🌳 Base Estimator	Any (commonly Decision Tree)	Always Decision Tree
##### 🔁 Bootstrap Sampling (Rows)	✅ Yes	✅ Yes
##### 🎲 Feature Randomness	❌ No (uses all features at each split by default)	✅ Yes (uses random subset of features per node)
##### 📊 Feature Selection Level	Per tree (if specified manually, same for all nodes)	Per node/split (random subset changes per node)
##### 🤝 Tree Correlation	High (since all features are used)	Low (due to feature subset per node)
##### 🔄 Variance Reduction	✅ Yes (averaging trees reduces variance)	✅✅ More variance reduction (due to added randomness)
##### 🧠 Bias	Slightly lower than RF (can overfit)	Slightly higher (due to random features) but generalizes better
##### 📉 Overfitting Risk	Medium	Low
##### 🛠️ Hyperparameter Example	max_features affects all splits	max_features controls subset size per node
##### 📈 Performance (usually)	Good	Better (on most real-world problems)
##### 🧮 Feature Importance	Not directly available	✅ Yes (built-in feature importance metrics)
##### 💬 Interpretability	Medium (depends on base learner)	Lower (due to high randomness and number of trees)


#### Example Analogy:
##### Bagging: Imagine training 100 doctors, all with access to the same medical book, but each sees different patient samples. They'll learn similarly and might give correlated diagnoses.
##### Random Forest: You still train 100 doctors, each with different patients AND given only a few random pages from the book per decision. They learn differently and provide diverse opinions — averaging their diagnoses is more robust.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.datasets import fetch_california_housing
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, mean_absolute_error, mean_squared_error, r2_score

In [2]:
#Work on titanic dataset for classification problem
df = pd.DataFrame(sns.load_dataset("titanic"))
df.sample(5)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
127,1,3,male,24.0,0,0,7.1417,S,Third,man,True,,Southampton,yes,True
47,1,3,female,,0,0,7.75,Q,Third,woman,False,,Queenstown,yes,True
408,0,3,male,21.0,0,0,7.775,S,Third,man,True,,Southampton,no,True
130,0,3,male,33.0,0,0,7.8958,C,Third,man,True,,Cherbourg,no,True
415,0,3,female,,0,0,8.05,S,Third,woman,False,,Southampton,no,True


In [3]:
#appending column at end
df['survived'] = df.pop('survived')
df.sample(5)

Unnamed: 0,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,survived
688,3,male,18.0,0,0,7.7958,S,Third,man,True,,Southampton,no,True,0
353,3,male,25.0,1,0,17.8,S,Third,man,True,,Southampton,no,False,0
38,3,female,18.0,2,0,18.0,S,Third,woman,False,,Southampton,no,False,0
713,3,male,29.0,0,0,9.4833,S,Third,man,True,,Southampton,no,True,0
326,3,male,61.0,0,0,6.2375,S,Third,man,True,,Southampton,no,True,0


In [4]:
#dropping same meaning columns like adult_male since it is similar to sex
df.drop(columns=['adult_male', 'embarked', 'who'], inplace=True)
df.sample(5)

Unnamed: 0,pclass,sex,age,sibsp,parch,fare,class,deck,embark_town,alive,alone,survived
366,1,female,60.0,1,0,75.25,First,D,Cherbourg,yes,False,1
671,1,male,31.0,1,0,52.0,First,B,Southampton,no,False,0
167,3,female,45.0,1,4,27.9,Third,,Southampton,no,False,0
470,3,male,,0,0,7.25,Third,,Southampton,no,True,0
597,3,male,49.0,0,0,0.0,Third,,Southampton,no,True,0


In [5]:
#seperate training and target columns
X = df.iloc[:, 0:-1]
y = df.iloc[:, -1]

In [15]:
# #train test split the data
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)

# using one hot encoder for encoding categorical values always need to pass it in 2d array using this [[]] because it doesnt requires series value
encoders = {}
ohe_cols = ['sex', 'class', 'deck', 'embark_town', 'alive']
X_train_ohe = []
X_test_ohe = []

for i in ohe_cols:
    encoders[i] = ohe_cols
    X_train_encode = ohe.fit_transform(X_train[[i]])
    X_test_encode = ohe.transform(X_test[[i]])
    X_train_ohe.append(X_train_encode)
    X_test_ohe.append(X_test_encode)

# Drop the original OHE columns from X_train and X_test
X_train_updated = X_train.drop(columns=ohe_cols).values
X_test_updated = X_test.drop(columns=ohe_cols).values

# Concatenate encoded columns + remaining numeric data
X_train_transformed = np.concatenate([X_train_updated] + X_train_ohe, axis=1)
X_test_transformed = np.concatenate([X_test_updated] + X_test_ohe, axis=1)

X_train_transformed

array([[1, 45.5, 0, ..., 1.0, 0.0, 0.0],
       [2, 23.0, 0, ..., 1.0, 0.0, 0.0],
       [3, 32.0, 0, ..., 1.0, 0.0, 0.0],
       ...,
       [3, 41.0, 2, ..., 1.0, 0.0, 0.0],
       [1, 14.0, 1, ..., 1.0, 0.0, 1.0],
       [1, 21.0, 0, ..., 1.0, 0.0, 0.0]], dtype=object)

In [17]:
#now using random forest model to get predictions and accuracy score
rf = RandomForestClassifier()
rf.fit(X_train_transformed, y_train)

In [19]:
y_pred = rf.predict(X_test_transformed)

print(f"Accuracy score: {accuracy_score(y_test, y_pred) * 100:.2f}")
print(f"Confusion Matrix: {confusion_matrix(y_test,y_pred)}")
print(f"Classification Report: {classification_report(y_test,y_pred)}")

Accuracy score: 100.00
Confusion Matrix: [[105   0]
 [  0  74]]
Classification Report:               precision    recall  f1-score   support

           0       1.00      1.00      1.00       105
           1       1.00      1.00      1.00        74

    accuracy                           1.00       179
   macro avg       1.00      1.00      1.00       179
weighted avg       1.00      1.00      1.00       179



In [20]:
#working on california house prediction set for regression problem
house_price = fetch_california_housing()
house_price

{'data': array([[   8.3252    ,   41.        ,    6.98412698, ...,    2.55555556,
           37.88      , -122.23      ],
        [   8.3014    ,   21.        ,    6.23813708, ...,    2.10984183,
           37.86      , -122.22      ],
        [   7.2574    ,   52.        ,    8.28813559, ...,    2.80225989,
           37.85      , -122.24      ],
        ...,
        [   1.7       ,   17.        ,    5.20554273, ...,    2.3256351 ,
           39.43      , -121.22      ],
        [   1.8672    ,   18.        ,    5.32951289, ...,    2.12320917,
           39.43      , -121.32      ],
        [   2.3886    ,   16.        ,    5.25471698, ...,    2.61698113,
           39.37      , -121.24      ]]),
 'target': array([4.526, 3.585, 3.521, ..., 0.923, 0.847, 0.894]),
 'frame': None,
 'target_names': ['MedHouseVal'],
 'feature_names': ['MedInc',
  'HouseAge',
  'AveRooms',
  'AveBedrms',
  'Population',
  'AveOccup',
  'Latitude',
  'Longitude'],
 'DESCR': '.. _california_housing_dataset:\n

In [23]:
X = pd.DataFrame(house_price.data,columns=house_price.feature_names)
y = pd.Series(house_price.target, name='target')

X.sample(5)

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
2880,1.375,35.0,4.050847,1.031477,1041.0,2.520581,35.38,-118.97
9491,3.5917,22.0,5.410526,1.021053,821.0,2.880702,39.21,-123.19
6481,3.5268,23.0,4.894309,1.097561,409.0,3.325203,34.09,-118.05
16122,3.7887,52.0,4.926724,1.110632,1584.0,2.275862,37.78,-122.46
14519,4.0,16.0,4.648973,0.994863,1619.0,2.77226,32.91,-117.13


In [24]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [25]:
rfr = RandomForestRegressor()
rfr.fit(X_train,y_train)

In [26]:
#predicting results
y_pred = rfr.predict(X_test)

#predicting mean_absolute_error, r2 score, mean_squared_error
print(f"MAE Score: {mean_absolute_error(y_test, y_pred)*100:.2f}")
print(f"MSE Score: {mean_squared_error(y_test, y_pred)*100:.2f}")
print(f"R2 Score: {r2_score(y_test, y_pred)*100:.2f}")

MAE Score: 32.58
MSE Score: 25.22
R2 Score: 80.75
