# Random Forest

### Key Features
* Ensemble Method: Random Forest combines the predictions of many decision trees to make a final decision, leveraging the "wisdom of the crowd" principle.

* Bagging: It uses a technique called bootstrap aggregating (bagging) to create multiple subsets of the training data by sampling with replacement. Each tree is trained on a different subset, promoting diversity.

* Random Feature Selection: When splitting nodes in the trees, Random Forest randomly selects a subset of features rather than considering all features. This further helps reduce correlation between the trees and improves generalization.

* Robustness: Random Forest is less prone to overfitting than individual decision trees due to the averaging of multiple trees, which balances bias and variance.

* Feature Importance: It provides insights into feature importance, helping to identify which variables are most influential in making predictions.

In [9]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

In [15]:
df = pd.read_csv('heart.csv')

In [17]:
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [19]:
X = df.iloc[:,:-1]
y = df.iloc[:,-1]

In [21]:
X_train,X_test, y_train,y_test = train_test_split(X,y,test_size=0.2,random_state= 42)

In [23]:
rf = RandomForestClassifier()

In [25]:
rf.fit(X_train,y_train)
y_pred = rf.predict(X_test)
print("Accuracy Score: ",accuracy_score(y_test,y_pred))

Accuracy Score:  0.8524590163934426


## Hyperparameter Tuning

In [44]:
rf = RandomForestClassifier(
    n_estimators=200,
    criterion='gini',
    max_depth=1,
)

In [46]:
rf.fit(X_train,y_train)
y_pred = rf.predict(X_test)
print("Accuracy Score: ",accuracy_score(y_test,y_pred))

Accuracy Score:  0.8852459016393442


## Feature Importance 

In [75]:
rf.feature_importances_

array([0.015, 0.025, 0.175, 0.015, 0.   , 0.   , 0.   , 0.13 , 0.14 ,
       0.135, 0.07 , 0.16 , 0.135])

## GridSearchCV

In [50]:
n_estimators = [20,40,60,100]
criterion = ['gini','entropy']
max_depth = [1,3,5,10,None]
max_samples = [0.3,0.5,1.0]
min_samples_split = [1,2,3]
min_samples_leaf= [1,2,3]

In [52]:
params = {
    'n_estimators' : n_estimators,
    'criterion' : criterion,
    'max_depth' : max_depth,
    'max_samples' : max_samples,
    'min_samples_split' : min_samples_split,
    'min_samples_leaf' : min_samples_leaf
}

In [54]:
search = GridSearchCV(
    estimator = rf,
    param_grid = params,
    n_jobs=-1,
    cv=5,
    verbose=0,
)

In [56]:
search.fit(X_train,y_train)

1800 fits failed out of a total of 5400.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
1800 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\ASUS\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\ASUS\anaconda3\Lib\site-packages\sklearn\base.py", line 1467, in wrapper
    estimator._validate_params()
  File "C:\Users\ASUS\anaconda3\Lib\site-packages\sklearn\base.py", line 666, in _validate_params
    validate_parameter_constraints(
  File "C:\Users\ASUS\anaconda3\Lib\site-packages\sklearn\utils\_param_validation.py", line 95, in validate_parameter_constraints
    raise InvalidParamet

In [57]:
search.best_params_

{'criterion': 'entropy',
 'max_depth': 1,
 'max_samples': 0.5,
 'min_samples_leaf': 1,
 'min_samples_split': 3,
 'n_estimators': 100}

In [58]:
search.best_score_

0.8471088435374149

## RandomSearchCV

In [60]:
from sklearn.model_selection import RandomizedSearchCV

In [67]:
rsearch = RandomizedSearchCV(
    estimator = rf,
    param_distributions = params,
    n_jobs=-1,
    cv=5,
    verbose=0,
)

In [69]:
rsearch.fit(X_train,y_train)

25 fits failed out of a total of 50.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
25 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\ASUS\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\ASUS\anaconda3\Lib\site-packages\sklearn\base.py", line 1467, in wrapper
    estimator._validate_params()
  File "C:\Users\ASUS\anaconda3\Lib\site-packages\sklearn\base.py", line 666, in _validate_params
    validate_parameter_constraints(
  File "C:\Users\ASUS\anaconda3\Lib\site-packages\sklearn\utils\_param_validation.py", line 95, in validate_parameter_constraints
    raise InvalidParameterErro

In [70]:
search.best_params_

{'criterion': 'entropy',
 'max_depth': 1,
 'max_samples': 0.5,
 'min_samples_leaf': 1,
 'min_samples_split': 3,
 'n_estimators': 100}

In [71]:
search.best_score_

0.8471088435374149