# Random Forest Classification in ML
![logo](https://miro.medium.com/v2/resize:fit:720/format:webp/1*jE1Cb1Dc_p9WEOPMkC95WQ.png)

### The random forest algorithm is a robust and widely used ensemble learning technique in Machine Learning. It belongs to the category of supervised learning, meaning it learns from labeled data to make predictions on unseen data.

## Key Idea:

- Unlike single decision trees, which are susceptible to overfitting and can be sensitive to small variations in the data, random forest builds an ensemble of multiple decision trees, also known as a forest.
- Each tree in the forest is trained on a random subset of features (a subset of the total features in the data) and a random subset of data points (drawn with replacement, meaning a data point can be selected multiple times) from the original dataset.
- This introduces diversity in the trees, preventing them from becoming too similar and overfitting the training data.

## Prediction Process:

- Individual Predictions: When presented with a new data point, each tree in the forest independently makes a prediction.
- Majority Vote: For classification tasks, the final prediction is the class label that receives the most votes from the individual trees.
- Average Prediction: For regression tasks, the final prediction is the average of the individual predictions from the trees.

In [18]:
from sklearn.datasets import load_iris

In [19]:
iris = load_iris()

In [20]:
iris

{'data': array([[5.1, 3.5, 1.4, 0.2],
        [4.9, 3. , 1.4, 0.2],
        [4.7, 3.2, 1.3, 0.2],
        [4.6, 3.1, 1.5, 0.2],
        [5. , 3.6, 1.4, 0.2],
        [5.4, 3.9, 1.7, 0.4],
        [4.6, 3.4, 1.4, 0.3],
        [5. , 3.4, 1.5, 0.2],
        [4.4, 2.9, 1.4, 0.2],
        [4.9, 3.1, 1.5, 0.1],
        [5.4, 3.7, 1.5, 0.2],
        [4.8, 3.4, 1.6, 0.2],
        [4.8, 3. , 1.4, 0.1],
        [4.3, 3. , 1.1, 0.1],
        [5.8, 4. , 1.2, 0.2],
        [5.7, 4.4, 1.5, 0.4],
        [5.4, 3.9, 1.3, 0.4],
        [5.1, 3.5, 1.4, 0.3],
        [5.7, 3.8, 1.7, 0.3],
        [5.1, 3.8, 1.5, 0.3],
        [5.4, 3.4, 1.7, 0.2],
        [5.1, 3.7, 1.5, 0.4],
        [4.6, 3.6, 1. , 0.2],
        [5.1, 3.3, 1.7, 0.5],
        [4.8, 3.4, 1.9, 0.2],
        [5. , 3. , 1.6, 0.2],
        [5. , 3.4, 1.6, 0.4],
        [5.2, 3.5, 1.5, 0.2],
        [5.2, 3.4, 1.4, 0.2],
        [4.7, 3.2, 1.6, 0.2],
        [4.8, 3.1, 1.6, 0.2],
        [5.4, 3.4, 1.5, 0.4],
        [5.2, 4.1, 1.5, 0.1],
  

In [23]:
print(iris.target_names)
print(iris.feature_names)

['setosa' 'versicolor' 'virginica']
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


In [21]:
import pandas as pd

In [16]:
#'target_names': array(['setosa', 'versicolor', 'virginica']
# feature_names': ['sepal length (cm)','sepal width (cm)','petal length (cm)','petal width (cm)']

In [22]:
data = pd.DataFrame({'sepal length':iris.data[:,0],
                    'sepal width':iris.data[:,1],
                     'petal length':iris.data[:,2],
                    'petal width':iris.data[:,3],
                    'species':iris.target})

In [24]:
data.head(5)

Unnamed: 0,sepal length,sepal width,petal length,petal width,species
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [27]:
x = data[['sepal length','sepal width','petal length','petal width']]     #features
y = data['species']     #labels

In [28]:
x

Unnamed: 0,sepal length,sepal width,petal length,petal width
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


In [29]:
y

0      0
1      0
2      0
3      0
4      0
      ..
145    2
146    2
147    2
148    2
149    2
Name: species, Length: 150, dtype: int32

In [30]:
from sklearn.model_selection import train_test_split

In [31]:
xtrain,xtest,ytrain,ytest = train_test_split(x,y,test_size=0.3)

In [32]:
len(xtrain)

105

In [33]:
len(xtest)

45

In [34]:
from sklearn.ensemble import RandomForestClassifier

In [35]:
clf = RandomForestClassifier(n_estimators=100,criterion='gini')

In [36]:
clf.fit(xtrain,ytrain)

In [37]:
clf.predict(xtest)

array([1, 0, 2, 2, 1, 2, 0, 2, 2, 0, 0, 2, 0, 1, 1, 0, 2, 1, 2, 0, 0, 2,
       0, 0, 2, 2, 2, 1, 0, 2, 2, 2, 0, 2, 1, 1, 2, 1, 2, 2, 1, 2, 2, 0,
       1])

In [38]:
ytest

58     1
27     0
134    2
77     1
88     1
70     1
22     0
101    2
137    2
21     0
44     0
104    2
46     0
69     1
59     1
47     0
124    2
81     1
146    2
4      0
23     0
131    2
43     0
2      0
149    2
141    2
126    2
50     1
42     0
83     1
113    2
144    2
11     0
102    2
91     1
98     1
110    2
84     1
120    2
140    2
85     1
121    2
117    2
0      0
60     1
Name: species, dtype: int32

In [39]:
clf.score(xtest,ytest)

0.9333333333333333

In [41]:
clf.predict([[3,5,4,2]])



array([2])

In [44]:
imp_feature = pd.Series(clf.feature_importances_,index=iris.feature_names).sort_values(ascending = False)

In [45]:
imp_feature

petal length (cm)    0.460124
petal width (cm)     0.411550
sepal length (cm)    0.104712
sepal width (cm)     0.023614
dtype: float64