# Random Forest
<font size=2>

Random Forest is a popular machine learning algorithm used for classification tasks. It is an ensemble learning method that combines the predictions of multiple decision trees to make more accurate predictions.

1. Building the forest:
    
The algorithm starts by creating an ensemble of decision trees. The number of trees is a user-defined parameter.

2. Random feature selection:
    
At each node of every decision tree, a random subset of features is considered for splitting. This helps to introduce randomness and reduce the correlation between trees.

3. Growing decision trees:
    
Each decision tree is grown by recursively partitioning the data based on the selected features. The splitting is done based on certain criteria, typically using measures like Gini impurity or entropy, to find the best feature and split point that maximizes the separation of classes.

4. Voting for predictions:
    
Once all the decision trees are constructed, predictions are made by each tree individually. For classification, the class label that receives the majority of votes from the trees is chosen as the final prediction.

<br>
    
Random Forest has several **advantages**:

It is robust against overfitting because the randomness introduced during feature selection and tree construction helps to reduce variance.
It can handle large datasets with a high number of features.
It provides estimates of feature importance, allowing for insights into the relative significance of different features in the classification task.
    
<br>
    
Code reference:
    
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
    
</font>

In [7]:
import numpy as np
from sklearn.ensemble import RandomForestClassifier

## Simple example

In [17]:
X = np.array([[0,0,0], [0,0,1], [0,1,0],
              [20,20,20], [19,19,19],[20,21,18]])
y = np.array([0, 0, 0,
              1, 1, 1])
clf = RandomForestClassifier(max_depth=2, random_state=0)
clf.fit(X, y)

In [18]:
print(clf.predict([[9,9,9],[10,10,10],[13,13,13]]))
print(clf.predict_proba([[9,9,9],[10,10,10],[13,13,13]]))

[0 0 1]
[[0.98 0.02]
 [0.52 0.48]
 [0.   1.  ]]


## Sample from Captury Live

In [19]:
class RandomForest_Class():
    
    def __init__(self,Max_Depth,Random_State,Dataset_Path,Train_Len,Test_Len):
        self.random_forest = RandomForestClassifier(max_depth=Max_Depth, random_state=Random_State)
        self.get_data(dataset_path=Dataset_Path,train_len=Train_Len,test_len=Test_Len)

    def get_data(self,dataset_path,train_len=10000,test_len=200):
        # load data
        x_data_path, y_data_path = dataset_path[0], dataset_path[1]
        with open(x_data_path, 'rb') as xf:
            x_data = np.load(xf)
        with open(y_data_path, 'rb') as yf:
            y_data = np.load(yf)
        # define length for trainset and testset
        self.Train_Len = train_len
        self.Test_Len = test_len
        # randomly select train samples
        choices_train = np.random.randint(x_data.shape[0], size = self.Train_Len)
        self.x_train = x_data[choices_train]
        self.y_train = y_data[choices_train]
        # delete train samples for test samples
        new_x_data = np.delete(x_data, choices_train, axis=0)
        new_y_data = np.delete(y_data, choices_train, axis=0)
        print(f'x_train shape: {self.x_train.shape}')
        print(f'y_train shape: {self.y_train.shape}')
        # randomly select test samples
        choices_test = np.random.randint(new_x_data.shape[0], size = self.Test_Len)
        self.x_test = new_x_data[choices_test]
        self.y_test = new_y_data[choices_test]
        print(f'x_test shape: {self.x_test.shape}')
        print(f'y_test shape: {self.y_test.shape}')
        
    def train(self):
        self.random_forest.fit(self.x_train, self.y_train)
        
    def test(self):
        self.P_pred = self.random_forest.predict_proba(self.x_test)
        self.T_pred = self.random_forest.predict(self.x_test)

In [24]:
if __name__ == '__main__':
    random_forest_cls = RandomForest_Class(Max_Depth=2,Random_State=0,
                                           Dataset_Path=['x_data_UpperBody.npy','y_data_UpperBody.npy'],
                                           Train_Len=10000,Test_Len=100)
    random_forest_cls.train()
    random_forest_cls.test()
    print(f'predicted target: {random_forest_cls.T_pred}')
    print(f'probability of predicted target: {random_forest_cls.P_pred}')
    print(f'true target: {random_forest_cls.y_test}')
    print(f'Are all predictions correct? {np.all(random_forest_cls.T_pred == random_forest_cls.y_test)}')

x_train shape: (10000, 38)
y_train shape: (10000,)
x_test shape: (100, 38)
y_test shape: (100,)
predicted target: [1 3 5 1 1 1 2 1 1 4 3 3 5 2 1 1 3 5 2 5 2 2 2 4 2 1 2 4 5 1 4 3 4 3 3 5 5
 5 3 4 4 1 2 5 2 3 4 2 4 4 5 3 4 2 2 3 3 1 4 3 2 1 3 3 5 5 5 2 3 2 5 5 5 5
 4 2 4 2 4 4 2 5 2 4 4 3 5 1 5 3 2 1 1 1 1 5 2 3 1 3]
probability of predicted target: [[0.68410819 0.13412026 0.095968   0.04624989 0.03955366]
 [0.1009976  0.27248135 0.49552297 0.06688577 0.06411231]
 [0.04085881 0.05536596 0.06084999 0.         0.84292524]
 [0.69121922 0.13074736 0.09239423 0.04623163 0.03940757]
 [0.68410819 0.13412026 0.095968   0.04624989 0.03955366]
 [0.68410819 0.13412026 0.095968   0.04624989 0.03955366]
 [0.14415645 0.42236188 0.27982944 0.09430906 0.05934317]
 [0.68410819 0.13412026 0.095968   0.04624989 0.03955366]
 [0.68410819 0.13412026 0.095968   0.04624989 0.03955366]
 [0.04820151 0.08913764 0.06477123 0.79788962 0.        ]
 [0.1009976  0.27248135 0.49552297 0.06688577 0.06411231]
 [0.1009976