Random Forest is a versatile and powerful ensemble learning algorithm used for both classification and regression tasks in machine learning. It's an extension of decision tree algorithms that builds multiple decision trees during training and outputs the class that is the mode of the classes (classification) or the mean prediction (regression) of the individual trees.

Here's how Random Forest works:

Bootstrap Sampling (Bagging):

Random Forest starts by creating multiple bootstrap samples (random samples with replacement) from the original training dataset.
Each bootstrap sample has the same size as the original dataset but may contain duplicate instances and omit others.
Random Feature Selection:

For each decision tree in the Random Forest, a random subset of features is selected at each split.
The number of features considered at each split is typically set to the square root of the total number of features.
Decision Tree Training:

Multiple decision trees are trained independently on the bootstrap samples with the randomly selected features.
Each decision tree is trained using the CART (Classification and Regression Trees) algorithm, which recursively splits the data at each node based on the feature that best separates the data into purest subsets.
Prediction Aggregation:

For classification tasks, the final prediction is determined by majority voting: the class that receives the most votes from the individual trees is chosen as the predicted class.
For regression tasks, the final prediction is the average (or median) prediction of all individual trees.

In [None]:


# importing required libraries
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score


In [2]:
url = 'tested.csv'
data = pd.read_csv(url)
data.isnull().sum()
data.drop('Cabin', axis=1, inplace=True)
data.dropna(inplace=True)


In [3]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
data['Sex'] = le.fit_transform(data['Sex'])

In [4]:
X = data[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare']]
y = data['Survived']

In [5]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=65)

In [6]:
model = RandomForestClassifier()

# fit the model with the training data
model.fit(x_train,y_train)

# number of trees used
print('Number of Trees used : ', model.n_estimators)

# predict the target on the train dataset
predict_train = model.predict(x_train)
print('\nTarget on train data',predict_train)

Number of Trees used :  100

Target on train data [0 0 1 1 1 0 1 0 0 1 1 0 0 0 0 1 0 0 1 1 1 0 1 0 1 0 0 0 0 1 0 0 0 1 0 1 0
 1 0 0 1 0 0 0 0 0 0 1 0 1 0 1 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0
 1 0 1 1 1 0 1 0 1 0 1 0 0 1 1 0 0 1 0 1 1 1 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0
 0 0 0 1 1 0 1 1 0 0 1 0 1 0 1 0 0 1 0 0 0 1 0 0 0 0 1 1 1 0 1 1 0 1 1 0 1
 0 0 0 0 0 1 1 0 1 1 0 0 0 1 1 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 1 0 0 1 0 1 0
 1 1 1 0 0 0 0 1 1 1 1 1 0 0 0 0 1 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0 0 0 1 0 0
 0 0 0 0 1 0 1 0 0 1 0 0 1 0 1 1 0 1 0 0 0 1 0 0 0 1 1 0 1 1 0 1 1 1 0 0 0
 1 1 1 0 1]


In [8]:
# Accuray Score on train dataset
accuracy_train = accuracy_score(y_train,predict_train)
print('\naccuracy_score on train dataset : ', accuracy_train)

# predict the target on the test dataset
predict_test = model.predict(x_test)
print('\nTarget on test data',predict_test)

# Accuracy Score on test dataset
accuracy_test = accuracy_score(y_test,predict_test)
print('\naccuracy_score on test dataset : ', accuracy_test)


accuracy_score on train dataset :  1.0

Target on test data [0 1 1 0 1 1 1 0 0 1 0 0 0 0 1 0 0 1 1 0 1 1 0 0 0 0 1 1 0 0 0 0 1 0 1 1 0
 1 0 1 0 0 1 0 0 0 0 1 0 1 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0]

accuracy_score on test dataset :  1.0
