# 1. Introduction to Random Forest algorithm*

Random forest is a supervised learning algorithm. It has two variations – one is used for classification problems and other is used for regression problems. It is one of the most flexible and easy to use algorithm. It creates decision trees on the given data samples, gets prediction from each tree and selects the best solution by means of voting. It is also a pretty good indicator of feature importance.


Random forest algorithm combines multiple decision-trees, resulting in a forest of trees, hence the name `Random Forest`. In the random forest classifier, the higher the number of trees in the forest results in higher accuracy.

# 2. Random Forest algorithm intuition

Random forest algorithm intuition can be divided into two stages. 

In the first stage, we randomly select “k” features out of total `m` features and build the random forest. In the first stage, we proceed as follows:-

1. Randomly select `k` features from a total of `m` features where `k < m`.
2. Among the `k` features, calculate the node `d` using the best split point.
3. Split the node into daughter nodes using the best split.
4. Repeat 1 to 3 steps until `l` number of nodes has been reached.
5. Build forest by repeating steps 1 to 4 for `n` number of times to create `n` number of trees.

In the second stage, we make predictions using the trained random forest algorithm. 

1. We take the test features and use the rules of each randomly created decision tree to predict the outcome and stores the predicted outcome.
2. Then, we calculate the votes for each predicted target.
3. Finally, we consider the high voted predicted target as the final prediction from the random forest algorithm.

### Random Forest algorithm intuition

![Random Forest](https://i.ytimg.com/vi/goPiwckWE9M/maxresdefault.jpg)

# 3. Advantages and disadvantages of Random Forest algorithm

The advantages of Random forest algorithm are as follows:-


1. Random forest algorithm can be used to solve both classification and regression problems.
2. It is considered as very accurate and robust model because it uses large number of decision-trees to make predictions.
3. Random forests takes the average of all the predictions made by the decision-trees, which cancels out the biases. So, it does not suffer from the overfitting problem. 
4. Random forest classifier can handle the missing values. There are two ways to handle the missing values. First is to use median values to replace continuous variables and second is to compute the proximity-weighted average of missing values.
5. Random forest classifier can be used for feature selection. It means selecting the most important features out of the available features from the training dataset.


The disadvantages of Random Forest algorithm are listed below:-


1. The biggest disadvantage of random forests is its computational complexity. Random forests is very slow in making predictions because large number of decision-trees are used to make predictions. All the trees in the forest have to make a prediction for the same input and then perform voting on it. So, it is a time-consuming process.
2. The model is difficult to interpret as compared to a decision-tree, where we can easily make a prediction as compared to a decision-tree.

# 4. Feature selection with Random Forests

Random forests algorithm can be used for feature selection process. This algorithm can be used to rank the importance of variables in a regression or classification problem. 

We measure the variable importance in a dataset by fitting the random forest algorithm to the data. During the fitting process, the out-of-bag error for each data point is recorded and averaged over the forest. 

The importance of the j-th feature was measured after training. The values of the j-th feature were permuted among the training data and the out-of-bag error was again computed on this perturbed dataset. The importance score for the j-th feature is computed by averaging the difference in out-of-bag error before and after the permutation over all trees. The score is normalized by the standard deviation of these differences.

Features which produce large values for this score are ranked as more important than features which produce small values. Based on this score, we will choose the most important features and drop the least important ones for model building. 


# 5. Difference between Random Forests and Decision Trees

I will compare random forests with decision-trees. Some salient features of comparison are as follows:-

1. Random forests is a set of multiple decision-trees.

2. Decision-trees are computationally faster as compared to random forests.

3. Deep decision-trees may suffer from overfitting. Random forest prevents overfitting by creating trees on random forests.

4. Random forest is difficult to interpret. But, a decision-tree is easily interpretable and can be converted to rules.

In [0]:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, row_number, lit
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.sql import Row
from pyspark.sql.window import Window

import mlflow
import mlflow.spark
from pyspark.sql import SparkSession
from pyspark.ml.feature import StringIndexer, VectorAssembler, OneHotEncoder
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml import Pipeline
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

import warnings
warnings.filterwarnings('ignore')

In [0]:
# Initialize SparkSession
spark = SparkSession.builder.appName('RandomForest').getOrCreate()

# File path in DBFS
file_path = "dbfs:/FileStore/MLFlowRandomForestClassifierTutorial/car_evaluation.csv"

# Read the CSV file into a DataFrame
df = spark.read.csv(file_path, header=True, inferSchema=True)

# Define the new column names
col_names = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class']

# Rename columns using toDF
df = df.toDF(*col_names)

# StringIndexer Initialization
categorical_columns = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class']
indexers = [StringIndexer(inputCol=column, outputCol=column + "_indexed") for column in categorical_columns]
pipeline = Pipeline(stages=indexers)
df = pipeline.fit(df).transform(df)

df = df.select('buying_indexed', 'maint_indexed', 'doors_indexed', 'persons_indexed', 'lug_boot_indexed', 'safety_indexed', 'class_indexed')

# Rename columns using toDF
df = df.toDF(*col_names)
df = df.withColumnRenamed("class", "label")

# Show the DataFrame with new column names
df.show()

# 6. Random Forest Model

## 6.1 Prepare Data

In [0]:
# Define the feature and label columns & Assemble the feature vector
feature_columns = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety']
assembler = VectorAssembler(inputCols=feature_columns, outputCol="features")

rf = RandomForestClassifier(labelCol="label",featuresCol="features")

# Split the data into training and test sets
train_data, test_data = df.randomSplit([0.7, 0.3], seed=42)
train_data.show()

+------+-----+-----+-------+--------+------+-----+
|buying|maint|doors|persons|lug_boot|safety|label|
+------+-----+-----+-------+--------+------+-----+
|   0.0|  0.0|  0.0|    0.0|     0.0|   0.0|  1.0|
|   0.0|  0.0|  0.0|    0.0|     0.0|   1.0|  1.0|
|   0.0|  0.0|  0.0|    0.0|     1.0|   0.0|  1.0|
|   0.0|  0.0|  0.0|    0.0|     1.0|   1.0|  0.0|
|   0.0|  0.0|  0.0|    0.0|     1.0|   2.0|  0.0|
|   0.0|  0.0|  0.0|    0.0|     2.0|   1.0|  0.0|
|   0.0|  0.0|  0.0|    1.0|     0.0|   1.0|  1.0|
|   0.0|  0.0|  0.0|    1.0|     0.0|   2.0|  0.0|
|   0.0|  0.0|  0.0|    1.0|     1.0|   0.0|  1.0|
|   0.0|  0.0|  0.0|    1.0|     2.0|   1.0|  0.0|
|   0.0|  0.0|  0.0|    1.0|     2.0|   2.0|  0.0|
|   0.0|  0.0|  0.0|    2.0|     0.0|   0.0|  0.0|
|   0.0|  0.0|  0.0|    2.0|     0.0|   2.0|  0.0|
|   0.0|  0.0|  0.0|    2.0|     1.0|   1.0|  0.0|
|   0.0|  0.0|  0.0|    2.0|     2.0|   1.0|  0.0|
|   0.0|  0.0|  0.0|    2.0|     2.0|   2.0|  0.0|
|   0.0|  0.0|  1.0|    0.0|   

## 6.2 Create Pipeline

In [0]:
pipeline = Pipeline(stages=[assembler,rf])

## 6.3 Hyperparameter Tuning and Model Selection

In [0]:
# Define the hyperparameter grid
paramGrid = ParamGridBuilder() \
    .addGrid(rf.numTrees, [10, 20, 30]) \
    .addGrid(rf.maxDepth, [5, 10, 15]) \
    .build()

# Create the cross-validator
cross_validator = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=MulticlassClassificationEvaluator(labelCol="label", metricName="accuracy"),
                          numFolds=5, seed=42)

# Train the model with the best hyperparameters
cv_model = cross_validator.fit(train_data)

## 6.4 Evaluating the model

In [0]:
# Make predictions on the test data
predictions = cv_model.transform(test_data)

evaluator = MulticlassClassificationEvaluator(labelCol="label", metricName="accuracy")

# Evaluate the model
accuracy = evaluator.evaluate(predictions)
print("Test set accuracy = {:.2f}".format(accuracy))

Test set accuracy = 0.97
