## Setting Up PySpark Environment

This code snippet sets up the PySpark environment.

- **Importing Libraries**: Necessary libraries such as `pyspark`, `os`, and `sys` are imported.
- **Setting Python Executable**: The Python executable path is set for both worker nodes and the driver.
- **Setting Spark Context**: A SparkContext is created using `SparkContext` from `pyspark`.
- **Setting SparkSession**: A SparkSession is created using `SparkSession` from `pyspark.sql`.

Setting up the PySpark environment is essential for initializing SparkContext and SparkSession, enabling interaction with Spark clusters and distributed computing.


In [None]:
import pyspark
import os
import sys
from pyspark import SparkContext
os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable

from pyspark.sql import SparkSession

## Configuring Spark Session

This code cell configures a SparkSession named 'chapter_4' with specific driver memory settings.

- **Configuring Driver Memory**: `.config("spark.driver.memory", "16g")` sets the driver memory to 16 gigabytes.
- **Creating SparkSession**: `SparkSession.builder.appName('chapter_4').getOrCreate()` creates a SparkSession with the specified configuration and application name.

Configuring the SparkSession with appropriate memory settings is crucial for managing resources effectively and optimizing performance, especially for memory-intensive tasks.


In [None]:
spark = SparkSession.builder.config("spark.driver.memory", "16g").appName('chapter_4').getOrCreate()

0.0.1 Preparing the Data

## Reading CSV Data Without Header

This code cell reads CSV data from the file path "data/covtype.data" without inferring the schema and without considering the first row as the header.

- **CSV File Path**: The CSV data is read from the file path "data/covtype.data".
- **Parsing Options**:
  - `inferSchema=True`: Specifies to infer the schema automatically.
  - `header=False`: Specifies that the first row should not be considered as the header.
- **DataFrame Creation**: The resulting DataFrame contains the CSV data without header and with inferred schema.

This operation reads the CSV data into a DataFrame without considering the first row as the header, allowing for manual specification of the schema if needed.


In [None]:
data_without_header = spark.read.option("inferSchema", True).option("header", False).csv("data/covtype.data")
data_without_header.printSchema()

## Specifying Column Names and Data Types

This code cell specifies column names and data types for the DataFrame `data_without_header` and converts the "Cover_Type" column to DoubleType.

- **Column Names**: A list `colnames` containing column names is defined.
- **Data Types**: The "Cover_Type" column is casted to DoubleType using `.cast(DoubleType())`.
- **DataFrame Transformation**: The DataFrame `data_without_header` is transformed by specifying column names and converting data types.
- **Head**: `.head()` is used to display the first row of the transformed DataFrame.

This transformation ensures that the DataFrame `data` has meaningful column names and correct data types, facilitating further analysis and processing.


In [None]:
from pyspark.sql.types import DoubleType
from pyspark.sql.functions import col

colnames = ["Elevation", "Aspect", "Slope", "Horizontal_Distance_To_Hydrology", "Vertical_Distance_To_Hydrology", "Horizontal_Distance_To_Roadways", "Hillshade_9am", "Hillshade_Noon", "Hillshade_3pm", "Horizontal_Distance_To_Fire_Points"] + [f"Wilderness_Area_{i}" for i in range(4)] + [f"Soil_Type_{i}" for i in range(40)] + ["Cover_Type"]

data = data_without_header.toDF(*colnames).withColumn("Cover_Type", col("Cover_Type").cast(DoubleType()))

data.head()

0.0.2 Our First Decision Tree

## Splitting Data into Training and Testing Sets

This code cell splits the DataFrame `data` into training and testing sets.

- **Random Splitting**: `data.randomSplit([0.9, 0.1])` randomly splits the data into two sets with a 90:10 ratio for training and testing, respectively.
- **Caching**: Both the training and testing sets (`train_data` and `test_data`) are cached for faster access.

- **Training Set**: `train_data` contains 90% of the data and is cached.
- **Testing Set**: `test_data` contains 10% of the data and is cached.

Splitting the data into training and testing sets is essential for evaluating the performance of machine learning models and preventing overfitting.


In [None]:
(train_data, test_data) = data.randomSplit([0.9, 0.1])
train_data.cache()
test_data.cache()

## Assembling Feature Vectors

This code cell assembles feature vectors using the `VectorAssembler` from PySpark.

- **Input Columns**: The variable `input_cols` contains all column names except the last one, which is the target variable.
- **VectorAssembler**: A `VectorAssembler` named `vector_assembler` is created with input columns specified by `inputCols` and output column specified by `outputCol`.
- **Transforming Data**: `vector_assembler.transform(train_data)` applies the `VectorAssembler` to the training data, creating a new column named "featureVector" containing the assembled feature vectors.
- **Displaying Feature Vectors**: `assembled_train_data.select("featureVector").show(truncate=False)` displays the assembled feature vectors.

Assembling feature vectors is a common preprocessing step in machine learning pipelines, combining multiple features into a single vector format suitable for model training.


In [None]:
from pyspark.ml.feature import VectorAssembler

input_cols = colnames[:-1]
vector_assembler = VectorAssembler(inputCols=input_cols, outputCol="featureVector")

assembled_train_data = vector_assembler.transform(train_data)

assembled_train_data.select("featureVector").show(truncate = False)

## Training Decision Tree Classifier

This code cell trains a decision tree classifier using the `DecisionTreeClassifier` from PySpark MLlib.

- **Creating Classifier**: A `DecisionTreeClassifier` named `classifier` is created with parameters such as seed, label column, features column, and prediction column.
- **Model Training**: `classifier.fit(assembled_train_data)` trains the decision tree classifier on the assembled training data.
- **Printing Model Debug String**: `model.toDebugString` prints the debug string representation of the trained decision tree model.

The debug string provides a human-readable representation of the decision tree model, showing the decision rules and splits made by the tree at each node.


In [None]:
from pyspark.ml.classification import DecisionTreeClassifier

classifier = DecisionTreeClassifier(seed = 1234, labelCol="Cover_Type", featuresCol="featureVector", predictionCol="prediction")

model = classifier.fit(assembled_train_data)
print(model.toDebugString)

## Extracting Feature Importances

This code cell extracts feature importances from the trained decision tree model and presents them in a Pandas DataFrame.

- **Importing Libraries**: The `pandas` library is imported as `pd`.
- **Converting Feature Importances**: `model.featureImportances.toArray()` converts the feature importances from the trained model to a NumPy array.
- **Creating DataFrame**: `pd.DataFrame(...)` creates a DataFrame with feature importances, using input column names as index and "importance" as the column name.
- **Sorting by Importance**: `.sort_values(by="importance", ascending=False)` sorts the DataFrame by importance values in descending order.

This DataFrame provides insights into the importance of each feature in predicting the target variable, helping to understand the significance of different features in the decision-making process of the model.


In [None]:
import pandas as pd

pd.DataFrame(model.featureImportances.toArray(), index=input_cols, columns=['importance']).sort_values(by="importance", ascending=False)

## Making Predictions

This code cell makes predictions using the trained decision tree model on the assembled training data.

- **Model Transformation**: `model.transform(assembled_train_data)` applies the trained decision tree model to the assembled training data, generating predictions.
- **Selecting Columns**: `.select("Cover_Type", "prediction", "probability")` selects the columns for target variable, predicted label, and probability distribution of each class.
- **Displaying Predictions**: `.show(10, truncate=False)` displays the first 10 rows of the DataFrame containing predictions.

This output provides a glimpse of the actual target variable, predicted label, and probability distribution for each class, aiding in assessing the performance of the trained model.


In [None]:
predictions = model.transform(assembled_train_data)
predictions.select("Cover_Type", "prediction", "probability").show(10, truncate = False)

## Evaluating Model Performance

This code cell evaluates the performance of the trained decision tree model using the `MulticlassClassificationEvaluator` from PySpark MLlib.

- **Creating Evaluator**: A `MulticlassClassificationEvaluator` named `evaluator` is created with parameters specifying the label column and prediction column.
- **Accuracy Evaluation**: `.setMetricName("accuracy").evaluate(predictions)` evaluates the accuracy of the predictions made by the model.
- **F1 Score Evaluation**: `.setMetricName("f1").evaluate(predictions)` evaluates the F1 score of the predictions made by the model.

Evaluating model performance is crucial for assessing its effectiveness in making accurate predictions and identifying areas for improvement.


In [None]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

evaluator = MulticlassClassificationEvaluator(labelCol="Cover_Type", predictionCol="prediction")

evaluator.setMetricName("accuracy").evaluate(predictions)
evaluator.setMetricName("f1").evaluate(predictions)

## Computing Confusion Matrix

This code cell computes the confusion matrix based on the predictions made by the trained model.

- **Grouping and Pivot**: `predictions.groupBy("Cover_Type").pivot("prediction", range(1,8)).count()` groups the data by the actual cover type and pivots the predicted cover types, counting the occurrences for each combination.
- **Handling Missing Values**: `.na.fill(0.0)` fills missing values with zeros.
- **Ordering by Cover Type**: `.orderBy("Cover_Type")` orders the confusion matrix by the actual cover type.

The resulting confusion matrix provides insights into the model's performance by showing how often each actual cover type was predicted as each possible cover type.


In [None]:
confusion_matrix = predictions.groupBy("Cover_Type").pivot("prediction", range(1,8)).count().na.fill(0.0).orderBy("Cover_Type")

confusion_matrix.show()

## Computing Class Probabilities

This code defines a function `class_probabilities` to compute class probabilities from a given DataFrame.

- **Function Definition**: The function takes a DataFrame `data` as input and computes the proportion of each class in the data.
- **Total Count**: The total count of data points is computed using `data.count()`.
- **Grouping and Counting**: `data.groupBy("Cover_Type").count()` groups the data by cover type and counts the occurrences of each cover type.
- **Calculating Proportions**: The count for each cover type is divided by the total count to calculate the proportion.
- **Collecting Results**: The computed proportions are collected into a list.

These class probabilities are essential for understanding the distribution of classes in the data, which can influence model training and evaluation.


In [None]:
from pyspark.sql import DataFrame

def class_probabilities(data):
  total = data.count()
  return data.groupBy("Cover_Type").count().orderBy("Cover_Type").select(col("count").cast(DoubleType())).withColumn("count_proportion", col("count")/total).select("count_proportion").collect()

train_prior_probabilities = class_probabilities(train_data)
test_prior_probabilities = class_probabilities(test_data)
train_prior_probabilities

## Computing Weighted Average

This code cell computes the weighted average of prior probabilities using the train and test data.

- **Extracting Values**: `[p[0] for p in train_prior_probabilities]` and `[p[0] for p in test_prior_probabilities]` extract the probability values from the lists.
- **Weighted Average Calculation**: `sum([train_p * cv_p for train_p, cv_p in zip(train_prior_probabilities,test_prior_probabilities)])` calculates the weighted average by multiplying corresponding probabilities from train and test data and summing the results.

Weighted average of prior probabilities provides a measure of the expected prior probability, considering the distribution of classes in both train and test data.


In [None]:
train_prior_probabilities = [p[0] for p in train_prior_probabilities]
test_prior_probabilities = [p[0] for p in test_prior_probabilities]

sum([train_p * cv_p for train_p, cv_p in zip(train_prior_probabilities,test_prior_probabilities)])

0.0.3 Tuning Decision Trees

## Creating ML Pipeline

This code cell creates a machine learning pipeline using the `Pipeline` class from PySpark MLlib.

- **Feature Assembler**: A `VectorAssembler` named `assembler` is created to assemble feature vectors from input columns.
- **Decision Tree Classifier**: A `DecisionTreeClassifier` named `classifier` is created with specified parameters.
- **Pipeline Stages**: Both `assembler` and `classifier` are included as stages in the pipeline.
- **Pipeline Creation**: `Pipeline(stages=[assembler, classifier])` creates a pipeline with the specified stages.

Machine learning pipelines are useful for chaining together multiple stages of data processing and model training, providing a unified interface for model development and deployment.


In [None]:
from pyspark.ml import Pipeline

assembler = VectorAssembler(inputCols=input_cols, outputCol="featureVector")
classifier = DecisionTreeClassifier(seed=1234, labelCol="Cover_Type",featuresCol="featureVector",predictionCol="prediction")

pipeline = Pipeline(stages=[assembler, classifier])

## Building Parameter Grid for Model Tuning

This code cell builds a parameter grid for tuning the decision tree classifier using the `ParamGridBuilder` from PySpark MLlib.

- **Parameter Grid Building**: `ParamGridBuilder()` initializes a parameter grid builder.
- **Adding Grids**: `.addGrid(...)` adds grids for different hyperparameters of the decision tree classifier, such as impurity, max depth, max bins, and min info gain.
- **Building Parameter Grid**: `.build()` builds the parameter grid containing all combinations of specified hyperparameters.

Parameter grid building is a crucial step in hyperparameter tuning, allowing for systematic exploration of different parameter combinations to find the optimal model configuration.


In [None]:
from pyspark.ml.tuning import ParamGridBuilder

paramGrid = ParamGridBuilder().addGrid(classifier.impurity, ["gini", "entropy"]).addGrid(classifier.maxDepth, [1, 20]).addGrid(classifier.maxBins, [40, 300]).addGrid(classifier.minInfoGain, [0.0, 0.05]).build()

multiclassEval = MulticlassClassificationEvaluator().setLabelCol("Cover_Type").setPredictionCol("prediction").setMetricName("accuracy")

## Training and Validation Split

This code cell performs training and validation split using the `TrainValidationSplit` class from PySpark MLlib.

- **Setting Up Validator**: A `TrainValidationSplit` named `validator` is created with specified parameters including the pipeline (estimator), evaluator, parameter grid, and train ratio.
- **Model Training**: `validator.fit(train_data)` trains the validator on the training data.

Training and validation split is a common technique used in machine learning for evaluating model performance and tuning hyperparameters, helping to prevent overfitting and ensure generalization to unseen data.


In [None]:
from pyspark.ml.tuning import TrainValidationSplit

validator = TrainValidationSplit(seed=1234,estimator=pipeline,evaluator=multiclassEval,estimatorParamMaps=paramGrid,trainRatio=0.9)

validator_model = validator.fit(train_data)

## Extracting Best Model Parameters

This code cell extracts the parameters of the best model obtained from the validation process.

- **Accessing Best Model**: `validator_model.bestModel` retrieves the best model selected during the validation process.
- **Extracting Parameter Map**: `.stages[1].extractParamMap()` extracts the parameter map of the decision tree classifier stage from the best model.

Extracting the parameters of the best model allows for understanding the configuration that yielded the best performance, aiding in model interpretation and further optimization.


In [None]:
from pprint import pprint

best_model = validator_model.bestModel
pprint(best_model.stages[1].extractParamMap())

## Re-training and Extracting Validation Metrics and Parameters

This code cell re-trains the validator on the training data and extracts the validation metrics along with the corresponding parameter maps.

- **Re-training Validator**: `validator.fit(train_data)` re-trains the validator on the training data.
- **Extracting Metrics and Parameters**: `validator_model.validationMetrics` retrieves the validation metrics, and `validator_model.getEstimatorParamMaps()` retrieves the corresponding parameter maps.

The validation metrics and parameter maps are then combined into a list and sorted based on the validation metrics in descending order.

Re-training the validator and examining the validation metrics and parameters helps in understanding the performance of different hyperparameter combinations and selecting the best model configuration.


In [None]:
validator_model = validator.fit(train_data)

metrics = validator_model.validationMetrics
params = validator_model.getEstimatorParamMaps()
metrics_and_params = list(zip(metrics, params))

metrics_and_params.sort(key=lambda x: x[0], reverse=True)
metrics_and_params

## Printing Highest Metric Value

This code snippet sorts the `metrics` list in descending order and prints the highest value.

- **Sorting Metrics**: `metrics.sort(reverse=True)` sorts the list of metrics in descending order.
- **Printing Highest Value**: `print(metrics[0])` prints the highest value from the sorted list.

This operation helps identify the highest metric value obtained during the validation process, providing insights into the performance of the model configurations.


In [None]:
metrics.sort(reverse=True)
print(metrics[0])

## Evaluating Model Performance on Test Data

This code snippet evaluates the performance of the best model obtained from the validation process on the test data.

- **Model Transformation**: `best_model.transform(test_data)` applies the best model to the test data, generating predictions.
- **Evaluation**: `multiclassEval.evaluate(...)` evaluates the performance of the predictions using the specified evaluator (`multiclassEval`).

Evaluating the model on the test data provides an assessment of its generalization ability and helps determine its effectiveness in making predictions on unseen data.


In [None]:
multiclassEval.evaluate(best_model.transform(test_data))

0.0.4 Categorical Features Revisited

## One-Hot Encoding Reverse Transformation

This code defines a function `unencode_one_hot` to reverse the one-hot encoding applied to categorical features.

- **Wilderness Area Columns**: Wilderness area columns are specified as `wilderness_cols`.
- **Wilderness Area Assembler**: A `VectorAssembler` is created to assemble the wilderness area columns into a single vector column named "wilderness".
- **Soil Type Columns**: Soil type columns are specified as `soil_cols`.
- **Soil Type Assembler**: A `VectorAssembler` is created to assemble the soil type columns into a single vector column named "soil".
- **Reverse Encoding UDF**: A user-defined function (UDF) is defined to find the index of the non-zero value in a vector, effectively reversing the one-hot encoding.
- **Applying Transformations**: The wilderness and soil type columns are transformed using the UDF to obtain the original categorical values.
- **Data Transformation**: The function returns the DataFrame with the one-hot encoded columns reversed.

Reverse transformation of one-hot encoding is essential for interpreting and analyzing categorical features in their original form.


In [None]:
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType

def unencode_one_hot(data):
  wilderness_cols = ['Wilderness_Area_' + str(i) for i in range(4)]
  wilderness_assembler = VectorAssembler().setInputCols(wilderness_cols).setOutputCol("wilderness")

  unhot_udf = udf(lambda v: v.toArray().tolist().index(1))

  with_wilderness = wilderness_assembler.transform(data).drop(*wilderness_cols).withColumn("wilderness", unhot_udf(col("wilderness")).cast(IntegerType()))

  soil_cols = ['Soil_Type_' + str(i) for i in range(40)]
  soil_assembler = VectorAssembler().setInputCols(soil_cols).setOutputCol("soil")
  with_soil = soil_assembler.transform(with_wilderness).drop(*soil_cols).withColumn("soil", unhot_udf(col("soil")).cast(IntegerType()))

  return with_soil

## Applying One-Hot Encoding Reverse Transformation

This code cell applies the reverse transformation function `unencode_one_hot` to the training data.

- **Function Application**: `unencode_one_hot(train_data)` applies the reverse transformation function to the training data.
- **Printing Schema**: `.printSchema()` displays the schema of the transformed DataFrame `unenc_train_data`.

This operation reverses the one-hot encoding applied to the categorical features in the training data, allowing for analysis and interpretation of the original categorical values.


In [None]:
unenc_train_data = unencode_one_hot(train_data)
unenc_train_data.printSchema()

## Grouping by Reversed One-Hot Encoded Feature

This code cell groups the DataFrame `unenc_train_data` by the reversed one-hot encoded feature "wilderness" and counts the occurrences of each category.

- **Grouping by Feature**: `.groupBy('wilderness')` groups the data by the reversed one-hot encoded feature "wilderness".
- **Counting Occurrences**: `.count()` counts the occurrences of each category within the "wilderness" feature.
- **Displaying Results**: `.show()` displays the counts of each category.

This operation helps to understand the distribution of categories within the reversed one-hot encoded feature "wilderness" in the training data.


In [None]:
unenc_train_data.groupBy('wilderness').count().show()

## Creating ML Pipeline with Vector Indexer

This code cell constructs a machine learning pipeline including a VectorIndexer stage.

- **Input Columns**: The variable `input_cols` contains all columns except the target variable "Cover_Type".
- **Feature Assembler**: A `VectorAssembler` is created to assemble the input columns into a single feature vector named "featureVector".
- **Vector Indexer**: A `VectorIndexer` stage is added to the pipeline to automatically identify categorical features and index them. It is configured to handle a maximum of 40 distinct values.
- **Decision Tree Classifier**: A `DecisionTreeClassifier` is added to the pipeline, configured with appropriate label and feature columns.
- **Pipeline Creation**: The pipeline is created and set with stages including the feature assembler, vector indexer, and classifier.

Using VectorIndexer in the pipeline helps to automatically identify categorical features and index them appropriately, which can improve the performance of tree-based models.


In [None]:
from pyspark.ml.feature import VectorIndexer

cols = unenc_train_data.columns
input_cols = [c for c in cols if c!='Cover_Type']

assembler = VectorAssembler().setInputCols(input_cols).setOutputCol("featureVector")

indexer = VectorIndexer().setMaxCategories(40).setInputCol("featureVector").setOutputCol("indexedVector")

classifier = DecisionTreeClassifier().setLabelCol("Cover_Type").setFeaturesCol("indexedVector").setPredictionCol("prediction")

pipeline = Pipeline().setStages([assembler, indexer, classifier])

0.0.5 Random Forests Takes Too Long To Run

## Updating Classifier to RandomForestClassifier

This code cell updates the classifier in the pipeline to use a RandomForestClassifier instead of a DecisionTreeClassifier.

- **Random Forest Classifier**: The `RandomForestClassifier` from PySpark MLlib is instantiated with specified parameters including seed, label column, features column, and prediction column.
- **Classifier Update**: The `classifier` variable is updated to use the RandomForestClassifier.

Random forests are an ensemble learning method that builds multiple decision trees during training and outputs the mode of the classes (classification) or mean prediction (regression) of the individual trees. They are known for their robustness and ability to handle complex datasets.


In [None]:
from pyspark.ml.classification import RandomForestClassifier

classifier = RandomForestClassifier(seed=1234, labelCol="Cover_Type",featuresCol="indexedVector",predictionCol="prediction")

In [None]:
unenc_train_data.columns

## Tuning Random Forest Classifier with TrainValidationSplit

This code cell performs hyperparameter tuning for the Random Forest Classifier using TrainValidationSplit.

- **Input Columns**: The variable `input_cols` contains all columns except the target variable "Cover_Type".
- **Feature Assembler**: A `VectorAssembler` is created to assemble the input columns into a single feature vector named "featureVector".
- **Vector Indexer**: A `VectorIndexer` stage is added to the pipeline to automatically identify categorical features and index them. It is configured to handle a maximum of 40 distinct values.
- **Random Forest Classifier**: The pipeline includes a `RandomForestClassifier` for classification tasks.
- **Parameter Grid Building**: A parameter grid is constructed with various hyperparameters for the Random Forest Classifier, such as impurity, max depth, max bins, and min info gain.
- **Multiclass Evaluator**: A `MulticlassClassificationEvaluator` is configured to evaluate the model's performance using accuracy.
- **Train Validation Split**: The `TrainValidationSplit` estimator is used for hyperparameter tuning, with the pipeline, evaluator, parameter grid, and train ratio specified.
- **Model Training**: The validator is fit to the training data, resulting in the selection of the best model configuration.


In [None]:
######### LONGER TIME ##################################

cols = unenc_train_data.columns
input_cols = [c for c in cols if c!='Cover_Type']

assembler = VectorAssembler().setInputCols(input_cols).setOutputCol("featureVector")

indexer = VectorIndexer().setMaxCategories(40).setInputCol("featureVector").setOutputCol("indexedVector")

pipeline = Pipeline().setStages([assembler, indexer, classifier])

paramGrid = ParamGridBuilder(). \
  addGrid(classifier.impurity, ["gini", "entropy"]). \
  addGrid(classifier.maxDepth, [1, 20]). \
  addGrid(classifier.maxBins, [40, 300]). \
  addGrid(classifier.minInfoGain, [0.0, 0.05]). \
  build()

multiclassEval = MulticlassClassificationEvaluator(). \
  setLabelCol("Cover_Type"). \
  setPredictionCol("prediction"). \
  setMetricName("accuracy")

validator = TrainValidationSplit(seed=1234,
  estimator=pipeline,
  evaluator=multiclassEval,
  estimatorParamMaps=paramGrid,
  trainRatio=0.9)

validator_model = validator.fit(unenc_train_data)

best_model = validator_model.bestModel

## Extracting Feature Importance from Random Forest Model

This code cell extracts the feature importance from the trained Random Forest model.

- **Accessing Forest Model**: `best_model.stages[2]` retrieves the Random Forest model from the best model obtained from the validation process.
- **Feature Importance Extraction**: `.featureImportances.toArray()` extracts the feature importances as an array.
- **Zip and Sort**: `zip(input_cols, forest_model.featureImportances.toArray())` zips the input column names with their corresponding feature importances and sorts them based on importance in descending order.
- **Printing Feature Importance**: The sorted list of feature importances is printed.

Understanding feature importance is crucial for identifying the most influential features in the model's decision-making process, aiding in feature selection and interpretation.


In [None]:
forest_model = best_model.stages[2]

feature_importance_list = list(zip(input_cols,forest_model.featureImportances.toArray()))

feature_importance_list.sort(key=lambda x: x[1], reverse=True)

pprint(feature_importance_list)

0.0.6 Making Predictions

## Making Predictions on Test Data with Best Model

This code cell makes predictions on the test data using the best model obtained from the validation process.

- **One-Hot Encoding Reverse Transformation**: The test data is first transformed using the `unencode_one_hot` function to reverse the one-hot encoding applied to categorical features.
- **Dropping Target Column**: The target column "Cover_Type" is dropped from the transformed test data.
- **Model Prediction**: `best_model.transform(...)` applies the best model to the transformed test data, generating predictions.
- **Selecting Prediction Column**: `.select("prediction")` selects the prediction column from the output.
- **Displaying Predictions**: `.show(1)` displays the predictions for one data point from the test data.

This operation allows for assessing the model's performance on unseen test data by examining its predictions.


In [None]:
unenc_test_data = unencode_one_hot(test_data)

best_model.transform(unenc_test_data.drop("Cover_Type")).select("prediction").show(1)