<h2>Decision Trees</h2>

<ul>
    <li>This code snippet sets up a PySpark environment in a Python script.</li><li> It first imports the necessary modules like pyspark, os, and sys.</li><li> Then, it sets the Python executable for PySpark to the same one being used by the script.</li><li> Finally, it imports the SparkContext class for creating RDDs and the SparkSession class for programming Spark with the DataFrame API.</li>
    </ul>

In [None]:
import pyspark
import os
import sys
from pyspark import SparkContext
os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable
from pyspark.sql import SparkSession

<ul><li>This code creates a SparkSession named spark with specific configuration options. It sets the driver memory to 16 GB and names the application 'chapter_4'.</li><ul>

In [None]:
spark = SparkSession.builder.config("spark.driver.memory", "16g").appName('chapter_4').getOrCreate()

<h3>Preparing Data</h3>
<ul><li>This code reads a CSV file named "covtype.data" without a header into a PySpark DataFrame called data_without_header. It then prints the schema of this DataFrame, allowing users to inspect the structure of the data.</ul></li>

In [None]:
data_without_header = spark.read.option("inferSchema", True)\.option("header", False).csv("data/covtype.data")
data_without_header.printSchema()

<ul><li>This code prepares a PySpark DataFrame named data by defining a list of column names (colnames) expected in the dataset. </li><li>These columns represent various geographical features and identifiers.</li><li> The data_without_header DataFrame is then converted to data using these column names, and the "Cover_Type" column is cast to a DoubleType to ensure it is treated as a numerical column.</li><li> Finally, it returns the first row of the DataFrame, providing a glimpse into the data's structure and content.</li></ul>

In [None]:
from pyspark.sql.types import DoubleType
from pyspark.sql.functions import col

 colnames = ["Elevation", "Aspect", "Slope", \
 "Horizontal_Distance_To_Hydrology", \
 "Vertical_Distance_To_Hydrology", "Horizontal_Distance_To_Roadways", \
 "Hillshade_9am", "Hillshade_Noon", "Hillshade_3pm", \
 "Horizontal_Distance_To_Fire_Points"] + \
[f"Wilderness_Area_{i}" for i in range(4)] + \
 [f"Soil_Type_{i}" for i in range(40)] + \
 ["Cover_Type"]
data = data_without_header.toDF(*colnames).\
 withColumn("Cover_Type",col("Cover_Type").cast(DoubleType()))
data.head()

<h3>Our First Decision Tree</h3>


<ul><li>This code splits the data DataFrame into train_data and test_data DataFrames using a 90-10 ratio </li></ul>

In [None]:
(train_data, test_data) = data.randomSplit([0.9, 0.1])
train_data.cache()
test_data.cache()

<b>This code uses PySpark's VectorAssembler to merge multiple feature columns from train_data into a single "featureVector" column.</b>
<ul>
    <li>It excludes the last column ("Cover_Type") from the feature set.</li>
<li>The transform method is applied to train_data, creating assembled_train_data with the new "featureVector" column.</li>
<li>The final select statement displays the "featureVector" column content for each row in assembled_train_data.</li></ul>

In [None]:
from pyspark.ml.feature import VectorAssembler
input_cols = colnames[:-1]
vector_assembler = VectorAssembler(inputCols=input_cols,outputCol="featureVector")
assembled_train_data = vector_assembler.transform(train_data)
assembled_train_data.select("featureVector").show(truncate = False)

<b>The code trains a decision tree classifier (DecisionTreeClassifier) using PySpark on the assembled_train_data DataFrame.</b>
<ul><li>Configuration includes a random seed for reproducibility (seed=1234), specifying the target variable column (labelCol="Cover_Type"), the input feature vector column (featuresCol="featureVector"), and the prediction output column (predictionCol="prediction").</li>
    <li>The fit method trains the model, storing it in the model variable.</li>
<li>print(model.toDebugString) displays a detailed representation of the trained decision tree model's structure.</li></ul>

In [None]:
from pyspark.ml.classification import DecisionTreeClassifier
classifier = DecisionTreeClassifier(seed = 1234, labelCol="Cover_Type",featuresCol="featureVector",predictionCol="prediction")
model = classifier.fit(assembled_train_data)
print(model.toDebugString)

<b>This code snippet converts the feature importances of the trained decision tree model (model) into a Pandas DataFrame using the pd.DataFrame() function from the Pandas library.</b>
<ul><li>
    The feature importances are extracted from the model using model.featureImportances.toArray().</li>
<li>The index of the DataFrame is set to the input column names (input_cols), which represent the features used in the model.</li>
    <li>The column containing the feature importances is named "importance".</li>
<li>The DataFrame is then sorted by the "importance" column in descending order using .sort_values(by="importance", ascending=False), allowing users to identify the most influential features in the decision tree model.</li></ul>

In [None]:
import pandas as pd
pd.DataFrame(model.featureImportances.toArray(),
 index=input_cols, columns=['importance']).\
 sort_values(by="importance", ascending=False)

<b>This code applies the trained decision tree model (model) to the assembled_train_data DataFrame to make predictions.</b>
<ul><li>
The transform method is used to apply the model to the data, resulting in a new DataFrame named predictions.</li><li>
The select method is then used to display the "Cover_Type" (actual target variable), "prediction" (predicted value), and "probability" columns from the predictions DataFrame.</li>

In [None]:
predictions = model.transform(assembled_train_data)
predictions.select("Cover_Type", "prediction", "probability").\
show(10, truncate = False)

<b>This code utilizes PySpark's MulticlassClassificationEvaluator to evaluate the decision tree model's accuracy on the predictions DataFrame.</b>
<ul><li>It calculates the accuracy of the model's predictions using setMetricName("accuracy").evaluate(predictions).</li>
<li>Additionally, it computes the F1 score, a measure of a model's accuracy, using setMetricName("f1").evaluate(predictions).</li>

In [None]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
evaluator = MulticlassClassificationEvaluator(labelCol="Cover_Type",predictionCol="prediction")
evaluator.setMetricName("accuracy").evaluate(predictions)
evaluator.setMetricName("f1").evaluate(predictions)

<ul><li>This code snippet creates a confusion matrix from the predictions DataFrame, showing the counts of correct and incorrect predictions for each cover type</li><li>. It groups the predictions by the actual cover types and pivots the data to compare the predicted cover types against the actual ones. The matrix is then sorted by the actual cover types for better interpretation.</li></ul>

In [None]:
confusion_matrix = predictions.groupBy("Cover_Type").\
pivot("prediction", range(1,8)).count().\
na.fill(0.0).\orderBy("Cover_Type")
confusion_matrix.show()

<b>This code defines a function class_probabilities that calculates the proportion of each class in a dataset based on the "Cover_Type" column. It takes a DataFrame data as input and returns a list of class proportions, sorted by "Cover_Type".</b>
<ul><li>
    The class_probabilities function first counts the total number of rows in the DataFrame. It then groups the data by "Cover_Type", calculates the count for each group, and orders the result by "Cover_Type".</li>
<li>Next, it converts the count to a double type, calculates the proportion of each class count to the total count, and selects the "count_proportion" column.</li><li>
    Finally, it collects and returns the list of class proportions.</li></ul>
<b>The function is applied to both the train_data and test_data DataFrames to calculate the class proportions for the training and test datasets, respectively.</b>

In [None]:
from pyspark.sql import DataFrame
def class_probabilities(data):
    total = data.count()
    return data.groupBy("Cover_Type").count().\
 orderBy("Cover_Type").\
 select(col("count").cast(DoubleType())).\
 withColumn("count_proportion", col("count")/total).\
 select("count_proportion").collect()

train_prior_probabilities = class_probabilities(train_data)
test_prior_probabilities = class_probabilities(test_data)
train_prior_probabilities

<b>This code calculates the weighted sum of products of prior probabilities for the training and test datasets.</b>
<ul><li>
It extracts the class proportions from train_prior_probabilities and test_prior_probabilities using list comprehensions.</li>
    <li>The zip function pairs up the class proportions from the training and test datasets.</li><li>
The sum function computes the sum of the products of these paired class proportions, reflecting the match between the prior probabilities of the two datasets.</li></ul>

In [None]:
train_prior_probabilities = [p[0] for p in train_prior_probabilities]
test_prior_probabilities = [p[0] for p in test_prior_probabilities]
sum([train_p * cv_p for train_p, cv_p in zip(train_prior_probabilities,test_prior_probabilities)]

<h3> Tuning Decision Trees</h3><br>
<b>
This code sets up a PySpark pipeline for assembling features and training a decision tree classifier.</b>
<ul><li>
It creates a VectorAssembler named assembler to combine input features into a single feature vector.</li><li>
    The DecisionTreeClassifier named classifier is configured with specific parameters.</li><li>
The Pipeline named pipeline is constructed with stages for the assembler and classifier, enabling a streamlined workflow for feature assembly and model training.</li></ul>

In [None]:
from pyspark.ml import Pipeline
assembler = VectorAssembler(inputCols=input_cols, outputCol="featureVector")
classifier = DecisionTreeClassifier(seed=1234, labelCol="Cover_Type",
featuresCol="featureVector",
predictionCol="prediction")
pipeline = Pipeline(stages=[assembler, classifier])

<b>This code snippet sets up a parameter grid for tuning the hyperparameters of the decision tree classifier (classifier) using PySpark's ParamGridBuilder.</b>

<ul><li>The ParamGridBuilder is used to define a grid of hyperparameters to explore during model training.
Hyperparameters such as impurity, maxDepth, maxBins, and minInfoGain are specified with a list of values to try.
    The build() method is called to build the parameter grid.</li>
<li>Additionally, a MulticlassClassificationEvaluator (multiclassEval) is configured to evaluate the model's performance using accuracy (setMetricName("accuracy")) on the "Cover_Type" target variable (setLabelCol("Cover_Type")) and the predicted values column (setPredictionCol("prediction")).</li></ul>

In [None]:
from pyspark.ml.tuning import ParamGridBuilder

paramGrid = ParamGridBuilder(). \
addGrid(classifier.impurity, ["gini", "entropy"]). \
addGrid(classifier.maxDepth, [1, 20]). \
addGrid(classifier.maxBins, [40, 300]). \
addGrid(classifier.minInfoGain, [0.0, 0.05]). \
build()

multiclassEval = MulticlassClassificationEvaluator(). \
setLabelCol("Cover_Type"). \
setPredictionCol("prediction"). \
setMetricName("accuracy")

<b>The code sets up a train-validation split for hyperparameter tuning using PySpark's TrainValidationSplit.</b><ul><li>
It configures the TrainValidationSplit with parameters such as a random seed (seed=1234), the pipeline (pipeline) containing the assembler and classifier, an evaluator (multiclassEval) for accuracy evaluation, a parameter grid (paramGrid) for hyperparameter tuning, and a training ratio (trainRatio=0.9).</li><li>
The fit method is used to train the validator object on the train_data DataFrame, resulting in a validator_model that contains the best model found during the tuning process.</li></ul>

In [None]:
from pyspark.ml.tuning import TrainValidationSplit
validator = TrainValidationSplit(seed=1234,
estimator=pipeline,
evaluator=multiclassEval,
estimatorParamMaps=paramGrid,
trainRatio=0.9)
validator_model = validator.fit(train_data)

<b>This code snippet extracts and prints the best hyperparameters found during the hyperparameter tuning process using PySpark's TrainValidationSplit.</b>
<ul><li>
    best_model contains the best model selected by the tuning process.</li><li>
    best_model.stages[1] accesses the decision tree classifier (classifier) within the pipeline.</li><li>
    extractParamMap() retrieves the hyperparameters of the decision tree classifier.</li></ul>

In [None]:
from pprint import pprint
best_model = validator_model.bestModel
pprint(best_model.stages[1].extractParamMap())

<b>The code fits the validator on the train_data DataFrame to find the best model.</b><ul><li>
    It extracts the validation metrics and hyperparameters for each model evaluated during tuning.</li><li>
    The metrics and hyperparameters are combined into a list of tuples.</li><li>
The list is sorted based on the validation metrics in descending order to identify the best-performing models.</li></ul>

In [None]:
validator_model = validator.fit(train_data)
metrics = validator_model.validationMetrics
params = validator_model.getEstimatorParamMaps()
metrics_and_params = list(zip(metrics, params))
metrics_and_params.sort(key=lambda x: x[0], reverse=True)
metrics_and_params

<b>This code snippet sorts the validation metrics in descending order to find the best-performing model and evaluates its performance on the test data using the MulticlassClassificationEvaluator.</b>

In [None]:
metrics.sort(reverse=True)
print(metrics[0])
multiclassEval.evaluate(best_model.transform(test_data))

<h3>Categorical Features Revisited</h3>

<b>This code defines a function called unencode_one_hot that reverses the one-hot encoding of "Wilderness_Area" and "Soil_Type" columns in a DataFrame.</b><ul><li> It first creates a VectorAssembler for each set of one-hot encoded columns, combining them into single columns named "wilderness" and "soil".</li><li> Then, it defines a user-defined function (UDF) to convert these combined columns back to their original categorical values by finding the index of the non-zero element.</li><li> The function applies this UDF to each "Wilderness_Area" and "Soil_Type" column, converting them back to integer values.</li><li> Finally, it drops the original one-hot encoded columns and renames the newly created columns to "wilderness" and "soil".</li></ul>

In [None]:
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType
def unencode_one_hot(data):
    wilderness_cols = ['Wilderness_Area_' + str(i) for i in range(4)]
    wilderness_assembler = VectorAssembler().\
 setInputCols(wilderness_cols).\
 setOutputCol("wilderness")
    unhot_udf = udf(lambda v: v.toArray().tolist().index(1))
    with_wilderness = wilderness_assembler.transform(data).\
 drop(*wilderness_cols).\
 withColumn("wilderness", unhot_udf(col("wilderness")).cast(IntegerType()))
    soil_cols = ['Soil_Type_' + str(i) for i in range(40)]
    soil_assembler = VectorAssembler().\
 setInputCols(soil_cols).\
 setOutputCol("soil")
    with_soil = soil_assembler.\
 transform(with_wilderness).\
 drop(*soil_cols).\
 withColumn("soil", unhot_udf(col("soil")).cast(IntegerType()))
    return with_soil

<p>This code applies the unencode_one_hot function to the train_data DataFrame, reversing the one-hot encoding of "Wilderness_Area" and "Soil_Type" columns.</p>

In [None]:
unenc_train_data = unencode_one_hot(train_data)
unenc_train_data.printSchema()

This code groups the unenc_train_data DataFrame by the "wilderness" column and counts the occurrences of each category.

In [None]:
unenc_train_data.groupBy('wilderness').count().show()

<b>This code sets up a PySpark pipeline for a decision tree classifier with vector indexing.</b>
<ul><li>
    A VectorAssembler combines input columns into a single feature vector column.</li><li>
A VectorIndexer indexes categorical features in the feature vector column with a maximum of 40 categories.</li><li>
A decision tree classifier is configured with the target variable column and input and output column names for features and predictions.</li><li>
The pipeline is constructed with stages for the assembler, indexer, and classifier, facilitating a streamlined workflow </li></ul>

In [None]:
from pyspark.ml.feature import VectorIndexer

cols = unenc_train_data.columns
input_cols = [c for c in cols if c!='Cover_Type']
assembler = VectorAssembler().setInputCols(input_cols).setOutputCol("featureVector")
indexer = VectorIndexer().\
 setMaxCategories(40).\
 setInputCol("featureVector").setOutputCol("indexedVector")
classifier = DecisionTreeClassifier().setLabelCol("Cover_Type").\
 setFeaturesCol("indexedVector").\
 setPredictionCol("prediction")
pipeline = Pipeline().setStages([assembler, indexer, classifier])

<h3>Random forest takes too long to run</h3>

<b>This code snippet defines a random forest classifier in PySpark with specific parameters.</b>
<ul><li>
    RandomForestClassifier is used to define a random forest classifier model.</li><li>
Parameters include a random seed for reproducibility (seed=1234), the target variable column ("Cover_Type"), the input feature vector column ("indexedVector"), and the output prediction column ("prediction").</li><li>
The columns attribute of unenc_train_data is accessed to display the column names in the DataFrame.</li></ul>

In [None]:
from pyspark.ml.classification import RandomForestClassifier
classifier = RandomForestClassifier(seed=1234, labelCol="Cover_Type",
featuresCol="indexedVector",
predictionCol="prediction")
unenc_train_data.columns

<b>This code snippet sets up a PySpark pipeline for a random forest classifier with hyperparameter tuning.</b><ul><li> It starts by defining the input columns for feature assembly, excluding "Cover_Type".</li><li> The VectorAssembler combines these columns into a single feature vector.</li><li> Next, a VectorIndexer is used to index categorical features in the vector.</li><li> The pipeline is constructed with stages for the assembler, indexer, and classifier. </li><li>Hyperparameter tuning is performed using TrainValidationSplit with a specified parameter grid. </li><li>Finally, the best model is determined using the fit method on the validator object with the training data.</li></ul>

In [None]:
cols = unenc_train_data.columns
input_cols = [c for c in cols if c!='Cover_Type']
assembler = VectorAssembler().setInputCols(input_cols).setOutputCol("featureVector")
indexer = VectorIndexer().\
 setMaxCategories(40).\
 setInputCol("featureVector").setOutputCol("indexedVector")
pipeline = Pipeline().setStages([assembler, indexer, classifier])
paramGrid = ParamGridBuilder(). \
 addGrid(classifier.impurity, ["gini", "entropy"]). \
 addGrid(classifier.maxDepth, [1, 20]). \
 addGrid(classifier.maxBins, [40, 300]). \
 addGrid(classifier.minInfoGain, [0.0, 0.05]). \
 build()
multiclassEval = MulticlassClassificationEvaluator(). \
 setLabelCol("Cover_Type"). \
 setPredictionCol("prediction"). \
 setMetricName("accuracy")
validator = TrainValidationSplit(seed=1234,
 estimator=pipeline,
 evaluator=multiclassEval,
estimatorParamMaps=paramGrid,
 trainRatio=0.9)
validator_model = validator.fit(unenc_train_data)
best_model = validator_model.bestModel

<b>This code extracts and prints the feature importances from the best random forest model found during hyperparameter tuning.</b><ul><li>

forest_model = best_model.stages[2] accesses the random forest model (classifier) within the pipeline.</li><li>
feature_importance_list is created as a list of tuples, where each tuple contains a feature column name (input_cols) and its corresponding importance score from the random forest model.</li><li>
    The list is sorted based on the importance scores in descending order.</li></ul>

In [None]:
forest_model = best_model.stages[2]
feature_importance_list = list(zip(input_cols,forest_model.featureImportances.toArray()))
feature_importance_list.sort(key=lambda x: x[1], reverse=True)
pprint(feature_importance_list)

<h3>Making Predictions</h3>

This code prepares the test data by reversing the one-hot encoding of certain columns. Then, it applies the best model found during training to predict the "Cover_Type" values for the test data. The prediction for the first row is displayed to provide a glimpse of the model's performance on unseen data.









In [None]:
unenc_test_data = unencode_one_hot(test_data)
best_model.transform(unenc_test_data.drop("Cover_Type")).\select("prediction").show(1)