## Decision Trees

In [None]:
import pyspark
import os
import sys
from pyspark import SparkContext
os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable
from pyspark.sql import SparkSession

In [None]:
spark = SparkSession.builder.config("spark.driver.memory", "16g").
,→appName('chapter_4').getOrCreate()


• Importing necessary libraries and setting up PySpark environment.
• Creating a Spark Context object.
Configuring the Spark session to allocate 16 gigabytes of memory to the driver and
setting the application name to 'chapter_4' for the movie recommendation system.

In [None]:
data_without_header = spark.read.option("inferSchema", True)\
.option("header", False).csv("data/covtype.data")
data_without_header.printSchema()

-Reading CSV Data**: The code reads a CSV file without a header and infers the
schema.
- Printing Schema**: It prints the schema of the loaded data.

In [None]:
from pyspark.sql.types import DoubleType
from pyspark.sql.functions import col
colnames = ["Elevation", "Aspect", "Slope", \
"Horizontal_Distance_To_Hydrology", \
"Vertical_Distance_To_Hydrology", "Horizontal_Distance_To_Roadways",
,→\
"Hillshade_9am", "Hillshade_Noon", "Hillshade_3pm", \
"Horizontal_Distance_To_Fire_Points"] + \
[f"Wilderness_Area_{i}" for i in range(4)] + \
[f"Soil_Type_{i}" for i in range(40)] + \
["Cover_Type"]
data = data_without_header.toDF(*colnames).\
withColumn("Cover_Type",
col("Cover_Type").cast(DoubleType()))
data.head()

- Importing Libraries: The code imports necessary libraries for defining data types and
working with columns.
- Column Names: It defines a comprehensive list of column names for the dataset
including the ones for "Wilderness_Area" and "Soil_Type".
- Data Conversion: The code converts the "Cover_Type" column to `DoubleType`.
- Fetching Data: It retrieves the first row of the DataFrame.


In [None]:
(train_data, test_data) = data.randomSplit([0.9, 0.1])
train_data.cache()
test_data.cache()

- Data Splitting: The code splits the dataset into training and testing sets with a 90-10
ratio.
- Caching Data: It caches the training and testing datasets to optimize performance.


In [None]:
from pyspark.ml.feature import VectorAssembler
input_cols = colnames[:-1]
vector_assembler = VectorAssembler(inputCols=input_cols,
outputCol="featureVector")
assembled_train_data = vector_assembler.transform(train_data)
assembled_train_data.select("featureVector").show(truncate = False)

- The code imports the `VectorAssembler` class from `pyspark.ml.feature`.
- It defines `input_cols` by selecting all column names except the last one from
`colnames`.
- `VectorAssembler` is initialized with `inputCols` set to `input_cols` and
`outputCol` set to "featureVector".
- The `transform` method is used to transform `train_data` using the
`vector_assembler`.
- Finally, the "featureVector" column is selected and displayed using `show` method
with `truncate=False`.

In [None]:
from pyspark.ml.classification import DecisionTreeClassifier
classifier = DecisionTreeClassifier(seed = 1234, labelCol="Cover_Type",
featuresCol="featureVector",
predictionCol="prediction")
model = classifier.fit(assembled_train_data)
print(model.toDebugString)

- The code imports the `DecisionTreeClassifier` class from
`pyspark.ml.classification`.
- `DecisionTreeClassifier` is initialized with parameters:
 - `seed` set to 1234
 - `labelCol` set to "Cover_Type"
 - `featuresCol` set to "featureVector"
 - `predictionCol` set to "prediction"
- The `fit` method is used to train the decision tree model on `assembled_train_data`.
- The `toDebugString` method is used to print a debug string representation of the
trained decision tree model.

In [None]:
import pandas as pd
pd.DataFrame(model.featureImportances.toArray(),
index=input_cols, columns=['importance']).\
sort_values(by="importance", ascending=False)

- The code imports the `pandas` library as `pd`.
- The `featureImportances` attribute of the `model` is converted to a NumPy array
using `toArray()`.
- A pandas DataFrame is created with the feature importances, using `input_cols` as
the index and 'importance' as the column name.
- The DataFrame is sorted in descending order based on the 'importance' column using
`sort_values`.

In [None]:
predictions = model.transform(assembled_train_data)
predictions.select("Cover_Type", "prediction", "probability").\
show(10, truncate = False)

- The `transform` method is used to make predictions on `assembled_train_data`
using the trained `model`.
- The `select` method is used to display the actual "Cover_Type", predicted
"prediction", and "probability" columns from the `predictions` DataFrame.
- The `show` method displays the first 10 rows of the selected columns with
`truncate=False`.

In [None]:
: from pyspark.ml.evaluation import MulticlassClassificationEvaluator
evaluator = MulticlassClassificationEvaluator(labelCol="Cover_Type",
predictionCol="prediction")
evaluator.setMetricName("accuracy").evaluate(predictions)
evaluator.setMetricName("f1").evaluate(predictions)


- The code imports the `MulticlassClassificationEvaluator` class from
`pyspark.ml.evaluation`.
- `MulticlassClassificationEvaluator` is initialized with parameters:
 - `labelCol` set to "Cover_Type"
 - `predictionCol` set to "prediction"
- The `setMetricName` method is used to set the metric to "accuracy" and then
evaluate the predictions.
- The `setMetricName` method is used again to set the metric to "f1" and then evaluate
the predictions, storing the results in `accuracy` and `f1_score` variables.

In [None]:
confusion_matrix = predictions.groupBy("Cover_Type").\
pivot("prediction", range(1,8)).count().\
na.fill(0.0).\
orderBy("Cover_Type")
confusion_matrix.show()

- The `groupBy` method groups the `predictions` DataFrame by "Cover_Type".
- `pivot` is used to pivot the data based on "prediction" values ranging from 1 to 7
(assuming there are 7 classes).
- The `count` method counts the occurrences of each combination of "Cover_Type"
and "prediction".
- `na.fill(0.0)` is used to fill any null values with 0.0.
- `orderBy("Cover_Type")` sorts the confusion matrix by "Cover_Type".
- The `show` method displays the resulting confusion matrix.

In [None]:
from pyspark.sql import DataFrame
def class_probabilities(data):
total = data.count()
return data.groupBy("Cover_Type").count().\
orderBy("Cover_Type").\
select(col("count").cast(DoubleType())).\
withColumn("count_proportion", col("count")/total).\
select("count_proportion").collect()
train_prior_probabilities = class_probabilities(train_data)
test_prior_probabilities = class_probabilities(test_data)
train_prior_probabilities

- The function `class_probabilities` calculates the prior probabilities of each class in
the dataset.
- The total count of data is obtained using `data.count()`.
- The `groupBy` method groups the data by "Cover_Type" and counts the occurrences
of each class.
- The counts are ordered by "Cover_Type" and the proportion of each class is calculated
by dividing the count by the total count.
- The resulting prior probabilities for the training data are computed and stored in
`train_prior_probabilities`.


In [None]:
train_prior_probabilities = [p[0] for p in train_prior_probabilities]
test_prior_probabilities = [p[0] for p in test_prior_probabilities]
sum([train_p * cv_p for train_p, cv_p in zip(train_prior_probabilities,
test_prior_probabilities)])

The prior probabilities for the training and test datasets are extracted from
`train_prior_probabilities` and `test_prior_probabilities`, respectively.
- A weighted average of the prior probabilities is calculated using a list comprehension
and the `zip` function.
- The resulting weighted average prior probability is stored in `weighted_avg_prior`.

In [None]:
 from pyspark.ml import Pipeline
assembler = VectorAssembler(inputCols=input_cols, outputCol="featureVector")
classifier = DecisionTreeClassifier(seed=1234, labelCol="Cover_Type",
featuresCol="featureVector",
predictionCol="prediction")
pipeline = Pipeline(stages=[assembler, classifier])

In [None]:
from pyspark.ml.tuning import ParamGridBuilder
paramGrid = ParamGridBuilder(). \
addGrid(classifier.impurity, ["gini", "entropy"]). \
addGrid(classifier.maxDepth, [1, 20]). \
addGrid(classifier.maxBins, [40, 300]). \
addGrid(classifier.minInfoGain, [0.0, 0.05]). \
build()
multiclassEval = MulticlassClassificationEvaluator(). \
setLabelCol("Cover_Type"). \
setPredictionCol("prediction"). \
setMetricName("accuracy")


- The `VectorAssembler` and `DecisionTreeClassifier` are imported from
`pyspark.ml.feature` and `pyspark.ml.classification`, respectively.
- A `Pipeline` is created with stages including the `assembler` and `classifier`.
- `ParamGridBuilder` is used to build a parameter grid for hyperparameter tuning of the
decision tree classifier with various combinations of `impurity`, `maxDepth`,
`maxBins`, and `minInfoGain`.
- `MulticlassClassificationEvaluator` is initialized to evaluate the accuracy of the
predictions on the "Cover_Type" column.


In [None]:
 from pyspark.ml.tuning import TrainValidationSplit
validator = TrainValidationSplit(seed=1234,
estimator=pipeline,
evaluator=multiclassEval,
estimatorParamMaps=paramGrid,
trainRatio=0.9)
validator_model = validator.fit(train_data)

In [None]:
from pprint import pprint
best_model = validator_model.bestModel
pprint(best_model.stages[1].extractParamMap())

- `TrainValidationSplit` is used for hyperparameter tuning and model selection.
- The `estimator` is set to the previously defined `pipeline`.
- The `evaluator` is set to `multiclassEval`.
- `estimatorParamMaps` is set to the previously defined `paramGrid`.
- `trainRatio` is set to 0.9, indicating that 90% of the data will be used for training and
10% for validation.
- The `fit` method is called on `train_data` to train the model and find the best
hyperparameters.
- `best_model` retrieves the best model from the `validator_model`.
- `pprint` is used to print the parameters of the best decision tree model.

In [None]:
metrics.sort(reverse=True)
print(metrics[0])
multiclassEval.evaluate(best_model.transform(test_data))


- The `metrics` list is sorted in descending order.
- The highest metric value is printed using `print(metrics[0])`.
- The `best_model` is used to transform `test_data`, and the accuracy is evaluated
using `multiclassEval`.
- The evaluated accuracy is stored in the `accuracy` variable.

In [None]:
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType
def unencode_one_hot(data):
wilderness_cols = ['Wilderness_Area_' + str(i) for i in range(4)]
wilderness_assembler = VectorAssembler().\
setInputCols(wilderness_cols).\
setOutputCol("wilderness")
unhot_udf = udf(lambda v: v.toArray().tolist().index(1))
with_wilderness = wilderness_assembler.transform(data).\
drop(*wilderness_cols).\
withColumn("wilderness", unhot_udf(col("wilderness")).cast(IntegerType()))
soil_cols = ['Soil_Type_' + str(i) for i in range(40)]
soil_assembler = VectorAssembler().\
setInputCols(soil_cols).\
setOutputCol("soil")
with_soil = soil_assembler.\
transform(with_wilderness).\
drop(*soil_cols).\
withColumn("soil", unhot_udf(col("soil")).cast(IntegerType()))
return with_soil


In [None]:
unenc_train_data = unencode_one_hot(train_data)
unenc_train_data.printSchema()

- The function `unencode_one_hot` is defined to convert one-hot encoded features
back to categorical features.
- `wilderness_cols` and `soil_cols` are defined to specify the one-hot encoded
columns for wilderness and soil types, respectively.
- `VectorAssembler` is used to assemble the one-hot encoded columns into a single
vector column for wilderness and soil.
- A `udf` (User Defined Function) `unhot_udf` is defined to convert the assembled
vectors back to categorical values.
- The `transform` method is used to transform the data, drop the original one-hot
encoded columns, and add the new categorical columns "wilderness" and "soil".
- The schema of `unenc_train_data` is printed to show the updated columns.
-The data is grouped, occurrences of each type is counted and the counts are
displayed.


In [None]:
from pyspark.ml.feature import VectorIndexer
cols = unenc_train_data.columns
input_cols = [c for c in cols if c!='Cover_Type']
assembler = VectorAssembler().setInputCols(input_cols).
,→setOutputCol("featureVector")
indexer = VectorIndexer().\
setMaxCategories(40).\
setInputCol("featureVector").setOutputCol("indexedVector")
classifier = DecisionTreeClassifier().setLabelCol("Cover_Type").\
setFeaturesCol("indexedVector").\
setPredictionCol("prediction")
pipeline = Pipeline().setStages([assembler, indexer, classifier])


- The `VectorAssembler` is initialized to assemble all input columns into a single
vector column named "featureVector".
- The `VectorIndexer` is used to automatically identify categorical features and index
them.
- `setMaxCategories(40)` specifies that features with more than 40 distinct values
should be treated as continuous.
- The `DecisionTreeClassifier` is set with the label column, features column, and
prediction column.
- A `Pipeline` is created with stages including the `assembler`, `indexer`, and
`classifier`.


In [None]:
unenc_test_data = unencode_one_hot(test_data)
best_model.transform(unenc_test_data.drop("Cover_Type")).\
select("prediction").show(1)

- The `unencode_one_hot` function is used to preprocess the `test_data` similar to
the `train_data`.
- The `best_model` is used to transform the preprocessed `unenc_test_data` after
dropping the "Cover_Type" column.
- The `select("prediction")` method selects and displays the predicted "prediction" for
the first row in the test data.