# DS/CMPSC 410 Spring 2025
# Instructor: Professor John Yen
# TAs: Jin Peng and Jingxi Zhu

## Lab 8 Decision Tree Learning Using ML Pipeline, Visualization, and Hyperparameter Tuning

## The goals of this lab are for you to be able to
- Understand the function of the different steps/stages involved in Spark ML pipeline
- Be able to construct a decision tree using Spark ML machine learning module
- Be able to generate a visualization of Decision Trees
- Be able to perform automated hyper-parameter tuning for Decision Trees 

## The data set used in this lab is a Breast Cancer diagnosis dataset.

## Submit the following items for Lab 8 (DT)
- Completed Jupyter Notebook of Lab 8 (in HTML format)
- A visualization of the decision tree generated in Part 5.
- The output file that contains the best DT hyperparameters for Part 6.
- A visualization of the best decision tree generated in Part 6.
- The output file that contains the best DT hyperparameters for Part 7.
- a visualization of the best decision tree generated in Part 7.

## Total Number of Exercises: 8
- Exercise 1: 5 points
- Exercise 2: 5 points
- Exercise 3: 10 points  
- Exercise 4: 10 points 
- Exercise 5: 20 points
- Exercise 6: 10 points
- Exercise 7: 30 points
- Exercise 8: 10 points
## Total Points: 100 points

# Due: midnight, March 23, 2025

# Load and set up the Python files for this Lab based on the instructions in "SpecialInstructionsLab 8" in Canvas (under Module Topic and Lab 8)
1. Create a "Lab8DT" directory in the work directory of your work directory.
2. Create a subdirectory under "Lab8DT" called "decision_tree_plot" (named the directory EXACTLY this way).
3. Upload the following three files in Module 8 from Canvas to the decision_tree_plot subdirectory
- decision_tree_parser.py
- decision_tree_plot.py
- tree_template.jinjia2
4. After you have completed the steps above, upload the Jupyter Notebook for Lab 8 (Lab8_DT_F24.ipynb) to the Lab8DT directory.
5. Upload the data file breast-cancer-wisconsin.data.txt from module 8 in Canvas to the "Lab8DT" directory.
6. Open the Jupyter Notebook for Lab 8 and follow the instructions to complete the lab.

# Follow the instructions below and execute the PySpark code cell by cell below. Make modifications as required.

In [None]:
import pyspark
import pandas as pd
import csv

## Notice that we use PySpark SQL module to import SparkSession because ML works with SparkSession
## Notice also the different methods imported from ML and three submodules of ML: classification, feature, and evaluation.

In [None]:
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.types import StructField, StructType, StringType, LongType, IntegerType, FloatType
import pyspark.sql.functions as F
from pyspark.sql.types import *
from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler, IndexToString
from pyspark.ml.evaluation import MulticlassClassificationEvaluator, BinaryClassificationEvaluator

## The following two lines import relevant functions from the two python files you uploaded into the decision_tree_plot subdirectory.

In [None]:
from decision_tree_plot.decision_tree_parser import decision_tree_parse
from decision_tree_plot.decision_tree_plot import plot_trees

## This lab runs Spark in the local mode because the size of data is small. 
## When you need to develop a Decision Tree-based Predictive Model for a large dataset, you will need to debug the code in local mode using a small sampled data.  After running in local mode successfully, you will need to convert it for cluster mode for running using the large dataset, like previous labs.
### Notice we are creating a SparkSession, not a SparkContext, when we use ML pipeline.
### The "getOrCreate()" method means we can re-evaluate this without a need to "stop the current SparkSession" first (unlike SparkContext).

In [None]:
ss=SparkSession.builder.master("local").appName("lab 8 DT").getOrCreate()

In [None]:
ss.sparkContext.setLogLevel("WARN")

## Exercise 1: (5 points) Enter your name below:
- My Name: 

## As we have seen in previous labs, we can define the schema for reading the input data into a PySpark DataFrame.

## Exercise 2: (5 points) Complete the following path with the path for your home directory.  

In [None]:
bc_schema = StructType([ StructField("id", IntegerType(), False ), \
                        StructField("clump_thickness", IntegerType(), False), \
                        StructField("unif_cell_size", IntegerType(), False ), \
                        StructField("unif_cell_shape", IntegerType(), False ), \
                        StructField("marg_adhesion", IntegerType(), False), \
                        StructField("single_epith_cell_size", IntegerType(), False), \
                        StructField("bare_nuclei", IntegerType(), False),\
                        StructField("bland_chrom", IntegerType(), False), \
                        StructField("norm_nucleoli", IntegerType(), False), \
                        StructField("mitoses", IntegerType(), False), \
                        StructField("class", StringType(), False) \
                           ])

In [None]:
data = ss.read.csv("/storage/home/juy1/work/Lab8DT/breast-cancer-wisconsin.data.txt", schema=bc_schema, header=True, inferSchema=False)

# Part 1 Feature Transformation Using DataFrame

In [None]:
data.printSchema()

In [None]:
data.show(5)

In [None]:
from pyspark.sql.functions import col
class_count = data.groupBy(col("class")).count()
class_count.show()

# Detecting and Filtering Rows with missing values

In [None]:
data.filter(col("bare_nuclei").isNull()).show()

In [None]:
data2 = data.filter(col("bare_nuclei").isNotNull())

In [None]:
data2.count()

In [None]:
data.count()

In [None]:
from pyspark.sql.functions import col
class_count2 = data2.groupBy(col("class")).count()
class_count2.show()

## We will use data2 (rather than data) in all of the remaining lab, because it does not contain missing/null values.

# Part 2 Feature Transformation

## StringIndex
- Transforms a column of string to a new column of index (type double).
- The feature transformation involves three steps:
-- Step 1: Create a "transformer" 
-- Step 2: Use the data (which contains all possible values of the string column) to create a mapping (of these strings into an integer/index)
-- Step 3: Use the mapping to generate the new column's value (i.e., trasformed index) for each row.

In [None]:
labelIndexer= StringIndexer(inputCol="class", outputCol="indexedLabel").fit(data2)

In [None]:
labelIndexer

In [None]:
transformed_data = labelIndexer.transform(data2)

In [None]:
transformed_data.show(10)

In [None]:
input_features = ['clump_thickness', 'unif_cell_size', 'unif_cell_shape', 'marg_adhesion', \
                  'single_epith_cell_size', 'bare_nuclei', 'bland_chrom', 'norm_nucleoli', 'mitoses']

In [None]:
assembler = VectorAssembler(inputCols=input_features, outputCol="features")

In [None]:
assembler

In [None]:
vectorized_data = assembler.transform(transformed_data)

## We will use `vectorized_data` for splitting labelled data into training data and testing data. This way, we have access to all the original features, which we will need for generating decision tree visualizations.

In [None]:
vectorized2_data = vectorized_data.select("features",'indexedLabel')
vectorized2_data.show(10)

# Part 3 Decision Tree Learning and Evaluation (1 hyperparameter setting)

## randomSplit is a method for DataFrame that split data in the DataFrame into two subsets, one for training, the other for testing, using a number as the seed for random number generator.
## If you want to generate a different split, you can use a different seed (preferably a prime number).

In [None]:
trainingData, testData= vectorized_data.randomSplit([0.75, 0.25], seed=1237)

In [None]:
dt=DecisionTreeClassifier(featuresCol="features", labelCol="indexedLabel", maxDepth=6, minInstancesPerNode=2)

In [None]:
dt

In [None]:
dt_model = dt.fit(trainingData)

In [None]:
dt_model

In [None]:
test_prediction = dt_model.transform(testData)

In [None]:
test_prediction.show(3)

# To compare the actual labels and predicted labels more easily, we can select the following columns.
## The `probability` column records the probability for the row to be in the "zero/benign class" and the probability to be in the "one/malignant class".
## The `prediction` column records the predicted label for each row, which is the class with the higher probability.

In [None]:
test_prediction.select("features","class","indexedLabel", "probability", "prediction").show(5)

In [None]:
labelIndexer.labels

In [None]:
labelConverter=IndexToString(inputCol="prediction", outputCol="predictedClass", labels=labelIndexer.labels)

In [None]:
test2_prediction = labelConverter.transform(test_prediction)

In [None]:
test2_prediction.select("features","class","indexedLabel","prediction","predictedClass").show(5)

In [None]:
evaluator = MulticlassClassificationEvaluator(
    labelCol="indexedLabel", predictionCol="prediction", metricName="f1")

In [None]:
f1 = evaluator.evaluate(test_prediction)
print("f1 score:", f1)

# Part 4 DT Learning Using ML Pipeline

## Exercise 3: (10 points) In the code cell below, fill in a value for maxDepth (recommend: 2 to 5) and a value of minInstancesPerNode (recommend: 1 to 5). Run the entire sequence of code below to generate a decision tree (using pipeline) and compute f1 measure of the testing data.
- After you run the code successfully, record the f1 measure and your choice of max_depth and minInstancesPerNode below:

## Answer for Exercise 3: 
- The f1 measure of testing data for max_detph ??? and minInstancesPerNode ??? = ???

In [None]:
trainingData, testData= data2.randomSplit([0.8, 0.2], seed=1234)

In [None]:
labelIndexer= StringIndexer(inputCol="class", outputCol="indexedLabel").fit(data2)
assembler = VectorAssembler( inputCols=input_features, outputCol="features")
dt=DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="features", maxDepth=??, minInstancesPerNode=??)
predictionConverter = IndexToString(inputCol="prediction", outputCol="predictedClass", labels=labelIndexer.labels)
pipeline = Pipeline(stages=[labelIndexer, assembler, dt, predictionConverter])
model = pipeline.fit(trainingData)
predictions = model.transform(testData)

In [None]:
pipeline

In [None]:
model

In [None]:
predictions.select("class","indexedLabel","features","prediction","predictedClass").show(10)

In [None]:
evaluator = MulticlassClassificationEvaluator(
    labelCol="indexedLabel", predictionCol="prediction", metricName="f1")

In [None]:
f1 = evaluator.evaluate(predictions)
print("f1 score:", f1)

# Part 5 Decision Tree Visualization

## stages[2] of the pipeline is "dt" (DecisionTreeClassifier). 
## model is a DataFrame representing a trained pipeline.
## model.stages[2] gives us the Decision Tree model learned.

In [None]:
DTmodel = model.stages[2]
print(DTmodel)

## Exercise 4: (10 points) 
- Complete the code below to generate a visualization of the decision tree.
- Fill in your PSU ID in the path for ``model_path`` and ``output_path``
- Save the visualization of the tree in a file that replaces ??? with your first initial and last name.  For example, I would name the file as "DTree_jyen_Part5.html".
- Download the HTML file of the tree and submit it as a part of Lab8 assignment.

In [None]:
model_path="/storage/home/???/work/Lab8DT/DTmodel_vis"

In [None]:
tree=decision_tree_parse(DTmodel, ss, model_path)
column = dict([(str(idx), i) for idx, i in enumerate(input_features)])
plot_trees(tree, column = column, output_path = '/storage/home/???/work/Lab8DT/DTree_??_Part5.html')

# Part 6 Automated Hyperparameter Tuning for Decision Tree

## Exercise 5: (20 points)  
- Complete the code below to perform hyper parameter tuning of Decision Tree (for two parameters: max_depth and minInstancesPerNode)

In [None]:
input_features = ['clump_thickness', 'unif_cell_size', 'unif_cell_shape', 'marg_adhesion', \
                  'single_epith_cell_size', 'bare_nuclei', 'bland_chrom', 'norm_nucleoli', 'mitoses']

In [None]:
trainingData, testingData= data2.randomSplit([0.8, 0.2], seed=12347)
model_path="/storage/home/???/work/Lab8DT/DTmodel_vis_Part6"

In [None]:
## Initialize a Pandas DataFrame to store evaluation results of all combination of hyper-parameter settings
hyperparams_eval_df = pd.DataFrame( columns = ['max_depth', 'minInstancesPerNode', 'training f1', \
                                               'testing f1', 'Best Model'] )
# initialize index to the hyperparam_eval_df to 0
index =0 
# initialize lowest_error
highest_testing_f1 = 0
# Set up the possible hyperparameter values to be evaluated
max_depth_list = [2, 3]
minInstancesPerNode_list = [2, 3]
labelIndexer = StringIndexer(inputCol="class", outputCol="indexedLabel").fit(data2)
assembler = VectorAssembler( inputCols=input_features, outputCol="features")
labelConverter = IndexToString(inputCol = "prediction", outputCol="predictedClass", labels=labelIndexer.labels)
trainingData.persist()
testingData.persist()
for max_depth in max_depth_list:
    for minInstPN in minInstancesPerNode_list:
        seed = 37
        # Construct a DT model using a set of hyper-parameter values and training data
        dt= DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="features", maxDepth= ???, \
                                   minInstancesPerNode= ???)
        pipeline = Pipeline(stages=[???, ???, dt, ???])
        model = pipeline.fit(trainingData)
        training_predictions = model.transform(trainingData)
        testing_predictions = model.transform(testingData)
        evaluator = MulticlassClassificationEvaluator(labelCol="indexedLabel", predictionCol="prediction", \
                                                      metricName="f1")
        training_f1 = evaluator.evaluate(training_predictions)
        testing_f1 = evaluator.evaluate(testing_predictions)
        # We use 0 as default value of the 'Best Model' column in the Pandas DataFrame.
        # The best model will have a value 1000
        hyperparams_eval_df.loc[index] = [ max_depth, minInstPN, training_f1, testing_f1, 0]  
        index = index +1
        if testing_f1 > ??? :
            best_max_depth = max_depth
            best_minInstPN = minInstPN
            best_index = index -1
            best_parameters_training_f1 = training_f1
            best_DTmodel= model.stages[??]
            best_tree = decision_tree_parse(best_DTmodel, ss, model_path)
            column = dict( [ (str(idx), i) for idx, i in enumerate(input_features) ])           
            highest_testing_f1 = testing_f1
print('The best max_depth is ', best_max_depth, ', best minInstancesPerNode = ', \
      best_minInstPN, ', testing f1 = ', highest_testing_f1) 
column = dict([(str(idx), i) for idx, i in enumerate(input_features)])

In [None]:
best_model_path="/storage/home/???/work/Lab8DT/BestDTmodel_???_Part6"

In [None]:
best_tree=decision_tree_parse(best_DTmodel, ss, best_model_path)
column = dict([(str(idx), i) for idx, i in enumerate(input_features)])
plot_trees(best_tree, column = column, output_path = '/storage/home/???/work/Lab8DT/BestDTtree_???_tuned_Part6.html')

In [None]:
# Store the Testing RMS in the DataFrame
hyperparams_eval_df.loc[best_index]=[best_max_depth, best_minInstPN, best_parameters_training_f1, highest_testing_f1, 1000]

## Exercise 6 (10 points)
### Complete the path below to save the result of your hyperparameter tuning in a csv file.

In [None]:
output_path = "/storage/home/???/work/Lab8DT/Lab8DT_HyperparamsTuningResult_Table.csv"
hyperparams_eval_df.to_csv(output_path)  

# Part 7 A Revised Hyper-parameter Tuning 

# Exercise 7 (20 points) Copy the code for Part 6 to the the code cells below and modify the range of hyper-parameter tuning into the following:
- max_depth: 2 to 11
- minInstancesPerNode: 2 to 10
## Modify the file names of your Decision Tree visualization files (.html) so that you do not accidentally destroy the visualiztion generated for Exercise 5 (Part 6).
## Modify the file names of your output files so that you can compare the results of Part 7 (both decision tree and best hyper parameters) with those of Part 6.

In [None]:
#Code cell for Part 7

In [None]:
#Code cell for Part 7

In [None]:
#Code cell for Part 7

In [None]:
#Code cell for Part 7

# Exercise 8 (10 points) Compare (a) the hyper-parameters and (b) the decision trees generated by Part 6 and Part 7. Discuss the results in the Markdown cell below.

# Answer for Exercise 8: 
- (a)
- (b)

In [None]:
ss.stop()