
# üå≥ Display Decision Tree Classification Results (Spark MLlib)

This guide shows how to **load saved prediction results**, display predictions with original data, compute basic metrics, and save the final output.

---

## ‚úÖ Step 1: Load Saved Prediction Results

Assuming results were saved as:

```

output/decision_tree_predictions

````

```scala
val predDF = spark.read
  .option("header", "true")
  .option("inferSchema", "true")
  .csv("output/decision_tree_predictions")
````

---

## ‚úÖ Step 2: Display Predictions with Original Data

```scala
predDF.select("features", "label", "prediction").show(false)
```

---

## ‚úÖ Step 3: Count Predicted Classes

```scala
val classCounts = predDF.groupBy("prediction").count()
classCounts.show()
```

---

## ‚úÖ Step 4: Save Final Classified Dataset

```scala
predDF.write.mode("overwrite").csv("output/final_classified_data")
```

---

## ‚úÖ Step 5: Basic Summary

```scala
println("Total Records: " + predDF.count())
predDF.describe("label", "prediction").show()
```

---

# üìå Sample Output

### **Prediction Display**

```
+----------------+-----+----------+
|features        |label|prediction|
+----------------+-----+----------+
|[25,50000,650]  |1.0  |1.0       |
|[40,80000,700]  |1.0  |1.0       |
|[35,30000,550]  |0.0  |0.0       |
|[50,90000,720]  |1.0  |1.0       |
|[28,40000,600]  |0.0  |0.0       |
+----------------+-----+----------+
```

### **Class Counts**

```
+----------+-----+
|prediction|count|
+----------+-----+
|0.0       |2    |
|1.0       |3    |
+----------+-----+
```

### **Summary**

```
Total Records: 5
```


## PySpark

In [1]:
# ==========================================
# Display Classification Results
# Decision Tree - Spark MLlib
# ==========================================

import numpy as np
import pandas as pd

from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml import Pipeline
from pyspark.sql.functions import col

# -------------------------------
# 1Ô∏è‚É£ Create Spark Session
# -------------------------------
spark = SparkSession.builder \
    .appName("DecisionTreeResultsDisplay") \
    .getOrCreate()

# -------------------------------
# 2Ô∏è‚É£ Generate Random Dataset
# -------------------------------
np.random.seed(42)

n_samples = 500
n_features = 3

X = np.random.rand(n_samples, n_features)
y = (X.sum(axis=1) > 1.5).astype(int)

columns = [f"feature_{i}" for i in range(n_features)]
pdf = pd.DataFrame(X, columns=columns)
pdf["label"] = y

df = spark.createDataFrame(pdf)

# -------------------------------
# 3Ô∏è‚É£ Feature Assembler
# -------------------------------
assembler = VectorAssembler(
    inputCols=columns,
    outputCol="features"
)

# -------------------------------
# 4Ô∏è‚É£ Train/Test Split
# -------------------------------
train_df, test_df = df.randomSplit([0.8, 0.2], seed=42)

# -------------------------------
# 5Ô∏è‚É£ Decision Tree Model
# -------------------------------
dt = DecisionTreeClassifier(
    labelCol="label",
    featuresCol="features",
    maxDepth=4
)

pipeline = Pipeline(stages=[assembler, dt])
model = pipeline.fit(train_df)

# -------------------------------
# 6Ô∏è‚É£ Predictions
# -------------------------------
predictions = model.transform(test_df)

# -------------------------------
# 7Ô∏è‚É£ Display Classification Results
# -------------------------------

print("=== Sample Predictions ===")
predictions.select(
    "features",
    "label",
    "prediction",
    "probability"
).show(10, truncate=False)

# -------------------------------
# 8Ô∏è‚É£ Accuracy
# -------------------------------
evaluator = MulticlassClassificationEvaluator(
    labelCol="label",
    predictionCol="prediction",
    metricName="accuracy"
)

accuracy = evaluator.evaluate(predictions)
print(f"\nModel Accuracy: {accuracy:.4f}")

# -------------------------------
# 9Ô∏è‚É£ Confusion Matrix
# -------------------------------
print("\n=== Confusion Matrix ===")
conf_matrix = predictions.groupBy("label", "prediction").count()
conf_matrix.show()

# -------------------------------
# üîü Feature Importances
# -------------------------------
tree_model = model.stages[-1]
print("\n=== Feature Importances ===")
print(tree_model.featureImportances)

# -------------------------------
# 11Ô∏è‚É£ Print Tree Structure
# -------------------------------
print("\n=== Decision Tree Structure ===")
print(tree_model.toDebugString)

# -------------------------------
# Stop Spark
# -------------------------------
spark.stop()

=== Sample Predictions ===
+--------------------------------------------------------------+-----+----------+----------------------------------------+
|features                                                      |label|prediction|probability                             |
+--------------------------------------------------------------+-----+----------+----------------------------------------+
|[0.016587828927856152,0.512093058299281,0.22649577519793795]  |0    |0.0       |[1.0,0.0]                               |
|[0.035942273796742086,0.46559801813246016,0.5426446347075766] |0    |0.0       |[0.6153846153846154,0.38461538461538464]|
|[0.040775141554763916,0.5908929431882418,0.6775643618422824]  |0    |0.0       |[0.6153846153846154,0.38461538461538464]|
|[0.06936130087516545,0.10077800137742665,0.018221825651549728]|0    |0.0       |[1.0,0.0]                               |
|[0.0944429607559284,0.6830067734163568,0.07118864846022899]   |0    |0.0       |[1.0,0.0]                      