
# üå≥ Decision Tree Classification Using Spark MLlib

This guide demonstrates how to perform **Decision Tree Classification** using Spark MLlib.

---

## üìÇ Dataset (`loan.csv`)

```text
id,age,income,credit_score,approved
1,25,50000,650,1
2,40,80000,700,1
3,35,30000,550,0
4,50,90000,720,1
5,28,40000,600,0
````

* **Label**: `approved` (1 = approved, 0 = not approved)
* **Features**: age, income, credit_score

---

# ‚úÖ Step 1: Load Dataset

```scala
val df = spark.read
  .option("header", "true")
  .option("inferSchema", "true")
  .csv("loan.csv")
```

---

# ‚úÖ Step 2: Preprocess Data (Create Feature Vector)

```scala
import org.apache.spark.ml.feature.VectorAssembler

val assembler = new VectorAssembler()
  .setInputCols(Array("age", "income", "credit_score"))
  .setOutputCol("features")

val data = assembler.transform(df)
  .select("features", "approved")
  .withColumnRenamed("approved", "label")
```

---

# ‚úÖ Step 3: Train Decision Tree Model

```scala
import org.apache.spark.ml.classification.DecisionTreeClassifier

val dt = new DecisionTreeClassifier()
  .setLabelCol("label")
  .setFeaturesCol("features")

val model = dt.fit(data)
```

---

# ‚úÖ Step 4: Generate Predictions

```scala
val predictions = model.transform(data)
predictions.select("features", "label", "prediction").show()
```

---

# ‚úÖ Step 5: Save Prediction Results

```scala
predictions.write.mode("overwrite").csv("output/decision_tree_predictions")
```

---

## üìå Sample Output

```text
+----------------+-----+----------+
|features        |label|prediction|
+----------------+-----+----------+
|[25,50000,650]  |1.0  |1.0       |
|[40,80000,700]  |1.0  |1.0       |
|[35,30000,550]  |0.0  |0.0       |
|[50,90000,720]  |1.0  |1.0       |
|[28,40000,600]  |0.0  |0.0       |
+----------------+-----+----------+
```


## PySpark

In [2]:
# ==========================================
# Decision Tree Classification - Random Data
# NumPy + Pandas + Spark MLlib
# ==========================================

import numpy as np
import pandas as pd

from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml import Pipeline

# -------------------------------
# 1Ô∏è‚É£ Create Spark Session
# -------------------------------
spark = SparkSession.builder \
    .appName("DecisionTreeRandomData") \
    .getOrCreate()

# -------------------------------
# 2Ô∏è‚É£ Generate Random Dataset (NumPy)
# -------------------------------

np.random.seed(42)

n_samples = 1000
n_features = 4

# Generate random feature values
X = np.random.rand(n_samples, n_features)

# Create a synthetic binary target
# Rule: if sum of features > 2 ‚Üí class 1 else class 0
y = (X.sum(axis=1) > 2).astype(int)

# Convert to Pandas DataFrame
columns = [f"feature_{i}" for i in range(n_features)]
pdf = pd.DataFrame(X, columns=columns)
pdf["label"] = y

# -------------------------------
# 3Ô∏è‚É£ Convert Pandas ‚Üí Spark DataFrame
# -------------------------------
df = spark.createDataFrame(pdf)

# -------------------------------
# 4Ô∏è‚É£ Feature Vector Assembler
# -------------------------------
assembler = VectorAssembler(
    inputCols=columns,
    outputCol="features"
)

# -------------------------------
# 5Ô∏è‚É£ Train-Test Split
# -------------------------------
train_df, test_df = df.randomSplit([0.8, 0.2], seed=42)

# -------------------------------
# 6Ô∏è‚É£ Decision Tree Model
# -------------------------------
dt = DecisionTreeClassifier(
    labelCol="label",
    featuresCol="features",
    maxDepth=5
)

pipeline = Pipeline(stages=[assembler, dt])

# -------------------------------
# 7Ô∏è‚É£ Train Model
# -------------------------------
model = pipeline.fit(train_df)

# -------------------------------
# 8Ô∏è‚É£ Predictions
# -------------------------------
predictions = model.transform(test_df)

# -------------------------------
# 9Ô∏è‚É£ Evaluate Model
# -------------------------------
evaluator = MulticlassClassificationEvaluator(
    labelCol="label",
    predictionCol="prediction",
    metricName="accuracy"
)

accuracy = evaluator.evaluate(predictions)

print("Model Accuracy:", accuracy)

# -------------------------------
# üîü Show Sample Predictions
# -------------------------------
predictions.select("features", "label", "prediction").show(10)

# -------------------------------
# Stop Spark
# -------------------------------
spark.stop()

Model Accuracy: 0.8846153846153846
+--------------------+-----+----------+
|            features|label|prediction|
+--------------------+-----+----------+
|[0.00638587171683...|    0|       0.0|
|[0.01212077464389...|    1|       1.0|
|[0.01439348862975...|    0|       0.0|
|[0.02535074341545...|    1|       1.0|
|[0.03353243473577...|    1|       1.0|
|[0.03934354066850...|    0|       0.0|
|[0.05168172116860...|    0|       1.0|
|[0.06026739028956...|    0|       0.0|
|[0.08091928305125...|    0|       0.0|
|[0.08104621590764...|    0|       0.0|
+--------------------+-----+----------+
only showing top 10 rows
