# **Structured API**

(last update: 7/5/2025)

---

# **I. Prepare enviroment**

In [66]:
# start pyspark
import findspark
findspark.init()

In [67]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local")\
          .appName("Spark APIs Exercises")\
          .config("spark.some.config.option", "some-value")\
          .getOrCreate()

sc = spark.sparkContext

In [68]:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml import Pipeline
from pyspark.ml.regression import DecisionTreeRegressor
from pyspark.ml.feature import VectorIndexer
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.feature import StringIndexer
from pyspark.sql import functions as F
from pyspark.sql.types import DoubleType
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

import numpy as np
import math

# **II. Structured API Implementation (High-Level)**

### **1. Read data**

Read raw csv data from HDFS into Structured API Dataframe

In [69]:
train_rawData = spark.read.csv("hdfs:///hcmus/22120262/Practical Exercises/HW3/data/train.csv", header=True, inferSchema=True)
test_rawData = spark.read.csv("hdfs:///hcmus/22120262/Practical Exercises/HW3/data/test.csv", header=True, inferSchema=True)

                                                                                

### **2. Train-val split**

We split `train_rawData` into `train_data` (for training model) and `val_data` (for validation model) with ratio 8/2.

We split it right away to keep the realistic of the `val_data`. So that we can apply pre-process data on only the `train_data`

In [70]:
(train_data, val_data) = train_rawData.randomSplit([0.8, 0.2], seed=42)

### **3. Pre-process data**

We know from the `DataExplore.ipynb` that distance play a huge role in exploration our data, so I will create a function to calculate the distance between the pick up and drop off location.

Beside if we use the raw coordinate data, it will be hard for the model to utilize the spatial pattern.

For better implementation with Spark, I will define the haversine functoin to compute the distance manually.

In [71]:
def haversine(lon1, lat1, lon2, lat2):
    R = 6371  # radius of Earth in km
    lon1, lat1, lon2, lat2 = map(math.radians, [lon1, lat1, lon2, lat2])
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = math.sin(dlat/2)**2 + math.cos(lat1)*math.cos(lat2)*math.sin(dlon/2)**2
    
    return R * 2 * math.asin(math.sqrt(a)) * 1000

Apply `haversine()` to calculate travel `distance_m` (as fly crow) on the datasets.

In [72]:
haversine_udf = F.udf(haversine, DoubleType())

for df_name, df in zip(['train_data', 'test_rawData', 'val_data'], [train_data, test_rawData, val_data]):
    df = df.withColumn("distance_m", haversine_udf(
        df["pickup_longitude"], df["pickup_latitude"],
        df["dropoff_longitude"], df["dropoff_latitude"]
    ))
    globals()[df_name] = df

Based on the `DataExplore.ipynb` file, we know that `trip_duration` is heavily right-skewed on log-scale plot.  

So to filter out those outliers, I will define a function to transform them to log-scale and using MAD to determine the boundaries.

In [73]:
def filter_outliers_mad_log(df, column_name, threshold=3):
    # Add log-transformed column (+1 to avoid log(0))
    df = df.withColumn(f"log_{column_name}", F.log(F.col(column_name) + 1))
    
    # Compute Median and MAD on log-scale
    median_log = df.approxQuantile(f"log_{column_name}", [0.5], 0.01)[0]
    mad_log = df.select(
        F.expr(f"percentile_approx(abs(log_{column_name} - {median_log}), 0.5)")
    ).first()[0]
    
    # Compute boundaries
    lower_bound_log = median_log - threshold * mad_log
    upper_bound_log = median_log + threshold * mad_log
    
    # Convert back to original scale
    lower_bound = np.exp(lower_bound_log) - 1
    upper_bound = np.exp(upper_bound_log) - 1
    
    # Filter data
    df_filtered = df.filter(
        (F.col(f"log_{column_name}") >= lower_bound_log) & 
        (F.col(f"log_{column_name}") <= upper_bound_log)
    ).drop(f"log_{column_name}")
    
    return df_filtered, (lower_bound, upper_bound)

Filter outlier for trip_duration

In [74]:
train_cleaned, (trip_low, trip_high) = filter_outliers_mad_log(
    train_data, 
    "trip_duration", 
    threshold=3
)
print(f"trip_duration limit: {trip_low:.2f}s - {trip_high:.2f}s")



trip_duration limit: 150.51s - 2979.62s


                                                                                

Now we will filter out the outlier in the `distance_m` column using the same function I define above.

In [75]:
train_cleaned, (dist_low, dist_high) = filter_outliers_mad_log(
    train_cleaned, 
    "distance_m", 
    threshold=3
)
print(f"distance_m limit: {dist_low:.2f}m - {dist_high:.2f}m")



distance_m limit: 422.17m - 10750.44m


                                                                                

Handle `pickup_datetime` column which have timestamp data type

In [76]:
for df_name, df in zip(['train_cleaned', 'test_rawData', 'val_data'], [train_cleaned, test_rawData, val_data]):
    df = df.withColumn("pickup_hour", F.hour(df["pickup_datetime"]))
    df = df.withColumn("pickup_dayofmonth", F.dayofmonth(df["pickup_datetime"]))
    df = df.withColumn("pickup_month", F.month(df["pickup_datetime"]))
    globals()[df_name] = df

train_cleaned = train_cleaned.drop("pickup_datetime", "dropoff_datetime")
test_data = test_rawData.drop("pickup_datetime")
val_data = val_data.drop("pickup_datetime", "dropoff_datetime")

Handle `store_and_fwd_flag` column which have string data type

In [77]:
indexer = StringIndexer(inputCol="store_and_fwd_flag", outputCol="store_and_fwd_flag_index")
indexer_model = indexer.fit(train_cleaned)

train_cleaned = indexer_model.transform(train_cleaned).drop("store_and_fwd_flag")
test_data = indexer_model.transform(test_rawData).drop("store_and_fwd_flag")
val_data = indexer_model.transform(val_data).drop("store_and_fwd_flag")

                                                                                

Define input column (exclude lat/long raw $\Rightarrow$ we use `distance_m` instead)

In [78]:
exclude_cols = ["trip_duration", "pickup_latitude", "pickup_longitude", "dropoff_latitude", "dropoff_longitude", "id"]
inputCols = [col for col in train_cleaned.columns if col not in exclude_cols]
val_data = val_data.select(inputCols + ["trip_duration"])

Assemble numeric feature

In [79]:
assembler = VectorAssembler(inputCols=inputCols, outputCol="features")

Feature indexing - handle categorical features automatically

In [80]:
feature_indexer = VectorIndexer(
    inputCol="features",
    outputCol="indexedFeatures",
    maxCategories=4 
).fit(assembler.transform(train_cleaned))

                                                                                

### **4. Train the Decision Tree Regressor model using MLlib**

Define Decision Tree model with parameters

I already run `CrossValidator` with `ParamGridBuilder` to hyperparameter tuning the model. Because of the time it take for tuning, I will only show the result of the best hyperparameters and the code I use.

```
dt = DecisionTreeRegressor(
    featuresCol="indexedFeatures",
    labelCol="trip_duration",           # Prevent overfitting
    impurity="variance"                 # Variance for regression
)

paramGrid = ParamGridBuilder() \
    .addGrid(dt.maxDepth, [5, 10, 15]) \
    .addGrid(dt.minInstancesPerNode, [5, 10, 20]) \
    .addGrid(dt.maxBins, [32, 64]) \a
    .build()

cv = CrossValidator(
    estimator=pipeline,
    estimatorParamMaps=paramGrid,
    evaluator=rmse_evaluator,
    numFolds=3,
    parallelism=2
)

cvModel = cv.fit(train_cleaned)

best_dt_model = cvModel.bestModel.stages[-1]

print("Best maxDepth:", best_dt_model.getOrDefault('maxDepth'))
print("Best minInstancesPerNode:", best_dt_model.getOrDefault('minInstancesPerNode'))
print("Best maxBins:", best_dt_model.getOrDefault('maxBins'))
```

> Best maxDepth: 10  
> Best minInstancesPerNode: 20  
> Best maxBins: 64

In [81]:
dt = DecisionTreeRegressor(
    featuresCol="indexedFeatures",
    labelCol="trip_duration",
    maxBins=64,
    maxDepth=10,                        
    minInstancesPerNode=20,             
    impurity="variance"                 
)

Create pipeline

In [82]:
pipeline = Pipeline(stages=[
    assembler,
    feature_indexer,
    dt
])

Create evaluators for $RMSE$ metrics and $R^2$ metrics

In [83]:
rmse_evaluator = RegressionEvaluator(
    labelCol="trip_duration",
    predictionCol="prediction",
    metricName="rmse"
)

r2_evaluator = RegressionEvaluator(
    labelCol="trip_duration",
    predictionCol="prediction",
    metricName="r2"
)

Train model

In [84]:
model = pipeline.fit(train_cleaned)

                                                                                

Make predictions on validation data set (`val_data`)

In [85]:
val_predictions = model.transform(val_data)

### **5. Evaluation model**

Analyze model structure and feature importance

In [86]:
tree_model = model.stages[2]                                        # DecisionTreeRegressor is the 3rd stage in pipeline

print("\nDecision Tree Model Summary:")
print("Depth:", tree_model.depth)
print("Number of Nodes:", tree_model.numNodes)
print("Feature importances:")
for col, imp in zip(inputCols, tree_model.featureImportances):
    print(f"- {col}: {imp:.2f}")


Decision Tree Model Summary:
Depth: 10
Number of Nodes: 1997
Feature importances:
- vendor_id: 0.00
- passenger_count: 0.00
- distance_m: 0.88
- pickup_hour: 0.10
- pickup_dayofmonth: 0.01
- pickup_month: 0.01
- store_and_fwd_flag_index: 0.00


Evaluate on validation set

In [87]:
val_rmse = rmse_evaluator.evaluate(val_predictions)
val_r2 = r2_evaluator.evaluate(val_predictions)

print("\nModel Evaluation Results:")
print("Validation Set:")
print("Root Mean Squared Error (RMSE) =", val_rmse)
print("R-squared (R²) =", val_r2)




Model Evaluation Results:
Validation Set:
Root Mean Squared Error (RMSE) = 4921.696555070559
R-squared (R²) = 0.010458172159521717


                                                                                

# **III. Conclusion**

>- RMSE = 4922.523202284945
>
>    - This is the standard deviation of the prediction errors.
>
>    - It shows that, on average, the model predictions are off by about 4922 units.
>
>- R² = 0.010125737730833029
>
>    - This means the model explains only ~1% of the variance in the target variable.
>
>    - Essentially, the model isn’t capturing much of the relationship between the features and the target.
>
>    - An R² this low suggests that the model is barely better than predicting the mean.

# **IV. Reference**

1. [Spark Document - Decistion Tree Regression](https://spark.apache.org/docs/latest/ml-classification-regression.html#decision-tree-regression)

2. [NYC Taxi EDA - Update: The fast & the curious](https://www.kaggle.com/code/headsortails/nyc-taxi-eda-update-the-fast-the-curious/report#extreme-trip-durations)