# **MLlib RDD-Based**

(last update: 9/5/2025)

---

# **I. Prepare enviroment**

In [56]:
# start pyspark
import findspark
findspark.init()

In [57]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local")\
          .appName("Spark MLlib RDD-Based Exercises")\
          .config("spark.some.config.option", "some-value")\
          .getOrCreate()

sc = spark.sparkContext

In [58]:
from pyspark.mllib.tree import DecisionTree
from pyspark.mllib.regression import LabeledPoint

import math
import numpy as np

# **II. MLlib RDD-Based Implementation**

### **1. Read data**

Read raw csv data from HDFS into RDD

In [59]:
train_lines = sc.textFile("hdfs:///hcmus/22120262/Practical Exercises/HW3/data/train.csv")
train_header = train_lines.first()
train_rawData = train_lines.filter(lambda line: line != train_header)

test_lines = sc.textFile("hdfs:///hcmus/22120262/Practical Exercises/HW3/data/test.csv")
test_header = test_lines.first()
test_data = test_lines.filter(lambda line: line != test_header)

25/05/10 17:42:52 WARN BlockReaderFactory: I/O error constructing remote block reader.
java.nio.channels.ClosedByInterruptException
	at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
	at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:658)
	at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:191)
	at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:586)
	at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3033)
	at org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:829)
	at org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:754)
	at org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:381)
	at org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:755)
	at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:685)
	at org.apac

### **2. Train-val split**

We split `train_rawData` into `train_data` (for training model) and `val_data` (for validation model) with ratio 8/2.

We split it right away to keep the realistic of the `val_data`. So that we can apply pre-process data on only the `train_data`

In [60]:
(train_data, val_data) = train_rawData.randomSplit([0.8, 0.2], seed=42)

### **3. Parsing and pre-process the data file**

We know from the `DataExplore.ipynb` that distance play a huge role in exploration our data, so I will create a function to calculate the distance between the pick up and drop off location.

Beside if we use the raw coordinate data, it will be hard for the model to utilize the spatial pattern.

For better implementation with Spark, I will define the haversine functoin to compute the distance manually.

In [61]:
def haversine(lat1, lon1, lat2, lon2):
    R = 6371  # Earth radius in km
    lat1, lon1, lat2, lon2 = map(math.radians, [lat1, lon1, lat2, lon2])
    dlat = lat2 - lat1
    dlon = lon2 - lon1
    a = math.sin(dlat/2)**2 + math.cos(lat1) * math.cos(lat2) * math.sin(dlon/2)**2
    c = 2 * math.atan2(math.sqrt(a), math.sqrt(1-a))
    return R * c * 1000  # meters

Apply `haversine()` to calculate travel `distance_m` (as fly crow) on the datasets.

In [62]:
distances = train_data.map(lambda row: row.split(",")) \
    .filter(lambda parts: parts[5] and parts[6] and parts[7] and parts[8]) \
    .map(lambda parts: haversine(float(parts[6]), float(parts[5]), float(parts[8]), float(parts[7]))) \
    .filter(lambda d: d >= 0) \
    .collect()

                                                                                

Now we parse the dataframe into LabelPoint 

In [63]:
def parse(row, is_test=False):
    try:
        parts = row.split(',')
        
        row_id = parts[0]
        pickup_dayofmonth = int(parts[2].split(' ')[0].split('-')[2])
        pickup_month = int(parts[2].split(' ')[0].split('-')[1])
        pickup_hour = int(parts[2].split(' ')[1].split(':')[0])
        pickup_lat = 0.0
        pickup_long = 0.0
        dropoff_lat = 0.0
        dropoff_long = 0.0
        vendor_id = float(parts[1]) - 1.0
        passenger_count = 0.0
        store_and_fwd_flag = 0.0
        trip_duration = 0.0

        if is_test:
            pickup_lat = float(parts[5])
            pickup_long = float(parts[4])
            dropoff_lat = float(parts[7])
            dropoff_long = float(parts[6])
            passenger_count = float(parts[3])
            store_and_fwd_flag = (1.0 if parts[8] == 'Y' else 0.0)

        else:
            pickup_lat = float(parts[6])
            pickup_long = float(parts[5])
            dropoff_lat = float(parts[8])
            dropoff_long = float(parts[7])
            passenger_count = float(parts[4])
            store_and_fwd_flag = (1.0 if parts[9] == 'Y' else 0.0)
            trip_duration = float(parts[10])

        distance_m = haversine(pickup_lat, pickup_long, dropoff_lat, dropoff_long)

        features = [
            vendor_id,
            passenger_count,
            distance_m,
            pickup_hour,
            pickup_dayofmonth,
            pickup_month,
            store_and_fwd_flag,
        ]

        # Trả về (id, features) cho test data
        return (row_id, features) if is_test else LabeledPoint(trip_duration, features)

    except Exception as e:
        print(f"Lỗi tại hàng {row[:50]}... | Chi tiết: {str(e)}")
        return None

From the `DataExplore.ipynb` file, we know that `trip_duration` and `distance_m` is heavily skewed. Esspecially `trip_duration` which heavily right-skewed.

So to filter out those outliers, I will define a function to transform them to log-scale and using MAD to determine the boundaries.

In [64]:
def filter_outliers_mad_log_rdd(rdd, target='label', threshold=3, feature_index=None):    
    # Extract values from LabeledPoints
    if target == 'label':
        values = rdd.map(lambda lp: lp.label)
    elif target == 'feature':
        if feature_index is None:
            raise ValueError("Phải chỉ định feature_index khi target='feature'")
        values = rdd.map(lambda lp: lp.features[feature_index])
    else:
        raise ValueError("target phải là 'label' hoặc 'feature'")
    
    # Add 1 and log transform (to avoid log(0))
    log_values = values.map(lambda x: np.log(x + 1)).cache()
    
    # Compute median and MAD
    median_log = log_values.takeOrdered(int(log_values.count() * 0.5))[-1]
    abs_deviations = log_values.map(lambda x: abs(x - median_log)).cache()
    mad_log = abs_deviations.takeOrdered(int(abs_deviations.count() * 0.5))[-1]
    
    # Compute boundaries on log scale
    lower_bound_log = median_log - threshold * mad_log
    upper_bound_log = median_log + threshold * mad_log
    
    # Chuyển về thang đo gốc
    lower_bound = np.exp(lower_bound_log) - 1
    upper_bound = np.exp(upper_bound_log) - 1
    
    # Convert back to original scale
    if target == 'label':
        filtered_rdd = rdd.filter(
            lambda lp: (np.log(lp.label + 1) >= lower_bound_log) & 
                       (np.log(lp.label + 1) <= upper_bound_log)
        )
    else:
        filtered_rdd = rdd.filter(
            lambda lp: (np.log(lp.features[feature_index] + 1) >= lower_bound_log) & 
                       (np.log(lp.features[feature_index] + 1) <= upper_bound_log)
        )
    
    # Clean up cached RDDs
    log_values.unpersist()
    abs_deviations.unpersist()
    
    return filtered_rdd, (lower_bound, upper_bound)

First, we parse our data to make it into LabelPoint.

In [65]:
train_parsed = train_data.map(lambda x: parse(x, is_test=False)) \
                        .filter(lambda x: x is not None) \
                        .cache()

val_parsed = val_data.map(lambda x: parse(x, is_test=False)) \
                    .filter(lambda x: x is not None) \
                    .cache()

test_parsed = test_data.map(lambda x: parse(x, is_test=True)) \
                      .filter(lambda x: x is not None) \
                      .cache()

Then we filter out the outlier of `trip_duration` and `distance_m`

In [66]:
# Lọc theo trip_duration (label)
train_cleaned_rdd, (trip_low, trip_high) = filter_outliers_mad_log_rdd(
    train_parsed, 
    target='label',
    threshold=3
)
print(f"trip_duration limit: {trip_low:.2f}s - {trip_high:.2f}s")

                                                                                

trip_duration limit: 148.20s - 2945.13s


In [67]:
train_cleaned_rdd, (dist_low, dist_high) = filter_outliers_mad_log_rdd(
    train_cleaned_rdd,
    target='feature',
    feature_index=2, # distance_m
    threshold=3
)
print(f"distance_m limit: {dist_low:.2f}s - {dist_high:.2f}s")

                                                                                

distance_m limit: 421.64s - 10716.25s


### **4. Train the Decision Tree Regressor model using MLlib**

We define `evaluate_metrics` that use $RMSE$ and $E^2$ metrics for evaluate consistency with the Structure API implementation.

In [68]:
def evaluate_metrics(labelsAndPredictions):
    metrics = labelsAndPredictions.map(
        lambda x: (1, x[0], x[1], (x[0] - x[1]) ** 2, x[0] ** 2)
    ).reduce(
        lambda a, b: (
            a[0] + b[0],                                                        # count (n)
            a[1] + b[1],                                                        # sum of labels (sum(y))
            a[2] + b[2],                                                        # sum of predictions (sum(y_hat))
            a[3] + b[3],                                                        # sum of squared errors (sum((y - y_hat)^2))
            a[4] + b[4]                                                         # sum of squared labels (sum(y^2))
        )
    )

    n = metrics[0]
    if n == 0:
        return {"RMSE": 0, "R2": 0}

    mse = metrics[3] / n
    rmse = mse ** 0.5                                                           # RMSE = sqrt(MSE)

    ss_total = metrics[4] - (metrics[1] ** 2) / n                               # sum((y - mean(y))^2)
    ss_residual = metrics[3]                                                    # sum((y - y_hat)^2)
    r2 = 1 - (ss_residual / ss_total) if ss_total != 0 else 0.0                 # R² = 1 - (SS_res / SS_total)

    return {"RMSE": rmse, "R2": r2}

Define Decision Tree model with parameters

I already hyperparameter tuning the model for optimal performance.  
But because of the time it take for tuning, I will only show the result of the best hyperparameters and the code I use.

```
maxDepths = [5, 10, 15]
maxBins_list = [32, 64]
minInstances_list = [1, 5, 10]

best_rmse = float('inf')
best_params = {}

dt_model = None

for depth in maxDepths:
    for bins in maxBins_list:
        for min_instances in minInstances_list:
            # Train model với các tham số hiện tại
            current_model = DecisionTree.trainRegressor(
                train_data,
                categoricalFeaturesInfo={},
                impurity="variance",
                maxDepth=depth,
                maxBins=bins,
                minInstancesPerNode=min_instances
            )
            
            # Đánh giá model
            predictions = current_model.predict(val_data.map(lambda x: x.features))
            labelsAndPredictions = val_data.map(lambda lp: lp.label).zip(predictions)
            metrics = evaluate_metrics(labelsAndPredictions)
            
            # Cập nhật best model nếu tốt hơn
            if metrics["RMSE"] < best_rmse:
                best_rmse = metrics["RMSE"]
                best_params = {
                    'maxDepth': depth,
                    'maxBins': bins,
                    'minInstancesPerNode': min_instances
                }
                print(f"New best params: {best_params}")

                dt_model = current_model

print("Best maxDepth:", best_params['maxDepth'])
print("Best minInstancesPerNode:", best_params['minInstancesPerNode'])
print("Best maxBins:", best_params['maxBins'])
```

> Best maxDepth: 10  
> Best minInstancesPerNode: 20  
> Best maxBins: 64

In [69]:
categoricalFeatures= {
    0: 2,
    6: 2
}

In [70]:
dt_model = DecisionTree.trainRegressor(
    train_cleaned_rdd,
    categoricalFeaturesInfo=categoricalFeatures,
    impurity="variance",
    maxDepth=10,
    maxBins=64,
    minInstancesPerNode=20
)

25/05/10 17:44:13 WARN BlockManager: Task 20 already completed, not releasing lock for rdd_7_0
                                                                                

Make predictions on validation

In [71]:
val_predictions = dt_model.predict(val_parsed.map(lambda x: x.features))

val_labelsAndPredictions = val_parsed.map(lambda lp: lp.label).zip(val_predictions)

### **5. Evaluation model**

Evaluate on validation set

In [72]:
metrics = evaluate_metrics(val_labelsAndPredictions)

print("\nModel Evaluation Results:")
print("Validation Set:")
print("Root Mean Squared Error (RMSE) =", metrics["RMSE"])
print("R-squared (R²) =", metrics["R2"])




Model Evaluation Results:
Validation Set:
Root Mean Squared Error (RMSE) = 3171.4570734151303
R-squared (R²) = 0.02449659530880599


                                                                                

# **III. Conclusion**

>- RMSE = 3171.3017981403855
>
>    - This is the standard deviation of the prediction errors.
>
>    - It shows that, on average, the model predictions are off by about 3171 units.
>
>- R² = 0.024592114716844637
>
>    - This means the model explains only ~2.4% of the variance in the target variable.
>
>    - Essentially, the model isn’t capturing much of the relationship between the features and the target.
>
>    - An R² this low suggests that the model is barely better than predicting the mean.

# **IV. Predict on test set**

In [74]:
test_predictions = dt_model.predict(test_parsed.map(lambda x: x[1]))

test_idAndPredictions = test_parsed.map(lambda x: x[0]).zip(test_predictions)

# Tạo RDD định dạng CSV
header = sc.parallelize(["id,prediction"])
results_csv = test_idAndPredictions.map(lambda x: f"{x[0]},{x[1]}")

# Ghi ra thư mục local
output_dir = "file:///home/phatle1578/BigData/Practical Exercises/HW3/MLlib RDD-Based/result/submission"

full_output = sc.parallelize(
    ["id,prediction"] + results_csv.collect()
)

full_output.saveAsTextFile(output_dir)

print("Predictions saved successfully to local directory.")

25/05/10 17:46:35 WARN TaskSetManager: Stage 39 contains a task of very large size (18467 KiB). The maximum recommended task size is 1000 KiB.


Predictions saved successfully to local directory.


In [None]:
spark.stop()

# **V. Reference**

1. [Spark Document - Decision Trees(RDD-based API)](https://spark.apache.org/docs/latest/mllib-decision-tree.html)

2. [NYC Taxi EDA - Update: The fast & the curious](https://www.kaggle.com/code/headsortails/nyc-taxi-eda-update-the-fast-the-curious/report#extreme-trip-durations)