# **How RayDP works together with Ray**

RayDP is a distributed data processing library that provides simple APIs for running Spark on Ray and integrating Spark with distributed deep learning and machine learning frameworks. This document builds an end-to-end deep learning pipeline on a single Ray cluster by using Spark for data preprocessing, and uses ray train to complete the training and evaluation.

## 1. Colab enviroment Setup

RayDP requires Ray and PySpark. At the same time, pytorch is used to build deep learning model.

In [2]:
! pip install ray==1.9
! pip install raydp
! pip install ray[tune]
! pip install torch==1.8.1+cpu -f https://download.pytorch.org/whl/torch_stable.html

Looking in links: https://download.pytorch.org/whl/torch_stable.html


## 2. Get the data file

The dataset is from: https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset, and we store the file in github repository. It's used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relavant information about the patient. 

In [3]:
! wget https://raw.githubusercontent.com/KepingYan/Test/main/data/healthcare-dataset-stroke-data.csv -O healthcare-dataset-stroke-data.csv

--2022-05-17 04:38:55--  https://raw.githubusercontent.com/KepingYan/Test/main/data/healthcare-dataset-stroke-data.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 316971 (310K) [text/plain]
Saving to: ‘healthcare-dataset-stroke-data.csv’


2022-05-17 04:38:55 (12.8 MB/s) - ‘healthcare-dataset-stroke-data.csv’ saved [316971/316971]



## 3. Init or connect to a ray cluster

In [4]:
import ray

ray.init(num_cpus=6)

2022-05-17 04:38:58,825	INFO services.py:1340 -- View the Ray dashboard at [1m[32mhttp://127.0.0.1:8265[39m[22m


{'metrics_export_port': 54905,
 'node_id': '213a4d64eb8f8723003d6bcae8998d20f33989c7930f27d4fcaf7cee',
 'node_ip_address': '172.28.0.2',
 'object_store_address': '/tmp/ray/session_2022-05-17_04-38-55_634538_8000/sockets/plasma_store',
 'raylet_ip_address': '172.28.0.2',
 'raylet_socket_name': '/tmp/ray/session_2022-05-17_04-38-55_634538_8000/sockets/raylet',
 'redis_address': '172.28.0.2:6379',
 'session_dir': '/tmp/ray/session_2022-05-17_04-38-55_634538_8000',
 'webui_url': '127.0.0.1:8265'}

## 4. Get a spark session

In [5]:
import raydp

app_name = "Stoke Prediction with RayDP"
num_executors = 1
cores_per_executor = 1
memory_per_executor = "500M"
spark = raydp.init_spark(app_name, num_executors, cores_per_executor, memory_per_executor)

## 5. Get data from .csv file via 'spark' created by **raydp**

In [6]:
data = spark.read.format("csv").option("header", "true") \
        .option("inferSchema", "true") \
        .load("/content/healthcare-dataset-stroke-data.csv")



## 6. Define the data_process function

The dataset is converted to `pyspark.sql.dataframe.DataFrame`. Before feeding into the deep learning model, we can use raydp to do some transformation operations on dataset.

### 6.1 Data Analysis

Here is a part of the data analysis.

In [7]:
# Data overview
data.show(5)
# Statistical N/A distribution
# There are 201 'N/A' value in column 'bmi column',
# we can update them the mean of the column
data.describe().show()
data.filter(data.bmi=='N/A').count()
# Observe the distribution of the column 'gender'
# Then we should remove the outliers 'Other'
data.rollup(data.gender).count().show()
# Observe the proportion of positive and negative samples.
data.rollup(data.stroke).count().show()

+-----+------+----+------------+-------------+------------+-------------+--------------+-----------------+----+---------------+------+
|   id|gender| age|hypertension|heart_disease|ever_married|    work_type|Residence_type|avg_glucose_level| bmi| smoking_status|stroke|
+-----+------+----+------------+-------------+------------+-------------+--------------+-----------------+----+---------------+------+
| 9046|  Male|67.0|           0|            1|         Yes|      Private|         Urban|           228.69|36.6|formerly smoked|     1|
|51676|Female|61.0|           0|            0|         Yes|Self-employed|         Rural|           202.21| N/A|   never smoked|     1|
|31112|  Male|80.0|           0|            1|         Yes|      Private|         Rural|           105.92|32.5|   never smoked|     1|
|60182|Female|49.0|           0|            0|         Yes|      Private|         Urban|           171.23|34.4|         smokes|     1|
| 1665|Female|79.0|           1|            0|         

### 6.2 Define operations

Define data processing operations based on data analysis results.

In [8]:
from pyspark.sql.functions import hour, quarter, month, year, dayofweek, dayofmonth, weekofyear, col, lit, udf, abs as functions_abs, avg

In [9]:
# Delete the useless column 'id'
def drop_col(data):
    data = data.drop('id')
    return data

In [10]:
# Replace the value N/A in 'bmi'
def replace_nan(data):
    bmi_avg = data.agg(avg(col("bmi"))).head()[0]

    @udf("float")
    def replace_nan(value):
        if value=='N/A':
            return float(bmi_avg)
        else:
            return float(value)

    # Replace the value N/A
    data = data.withColumn('bmi', replace_nan(col("bmi")))
    return data

In [11]:
# Drop the only one value 'Other' in column 'gender'
def clean_value(data):
    data = data.filter(data.gender != 'Other')
    return data

In [12]:
# Transform the category columns
def trans_category(data):
    @udf("int")
    def trans_gender(value):
        gender = {'Female': 0,
                  'Male': 1}
        return int(gender[value])

    @udf("int")
    def trans_ever_married(value):
        residence_type = {'No': 0,
                          'Yes': 1}
        return int(residence_type[value])

    @udf("int")
    def trans_work_type(value):
        work_type = {'children': 0,
                     'Govt_job': 1,
                     'Never_worked': 2,
                     'Private': 3,
                     'Self-employed': 4}
        return int(work_type[value])

    @udf("int")
    def trans_residence_type(value):
        residence_type = {'Rural': 0,
                          'Urban': 1}
        return int(residence_type[value])

    @udf("int")
    def trans_smoking_status(value):
        smoking_status = {'formerly smoked': 0,
                          'never smoked': 1,
                          'smokes': 2,
                          'Unknown': 3}
        return int(smoking_status[value])

    data = data.withColumn('gender', trans_gender(col('gender'))) \
               .withColumn('ever_married', trans_ever_married(col('ever_married'))) \
               .withColumn('work_type', trans_work_type(col('work_type'))) \
               .withColumn('Residence_type', trans_residence_type(col('Residence_type'))) \
               .withColumn('smoking_status', trans_smoking_status(col('smoking_status')))
    return data

In [13]:
# Add the discretized column of 'Age'
def map_age(data):
    @udf("int")
    def get_value(value):
        if value >= 18 and value < 26:
            return int(0)
        elif value >=26 and value < 36:
            return int(1)
        elif value >=36 and value < 46:
            return int(2)
        elif value >=46 and value < 56:
            return int(3)
        else:
            return int(4)

    data = data.withColumn('age_dis', get_value(col('age')))
    return data

In [14]:
# Preprocess the data
def data_preprocess(data):
    data = drop_col(data)
    data = replace_nan(data)
    data = clean_value(data)
    data = trans_category(data)
    data = map_age(data)
    return data

## 7. Data processing

In [15]:
import torch
from raydp.utils import random_split

# Transform the dataset
data = data_preprocess(data)
# Split data into train_dataset and test_dataset
train_df, test_df = random_split(data, [0.8, 0.2], 0)
# Balance the positive and negative samples
train_df_neg = train_df.filter(train_df.stroke == '1')
train_df = train_df.unionByName(train_df_neg)
train_df = train_df.unionByName(train_df_neg)
features = [field.name for field in list(train_df.schema) if field.name != "stroke"]
# Convert spark dataframe into ray Dataset
# Remember to align ``parallelism`` with ``num_workers`` of ray train
train_dataset = ray.data.from_spark(train_df, parallelism = 8)
test_dataset = ray.data.from_spark(test_df, parallelism = 8)
feature_dtype = [torch.float] * len(features)

## 8. Define a neural network model

In [16]:
import torch.nn as nn
import torch.nn.functional as F

class NET_Model(nn.Module):
    def __init__(self, cols):
        super().__init__()
        self.emb_layer_gender = nn.Embedding(2, 2)           # gender
        self.emb_layer_hypertension = nn.Embedding(2,2)      # hypertension
        self.emb_layer_heart_disease = nn.Embedding(2,2)     # heart_disease
        self.emb_layer_ever_married = nn.Embedding(2, 2)     # ever_married
        self.emb_layer_work = nn.Embedding(5, 5)             # work_type
        self.emb_layer_residence = nn.Embedding(2, 2)        # Residence_type
        self.emb_layer_smoking_status = nn.Embedding(4, 4)   # smoking_status
        self.emb_layer_age = nn.Embedding(5, 5)              # age column after discretization
        self.fc1 = nn.Linear(cols, 256)
        self.fc2 = nn.Linear(256, 128)
        self.fc3 = nn.Linear(128, 64)
        self.fc4 = nn.Linear(64, 16)
        self.fc5 = nn.Linear(16, 2)

        self.fc_sparse = nn.Linear(24, 16)
        self.fc_dense = nn.Linear(3, 8)
        self.fc = nn.Linear(24, 2)
        
        self.bn1 = nn.BatchNorm1d(256)
        self.bn2 = nn.BatchNorm1d(128)
        self.bn3 = nn.BatchNorm1d(64)
        self.bn4 = nn.BatchNorm1d(16)

    def forward(self, *x):
        x = torch.cat(x, dim=1)
        # pick the dense attribute columns
        dense_columns = x[:, [1,7,8]]
        # Embedding operation on sparse attribute columns
        sparse_col_1 = self.emb_layer_gender(x[:, 0].long())
        sparse_col_2 = self.emb_layer_hypertension(x[:, 2].long())
        sparse_col_3 = self.emb_layer_heart_disease(x[:, 3].long())
        sparse_col_4 = self.emb_layer_ever_married(x[:, 4].long())
        sparse_col_5 = self.emb_layer_work(x[:, 5].long())
        sparse_col_6 = self.emb_layer_residence(x[:, 6].long())
        sparse_col_7 = self.emb_layer_smoking_status(x[:, 9].long())
        sparse_col_8 = self.emb_layer_age(x[:, 10].long())
        # Splice sparse attribute columns and dense attribute columns
        x = torch.cat([dense_columns, sparse_col_1, sparse_col_2, sparse_col_3, sparse_col_4, sparse_col_5, sparse_col_6, sparse_col_7, sparse_col_8], dim=1)

        sparse_columns = torch.cat([sparse_col_1, sparse_col_2, sparse_col_3, sparse_col_4, sparse_col_5, sparse_col_6, sparse_col_7, sparse_col_8], dim=1)
        dense_feat = self.fc_dense(dense_columns)
        sparse_feat = self.fc_sparse(sparse_columns)
        return self.fc(torch.cat([dense_feat, sparse_feat], dim=1))

        x = F.relu(self.fc1(x))
        x = self.bn1(x)
        x = F.relu(self.fc2(x))
        x = self.bn2(x)
        x = F.relu(self.fc3(x))
        x = self.bn3(x)
        x = F.relu(self.fc4(x))
        x = self.bn4(x)
        x = self.fc5(x)
        return x


## 9. Define train and test function

In [17]:
def train_epoch(dataset, model, criterion, optimizer):
    model.train()
    train_loss, correct, data_size, batch_idx = 0, 0, 0, 0
    for batch_idx, (inputs, targets) in enumerate(dataset):
        # Compute prediction error
        inputs = [inputs[:,i].unsqueeze(1) for i in range(inputs.size(1))]
        targets = targets.reshape(-1)
        outputs = model(*inputs)
        loss = criterion(outputs, targets)
        train_loss += loss.item()
        _, predicted = torch.max(outputs.data, 1)
        data_size += targets.size(0)
        correct += (predicted == targets).sum().item()
        # Backpropagation
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    # Caculate the train_loss and train_acc
    train_loss /= (batch_idx + 1)
    train_acc = correct/data_size
    return train_acc, train_loss

def test_epoch(dataset, model, criterion):
    model.eval()
    test_loss, correct, data_size, batch_idx = 0, 0, 0, 0
    with torch.no_grad():
        for batch_idx, (inputs, targets) in enumerate(dataset):
            # Compute prediction error
            inputs = [inputs[:,i].unsqueeze(1) for i in range(inputs.size(1))]
            targets = targets.reshape(-1)
            outputs = model(*inputs)
            test_loss += criterion(outputs, targets).item()
            _, predicted = torch.max(outputs.data, 1)
            data_size += targets.size(0)
            correct += (predicted == targets).sum().item()
    # Caculate the test_loss and test_acc
    test_loss /= (batch_idx + 1)
    test_acc = correct/data_size
    return test_acc, test_loss

## 10. Define train function

In [18]:
from ray import train
from ray.train import get_dataset_shard

def train_func(config):
    num_epochs = config["num_epochs"]
    lr = config["lr"]
    batch_size = config["batch_size"]
    # Then convert to torch datasets
    # Get the corresponging shard
    train_data_shard = get_dataset_shard("train")
    train_dataset = train_data_shard.to_torch(feature_columns=features,
                                              label_column="stroke",
                                              label_column_dtype=torch.long,
                                              feature_column_dtypes=feature_dtype,
                                              batch_size=batch_size)
    test_data_shard = get_dataset_shard("test")
    test_dataset = test_data_shard.to_torch(feature_columns=features,
                                            label_column="stroke",
                                            label_column_dtype=torch.long,
                                            feature_column_dtypes=feature_dtype,
                                            batch_size=batch_size)
    model = NET_Model(len(features))
    model = train.torch.prepare_model(model)
    criterion = nn.CrossEntropyLoss(weight=torch.tensor([0.35, 0.65]))
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    loss_results = []
    for epoch in range(num_epochs):
        train_acc, train_loss = train_epoch(train_dataset, model, criterion, optimizer)
        test_acc, test_loss = test_epoch(test_dataset, model, criterion)
        train.report(epoch = epoch, train_acc = train_acc, train_loss = train_loss)
        train.report(epoch = epoch, test_acc=test_acc, test_loss=test_loss)
        loss_results.append(test_loss)

## 11. Define the callback function

In [19]:
from ray.train import TrainingCallback
from typing import List, Dict

# log the train results
class PrintingCallback(TrainingCallback):
    def handle_result(self, results: List[Dict], **info):
        print(results)

## 12. Train model via ray train

In [20]:
from ray.train import Trainer

trainer = Trainer(backend="torch", num_workers=num_executors)
trainer.start()
results = trainer.run(
    train_func, config={"num_epochs": 10, "lr": 0.001, "batch_size": 64},
    callbacks=[PrintingCallback()],
    dataset={
        "train": train_dataset,
        "test": test_dataset
    }
)
trainer.shutdown()

2022-05-17 04:40:11,390	INFO trainer.py:172 -- Trainer logs will be logged in: /root/ray_results/train_2022-05-17_04-40-11
2022-05-17 04:40:12,923	INFO trainer.py:178 -- Run results will be logged in: /root/ray_results/train_2022-05-17_04-40-11/run_001
[2m[36m(BaseWorkerMixin pid=8945)[0m 2022-05-17 04:40:12,920	INFO torch.py:67 -- Setting up process group for: env:// [rank=0, world_size=1]
[2m[36m(BaseWorkerMixin pid=8945)[0m 2022-05-17 04:40:12,986	INFO torch.py:239 -- Moving model to device: cpu


[{'epoch': 0, 'train_acc': 0.7635832959137854, 'train_loss': 1.4808189805064882, '_timestamp': 1652762413, '_time_this_iter_s': 0.8604536056518555, '_training_iteration': 1}]
[{'epoch': 0, 'test_acc': 0.8224121557454891, 'test_loss': 0.6221899249974419, '_timestamp': 1652762413, '_time_this_iter_s': 0.0012180805206298828, '_training_iteration': 2}]
[{'epoch': 1, 'train_acc': 0.7882801975752133, 'train_loss': 0.6724655921970095, '_timestamp': 1652762414, '_time_this_iter_s': 0.6026747226715088, '_training_iteration': 3}]
[{'epoch': 1, 'test_acc': 0.929724596391263, 'test_loss': 0.39153894606758566, '_timestamp': 1652762414, '_time_this_iter_s': 0.0008864402770996094, '_training_iteration': 4}]
[{'epoch': 2, 'train_acc': 0.8477772788504715, 'train_loss': 0.4688507060919489, '_timestamp': 1652762415, '_time_this_iter_s': 0.5889034271240234, '_training_iteration': 5}]
[{'epoch': 2, 'test_acc': 0.9278252611585945, 'test_loss': 0.32463862878434796, '_timestamp': 1652762415, '_time_this_iter_

## 13. shut down ray and raydp

In [21]:
raydp.stop_spark()
ray.shutdown()