# **How RayDP works together with Pytorch**

RayDP is a distributed data processing library that provides simple APIs for running Spark on Ray and integrating Spark with distributed deep learning and machine learning frameworks. This document builds an end-to-end deep learning pipeline on a single Ray cluster by using Spark for data preprocessing, and uses distributed estimator based on the raydp api to complete the training and evaluation.

## 1. Colab enviroment Setup

RayDP requires Ray and PySpark. At the same time, pytorch is used to build deep learning model.

In [1]:
! pip install ray==1.9
! pip install raydp
! pip install ray[tune]
! pip install torch==1.8.1+cpu -f https://download.pytorch.org/whl/torch_stable.html

Collecting ray==1.9
  Downloading ray-1.9.0-cp37-cp37m-manylinux2014_x86_64.whl (57.6 MB)
[K     |████████████████████████████████| 57.6 MB 1.6 MB/s 
Collecting redis>=3.5.0
  Downloading redis-4.3.1-py3-none-any.whl (241 kB)
[K     |████████████████████████████████| 241 kB 44.0 MB/s 
Collecting deprecated>=1.2.3
  Downloading Deprecated-1.2.13-py2.py3-none-any.whl (9.6 kB)
Collecting async-timeout>=4.0.2
  Downloading async_timeout-4.0.2-py3-none-any.whl (5.8 kB)
Installing collected packages: deprecated, async-timeout, redis, ray
Successfully installed async-timeout-4.0.2 deprecated-1.2.13 ray-1.9.0 redis-4.3.1
Collecting raydp
  Downloading raydp-0.4.2-py3-none-any.whl (10.5 MB)
[K     |████████████████████████████████| 10.5 MB 4.3 MB/s 
Collecting netifaces
  Downloading netifaces-0.11.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (32 kB)
Collecting pyspark>=3.2.0
  Downloading pyspark-3.2.1.tar.gz (281.4 MB)
[K     |████████████████████████████████| 281.4 MB 33 kB/s 

Collecting tensorboardX>=1.9
  Downloading tensorboardX-2.5-py2.py3-none-any.whl (125 kB)
[K     |████████████████████████████████| 125 kB 4.4 MB/s 
Installing collected packages: tensorboardX
Successfully installed tensorboardX-2.5
Looking in links: https://download.pytorch.org/whl/torch_stable.html
Collecting torch==1.8.1+cpu
  Downloading https://download.pytorch.org/whl/cpu/torch-1.8.1%2Bcpu-cp37-cp37m-linux_x86_64.whl (169.1 MB)
[K     |████████████████████████████████| 169.1 MB 77 kB/s 
Installing collected packages: torch
  Attempting uninstall: torch
    Found existing installation: torch 1.11.0+cu113
    Uninstalling torch-1.11.0+cu113:
      Successfully uninstalled torch-1.11.0+cu113
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchvision 0.12.0+cu113 requires torch==1.11.0, but you have torch 1.8.1+cpu which is incompatible.
torchtext 0.1

## 2. Get the data file

The dataset is from: https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset, and we store the file in github repository. It's used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relavant information about the patient. 

In [2]:
! wget https://raw.githubusercontent.com/KepingYan/Test/main/data/healthcare-dataset-stroke-data.csv -O healthcare-dataset-stroke-data.csv

--2022-05-17 04:43:45--  https://raw.githubusercontent.com/KepingYan/Test/main/data/healthcare-dataset-stroke-data.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 316971 (310K) [text/plain]
Saving to: ‘healthcare-dataset-stroke-data.csv’


2022-05-17 04:43:47 (7.73 MB/s) - ‘healthcare-dataset-stroke-data.csv’ saved [316971/316971]



## 3. Init or connect to a ray cluster

In [3]:
import ray

ray.init(num_cpus=6)

2022-05-17 04:43:51,684	INFO services.py:1340 -- View the Ray dashboard at [1m[32mhttp://127.0.0.1:8265[39m[22m


{'metrics_export_port': 54253,
 'node_id': '1e0e1bce883f672ebace868dbf36e508b0bd0a203cabd13d5e28f922',
 'node_ip_address': '172.28.0.2',
 'object_store_address': '/tmp/ray/session_2022-05-17_04-43-47_883416_61/sockets/plasma_store',
 'raylet_ip_address': '172.28.0.2',
 'raylet_socket_name': '/tmp/ray/session_2022-05-17_04-43-47_883416_61/sockets/raylet',
 'redis_address': '172.28.0.2:6379',
 'session_dir': '/tmp/ray/session_2022-05-17_04-43-47_883416_61',
 'webui_url': '127.0.0.1:8265'}

## 4. Get a spark session

In [4]:
import raydp

app_name = "Stoke Prediction with RayDP"
num_executors = 1
cores_per_executor = 1
memory_per_executor = "500M"
spark = raydp.init_spark(app_name, num_executors, cores_per_executor, memory_per_executor)

## 5. Get data from .csv file via 'spark' created by **raydp**

In [5]:
data = spark.read.format("csv").option("header", "true") \
        .option("inferSchema", "true") \
        .load("/content/healthcare-dataset-stroke-data.csv")



## 6. Define the data_process function

The dataset is converted to `pyspark.sql.dataframe.DataFrame`. Before feeding into the deep learning model, we can use raydp to do some transformation operations on dataset.

### 6.1 Data Analysis

Here is a part of the data analysis.

In [6]:
# Data overview
data.show(5)
# Statistical N/A distribution
# There are 201 'N/A' value in column 'bmi column',
# we can update them the mean of the column
data.describe().show()
data.filter(data.bmi=='N/A').count()
# Observe the distribution of the column 'gender'
# Then we should remove the outliers 'Other'
data.rollup(data.gender).count().show()
# Observe the proportion of positive and negative samples.
data.rollup(data.stroke).count().show()

+-----+------+----+------------+-------------+------------+-------------+--------------+-----------------+----+---------------+------+
|   id|gender| age|hypertension|heart_disease|ever_married|    work_type|Residence_type|avg_glucose_level| bmi| smoking_status|stroke|
+-----+------+----+------------+-------------+------------+-------------+--------------+-----------------+----+---------------+------+
| 9046|  Male|67.0|           0|            1|         Yes|      Private|         Urban|           228.69|36.6|formerly smoked|     1|
|51676|Female|61.0|           0|            0|         Yes|Self-employed|         Rural|           202.21| N/A|   never smoked|     1|
|31112|  Male|80.0|           0|            1|         Yes|      Private|         Rural|           105.92|32.5|   never smoked|     1|
|60182|Female|49.0|           0|            0|         Yes|      Private|         Urban|           171.23|34.4|         smokes|     1|
| 1665|Female|79.0|           1|            0|         

### 6.2 Define operations

Define data processing operations based on data analysis results.

In [7]:
from pyspark.sql.functions import hour, quarter, month, year, dayofweek, dayofmonth, weekofyear, col, lit, udf, abs as functions_abs, avg

In [8]:
# Delete the useless column 'id'
def drop_col(data):
    data = data.drop('id')
    return data

In [9]:
# Replace the value N/A in 'bmi'
def replace_nan(data):
    bmi_avg = data.agg(avg(col("bmi"))).head()[0]

    @udf("float")
    def replace_nan(value):
        if value=='N/A':
            return float(bmi_avg)
        else:
            return float(value)

    # Replace the value N/A
    data = data.withColumn('bmi', replace_nan(col("bmi")))
    return data

In [10]:
# Drop the only one value 'Other' in column 'gender'
def clean_value(data):
    data = data.filter(data.gender != 'Other')
    return data

In [11]:
# Transform the category columns
def trans_category(data):
    @udf("int")
    def trans_gender(value):
        gender = {'Female': 0,
                  'Male': 1}
        return int(gender[value])

    @udf("int")
    def trans_ever_married(value):
        residence_type = {'No': 0,
                          'Yes': 1}
        return int(residence_type[value])

    @udf("int")
    def trans_work_type(value):
        work_type = {'children': 0,
                     'Govt_job': 1,
                     'Never_worked': 2,
                     'Private': 3,
                     'Self-employed': 4}
        return int(work_type[value])

    @udf("int")
    def trans_residence_type(value):
        residence_type = {'Rural': 0,
                          'Urban': 1}
        return int(residence_type[value])

    @udf("int")
    def trans_smoking_status(value):
        smoking_status = {'formerly smoked': 0,
                          'never smoked': 1,
                          'smokes': 2,
                          'Unknown': 3}
        return int(smoking_status[value])

    data = data.withColumn('gender', trans_gender(col('gender'))) \
               .withColumn('ever_married', trans_ever_married(col('ever_married'))) \
               .withColumn('work_type', trans_work_type(col('work_type'))) \
               .withColumn('Residence_type', trans_residence_type(col('Residence_type'))) \
               .withColumn('smoking_status', trans_smoking_status(col('smoking_status')))
    return data

In [12]:
# Add the discretized column of 'Age'
def map_age(data):
    @udf("int")
    def get_value(value):
        if value >= 18 and value < 26:
            return int(0)
        elif value >=26 and value < 36:
            return int(1)
        elif value >=36 and value < 46:
            return int(2)
        elif value >=46 and value < 56:
            return int(3)
        else:
            return int(4)

    data = data.withColumn('age_dis', get_value(col('age')))
    return data

In [13]:
# Preprocess the data
def data_preprocess(data):
    data = drop_col(data)
    data = replace_nan(data)
    data = clean_value(data)
    data = trans_category(data)
    data = map_age(data)
    return data

## 7. Data processing

In [14]:
import torch
from raydp.utils import random_split

# Transform the dataset
data = data_preprocess(data)
# Split data into train_dataset and test_dataset
train_df, test_df = random_split(data, [0.8, 0.2], 0)
# Balance the positive and negative samples
train_df_neg = train_df.filter(train_df.stroke == '1')
train_df = train_df.unionByName(train_df_neg)
train_df = train_df.unionByName(train_df_neg)
features = [field.name for field in list(train_df.schema) if field.name != "stroke"]
# Convert spark dataframe into ray Dataset
# Remember to align ``parallelism`` with ``num_workers`` of ray train
train_dataset = ray.data.from_spark(train_df, parallelism = 8)
test_dataset = ray.data.from_spark(test_df, parallelism = 8)
feature_dtype = [torch.float] * len(features)

## 8. Define a neural network model

In [15]:
import torch.nn as nn
import torch.nn.functional as F

class NET_Model(nn.Module):
    def __init__(self, cols):
        super().__init__()
        self.emb_layer_gender = nn.Embedding(2, 1)           # gender
        self.emb_layer_hypertension = nn.Embedding(2,1)      # hypertension
        self.emb_layer_heart_disease = nn.Embedding(2,1)     # heart_disease
        self.emb_layer_ever_married = nn.Embedding(2, 1)     # ever_married
        self.emb_layer_work = nn.Embedding(5, 1)             # work_type
        self.emb_layer_residence = nn.Embedding(2, 1)        # Residence_type
        self.emb_layer_smoking_status = nn.Embedding(4, 1)   # smoking_status
        self.emb_layer_age = nn.Embedding(5, 1)              # age column after discretization
        self.fc1 = nn.Linear(cols, 256)
        self.fc2 = nn.Linear(256, 128)
        self.fc3 = nn.Linear(128, 64)
        self.fc4 = nn.Linear(64, 16)
        self.fc5 = nn.Linear(16, 2)
        self.bn1 = nn.BatchNorm1d(256)
        self.bn2 = nn.BatchNorm1d(128)
        self.bn3 = nn.BatchNorm1d(64)
        self.bn4 = nn.BatchNorm1d(16)

    def forward(self, *x):
        x = torch.cat(x, dim=1)
        # pick the dense attribute columns
        dense_columns = x[:, [1,7,8]]
        # Embedding operation on sparse attribute columns
        sparse_col_1 = self.emb_layer_gender(x[:, 0].long())
        sparse_col_2 = self.emb_layer_hypertension(x[:, 2].long())
        sparse_col_3 = self.emb_layer_heart_disease(x[:, 3].long())
        sparse_col_4 = self.emb_layer_ever_married(x[:, 4].long())
        sparse_col_5 = self.emb_layer_work(x[:, 5].long())
        sparse_col_6 = self.emb_layer_residence(x[:, 6].long())
        sparse_col_7 = self.emb_layer_smoking_status(x[:, 9].long())
        sparse_col_8 = self.emb_layer_age(x[:, 10].long())
        # Splice sparse attribute columns and dense attribute columns
        x = torch.cat([dense_columns, sparse_col_1, sparse_col_2, sparse_col_3, sparse_col_4, sparse_col_5, sparse_col_6, sparse_col_7, sparse_col_8], dim=1)

        x = F.relu(self.fc1(x))
        x = self.bn1(x)
        x = F.relu(self.fc2(x))
        x = self.bn2(x)
        x = F.relu(self.fc3(x))
        x = self.bn3(x)
        x = F.relu(self.fc4(x))
        x = self.bn4(x)
        x = self.fc5(x)
        return x


## 9. Create model, critetion and optimizer

In [16]:
import torch
import torch.nn as nn

net_model = NET_Model(len(features))
criterion = nn.SmoothL1Loss()
optimizer = torch.optim.Adam(net_model.parameters(), lr=0.001)

## 10. Create distributed estimator and train

In [17]:
from raydp.torch import TorchEstimator

estimator = TorchEstimator(num_workers=1, model=net_model, optimizer=optimizer, loss=criterion,
                           feature_columns=features, label_column="stroke", batch_size=64,
                           num_epochs=30)
# Train the model
estimator.fit_on_spark(train_df, test_df)
estimator.shutdown()

[2m[36m(TorchRunner pid=1080)[0m   return F.smooth_l1_loss(input, target, reduction=self.reduction, beta=self.beta)


Epoch-0: {'num_samples': 4454, 'epoch': 1.0, 'batch_count': 70.0, 'train_loss': 0.07485512316855926, 'last_train_loss': 0.0520394966006279}


[2m[36m(TorchRunner pid=1080)[0m   return F.smooth_l1_loss(input, target, reduction=self.reduction, beta=self.beta)


Epoch-1: {'num_samples': 4454, 'epoch': 2.0, 'batch_count': 70.0, 'train_loss': 0.05209015807965758, 'last_train_loss': 0.06677234917879105}
Epoch-2: {'num_samples': 4454, 'epoch': 3.0, 'batch_count': 70.0, 'train_loss': 0.048088046336955535, 'last_train_loss': 0.0547766387462616}
Epoch-3: {'num_samples': 4454, 'epoch': 4.0, 'batch_count': 70.0, 'train_loss': 0.048267952291538324, 'last_train_loss': 0.07120266556739807}
Epoch-4: {'num_samples': 4454, 'epoch': 5.0, 'batch_count': 70.0, 'train_loss': 0.04843106401016159, 'last_train_loss': 0.036428745836019516}
Epoch-5: {'num_samples': 4454, 'epoch': 6.0, 'batch_count': 70.0, 'train_loss': 0.04771103299769469, 'last_train_loss': 0.026480354368686676}
Epoch-6: {'num_samples': 4454, 'epoch': 7.0, 'batch_count': 70.0, 'train_loss': 0.047705781934209934, 'last_train_loss': 0.07878496497869492}
Epoch-7: {'num_samples': 4454, 'epoch': 8.0, 'batch_count': 70.0, 'train_loss': 0.04731543994620342, 'last_train_loss': 0.035199981182813644}


  1. The dashboard might not display correct information on this node.
  2. Metrics on this node won't be reported.
  3. runtime_env APIs won't work.
Check out the `dashboard_agent.log` to see the detailed failure messages.


Epoch-8: {'num_samples': 4454, 'epoch': 9.0, 'batch_count': 70.0, 'train_loss': 0.04675104429248635, 'last_train_loss': 0.0598708838224411}
Epoch-9: {'num_samples': 4454, 'epoch': 10.0, 'batch_count': 70.0, 'train_loss': 0.04691653900002357, 'last_train_loss': 0.028144823387265205}
Epoch-10: {'num_samples': 4454, 'epoch': 11.0, 'batch_count': 70.0, 'train_loss': 0.04727107569488325, 'last_train_loss': 0.06265528500080109}
Epoch-11: {'num_samples': 4454, 'epoch': 12.0, 'batch_count': 70.0, 'train_loss': 0.04663434016614898, 'last_train_loss': 0.03826573118567467}
Epoch-12: {'num_samples': 4454, 'epoch': 13.0, 'batch_count': 70.0, 'train_loss': 0.046925004998236215, 'last_train_loss': 0.04130573570728302}
Epoch-13: {'num_samples': 4454, 'epoch': 14.0, 'batch_count': 70.0, 'train_loss': 0.04591377142550809, 'last_train_loss': 0.0486239455640316}
Epoch-14: {'num_samples': 4454, 'epoch': 15.0, 'batch_count': 70.0, 'train_loss': 0.04579158623622949, 'last_train_loss': 0.040485624223947525}
E

## 11. shut down ray and raydp

In [18]:
raydp.stop_spark()
ray.shutdown()

[2m[36m(TorchRunner pid=1080)[0m   return F.smooth_l1_loss(input, target, reduction=self.reduction, beta=self.beta)
