# Data Processing with Ray

For this chapter you need to install the following dependencies:

In [None]:
! pip install "ray[data]==2.2.0"
! pip install "scikit-learn==1.0.2"
! pip install "dask==2022.2.0"

![Simple Ray Data](https://raw.githubusercontent.com/maxpumperla/learning_ray/main/notebooks/images/chapter_06/AIR_data.png)


![Data Pipeline 1](https://raw.githubusercontent.com/maxpumperla/learning_ray/main/notebooks/images/chapter_06/data_pipeline_1.png)

![Data Pipeline 2](https://raw.githubusercontent.com/maxpumperla/learning_ray/main/notebooks/images/chapter_06/data_pipeline_2.png)

![Data Positioning 1](https://raw.githubusercontent.com/maxpumperla/learning_ray/main/notebooks/images/chapter_06/data_positioning_1.png)

![Data Positioning 2](https://raw.githubusercontent.com/maxpumperla/learning_ray/main/notebooks/images/chapter_06/data_positioning_2.png)

![Data Architecture](https://raw.githubusercontent.com/maxpumperla/learning_ray/main/notebooks/images/chapter_06/datasets_arch.png)

![Data ML Workflow](https://raw.githubusercontent.com/maxpumperla/learning_ray/main/notebooks/images/chapter_06/ml_workflow.png)

In [1]:
import ray

# Create a dataset containing integers in the range [0, 10000).
ds = ray.data.range(10000)

# Basic operations: show the size of the dataset, get a few samples, print the schema.
print(ds.count())  # -> 10000
print(ds.take(5))  # -> [0, 1, 2, 3, 4]
print(ds.schema())  # -> <class 'int'>

2024-05-11 03:06:34,508	INFO worker.py:1749 -- Started a local Ray instance.
2024-05-11 03:06:37,791	INFO dataset.py:2370 -- Tip: Use `take_batch()` instead of `take() / show()` to return records in pandas or numpy batch format.
2024-05-11 03:06:37,802	INFO streaming_executor.py:112 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2024-05-11_03-06-32_745256_1190438/logs/ray-data
2024-05-11 03:06:37,806	INFO streaming_executor.py:113 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadRange] -> LimitOperator[limit=5]


10000


- ReadRange 1:   0%|          | 0/257 [00:00<?, ?it/s]

- limit=5 2:   0%|          | 0/257 [00:00<?, ?it/s]

Running 0:   0%|          | 0/257 [00:00<?, ?it/s]

[{'id': 0}, {'id': 1}, {'id': 2}, {'id': 3}, {'id': 4}]
Column  Type
------  ----
id      int64


In [3]:
# Save the dataset to a local file and load it back.
ray.data.range(10000).write_csv("local_dir")
ds = ray.data.read_csv("local_dir")
print(ds.count())

2024-05-11 03:08:33,110	INFO streaming_executor.py:112 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2024-05-11_03-06-32_745256_1190438/logs/ray-data
2024-05-11 03:08:33,112	INFO streaming_executor.py:113 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadRange->Write]


- ReadRange->Write 1:   0%|          | 0/257 [00:00<?, ?it/s]

Running 0:   0%|          | 0/257 [00:00<?, ?it/s]

Read progress 0:   0%|          | 0/256 [00:00<?, ?it/s]

20000


In [14]:
ds1 = ray.data.range(10000)
ds2 = ray.data.range(10000)
ds3 = ds1.union(ds2)
print(ds3.count())  # -> 20000

# Filter the combined dataset to only the even elements.
ds3 = ds3.filter(lambda x: x['id'] % 2 == 0)
print(ds3.count())  # -> 10000
print(ds3.take(5))  # -> [0, 2, 4, 6, 8]


2024-05-11 04:03:59,259	INFO streaming_executor.py:112 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2024-05-11_03-06-32_745256_1190438/logs/ray-data
2024-05-11 04:03:59,260	INFO streaming_executor.py:113 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadRange], InputDataBuffer[Input] -> TaskPoolMapOperator[ReadRange] -> UnionOperator[UnionOperator(ReadRange, ReadRange)]


- ReadRange 1:   0%|          | 0/257 [00:00<?, ?it/s]

- ReadRange 2:   0%|          | 0/257 [00:00<?, ?it/s]

- UnionOperator(ReadRange, ReadRange) 3:   0%|          | 0/514 [00:00<?, ?it/s]

Running 0:   0%|          | 0/514 [00:00<?, ?it/s]

2024-05-11 04:04:00,572	INFO streaming_executor.py:112 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2024-05-11_03-06-32_745256_1190438/logs/ray-data
2024-05-11 04:04:00,573	INFO streaming_executor.py:113 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadRange], InputDataBuffer[Input] -> TaskPoolMapOperator[ReadRange] -> UnionOperator[UnionOperator(ReadRange, ReadRange)] -> TaskPoolMapOperator[Filter(<lambda>)]


20000


- ReadRange 1:   0%|          | 0/257 [00:00<?, ?it/s]

- ReadRange 2:   0%|          | 0/257 [00:00<?, ?it/s]

- UnionOperator(ReadRange, ReadRange) 3:   0%|          | 0/514 [00:00<?, ?it/s]

- Filter(<lambda>) 4:   0%|          | 0/514 [00:00<?, ?it/s]

Running 0:   0%|          | 0/514 [00:00<?, ?it/s]

2024-05-11 04:04:02,457	INFO streaming_executor.py:112 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2024-05-11_03-06-32_745256_1190438/logs/ray-data
2024-05-11 04:04:02,458	INFO streaming_executor.py:113 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadRange], InputDataBuffer[Input] -> TaskPoolMapOperator[ReadRange] -> UnionOperator[UnionOperator(ReadRange, ReadRange)] -> TaskPoolMapOperator[Filter(<lambda>)] -> LimitOperator[limit=5]


10000


- ReadRange 1:   0%|          | 0/257 [00:00<?, ?it/s]

- ReadRange 2:   0%|          | 0/257 [00:00<?, ?it/s]

- UnionOperator(ReadRange, ReadRange) 3:   0%|          | 0/514 [00:00<?, ?it/s]

- Filter(<lambda>) 4:   0%|          | 0/514 [00:00<?, ?it/s]

- limit=5 5:   0%|          | 0/514 [00:00<?, ?it/s]

Running 0:   0%|          | 0/514 [00:00<?, ?it/s]

[{'id': 0}, {'id': 2}, {'id': 4}, {'id': 6}, {'id': 8}]


In [18]:
ds1 = ray.data.range(10000).materialize()
print(ds1.num_blocks())  # -> 200
ds2 = ray.data.range(10000).materialize()
print(ds2.num_blocks())  # -> 200
ds3 = ds1.union(ds2).materialize()
print(ds3.num_blocks())  # -> 400

print(ds3.repartition(200).materialize().num_blocks())  # -> 200

Read progress 0:   0%|          | 0/257 [00:00<?, ?it/s]

257


Read progress 0:   0%|          | 0/257 [00:00<?, ?it/s]

2024-05-11 04:05:28,595	INFO streaming_executor.py:112 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2024-05-11_03-06-32_745256_1190438/logs/ray-data
2024-05-11 04:05:28,596	INFO streaming_executor.py:113 -- Execution plan of Dataset: InputDataBuffer[Input], InputDataBuffer[Input] -> UnionOperator[UnionOperator(Input, Input)]


257


- UnionOperator(Input, Input) 1:   0%|          | 0/514 [00:00<?, ?it/s]

Running 0:   0%|          | 0/514 [00:00<?, ?it/s]

2024-05-11 04:05:28,668	INFO streaming_executor.py:112 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2024-05-11_03-06-32_745256_1190438/logs/ray-data
2024-05-11 04:05:28,668	INFO streaming_executor.py:113 -- Execution plan of Dataset: InputDataBuffer[Input] -> AllToAllOperator[Repartition]


514


- Repartition 1:   0%|          | 0/200 [00:00<?, ?it/s]

Split Repartition 2:   0%|          | 0/200 [00:00<?, ?it/s]

Running 0:   0%|          | 0/200 [00:00<?, ?it/s]

200


In [19]:
ds = ray.data.from_items([{"id": "abc", "value": 1}, {"id": "def", "value": 2}])
print(ds.schema())  # -> id: string, value: int64

Column  Type
------  ----
id      string
value   int64


In [20]:
pandas_df = ds.to_pandas()  # pandas_df will inherit the schema from our Dataset.

In [26]:
ds = ray.data.range(10000).map(lambda x: {'item': x['id'] ** 2})
ds.take(5)  # -> [0, 1, 4, 9, 16]

2024-05-11 04:08:05,539	INFO streaming_executor.py:112 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2024-05-11_03-06-32_745256_1190438/logs/ray-data
2024-05-11 04:08:05,539	INFO streaming_executor.py:113 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadRange->Map(<lambda>)] -> LimitOperator[limit=5]


- ReadRange->Map(<lambda>) 1:   0%|          | 0/257 [00:00<?, ?it/s]

- limit=5 2:   0%|          | 0/257 [00:00<?, ?it/s]

Running 0:   0%|          | 0/257 [00:00<?, ?it/s]

[{'item': 0}, {'item': 1}, {'item': 4}, {'item': 9}, {'item': 16}]

In [42]:
import numpy as np

ds = ray.data.range(10000).map_batches(lambda batch: {'results': np.square(batch['id']).tolist()})
ds.take(5)  # -> [0, 1, 4, 9, 16]

2024-05-11 04:20:04,076	INFO streaming_executor.py:112 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2024-05-11_03-06-32_745256_1190438/logs/ray-data
2024-05-11 04:20:04,076	INFO streaming_executor.py:113 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadRange->MapBatches(<lambda>)] -> LimitOperator[limit=5]


- ReadRange->MapBatches(<lambda>) 1:   0%|          | 0/257 [00:00<?, ?it/s]

- limit=5 2:   0%|          | 0/257 [00:00<?, ?it/s]

Running 0:   0%|          | 0/257 [00:00<?, ?it/s]

[{'results': 0},
 {'results': 1},
 {'results': 4},
 {'results': 9},
 {'results': 16}]