# Task 2: MapReduce with Ray Actors

This is a simple task of running MapReduce with Ray actors. For reference, you can use the implementation of MapReduce with Ray tasks [here](https://github.com/maxpumperla/learning_ray/blob/main/notebooks/ch_02_ray_core.ipynb) (Make sure to scroll to the end of the notebook).

The task is to use MapReduce to count the number of occurrences for each word in a block of text. We'll be using a `Mapper` and `Reducer` actor. Implement the MapReduce algorithm in the following way - 
  -  You will employ Mappers equivalent to the count of `NUM_CPUS` workers in your Ray cluster, and segment the text into partitions to evenly distribute the workload among the workers. 
  - You will then assign these partitions to each Mapper, and compute the word counts on partitions assigned to them using `.map`
  - Once the counts are computed by the Mappers, you will use `.reduce` from your `Reducer` to combine the results from all partitions


In [2]:
import ray
import re
import time
import string 
import json 

NUM_CPUS=8
ray.init()

2024-03-04 11:02:40,125	INFO util.py:154 -- Outdated packages:
  ipywidgets==7.8.1 found, needs ipywidgets>=8
Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.
2024-03-04 11:02:40,142	INFO worker.py:1540 -- Connecting to existing Ray cluster at address: 10.47.192.23:6380...
2024-03-04 11:02:40,194	INFO worker.py:1715 -- Connected to Ray cluster. View the dashboard at [1m[32mhttp://10.47.192.23:8265 [39m[22m


0,1
Python version:,3.8.18
Ray version:,2.9.3
Dashboard:,http://10.47.192.23:8265


In [22]:
@ray.remote
class Mapper:
    def __init__(self):
        self.word_counts = {}
    
    def map(self, lines):
        for line in lines:
            words = re.findall(r'\S+', line.lower())
            for word in words:
                self.word_counts[word] = self.word_counts.get(word, 0) + 1
    
    def get_counts(self):
        return self.word_counts

@ray.remote
class Reducer:
    def __init__(self):
        self.word_counts = {}

    def reduce(self, counts):
        for counts_dict_ref in counts:
            dict_counts = ray.get(counts_dict_ref)
            for word, count in dict_counts.items():
                self.word_counts[word] = self.word_counts.get(word, 0) + count

    def get_counts(self):
        return self.word_counts


def main(text):

    lines = text.split(". ")
    cl = [line.strip() for line in lines]
    
    # Please complete the core MapReduce algorithm here
    
    # YOUR CODE HERE
    block_size = len(cl)//NUM_CPUS
    mapper_refs = [Mapper.remote() for _ in range(NUM_CPUS+1)]
    reducer_ref = Reducer.remote()
    
    for i in range(NUM_CPUS+1):
        block = cl[i * block_size: (i + 1) * block_size]
        mapper_refs[i].map.remote(block)

    reducer_refs = [reducer_ref.reduce.remote([mapper_refs[i].get_counts.remote()]) for i in range(NUM_CPUS+1)]
    ray.get(reducer_refs)
    final_counts = ray.get(reducer_ref.get_counts.remote())
    return final_counts

# Wait, why am I using Ray for this again?

We're dealing with a toy example with a small text file here. But even now, note that this code will run worker processes across all the nodes in a Ray cluster. The same code written for a single machine can scale to a cluster now.

In [23]:
# launch a test mapper!
mapper = Mapper.remote()
# launch a test reducer
reducer = Reducer.remote()
# let's check their status
!ray list actors -f state=ALIVE

2024-03-04 22:26:48,708	INFO worker.py:1540 -- Connecting to existing Ray cluster at address: 10.47.192.23:6380...
2024-03-04 22:26:48,718	INFO worker.py:1715 -- Connected to Ray cluster. View the dashboard at [1m[32mhttp://10.47.192.23:8265 [39m[22m



Stats:
------------------------------
Total: 31

Table:
------------------------------
    ACTOR_ID                          CLASS_NAME            STATE    JOB_ID    NAME                  NODE_ID                                                     PID  RAY_NAMESPACE
 0  0f3175e5f2a578448e0c96f71c000000  Info                  ALIVE    1c000000  info_dsc_204a         f8d09b7991c564bef44ff719e5377c06b41e25e35b18e6b830e1adcb  11899  f4f5ef8b-9943-4dd6-a562-f8a5f707168b
 1  177e6bc4d72a9c25eff19fff09000000  Info                  ALIVE    09000000  info_dsc_204a         b766d1ef501a60f87e2c334154fd58764a344ecfb35d8ab4cdd6179f  25161  5c4ffda4-cb41-4fcb-aa8b-65dd2a4ae4ea
 2  29739b559b470daa5cd32e4827000000  Reducer               ALIVE    27000000                        5644f0283c3734021b36913ef36a7035513f86bdf72c39b21c09188a  23513  d8a32d2b-4f8e-4b38-ac18-d6108c7022bb
 3  34076ea5ea9f8f3554e106f124000000  Worker                ALIVE    24000000                        b766d1ef501

In [24]:
# get word counts for the example text file
with open('Essay.txt', 'r') as file:
    text = file.read()
final_counts = main(text)



In [25]:
with open("task2_expected_output.txt", "r") as f:
    expected_out = json.load(f)
assert expected_out == final_counts

In [21]:
# shutdown!
ray.shutdown()