# Executors and Tasks

This notebook has a brief explanation of how we are going to create generic **Classes** capable of handling important Data Lake jobs. *E.g.*, transferring data from different storages, creating tables on **AWS Athena** and **AWS Redshift**, managing **Glue Jobs**, *etc*.

## Staging Data with a optional transformation

The class we are creating below is capable of taking data from a parent directory and moving it to a dump directory. Furthermore, given a python script, it will run it with its respective arguments. This is a good option for a before hand cleaning.

In [7]:
class StagingExecutor:
    
    def __init__(self, parent_directory: str, dump_directory: str, archive_or_delete: str = "archive", py_exec_path: str = None, py_exec_args: dict = None) -> None:
        self.parent = parent_directory
        self.dump = dump_directory
        self.archive_or_delete = archive_or_delete
        self.py_exec_path = py_exec_path
        self.py_exec_args = py_exec_args

    def transfer(self) -> None:
        
        if not self.py_exec_path:
            print('No transformation required. Moving file using only parent and dump')
        
        elif self.py_exec_path:
            import sys
            sys.path.insert(1, self.py_exec_path)
            import py_exec
            
            if not self.py_exec_args:
                print('Python Executor does not require arguments')
            
            elif self.py_exec_args:
                print(f'Python Executor is running with the following parameters:\n{self.py_exec_args}')
                py_exec.main(self.parent, self.dump, **self.py_exec_args)
                
    def post_staging(self) -> None:
                
        if self.archive_or_delete == 'archive':
            print('File from landing will be moved to archive folder')
        
        elif self.archive_or_delete == 'delete':
            print('File will be deleted from landing')

## Downloader with unzip functionality

This executor will take directories as main arguments, and download them into a landing folder (local or on **S3**) using *requests*. Notwithstanding downloading, it can be set to unzip and organize raw data for a cleaner staging task.

In [52]:
import requests, zipfile, io


class DownloaderExecutor:
    def __init__(self, requests_arguments: dict, landing_directory: str, unzip: str = None) -> None:
        self.requests_arguments = requests_arguments ##A dictionary such as: {"url": URL, "params": PARAMS, ...}
        self.landing_directory = landing_directory ##With file name in case of unzipped, without in case of zipped
        self.unzip = unzip
        self.res = None
        
        # self.py_exec_path = py_exec_path
        # self.py_exec_args = py_exec_args
        
    def download(self):
        self.res = requests.get(**self.requests_arguments)
        
        if self.unzip == "unzip":
            print('Will have to unzip')
            temp = zipfile.ZipFile(io.BytesIO(self.res.content))
            temp.extractall(self.landing_directory)
            
            
        elif not self.unzip:
            print('Does not unzip')
            with open(self.landing_directory, 'wb') as temp:
                temp.write(self.res.content)

To test de unzip functionallity, we are using the following dataset: https://www.stats.govt.nz/assets/Uploads/Retail-trade-survey/Retail-trade-survey-September-2020-quarter/Download-data/retail-trade-survey-september-2020-quarter-csv.zip

In [57]:
downexec = {"requests_arguments": {"url": "https://www.stats.govt.nz/assets/Uploads/Retail-trade-survey/Retail-trade-survey-September-2020-quarter/Download-data/retail-trade-survey-september-2020-quarter-csv.zip"}, "landing_directory": "../data/landing", "unzip": "unzip"}
klass = DownloaderExecutor(**downexec)
klass.download()

Will have to unzip


# Orchestrator

Data orchestration is a relatively new concept to describe the set of technologies that abstracts data access across storage systems, virtualizes all the data, and presents the data via standardized APIs with a global namespace to data-driven applications. There is a clear need for data orchestration because of the increasing complexity of the data ecosystem due to new frameworks, cloud adoption/migration, as well as the rise of data-driven applications. [[Data Orchestrator]](https://dzone.com/articles/data-orchestration-its-open-source-but-what-is-it)

The Orchestrator is a class that will follow the routine_config.json, where one will declare which executors and their respective tasks to run. The model for our routine_config is:

```
{
    "routine_name": <ROUTINE_NAME>,
    "executors": {
        <EXECUTOR_CLASS>:  {
            "params":<__init__ PARAMETERS>,
            "tasks": <LIST_SELECTED_TASKS_FROM_EXECUTOR>
        }
    }
}   
```

In [3]:
class Orchestrator:
    
    def __init__(self, routine_config: dict) -> None:
        self.routine_config = routine_config
    
    def run_executors(self):
        for executor in routine_config['executors']:
            self.executor_name = executor
            klass = globals()[self.executor_name]
            self.executor = klass(**routine_config['executors'][self.executor_name]['params'])
            self.run_tasks()
            
    def run_tasks(self):
        for task in routine_config['executors'][self.executor_name]['tasks']:
            getattr(self.executor, task)()