# Demo: Data Uploader for UFS Datasets to Cloud Data Storage

### __Purpose:__ 

The purpose of this program is to transfer the Unified Forecast Sytstem (UFS) input and baseline datasets residing within the RDHPCS to cloud data storage via chaining API calls to communicate with cloud data storage buckets. The program will support the data required for the current UFS Weather Model (UFS WM) deployed within the RDHPCS as well as support the NOAA development team's data management in maintaining only the datasets committed within the latest N months of their UFS development code (once the program is integrated into Jenkins).

According to Amazon AWS, the following conditions need to be considered when transferring data to cloud data storage:
- Largest object that can be uploaded in a single PUT is 5 GB.
- Individual Amazon S3 objects can range in size from a minimum of 0 bytes to a maximum of 5 TB.
- For objects larger than 100 MB, Amazon recommends using the Multipart Upload capability.
- The total volume of data in a cloud data storage bucket are unlimited.

Tools which could be be utilized to perform data transferring & partitioning (Multipart Upload/Download) are: 
- AWS SDK
- AWS CLI
- AWS S3 REST API

All of the AWS provided tools are built on Boto3. 

In this demontration, the framework will implement Python AWS SDK for transferring the UFS datasets from the RDHPCS, Orion, to the cloud data storage with low latency. 

The AWS SDK will be implemented for the following reasons:
- To integrate with other python scripts.
- AWS SDK carries addition capabilities/features for data manipulation & transferring compare to the aforementioned alternate tools.

### __Capabilities:__ 

The framework will be able to perform the following actions:

- Multi-threading & partitioning to the datasets to assist in the optimization in uploading performance of the datasets from on-prem to cloud. 

### __Future Capabilities:__  

The program can be used as a skeletal framework for transferring future datasets of interest (e.g. SRW data, MRW data, etc). In addition, it can be integrated with the UFS tracker bot (https://github.com/NOAA-EPIC/ufs-dev_data_timestamps) & Jenkins to automate the data transferring process as new datasets are being committed & pushed to the UFS-WM repository develop branch.


### __Sample Datasets to Transfer:__
There are two scenarios that will need to be considered when storing the UFS data in cloud:

- Datasets to be stored in cloud need to support NOAA's development team. Datasets residing within the Cloud as well as RDHPCS must support their development team's latest 2 months of developing code. 


| UFS MODEL DEVELOPMENT VERSIONS| BASELINE DATA | INPUT DATA | WW3 INPUT DATA | BM_IC |
| :- | :- | :- | -: | :-: |
| Supports NOAA Dev Team Versons (since 03-04-22)| 20220304 | input-data-20211210  | WW3_input_data_20211113 | BM_IC-20220207 
| Supports NOAA Dev Team Versons (since 03-16-22)| 20220316 | input-data-20211210  | WW3_input_data_20211113 | BM_IC-20220207 |
| Supports NOAA Dev Team Versons (since 03-18-22)| 20220318 | input-data-20211210  | WW3_input_data_20211113 | BM_IC-20220207 |
| Supports NOAA Dev Team Versions (since 03-18-22)| 20220321 | input-data-20211210  | WW3_input_data_20211113 | BM_IC-20220207|
| Supports NOAA Dev Team Versions (since 03-18-29)| 20220329 | input-data-20211210  | WW3_input_data_20211113 | BM_IC-20220207|

- Datasets to be stored need to support the UFS weather model develop branch code revision, which was pulled last year October 2021 by the EPIC's Platform team. These datasets are:

| UFS MODEL DEVELOPMENT VERSIONS| BASELINE DATA | INPUT DATA | WW3 INPUT DATA | BM_IC |
| :- | :- | :- | -: | :-: |
| Supports UFS Model Version Deployed in CSPs| 20220207 | input-data-20211210  | WW3_input_data_20211113 | BM_IC-20210717 |

<img src="./images/DataVersionsZach&JongAreUsing.png">

### __Environment Setup:__

1. Install miniconda on your machine. Note: Miniconda is a smaller version of Anaconda that only includes conda along with a small set of necessary and useful packages. With Miniconda, you can install only what you need, without all the extra packages that Anaconda comes packaged with:

Download latest Miniconda (e.g. 3.9 version):
- __wget https://repo.anaconda.com/miniconda/Miniconda3-py39_4.9.2-Linux-x86_64.sh__

Check integrity downloaded file with SHA-256:
- __sha256sum Miniconda3-py39_4.9.2-Linux-x86_64.sh__

Reference SHA256 hash in following link: https://docs.conda.io/en/latest/miniconda.html

Install Miniconda in Linux:
- __bash Miniconda3-py39_4.9.2-Linux-x86_64.sh__

Next, Miniconda installer will prompt where do you want to install Miniconda. Press ENTER to accept the default install location i.e. your $HOME directory. If you don't want to install in the default location, press CTRL+C to cancel the installation or mention an alternate installation directory. If you've chosen the default location, the installer will display “PREFIX=/var/home/<user>/miniconda3” and continue the installation.

For installation to take into effect, run the following command: 
- __source ~/.bashrc__

Next, you will see the prefix (base) in front of your terminal/shell prompt. Indicating the conda's base environment is activated.

2.	Once you have conda installed on your machine, perform the following to create a conda environment:

To create a new environment (if a YAML file is not provided)
- __conda create -n [Name of your conda environment you wish to create]__

__(OR)__

To ensure you are running Python 3.9:
- __conda create -n myenv Python=3.9__

__(OR)__

To create a new environment from an existing YAML file (if a YAML file is provided):
- __conda env create -f environment.yml__

__*Note:__ A .yml file is a text file that contains a list of dependencies, which channels a list for installing dependencies for the given conda environment. For the code to utilize the dependencies, you will need to be in the directory where the environment.yml file lives.

4.	Activate the new environment via: 
- __conda activate [Name of your conda environment you wish to activate]__

5.	Verify that the new environment was installed correctly via:
- __conda info --env__

__*Note:__
- From this point on, must activate conda environment prior to .py script(s) or jupyter notebooks execution
using the following command: __conda activate__
- To deactivate a conda environment: 
    - __conda deactivate__

#### ___Link Home Directory to Dataset Location on RDHPCS Platform___ 

6.	Unfortunately, there is no way to navigate to the /work/ filesystem from within the Jupyter interface. The best way to workaround is to create a symbolic link in your home folder that will take you to the /work/ filesystem. Run the following command from a linux terminal on Orion to create the link: 

    - __ln -s /work /home/[Your user account name]/work__

Now, when you navigate to the __/home/[Your user account name]/work__ directory in Jupyter, it will take you to the __/work__ folder. Allowing you to obtain any data residing within the __/work__ filesystem that you have permission to access from Jupyter. This same procedure will work for any filesystem available from the root directory. 

__*Note:__ On Orion, user must sym link from their home directory to the main directory containing the datasets of interest.

#### ___Open & Run Data Analytics Tool on Jupyter Notebook___

7.	Open OnDemand has a built-in file explorer and file transfer application available directly from its dashboard via ...
    - Login to https://orion-ood.hpc.msstate.edu/ 
    - In the Open OnDemand Interface, select __Interactive Apps__ > __Jupyter Notbook__
    - Set the following configurations to run Jupyter:


#### ___Additonal Information___

__To create a .yml file, execute the following commands:__

- Activate the environment to export: 
    - __conda activate myenv__

- Export your active environment to a new file:
    - __conda env export > [ENVIRONMENT FILENAME].yml__


### __Reference(s)__
Latest UFS Weather Model Guide:
- https://ufs-weather-model.readthedocs.io/en/latest/InputsOutputs.html


# Demo 1: Data Locality Extractor from Source

#### Test Sample
The script will read from the data tracker bot's output pickle file. The test sample generated by the data tracker bot will reside in **./data_from_ts_tracker/latest_rt.sh.pk** at this time until Jenkins is connected to the RDHPCS.

In this demonstration, datasets were transferred to the cloud data storage to support the following developing UFS-WMs of interest.

| UFS MODEL DEVELOPMENT VERSIONS| BASELINE DATA | INPUT DATA | WW3 INPUT DATA | BM_IC |
| :- | :- | :- | -: | :-: |
| Supports UFS Model Version Deployed in CSPs| 20220207 | input-data-20211210  | WW3_input_data_20211113 | BM_IC-20210717 |
| Supports NOAA Dev Team Versons (since 03-04-22)| 20220304 | input-data-20211210  | WW3_input_data_20211113 | BM_IC-20220207 
| Supports NOAA Dev Team Versons (since 03-16-22)| 20220316 | input-data-20211210  | WW3_input_data_20211113 | BM_IC-20220207 |
| Supports NOAA Dev Team Versons (since 03-18-22)| 20220318 | input-data-20211210  | WW3_input_data_20211113 | BM_IC-20220207 |
| Supports NOAA Dev Team Versions (since 03-18-22)| 20220321 | input-data-20211210  | WW3_input_data_20211113 | BM_IC-20220207|
| Supports NOAA Dev Team Versions (since 03-18-29)| 20220329 | input-data-20211210  | WW3_input_data_20211113 | BM_IC-20220207|


### Obtain directories for the datasets tracked by the data tracker bot. 

**TODO: FOR YEAR 2, will capture the date at which the ts were extracted and then write a script which will add a sliding filter window to capture only the latest 2 months of development code.

In [1]:
if __name__ == '__main__': 
    
    # Module for extracting data from source.
    from get_timestamp_data import GetTimestampData
    
    # Establish locality of where the dataseta are sourced.
    linked_home_dir = "/home/schin/work"
    orion_rt_data_dir = linked_home_dir + "/noaa/nems/emc.nemspara/RT/NEMSfv3gfs/"
    
    # Read data tracker's latest set of timestamps retrieved
    get_ts_wrapper = GetTimestampData(orion_rt_data_dir, None)
    latest_retrieved_ts = get_ts_wrapper.data_log_dict

    # Filter to tracker log's timestamps & extract their corresponding UFS input & baseline file directories.
    filter2tracker_ts_datasets = get_ts_wrapper.get_tracker_ts_files()


[1m
All Primary Dataset Folders & Files In Main Directory (/home/schin/work/noaa/nems/emc.nemspara/RT/NEMSfv3gfs/):
[0m['develop-20220629', 'develop-20220613', 'develop-20220511', 'ufs-public-release-v2-20210212', 'develop-20220623', 'adjust_permissions.sh', 'ufs-public-release-v2-20210208', 'develop-20220519', 'develop-20220508', 'develop-20220517', 'BM_IC-20220207', 'develop-20220316', 'develop-20220516', 'develop-20220503', 'develop-20220707', 'develop-20220601', 'develop-20220512', 'develop-20220502', 'develop-20220425', 'develop-20220531', 'develop-20220616', 'input-data-20220414', 'develop-20220701', 'BM_IC-20210717', 'develop-20220706']
[1m
Data Tracker's Latest Set of Timestamped Datasets Retrieved was on 07-07-2022:[0m
{'BL_DATE': ['20220706'], 'INPUTDATA_ROOT': ['20220414'], 'INPUTDATA_ROOT_WW3': ['20220624'], 'INPUTDATA_ROOT_BMIC': ['20220207']}
[1m
Data Tracker's Retrieval Dates:
[0mdict_keys(['06-30-2022', '07-05-2022', '07-07-2022'])
[1m
Data Tracker's Log of Times

In [5]:
# Confirm: Keys of latest Retrieval Date Tracked by Data Tracker Bot
filter2tracker_ts_datasets.keys()

dict_keys(['INPUTDATA_ROOT', 'BL_DATE', 'INPUTDATA_ROOT_WW3', 'INPUTDATA_ROOT_BMIC'])

In [2]:
# Selected timestamp dataset of the latest rettrival date to transfer 
# from RDHPCS on-disk to cloud as tracked by data tracker bot
filter2tracker_ts_datasets

defaultdict(list,
            {'INPUTDATA_ROOT': ['input-data-20220414/WW3_input_data_20220624/mesh.glo_1deg.nc',
              'input-data-20220414/WW3_input_data_20220624/mod_def.pointsatmw',
              'input-data-20220414/WW3_input_data_20220624/mod_def.glo_15mxt',
              'input-data-20220414/WW3_input_data_20220624/mod_def.gsh_15m',
              'input-data-20220414/WW3_input_data_20220624/mod_def.mx100',
              'input-data-20220414/WW3_input_data_20220624/mesh.gwes_30m.nc',
              'input-data-20220414/WW3_input_data_20220624/mod_def.natl_6m',
              'input-data-20220414/WW3_input_data_20220624/mod_def.glo_1deg',
              'input-data-20220414/WW3_input_data_20220624/mod_def.mx050',
              'input-data-20220414/WW3_input_data_20220624/mod_def.aoc_9km',
              'input-data-20220414/WW3_input_data_20220624/mod_def.mx025',
              'input-data-20220414/WW3_input_data_20220624/mod_def.points',
              'input-data-20220414/WW3_

### Obtain directories for the datasets requested by the user.

In [None]:
# if __name__ == '__main__': 
    
#     # Module for extracting data from source.
#     from get_timestamp_data import GetTimestampData
    
#     # Establish locality of where the dataseta are sourced.
#     linked_home_dir = "/home/schin/work"
#     orion_rt_data_dir = linked_home_dir + "/noaa/nems/emc.nemspara/RT/NEMSfv3gfs/"
    
#     # Select timestamp dataset to transfer from RDHPCS on-disk to cloud
#     input_ts, bl_ts, ww3_input_ts, bmic_ts = [], ['develop-20220706'], ['WW3_input_data_20220624'], []
#     filter2specific_ts_datasets = GetTimestampData(orion_rt_data_dir, None).get_specific_ts_files(input_ts, bl_ts, ww3_input_ts, bmic_ts)
# filter2specific_ts_datasets.keys()

In [None]:
# # Selected timestamp dataset to transfer from RDHPCS on-disk to cloud
# filter2specific_ts_datasets

# Demo 2: Multipart Upload of Extracted Data to Cloud

### Upload datasets tracked by the data tracker bot.

In [None]:
# if __name__ == '__main__': 
#     from progress_bar import ProgressPercentage
#     from upload_data import UploadData
#     linked_home_dir = "/home/schin/work"
#     orion_rt_data_dir = linked_home_dir + "/noaa/nems/emc.nemspara/RT/NEMSfv3gfs/"
#     uploader_wrapper = UploadData(orion_rt_data_dir, filter2tracker_ts_datasets, use_bucket='rt')
#     uploader_wrapper.upload_files2cloud()

### Upload datasets by timestamps as requested by the user.
- In this scenario, used when transferring data files required for the UFS-WM currently deployed in the CSPs.

In [None]:
# if __name__ == '__main__': 
#     from progress_bar import ProgressPercentage
#     from upload_data import UploadData
#     linked_home_dir = "/home/schin/work"
#     orion_rt_data_dir = linked_home_dir + "/noaa/nems/emc.nemspara/RT/NEMSfv3gfs/"
#     uploader_wrapper = UploadData(orion_rt_data_dir, filter2specific_ts_datasets, use_bucket='rt')
#     uploader_wrapper.upload_files2cloud()


# Consolidated Demo: Extract Data Localities & Upload to Cloud

### Extract & upload datasets (input data + baseline data + IC data) as requested by the user.

In [None]:
# # Obtain directories & upload to cloud for the datasets requested by the user.
# from transfer_specific_data import TransferSpecificData
# input_ts, bl_ts, ww3_input_ts, bmic_ts = [], ['develop-20220706'], ['WW3_input_data_20220624'], []
# TransferSpecificData(input_ts, bl_ts, ww3_input_ts, bmic_ts, linked_home_dir="/home/schin/work", platform="orion")

### Extract & upload datasets tracked by the data tracker bot.

In [None]:
# Obtain directories & upload to cloud for the latest retrieved set of timestamped datasets tracked by the data tracker bot.
from transfer_bot_data import TransferBotData
TransferBotData(linked_home_dir="/home/schin/work", platform="orion")

# Demo: Upload a Single Data File of Interest

In [None]:
# if __name__ == '__main__': 
#     from progress_bar import ProgressPercentage
#     from upload_data import UploadData
#     linked_home_dir = "/home/schin/work"
#     orion_rt_data_dir = linked_home_dir + "/noaa/nems/emc.nemspara/RT/NEMSfv3gfs/"
#     uploader_wrapper = UploadData(orion_rt_data_dir, file_relative_dirs=None, use_bucket='rt')
    
#     # Upload a Single Data File of Interest
#     file_dir = 'input-data-20211210/fv3_regional_c768/INPUT/grid.tile7.halo4.nc'
#     uploader_wrapper.upload_single_file(file_dir)

# Demo: Delete a File

In [None]:
# if __name__ == '__main__': 
#     from progress_bar import ProgressPercentage
#     from upload_data import UploadData
#     linked_home_dir = "/home/schin/work"
#     orion_rt_data_dir = linked_home_dir + "/noaa/nems/emc.nemspara/RT/NEMSfv3gfs/"
#     uploader_wrapper = UploadData(orion_rt_data_dir, file_relative_dirs=None, use_bucket='rt')
#     file_dir = 'input-data-20211210/fv3_regional_c768/INPUT/grid.tile7.halo4.nc'
#     key_path = file_dir
#     uploader_wrapper.purge(key_path)

# Demo: Delete Objects with Key Prefix

In [None]:
# if __name__ == '__main__': 
#     from progress_bar import ProgressPercentage
#     from upload_data import UploadData
#     linked_home_dir = "/home/schin/work"
#     orion_rt_data_dir = linked_home_dir + "/noaa/nems/emc.nemspara/RT/NEMSfv3gfs/"
#     uploader_wrapper = UploadData(orion_rt_data_dir, file_relative_dirs=None, use_bucket='rt')
#     key_prefix = 'develop-20220511'
#     uploader_wrapper.purge_by_keyprefix(key_prefix)


# Demo: Copy Objects & Delete with Key Prefix
AWS CLI copies the objects to the target folder and then removes the original file. There is no “move” action in S3.

In [None]:
# if __name__ == '__main__': 
#     from progress_bar import ProgressPercentage
#     from upload_data import UploadData
#     uploader_wrapper = UploadData(orion_rt_data_dir, file_relative_dirs=None, use_bucket='rt')
#     source_key_path = '####' 
#     new_key_path = '###'
#     uploader_wrapper.rename_s3_keys(source_key_path, new_key_path)

# Demo: Get List of All Keys in UFS RT S3 Bucket

In [None]:
# if __name__ == '__main__': 
#     from progress_bar import ProgressPercentage
#     from upload_data import UploadData
#     uploader_wrapper = UploadData(orion_rt_data_dir, file_relative_dirs=None, use_bucket='rt')
#     all_bucket_objects = uploader_wrapper.get_all_s3_keys()
# all_bucket_objects

In [None]:
# Write list to text file and save to directory
# with open('filename].[file_format]', 'w') as f:
#     for item in all_bucket_objects:
#         f.write("%s\n" % item)