# Demo: Data Uploader for SRW Datasets to Cloud Data Storage

### __Purpose:__ 

The purpose of this program is to transfer the Unified Forecast Sytstem Short-Range Weather Application (UFS SRW Application) fixed and input model datasets residing within the RDHPCS to cloud data storage via chaining API calls to communicate with its cloud data storage bucket. The program will support the data required for the current UFS SRW Application.

According to Amazon AWS, the following conditions need to be considered when transferring data to cloud data storage:
- Largest object that can be uploaded in a single PUT is 5 GB.
- Individual Amazon S3 objects can range in size from a minimum of 0 bytes to a maximum of 5 TB.
- For objects larger than 100 MB, Amazon recommends using the Multipart Upload capability.
- The total volume of data in a cloud data storage bucket are unlimited.

Tools which could be be utilized to perform data transferring & partitioning (Multipart Upload/Download) are: 
- AWS SDK
- AWS CLI
- AWS S3 REST API

All of the AWS provided tools are built on Boto3. 

In this demontration, the framework will implement Python AWS SDK for transferring the UFS SRW application fixed and input model datasets from the RDHPCS, Orion, to the cloud data storage with low latency. 

The AWS SDK will be implemented for the following reasons:
- To integrate with other python scripts.
- AWS SDK carries addition capabilities/features for data manipulation & transferring compare to the aforementioned alternate tools.

### __Capabilities:__ 

The framework will be able to perform the following actions:

- Multi-threading & partitioning to the datasets to assist in the optimization in uploading performance of the datasets from on-prem to cloud. 


### __Datasets to Transfer:__
The following conditions must be considered when storing the SRW data in cloud:

- As of 05/2022, datasets to be stored in cloud need to support the SRW cases featured within the UFS SRW release version 2.0. 
- The datasets to be stored in cloud will include the fixed and input model datasets residing on the RDHPCS platform, Orion.

| SRW Release Version | Fixed Data Location (on Orion) | Input Model Data Location (on Orion) |
| :- | :- | :-: |
| 2.0| /noaa/fv3-cam/UFS_SRW_App/develop/fix.tar | /noaa/fv3-cam/UFS_SRW_App/develop/input_model_data.tar |

### __Environment Setup:__

1. Install miniconda on your machine. Note: Miniconda is a smaller version of Anaconda that only includes conda along with a small set of necessary and useful packages. With Miniconda, you can install only what you need, without all the extra packages that Anaconda comes packaged with:

Download latest Miniconda (e.g. 3.9 version):
- __wget https://repo.anaconda.com/miniconda/Miniconda3-py39_4.9.2-Linux-x86_64.sh__

Check integrity downloaded file with SHA-256:
- __sha256sum Miniconda3-py39_4.9.2-Linux-x86_64.sh__

Reference SHA256 hash in following link: https://docs.conda.io/en/latest/miniconda.html

Install Miniconda in Linux:
- __bash Miniconda3-py39_4.9.2-Linux-x86_64.sh__

Next, Miniconda installer will prompt where do you want to install Miniconda. Press ENTER to accept the default install location i.e. your $HOME directory. If you don't want to install in the default location, press CTRL+C to cancel the installation or mention an alternate installation directory. If you've chosen the default location, the installer will display “PREFIX=/var/home/<user>/miniconda3” and continue the installation.

For installation to take into effect, run the following command: 
- __source ~/.bashrc__

Next, you will see the prefix (base) in front of your terminal/shell prompt. Indicating the conda's base environment is activated.

2.	Once you have conda installed on your machine, perform the following to create a conda environment:

To create a new environment (if a YAML file is not provided)
- __conda create -n [Name of your conda environment you wish to create]__

__(OR)__

To ensure you are running Python 3.9:
- __conda create -n myenv Python=3.9__

__(OR)__

To create a new environment from an existing YAML file (if a YAML file is provided):
- __conda env create -f environment.yml__

__*Note:__ A .yml file is a text file that contains a list of dependencies, which channels a list for installing dependencies for the given conda environment. For the code to utilize the dependencies, you will need to be in the directory where the environment.yml file lives.

4.	Activate the new environment via: __conda activate [Name of your conda environment you wish to activate]__

5.	Verify that the new environment was installed correctly via: __conda info --env__

__*Note:__
- From this point on, must activate conda environment prior to .py script(s) or jupyter notebooks execution
using the following command: __conda activate__
- To deactivate a conda environment: 
    - __conda deactivate__

#### ___Link Home Directory to Dataset Location on RDHPCS Platform___ 

6.	Unfortunately, there is no way to navigate to the /work/ filesystem from within the Jupyter interface. The best way to workaround is to create a symbolic link in your home folder that will take you to the /work/ filesystem. Run the following command from a linux terminal on Orion to create the link: 

    - __ln -s /work /home/[Your user account name]/work__

Now, when you navigate to the __/home/[Your user account name]/work__ directory in Jupyter, it will take you to the __/work__ folder. Allowing you to obtain any data residing within the __/work__ filesystem that you have permission to access from Jupyter. This same procedure will work for any filesystem available from the root directory. 

__*Note:__ On Orion, user must sym link from their home directory to the main directory containing the datasets of interest.

#### ___Open & Run Data Analytics Tool on Jupyter Notebook___

7.	Open OnDemand has a built-in file explorer and file transfer application available directly from its dashboard via ...
    - Login to https://orion-ood.hpc.msstate.edu/ 
    - In the Open OnDemand Interface, select __Interactive Apps__ > __Jupyter Notbook__
    - Set the following configurations to run Jupyter:


#### ___Additonal Information___

__To create a .yml file, execute the following commands:__

- Activate the environment to export: 
    - __conda activate myenv__

- Export your active environment to a new file:
    - __conda env export > [ENVIRONMENT FILENAME].yml__


### __Reference(s)__
Latest UFS SRW Application Guide:
- https://ufs-srweather-app.readthedocs.io/en/latest/InputOutputFiles.html

# Demo 1: Data Locality Extractor from Source

The purpose of this demo is to read the directory content of the TAR folders -- in an effort to set each data file as an object with a unique key.

__Historical Log:__
- Initially, SRW cases were to be defined for the UFS SRW application release version 2.0 to determine the datasets required to support the UFS SRW application release version, however it was suggested during an AUS weekly meeting to transfer all data within two tar formatted SRW dataset folders regardless as to whether a dataset is required/not required for the aforementioned UFS SRW application release version -- this is due to the unknown timing of when the SRW dataset locations will be standardized across all RDHPCS. 

- As of 05/04/22, these tar formatted SRW dataset folders feature the standardized dataset locations according to on-prem SRW data maintainer, Michael Kavulich. As of 05/04/22, there are a few PRs (UFS SRW Application Issue #231, Issue #716, Issue #724) that need to go in first before these dataset locations changes will be implemented. 

- As a result, all data within the two tar formatted SRW dataset folders on Orion (regardless as to whether a dataset in the tar folders are required/not required for the aforementioned UFS SRW application release version) were transferred to the cloud data storage in this demonstration to support the following developing UFS SRW release version:

| SRW Release Version | Fixed Data Location (on Orion) | Input Model Data Location (on Orion) |
| :- | :- | :-: |
| 2.0| /noaa/fv3-cam/UFS_SRW_App/develop/fix.tar | /noaa/fv3-cam/UFS_SRW_App/develop/input_model_data.tar |


- Later, AUS requested to set each TAR folder as their own object. The migration of TAR objects was requested by AUS team member, Gillian P.

### Obtain directories for the datasets requested by the user as listed within the WE2E csv file.
The importance of this demo is to obtain only the dataset required for SRW cases requested by a given user.
Recall, the demo is to transfer all data within two tar formatted SRW dataset folders regardless as to whether a dataset is required/not required for a user's UFS SRW application release version. By using the following function, a user can request the datasets that is applicable to their SRW release version needs -- rather than the full datasets within the SRW tar folders. This feature is a future capability as SRW development continues within this project program.


In [None]:
# if __name__ == '__main__': 
    
#     # Module for extracting data from source.
#     from read_srw_we2e_cases import TransferCaseData
    
#     # Model's Date and times requested by user.
#     fv3gfs_ts, gsmgfs_ts, hrrr_ts, rap_ts, nam_ts = TransferCaseData(srw_cases_fn = 'WE2E Cases and Locations.xlsx').read_srw_cases()
    
#     # FV3LAM pregen grids requested by user. NOTE: (04.22.22) Currently contains unique fixed files per FV3LAM pregen grids folder on Orion
#     grids_list = TransferCaseData(srw_cases_fn = 'WE2E Cases and Locations.xlsx').read_srw_grids()


### Keep extraction of resolutions if SRW fixed files data strucutre remains as the standard data structure -- otherwise, modify appropriately.

In [None]:
# # Keep if the data structure for SRW fixed files remains otherwise modify appropriately.
# import re
# import itertools
# import numpy as np
# from collections import defaultdict

# grid_key = []
# res_list = []
# grid2res = defaultdict(list)
# for t in grids_list:
    
#     #grid_key.append(re.findall('([a-zA-Z_ ]*)\d*.*', t)[0])
    
#     if 'RRFS_CONUS_' in t:
#         grid2res['RRFS_CONUS'].append(re.findall(r'(\d+)km', t)[0])
#     elif 'SUBCONUS_' in t:
#         grid2res['SUBCONUS'].append(re.findall(r'(\d+)km', t)[0])
#     elif 'RRFS_CONUScompact' in t:
#         grid2res['RRFS_CONUScompact'].append(re.findall(r'(\d+)km', t)[0])
#     elif 'ESG' in t:
#         grid2res['ESG'].append([x for x[0] in re.findall(r'(\d+)km', t) if x!=[]])
#     elif 'GFDL' in t:
#         grid2res['GFDL'].append([x for x[0] in re.findall(r'(\d+)km', t) if x!=[]])
# grid2res        

In [None]:
# # Set resolutions to filter fixed files to.
# rrfs_conus_res = grid2res['RRFS_CONUS']
# rrfs_subconus_res = grid2res['SUBCONUS']
# rrfs_conus_compact_res = grid2res['RRFS_CONUScompact']
# esg_res = grid2res['ESG'][0]
# gfdl_res = grid2res['GFDL'][0]

### Read SRW TAR folders and its file directory content.
The purpose of this demo is to read the directory content of the TAR folders was to set each data file as an object with a unique key, however AUS later requested to set each TAR folder as their own object.

__Historical Log:__
- The migration of TAR objects was requested by AUS team member, Gillian P.


In [None]:
if __name__ == '__main__': 
    
    # Module for extracting data from source.
    from get_srw_data import GetSrwData
    
    # Source SRW data from Orion
    linked_home_dir = "/home/schin/work"
    fix_data_dir = linked_home_dir + "/noaa/fv3-cam/UFS_SRW_App/develop/fix.tar"
    input_model_data_dir = linked_home_dir + "/noaa/fv3-cam/UFS_SRW_App/develop/input_model_data.tar"
    natural_earth_dir = linked_home_dir + "/noaa/fv3-cam/UFS_SRW_App/develop/NaturalEarth"
    
    # Source data from Hera
    fc_sample_data_dir =  "/" 
    
    # Instantiate SRW uploader
    srw_uploader = GetSrwData(None, None, None, fix_data_dir, input_model_data_dir, natural_earth_dir, fc_sample_data_dir)
    
    # List all data directories from sources (filtered)
    ma_data_list = srw_uploader.ma_data_list
    fix_data_list = srw_uploader.fix_data_list
    ne_data_list = srw_uploader.ne_data_list
    fc_sample_data_list = srw_uploader.fc_sample_data_list
        
    # SRW input model analysis, fixed and natural earth data file locations (filtered)
    srw_ma_data_dirs = srw_uploader.ma_file_dirs
    srw_fix_data_dirs = srw_uploader.fix_file_dirs
    srw_natural_earth_dirs = srw_uploader.ne_dirs
    srw_fc_dirs = srw_uploader.fc_sample_dirs
    
    # Select model analysis files based on external model it was generated by (filtered)
    srw_ma_dict = srw_uploader.partition_ma_datasets 
    srw_fix_dict = srw_uploader.partition_fixed_datasets 
    
    # Select ne files based on (categorization has not been requested by AUS for the natural earth dataset)
    srw_ne_dict = srw_uploader.partition_ne_datasets
    
    # Select fc files based on (categorization has not been requested by AUS for the fc sample dataset)
    srw_fc_dict = srw_uploader.partition_fc_datasets
    

In [None]:
# Input model data files based on external model it was generated by.
srw_ma_dict

In [None]:
# Input fixed files.
srw_fix_dict

In [None]:
# Input Natural Earth files.
srw_ne_dict

In [None]:
# List all input model data directories from MA source (filtered)
ma_data_list

In [None]:
# List all fix data directories from fix source (filtered)
fix_data_list

In [None]:
# List all Natural Earth data directories from NE source (filtered)
ne_data_list

# Demo 2: Multipart Upload of Extracted Data to Cloud

### Upload all fixed datasets residing in SRW TAR folder

In [None]:
# if __name__ == '__main__': 
#     from progress_bar import ProgressPercentage
#     from upload_data import UploadData
    
#     # Upload input fixed data.
#     uploader_wrapper = UploadData(srw_fix_dict, use_bucket='srw')
#     uploader_wrapper.upload_files2cloud()
    

### Upload all input model datasets residing in SRW TAR folder

In [None]:
# if __name__ == '__main__': 
#     from progress_bar import ProgressPercentage
#     from upload_data import UploadData
    
#     #Upload input model data.
#     uploader_wrapper = UploadData(srw_ma_dict, use_bucket='srw')
#     uploader_wrapper.upload_files2cloud()

### Upload all Natural Earth datasets residing in NE Directory Source (Not Derived from a TAR)

In [None]:
# if __name__ == '__main__': 
#     from progress_bar import ProgressPercentage
#     from upload_data import UploadData
    
#     #Upload input model data.
#     uploader_wrapper = UploadData(srw_ne_dict, use_bucket='srw')
#     uploader_wrapper.upload_files2cloud()

# Consolidated Demo: Extract Data Localities & Upload to Cloud.

### Extract & upload all SRW's datasets (fixed data + input model data) to cloud.

In [None]:
# if __name__ == '__main__': 
#     from progress_bar import ProgressPercentage
#     from upload_data import UploadData
#     from transfer_srw_data import TransferSrwData
    
#     # Obtain directories & upload to cloud for all the fix and model input SRW datasets
#     srw_xfer = TransferSrwData(linked_home_dir="/home/schin/work", platform="orion")


In [None]:
# # List all data directories from sources (filtered)
# ma_data_list = srw_xfer.ma_data_list
# fix_data_list = srw_xfer.fix_data_list

# # SRW input model analysis & fixed data file locations (filtered)
# srw_ma_data_dirs = srw_xfer.ma_file_dirs
# srw_fix_data_dirs = srw_xfer.fix_file_dirs

# # Select model analysis files based on external model it was generated by (filtered)
# srw_ma_dict = srw_xferr.partition_ma_datasets 
# srw_fix_dict = srw_xfer.partition_fixed_datasets 


# Demo: Upload a Single Data File of Interest

__Remarks:__
- Is there an interest to transfer the readme file: input_model_data/README_input_model_data.txt ? No. May Change

In [None]:
# if __name__ == '__main__': 
#     from progress_bar import ProgressPercentage
#     from upload_data import UploadData

#     # Upload a Single Data File of Interest
#     uploader_wrapper = UploadData(file_relative_dirs=None, use_bucket='srw')
#     file_dir = '###/###/[filename].[file_format]'
#     uploader_wrapper.upload_single_file(file_dir, None)


# Demo: Upload a Single Data Folder of Interest
The purpose of this demo is to upload a TAR folder as their own object w/ a set key. The keys established here were requested by AUS.

__Historical Log:__
- As of 06/01/22: Request from Natalie Perlin to split the single tar fix_files_and_model_data.tar into the following two objects prior to SRW release 2.0.


In [3]:
if __name__ == '__main__': 
    from progress_bar import ProgressPercentage
    from upload_data import UploadData

    # Source SRW data folder
    """ 
    As of 11/09/22: As development continues, the developer of this tool no longer has access to Orion to 
    to extract the data on Orion -- this is due to the account not being re-activated by the program's PI. 
    In the meantime, the developer has created a back-up server work environment to perform the SRW data migration 
    to cloud. 
    """
    linked_home_dir = "/home/schin/work"
    folder_dir1 = linked_home_dir + "/noaa/epic-ps/nperlin/SRW_RELEASE_DATA/fix_data.tgz" # Source: Orion
    folder_dir2 = linked_home_dir + "/noaa/epic-ps/nperlin/SRW_RELEASE_DATA/gst_data.tgz" # Source: Orion
    folder_dir3 = linked_home_dir + "/noaa/epic-ps/schin/NaturalEarth.tgz" # Source: Orion
    folder_dir4 = "/scratch1/NCEPDEV/nems/Edward.Snyder/srw-v2p1-indy-sample-case-vx/sample_cases/Indy-Severe-Weather.tgz" # Source: Hera
    folder_dir5 = "/scratch1/NCEPDEV/nems/Edward.Snyder/srw-v2p1-indy-sample-case-vx/sample_cases/release-public-v2.1.0/Indy-Severe-Weather.tgz" # Source: Hera

    # Instantiate SRW uploader
    uploader_wrapper = UploadData(file_relative_dirs = None, use_bucket = 'srw')
    
    # Upload a Single Data Folder of Interest
    """
    As of 06/01/22: Request from Natalie Perlin to split the single tar file, fix_files_and_model_data.tar,
    into the following two objects prior to SRW release 2.0.
    
    - Fix Data Object 1's Key: current_srw_release_data/fix_data.tgz
    - Test Data Object 2's Key: current_srw_release_data/gst_data.tgz 
    
    Although the nomentclature deviates from the rest of the stored datasets in the SRW cloud bucket,
    there was an additional request by AUS to have the natural earth data object's key set 
    to NaturalEarth/NaturalEarth.tgz.
    
    """
    # Set object's key.
    #key_path1 = 'current_srw_release_data/fix_data.tgz' # Key for latest SRW release's fixed data
    #key_path2 = 'current_srw_release_data/gst_data.tgz' # Key for latest SRW release's model input data
    #key_path3 = 'NaturalEarth/NaturalEarth.tgz' # Key for natural earth data
    key_path4 = 'sample_cases/release-public-v2/Indy-Severe-Weather.tgz' # Key for sampled forecast data for SRW v2
    key_path5 = 'sample_cases/release-public-v2.1.0/Indy-Severe-Weather.tgz' # Key for sampled forecast data for SRW v2.1.0
   
    # Migrate object to SRW cloud bucket.
    #uploader_wrapper.upload_single_srw_folder(folder_dir1, key_path1)
    #uploader_wrapper.upload_single_srw_folder(folder_dir2, key_path2)
    #uploader_wrapper.upload_single_srw_folder(folder_dir3, key_path3)
    uploader_wrapper.upload_single_srw_folder(folder_dir4, key_path4)
    uploader_wrapper.upload_single_srw_folder(folder_dir5, key_path5)


/scratch1/NCEPDEV/nems/Edward.Snyder/srw-v2p1-indy-sample-case-vx/sample_cases/Indy-Severe-Weather.tgz  7768111669 / 7768111669.0  (100.00%)Processing Time (min): 3.0262589852015176

/scratch1/NCEPDEV/nems/Edward.Snyder/srw-v2p1-indy-sample-case-vx/sample_cases/release-public-v2.1.0/Indy-Severe-Weather.tgz  7937423878 / 7937423878.0  (100.00%)Processing Time (min): 2.0303078174591063



# Demo: Delete a File

In [2]:
if __name__ == '__main__': 
    from progress_bar import ProgressPercentage
    from upload_data import UploadData
    uploader_wrapper = UploadData(file_relative_dirs=None, use_bucket='srw')
    file_dir = 'sample_cases/Indy-Severe-Weather.tgz'
    key_path = file_dir
    uploader_wrapper.purge(key_path)

# Demo: Delete Objects with Key Prefix

In [None]:
# if __name__ == '__main__': 
#     from progress_bar import ProgressPercentage
#     from upload_data import UploadData
#     uploader_wrapper = UploadData(file_relative_dirs=None, use_bucket='srw')
#     key_prefix = '###'
#     uploader_wrapper.purge_by_keyprefix(key_prefix)

# Demo: Copy Objects & Delete with Key Prefix
AWS CLI copies the objects to the target folder and then removes the original file. There is no “move” action in S3.

In [None]:
# if __name__ == '__main__': 
#     from progress_bar import ProgressPercentage
#     from upload_data import UploadData
#     uploader_wrapper = UploadData(file_relative_dirs=None, use_bucket='srw')
#     source_key_path = '####' 
#     new_key_path = '###'
#     uploader_wrapper.rename_s3_keys(source_key_path, new_key_path)

# Demo: Get List of All Keys in UFS SRW S3 Bucket

In [None]:
# if __name__ == '__main__': 
#     from progress_bar import ProgressPercentage
#     from upload_data import UploadData
#     uploader_wrapper = UploadData(file_relative_dirs=None, use_bucket='srw')
#     all_bucket_objects = uploader_wrapper.get_all_s3_keys()
# all_bucket_objects

In [None]:
# Write list to text file and save to directory
# with open('filename].[file_format]', 'w') as f:
#     for item in all_bucket_objects:
#         f.write("%s\n" % item)