# Demo: Track Dev Branch Commits.
- Extract single file daily & parse
- Client makes a direct request for rt.sh from GitHub
- rt.sh is read, preprocessed & extracts the timestamps of the relevant datasets which has been pushed on GitHub.
- Generates a file containing the datasets' timestamps
- Program will compare the last log file with the most recent file containing the datasets' timestamps.

### Setting up the environment
Once you have conda installed on your machine, perform the following to create a conda environment: ++
- Create environment with .yml. Note: Environemnt name is set within the yml file
    - $ conda env create -f git_env.yml

- Activate the new environment via:
   - $ conda activate data_tracker
   
- Verify that the new environment was installed correctly via:
    - $ conda info --env

- Confirm the git-env.yml dependencies were installed via:
    - $ conda list

In [None]:
from rt_revision_tracker import *

# Restart the accumulation of timestamps.
rt_revision_tracker().reset_tracker()


In [None]:
from rt_revision_tracker import *

# Accumulation of timestamps since time of reset.
data_log_dict = rt_revision_tracker().populate()
data_log_dict

# Filter Window Featuring Latest 2 Months

In [None]:
# Define Window to Filter dataset by retrieval date duration
# In this scenario, capturing the past 60 days of data.
from rt_tracker_filter import RtTrackerFilter
maintenance_wrapper =  RtTrackerFilter()
maintenance_wrapper.maintenance_window(60)
window_dates2store = maintenance_wrapper.maintenance_window_dict
window_dates2store 

# TODO: For data maintenance & preserving current data managment process, if the retrieval date is not within
# cloud data storage, then remove the objects with the prefix dates outside of filter window from cloud data storage.


# Read the latest pickle from data tracker

In [None]:
import pickle

# Recall latest updated file & populate w/ retrieved file if it has a timestamp revision.
with open("./track_ts/latest_rt.sh.pk", 'rb') as handle:
    data_log_dict = pickle.load(handle)                
print('\033[94m' + '\033[1m' + f'\nTimestamps (Prior to File Retrieval):\033[0m\033[1m\n{data_log_dict}\033[0m')  

# Map dictionary keys to the established ts names as shown on data folders on-prem.
data_fldrs_dict = {}
for retrieval_date, ts_dict in data_log_dict.items():
    
    # Initialize the list per retrieval day
    input_ts, bl_ts, ww3_input_ts, bmic_ts = [], [], [], []
    for ts_type, ts_day in ts_dict.items():
        for ts in ts_day:
            if ts_type == 'BL_DATE' and f'develop-{ts}' not in bl_ts:
                bl_ts.append(f'develop-{ts}')
            elif ts_type == 'INPUTDATA_ROOT' and f'input-data-{ts}' not in input_ts:
                input_ts.append(f'input-data-{ts}')
            elif ts_type == 'INPUTDATA_ROOT_BMIC' and f'BM_IC-{ts}' not in bmic_ts:
                bmic_ts.append(f'BM_IC-{ts}')
            elif ts_type == 'INPUTDATA_ROOT_WW3' and f'WW3_input_data_{ts}' not in ww3_input_ts:
                 ww3_input_ts.append(f'WW3_input_data_{ts}')
    data_fldrs_dict[retrieval_date] = input_ts, bl_ts, ww3_input_ts, bmic_ts
data_fldrs_dict

In [None]:
# Latest retrival date. 
input_ts, bl_ts, ww3_input_ts, bmic_ts = data_fldrs_dict[max(data_fldrs_dict)]
input_ts, bl_ts, ww3_input_ts, bmic_ts 

#### Findings:
- BL dataset timestamps will not necessarily sync up with the date at which PR was approved - can take unknown N days to approve.

- Baseline change label does not necessarily ensure date was actually change

- 'BM_IC' # IC folder's prefix

- 'develop' # Baseline folder's prefix

- 'input-data' # Input folder's prefix

- 'WW3_input_data' # WW3 Input folder's prefix

#### Suggestion:
- Why not name the datasets based on the date at which they were approved? Would allow a script to collect based on PR approved date - assuming baseline github labels are properly labeled.


In [None]:
'BM_IC' # IC folder's prefix
'develop' # Baseline folder's prefix
'input-data' # Input folder's prefix
'WW3_input_data' # WW3 Input folder's prefix