# AI for Earth System Science Hackathon 2020
# Challenge Notebook Template
Author (Instituion), Second Author (Institution)

## Introduction
*A relevant picture*

The introduction contains the following elements:
* Scientific goal of the challenge
* Contextual background on the problem
* Short description of existing solutions (if any)
* Why the problem is important
* Impact if solved



## Software Requirements
This notebook requires Python >= 3.7. The following libraries are required:
* numpy
* scipy
* matplotlib
* xarray
* pandas
* scikit-learn
* tensorflow >= 2.1
* netcdf4
* tqdm


In [1]:
! pip install numpy scipy matplotlib xarray pandas netcdf4 tqdm tensorflow scikit-learn dask s3fs

You should consider upgrading via the '/glade/work/cbecker/ncar_20191211/bin/python3.7 -m pip install --upgrade pip' command.[0m


In [2]:
import numpy as np
import pandas as pd
import xarray as xr
import tensorflow as tf
import matplotlib.pyplot as plt
from dask.distributed import Client, LocalCluster, progress
import s3fs
from glob import glob
import warnings
%matplotlib inline
warnings.simplefilter("ignore", category=FutureWarning)

# Set random seed
seed = 3985
np.random.seed(seed)
tf.random.set_seed(seed)

## Data
The data summary should contain the following pieces of information:
* Data generation procedure (satellite, model, etc.) 
* Link to website containing more information about dataset
* Time span of the dataset
* Geographic coverage of the dataset
* Parameter space coverage (if synthetic)

### Potential Input Variables
| Variable Name | Units | Description | Relevance |
| ------------- | :----:|:----------- | :--------:|
| ABI Band 08   | K     | Upper-level Water Vapor | ¯\\_(ツ)_/¯ |
| ABI Band 09   | K     | Mid-level Water Vapor   |
| ABI Band 10   | K     | Lower-level Water Vapor |
| ABI Band 14   | K     | Longwave Window         |

### Output Variables
| Variable Name | Units | Description |
| ------------- | :----:|:----------- |
| GLM Counts    | -     | Lightning strike count |


### Metadata Variables
| Variable Name | Units | Description |
| ------------- | :----:|:----------- |
| Time     | YYYY-MM-DDTHH:MM:SS  | The Date   |
| Lat      | degrees     | Latitude   |
| Lon      | degrees     | Longitude  |


### Training Set
Description of training set time/space/parameter coverage and size

### Validation Set
Description of validation set time/space/parameter coverage and size

### Test Set
Description of test set time/space/parameter coverage and size


In [29]:
# How to load the data from disk or cloud

def split_data_files(dir_path="ncar-aiml-data-commons/goes/ABI_patches_32/", file_prefix='abi_patches_', 
               start_date='20190306', end_date='20190901', seq_len=4, skip_len=1):
    """
    Take daily ABI patch files and split into equal training/validation/testing
    semi-contiguous partitions, skipping day(s) between chunks to isolate convective 
    cycles.
    
    Args: 
        dir_path: (str) Directory path to daily ABI files
        file_prefix: (str) File prefix up to date 
        start_date: (str) Starting date to get files in format of YYYYMMDD
        end_date: (str) Ending date to get files in format of YYYYMMDD
        seq_len: (int) Length of days per 'chunk' of data
        skip_len: (int) How many days to skip between data chunks
        
    Returns:
        train_f, val_f, test_f: list of training/validation/test files
    """
    
    all_files = fs.ls(dir_path)
    start_index = all_files.index('{}{}{}T000000.nc'.format(dir_path, file_prefix, start_date))
    end_index = all_files.index('{}{}{}T000000.nc'.format(dir_path, file_prefix, end_date))
    file_spread = all_files[start_index:end_index+1]
    
    train_files, val_files, test_files = [], [], []
    
    for i in np.arange(0, len(file_spread)+1, (seq_len+skip_len)*3):
        
        val_i = i + seq_len + skip_len
        test_i = i + (seq_len + skip_len)*2
        
        train_files.append(file_spread[i:i+seq_len])
        val_files.append(file_spread[val_i:val_i+seq_len])
        test_files.append(file_spread[test_i:test_i+seq_len])
        
    train_f = [item for sublist in train_files for item in sublist]
    val_f = [item for sublist in val_files for item in sublist]
    test_f = [item for sublist in test_files for item in sublist]
    
    return train_f, val_f, test_f

def fetch_data(file_number, file_list):
    """
    Function to be distributed across a cluster to individually load files directly from an AWS S3 bucket 
    
    Args:
        file_number: index for file from file_list
        file_list: List of files to index from
    Returns:
        ds: xarray dataset of daily file 
    """
    obj = fs.open(file_list[file_number])
    ds = xr.open_dataset(obj, chunks={})
    
    return ds

def merge_data(file_list):
    """
    Take a list of files and distribute across a cluster to be loaded then gathered and concantenated
    
    Args:
        file_list: List of files to be merged together (training, validation, or testing)
    Returns:
        merged_data: Concatenated xarray dataset of training, validation, or testing data 
    """
    futures = client.map(fetch_data, range(len(file_list)), [file_list]*len(file_list))
    results = client.gather(futures)
    merged_data = xr.concat(results, 'patch').compute()
    
    return merged_data

In [4]:
cluster = LocalCluster(processes=True, threads_per_worker=2)
client = Client(cluster)
fs = s3fs.S3FileSystem(anon=True)

In [25]:
%%time
train_files, val_files, test_files = split_data_files()
train, val, test = map(merge_data, [train_files, val_files, test_files])

CPU times: user 20.1 s, sys: 21.4 s, total: 41.4 s
Wall time: 1min 42s


In [None]:
# Separate input, output and meta data


In [None]:
# Split into training, validation, and test sets


In [None]:
# Exploratory visualizations of data


### Data Transforms
Discuss any transforms or normalizations that may be needed for this dataset. Remember to fit a scaler only to the training data and then apply it on testing and validation.

In [None]:
# Example of data transform procedure for dataset

In [None]:
# Visual of input variable before and after transform

## Baseline Machine Learning Model
Description of baseline ML approach should include:
* Choice of ML software
* Type of ML model
* Hyperparameter choices and justification


In [None]:
# Baseline ML model initialization code goes here


## Metrics
Description of the different metrics used to assess performance on the challenge:
* Correctness Metric: how close are the predictions to the truth (e.g., RMSE or AUC) 
* Training time
* Inference time
* Model complexity

In [None]:
# Metric functions 

## Interpretation
Description of interpretation methods for problem

In [None]:
# Include examples of interpretation code

## Hackathon Challenges

### Monday
* Load the data
* Create an exploratory visualization of the data
* Test two different transformation and scaling methods
* Test one dimensionality reduction method
* Train a linear model
* Train a decision tree ensemble method of your choice

In [None]:
# Monday's code goes here


### Tuesday
* Train a densely connected neural network
* Train a convolutional or recurrent neural network (depends on problem)
* Experiment with different architectures

In [None]:
# Tuesday's code goes here


### Wednesday
* Calculate three relevant evaluation metrics for each ML solution and baseline
* Refine machine learning approaches and test additional hyperparameter settings

In [None]:
# Wednesday's code goes here


### Thursday 
* Evaluate two interpretation methods for your machine learning solution
* Compare interpretation of baseline with your approach
* Submit best results on project to leaderboard
* Prepare 2 Google Slides on team's approach and submit them 

In [4]:
# Thursday's code goes here


## Ultimate Submission Code
Please insert your full data processing and machine learning pipeline code in the cell below.