In [None]:
#| hide
from EnvDL.core import *

# EnvDL

>  This project aims to prototype approaches to be used in fulfilling the aims proposed in the grant "Environmentally Aware Deep Learning Based Genomic Selection And Management Optimization For Maize Yield"

Ultimately this project will be installable via pip with the following command. Note that this is not yet implemented.

## Install

```sh
# not implemented
pip install EnvDL
```

## How to use

This project is currently pre-release. Instructions on use will be added in the future.

## Project Structure


The files of primary interest in this project are notebooks, external data, processed data, models, and artifacts from modeling. Additional files are produced by `nbdev` for building documentation, testing, and other tasks. This project is intended to applicable to multiple organisms. This long term goal motivates encapsulating data relevant to a species in a species subfolder even though only *Zea mays* is considered at present.

Below is an outline to illustrate the project's target structure.

External Data (`data_ext`)
 - Subfolders for different species (Arabidopsis `ath`, Wheat `taes`, and Maize `zma` shown here)
 - Subfolders may contain data from public databases (cyverse, panzea, kegg, etc.) or data from specific studies or projects (g2fc being the Genomes to Fields 2023 competition). 
 - Study data should be named according to the citation (e.g. `buckler_et_al_2009`) rather than the repository that the data is stored in (e.g. figshare, zenodo, etc.). 

Notebooks (`nbs`)
 - Where possible, analysis will be done in jupyter notebooks. 
 - Notebooks follow a naming convention inspired by the [Johnny.Decimal](https://johnnydecimal.com/) system.
 - $\color{red}{\text{01}}\color{black}{\text{.}}\color{black}{\text{01}}\color{black}{\text{_g2fc_aggregation.ipynb}}$
     - **Categories** group notebooks that are all related to the same dataset. This could be as broad as _all zma_ but is expected to be more narrowly defined as say all processing of data from _Tian et al. 2011_ or a unique combinations of studies.
     - **Categories** should aim to be orthogonal (to the extent that is practical). Some functions may be shared, but it would be preferable for these to end up in a notebook in a mutually required category (e.g. `00.00_core.ipynb`). 
 - $\color{black}{\text{01}}\color{black}{\text{.}}\color{red}{\text{01}}\color{black}{\text{_g2fc_aggregation.ipynb}}$
     - **Areas** are ordinal and state a safe run order. Notebooks with higher area numbers may not require all those with lower area numbers to be run, but they should not depend on any higher ones.
     - Running all notebooks below an area number should be _sufficient but not necessary_ to reproduce results.
 - $\color{black}{\text{01}}\color{black}{\text{.}}\color{black}{\text{01}}\color{red}{\text{_g2fc_aggregation.ipynb}}$
     - A **description** makes this system navigable by summarizing what the notebook does. It contains some of the same information as the area (that the data scope is g2fc) but in a human friendly format.
     
Notebook Artifacts (`nbs_artifacts`)
 - Cleaned or otherwise transformed data from `ext_data` should be kept here. 
 - Computational artifacts (e.g. pickled objects, figures, models) that are expensive to recompute should also be stored here. 
 - To make the source of the objects clear, directory names match notebook names in `nbs`.
 - Notebooks of a higher category are allowed to draw from notebook artifacts of a lower category.


Illustrative Directory Structure:
```
.

├── data_ext
│   ├── ath
│   ├── taes
│   └── zma
│       ├── g2fc
│       └── kegg
├── nbs
│   ├── 01.00_g2fc_core.ipynb
│   ├── 01.01_g2fc_aggregation.ipynb
│   ├── ...
│   └── index.ipynb
└── nbs_artifacts
    ├── 01.01_g2fc_aggregation
    │   ├── phno.csv
    │   ├── phno_Envs_miss.csv
    │   └── ...
    └── ...    
```