# **<center>SPARC FAIR Codeathon 2022</center>**
<center>
<a href="https://sparc.science">
<img src="https://sparc.science/_nuxt/img/logo-sparc-wave-primary.8ed83a5.svg" alt="SPARC" width="150"/>
</a>
</center>
<center>
<a href="https://sparc.science/help/2022-sparc-fair-codeathon">
<img src="https://images.ctfassets.net/6bya4tyw8399/2qgsOmFnm7wYIfRrPrqbgx/ae3255858aa12bfcebb52e95c7cacffe/codeathon-graphic.png" alt="FAIR" width="75">
</a>
</center>

## <center>Tutorial 2: Resampling data for simulations</center>


## **Introduction**
Welcome to the second of the Quilted Tutorials! We will be demonstrating different features from the [**SPARC**](https://sparc.science/) project. The goal will be to download some **SPARC** datasets and resample them so that they can be used for simulations. Because the data is [**FAIR**](https://www.nature.com/articles/sdata201618) we will be combining three different datasets of the spatial distribution of the vagal afferents and efferents. Here is the workflow for this tutorial

![workflow](img/workflow.png)

## **Installing the dependencies**
This tutorial relies on several Python packages that have been developed as part of the **SPARC** project. We will be installing them in order to complete this tutorial.

In [None]:
!pip install pandas
!pip install openpyxl
!pip install ipywidgets
!pip install numpy
!pip install numpy-stl
!pip install matplotlib
!pip install ipympl
!pip install scipy
!pip install tqdm

## **Imports**
Here we import all of the dependencies that we will need to run the code correctly.

In [None]:
import os
import requests
import numpy as np
import pandas as pd
import scipy.stats as st
import matplotlib.pyplot as plt

from stl import mesh as msh
from tqdm import tqdm
from zipfile import ZipFile
from mpl_toolkits import mplot3d
from ipywidgets import interact, fixed

## **Retrieving the data**
Now that all the dependencies have been installed we will retrieve the data directly from the [**SPARC**](https://sparc.science) project website. 
We will be using the following three datasets:
 * [Vagal afferents associated with the myenteric plexus of the rat stomach](https://sparc.science/datasets/10?type=dataset&datasetDetailsTab=files)
 * [Vagal afferents within the longitudinal and circular muscle layers of the rat stomach](https://sparc.science/datasets/11?type=dataset&datasetDetailsTab=files)
 * [Vagal efferents associated with the myenteric plexus of the rat stomach](https://sparc.science/datasets/12?type=dataset&datasetDetailsTab=files)
 
You can search through all of the **SPARC** datasets [here](https://sparc.science/data?type=dataset) or simply click on the links above to be directed to the datasets. 

It is possible to download the entire dataset by clicking on the purple ***Download full dataset*** button in the **Download Dataset** tab or to download specific files by selecting files and folders in the **Dataset Files** tab located at the bottom of the page. 

For this tutorial, we are only interested in the contents of the _derivative_ folder which contains two .xlsx files: one with the data (IGLE_data.xlsx, IMA_analyzed_data.xlsx, and Efferent_data.xlsx) and a manifest (manifest.xlsx). Enter the _derivative_ folder and select the .xlsx file containing the data by ticking the box in front of it. Download the file by clicking the **Download Selected Files and Folders** button at the bottom. You will then be prompted to select the location in which to save it. For each dataset, save it in the _SPARC-tutorial_ folder. 

#### ⚠️  **SPARC Guru tip**: 
Ever hear of Pennsieve? It's the **SPARC** tool to use if you want to avoid downloading the data manually! Check out the documentation for it and try it out.

### **Helper functions**
Now that we have installed and imported the required dependencies, we are going to define some helper functions to retrieve the data.

#### _search\_dataset_
This function will search the **SPARC** datasets for a given keyword.

In [None]:
def search_dataset(query, limit=5):
    """ Searches the SPARC data portal for the given query
    
    Inputs: 
    query -- str, string to search as a keyword in the dataset
    limit -- int, integer limit for the number of results to return, defualt 5
        
    Outputs:
    rst -- str, string of concatenated json tags for return results with the id, version, name and tags fields only for all returned results
        
    """
    url = "https://api.pennsieve.io/discover/search/datasets?limit="+str(limit)+"&offset=0&query="+query+"&orderBy=relevance&orderDirection=desc"
    headers = {"Accept": "application/json"}
    response = requests.get(url, headers=headers)
    rst = []
    for r in response.json()['datasets']:
        rst += [{'id':r['id'], 'version':r['version'], 'name':r['name'], 'tags':r['tags']}]
    return rst

#### _print\_folder\_structure_
This function will prints the structure of the dataset directory.

In [None]:
def print_folder_structure(dataId, version, max_level=3): # taken from stackoverflow
    """ Print the directory structure of a dataset to the console output. 
    This assumes that it is saved in the root directory with default filename.
    
    Inputs: 
    dataId -- integer id of the result
    version -- integer dataset version 
    max_level -- integer depth of directory structure to return, default 3
    
    Outputs: 
    None 
    """
    startpath = "Pennsieve-dataset-"+str(dataId)+"-version-"+str(version)
    for root, dirs, files in os.walk(startpath):
        level = root.replace(startpath, '').count(os.sep)
        if level == max_level: break
        indent = ' ' * 4 * (level)
        print('{}{}/'.format(indent, os.path.basename(root)))
        subindent = ' ' * 4 * (level + 1)
        for f in files:
            print('{}{}'.format(subindent, f))

#### _get\_dataset_
This function will retrieve the data from the **SPARC** portal.

In [None]:
def get_dataset(dataId, version, dest_dir="."):
    """ Save a dataset from the SPARC data portal using the Pennsieve API.
    
    Inputs: 
    dataId -- integer id of the dataset
    version -- integer dataset version 
    dest_dir -- string directory to save data set into. Default is root.
    
    Outputs: 
    None 
    """
    url = "https://api.pennsieve.io/discover/datasets/"+str(dataId)+"/versions/"+str(version)+"/download?"
    # download dataset
    response = requests.get(url, stream = True)
    file_zip = "data.zip"
    data_file = open(file_zip,"wb")
    for chunk in tqdm(response.iter_content(chunk_size=1024)):
        data_file.write(chunk)
    data_file.close()
    # unzip dataset
    with ZipFile(file_zip, 'r') as obj:
       obj.extractall()
    # delete temporary zip file
    os.remove(file_zip)

In [None]:
## We are looking for vagal datasets so we create a query "vagal", returning 5 results using default value
search_dataset('vagal')

# The first three datasets are interesting so download them
get_dataset(dataId=10, version=3)
get_dataset(dataId=11, version=3)
get_dataset(dataId=12, version=3)

# Exploring downloaded dataset. 
# We need the derivative analysis result in derivative folder (we know this because we have inspected the dataset documentation in the manifest and README files)
print_folder_structure(dataId=10, version=3) # Print this one as an example
# print_folder_structure(dataId=11, version=3)
# print_folder_structure(dataId=12, version=3)

# copy the required files to res folder for further utilisation
!mkdir res
!mv Pennsieve-dataset-10-version-3/files/derivative/IGLE_data.xlsx res
!mv Pennsieve-dataset-11-version-3/files/derivative/IMA_analyzed_data.xlsx res
!mv Pennsieve-dataset-12-version-3/files/derivative/Efferent_data.xlsx res

## **Loading the 2D data**
### **Helper functions**
Now that we have retrieve the data, we are going to define some helper functions to load it.

#### _get\_position_
This function will allow use to convert the position of the data from a percentage into a distance in mm.

In [None]:
def get_position(percent, min_val, max_val):
    """ Converts the position from percentage to distance.
    
    Inputs:
    percent -- float, percentage value.
    min_val -- float, minimum distance for conversion.
    max_val -- float, maximum distance for conversion.
    
    Outputs:
    converted_value -- float, converted value.
    
    """
    return percent / 100 * (max_val - min_val) + min_val 

#### _load\_data_
This function will allow use to extract the correct elements inside the data files and store them into a data frame.

In [None]:
def load_data(data_name, col_keeps, x_lims, y_lims):
    """ Loads the data from an .xlsx file.
    
    Inputs:
    data_name -- str, nane of the .xlsx file to read.
    col_keeps -- dict{str:str}, dictionnary containing the names of the columns
        to keep.
    x_lims -- list[int], limits for the x direction to convert back to mm,
            first element is the minimum and second is the maximum.
    y_lims -- list[int], limits for the y direction to convert back to mm,
        first element is the minimum and second is the maximum.
    
    Outputs:
    df -- DataFrame, data frame containing the desired data.
    
    """
    df = pd.read_excel(data_name)
    # remove unnecessary columns
    for col in df.columns:
        if col in col_keeps:
            df.rename(columns = {col:col_keeps[col]}, inplace = True)
        else:
            df.drop(col, axis=1, inplace=True)
    df['y'] = get_position(df['%y'], y_lims[0], y_lims[1])
    df['x'] = get_position(df['%x'], x_lims[0], x_lims[1])
    df['-%y'] = 100 - df['%y']
    # change the area to mm
    return df

In the 2D datasets that we are using, the distances are in percentages relative to an origin situated in the pyloric end of the stomach for the y-axis and near the oesophagus for the z-axis. We are going to convert those into millimeters instead. For this, we are going to define the limits in the z- and y-axis. 

In [None]:
# Setup maximimum y and z widths based on scale in image.
x_lims = [0, 36.7]
y_lims = [24.6, 0]

col_keeps = {'%x (distance from pylorus side)':'%x', '%y (distance from bottom)':'%y'}

We can now load the locations of the nerves into data frames:

In [None]:
igle_df = load_data('res/IGLE_data.xlsx', col_keeps, x_lims, y_lims)
ima_df = load_data('res/IMA_analyzed_data.xlsx', col_keeps, x_lims, y_lims)
efferent_df = load_data('res/Efferent_data.xlsx', col_keeps, x_lims, y_lims) 

## **Preparing the 2D data**
### **Helper functions**
Now that we have loaded the data, we are going to define some helper functions to process it.

#### _prepare\_data_
This function will prepare the data to be plotted by resampling the data points and extract the density of nerves in 2D space.

In [None]:
def prepare_data(df):
    """ Prepares the data to be plotted by creating the probablity estimates and the sampled points.
    
    Inputs:
    df -- 
    Outputs:
    
    """
    data_array = df
    data_array = data_array[~data_array.isin([np.nan]).any(1)]

    # Extract x and y
    x = np.array(data_array['x'])
    y = np.array(data_array['y'])

    # Create meshgrid
    xx, yy = np.mgrid[x_lims[0]:x_lims[1]:100j, y_lims[0]:y_lims[1]:100j]

    positions = np.vstack([xx.ravel(), yy.ravel()])
    values = np.vstack([x, y])
    kernel = st.gaussian_kde(values)
    prob_estimate = np.reshape(kernel(positions).T, xx.shape)

    sampled_pts = kernel.resample(1000).T
    
    return xx, yy, prob_estimate, sampled_pts

Now that we have loaded are datasets into Python, we are going to prepare the data for plotting. For this, we are going to use our _prepare\_data_ helper function. This will resample your data points using the desires probability distribution and provide us with the density of points in space. 

In [None]:
efferent_xx, efferent_yy, efferent_est, efferent_pts = prepare_data(efferent_df)
ima_xx, ima_yy, ima_est, ima_pts = prepare_data(ima_df)
igle_xx, igle_yy, igle_est, igle_pts = prepare_data(igle_df)

## **Visualising data**
We are now going to visualise the data in 2D. In the plot, the green point represent the resampled point and the color represents the density of the data points. Switch between nerve types by using the dropdown menu and see how the density changes. 

In [None]:
# Enable interactivity in jupyterlab.
%matplotlib widget 

def plotting_fct(efferent_xx, efferent_yy, efferent_est, efferent_pts,
                 ima_xx, ima_yy, ima_est, ima_pts,
                 igle_xx, igle_yy, igle_est, igle_pts,
                 sel):
    
    fig = plt.figure(figsize=(8,8))
    ax = fig.gca()

    ax.set_xlim(x_lims[0], x_lims[1])
    ax.set_ylim(y_lims[1], y_lims[0])

    if sel == 'Efferent':
        cfset = ax.contourf(efferent_xx, efferent_yy, efferent_est, levels=1000,cmap='coolwarm')
        ax.imshow(np.rot90(efferent_est), cmap='coolwarm', extent=[x_lims[0], x_lims[1], y_lims[0], y_lims[1]])
        ax.scatter(efferent_pts[:, 0], efferent_pts[:, 1], s=5, color='g')

    if sel == 'IMA':
        cfset = ax.contourf(ima_xx, ima_yy, ima_est, levels=1000,cmap='coolwarm')
        ax.imshow(np.rot90(ima_est), cmap='coolwarm', extent=[x_lims[0], x_lims[1], y_lims[0], y_lims[1]])        
        ax.scatter(ima_pts[:, 0], ima_pts[:, 1], s=5, color='g')

    if sel == 'IGLE':
        cfset = ax.contourf(igle_xx, igle_yy, igle_est, levels=1000,cmap='coolwarm')
        ax.imshow(np.rot90(igle_est), cmap='coolwarm', extent=[x_lims[0], x_lims[1], y_lims[0], y_lims[1]])        
        ax.scatter(igle_pts[:, 0], igle_pts[:, 1], s=5, color='g')

    ax.set_xlabel('X (mm)')
    ax.set_ylabel('Y (mm)')
    plt.title('Gaussian Kernel density estimation')
    plt.show()

def onToggle(btn):
    plotting_fct(efferent_xx=efferent_xx, efferent_yy=efferent_yy, efferent_est=efferent_est, efferent_pts=efferent_pts, 
         ima_xx=ima_xx, ima_yy=ima_yy, ima_est=ima_est, ima_pts=ima_pts, 
         igle_xx=igle_xx, igle_yy=igle_yy, igle_est=igle_est, igle_pts=igle_pts, sel=btn.owner.value)

interact(plotting_fct, efferent_xx=fixed(efferent_xx), efferent_yy=fixed(efferent_yy), efferent_est=fixed(efferent_est), efferent_pts=fixed(efferent_pts), 
         ima_xx=fixed(ima_xx), ima_yy=fixed(ima_yy), ima_est=fixed(ima_est), ima_pts=fixed(ima_pts), 
         igle_xx=fixed(igle_xx), igle_yy=fixed(igle_yy), igle_est=fixed(igle_est), igle_pts=fixed(igle_pts), 
         sel=['Efferent', 'IGLE', 'IMA'])    


## **Congratulations**
You have successfully completed the second Quilted Tutorial and are now well on your way to becoming a **SPARC** Guru! 

We invite you to reuse this tutorial and explore the possibilities of using **SPARC** tools when possible or using a different sampling kernel. You can use the resampled data to make simulations.