# Hakai Profile QA/QC Development Tool
This jupyter notebook is a flexible tool used for testing and improving Hakai's QA/QCing workflow of the its CTD profile data. 

The tool can load Hakai's CTD dataset and apply default tests already applied. Those tests can be modified by the user to tests different thresholds. Other tests can be added too!

## Let's load all the python packages we need
Some of the pacakges are available through Pypi while others aren't. We also load the hakai_qc main branch here too.

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import json

!pip install C:\Users\jessy\Documents\repositories\ioos_qc
from ioos_qc.config import QcConfig

# Get External packages
try:
    from hakai_api import Client
    from ioos_qc.config import QcConfig
    import hakai_qc
except:
    # Install Hakai API Python Client
    !pip install git+https://github.com/HakaiInstitute/hakai-api-client-python.git
    from hakai_api import Client

    # Install ioos_qc
    #!pip install git+https://github.com/HakaiInstitute/ioos_qc
    #from ioos_qc.config import QcConfig
    
    # Load local modules
    !pip install git+https://github.com/HakaiInstitute/hakai-profile-qaqc.git
    import hakai_qc


## Import data from the Hakai CTD Profile Database and Hakai List of Stations
The Hakai Station Master List is based on a CSV output of the [Hakai Oceanography Master Stations Map and Data](https://hakai.maps.arcgis.com/apps/webappviewer/index.html?id=38e1b1da8d16466bbe5d7c7a713d2678). Missing sites should be added to the master list in order to applied all the different tests.

In [None]:
# Load Hakai Station List
hakai_stations = hakai_qc.get.hakai_stations()

Now, let's get some data from the Hakai CTD Processed Data Database 

In [None]:
# Get Hakai CTD Data Download through the API
# Let's just get the data from QU39
filterUrl = 'station=QU39&status!=MISCAST&limit=20000'
df, url = hakai_qc.get.hakai_ctd_data(filterUrl)
print(str(len(df))+' records found')

# Regroup profiles and sort them by pressure
group_variables = ['device_model','device_sn','ctd_file_pk','ctd_cast_pk','direction_flag']
df = df.sort_values(by=group_variables+['pressure'])

# Get Derived Variables
df = hakai_qc.utils.derived_ocean_variables(df)

# Just show the first few lines to have a look
df.head() # Show the top of the data frame

## Test Configuration
We first import the different tests that applied as of now to the Hakai Dataset. You can also add new tests by adding a related dictionary which follow the structure presented below. For more information on the different tests available, have a look at the [ioos_qc webpage](https://ioos.github.io/ioos_qc/api/ioos_qc.html).

In [None]:
# Load default test parameters used right now!
qc_config = hakai_qc.get.json_config('hakai_ctd_profile.json')

# If you want to add or modify some of the tests do it here
#ex: let's add for the range of fluorescence 
qc_config['flc']= {'qartod': {
                        'gross_range_test': {   
                            "suspect_span": [0, 70],
                            "fail_span": [-.5, 100],
                        },
    
                        "aggregate": {}    
                        }
                   }
# Show the QC config in a nice looking table
hakai_qc.get.config_as_dataframe(qc_config)

## Run Test on Data
All the different tests listed above are applied to station and each profiles one at the time. 


In [None]:
# Run all the tests on each available profiles
df = hakai_qc.run.tests_on_profiles(df,hakai_stations,qc_config)

## Review Results
Let's filter all the data that actually got flagged and keep only the downcast.

In [None]:
# Get Variables to plot (first line gives a list of a the variables that are tested, the second overwrite the first if you want to look into one or few specific variables)
variables_to_plot = list(set(qc_config.keys())-set(['position','depth','pressure']))
variables_to_plot = ["sigma0","salinity","dissolved_oxygen_ml_l","par","flc"]

# Review Flagged data
flag_columns = [var+'_qartod_aggregate' for var in variables_to_plot]
flagged_hakai_id = df[((df.filter(items=flag_columns)>1) 
                       & (df.filter(items=flag_columns)!=9)).any(axis=1)]['hakai_id'].unique()              

# Tell me how many there is
print(str(len(flagged_hakai_id))+' profiles were flagged')

Show me the results in a some figures!

In [None]:
plot_n_profiles = 20
if len(flagged_hakai_id)>plot_n_profiles:
    temp_flagged_hakai_id = flagged_hakai_id[:plot_n_profiles]
else:
    temp_flagged_hakai_id = flagged_hakai_id
    

# Loop  through each profiles and variable and create plot
for hakai_id in temp_flagged_hakai_id:
    print(hakai_id)
    plt.figure()
    fig,axs = plt.subplots(1,len(variables_to_plot),
                          sharex=False,sharey=True)
    fig.set_figwidth(4*len(variables_to_plot))
    fig.set_figheight(10)
    fig.suptitle('Hakai ID: '+hakai_id)

    axs[0].invert_yaxis()
    
    kk=0
    for variable in variables_to_plot:
        g = sns.scatterplot(data=df[df['hakai_id']==hakai_id],
                            x=variable,y='depth',
                            hue=variable+'_qartod_aggregate', palette='tab10',
                            style='direction_flag',
                            size=variable+'_qartod_aggregate' ,
                            linewidth=0,ax=axs[kk],legend='brief')
        kk=kk+1
    plt.subplots_adjust(wspace=0, hspace=0)
variables_to_plot