# Hakai Profile QA/QC Development Tool
This jupyter notebook is a flexible tool used for testing and improving Hakai's QA/QCing workflow of the its CTD profile data. 

The tool can load Hakai's CTD dataset and apply default tests already applied. Those tests can be modified by the user to tests different thresholds. Other tests can be added too!

## Let's load all the python packages we need
This may take some times the very first time. Some of the packages are available through Pypi while others aren't. We also load the hakai_qc main branch here.

We also install the ioos_qc tool from the Hakai github fork and add-density-inversion-test branch. This will likely change in th future as some new tests gets integrated int he standard ioos_qc package.

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import json


# Get External packages
try:
    from hakai_api import Client
    from ioos_qc.config import QcConfig
    import hakai_qc
except:
    # Install Hakai API Python Client
    !pip install git+https://github.com/HakaiInstitute/hakai-api-client-python.git
    from hakai_api import Client

    # Install ioos_qc
    !pip install git+https://github.com/HakaiInstitute/ioos_qc@colab-compatible
    from ioos_qc.config import QcConfig
    
    # Load local modules
    #!pip install git+https://github.com/HakaiInstitute/hakai-profile-qaqc.git
    import hakai_qc


## Import data from the Hakai CTD Profile Database and Hakai List of Stations
The Hakai Station Master List is based on a CSV output of the [Hakai Oceanography Master Stations Map and Data](https://hakai.maps.arcgis.com/apps/webappviewer/index.html?id=38e1b1da8d16466bbe5d7c7a713d2678). Missing sites should be added to the master list in order to applied all the different tests.

In [2]:
# Load Hakai Station List
hakai_stations = hakai_qc.get.hakai_stations()

Now, let's get some data from the Hakai CTD Processed Data Database 

In [4]:
# Get Hakai CTD Data Download through the API
station = 'QU39'

variable_lists = hakai_qc.get.hakai_api_selected_variables()

# Let's just get the data from QU39
filterUrl = 'station='+station+'&status!=MISCAST&limit=-1'+'&fields='+','.join(variable_lists)
#filterUrl = 'station=QU39&status!=MISCAST&limit=-1'+fields
df, url = hakai_qc.get.hakai_ctd_data(filterUrl)
print(str(len(df))+' records found')

# Regroup profiles and sort them by pressure
group_variables = ['device_model','device_sn','ctd_file_pk','ctd_cast_pk','direction_flag']
df = df.sort_values(by=group_variables+['pressure'])

# Get Derived Variables
df = hakai_qc.utils.derived_ocean_variables(df)

# Just show the first few lines to have a look
df.head() # Show the top of the data frame

Please go here and authorize:
https://hecate.hakai.org/api/auth/oauth2?response_type=code&client_id=289782143400-1f4r7l823cqg8fthd31ch4ug0thpejme.apps.googleusercontent.com&state=5gMkk3f6yNMPtLg18xmjcTVuuZqPCe

Paste the full redirect URL here:
https://hecate.hakai.org/api/auth/oauth2/callback?state=5gMkk3f6yNMPtLg18xmjcTVuuZqPCe&code=4/0AY0e-g51Vw--eIuF5IZTo5hbTu_Nk80APm7Y83t7uuwRklHPR0oSrpch5_FYOeUs2IELfg&scope=email%20profile%20https://www.googleapis.com/auth/userinfo.email%20openid%20https://www.googleapis.com/auth/userinfo.profile&authuser=0&hd=hakai.org&prompt=none
135861 records found


  return _gsw_ufuncs.ct_from_t(SA, t, p)


Unnamed: 0,ctd_file_pk,ctd_cast_pk,hakai_id,ctd_data_pk,filename,device_model,device_sn,work_area,cruise,station,...,sos_un,sos_un_flag,backscatter_beta,backscatter_beta_flag,cdom_ppb,cdom_ppb_flag,absolute salinity,conservative temperature,density,sigma0
117244,2745,7913,080217_2017-01-05T17:32:36.333Z,9169911,080217_20170105_1317,RBRconcerto,80217,QUADRA,QOMB,QU39,...,,,,,,,28.01928,6.838949,1021.85761,21.852974
117245,2745,7913,080217_2017-01-05T17:32:36.333Z,9169912,080217_20170105_1317,RBRconcerto,80217,QUADRA,QOMB,QU39,...,,,,,,,28.009184,6.862007,1021.851638,21.842369
117246,2745,7913,080217_2017-01-05T17:32:36.333Z,9169913,080217_20170105_1317,RBRconcerto,80217,QUADRA,QOMB,QU39,...,,,,,,,28.008935,6.854664,1021.856944,21.843039
117247,2745,7913,080217_2017-01-05T17:32:36.333Z,9169914,080217_20170105_1317,RBRconcerto,80217,QUADRA,QOMB,QU39,...,,,,,,,28.009692,6.854777,1021.862157,21.843617
117248,2745,7913,080217_2017-01-05T17:32:36.333Z,9169915,080217_20170105_1317,RBRconcerto,80217,QUADRA,QOMB,QU39,...,,,,,,,28.013262,6.863834,1021.868514,21.84534


## Test Configuration
We first import the different tests that applied as of now to the Hakai Dataset. You can also add new tests by adding a related dictionary which follow the structure presented below. For more information on the different tests available, have a look at the [ioos_qc webpage](https://ioos.github.io/ioos_qc/api/ioos_qc.html).

In [5]:
# Load default test parameters used right now!
qc_config = hakai_qc.get.json_config('hakai_ctd_profile.json')

# If you want to add or modify some of the tests do it here
#ex: let's add for the range of fluorescence 
qc_config['flc']= {'qartod': {
                        'gross_range_test': {   
                            "suspect_span": [0, 70],
                            "fail_span": [-.5, 100],
                        }
                   }}

target = {'target_range':[1000]}
qc_config['position']['qartod']['location_test'].update(target)

# Show the QC config in a nice looking table
hakai_qc.get.config_as_dataframe(qc_config)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Value
Variable,Module,Test,Input,Unnamed: 4_level_1
position,qartod,location_test,bbox,"[-180, -90, 180, 90]"
position,qartod,location_test,target_range,[1000]
pressure,qartod,gross_range_test,fail_span,"[0, 12000]"
pressure,qartod,gross_range_test,maximum_fail_depth_ratio,1.1
pressure,qartod,gross_range_test,maximum_suspect_depth_ratio,1.05
pressure,qartod,gross_range_test,suspect_span,"[0, 12000]"
depth,qartod,gross_range_test,fail_span,"[0, 12000]"
depth,qartod,gross_range_test,maximum_fail_depth_ratio,1.1
depth,qartod,gross_range_test,maximum_suspect_depth_ratio,1.05
depth,qartod,gross_range_test,suspect_span,"[0, 12000]"


## Run Test on Data
All the different tests listed above are applied to station and each profiles one at the time. 


In [None]:
# Run all the tests on each available profiles
df = hakai_qc.run.tests_on_profiles(df,hakai_stations,qc_config)

## Review Results
### Profile location versus target location (station)
We present here the result of the analysis of the lat/long position recorded for the drop and its distance from the target station. 


In [None]:
# Give me all the drops that had their position flagged because it's not within range or invalid
#  ignore rows where a depth value does not exist.
df[df['position_qartod_location_test']>1].dropna(
    axis=0,subset=['depth']).groupby(
    'hakai_id').first()[['position_qartod_location_test','station','latitude','longitude','measurement_dt']]

In [None]:
# Show me them on a map
m = hakai_qc.get.flag_result_map(df.dropna(axis=0,subset=['latitude','longitude','depth']),
                                 flag_variable='position_qartod_location_test')
m

### Profile test flags
Let's filter all the data that actually got flagged and keep only the downcast.

In [None]:
# Get Variables to plot (first line gives a list of a the variables that are tested, the second overwrite the first if you want to look into one or few specific variables)
variables_to_plot = set(qc_config.keys())-{'position','depth','pressure','sigma0'}                    
#variables_to_plot = ["dissolved_oxygen_ml_l","temperature"]

# Review Flagged data (let's look at only the downcast)
flag_columns = [var+'_qartod_flag' for var in variables_to_plot]
flagged_hakai_id = df.where(df['direction_flag']=='d')[((df.filter(items=flag_columns)>1) 
                       & (df.filter(items=flag_columns)!=9)).any(axis=1)]['hakai_id'].dropna().unique()              

# Tell me how many there is
print(str(len(flagged_hakai_id))+' profiles were flagged')

Show me the results in a some figures!

In [None]:
# Now let's plot the flagged data
hakai_qc.get.flag_result_plot(df,variables_to_plot,flagged_hakai_id[:20],flag_type='_qartod_flag')

## Show me one profile at the time
We'll use plotly do to this.

In [None]:
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go

#Define flag colors
qartod_color = {1:'green',2:'yellow',3:'orange',4:'red',9:'purple'}

df_hakai_id = df.groupby('hakai_id')
df_iter = iter(df_hakai_id)

A new profile shown everytime you'll run the cell below.

In [None]:
# Iterate one Hakai ID at the time
id,df_temp = next(df_iter)

# Sort them by direction and depth
df_temp.sort_values(['direction_flag','depth'],inplace=True)
vars = list(variables_to_plot)
#vars = ['salinity','temperature']

#Create Subplots
fig = make_subplots(rows=1,cols=len(vars), shared_yaxes=True)

kk=1
for var in vars:
    for direction_flag in ['d','u']:
        for flag,color in qartod_color.items():
            df_flag = df_temp[(df_temp[var+'_qartod_flag']==flag) & (df_temp['direction_flag']==direction_flag)]

            if len(df_flag):
                
                if direction_flag is 'u':
                    marker_dict = dict(color=color,line=dict(color='black',width=.5))
                else:
                    marker_dict = dict(color=color,line=dict(color='white',width=.5))
                    
                fig.add_trace(
                go.Scatter(x=df_flag[var],
                           y=df_flag['depth'],
                           mode='markers',
                           marker=marker_dict,# df_temp[var+'_qartod_flag'],
                           text=df_flag[var+'_flag_description']),
                    row=1,col=kk)
    
    # Add a new line character to x titles every two plots to make x titles more readable
    if (kk % 2)== 0:
        title_x = ' <br>'+var
    else:
        title_x = var
        
    if var in ['par']: # Make PAR x axis log
        fig.update_xaxes(type="log",row=1,col=kk)
        
    fig.update_xaxes(title=title_x, row=1, col=kk)
    kk=kk+1

# Add stuff around each figures
fig.update_yaxes(title_text="Depth (m)",row=1,col=1)
fig.update_yaxes(autorange="reversed",linecolor='black',mirror=True,ticks='outside',showline=True)
fig.update_xaxes(mirror=True,ticks='outside',showline=True,tickangle=45,linecolor='black')
fig.update_layout(height=600, width=1000,showlegend=False)
print(id)
fig.show()

In [None]:
# Look at one variable at the time
var = 'salinity'

## Get Figure
fig = px.scatter(df_temp,x=var,y='depth',color=var+'_qartod_flag',hover_name=var+'_flag_description',
                color_discrete_map=qartod_color,symbol='direction_flag')
#fig.update_xaxes(type='log')
fig.update_yaxes(autorange="reversed",linecolor='black',mirror=True,ticks='outside',showline=True,title_text="Depth (m)",)
fig.update_xaxes(mirror=True,ticks='outside',showline=True,tickangle=45,linecolor='black')
fig.update_layout(height=600, width=1000)
fig.show()