Motivation: individual gross outliers from general station distribution are a common error in obs data by random recording, reporting, formatting, or instrumentation errors

Process:
1. uses individual observation deviations derived from monthly mean climatology calculated for each hour of the day
2. climatologies calculated using winsorised data to remove initial effect of outliers
    - Winsorising: all values beyond threhsold value from mean are set to that threshold value
    - 5 and 95% for hadisd
    - number of data values in population remains the same, not trimmed
3. raw unwinsorised observations are anomalised using these climatologies
4. standardized by IQR for that month and hour
    - IQR cannot be less than 1.5degC
5. values are low-pass filtered to remove any climate change signal causing overzealous removal at ends of time series
6. gaussian is fitted to the histogram of anomalies for each month
7. threshold value, rounded outwards where crosses y=0.1 line
8. distribution beyond threhsold value is scanned for gap, equal to bin width or more
9. all values beyond gap are flagged
10. obs that fall between critical threshold value and gap or critical threshold and end of distribution are tentatively flagged
    - these may be later reinstated on comparison with good data from neighboring stations

Notes:
- when applied to SLP, frequently flags storm signals, which may be of high interest, so this test is not applied to pressure data
- hadisd only applies to temp and dewpoint temp

In [2]:
import pandas as pd
import numpy as np
import xarray as xr



In [3]:
ds = xr.open_dataset('/Users/victoriaford/Desktop/Train_Files/CAHYDRO_BROC1.nc')

df = ds.to_dataframe()
df = df.reset_index()
df['month'] = pd.to_datetime(df['time']).dt.month # sets month to new variable
df['year'] = pd.to_datetime(df['time']).dt.year # sets year to new variable

Unnamed: 0_level_0,Unnamed: 1_level_0,tas,pr,tas_qc,elevation,lat,lon
station,time,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
CAHYDRO_BROC1,2010-05-30 15:30:00,300.940,,,549.8592,33.2314,-116.4144
CAHYDRO_BROC1,2010-05-31 15:20:00,301.490,,,549.8592,33.2314,-116.4144
CAHYDRO_BROC1,2010-06-05 15:20:00,303.160,,,549.8592,33.2314,-116.4144
CAHYDRO_BROC1,2010-06-06 15:25:00,304.830,,,549.8592,33.2314,-116.4144
CAHYDRO_BROC1,2010-06-12 15:25:00,294.830,,,549.8592,33.2314,-116.4144
CAHYDRO_BROC1,...,...,...,...,...,...,...
CAHYDRO_BROC1,2022-07-31 23:50:00,311.494,,,549.8592,33.2314,-116.4144
CAHYDRO_BROC1,2022-08-06 23:20:00,315.383,,,549.8592,33.2314,-116.4144
CAHYDRO_BROC1,2022-08-08 00:40:00,312.050,,,549.8592,33.2314,-116.4144
CAHYDRO_BROC1,2022-08-21 00:20:00,310.939,,,549.8592,33.2314,-116.4144


In [8]:
df_jan = df.loc[df.time.dt.month == 1]
df_jan

Unnamed: 0,station,time,tas,pr,tas_qc,elevation,lat,lon,month,year
113,CAHYDRO_BROC1,2011-01-01 15:40:00,275.940,,,549.8592,33.2314,-116.4144,1,2011
114,CAHYDRO_BROC1,2011-01-02 15:30:00,273.720,,,549.8592,33.2314,-116.4144,1,2011
115,CAHYDRO_BROC1,2011-01-04 16:00:00,282.050,,,549.8592,33.2314,-116.4144,1,2011
116,CAHYDRO_BROC1,2011-01-06 16:00:00,284.830,,,549.8592,33.2314,-116.4144,1,2011
117,CAHYDRO_BROC1,2011-01-07 16:00:00,287.050,,,549.8592,33.2314,-116.4144,1,2011
...,...,...,...,...,...,...,...,...,...,...
2825,CAHYDRO_BROC1,2022-01-27 00:20:00,292.050,,,549.8592,33.2314,-116.4144,1,2022
2826,CAHYDRO_BROC1,2022-01-28 00:30:00,292.606,,,549.8592,33.2314,-116.4144,1,2022
2827,CAHYDRO_BROC1,2022-01-29 00:10:00,292.050,,,549.8592,33.2314,-116.4144,1,2022
2828,CAHYDRO_BROC1,2022-01-30 00:20:00,292.050,,,549.8592,33.2314,-116.4144,1,2022


In [None]:
def clim_mon_mean_hourly(df, var):
    """
    Calculate individual observation anomalies derived from 
    monthly mean climatology 
    calculated for each hour of the day
    """


In [None]:
def qaqc_climatological_outlier(df, plot=True, verbose=True):
    '''
    Flags individual gross outliers from climatological distribution.
    Only applied to air temperature and dew point temperature
    
    Input:
    ------
        df [pd.DataFrame]: station dataset converted to dataframe through QAQC pipeline
        plots [bool]: if True, produces plots of any flagged data and saved to AWS
            
    Returns:
    --------
        qaqc success:
            df [pd.DataFrame]: QAQC dataframe with flagged values (see below for flag meaning)
        qaqc failure:
            None
            
    Flag meaning:
    -------------
        25,qaqc_climatological_outlier,Value flagged as a climatological outlier
        26,qaqc_climatological_outlier,Value flagged as a tentative climatological outlier. Review in neighboring stations check.
    '''
    
    vars_to_check = ['tas', 'tdps', 'tdps_derived']
    

In [None]:
def clim_anom(df, var):
    '''raw unwinsorised observations are anomalised using these climatologies'''
    
    

In [54]:
def winsorise_data(df, var, percent=0.05):
    '''
    Winsorising: all values beyond a threshold value from the mean are set to that threshold value
        - Removes initial effect of outliers
        - HadISD uses 5% and 95%
        - Result: Population size remains the same, instead of trimming those observations from data
    '''
    
    # find observations beyond these thresholds and set to the percentile value at that point
    p_low = np.nanpercentile(df[var], percent)
    p_high = np.nanpercentile(df[var], 1-percent)
    print(p_low, p_high)
        
    df.loc[df[var] < p_low, var] = p_low
    df.loc[df[var] > p_high, var] = p_high
            
    return df

In [55]:
winsorise_data(df, 'tas')

277.11578688 281.48999999999995


Unnamed: 0_level_0,Unnamed: 1_level_0,tas,pr,tas_qc,elevation,lat,lon
station,time,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
CAHYDRO_BROC1,2010-05-30 15:30:00,281.49,,,549.8592,33.2314,-116.4144
CAHYDRO_BROC1,2010-05-31 15:20:00,281.49,,,549.8592,33.2314,-116.4144
CAHYDRO_BROC1,2010-06-05 15:20:00,281.49,,,549.8592,33.2314,-116.4144
CAHYDRO_BROC1,2010-06-06 15:25:00,281.49,,,549.8592,33.2314,-116.4144
CAHYDRO_BROC1,2010-06-12 15:25:00,281.49,,,549.8592,33.2314,-116.4144
CAHYDRO_BROC1,...,...,...,...,...,...,...
CAHYDRO_BROC1,2022-07-31 23:50:00,281.49,,,549.8592,33.2314,-116.4144
CAHYDRO_BROC1,2022-08-06 23:20:00,281.49,,,549.8592,33.2314,-116.4144
CAHYDRO_BROC1,2022-08-08 00:40:00,281.49,,,549.8592,33.2314,-116.4144
CAHYDRO_BROC1,2022-08-21 00:20:00,281.49,,,549.8592,33.2314,-116.4144
