## Outlier removal

To remove outliers, load the level 0 CSV file, making sure to set the index to the DateTime. 

This function will load all the data, plot an individual variable, and then save a CSV of any outliers that are chosen as True. This outlier CSV can be used to mask the values in the original dataframe. An example of this is shown in the second cell.

_Note: No data is changed in the input dataframe._

In [1]:
import pandas as pd
import os
import numpy as np
from helikite.processing import choose_flags

filefolder = r'C:\Users\calmer\Documents\VAERTICAL\Helikite\data\Villum'

INPUT_DATA_FILENAME = os.path.join(filefolder, "level_1", "20240402A_level_1.csv")#os.getcwd()
FLAG_FILENAME = os.path.join(os.getcwd(), "flags.csv")
df = pd.read_csv(INPUT_DATA_FILENAME, low_memory=False, parse_dates=True, index_col=0)

# Assign 'hovering' to selected points
choose_flags(df=df, y="Altitude", flag_file=FLAG_FILENAME, key='Flight_state', value="test")


DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`



In "Flight_state", these values:
test: 2188
step1: 1456
step2: 1336
step4: 1084
step3: 1024
step5: 572
ground: 557


VBox(children=(Dropdown(description='Variable:', index=1, options=('Index<DateTime>', 'T', 'RH', 'P', 'WD', 'W…

# Mask the original DataFrame

Loading the CSV file, ensuring the index is set to the DateTime column (using parse_dates lets Pandas discover the index is a date column instead of just strings), we can mask any values that are True. 

In [3]:
flags = pd.read_csv(FLAG_FILENAME, index_col=0, parse_dates=True)


flags

14933
10490


Unnamed: 0,instrument_state,Flight_state
2024-04-05 13:22:21,ground,
2024-04-05 13:22:23,ground,
2024-04-05 13:25:26,ground,
2024-04-05 13:28:30,ground,
2024-04-05 13:31:33,ground,
...,...,...
2024-04-02 10:00:40,,test
2024-04-02 10:00:41,,test
2024-04-02 10:00:42,,test
2024-04-02 10:00:43,,test


In [4]:
# Add the additional columns to the original df by their common index
#df = df.merge(flags, left_index=True, right_index=True)
df = df.join(flags,how='left')
df


Unnamed: 0_level_0,Altitude,T,RH,P,WD,WS,total_POPS_conc,flag_flight,POPS_b3,POPS_b4,...,dN_Bin_Conc55,dN_Bin_Conc56,dN_Bin_Conc57,dN_Bin_Conc58,dN_Bin_Conc59,dN_Bin_Conc60,dN_totalconc,ave3min,instrument_state,Flight_state
DateTime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2024-04-02 09:50:18,1.840298,15.510,12.240,1015.04,277.0,0.0,15.070510,0,15.0,15.0,...,4.816762,2.164083,3.021214,2.818109,2.44177,2.874484,68.351245,2024-04-02 09:47:29,,ground
2024-04-02 09:50:19,1.169826,15.530,12.300,1015.13,278.0,0.0,18.084612,0,18.0,15.0,...,,,,,,,,,,ground
2024-04-02 09:50:20,1.840298,15.535,12.305,1015.04,277.0,0.0,13.396009,0,15.0,9.0,...,,,,,,,,,,ground
2024-04-02 09:50:22,1.244320,15.580,12.305,1015.12,277.0,0.0,19.424213,0,23.0,15.0,...,,,,,,,,,,ground
2024-04-02 09:50:23,1.169826,15.570,12.280,1015.13,277.0,0.0,17.414812,0,18.0,10.0,...,,,,,,,,,,ground
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2024-04-02 13:10:56,-0.000000,9.225,18.900,1014.90,,,0.000000,0,,,...,,,,,,,,,,ground
2024-04-02 13:10:58,-0.000000,9.300,18.640,1014.90,,,0.000000,0,,,...,,,,,,,,,,ground
2024-04-02 13:11:00,-0.000000,9.370,18.420,1014.90,,,0.000000,0,,,...,,,,,,,,,,ground
2024-04-02 13:11:01,0.670368,9.430,18.270,1014.81,,,0.000000,0,,,...,,,,,,,,,,ground
