## Outlier removal

To remove outliers, load the level 0 CSV file, making sure to set the index to the DateTime. 

This function will load all the data, plot an individual variable, and then save a CSV of any outliers that are chosen as True. This outlier CSV can be used to mask the values in the original dataframe. An example of this is shown in the second cell.

_Note: No data is changed in the input dataframe._

In [None]:
import pandas as pd
import os
import numpy as np
from helikite.processing import choose_outliers

INPUT_DATA_FILENAME = os.path.join(os.getcwd(), "level0", "20240402A_level_0.csv")
OUTLIER_FILENAME = os.path.join(os.getcwd(), "outliers.csv")
df = pd.read_csv(INPUT_DATA_FILENAME, low_memory=False, parse_dates=True, index_col=0)
choose_outliers(df=df, y="FC_Pressure", outlier_file=OUTLIER_FILENAME)

# Mask the original DataFrame

Loading the CSV file, ensuring the index is set to the DateTime column (using parse_dates lets Pandas discover the index is a date column instead of just strings), we can mask any values that are True. 

In [None]:
outliers = pd.read_csv(OUTLIER_FILENAME, index_col=0, parse_dates=True)
outliers

In [None]:
# Set the values in the df to be np.nan (this is default
df.loc[outliers.index] = df.loc[outliers.index].mask(outliers)                 # By default the outliers will be nan
# df.loc[outliers.index] = df.loc[outliers.index].mask(outliers, other=99999)  # Example: set outliers to 99999

In [None]:
# The DF is updated now with the values (but to find them we would probably need to filter)
df

In [None]:
# Filter the changed df by the index of the outliers to validate they changed
df.loc[outliers.index]