In [None]:
import functions

# Analyze the dataframe

Get general informations about the dataframe (size/elements/types..), as well as the main metric to check 'missing rate'.

### <i>purpose: Check the missingness rate of the features before introduce the benchmark (based on missingness rate/type). </i>

In [None]:
csv_path = "../data/node1/extracted_vital_signs.csv"
original_dataframe = functions.analyze_dataframe(csv_path=csv_path)

# Pruposed solution : Create a clean subset first
This approach creates a completely clean dataset by removing all rows with missing values within the target feature:
this gives us a clean slate with 0% missing values, then we can apply the introduce_missingness() function to this dataset with our desired missing rate also with the desired missing type, as well as benefit from the real values of missing data for evaluation and test performance.
###### <i>RQ: <u>Usually we need just one feature to test the impuatating stategies on it, but if we apply the MAR pattern we should have other cleaned feature.</i></u>

To choose features for MAR missingness analysis, we will follow these steps:
- Look for variables that are well-populated (few missing values) (check the previous step output)
- Evaluate relationships by checking the correlations between potential variables. (Spearman rank correlation - Mutual Information (MI))

### Evaluate relationships :
Starting by Spearman rank correlations(not just linear), +1: Indicates a perfect positive monotonic relationship. 
-1: Indicates a perfect negative monotonic relationship. 
0: Indicates no monotonic relationship between the ranks of the two variables.

In [None]:
#Defining the target column after checking the missingness rate 
target_column = "respiratory_rate" 
functions.calculate_spearman_correlation(original_dataframe,target_col=target_column)

To emphasize the features choise, we will apply the Mutual Information (MI) to capture any statistical dependency, not just linear. ( I(X;Y)≥0   e.g. above 2.0 or so, though this is context-dependent) 

In [None]:
functions.calculate_mutual_information(original_dataframe,target_column)

After the evaluation of the metrices, we have a comprehensive idea about variables relationship to select appropriate features for your MAR analysis.

In [None]:
features =[target_column,"heart_rate"]

Clean the dataset to have a slat with 0% of missingness to test the imputation strategies based on missing rate/type . 

In [None]:
cleaned_dataframe = functions.prepare_clean_dataset(original_dataframe,features=features) 

### Check the missing rates.

In [None]:
functions.analyze_dataframe(df=cleaned_dataframe)

# Introduce missing values
This step introduce the missingness within the specific feature by given the missing rate (0.1 - 0.3 - 0.5) and type (MCAR - MAR - MNAR), also returning a ground truth DataFrame containing the original values and missingness information for later evaluation. 
###### <i>RQ: <u>We need a target variable in the MAR (Missing At Random) pattern.</i></u>

In [None]:
missing_rate = 0.2
pattern = 'MAR'

data_with_missingness , original_values = functions.introduce_missingness(df=cleaned_dataframe,feature1=features[0],feature2=features[1],missing_rate=missing_rate,task="regression",pattern=pattern)

### -> Trainning of the federated learning model
Load the model to impute missing data . (The model is trainned based on data across nodes, using the federated learning approche) 

In [None]:

features = ['heart_rate','gender']
target = "respiratory_rate"

result_dataframe = functions.return_regression_results(data_with_missingness,features,target)

In [None]:
functions.get_nan_rows(result_dataframe,target)

In [None]:
functions.benchmark_predictions(result_dataframe,original_values,target)