# Visualize Results: Downstream Performance - Multiclass Classification Corrupted Experiments

Notebook wurde angepasst -> für Tests nutzen!

This notebook should answer the questions: *Does imputation lead to better downstream performances?*

Data needs to be preprocessed with other notebook, her we only import two csv files with raw data regarding the results of the experiment and information about the used datasets!

## Notebook Structure 

* Application Scenario 2 - Downstream Performance  
   * Categorical  Columns (Classification)
   * Numerical Columns (Regression)
   * Heterogenous Columns (Classification and Regression Combined)

In [69]:
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import os
import pandas as pd
import re
import seaborn as sns

from pathlib import Path

import plotly as py
import plotly.express as px
import plotly.graph_objects as go
import xarray as xr


%matplotlib inline

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Settings

In [70]:
sns.set(style="whitegrid")
sns.set_context('paper', font_scale=1.5)
mpl.rcParams['lines.linewidth'] = '2'

In [71]:
CLF_METRIC = "F1_macro"
REG_METRIC = "RMSE"

#CLF_METRIC = "Classification Tasks"
#REG_METRIC = "Regression Tasks"

DOWNSTREAM_RESULT_TYPE = "downstream_performance_mean"
IMPUTE_RESULT_TYPE = "impute_performance_mean"

FIGURES_PATH = Path(f"../paper/figures/")

## Data Preparation

In [72]:
#read results.csv file here!

# Pick whether you want to analyze the "Regression" Experiment oder the "Regression Corrupted" Experiment

#results = pd.read_csv('regression_corrupted.csv')
results = pd.read_csv('../multiclass_classification_corrupted.csv')
# Preresults.head()
results

Unnamed: 0,experiment,imputer,task,missing_type,missing_fraction,strategy,column,result_type,metric,train,test,baseline,corrupted,imputed
0,corrupted_multi_experiment,KNNImputer,30,MNAR,0.50,single_single,eccen,impute_performance_std,MAE,1.824974,0.684856,,,
1,corrupted_multi_experiment,KNNImputer,30,MNAR,0.50,single_single,eccen,impute_performance_std,MSE,70.315798,67.553193,,,
2,corrupted_multi_experiment,KNNImputer,30,MNAR,0.50,single_single,eccen,impute_performance_std,RMSE,3.869505,4.132480,,,
3,corrupted_multi_experiment,KNNImputer,30,MNAR,0.30,single_single,eccen,impute_performance_std,MAE,0.802552,0.406847,,,
4,corrupted_multi_experiment,KNNImputer,30,MNAR,0.30,single_single,eccen,impute_performance_std,MSE,15.222242,1.146581,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8539,corrupted_multi_experiment,AutoKerasImputer,40685,MCAR,0.01,single_single,A2,downstream_performance_mean,F1_macro,,,0.382647,0.0,0.382647
8540,corrupted_multi_experiment,AutoKerasImputer,40685,MCAR,0.01,single_single,A2,downstream_performance_mean,F1_weighted,,,0.920582,0.0,0.920582
8541,corrupted_multi_experiment,AutoKerasImputer,40685,MCAR,0.10,single_single,A2,downstream_performance_mean,F1_micro,,,0.927931,0.0,0.927931
8542,corrupted_multi_experiment,AutoKerasImputer,40685,MCAR,0.10,single_single,A2,downstream_performance_mean,F1_macro,,,0.382871,0.0,0.382871


In [73]:
# Filtering the relevant data for downstream analysis

na_impute_results = results[
    (results["result_type"] == IMPUTE_RESULT_TYPE) & 
    (results["metric"].isin(["F1_macro", "RMSE"]))
]
na_impute_results.drop(["baseline", "corrupted", "imputed"], axis=1, inplace=True)
na_impute_results = na_impute_results[na_impute_results.isna().any(axis=1)]
na_impute_results.shape



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



(0, 11)

In [74]:
# check if strategy type is correct!
STRATEGY_TYPE = "single_single"

downstream_results = results[
    (results["result_type"] == DOWNSTREAM_RESULT_TYPE) & 
    (results["metric"].isin(["F1_macro", "RMSE"]) &
    (results["strategy"] == STRATEGY_TYPE))
]

# remove experiments where imputation failed
downstream_results = downstream_results.merge(
    na_impute_results,
    how = "left",
    validate = "one_to_one",
    indicator = True,
    suffixes=("", "_imp"),
    on = ["experiment", "imputer", "task", "missing_type", "missing_fraction", "strategy", "column"]
)
downstream_results = downstream_results[downstream_results["_merge"]=="left_only"]

assert len(results["strategy"].unique()) == 1
downstream_results.drop(["experiment", "strategy", "result_type_imp", "metric_imp", "train", "test", "train_imp", "test_imp", "_merge"], axis=1, inplace=True)

downstream_results = downstream_results.rename(
    {
        "imputer": "Imputation_Method",
        "task": "Task",
        "missing_type": "Missing Type",
        "missing_fraction": "Missing Fraction",
        "column": "Column",
        "baseline": "Baseline",
        "imputed": "Imputed",
        "corrupted": "Corrupted"
    },
    axis = 1
)

In [75]:
rename_imputer_dict = {
    "ModeImputer": "Mean/Mode",
    "KNNImputer": "KNN",
    "ForestImputer": "Random Forest",
    "AutoKerasImputer": "Discriminative DL",
    "VAEImputer": "VAE",
    "GAINImputer": "GAIN"    
}

rename_metric_dict = {
    "F1_macro": CLF_METRIC,
    "RMSE": REG_METRIC
}

downstream_results = downstream_results.replace(rename_imputer_dict)
downstream_results = downstream_results.replace(rename_metric_dict)

downstream_results

Unnamed: 0,Imputation_Method,Task,Missing Type,Missing Fraction,Column,result_type,metric,Baseline,Corrupted,Imputed
0,KNN,30,MNAR,0.50,eccen,downstream_performance_mean,F1_macro,0.728825,0.0,0.728026
1,KNN,30,MNAR,0.30,eccen,downstream_performance_mean,F1_macro,0.673140,0.0,0.673140
2,KNN,30,MNAR,0.01,eccen,downstream_performance_mean,F1_macro,0.673140,0.0,0.673140
3,KNN,30,MNAR,0.10,eccen,downstream_performance_mean,F1_macro,0.749369,0.0,0.749369
4,KNN,30,MAR,0.50,eccen,downstream_performance_mean,F1_macro,0.714564,0.0,0.715378
...,...,...,...,...,...,...,...,...,...,...
707,Discriminative DL,40685,MAR,0.10,A2,downstream_performance_mean,F1_macro,0.383420,0.0,0.383415
708,Discriminative DL,40685,MCAR,0.50,A2,downstream_performance_mean,F1_macro,0.382006,0.0,0.381955
709,Discriminative DL,40685,MCAR,0.30,A2,downstream_performance_mean,F1_macro,0.384540,0.0,0.384440
710,Discriminative DL,40685,MCAR,0.01,A2,downstream_performance_mean,F1_macro,0.382647,0.0,0.382647


### Robustness: check which imputers yielded `NaN`values

In [76]:
for col in downstream_results.columns:
    na_sum = downstream_results[col].isna().sum()
    if na_sum > 0:
        print("-----" * 10)        
        print(col, na_sum)
        print("-----" * 10)        
        na_idx = downstream_results[col].isna()
        print(downstream_results.loc[na_idx, "Imputation Method"].value_counts(dropna=False))
        print("\n")

## Adding Dataset Info, Sorting and Ranking

In [77]:
# Sortierung der Daten

#downstream_results_full_sort = pd.read_csv('downstream_results.csv')
downstream_results_full_sort = downstream_results


#impute_results_full_sort = impute_results_full_sort.sort_values(['Task'], ascending=[True])
downstream_results_full_sort = downstream_results_full_sort.sort_values(['Task', 'Missing Type', 'Missing Fraction', 'Imputed'], ascending=[True, True, True, True])
#print(downstream_results_full_sort)
downstream_results_full_sort.head()


#downstream_results_full_sort.to_csv('downstream_results_full_sort.csv')

Unnamed: 0,Imputation_Method,Task,Missing Type,Missing Fraction,Column,result_type,metric,Baseline,Corrupted,Imputed
600,Discriminative DL,30,MAR,0.01,eccen,downstream_performance_mean,F1_macro,0.67314,0.0,0.67314
360,Random Forest,30,MAR,0.01,eccen,downstream_performance_mean,F1_macro,0.674928,0.0,0.674928
126,GAIN,30,MAR,0.01,eccen,downstream_performance_mean,F1_macro,0.681451,0.0,0.681451
6,KNN,30,MAR,0.01,eccen,downstream_performance_mean,F1_macro,0.702542,0.0,0.702542
240,Mean/Mode,30,MAR,0.01,eccen,downstream_performance_mean,F1_macro,0.738395,0.0,0.73824


In [78]:
# add dataset information from other csv file

dataset_info = pd.read_csv('../datasets_information_overview.csv')
dataset_info = dataset_info.rename(columns={"did": "Task"})


downstream_results_full_sort = pd.merge(downstream_results_full_sort, dataset_info, on='Task')
#downstream_results_full_sort.to_csv('downstream_results_full_sort_testtesttest.csv')
downstream_results_full_sort.head()

Unnamed: 0.1,Imputation_Method,Task,Missing Type,Missing Fraction,Column,result_type,metric,Baseline,Corrupted,Imputed,Unnamed: 0,name,MajorityClassSize,MinorityClassSize,NumberOfFeatures,NumberOfInstances,NumberOfNumericFeatures,NumberOfCategoricalFeatures,NumberOfClasses
0,Discriminative DL,30,MAR,0.01,eccen,downstream_performance_mean,F1_macro,0.67314,0.0,0.67314,48,page-blocks,4913.0,28.0,11.0,5473.0,10.0,1.0,5.0
1,Random Forest,30,MAR,0.01,eccen,downstream_performance_mean,F1_macro,0.674928,0.0,0.674928,48,page-blocks,4913.0,28.0,11.0,5473.0,10.0,1.0,5.0
2,GAIN,30,MAR,0.01,eccen,downstream_performance_mean,F1_macro,0.681451,0.0,0.681451,48,page-blocks,4913.0,28.0,11.0,5473.0,10.0,1.0,5.0
3,KNN,30,MAR,0.01,eccen,downstream_performance_mean,F1_macro,0.702542,0.0,0.702542,48,page-blocks,4913.0,28.0,11.0,5473.0,10.0,1.0,5.0
4,Mean/Mode,30,MAR,0.01,eccen,downstream_performance_mean,F1_macro,0.738395,0.0,0.73824,48,page-blocks,4913.0,28.0,11.0,5473.0,10.0,1.0,5.0


In [79]:
# Ranking of downstream performance per data constellation

EXPERIMENTAL_CONDITIONS = ["Task", "Missing Type", "Missing Fraction", "Column", "result_type"]

downstream_results_rank = downstream_results_full_sort

#clf_row_idx = impute_results["metric"] == CLF_METRIC
#reg_row_idx = impute_results["metric"] == REG_METRIC

downstream_results_rank["Downstream Performance Rank"] = downstream_results_rank.groupby(EXPERIMENTAL_CONDITIONS).rank(ascending=False, na_option="bottom", method="min")["Imputed"]
downstream_results_rank.to_csv('downstream_results_complete_overview.csv')
downstream_results_rank.head()


Unnamed: 0.1,Imputation_Method,Task,Missing Type,Missing Fraction,Column,result_type,metric,Baseline,Corrupted,Imputed,Unnamed: 0,name,MajorityClassSize,MinorityClassSize,NumberOfFeatures,NumberOfInstances,NumberOfNumericFeatures,NumberOfCategoricalFeatures,NumberOfClasses,Downstream Performance Rank
0,Discriminative DL,30,MAR,0.01,eccen,downstream_performance_mean,F1_macro,0.67314,0.0,0.67314,48,page-blocks,4913.0,28.0,11.0,5473.0,10.0,1.0,5.0,6.0
1,Random Forest,30,MAR,0.01,eccen,downstream_performance_mean,F1_macro,0.674928,0.0,0.674928,48,page-blocks,4913.0,28.0,11.0,5473.0,10.0,1.0,5.0,5.0
2,GAIN,30,MAR,0.01,eccen,downstream_performance_mean,F1_macro,0.681451,0.0,0.681451,48,page-blocks,4913.0,28.0,11.0,5473.0,10.0,1.0,5.0,4.0
3,KNN,30,MAR,0.01,eccen,downstream_performance_mean,F1_macro,0.702542,0.0,0.702542,48,page-blocks,4913.0,28.0,11.0,5473.0,10.0,1.0,5.0,3.0
4,Mean/Mode,30,MAR,0.01,eccen,downstream_performance_mean,F1_macro,0.738395,0.0,0.73824,48,page-blocks,4913.0,28.0,11.0,5473.0,10.0,1.0,5.0,2.0


In [80]:
# Merge the two columns "Missing Type" and "Missing Fraction"

downstream_results_rank['Missing Type'] = downstream_results_rank['Missing Type'].astype(str)
downstream_results_rank['Missing Fraction'] = downstream_results_rank['Missing Fraction'].astype(str)
datatype_new = downstream_results_rank.dtypes
#print(datatype_new)

downstream_results_rank['Data_Constellation'] = downstream_results_rank['Missing Type'] + ' - ' + downstream_results_rank['Missing Fraction']
downstream_results_rank.to_csv('downstream_results_rank_temp.csv')
downstream_results_rank_heatmap2 = downstream_results_rank.copy()
downstream_results_rank.head()


Unnamed: 0,Imputation_Method,Task,Missing Type,Missing Fraction,Column,result_type,metric,Baseline,Corrupted,Imputed,...,name,MajorityClassSize,MinorityClassSize,NumberOfFeatures,NumberOfInstances,NumberOfNumericFeatures,NumberOfCategoricalFeatures,NumberOfClasses,Downstream Performance Rank,Data_Constellation
0,Discriminative DL,30,MAR,0.01,eccen,downstream_performance_mean,F1_macro,0.67314,0.0,0.67314,...,page-blocks,4913.0,28.0,11.0,5473.0,10.0,1.0,5.0,6.0,MAR - 0.01
1,Random Forest,30,MAR,0.01,eccen,downstream_performance_mean,F1_macro,0.674928,0.0,0.674928,...,page-blocks,4913.0,28.0,11.0,5473.0,10.0,1.0,5.0,5.0,MAR - 0.01
2,GAIN,30,MAR,0.01,eccen,downstream_performance_mean,F1_macro,0.681451,0.0,0.681451,...,page-blocks,4913.0,28.0,11.0,5473.0,10.0,1.0,5.0,4.0,MAR - 0.01
3,KNN,30,MAR,0.01,eccen,downstream_performance_mean,F1_macro,0.702542,0.0,0.702542,...,page-blocks,4913.0,28.0,11.0,5473.0,10.0,1.0,5.0,3.0,MAR - 0.01
4,Mean/Mode,30,MAR,0.01,eccen,downstream_performance_mean,F1_macro,0.738395,0.0,0.73824,...,page-blocks,4913.0,28.0,11.0,5473.0,10.0,1.0,5.0,2.0,MAR - 0.01


## Analyzing Performance based on Rank and Improvement per Data Constellation

Hier die Rechnung -> Bestes Ergebnis pro "Experimental condition" - Beste Methode im Durchschnitt 

ToDo´s für restliche Auswertung (Mathematische Part)
- Beste Imp-Methode je Datensatz ermitteln (-> via Ranking am besten, je Konstellation (Bsp. MAR 0.01)
- durchschnittliche Platzierung jeder Imp-Methode ermitteln (Ranking -> dann je Konstellation (Bsp. MAR 0.01)
- Beste Imp je Datensatz mit durchschnittlich bester Imp vergleichen (Liste mit beste Imp & Liste mit Durchschn. Imp -> VGL)
(jede Konstellation genau einmal in jeder Liste)


In [81]:
data = downstream_results_rank.copy()

# Count amount of different Data constellations in column "Data_Constellation"
dc_unique = data.Data_Constellation.unique().size
print(dc_unique, "Data Constellations")
print("_____________________")
# Count amount of 1.0 Ranking result in column "Downstream Performance Rank" 
rank_count = data['Downstream Performance Rank'].value_counts()
print(rank_count)
print("_____________________")
# Filter for 1.0 Ranking -> Overview -> save as csv
rank_1 = data.loc[data['Downstream Performance Rank'] == 1.0]
rank_1.to_csv('rank_1.csv')

print("_____________________")
# Count how often each Imputation Method is present -> most "wins"
rank_wins = rank_1['Imputation_Method'].value_counts()
print(rank_wins)
print("_____________________")
# Take initial overview and filter for each imputation method and calculate average rank and average improvement
methods = ['Random Forest', 'KNN', 'Mean/Mode', 'VAE', 'GAIN', 'Discriminative DL']
for i in methods:
    df_average_rank = data.loc[data['Imputation_Method'] == i]
    len_ar = len(df_average_rank)
    print(len_ar, "Amount of results available")
    rank_pos = df_average_rank['Downstream Performance Rank'].value_counts().sort_index(ascending=True)
    print(rank_pos)
    average_rank = df_average_rank["Downstream Performance Rank"].mean()
    print("Average Rank for", i, "is", average_rank)
    #average_improvement = df_average_rank["Improvement"].mean()
    #print("Average Improvement to baseline is", average_improvement)
    print("_____________________")



12 Data Constellations
_____________________
1.0    124
4.0    121
3.0    121
5.0    117
2.0    117
6.0    112
Name: Downstream Performance Rank, dtype: int64
_____________________
_____________________
GAIN                 24
KNN                  24
VAE                  23
Random Forest        20
Mean/Mode            20
Discriminative DL    13
Name: Imputation_Method, dtype: int64
_____________________
120 Amount of results available
1.0    20
2.0    22
3.0    24
4.0    23
5.0    16
6.0    15
Name: Downstream Performance Rank, dtype: int64
Average Rank for Random Forest is 3.316666666666667
_____________________
120 Amount of results available
1.0    24
2.0    18
3.0    25
4.0    15
5.0    25
6.0    13
Name: Downstream Performance Rank, dtype: int64
Average Rank for KNN is 3.316666666666667
_____________________
120 Amount of results available
1.0    20
2.0    24
3.0    22
4.0    15
5.0    19
6.0    20
Name: Downstream Performance Rank, dtype: int64
Average Rank for Mean/Mode is 3.408

In [82]:
rank_1.head()

Unnamed: 0,Imputation_Method,Task,Missing Type,Missing Fraction,Column,result_type,metric,Baseline,Corrupted,Imputed,...,name,MajorityClassSize,MinorityClassSize,NumberOfFeatures,NumberOfInstances,NumberOfNumericFeatures,NumberOfCategoricalFeatures,NumberOfClasses,Downstream Performance Rank,Data_Constellation
5,VAE,30,MAR,0.01,eccen,downstream_performance_mean,F1_macro,0.738395,0.0,0.738395,...,page-blocks,4913.0,28.0,11.0,5473.0,10.0,1.0,5.0,1.0,MAR - 0.01
11,GAIN,30,MAR,0.1,eccen,downstream_performance_mean,F1_macro,0.743449,0.0,0.744342,...,page-blocks,4913.0,28.0,11.0,5473.0,10.0,1.0,5.0,1.0,MAR - 0.1
17,VAE,30,MAR,0.3,eccen,downstream_performance_mean,F1_macro,0.751008,0.0,0.751215,...,page-blocks,4913.0,28.0,11.0,5473.0,10.0,1.0,5.0,1.0,MAR - 0.3
23,Random Forest,30,MAR,0.5,eccen,downstream_performance_mean,F1_macro,0.752634,0.0,0.753104,...,page-blocks,4913.0,28.0,11.0,5473.0,10.0,1.0,5.0,1.0,MAR - 0.5
29,Mean/Mode,30,MCAR,0.01,eccen,downstream_performance_mean,F1_macro,0.75139,0.0,0.750793,...,page-blocks,4913.0,28.0,11.0,5473.0,10.0,1.0,5.0,1.0,MCAR - 0.01


In [83]:
# Calculate average impovement to baseline over all datasets per method







In [84]:
# Take initial overview and filter best average imputation method and take filtered dataframe from 1.0 Ranking
# Where Data_Constellation identical -> Ranking 1.0 [Improvement] - Best_Imp_Method [Improvement]
# Write Difference in seperat column - > Calculate Average improvement

AVERAGE_BEST_IMPUTATION_METHOD = "KNN"

# Adjust the following depending on the previous results
av_best = data.loc[data['Imputation_Method'] == AVERAGE_BEST_IMPUTATION_METHOD]
av_best['Task'] = av_best['Task'].astype(str)
av_best['Data_Constellation'] = av_best['Data_Constellation'] + ' - ' + av_best['Task']

av_best = av_best[['Imputation_Method', 'Imputed', 'Data_Constellation', 'Downstream Performance Rank']]
av_best = av_best.rename(columns={'Imputation_Method':'Imputation_Method_average', 
                               'Imputed':'Imputed_average',
                                 'Downstream Performance Rank':'Downstream Performance Rank Average'})

#av_best.head()

rank_1['Task'] = rank_1['Task'].astype(str)
rank_1['Data_Constellation'] = rank_1['Data_Constellation'] + ' - ' + rank_1['Task']
rank_1 = rank_1[['Imputation_Method', 'Imputed', 'Data_Constellation', 'Downstream Performance Rank']]
rank_1 = rank_1.rename(columns={'Imputation_Method':'Imputation_Method_best', 
                               'Imputed':'Imputed_best',
                               'Downstream Performance Rank':'Downstream Performance Rank Best'})

performance_difference = pd.merge(av_best, rank_1, on='Data_Constellation')
performance_difference.head()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/

Unnamed: 0,Imputation_Method_average,Imputed_average,Data_Constellation,Downstream Performance Rank Average,Imputation_Method_best,Imputed_best,Downstream Performance Rank Best
0,KNN,0.702542,MAR - 0.01 - 30,3.0,VAE,0.738395,1.0
1,KNN,0.674449,MAR - 0.1 - 30,5.0,GAIN,0.744342,1.0
2,KNN,0.643422,MAR - 0.3 - 30,6.0,VAE,0.751215,1.0
3,KNN,0.715378,MAR - 0.5 - 30,5.0,Random Forest,0.753104,1.0
4,KNN,0.711018,MCAR - 0.01 - 30,5.0,Mean/Mode,0.750793,1.0


In [85]:
#performance_difference['Imputed_best'] = performance_difference['Imputed_best'] 
#performance_difference['Imputed_average'] = performance_difference['Imputed_average'] 

performance_difference['Performance Difference Best to Average'] = performance_difference['Imputed_best'] - performance_difference['Imputed_average']
Average_Difference = performance_difference['Performance Difference Best to Average'].mean()

print("Average Difference for F1 Score", Average_Difference)


Average Difference for F1 Score 0.015016058228635979


In [86]:
# Prozentuale Verbesserung

performance_difference['Performance Difference Best to Average in Percentage'] = ((performance_difference['Imputed_best'] - performance_difference['Imputed_average'])/performance_difference['Imputed_best'])*100
Average_Difference_per = performance_difference['Performance Difference Best to Average in Percentage'].mean()

print("Based on F1 Score the Average best method is worse than the best method by this percentage", Average_Difference_per)



Based on F1 Score the Average best method is worse than the best method by this percentage 4.8755820408790065


In [87]:
performance_difference.to_csv('performance_difference.csv')

In [88]:
performance_difference

Unnamed: 0,Imputation_Method_average,Imputed_average,Data_Constellation,Downstream Performance Rank Average,Imputation_Method_best,Imputed_best,Downstream Performance Rank Best,Performance Difference Best to Average,Performance Difference Best to Average in Percentage
0,KNN,0.702542,MAR - 0.01 - 30,3.0,VAE,0.738395,1.0,0.035853,4.855474
1,KNN,0.674449,MAR - 0.1 - 30,5.0,GAIN,0.744342,1.0,0.069892,9.389816
2,KNN,0.643422,MAR - 0.3 - 30,6.0,VAE,0.751215,1.0,0.107794,14.349251
3,KNN,0.715378,MAR - 0.5 - 30,5.0,Random Forest,0.753104,1.0,0.037726,5.009445
4,KNN,0.711018,MCAR - 0.01 - 30,5.0,Mean/Mode,0.750793,1.0,0.039774,5.297653
...,...,...,...,...,...,...,...,...,...
119,KNN,0.240619,MNAR - 0.01 - 41671,1.0,Random Forest,0.240619,1.0,0.000000,0.000000
120,KNN,0.240619,MNAR - 0.01 - 41671,1.0,Discriminative DL,0.240619,1.0,0.000000,0.000000
121,KNN,0.239984,MNAR - 0.1 - 41671,3.0,Discriminative DL,0.241994,1.0,0.002009,0.830283
122,KNN,0.239941,MNAR - 0.3 - 41671,5.0,Mean/Mode,0.244943,1.0,0.005003,2.042390


## Analysis and Ranking based on F1 Score

In [89]:
#downstream_results_rank.drop(['Unnamed: 0.1'], axis=1)
downstream_results_rank

Unnamed: 0,Imputation_Method,Task,Missing Type,Missing Fraction,Column,result_type,metric,Baseline,Corrupted,Imputed,...,name,MajorityClassSize,MinorityClassSize,NumberOfFeatures,NumberOfInstances,NumberOfNumericFeatures,NumberOfCategoricalFeatures,NumberOfClasses,Downstream Performance Rank,Data_Constellation
0,Discriminative DL,30,MAR,0.01,eccen,downstream_performance_mean,F1_macro,0.673140,0.0,0.673140,...,page-blocks,4913.0,28.0,11.0,5473.0,10.0,1.0,5.0,6.0,MAR - 0.01
1,Random Forest,30,MAR,0.01,eccen,downstream_performance_mean,F1_macro,0.674928,0.0,0.674928,...,page-blocks,4913.0,28.0,11.0,5473.0,10.0,1.0,5.0,5.0,MAR - 0.01
2,GAIN,30,MAR,0.01,eccen,downstream_performance_mean,F1_macro,0.681451,0.0,0.681451,...,page-blocks,4913.0,28.0,11.0,5473.0,10.0,1.0,5.0,4.0,MAR - 0.01
3,KNN,30,MAR,0.01,eccen,downstream_performance_mean,F1_macro,0.702542,0.0,0.702542,...,page-blocks,4913.0,28.0,11.0,5473.0,10.0,1.0,5.0,3.0,MAR - 0.01
4,Mean/Mode,30,MAR,0.01,eccen,downstream_performance_mean,F1_macro,0.738395,0.0,0.738240,...,page-blocks,4913.0,28.0,11.0,5473.0,10.0,1.0,5.0,2.0,MAR - 0.01
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
707,Discriminative DL,41671,MNAR,0.5,a9,downstream_performance_mean,F1_macro,0.240118,0.0,0.240732,...,microaggregation2,11162.0,743.0,21.0,20000.0,20.0,1.0,5.0,5.0,MNAR - 0.5
708,Mean/Mode,41671,MNAR,0.5,a9,downstream_performance_mean,F1_macro,0.239275,0.0,0.240780,...,microaggregation2,11162.0,743.0,21.0,20000.0,20.0,1.0,5.0,4.0,MNAR - 0.5
709,VAE,41671,MNAR,0.5,a9,downstream_performance_mean,F1_macro,0.241995,0.0,0.242421,...,microaggregation2,11162.0,743.0,21.0,20000.0,20.0,1.0,5.0,3.0,MNAR - 0.5
710,GAIN,41671,MNAR,0.5,a9,downstream_performance_mean,F1_macro,0.239877,0.0,0.243434,...,microaggregation2,11162.0,743.0,21.0,20000.0,20.0,1.0,5.0,2.0,MNAR - 0.5


In [90]:
# Relative Difference in Percent -> Best Method to Average Best Method

#AVERAGE_BEST_IMPUTATION_METHOD = "KNN"

data = downstream_results_rank.copy()
data['Task'] = data['Task'].astype(str)
data['Data_Constellation_full'] = data['Data_Constellation'] + ' - ' + data['Task']

# TODO: drop unnecessary columns here
dc_unique = data.Data_Constellation_full.unique()
#print(dc_unique)

#data_constellations = ['MAR - 0.01', 'MAR - 0.1', 'MAR - 0.3', 'MCAR - 0.5', 'MCAR - 0.01', 'MCAR - 0.1', 'MCAR - 0.3', 'MCAR - 0.5', 'MNAR - 0.01', 'MNAR - 0.1', 'MNAR - 0.3', 'MNAR - 0.5']
data_constellations = dc_unique.tolist()
methods = ['Random Forest', 'KNN', 'Mean/Mode', 'VAE', 'GAIN', 'Discriminative DL']
#print(data_constellations)
#print(type(methods))
average_best_complete = pd.DataFrame()


for i in data_constellations:
    data_constel = data.loc[data['Data_Constellation_full'] == i]
    best_score = data_constel.loc[data_constel['Downstream Performance Rank'] == 1.0]
    average_best = data_constel.loc[data_constel['Imputation_Method'] == AVERAGE_BEST_IMPUTATION_METHOD]
    best_score_int = best_score.iloc[0]['Imputed']
    #print(best_score_int)
    average_best_int = average_best.iloc[0]['Imputed']
    #print(average_best_int)
    calc_result = ((best_score_int - average_best_int)/average_best_int)
#    print(calc_result)
#    print(i)
    average_best['Performance Difference to Best to Average in Percent'] = calc_result
    average_best_complete = average_best_complete.append(average_best)
    
    
    #conditions = [data['Data_Constellation_full'].eq(i) & data['Imputation_Method'].eq(AVERAGE_BEST_IMPUTATION_METHOD)]
    #choices = [calc_result]
    #data['Performance Difference to Best to Average in Percent'] = np.select(conditions, choices)#, default=np.nan)    

    #data['Performance Difference to Best to Average in Percent'] = calc_result

    
    #data.loc[data['Data_Constellation_full'] == i, data['Imputation_Method']==AVERAGE_BEST_IMPUTATION_METHOD, 'Performance Difference to Best to Average in Percent'] = calc_result 
    
#    data = np.where( ( (data['Data_Constellation_full'] == i) & (data['Imputation_Method'] == AVERAGE_BEST_IMPUTATION_METHOD ) ), calc_result, np.nan)


#test = data['Performance Difference to Best to Average in Percent'].mean()    
#print(test)
#data.to_csv("Performance Difference based on F1 Score.csv")    


#df=data.dropna(subset=['Performance Difference to Best to Average in Percent'])
#average_difference = df['Performance Difference to Best to Average in Percent'].mean()
#print(average_difference, "average difference")  
average_best_complete

    



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing

Unnamed: 0,Imputation_Method,Task,Missing Type,Missing Fraction,Column,result_type,metric,Baseline,Corrupted,Imputed,...,MinorityClassSize,NumberOfFeatures,NumberOfInstances,NumberOfNumericFeatures,NumberOfCategoricalFeatures,NumberOfClasses,Downstream Performance Rank,Data_Constellation,Data_Constellation_full,Performance Difference to Best to Average in Percent
3,KNN,30,MAR,0.01,eccen,downstream_performance_mean,F1_macro,0.702542,0.0,0.702542,...,28.0,11.0,5473.0,10.0,1.0,5.0,3.0,MAR - 0.01,MAR - 0.01 - 30,0.051033
7,KNN,30,MAR,0.1,eccen,downstream_performance_mean,F1_macro,0.674449,0.0,0.674449,...,28.0,11.0,5473.0,10.0,1.0,5.0,5.0,MAR - 0.1,MAR - 0.1 - 30,0.103629
12,KNN,30,MAR,0.3,eccen,downstream_performance_mean,F1_macro,0.647119,0.0,0.643422,...,28.0,11.0,5473.0,10.0,1.0,5.0,6.0,MAR - 0.3,MAR - 0.3 - 30,0.167532
19,KNN,30,MAR,0.5,eccen,downstream_performance_mean,F1_macro,0.714564,0.0,0.715378,...,28.0,11.0,5473.0,10.0,1.0,5.0,5.0,MAR - 0.5,MAR - 0.5 - 30,0.052736
25,KNN,30,MCAR,0.01,eccen,downstream_performance_mean,F1_macro,0.713955,0.0,0.711018,...,28.0,11.0,5473.0,10.0,1.0,5.0,5.0,MCAR - 0.01,MCAR - 0.01 - 30,0.055940
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
684,KNN,41671,MCAR,0.5,a9,downstream_performance_mean,F1_macro,0.237525,0.0,0.236969,...,743.0,21.0,20000.0,20.0,1.0,5.0,5.0,MCAR - 0.5,MCAR - 0.5 - 41671,0.035315
691,KNN,41671,MNAR,0.01,a9,downstream_performance_mean,F1_macro,0.240619,0.0,0.240619,...,743.0,21.0,20000.0,20.0,1.0,5.0,1.0,MNAR - 0.01,MNAR - 0.01 - 41671,0.000000
697,KNN,41671,MNAR,0.1,a9,downstream_performance_mean,F1_macro,0.240084,0.0,0.239984,...,743.0,21.0,20000.0,20.0,1.0,5.0,3.0,MNAR - 0.1,MNAR - 0.1 - 41671,0.008372
701,KNN,41671,MNAR,0.3,a9,downstream_performance_mean,F1_macro,0.239976,0.0,0.239941,...,743.0,21.0,20000.0,20.0,1.0,5.0,5.0,MNAR - 0.3,MNAR - 0.3 - 41671,0.020850


In [91]:
average_difference = average_best_complete['Performance Difference to Best to Average in Percent'].mean()
print(average_difference, "average difference in Percent")

0.05903317877431598 average difference in Percent


In [92]:
# Relative Difference in absolute values (F1 Score) -> Best Method to Average Best Method

#AVERAGE_BEST_IMPUTATION_METHOD = "KNN"

data = downstream_results_rank.copy()
data['Task'] = data['Task'].astype(str)
data['Data_Constellation_full'] = data['Data_Constellation'] + ' - ' + data['Task']

# TODO: drop unnecessary columns here
dc_unique = data.Data_Constellation_full.unique()
#print(dc_unique)

#data_constellations = ['MAR - 0.01', 'MAR - 0.1', 'MAR - 0.3', 'MCAR - 0.5', 'MCAR - 0.01', 'MCAR - 0.1', 'MCAR - 0.3', 'MCAR - 0.5', 'MNAR - 0.01', 'MNAR - 0.1', 'MNAR - 0.3', 'MNAR - 0.5']
data_constellations = dc_unique.tolist()
methods = ['Random Forest', 'KNN', 'Mean/Mode', 'VAE', 'GAIN', 'Discriminative DL']
#print(data_constellations)
#print(type(methods))
average_best_total = pd.DataFrame()


for i in data_constellations:
    data_constel = data.loc[data['Data_Constellation_full'] == i]
    best_score = data_constel.loc[data_constel['Downstream Performance Rank'] == 1.0]
    average_best = data_constel.loc[data_constel['Imputation_Method'] == AVERAGE_BEST_IMPUTATION_METHOD]
    best_score_int = best_score.iloc[0]['Imputed']
    #print(best_score_int)
    average_best_int = average_best.iloc[0]['Imputed']
    #print(average_best_int)
    calc_result = (best_score_int - average_best_int)
#    print(calc_result)
#    print(i)
    average_best['Performance Difference to Best to Average in absolute'] = calc_result
    average_best_total = average_best_total.append(average_best)
 
average_best_total




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing

Unnamed: 0,Imputation_Method,Task,Missing Type,Missing Fraction,Column,result_type,metric,Baseline,Corrupted,Imputed,...,MinorityClassSize,NumberOfFeatures,NumberOfInstances,NumberOfNumericFeatures,NumberOfCategoricalFeatures,NumberOfClasses,Downstream Performance Rank,Data_Constellation,Data_Constellation_full,Performance Difference to Best to Average in absolute
3,KNN,30,MAR,0.01,eccen,downstream_performance_mean,F1_macro,0.702542,0.0,0.702542,...,28.0,11.0,5473.0,10.0,1.0,5.0,3.0,MAR - 0.01,MAR - 0.01 - 30,0.035853
7,KNN,30,MAR,0.1,eccen,downstream_performance_mean,F1_macro,0.674449,0.0,0.674449,...,28.0,11.0,5473.0,10.0,1.0,5.0,5.0,MAR - 0.1,MAR - 0.1 - 30,0.069892
12,KNN,30,MAR,0.3,eccen,downstream_performance_mean,F1_macro,0.647119,0.0,0.643422,...,28.0,11.0,5473.0,10.0,1.0,5.0,6.0,MAR - 0.3,MAR - 0.3 - 30,0.107794
19,KNN,30,MAR,0.5,eccen,downstream_performance_mean,F1_macro,0.714564,0.0,0.715378,...,28.0,11.0,5473.0,10.0,1.0,5.0,5.0,MAR - 0.5,MAR - 0.5 - 30,0.037726
25,KNN,30,MCAR,0.01,eccen,downstream_performance_mean,F1_macro,0.713955,0.0,0.711018,...,28.0,11.0,5473.0,10.0,1.0,5.0,5.0,MCAR - 0.01,MCAR - 0.01 - 30,0.039774
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
684,KNN,41671,MCAR,0.5,a9,downstream_performance_mean,F1_macro,0.237525,0.0,0.236969,...,743.0,21.0,20000.0,20.0,1.0,5.0,5.0,MCAR - 0.5,MCAR - 0.5 - 41671,0.008369
691,KNN,41671,MNAR,0.01,a9,downstream_performance_mean,F1_macro,0.240619,0.0,0.240619,...,743.0,21.0,20000.0,20.0,1.0,5.0,1.0,MNAR - 0.01,MNAR - 0.01 - 41671,0.000000
697,KNN,41671,MNAR,0.1,a9,downstream_performance_mean,F1_macro,0.240084,0.0,0.239984,...,743.0,21.0,20000.0,20.0,1.0,5.0,3.0,MNAR - 0.1,MNAR - 0.1 - 41671,0.002009
701,KNN,41671,MNAR,0.3,a9,downstream_performance_mean,F1_macro,0.239976,0.0,0.239941,...,743.0,21.0,20000.0,20.0,1.0,5.0,5.0,MNAR - 0.3,MNAR - 0.3 - 41671,0.005003


In [93]:
average_difference = average_best_total['Performance Difference to Best to Average in absolute'].mean()
print(average_difference, "average difference in absolut")

0.015514683711449376 average difference in absolut


## Heatmap (needs to be adjusted)

In [94]:
#df_heat = pd.read_csv('downstream_results_rank_temp.csv')
df_heat = downstream_results_rank.copy()
df_heat.drop(["Missing Type", "Missing Fraction", "Column", "result_type", "metric", "Baseline", "Corrupted", "Unnamed: 0", "Unnamed: 0", "name", "NumberOfClasses", "MajorityClassSize", "MinorityClassSize"], axis=1, inplace=True)
#df_heat['Improvement'] = df_heat['Improvement'] - 1
df_heat

Unnamed: 0,Imputation_Method,Task,Imputed,NumberOfFeatures,NumberOfInstances,NumberOfNumericFeatures,NumberOfCategoricalFeatures,Downstream Performance Rank,Data_Constellation
0,Discriminative DL,30,0.673140,11.0,5473.0,10.0,1.0,6.0,MAR - 0.01
1,Random Forest,30,0.674928,11.0,5473.0,10.0,1.0,5.0,MAR - 0.01
2,GAIN,30,0.681451,11.0,5473.0,10.0,1.0,4.0,MAR - 0.01
3,KNN,30,0.702542,11.0,5473.0,10.0,1.0,3.0,MAR - 0.01
4,Mean/Mode,30,0.738240,11.0,5473.0,10.0,1.0,2.0,MAR - 0.01
...,...,...,...,...,...,...,...,...,...
707,Discriminative DL,41671,0.240732,21.0,20000.0,20.0,1.0,5.0,MNAR - 0.5
708,Mean/Mode,41671,0.240780,21.0,20000.0,20.0,1.0,4.0,MNAR - 0.5
709,VAE,41671,0.242421,21.0,20000.0,20.0,1.0,3.0,MNAR - 0.5
710,GAIN,41671,0.243434,21.0,20000.0,20.0,1.0,2.0,MNAR - 0.5


In [95]:
# Get a dataframe for each "Data_Constellation"
# Hier mit Variablen arbeiten -> Liste mit Konstellationen

# Hier eventuell for schleife, etc


# drop unneccessary columns

#df_heat = downstream_results_rank
#df_heat.drop(["Missing Type", "Missing Fraction", "Column", "result_type", "metric", "Baseline", "Imputed", "Corrupted", "Unnamed: 0", "Unnamed: 0", "name", "NumberOfClasses", "MajorityClassSize", "MinorityClassSize"], axis=1, inplace=True)

#df_heat['Improvement'] = df_heat['Improvement']
df_heat = df_heat.astype({"Task":"string"})

#mar001.drop(["Missing Type", "Missing Fraction", "Column", "result_type", "metric", "Baseline", "Imputed", "Corrupted", "Unnamed: 0"], axis=1, inplace=True)

data_constellations = ['MAR - 0.01', 'MAR - 0.1', 'MAR - 0.3', 'MCAR - 0.5', 'MCAR - 0.01', 'MCAR - 0.1', 'MCAR - 0.3', 'MCAR - 0.5', 'MNAR - 0.01', 'MNAR - 0.1', 'MNAR - 0.3', 'MNAR - 0.5']


for i in data_constellations:
    data_constel = df_heat.loc[df_heat['Data_Constellation'] == i]

    ### uncomment whatever you want to investigate

    ## sort by amount datapoints (ascending)
    #data_constel = data_constel.sort_values(by=['NumberOfInstances'])

    ## sort by amount of features (ascending)
    data_constel = data_constel.sort_values(by=['NumberOfFeatures'])

    ## sort by amount of datapoints and features (ascending)
    #data_constel = data_constel.sort_values(by=['NumberOfInstances', 'NumberOfFeatures'])

    ## sort by amount of categorical features and datapoints (ascending)
    #data_constel = data_constel.sort_values(by=['NumberOfCategoricalFeatures', 'NumberOfInstances'])

    ## sort by amount of numerical features and datapoints (ascending)
    #data_constel = data_constel.sort_values(by=['NumberOfNumericFeatures', 'NumberOfInstances'])
    
    Dataset_number = data_constel["Task"]
    Imputation_Method = data_constel["Imputation_Method"]
    F1_Score = data_constel["Imputed"]
    

    trace = go.Heatmap(
                   z=F1_Score,
                   x=Dataset_number,
                   y=Imputation_Method,
                   type = 'heatmap',
                    autocolorscale= False,
                    colorscale = 'Reds',
                    #zmid=0,
                    zmin=0,
                    #hoverinfo='text',
                    #text=hovertext
                    )
    data = [trace]
    fig = go.Figure(data=data)
    fig.update_layout(
        title=i,
        xaxis_nticks=36)
    fig.show()



In [96]:
downstream_results_rank_heatmap2

Unnamed: 0,Imputation_Method,Task,Missing Type,Missing Fraction,Column,result_type,metric,Baseline,Corrupted,Imputed,...,name,MajorityClassSize,MinorityClassSize,NumberOfFeatures,NumberOfInstances,NumberOfNumericFeatures,NumberOfCategoricalFeatures,NumberOfClasses,Downstream Performance Rank,Data_Constellation
0,Discriminative DL,30,MAR,0.01,eccen,downstream_performance_mean,F1_macro,0.673140,0.0,0.673140,...,page-blocks,4913.0,28.0,11.0,5473.0,10.0,1.0,5.0,6.0,MAR - 0.01
1,Random Forest,30,MAR,0.01,eccen,downstream_performance_mean,F1_macro,0.674928,0.0,0.674928,...,page-blocks,4913.0,28.0,11.0,5473.0,10.0,1.0,5.0,5.0,MAR - 0.01
2,GAIN,30,MAR,0.01,eccen,downstream_performance_mean,F1_macro,0.681451,0.0,0.681451,...,page-blocks,4913.0,28.0,11.0,5473.0,10.0,1.0,5.0,4.0,MAR - 0.01
3,KNN,30,MAR,0.01,eccen,downstream_performance_mean,F1_macro,0.702542,0.0,0.702542,...,page-blocks,4913.0,28.0,11.0,5473.0,10.0,1.0,5.0,3.0,MAR - 0.01
4,Mean/Mode,30,MAR,0.01,eccen,downstream_performance_mean,F1_macro,0.738395,0.0,0.738240,...,page-blocks,4913.0,28.0,11.0,5473.0,10.0,1.0,5.0,2.0,MAR - 0.01
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
707,Discriminative DL,41671,MNAR,0.5,a9,downstream_performance_mean,F1_macro,0.240118,0.0,0.240732,...,microaggregation2,11162.0,743.0,21.0,20000.0,20.0,1.0,5.0,5.0,MNAR - 0.5
708,Mean/Mode,41671,MNAR,0.5,a9,downstream_performance_mean,F1_macro,0.239275,0.0,0.240780,...,microaggregation2,11162.0,743.0,21.0,20000.0,20.0,1.0,5.0,4.0,MNAR - 0.5
709,VAE,41671,MNAR,0.5,a9,downstream_performance_mean,F1_macro,0.241995,0.0,0.242421,...,microaggregation2,11162.0,743.0,21.0,20000.0,20.0,1.0,5.0,3.0,MNAR - 0.5
710,GAIN,41671,MNAR,0.5,a9,downstream_performance_mean,F1_macro,0.239877,0.0,0.243434,...,microaggregation2,11162.0,743.0,21.0,20000.0,20.0,1.0,5.0,2.0,MNAR - 0.5


In [97]:
#df_heat = pd.read_csv('downstream_results_rank_temp.csv')
df_heat_dif = downstream_results_rank_heatmap2
df_heat_dif.drop(["Missing Type", "Missing Fraction", "Column", "result_type", "metric", "Baseline", "Corrupted", "Unnamed: 0", "Unnamed: 0", "name", "NumberOfClasses", "MajorityClassSize", "MinorityClassSize"], axis=1, inplace=True)
#df_heat['Improvement'] = df_heat['Improvement'] - 1
df_heat_dif


Unnamed: 0,Imputation_Method,Task,Imputed,NumberOfFeatures,NumberOfInstances,NumberOfNumericFeatures,NumberOfCategoricalFeatures,Downstream Performance Rank,Data_Constellation
0,Discriminative DL,30,0.673140,11.0,5473.0,10.0,1.0,6.0,MAR - 0.01
1,Random Forest,30,0.674928,11.0,5473.0,10.0,1.0,5.0,MAR - 0.01
2,GAIN,30,0.681451,11.0,5473.0,10.0,1.0,4.0,MAR - 0.01
3,KNN,30,0.702542,11.0,5473.0,10.0,1.0,3.0,MAR - 0.01
4,Mean/Mode,30,0.738240,11.0,5473.0,10.0,1.0,2.0,MAR - 0.01
...,...,...,...,...,...,...,...,...,...
707,Discriminative DL,41671,0.240732,21.0,20000.0,20.0,1.0,5.0,MNAR - 0.5
708,Mean/Mode,41671,0.240780,21.0,20000.0,20.0,1.0,4.0,MNAR - 0.5
709,VAE,41671,0.242421,21.0,20000.0,20.0,1.0,3.0,MNAR - 0.5
710,GAIN,41671,0.243434,21.0,20000.0,20.0,1.0,2.0,MNAR - 0.5


In [98]:
#Calculate Difference for every Imputation towards average best Imputation Method per Data Constellation

# Relative Difference in Percent -> Best Method to Average Best Method

#AVERAGE_BEST_IMPUTATION_METHOD = "KNN"

data = downstream_results_rank
data['Task'] = data['Task'].astype(str)
data['Data_Constellation_full'] = data['Data_Constellation'] + ' - ' + data['Task']

# TODO: drop unnecessary columns here
dc_unique = data.Data_Constellation_full.unique()
#print(dc_unique)

#data_constellations = ['MAR - 0.01', 'MAR - 0.1', 'MAR - 0.3', 'MCAR - 0.5', 'MCAR - 0.01', 'MCAR - 0.1', 'MCAR - 0.3', 'MCAR - 0.5', 'MNAR - 0.01', 'MNAR - 0.1', 'MNAR - 0.3', 'MNAR - 0.5']
data_constellations = dc_unique.tolist()

# EXCLUDE AVERAGE BEST FROM THIS LIST
#methods = ['KNN', 'Mean/Mode', 'VAE', 'GAIN', 'Discriminative DL']
methods = ['Random Forest', 'KNN', 'Mean/Mode', 'VAE', 'GAIN', 'Discriminative DL']

heatmap_data_difference = pd.DataFrame()


for i in data_constellations:
    data_constel = data.loc[data['Data_Constellation_full'] == i]
#    best_score = data_constel.loc[data_constel['Downstream Performance Rank'] == 1.0]
    average_best = data_constel.loc[data_constel['Imputation_Method'] == AVERAGE_BEST_IMPUTATION_METHOD]
    dataset_number = best_score.iloc[0]['Task']
    #print(dataset_number)
    for i in methods:
        if ((data_constel['Imputation_Method'] == i).any()):
            current_score_row = data_constel.loc[data['Imputation_Method'] == i]
            current_score_int = current_score_row.iloc[0]['Imputed']

        #print(best_score_int)
            average_best_int = average_best.iloc[0]['Imputed']

        #print(average_best_int)
            calc_result = (current_score_int - average_best_int)

    #    print(calc_result)
    #    print(i)
            current_score_row['Performance Difference to Average Best'] = calc_result
            heatmap_data_difference = heatmap_data_difference.append(current_score_row)  
        else:
            print("Imputation Method not here ---------------------")

heatmap_data_difference





A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing

Imputation Method not here ---------------------
Imputation Method not here ---------------------
Imputation Method not here ---------------------
Imputation Method not here ---------------------
Imputation Method not here ---------------------
Imputation Method not here ---------------------
Imputation Method not here ---------------------
Imputation Method not here ---------------------


Unnamed: 0,Imputation_Method,Task,Missing Type,Missing Fraction,Column,result_type,metric,Baseline,Corrupted,Imputed,...,MinorityClassSize,NumberOfFeatures,NumberOfInstances,NumberOfNumericFeatures,NumberOfCategoricalFeatures,NumberOfClasses,Downstream Performance Rank,Data_Constellation,Data_Constellation_full,Performance Difference to Average Best
1,Random Forest,30,MAR,0.01,eccen,downstream_performance_mean,F1_macro,0.674928,0.0,0.674928,...,28.0,11.0,5473.0,10.0,1.0,5.0,5.0,MAR - 0.01,MAR - 0.01 - 30,-0.027615
3,KNN,30,MAR,0.01,eccen,downstream_performance_mean,F1_macro,0.702542,0.0,0.702542,...,28.0,11.0,5473.0,10.0,1.0,5.0,3.0,MAR - 0.01,MAR - 0.01 - 30,0.000000
4,Mean/Mode,30,MAR,0.01,eccen,downstream_performance_mean,F1_macro,0.738395,0.0,0.738240,...,28.0,11.0,5473.0,10.0,1.0,5.0,2.0,MAR - 0.01,MAR - 0.01 - 30,0.035698
5,VAE,30,MAR,0.01,eccen,downstream_performance_mean,F1_macro,0.738395,0.0,0.738395,...,28.0,11.0,5473.0,10.0,1.0,5.0,1.0,MAR - 0.01,MAR - 0.01 - 30,0.035853
2,GAIN,30,MAR,0.01,eccen,downstream_performance_mean,F1_macro,0.681451,0.0,0.681451,...,28.0,11.0,5473.0,10.0,1.0,5.0,4.0,MAR - 0.01,MAR - 0.01 - 30,-0.021092
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
711,KNN,41671,MNAR,0.5,a9,downstream_performance_mean,F1_macro,0.264992,0.0,0.265395,...,743.0,21.0,20000.0,20.0,1.0,5.0,1.0,MNAR - 0.5,MNAR - 0.5 - 41671,0.000000
708,Mean/Mode,41671,MNAR,0.5,a9,downstream_performance_mean,F1_macro,0.239275,0.0,0.240780,...,743.0,21.0,20000.0,20.0,1.0,5.0,4.0,MNAR - 0.5,MNAR - 0.5 - 41671,-0.024615
709,VAE,41671,MNAR,0.5,a9,downstream_performance_mean,F1_macro,0.241995,0.0,0.242421,...,743.0,21.0,20000.0,20.0,1.0,5.0,3.0,MNAR - 0.5,MNAR - 0.5 - 41671,-0.022974
710,GAIN,41671,MNAR,0.5,a9,downstream_performance_mean,F1_macro,0.239877,0.0,0.243434,...,743.0,21.0,20000.0,20.0,1.0,5.0,2.0,MNAR - 0.5,MNAR - 0.5 - 41671,-0.021961


In [99]:
# Get a dataframe for each "Data_Constellation"
# Hier mit Variablen arbeiten -> Liste mit Konstellationen

# Hier eventuell for schleife, etc


# drop unneccessary columns

#df_heat = downstream_results_rank
#df_heat.drop(["Missing Type", "Missing Fraction", "Column", "result_type", "metric", "Baseline", "Imputed", "Corrupted", "Unnamed: 0", "Unnamed: 0", "name", "NumberOfClasses", "MajorityClassSize", "MinorityClassSize"], axis=1, inplace=True)

#df_heat['Improvement'] = df_heat['Improvement']
heatmap_data_difference = heatmap_data_difference.astype({"Task":"string"})

#mar001.drop(["Missing Type", "Missing Fraction", "Column", "result_type", "metric", "Baseline", "Imputed", "Corrupted", "Unnamed: 0"], axis=1, inplace=True)

data_constellations = ['MAR - 0.01', 'MAR - 0.1', 'MAR - 0.3', 'MCAR - 0.5', 'MCAR - 0.01', 'MCAR - 0.1', 'MCAR - 0.3', 'MCAR - 0.5', 'MNAR - 0.01', 'MNAR - 0.1', 'MNAR - 0.3', 'MNAR - 0.5']


for i in data_constellations:
    data_constel = heatmap_data_difference.loc[df_heat['Data_Constellation'] == i]

    ### uncomment whatever you want to investigate

    ## sort by amount datapoints (ascending)
    #data_constel = data_constel.sort_values(by=['NumberOfInstances'])

    ## sort by amount of features (ascending)
    data_constel = data_constel.sort_values(by=['NumberOfFeatures'])

    ## sort by amount of datapoints and features (ascending)
    #data_constel = data_constel.sort_values(by=['NumberOfInstances', 'NumberOfFeatures'])

    ## sort by amount of categorical features and datapoints (ascending)
    #data_constel = data_constel.sort_values(by=['NumberOfCategoricalFeatures', 'NumberOfInstances'])

    ## sort by amount of numerical features and datapoints (ascending)
    #data_constel = data_constel.sort_values(by=['NumberOfNumericFeatures', 'NumberOfInstances'])
    
    Dataset_number = data_constel["Task"]
    Imputation_Method = data_constel["Imputation_Method"]
    Improvement = data_constel["Performance Difference to Average Best"]
    

    trace = go.Heatmap(
                   z=Improvement,
                   x=Dataset_number,
                   y=Imputation_Method,
                   type = 'heatmap',
                    autocolorscale= False,
                    colorscale = 'RdBu_r',
                    zmid=0,
                    #hoverinfo='text',
                    #text=hovertext
                    )
    data = [trace]
    fig = go.Figure(data=data)
    fig.update_layout(
        title=i,
        xaxis_nticks=36)
    fig.show()

## Plotly Heatmaps Tests

In [100]:
#heatmap_mar001.head()


In [101]:
#testmar001 = xr.tutorial.open_dataset('air_temperature').air.sel(lon=250.0)

'''
#plotly express test

fig = px.imshow(heatmap_mar001, text_auto = True, 
                labels=dict(x="Task", y="Imputation_Method", color="Improvement"),
                color_continuous_scale='RdBu_r', color_continuous_midpoint=0)
'''

'\n#plotly express test\n\nfig = px.imshow(heatmap_mar001, text_auto = True, \n                labels=dict(x="Task", y="Imputation_Method", color="Improvement"),\n                color_continuous_scale=\'RdBu_r\', color_continuous_midpoint=0)\n'

In [102]:
### uncomment whatever you want to investigate

## sort by amount datapoints (ascending)
#mar001 = mar001.sort_values(by=['NumberOfInstances'])

## sort by amount of features (ascending)
#mar001 = mar001.sort_values(by=['NumberOfFeatures'])

## sort by amount of datapoints and features (ascending)
#mar001 = mar001.sort_values(by=['NumberOfInstances', 'NumberOfFeatures'])

## sort by amount of categorical features and datapoints (ascending)
#mar001 = mar001.sort_values(by=['NumberOfCategoricalFeatures', 'NumberOfInstances'])

## sort by amount of numerical features and datapoints (ascending)
#mar001 = mar001.sort_values(by=['NumberOfNumericFeatures', 'NumberOfInstances'])



mar001 = mar001.astype({"Task":"string"})

Dataset_number = mar001["Task"]
Imputation_Method = mar001["Imputation_Method"]
Improvement = mar001["Improvement"]




NameError: name 'mar001' is not defined

In [None]:
trace = go.Heatmap(
                   z=Improvement,
                   x=Dataset_number,
                   y=Imputation_Method,
                   type = 'heatmap',
                    autocolorscale= False,
                    colorscale = 'RdBu_r',
                    zmid=0,
                    hoverinfo='text',
                    text=hovertext
                    )




data = [trace]
fig = go.Figure(data=data)
#iplot(fig)


fig.show()

ToDo´s für Darstellung:
- Optionen für einfache Anpassung bei der Sortierung/Darstellung:
    - Anzahl Datenpunkte
    - Anzahl Features
    - Anzahl numerische Features
    - Anzahl kategorische Features
- Schleife aufsetzen für alle Datenkonstellationen (nicht hart kodieren)
- Jeweils beste Imputationsmethode je Datensatz nochmals separat in Heatmap



ToDo´s für restliche Auswertung (Mathematische Part)
- Beste Imp-Methode je Datensatz ermitteln (-> via Ranking am besten, je Konstellation (Bsp. MAR 0.01)
- durchschnittliche Platzierung jeder Imp-Methode ermitteln (Ranking -> dann je Konstellation (Bsp. MAR 0.01)
- Beste Imp je Datensatz mit durchschnittlich bester Imp vergleichen (Liste mit beste Imp & Liste mit Durchschn. Imp -> VGL)
(jede Konstellation genau einmal in jeder Liste)



Sonstiges (keine Prio)
- Optionen für Filterung (bei Beadrf umsetzen -> vorerst keine Priorität!)
    - Numerisches Feature wurde imputiert
    - Kategorisches Feature wurde imputiert

## Application Scenario 2 - Downstream Performance

### Categorical  Columns (Classification)

In [None]:
'''
draw_cat_box_plot(
    downstream_results,
    "Improvement",
    (-0.15, 0.3),
    FIGURES_PATH,
    "fully_observed_downstream_boxplot.eps",
    hue_order=list(rename_imputer_dict.values()),
    row_order=list(rename_metric_dict.values())
)
'''
# Not used at the moment -> function from other file required, check first field