# Visualize Results: Downstream Performance - Binary Classification Corrupted Experiments

This notebook should answer the questions: *Does imputation lead to better downstream performances?*

Data needs to be preprocessed with other notebook, her we only import two csv files with raw data regarding the results of the experiment and information about the used datasets!

## Notebook Structure 

* Application Scenario 2 - Downstream Performance  
   * Categorical  Columns (Classification)
   * Numerical Columns (Regression)
   * Heterogenous Columns (Classification and Regression Combined)

In [71]:
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import os
import pandas as pd
import re
import seaborn as sns

from pathlib import Path

import plotly as py
import plotly.express as px
import plotly.graph_objects as go
import xarray as xr


%matplotlib inline

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Settings

In [72]:
sns.set(style="whitegrid")
sns.set_context('paper', font_scale=1.5)
mpl.rcParams['lines.linewidth'] = '2'

In [73]:
CLF_METRIC = "Classification Tasks"
REG_METRIC = "Regression Tasks"

DOWNSTREAM_RESULT_TYPE = "downstream_performance_mean"
IMPUTE_RESULT_TYPE = "impute_performance_mean"

FIGURES_PATH = Path(f"../paper/figures/")

## Data Preparation

In [74]:
#read results.csv file here!

# Pick whether you want to analyze the "Regression" Experiment oder the "Regression Corrupted" Experiment

#results = pd.read_csv('regression_corrupted.csv')
results = pd.read_csv('../binary_classification_corrupted.csv')
# Preresults.head()
results

Unnamed: 0,experiment,imputer,task,missing_type,missing_fraction,strategy,column,result_type,metric,train,test,baseline,corrupted,imputed
0,corrupted_binary_experiment,KNNImputer,42192,MNAR,0.50,single_single,age,impute_performance_std,MAE,3.011733,3.797001,,,
1,corrupted_binary_experiment,KNNImputer,42192,MNAR,0.50,single_single,age,impute_performance_std,MSE,50.362390,56.332410,,,
2,corrupted_binary_experiment,KNNImputer,42192,MNAR,0.50,single_single,age,impute_performance_std,RMSE,2.920725,3.800179,,,
3,corrupted_binary_experiment,KNNImputer,42192,MNAR,0.30,single_single,age,impute_performance_std,MAE,2.738637,2.008460,,,
4,corrupted_binary_experiment,KNNImputer,42192,MNAR,0.30,single_single,age,impute_performance_std,MSE,34.927790,23.464325,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8575,corrupted_binary_experiment,AutoKerasImputer,42493,MCAR,0.01,single_single,Length,downstream_performance_mean,F1_macro,,,0.625302,0.0,0.625929
8576,corrupted_binary_experiment,AutoKerasImputer,42493,MCAR,0.01,single_single,Length,downstream_performance_mean,F1_weighted,,,0.636242,0.0,0.636842
8577,corrupted_binary_experiment,AutoKerasImputer,42493,MCAR,0.10,single_single,Length,downstream_performance_mean,F1_micro,,,0.648313,0.0,0.648189
8578,corrupted_binary_experiment,AutoKerasImputer,42493,MCAR,0.10,single_single,Length,downstream_performance_mean,F1_macro,,,0.626519,0.0,0.626459


In [75]:
# Filtering the relevant data for downstream analysis

na_impute_results = results[
    (results["result_type"] == IMPUTE_RESULT_TYPE) & 
    (results["metric"].isin(["F1_macro", "RMSE"]))
]
na_impute_results.drop(["baseline", "corrupted", "imputed"], axis=1, inplace=True)
na_impute_results = na_impute_results[na_impute_results.isna().any(axis=1)]
na_impute_results.shape



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



(2, 11)

In [76]:
# check if strategy type is correct!
STRATEGY_TYPE = "single_single"

downstream_results = results[
    (results["result_type"] == DOWNSTREAM_RESULT_TYPE) & 
    (results["metric"].isin(["F1_macro", "RMSE"]) &
    (results["strategy"] == STRATEGY_TYPE))
]

# remove experiments where imputation failed
downstream_results = downstream_results.merge(
    na_impute_results,
    how = "left",
    validate = "one_to_one",
    indicator = True,
    suffixes=("", "_imp"),
    on = ["experiment", "imputer", "task", "missing_type", "missing_fraction", "strategy", "column"]
)
downstream_results = downstream_results[downstream_results["_merge"]=="left_only"]

assert len(results["strategy"].unique()) == 1
downstream_results.drop(["experiment", "strategy", "result_type_imp", "metric_imp", "train", "test", "train_imp", "test_imp", "_merge"], axis=1, inplace=True)

downstream_results = downstream_results.rename(
    {
        "imputer": "Imputation_Method",
        "task": "Task",
        "missing_type": "Missing Type",
        "missing_fraction": "Missing Fraction",
        "column": "Column",
        "baseline": "Baseline",
        "imputed": "Imputed",
        "corrupted": "Corrupted"
    },
    axis = 1
)

In [77]:
rename_imputer_dict = {
    "ModeImputer": "Mean/Mode",
    "KNNImputer": "KNN",
    "ForestImputer": "Random Forest",
    "AutoKerasImputer": "Discriminative DL",
    "VAEImputer": "VAE",
    "GAINImputer": "GAIN"    
}

rename_metric_dict = {
    "F1_macro": CLF_METRIC,
    "RMSE": REG_METRIC
}

downstream_results = downstream_results.replace(rename_imputer_dict)
downstream_results = downstream_results.replace(rename_metric_dict)

downstream_results

Unnamed: 0,Imputation_Method,Task,Missing Type,Missing Fraction,Column,result_type,metric,Baseline,Corrupted,Imputed
0,KNN,42192,MNAR,0.50,age,downstream_performance_mean,Classification Tasks,0.645845,0.0,0.652948
1,KNN,42192,MNAR,0.30,age,downstream_performance_mean,Classification Tasks,0.645845,0.0,0.645791
2,KNN,42192,MNAR,0.01,age,downstream_performance_mean,Classification Tasks,0.663180,0.0,0.663755
3,KNN,42192,MNAR,0.10,age,downstream_performance_mean,Classification Tasks,0.657712,0.0,0.656976
4,KNN,42192,MAR,0.50,age,downstream_performance_mean,Classification Tasks,0.652208,0.0,0.649844
...,...,...,...,...,...,...,...,...,...,...
710,Discriminative DL,42493,MAR,0.10,Length,downstream_performance_mean,Classification Tasks,0.626267,0.0,0.626262
711,Discriminative DL,42493,MCAR,0.50,Length,downstream_performance_mean,Classification Tasks,0.622848,0.0,0.623964
712,Discriminative DL,42493,MCAR,0.30,Length,downstream_performance_mean,Classification Tasks,0.624424,0.0,0.624357
713,Discriminative DL,42493,MCAR,0.01,Length,downstream_performance_mean,Classification Tasks,0.625302,0.0,0.625929


### Robustness: check which imputers yielded `NaN`values

In [78]:
for col in downstream_results.columns:
    na_sum = downstream_results[col].isna().sum()
    if na_sum > 0:
        print("-----" * 10)        
        print(col, na_sum)
        print("-----" * 10)        
        na_idx = downstream_results[col].isna()
        print(downstream_results.loc[na_idx, "Imputation Method"].value_counts(dropna=False))
        print("\n")

## Compute Downstream Performance relative to Baseline

In [79]:
clf_row_idx = downstream_results["metric"] == CLF_METRIC
reg_row_idx = downstream_results["metric"] == REG_METRIC

In [80]:
#downstream_results["Improvement"]   = (downstream_results["Imputed"] - downstream_results["Baseline"]  ) / downstream_results["Baseline"]
#downstream_results.loc[reg_row_idx, "Improvement"]   = downstream_results.loc[reg_row_idx, "Improvement"]   * -1

#mar001.drop(["Missing Type", "Missing Fraction", "Column", "result_type", "metric", "Baseline", "Imputed", "Corrupted", "Unnamed: 0"], axis=1, inplace=True)

#print(downstream_results)
#downstream_results.to_csv('downstream_results.csv')
downstream_results.head()

Unnamed: 0,Imputation_Method,Task,Missing Type,Missing Fraction,Column,result_type,metric,Baseline,Corrupted,Imputed
0,KNN,42192,MNAR,0.5,age,downstream_performance_mean,Classification Tasks,0.645845,0.0,0.652948
1,KNN,42192,MNAR,0.3,age,downstream_performance_mean,Classification Tasks,0.645845,0.0,0.645791
2,KNN,42192,MNAR,0.01,age,downstream_performance_mean,Classification Tasks,0.66318,0.0,0.663755
3,KNN,42192,MNAR,0.1,age,downstream_performance_mean,Classification Tasks,0.657712,0.0,0.656976
4,KNN,42192,MAR,0.5,age,downstream_performance_mean,Classification Tasks,0.652208,0.0,0.649844


## Adding Dataset Info, Sorting and Ranking

In [81]:
# Sortierung der Daten

#downstream_results_full_sort = pd.read_csv('downstream_results.csv')
downstream_results_full_sort = downstream_results

#df = sns.load_dataset('impute_results_full')
#downstream_results_full_sort = downstream_results_full_sort.replace('$k$-NN','KNN')
#impute_results_full_sort.head()

#impute_results_full_sort = impute_results_full_sort.sort_values(['Task'], ascending=[True])
downstream_results_full_sort = downstream_results_full_sort.sort_values(['Task', 'Missing Type', 'Missing Fraction', 'Imputed'], ascending=[True, True, True, True])
#print(downstream_results_full_sort)
downstream_results_full_sort.head()


#downstream_results_full_sort.to_csv('downstream_results_full_sort.csv')

Unnamed: 0,Imputation_Method,Task,Missing Type,Missing Fraction,Column,result_type,metric,Baseline,Corrupted,Imputed
292,Mean/Mode,137,MAR,0.01,top-middle-square,downstream_performance_mean,Classification Tasks,0.674145,0.0,0.674116
649,Discriminative DL,137,MAR,0.01,top-middle-square,downstream_performance_mean,Classification Tasks,0.674505,0.0,0.674251
532,VAE,137,MAR,0.01,top-middle-square,downstream_performance_mean,Classification Tasks,0.674729,0.0,0.674488
54,KNN,137,MAR,0.01,top-middle-square,downstream_performance_mean,Classification Tasks,0.674729,0.0,0.67476
412,Random Forest,137,MAR,0.01,top-middle-square,downstream_performance_mean,Classification Tasks,0.674909,0.0,0.67476


In [82]:
# add dataset information from other csv file

dataset_info = pd.read_csv('../datasets_information_overview.csv')
dataset_info = dataset_info.rename(columns={"did": "Task"})


downstream_results_full_sort = pd.merge(downstream_results_full_sort, dataset_info, on='Task')
#downstream_results_full_sort.to_csv('downstream_results_full_sort_testtesttest.csv')
downstream_results_full_sort.head()

Unnamed: 0.1,Imputation_Method,Task,Missing Type,Missing Fraction,Column,result_type,metric,Baseline,Corrupted,Imputed,Unnamed: 0,name,MajorityClassSize,MinorityClassSize,NumberOfFeatures,NumberOfInstances,NumberOfNumericFeatures,NumberOfCategoricalFeatures,NumberOfClasses
0,Mean/Mode,137,MAR,0.01,top-middle-square,downstream_performance_mean,Classification Tasks,0.674145,0.0,0.674116,34,BNG(tic-tac-toe),25702.0,13664.0,10.0,39366.0,0.0,10.0,
1,Discriminative DL,137,MAR,0.01,top-middle-square,downstream_performance_mean,Classification Tasks,0.674505,0.0,0.674251,34,BNG(tic-tac-toe),25702.0,13664.0,10.0,39366.0,0.0,10.0,
2,VAE,137,MAR,0.01,top-middle-square,downstream_performance_mean,Classification Tasks,0.674729,0.0,0.674488,34,BNG(tic-tac-toe),25702.0,13664.0,10.0,39366.0,0.0,10.0,
3,KNN,137,MAR,0.01,top-middle-square,downstream_performance_mean,Classification Tasks,0.674729,0.0,0.67476,34,BNG(tic-tac-toe),25702.0,13664.0,10.0,39366.0,0.0,10.0,
4,Random Forest,137,MAR,0.01,top-middle-square,downstream_performance_mean,Classification Tasks,0.674909,0.0,0.67476,34,BNG(tic-tac-toe),25702.0,13664.0,10.0,39366.0,0.0,10.0,


In [83]:
# Ranking of downstream performance per data constellation

EXPERIMENTAL_CONDITIONS = ["Task", "Missing Type", "Missing Fraction", "Column", "result_type"]

downstream_results_rank = downstream_results_full_sort

#clf_row_idx = impute_results["metric"] == CLF_METRIC
#reg_row_idx = impute_results["metric"] == REG_METRIC

downstream_results_rank["Downstream Performance Rank"] = downstream_results_rank.groupby(EXPERIMENTAL_CONDITIONS).rank(ascending=False, na_option="bottom", method="min")["Imputed"]
downstream_results_rank.to_csv('downstream_results_complete_overview.csv')
downstream_results_rank.head()


Unnamed: 0.1,Imputation_Method,Task,Missing Type,Missing Fraction,Column,result_type,metric,Baseline,Corrupted,Imputed,Unnamed: 0,name,MajorityClassSize,MinorityClassSize,NumberOfFeatures,NumberOfInstances,NumberOfNumericFeatures,NumberOfCategoricalFeatures,NumberOfClasses,Downstream Performance Rank
0,Mean/Mode,137,MAR,0.01,top-middle-square,downstream_performance_mean,Classification Tasks,0.674145,0.0,0.674116,34,BNG(tic-tac-toe),25702.0,13664.0,10.0,39366.0,0.0,10.0,,5.0
1,Discriminative DL,137,MAR,0.01,top-middle-square,downstream_performance_mean,Classification Tasks,0.674505,0.0,0.674251,34,BNG(tic-tac-toe),25702.0,13664.0,10.0,39366.0,0.0,10.0,,4.0
2,VAE,137,MAR,0.01,top-middle-square,downstream_performance_mean,Classification Tasks,0.674729,0.0,0.674488,34,BNG(tic-tac-toe),25702.0,13664.0,10.0,39366.0,0.0,10.0,,3.0
3,KNN,137,MAR,0.01,top-middle-square,downstream_performance_mean,Classification Tasks,0.674729,0.0,0.67476,34,BNG(tic-tac-toe),25702.0,13664.0,10.0,39366.0,0.0,10.0,,2.0
4,Random Forest,137,MAR,0.01,top-middle-square,downstream_performance_mean,Classification Tasks,0.674909,0.0,0.67476,34,BNG(tic-tac-toe),25702.0,13664.0,10.0,39366.0,0.0,10.0,,1.0


In [84]:
# Merge the two columns "Missing Type" and "Missing Fraction"

downstream_results_rank['Missing Type'] = downstream_results_rank['Missing Type'].astype(str)
downstream_results_rank['Missing Fraction'] = downstream_results_rank['Missing Fraction'].astype(str)
datatype_new = downstream_results_rank.dtypes
#print(datatype_new)


downstream_results_rank['Data_Constellation'] = downstream_results_rank['Missing Type'] + ' - ' + downstream_results_rank['Missing Fraction']
downstream_results_rank.to_csv('downstream_results_rank_temp.csv')
downstream_results_rank_heatmap2 = downstream_results_rank.copy()
downstream_results_rank.head()


Unnamed: 0,Imputation_Method,Task,Missing Type,Missing Fraction,Column,result_type,metric,Baseline,Corrupted,Imputed,...,name,MajorityClassSize,MinorityClassSize,NumberOfFeatures,NumberOfInstances,NumberOfNumericFeatures,NumberOfCategoricalFeatures,NumberOfClasses,Downstream Performance Rank,Data_Constellation
0,Mean/Mode,137,MAR,0.01,top-middle-square,downstream_performance_mean,Classification Tasks,0.674145,0.0,0.674116,...,BNG(tic-tac-toe),25702.0,13664.0,10.0,39366.0,0.0,10.0,,5.0,MAR - 0.01
1,Discriminative DL,137,MAR,0.01,top-middle-square,downstream_performance_mean,Classification Tasks,0.674505,0.0,0.674251,...,BNG(tic-tac-toe),25702.0,13664.0,10.0,39366.0,0.0,10.0,,4.0,MAR - 0.01
2,VAE,137,MAR,0.01,top-middle-square,downstream_performance_mean,Classification Tasks,0.674729,0.0,0.674488,...,BNG(tic-tac-toe),25702.0,13664.0,10.0,39366.0,0.0,10.0,,3.0,MAR - 0.01
3,KNN,137,MAR,0.01,top-middle-square,downstream_performance_mean,Classification Tasks,0.674729,0.0,0.67476,...,BNG(tic-tac-toe),25702.0,13664.0,10.0,39366.0,0.0,10.0,,2.0,MAR - 0.01
4,Random Forest,137,MAR,0.01,top-middle-square,downstream_performance_mean,Classification Tasks,0.674909,0.0,0.67476,...,BNG(tic-tac-toe),25702.0,13664.0,10.0,39366.0,0.0,10.0,,1.0,MAR - 0.01


## Analyzing Performance based on Rank and Improvement per Data Constellation

Hier die Rechnung -> Bestes Ergebnis pro "Experimental condition" - Beste Methode im Durchschnitt 

ToDo´s für restliche Auswertung (Mathematische Part)
- Beste Imp-Methode je Datensatz ermitteln (-> via Ranking am besten, je Konstellation (Bsp. MAR 0.01)
- durchschnittliche Platzierung jeder Imp-Methode ermitteln (Ranking -> dann je Konstellation (Bsp. MAR 0.01)
- Beste Imp je Datensatz mit durchschnittlich bester Imp vergleichen (Liste mit beste Imp & Liste mit Durchschn. Imp -> VGL)
(jede Konstellation genau einmal in jeder Liste)


In [85]:
data = downstream_results_rank

# Count amount of different Data constellations in column "Data_Constellation"
dc_unique = data.Data_Constellation.unique().size
print(dc_unique, "Data Constellations")
print("_____________________")
# Count amount of 1.0 Ranking result in column "Downstream Performance Rank" (Numbers must match)
rank_count = data['Downstream Performance Rank'].value_counts()
print(rank_count)
print("_____________________")
# Filter for 1.0 Ranking -> Overview -> save as csv
rank_1 = data.loc[data['Downstream Performance Rank'] == 1.0]
rank_1.to_csv('rank_1.csv')

print("_____________________")
# Count how often each Imputation Method is present -> most "wins"
rank_wins = rank_1['Imputation_Method'].value_counts()
print(rank_wins)
print("_____________________")
# Take initial overview and filter for each imputation method and calculate average rank
methods = ['Random Forest', 'KNN', 'Mean/Mode', 'VAE', 'GAIN', 'Discriminative DL']
for i in methods:
    df_average_rank = data.loc[data['Imputation_Method'] == i]
    len_ar = len(df_average_rank)
    print(len_ar, "Amount of results available")
    rank_pos = df_average_rank['Downstream Performance Rank'].value_counts().sort_index(ascending=True)
    print(rank_pos)
    average_rank = df_average_rank["Downstream Performance Rank"].mean()
    print("Average Rank for", i, "is", average_rank)
    #average_improvement = df_average_rank["Imputed"].mean()
    #print("Average Improvement to baseline is", average_improvement)
    print("_____________________")



12 Data Constellations
_____________________
1.0    144
2.0    123
4.0    121
5.0    111
3.0    110
6.0    104
Name: Downstream Performance Rank, dtype: int64
_____________________
_____________________
Random Forest        38
KNN                  26
VAE                  24
GAIN                 22
Discriminative DL    21
Mean/Mode            13
Name: Imputation_Method, dtype: int64
_____________________
120 Amount of results available
1.0    38
2.0    32
3.0    17
4.0    16
5.0    11
6.0     6
Name: Downstream Performance Rank, dtype: int64
Average Rank for Random Forest is 2.566666666666667
_____________________
120 Amount of results available
1.0    26
2.0    23
3.0    30
4.0    17
5.0    13
6.0    11
Name: Downstream Performance Rank, dtype: int64
Average Rank for KNN is 3.0083333333333333
_____________________
120 Amount of results available
1.0    13
2.0     9
3.0    16
4.0    30
5.0    25
6.0    27
Name: Downstream Performance Rank, dtype: int64
Average Rank for Mean/Mode is 4.05

In [86]:
rank_1.head()

Unnamed: 0,Imputation_Method,Task,Missing Type,Missing Fraction,Column,result_type,metric,Baseline,Corrupted,Imputed,...,name,MajorityClassSize,MinorityClassSize,NumberOfFeatures,NumberOfInstances,NumberOfNumericFeatures,NumberOfCategoricalFeatures,NumberOfClasses,Downstream Performance Rank,Data_Constellation
4,Random Forest,137,MAR,0.01,top-middle-square,downstream_performance_mean,Classification Tasks,0.674909,0.0,0.67476,...,BNG(tic-tac-toe),25702.0,13664.0,10.0,39366.0,0.0,10.0,,1.0,MAR - 0.01
10,KNN,137,MAR,0.1,top-middle-square,downstream_performance_mean,Classification Tasks,0.673672,0.0,0.674265,...,BNG(tic-tac-toe),25702.0,13664.0,10.0,39366.0,0.0,10.0,,1.0,MAR - 0.1
16,Random Forest,137,MAR,0.3,top-middle-square,downstream_performance_mean,Classification Tasks,0.674084,0.0,0.674124,...,BNG(tic-tac-toe),25702.0,13664.0,10.0,39366.0,0.0,10.0,,1.0,MAR - 0.3
22,KNN,137,MAR,0.5,top-middle-square,downstream_performance_mean,Classification Tasks,0.677718,0.0,0.678583,...,BNG(tic-tac-toe),25702.0,13664.0,10.0,39366.0,0.0,10.0,,1.0,MAR - 0.5
27,VAE,137,MCAR,0.01,top-middle-square,downstream_performance_mean,Classification Tasks,0.674617,0.0,0.674663,...,BNG(tic-tac-toe),25702.0,13664.0,10.0,39366.0,0.0,10.0,,1.0,MCAR - 0.01


In [87]:
# Take initial overview and filter best average imputation method and take filtered dataframe from 1.0 Ranking
# Where Data_Constellation identical -> Ranking 1.0 [Improvement] - Best_Imp_Method [Improvement]
# Write Difference in seperat column - > Calculate Average improvement

AVERAGE_BEST_IMPUTATION_METHOD = "Random Forest"


# Adjust the following depending on the previous results
av_best = data.loc[data['Imputation_Method'] == AVERAGE_BEST_IMPUTATION_METHOD]
av_best['Task'] = av_best['Task'].astype(str)
av_best['Data_Constellation'] = av_best['Data_Constellation'] + ' - ' + av_best['Task']

#av_best = av_best[['Imputation_Method', 'Imputed', 'Data_Constellation', 'Downstream Performance Rank']]
av_best = av_best.rename(columns={'Imputation_Method':'Imputation_Method_average', 
                               'Imputed':'Imputed_average',
                                 'Downstream Performance Rank':'Downstream Performance Rank Average'})

#av_best.head()

rank_1['Task'] = rank_1['Task'].astype(str)
rank_1['Data_Constellation'] = rank_1['Data_Constellation'] + ' - ' + rank_1['Task']
rank_1 = rank_1[['Imputation_Method', 'Imputed', 'Data_Constellation', 'Downstream Performance Rank']]
rank_1 = rank_1.rename(columns={'Imputation_Method':'Imputation_Method_best', 
                               'Imputed':'Imputed_best',
                               'Downstream Performance Rank':'Downstream Performance Rank Best'})

performance_difference = pd.merge(av_best, rank_1, on='Data_Constellation')
performance_difference.head()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/

Unnamed: 0,Imputation_Method_average,Task,Missing Type,Missing Fraction,Column,result_type,metric,Baseline,Corrupted,Imputed_average,...,NumberOfFeatures,NumberOfInstances,NumberOfNumericFeatures,NumberOfCategoricalFeatures,NumberOfClasses,Downstream Performance Rank Average,Data_Constellation,Imputation_Method_best,Imputed_best,Downstream Performance Rank Best
0,Random Forest,137,MAR,0.01,top-middle-square,downstream_performance_mean,Classification Tasks,0.674909,0.0,0.67476,...,10.0,39366.0,0.0,10.0,,1.0,MAR - 0.01 - 137,Random Forest,0.67476,1.0
1,Random Forest,137,MAR,0.1,top-middle-square,downstream_performance_mean,Classification Tasks,0.673767,0.0,0.673526,...,10.0,39366.0,0.0,10.0,,4.0,MAR - 0.1 - 137,KNN,0.674265,1.0
2,Random Forest,137,MAR,0.3,top-middle-square,downstream_performance_mean,Classification Tasks,0.674084,0.0,0.674124,...,10.0,39366.0,0.0,10.0,,1.0,MAR - 0.3 - 137,Random Forest,0.674124,1.0
3,Random Forest,137,MAR,0.5,top-middle-square,downstream_performance_mean,Classification Tasks,0.665708,0.0,0.665258,...,10.0,39366.0,0.0,10.0,,4.0,MAR - 0.5 - 137,KNN,0.678583,1.0
4,Random Forest,137,MCAR,0.01,top-middle-square,downstream_performance_mean,Classification Tasks,0.674273,0.0,0.674334,...,10.0,39366.0,0.0,10.0,,2.0,MCAR - 0.01 - 137,VAE,0.674663,1.0


In [88]:
#performance_difference['Imputed_best'] = performance_difference['Imputed_best'] + 1
#performance_difference['Imputed_average'] = performance_difference['Imputed_average'] + 1

performance_difference['Performance Difference Best to Average'] = performance_difference['Imputed_best'] - performance_difference['Imputed_average']
Average_Difference = performance_difference['Performance Difference Best to Average'].mean()
print("Average Difference in Improvement from best method to average best method for F1", Average_Difference)


Average Difference in Improvement from best method to average best method for F1 0.006912684257484671


In [89]:

performance_difference.to_csv('performance_difference.csv')
performance_difference

Unnamed: 0,Imputation_Method_average,Task,Missing Type,Missing Fraction,Column,result_type,metric,Baseline,Corrupted,Imputed_average,...,NumberOfInstances,NumberOfNumericFeatures,NumberOfCategoricalFeatures,NumberOfClasses,Downstream Performance Rank Average,Data_Constellation,Imputation_Method_best,Imputed_best,Downstream Performance Rank Best,Performance Difference Best to Average
0,Random Forest,137,MAR,0.01,top-middle-square,downstream_performance_mean,Classification Tasks,0.674909,0.0,0.674760,...,39366.0,0.0,10.0,,1.0,MAR - 0.01 - 137,Random Forest,0.674760,1.0,0.000000
1,Random Forest,137,MAR,0.1,top-middle-square,downstream_performance_mean,Classification Tasks,0.673767,0.0,0.673526,...,39366.0,0.0,10.0,,4.0,MAR - 0.1 - 137,KNN,0.674265,1.0,0.000739
2,Random Forest,137,MAR,0.3,top-middle-square,downstream_performance_mean,Classification Tasks,0.674084,0.0,0.674124,...,39366.0,0.0,10.0,,1.0,MAR - 0.3 - 137,Random Forest,0.674124,1.0,0.000000
3,Random Forest,137,MAR,0.5,top-middle-square,downstream_performance_mean,Classification Tasks,0.665708,0.0,0.665258,...,39366.0,0.0,10.0,,4.0,MAR - 0.5 - 137,KNN,0.678583,1.0,0.013326
4,Random Forest,137,MCAR,0.01,top-middle-square,downstream_performance_mean,Classification Tasks,0.674273,0.0,0.674334,...,39366.0,0.0,10.0,,2.0,MCAR - 0.01 - 137,VAE,0.674663,1.0,0.000329
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
139,Random Forest,42493,MCAR,0.5,Length,downstream_performance_mean,Classification Tasks,0.625924,0.0,0.624029,...,26969.0,2.0,6.0,,1.0,MCAR - 0.5 - 42493,Random Forest,0.624029,1.0,0.000000
140,Random Forest,42493,MNAR,0.01,Length,downstream_performance_mean,Classification Tasks,0.627164,0.0,0.626956,...,26969.0,2.0,6.0,,5.0,MNAR - 0.01 - 42493,VAE,0.627354,1.0,0.000398
141,Random Forest,42493,MNAR,0.1,Length,downstream_performance_mean,Classification Tasks,0.625641,0.0,0.625463,...,26969.0,2.0,6.0,,5.0,MNAR - 0.1 - 42493,VAE,0.626505,1.0,0.001042
142,Random Forest,42493,MNAR,0.3,Length,downstream_performance_mean,Classification Tasks,0.624441,0.0,0.623823,...,26969.0,2.0,6.0,,3.0,MNAR - 0.3 - 42493,Discriminative DL,0.624448,1.0,0.000625


## Analysis and Ranking based on F1 Score

In [90]:
# Relative Difference in Percent -> Best Method to Average Best Method

#AVERAGE_BEST_IMPUTATION_METHOD = "Random Forest"

data = downstream_results_rank
data['Task'] = data['Task'].astype(str)
data['Data_Constellation_full'] = data['Data_Constellation'] + ' - ' + data['Task']

# TODO: drop unnecessary columns here
dc_unique = data.Data_Constellation_full.unique()
#print(dc_unique)

#data_constellations = ['MAR - 0.01', 'MAR - 0.1', 'MAR - 0.3', 'MCAR - 0.5', 'MCAR - 0.01', 'MCAR - 0.1', 'MCAR - 0.3', 'MCAR - 0.5', 'MNAR - 0.01', 'MNAR - 0.1', 'MNAR - 0.3', 'MNAR - 0.5']
data_constellations = dc_unique.tolist()
methods = ['Random Forest', 'KNN', 'Mean/Mode', 'VAE', 'GAIN', 'Discriminative DL']
#print(data_constellations)
#print(type(methods))
average_best_complete = pd.DataFrame()


for i in data_constellations:
    data_constel = data.loc[data['Data_Constellation_full'] == i]
    best_score = data_constel.loc[data_constel['Downstream Performance Rank'] == 1.0]
    average_best = data_constel.loc[data_constel['Imputation_Method'] == AVERAGE_BEST_IMPUTATION_METHOD]
    dataset_number = best_score.iloc[0]['Task']
    #print(dataset_number)
    if (dataset_number != '4135'):
        best_score_int = best_score.iloc[0]['Imputed']
    #print(best_score_int)
        average_best_int = average_best.iloc[0]['Imputed']
    #print(average_best_int)
        calc_result = ((best_score_int - average_best_int)/average_best_int)
#    print(calc_result)
#    print(i)
        average_best['Performance Difference to Best to Average in Percent'] = calc_result
        average_best_complete = average_best_complete.append(average_best)  
    else:
        print("4135 else ---------------------")

average_best_complete



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing

Unnamed: 0,Imputation_Method,Task,Missing Type,Missing Fraction,Column,result_type,metric,Baseline,Corrupted,Imputed,...,MinorityClassSize,NumberOfFeatures,NumberOfInstances,NumberOfNumericFeatures,NumberOfCategoricalFeatures,NumberOfClasses,Downstream Performance Rank,Data_Constellation,Data_Constellation_full,Performance Difference to Best to Average in Percent
4,Random Forest,137,MAR,0.01,top-middle-square,downstream_performance_mean,Classification Tasks,0.674909,0.0,0.674760,...,13664.0,10.0,39366.0,0.0,10.0,,1.0,MAR - 0.01,MAR - 0.01 - 137,0.000000
7,Random Forest,137,MAR,0.1,top-middle-square,downstream_performance_mean,Classification Tasks,0.673767,0.0,0.673526,...,13664.0,10.0,39366.0,0.0,10.0,,4.0,MAR - 0.1,MAR - 0.1 - 137,0.001097
16,Random Forest,137,MAR,0.3,top-middle-square,downstream_performance_mean,Classification Tasks,0.674084,0.0,0.674124,...,13664.0,10.0,39366.0,0.0,10.0,,1.0,MAR - 0.3,MAR - 0.3 - 137,0.000000
19,Random Forest,137,MAR,0.5,top-middle-square,downstream_performance_mean,Classification Tasks,0.665708,0.0,0.665258,...,13664.0,10.0,39366.0,0.0,10.0,,4.0,MAR - 0.5,MAR - 0.5 - 137,0.020031
26,Random Forest,137,MCAR,0.01,top-middle-square,downstream_performance_mean,Classification Tasks,0.674273,0.0,0.674334,...,13664.0,10.0,39366.0,0.0,10.0,,2.0,MCAR - 0.01,MCAR - 0.01 - 137,0.000488
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
688,Random Forest,42493,MCAR,0.5,Length,downstream_performance_mean,Classification Tasks,0.625924,0.0,0.624029,...,12035.0,8.0,26969.0,2.0,6.0,,1.0,MCAR - 0.5,MCAR - 0.5 - 42493,0.000000
690,Random Forest,42493,MNAR,0.01,Length,downstream_performance_mean,Classification Tasks,0.627164,0.0,0.626956,...,12035.0,8.0,26969.0,2.0,6.0,,5.0,MNAR - 0.01,MNAR - 0.01 - 42493,0.000635
696,Random Forest,42493,MNAR,0.1,Length,downstream_performance_mean,Classification Tasks,0.625641,0.0,0.625463,...,12035.0,8.0,26969.0,2.0,6.0,,5.0,MNAR - 0.1,MNAR - 0.1 - 42493,0.001666
704,Random Forest,42493,MNAR,0.3,Length,downstream_performance_mean,Classification Tasks,0.624441,0.0,0.623823,...,12035.0,8.0,26969.0,2.0,6.0,,3.0,MNAR - 0.3,MNAR - 0.3 - 42493,0.001002


In [91]:
average_difference = average_best_complete['Performance Difference to Best to Average in Percent'].mean()
print(average_difference, "average difference in Percent")

0.012221487456965722 average difference in Percent


In [92]:
'''
# Relative Difference in absolute values (F1 Score) -> Best Method to Average Best Method

AVERAGE_BEST_IMPUTATION_METHOD = "Random Forest"

data = downstream_results_rank
data['Task'] = data['Task'].astype(str)
data['Data_Constellation_full'] = data['Data_Constellation'] + ' - ' + data['Task']

# TODO: drop unnecessary columns here
dc_unique = data.Data_Constellation_full.unique()
#print(dc_unique)

#data_constellations = ['MAR - 0.01', 'MAR - 0.1', 'MAR - 0.3', 'MCAR - 0.5', 'MCAR - 0.01', 'MCAR - 0.1', 'MCAR - 0.3', 'MCAR - 0.5', 'MNAR - 0.01', 'MNAR - 0.1', 'MNAR - 0.3', 'MNAR - 0.5']
data_constellations = dc_unique.tolist()
methods = ['Random Forest', 'KNN', 'Mean/Mode', 'VAE', 'GAIN', 'Discriminative DL']
#print(data_constellations)
#print(type(methods))
average_best_total = pd.DataFrame()


for i in data_constellations:
    data_constel = data.loc[data['Data_Constellation_full'] == i]
    best_score = data_constel.loc[data_constel['Downstream Performance Rank'] == 1.0]
    average_best = data_constel.loc[data_constel['Imputation_Method'] == AVERAGE_BEST_IMPUTATION_METHOD]
    dataset_number = best_score.iloc[0]['Task']
    #print(dataset_number)
    if (dataset_number != '4135'):
        best_score_int = best_score.iloc[0]['Imputed']
        average_best_int = average_best.iloc[0]['Imputed']
    #print(average_best_int)
        calc_result = (best_score_int - average_best_int)
        average_best['Performance Difference to Best to Average in absolute'] = calc_result
        average_best_total = average_best_total.append(average_best)
 
    else:
        print("4135 else ---------------------")

average_best_total
'''

'\n# Relative Difference in absolute values (F1 Score) -> Best Method to Average Best Method\n\nAVERAGE_BEST_IMPUTATION_METHOD = "Random Forest"\n\ndata = downstream_results_rank\ndata[\'Task\'] = data[\'Task\'].astype(str)\ndata[\'Data_Constellation_full\'] = data[\'Data_Constellation\'] + \' - \' + data[\'Task\']\n\n# TODO: drop unnecessary columns here\ndc_unique = data.Data_Constellation_full.unique()\n#print(dc_unique)\n\n#data_constellations = [\'MAR - 0.01\', \'MAR - 0.1\', \'MAR - 0.3\', \'MCAR - 0.5\', \'MCAR - 0.01\', \'MCAR - 0.1\', \'MCAR - 0.3\', \'MCAR - 0.5\', \'MNAR - 0.01\', \'MNAR - 0.1\', \'MNAR - 0.3\', \'MNAR - 0.5\']\ndata_constellations = dc_unique.tolist()\nmethods = [\'Random Forest\', \'KNN\', \'Mean/Mode\', \'VAE\', \'GAIN\', \'Discriminative DL\']\n#print(data_constellations)\n#print(type(methods))\naverage_best_total = pd.DataFrame()\n\n\nfor i in data_constellations:\n    data_constel = data.loc[data[\'Data_Constellation_full\'] == i]\n    best_score = dat

In [93]:
#average_difference = average_best_total['Performance Difference to Best to Average in absolute'].mean()
#print(average_difference, "average difference in absolut")

## Heatmap (needs to be adjusted)

In [94]:
#df_heat = pd.read_csv('downstream_results_rank_temp.csv')
df_heat = downstream_results_rank
df_heat.drop(["Missing Type", "Missing Fraction", "Column", "result_type", "metric", "Baseline", "Corrupted", "Unnamed: 0", "Unnamed: 0", "name", "NumberOfClasses", "MajorityClassSize", "MinorityClassSize"], axis=1, inplace=True)
#df_heat['Improvement'] = df_heat['Improvement'] - 1
df_heat


Unnamed: 0,Imputation_Method,Task,Imputed,NumberOfFeatures,NumberOfInstances,NumberOfNumericFeatures,NumberOfCategoricalFeatures,Downstream Performance Rank,Data_Constellation,Data_Constellation_full
0,Mean/Mode,137,0.674116,10.0,39366.0,0.0,10.0,5.0,MAR - 0.01,MAR - 0.01 - 137
1,Discriminative DL,137,0.674251,10.0,39366.0,0.0,10.0,4.0,MAR - 0.01,MAR - 0.01 - 137
2,VAE,137,0.674488,10.0,39366.0,0.0,10.0,3.0,MAR - 0.01,MAR - 0.01 - 137
3,KNN,137,0.674760,10.0,39366.0,0.0,10.0,2.0,MAR - 0.01,MAR - 0.01 - 137
4,Random Forest,137,0.674760,10.0,39366.0,0.0,10.0,1.0,MAR - 0.01,MAR - 0.01 - 137
...,...,...,...,...,...,...,...,...,...,...
708,Random Forest,42493,0.622476,8.0,26969.0,2.0,6.0,5.0,MNAR - 0.5,MNAR - 0.5 - 42493
709,Discriminative DL,42493,0.626042,8.0,26969.0,2.0,6.0,4.0,MNAR - 0.5,MNAR - 0.5 - 42493
710,GAIN,42493,0.626178,8.0,26969.0,2.0,6.0,3.0,MNAR - 0.5,MNAR - 0.5 - 42493
711,VAE,42493,0.626914,8.0,26969.0,2.0,6.0,2.0,MNAR - 0.5,MNAR - 0.5 - 42493


In [95]:
# Get a dataframe for each "Data_Constellation"
# Hier mit Variablen arbeiten -> Liste mit Konstellationen

# Hier eventuell for schleife, etc


# drop unneccessary columns

#df_heat = downstream_results_rank
#df_heat.drop(["Missing Type", "Missing Fraction", "Column", "result_type", "metric", "Baseline", "Imputed", "Corrupted", "Unnamed: 0", "Unnamed: 0", "name", "NumberOfClasses", "MajorityClassSize", "MinorityClassSize"], axis=1, inplace=True)

#df_heat['Improvement'] = df_heat['Improvement']
df_heat = df_heat.astype({"Task":"string"})

#mar001.drop(["Missing Type", "Missing Fraction", "Column", "result_type", "metric", "Baseline", "Imputed", "Corrupted", "Unnamed: 0"], axis=1, inplace=True)

data_constellations = ['MAR - 0.01', 'MAR - 0.1', 'MAR - 0.3', 'MCAR - 0.5', 'MCAR - 0.01', 'MCAR - 0.1', 'MCAR - 0.3', 'MCAR - 0.5', 'MNAR - 0.01', 'MNAR - 0.1', 'MNAR - 0.3', 'MNAR - 0.5']


for i in data_constellations:
    data_constel = df_heat.loc[df_heat['Data_Constellation'] == i]

    ### uncomment whatever you want to investigate

    ## sort by amount datapoints (ascending)
    data_constel = data_constel.sort_values(by=['NumberOfInstances'])

    ## sort by amount of features (ascending)
    #data_constel = data_constel.sort_values(by=['NumberOfFeatures'])

    ## sort by amount of datapoints and features (ascending)
    #data_constel = data_constel.sort_values(by=['NumberOfInstances', 'NumberOfFeatures'])

    ## sort by amount of categorical features and datapoints (ascending)
    #data_constel = data_constel.sort_values(by=['NumberOfCategoricalFeatures', 'NumberOfInstances'])

    ## sort by amount of numerical features and datapoints (ascending)
    #data_constel = data_constel.sort_values(by=['NumberOfNumericFeatures', 'NumberOfInstances'])
    
    Dataset_number = data_constel["Task"]
    Imputation_Method = data_constel["Imputation_Method"]
    Improvement = data_constel["Imputed"]
    

    trace = go.Heatmap(
                   z=Improvement,
                   x=Dataset_number,
                   y=Imputation_Method,
                   type = 'heatmap',
                    autocolorscale= False,
                    colorscale = 'Reds',
                    #zmid=0,
                    #hoverinfo='text',
                    #text=hovertext
                    )
    data = [trace]
    fig = go.Figure(data=data)
    fig.update_layout(
        title=i,
        xaxis_nticks=36)
    fig.show()

In [96]:
downstream_results_rank_heatmap2

Unnamed: 0,Imputation_Method,Task,Missing Type,Missing Fraction,Column,result_type,metric,Baseline,Corrupted,Imputed,...,name,MajorityClassSize,MinorityClassSize,NumberOfFeatures,NumberOfInstances,NumberOfNumericFeatures,NumberOfCategoricalFeatures,NumberOfClasses,Downstream Performance Rank,Data_Constellation
0,Mean/Mode,137,MAR,0.01,top-middle-square,downstream_performance_mean,Classification Tasks,0.674145,0.0,0.674116,...,BNG(tic-tac-toe),25702.0,13664.0,10.0,39366.0,0.0,10.0,,5.0,MAR - 0.01
1,Discriminative DL,137,MAR,0.01,top-middle-square,downstream_performance_mean,Classification Tasks,0.674505,0.0,0.674251,...,BNG(tic-tac-toe),25702.0,13664.0,10.0,39366.0,0.0,10.0,,4.0,MAR - 0.01
2,VAE,137,MAR,0.01,top-middle-square,downstream_performance_mean,Classification Tasks,0.674729,0.0,0.674488,...,BNG(tic-tac-toe),25702.0,13664.0,10.0,39366.0,0.0,10.0,,3.0,MAR - 0.01
3,KNN,137,MAR,0.01,top-middle-square,downstream_performance_mean,Classification Tasks,0.674729,0.0,0.674760,...,BNG(tic-tac-toe),25702.0,13664.0,10.0,39366.0,0.0,10.0,,2.0,MAR - 0.01
4,Random Forest,137,MAR,0.01,top-middle-square,downstream_performance_mean,Classification Tasks,0.674909,0.0,0.674760,...,BNG(tic-tac-toe),25702.0,13664.0,10.0,39366.0,0.0,10.0,,1.0,MAR - 0.01
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
708,Random Forest,42493,MNAR,0.5,Length,downstream_performance_mean,Classification Tasks,0.622205,0.0,0.622476,...,airlines,14934.0,12035.0,8.0,26969.0,2.0,6.0,,5.0,MNAR - 0.5
709,Discriminative DL,42493,MNAR,0.5,Length,downstream_performance_mean,Classification Tasks,0.624567,0.0,0.626042,...,airlines,14934.0,12035.0,8.0,26969.0,2.0,6.0,,4.0,MNAR - 0.5
710,GAIN,42493,MNAR,0.5,Length,downstream_performance_mean,Classification Tasks,0.625497,0.0,0.626178,...,airlines,14934.0,12035.0,8.0,26969.0,2.0,6.0,,3.0,MNAR - 0.5
711,VAE,42493,MNAR,0.5,Length,downstream_performance_mean,Classification Tasks,0.625687,0.0,0.626914,...,airlines,14934.0,12035.0,8.0,26969.0,2.0,6.0,,2.0,MNAR - 0.5


In [97]:
#df_heat = pd.read_csv('downstream_results_rank_temp.csv')
df_heat_dif = downstream_results_rank_heatmap2
df_heat_dif.drop(["Missing Type", "Missing Fraction", "Column", "result_type", "metric", "Baseline", "Corrupted", "Unnamed: 0", "Unnamed: 0", "name", "NumberOfClasses", "MajorityClassSize", "MinorityClassSize"], axis=1, inplace=True)
#df_heat['Improvement'] = df_heat['Improvement'] - 1
df_heat_dif


Unnamed: 0,Imputation_Method,Task,Imputed,NumberOfFeatures,NumberOfInstances,NumberOfNumericFeatures,NumberOfCategoricalFeatures,Downstream Performance Rank,Data_Constellation
0,Mean/Mode,137,0.674116,10.0,39366.0,0.0,10.0,5.0,MAR - 0.01
1,Discriminative DL,137,0.674251,10.0,39366.0,0.0,10.0,4.0,MAR - 0.01
2,VAE,137,0.674488,10.0,39366.0,0.0,10.0,3.0,MAR - 0.01
3,KNN,137,0.674760,10.0,39366.0,0.0,10.0,2.0,MAR - 0.01
4,Random Forest,137,0.674760,10.0,39366.0,0.0,10.0,1.0,MAR - 0.01
...,...,...,...,...,...,...,...,...,...
708,Random Forest,42493,0.622476,8.0,26969.0,2.0,6.0,5.0,MNAR - 0.5
709,Discriminative DL,42493,0.626042,8.0,26969.0,2.0,6.0,4.0,MNAR - 0.5
710,GAIN,42493,0.626178,8.0,26969.0,2.0,6.0,3.0,MNAR - 0.5
711,VAE,42493,0.626914,8.0,26969.0,2.0,6.0,2.0,MNAR - 0.5


In [98]:
#Calculate Difference for every Imputation towards average best Imputation Method per Data Constellation

# Relative Difference in Percent -> Best Method to Average Best Method

#AVERAGE_BEST_IMPUTATION_METHOD = "Random Forest"

data = downstream_results_rank
data['Task'] = data['Task'].astype(str)
data['Data_Constellation_full'] = data['Data_Constellation'] + ' - ' + data['Task']

# TODO: drop unnecessary columns here
dc_unique = data.Data_Constellation_full.unique()
#print(dc_unique)

#data_constellations = ['MAR - 0.01', 'MAR - 0.1', 'MAR - 0.3', 'MCAR - 0.5', 'MCAR - 0.01', 'MCAR - 0.1', 'MCAR - 0.3', 'MCAR - 0.5', 'MNAR - 0.01', 'MNAR - 0.1', 'MNAR - 0.3', 'MNAR - 0.5']
data_constellations = dc_unique.tolist()

# EXCLUDE AVERAGE BEST FROM THIS LIST
#methods = ['KNN', 'Mean/Mode', 'VAE', 'GAIN', 'Discriminative DL']
methods = ['Random Forest', 'KNN', 'Mean/Mode', 'VAE', 'GAIN', 'Discriminative DL']

heatmap_data_difference = pd.DataFrame()


for i in data_constellations:
    data_constel = data.loc[data['Data_Constellation_full'] == i]
#    best_score = data_constel.loc[data_constel['Downstream Performance Rank'] == 1.0]
    average_best = data_constel.loc[data_constel['Imputation_Method'] == AVERAGE_BEST_IMPUTATION_METHOD]
    dataset_number = best_score.iloc[0]['Task']
    #print(dataset_number)
    for i in methods:
        if ((data_constel['Imputation_Method'] == i).any()):
            current_score_row = data_constel.loc[data['Imputation_Method'] == i]
            current_score_int = current_score_row.iloc[0]['Imputed']

        #print(best_score_int)
            average_best_int = average_best.iloc[0]['Imputed']

        #print(average_best_int)
            calc_result = (current_score_int - average_best_int)

    #    print(calc_result)
    #    print(i)
            current_score_row['Performance Difference to Average Best'] = calc_result
            heatmap_data_difference = heatmap_data_difference.append(current_score_row)  
        else:
            print("Imputation Method not here ---------------------")

heatmap_data_difference



Imputation Method not here ---------------------
Imputation Method not here ---------------------
Imputation Method not here ---------------------
Imputation Method not here ---------------------




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing

Imputation Method not here ---------------------
Imputation Method not here ---------------------
Imputation Method not here ---------------------




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing

Unnamed: 0,Imputation_Method,Task,Imputed,NumberOfFeatures,NumberOfInstances,NumberOfNumericFeatures,NumberOfCategoricalFeatures,Downstream Performance Rank,Data_Constellation,Data_Constellation_full,Performance Difference to Average Best
4,Random Forest,137,0.674760,10.0,39366.0,0.0,10.0,1.0,MAR - 0.01,MAR - 0.01 - 137,0.000000e+00
3,KNN,137,0.674760,10.0,39366.0,0.0,10.0,2.0,MAR - 0.01,MAR - 0.01 - 137,-2.949553e-08
0,Mean/Mode,137,0.674116,10.0,39366.0,0.0,10.0,5.0,MAR - 0.01,MAR - 0.01 - 137,-6.439471e-04
2,VAE,137,0.674488,10.0,39366.0,0.0,10.0,3.0,MAR - 0.01,MAR - 0.01 - 137,-2.716528e-04
1,Discriminative DL,137,0.674251,10.0,39366.0,0.0,10.0,4.0,MAR - 0.01,MAR - 0.01 - 137,-5.093850e-04
...,...,...,...,...,...,...,...,...,...,...,...
707,KNN,42493,0.620028,8.0,26969.0,2.0,6.0,6.0,MNAR - 0.5,MNAR - 0.5 - 42493,-2.447951e-03
712,Mean/Mode,42493,0.627879,8.0,26969.0,2.0,6.0,1.0,MNAR - 0.5,MNAR - 0.5 - 42493,5.402509e-03
711,VAE,42493,0.626914,8.0,26969.0,2.0,6.0,2.0,MNAR - 0.5,MNAR - 0.5 - 42493,4.437433e-03
710,GAIN,42493,0.626178,8.0,26969.0,2.0,6.0,3.0,MNAR - 0.5,MNAR - 0.5 - 42493,3.701885e-03


In [99]:
# Get a dataframe for each "Data_Constellation"
# Hier mit Variablen arbeiten -> Liste mit Konstellationen

# Hier eventuell for schleife, etc


# drop unneccessary columns

#df_heat = downstream_results_rank
#df_heat.drop(["Missing Type", "Missing Fraction", "Column", "result_type", "metric", "Baseline", "Imputed", "Corrupted", "Unnamed: 0", "Unnamed: 0", "name", "NumberOfClasses", "MajorityClassSize", "MinorityClassSize"], axis=1, inplace=True)

#df_heat['Improvement'] = df_heat['Improvement']
heatmap_data_difference = heatmap_data_difference.astype({"Task":"string"})

#mar001.drop(["Missing Type", "Missing Fraction", "Column", "result_type", "metric", "Baseline", "Imputed", "Corrupted", "Unnamed: 0"], axis=1, inplace=True)

data_constellations = ['MAR - 0.01', 'MAR - 0.1', 'MAR - 0.3', 'MCAR - 0.5', 'MCAR - 0.01', 'MCAR - 0.1', 'MCAR - 0.3', 'MCAR - 0.5', 'MNAR - 0.01', 'MNAR - 0.1', 'MNAR - 0.3', 'MNAR - 0.5']


for i in data_constellations:
    data_constel = heatmap_data_difference.loc[df_heat['Data_Constellation'] == i]

    ### uncomment whatever you want to investigate

    ## sort by amount datapoints (ascending)
    #data_constel = data_constel.sort_values(by=['NumberOfInstances'])

    ## sort by amount of features (ascending)
    data_constel = data_constel.sort_values(by=['NumberOfFeatures'])

    ## sort by amount of datapoints and features (ascending)
    #data_constel = data_constel.sort_values(by=['NumberOfInstances', 'NumberOfFeatures'])

    ## sort by amount of categorical features and datapoints (ascending)
    #data_constel = data_constel.sort_values(by=['NumberOfCategoricalFeatures', 'NumberOfInstances'])

    ## sort by amount of numerical features and datapoints (ascending)
    #data_constel = data_constel.sort_values(by=['NumberOfNumericFeatures', 'NumberOfInstances'])
    
    Dataset_number = data_constel["Task"]
    Imputation_Method = data_constel["Imputation_Method"]
    Improvement = data_constel["Performance Difference to Average Best"]
    

    trace = go.Heatmap(
                   z=Improvement,
                   x=Dataset_number,
                   y=Imputation_Method,
                   type = 'heatmap',
                    autocolorscale= False,
                    colorscale = 'RdBu_r',
                    zmid=0,
                    #hoverinfo='text',
                    #text=hovertext
                    )
    data = [trace]
    fig = go.Figure(data=data)
    fig.update_layout(
        title=i,
        xaxis_nticks=36)
    fig.show()

In [100]:
# drop unneccessary columns

#mar001.drop(["Missing Type", "Missing Fraction", "Column", "result_type", "metric", "Baseline", "Imputed", "Corrupted", "Unnamed: 0"], axis=1, inplace=True)
#mar001.drop(["Missing Type", "Missing Fraction", "Column", "result_type", "metric", "Baseline", "Imputed", "Corrupted", "Unnamed: 0_x", "Unnamed: 0_y", "name", "NumberOfClasses", "MajorityClassSize", "MinorityClassSize"], axis=1, inplace=True)


#mar001.head()


In [101]:
# Seaborn heatmap
'''
plt.subplots(figsize=(50,10))
sns.set()

heatmap_mar001 = mar001.pivot("Imputation_Method", "Task", "Improvement")
ax = sns.heatmap(heatmap_mar001, annot=True, vmin=-0.1, vmax=0.2)
#ax = sns.heatmap(mar001, annot=True, vmin=-0.3, vmax=0.3)
#plt.subplots(figsize=(50,50))
title = mar001.iloc[2]['Data_Constellation']
print(title)
plt.title(title)
plt.show()
'''

'\nplt.subplots(figsize=(50,10))\nsns.set()\n\nheatmap_mar001 = mar001.pivot("Imputation_Method", "Task", "Improvement")\nax = sns.heatmap(heatmap_mar001, annot=True, vmin=-0.1, vmax=0.2)\n#ax = sns.heatmap(mar001, annot=True, vmin=-0.3, vmax=0.3)\n#plt.subplots(figsize=(50,50))\ntitle = mar001.iloc[2][\'Data_Constellation\']\nprint(title)\nplt.title(title)\nplt.show()\n'

## Plotly Heatmaps

In [102]:
#heatmap_mar001.head()


In [103]:


mar001.head()


NameError: name 'mar001' is not defined

In [None]:
#testmar001 = xr.tutorial.open_dataset('air_temperature').air.sel(lon=250.0)

'''
#plotly express test

fig = px.imshow(heatmap_mar001, text_auto = True, 
                labels=dict(x="Task", y="Imputation_Method", color="Improvement"),
                color_continuous_scale='RdBu_r', color_continuous_midpoint=0)
'''

In [None]:
### uncomment whatever you want to investigate

## sort by amount datapoints (ascending)
#mar001 = mar001.sort_values(by=['NumberOfInstances'])

## sort by amount of features (ascending)
#mar001 = mar001.sort_values(by=['NumberOfFeatures'])

## sort by amount of datapoints and features (ascending)
#mar001 = mar001.sort_values(by=['NumberOfInstances', 'NumberOfFeatures'])

## sort by amount of categorical features and datapoints (ascending)
#mar001 = mar001.sort_values(by=['NumberOfCategoricalFeatures', 'NumberOfInstances'])

## sort by amount of numerical features and datapoints (ascending)
#mar001 = mar001.sort_values(by=['NumberOfNumericFeatures', 'NumberOfInstances'])



mar001 = mar001.astype({"Task":"string"})

Dataset_number = mar001["Task"]
Imputation_Method = mar001["Imputation_Method"]
Improvement = mar001["Improvement"]




In [None]:
trace = go.Heatmap(
                   z=Improvement,
                   x=Dataset_number,
                   y=Imputation_Method,
                   type = 'heatmap',
                    autocolorscale= False,
                    colorscale = 'RdBu_r',
                    zmid=0,
                    hoverinfo='text',
                    text=hovertext
                    )




data = [trace]
fig = go.Figure(data=data)
#iplot(fig)


fig.show()

ToDo´s für Darstellung:
- Optionen für einfache Anpassung bei der Sortierung/Darstellung:
    - Anzahl Datenpunkte
    - Anzahl Features
    - Anzahl numerische Features
    - Anzahl kategorische Features
- Schleife aufsetzen für alle Datenkonstellationen (nicht hart kodieren)
- Jeweils beste Imputationsmethode je Datensatz nochmals separat in Heatmap



ToDo´s für restliche Auswertung (Mathematische Part)
- Beste Imp-Methode je Datensatz ermitteln (-> via Ranking am besten, je Konstellation (Bsp. MAR 0.01)
- durchschnittliche Platzierung jeder Imp-Methode ermitteln (Ranking -> dann je Konstellation (Bsp. MAR 0.01)
- Beste Imp je Datensatz mit durchschnittlich bester Imp vergleichen (Liste mit beste Imp & Liste mit Durchschn. Imp -> VGL)
(jede Konstellation genau einmal in jeder Liste)



Sonstiges (keine Prio)
- Optionen für Filterung (bei Beadrf umsetzen -> vorerst keine Priorität!)
    - Numerisches Feature wurde imputiert
    - Kategorisches Feature wurde imputiert

## Application Scenario 2 - Downstream Performance

### Categorical  Columns (Classification)

In [None]:
'''
draw_cat_box_plot(
    downstream_results,
    "Improvement",
    (-0.15, 0.3),
    FIGURES_PATH,
    "fully_observed_downstream_boxplot.eps",
    hue_order=list(rename_imputer_dict.values()),
    row_order=list(rename_metric_dict.values())
)
'''
# Not used at the moment -> function from other file required, check first field