# Visualize Results: Downstream Performance - Regression Corrupted Experiments

This notebook should answer the questions: *Does imputation lead to better downstream performances?*

Data needs to be preprocessed with other notebook, her we only import two csv files with raw data regarding the results of the experiment and information about the used datasets!

## Notebook Structure 

* Application Scenario 2 - Downstream Performance  
   * Categorical  Columns (Classification)
   * Numerical Columns (Regression)
   * Heterogenous Columns (Classification and Regression Combined)

In [97]:
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import os
import pandas as pd
import re
import seaborn as sns

from pathlib import Path

import plotly as py
import plotly.express as px
import plotly.graph_objects as go
import xarray as xr


%matplotlib inline

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Settings

In [98]:
sns.set(style="whitegrid")
sns.set_context('paper', font_scale=1.5)
mpl.rcParams['lines.linewidth'] = '2'

In [99]:
CLF_METRIC = "Classification Tasks"
REG_METRIC = "Regression Tasks"

DOWNSTREAM_RESULT_TYPE = "downstream_performance_mean"
IMPUTE_RESULT_TYPE = "impute_performance_mean"

FIGURES_PATH = Path(f"../paper/figures/")

## Data Preparation

In [100]:
#read results.csv file here!

# Pick whether you want to analyze the "Regression" Experiment oder the "Regression Corrupted" Experiment

results = pd.read_csv('../regression_corrupted.csv')
#results = pd.read_csv('regression.csv')
# Preresults.head()

In [101]:
# Filtering the relevant data for downstream analysis

na_impute_results = results[
    (results["result_type"] == IMPUTE_RESULT_TYPE) & 
    (results["metric"].isin(["F1_macro", "RMSE"]))
]
na_impute_results.drop(["baseline", "corrupted", "imputed"], axis=1, inplace=True)
na_impute_results = na_impute_results[na_impute_results.isna().any(axis=1)]
na_impute_results.shape



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



(0, 11)

In [102]:
# check if strategy type is correct!
STRATEGY_TYPE = "single_single"

downstream_results = results[
    (results["result_type"] == DOWNSTREAM_RESULT_TYPE) & 
    (results["metric"].isin(["F1_macro", "RMSE"]) &
    (results["strategy"] == STRATEGY_TYPE))
]

# remove experiments where imputation failed
downstream_results = downstream_results.merge(
    na_impute_results,
    how = "left",
    validate = "one_to_one",
    indicator = True,
    suffixes=("", "_imp"),
    on = ["experiment", "imputer", "task", "missing_type", "missing_fraction", "strategy", "column"]
)
downstream_results = downstream_results[downstream_results["_merge"]=="left_only"]

assert len(results["strategy"].unique()) == 1
downstream_results.drop(["experiment", "strategy", "result_type_imp", "metric_imp", "train", "test", "train_imp", "test_imp", "_merge"], axis=1, inplace=True)

downstream_results = downstream_results.rename(
    {
        "imputer": "Imputation_Method",
        "task": "Task",
        "missing_type": "Missing Type",
        "missing_fraction": "Missing Fraction",
        "column": "Column",
        "baseline": "Baseline",
        "imputed": "Imputed",
        "corrupted": "Corrupted"
    },
    axis = 1
)

In [103]:
rename_imputer_dict = {
    "ModeImputer": "Mean/Mode",
    "KNNImputer": "KNN",
    "ForestImputer": "Random Forest",
    "AutoKerasImputer": "Discriminative DL",
    "VAEImputer": "VAE",
    "GAINImputer": "GAIN"    
}

rename_metric_dict = {
    "F1_macro": CLF_METRIC,
    "RMSE": REG_METRIC
}

downstream_results = downstream_results.replace(rename_imputer_dict)
downstream_results = downstream_results.replace(rename_metric_dict)

downstream_results

Unnamed: 0,Imputation_Method,Task,Missing Type,Missing Fraction,Column,result_type,metric,Baseline,Corrupted,Imputed
0,KNN,287,MNAR,0.50,sulphates,downstream_performance_mean,Regression Tasks,0.748508,0.0,0.748789
1,KNN,287,MNAR,0.30,sulphates,downstream_performance_mean,Regression Tasks,0.745449,0.0,0.744320
2,KNN,287,MNAR,0.01,sulphates,downstream_performance_mean,Regression Tasks,0.746691,0.0,0.746363
3,KNN,287,MNAR,0.10,sulphates,downstream_performance_mean,Regression Tasks,0.748270,0.0,0.747389
4,KNN,287,MAR,0.50,sulphates,downstream_performance_mean,Regression Tasks,0.748632,0.0,0.747375
...,...,...,...,...,...,...,...,...,...,...
699,Discriminative DL,42712,MAR,0.10,humidity,downstream_performance_mean,Regression Tasks,150.063185,0.0,150.081707
700,Discriminative DL,42712,MCAR,0.50,humidity,downstream_performance_mean,Regression Tasks,149.940476,0.0,150.009100
701,Discriminative DL,42712,MCAR,0.30,humidity,downstream_performance_mean,Regression Tasks,149.841198,0.0,149.764901
702,Discriminative DL,42712,MCAR,0.01,humidity,downstream_performance_mean,Regression Tasks,149.542662,0.0,149.510195


### Robustness: check which imputers yielded `NaN`values

In [104]:
for col in downstream_results.columns:
    na_sum = downstream_results[col].isna().sum()
    if na_sum > 0:
        print("-----" * 10)        
        print(col, na_sum)
        print("-----" * 10)        
        na_idx = downstream_results[col].isna()
        print(downstream_results.loc[na_idx, "Imputation Method"].value_counts(dropna=False))
        print("\n")

## Compute Downstream Performance relative to Baseline

In [105]:
clf_row_idx = downstream_results["metric"] == CLF_METRIC
reg_row_idx = downstream_results["metric"] == REG_METRIC

In [106]:
#downstream_results["Improvement"]   = (downstream_results["Imputed"] - downstream_results["Baseline"]  ) / downstream_results["Baseline"]
#downstream_results.loc[reg_row_idx, "Improvement"]   = downstream_results.loc[reg_row_idx, "Improvement"]   * -1

#mar001.drop(["Missing Type", "Missing Fraction", "Column", "result_type", "metric", "Baseline", "Imputed", "Corrupted", "Unnamed: 0"], axis=1, inplace=True)

#print(downstream_results)
#downstream_results.to_csv('downstream_results.csv')
downstream_results.head()

Unnamed: 0,Imputation_Method,Task,Missing Type,Missing Fraction,Column,result_type,metric,Baseline,Corrupted,Imputed
0,KNN,287,MNAR,0.5,sulphates,downstream_performance_mean,Regression Tasks,0.748508,0.0,0.748789
1,KNN,287,MNAR,0.3,sulphates,downstream_performance_mean,Regression Tasks,0.745449,0.0,0.74432
2,KNN,287,MNAR,0.01,sulphates,downstream_performance_mean,Regression Tasks,0.746691,0.0,0.746363
3,KNN,287,MNAR,0.1,sulphates,downstream_performance_mean,Regression Tasks,0.74827,0.0,0.747389
4,KNN,287,MAR,0.5,sulphates,downstream_performance_mean,Regression Tasks,0.748632,0.0,0.747375


## Adding Dataset Info, Sorting and Ranking

In [107]:
# Sortierung der Daten

#downstream_results_full_sort = pd.read_csv('downstream_results.csv')
downstream_results_full_sort = downstream_results

#df = sns.load_dataset('impute_results_full')
#downstream_results_full_sort = downstream_results_full_sort.replace('$k$-NN','KNN')
#impute_results_full_sort.head()

#impute_results_full_sort = impute_results_full_sort.sort_values(['Task'], ascending=[True])
downstream_results_full_sort = downstream_results_full_sort.sort_values(['Task', 'Missing Type', 'Missing Fraction', 'Imputed'], ascending=[True, True, True, True])
#print(downstream_results_full_sort)
downstream_results_full_sort.head()


#downstream_results_full_sort.to_csv('downstream_results_full_sort.csv')

Unnamed: 0,Imputation_Method,Task,Missing Type,Missing Fraction,Column,result_type,metric,Baseline,Corrupted,Imputed
650,Discriminative DL,216,MAR,0.01,climbRate,downstream_performance_mean,Regression Tasks,0.00284,0.0,0.002838
66,KNN,216,MAR,0.01,climbRate,downstream_performance_mean,Regression Tasks,0.002839,0.0,0.002838
534,VAE,216,MAR,0.01,climbRate,downstream_performance_mean,Regression Tasks,0.002839,0.0,0.002838
414,Random Forest,216,MAR,0.01,climbRate,downstream_performance_mean,Regression Tasks,0.002839,0.0,0.002838
294,Mean/Mode,216,MAR,0.01,climbRate,downstream_performance_mean,Regression Tasks,0.00284,0.0,0.002839


In [108]:
# add dataset information from other csv file

dataset_info = pd.read_csv('../datasets_information_overview.csv')
dataset_info = dataset_info.rename(columns={"did": "Task"})


downstream_results_full_sort = pd.merge(downstream_results_full_sort, dataset_info, on='Task')
#downstream_results_full_sort.to_csv('downstream_results_full_sort_testtesttest.csv')
downstream_results_full_sort.head()

Unnamed: 0.1,Imputation_Method,Task,Missing Type,Missing Fraction,Column,result_type,metric,Baseline,Corrupted,Imputed,Unnamed: 0,name,MajorityClassSize,MinorityClassSize,NumberOfFeatures,NumberOfInstances,NumberOfNumericFeatures,NumberOfCategoricalFeatures,NumberOfClasses
0,Discriminative DL,216,MAR,0.01,climbRate,downstream_performance_mean,Regression Tasks,0.00284,0.0,0.002838,90,elevators,,,19.0,16599.0,19.0,0.0,
1,KNN,216,MAR,0.01,climbRate,downstream_performance_mean,Regression Tasks,0.002839,0.0,0.002838,90,elevators,,,19.0,16599.0,19.0,0.0,
2,VAE,216,MAR,0.01,climbRate,downstream_performance_mean,Regression Tasks,0.002839,0.0,0.002838,90,elevators,,,19.0,16599.0,19.0,0.0,
3,Random Forest,216,MAR,0.01,climbRate,downstream_performance_mean,Regression Tasks,0.002839,0.0,0.002838,90,elevators,,,19.0,16599.0,19.0,0.0,
4,Mean/Mode,216,MAR,0.01,climbRate,downstream_performance_mean,Regression Tasks,0.00284,0.0,0.002839,90,elevators,,,19.0,16599.0,19.0,0.0,


In [109]:
# Ranking of downstream performance per data constellation

EXPERIMENTAL_CONDITIONS = ["Task", "Missing Type", "Missing Fraction", "Column", "result_type"]

downstream_results_rank = downstream_results_full_sort

#clf_row_idx = impute_results["metric"] == CLF_METRIC
#reg_row_idx = impute_results["metric"] == REG_METRIC

downstream_results_rank["Downstream Performance Rank"] = downstream_results_rank.groupby(EXPERIMENTAL_CONDITIONS).rank(ascending=False, na_option="bottom", method="min")["Imputed"]
downstream_results_rank.to_csv('downstream_results_complete_overview.csv')
downstream_results_rank.head()


Unnamed: 0.1,Imputation_Method,Task,Missing Type,Missing Fraction,Column,result_type,metric,Baseline,Corrupted,Imputed,Unnamed: 0,name,MajorityClassSize,MinorityClassSize,NumberOfFeatures,NumberOfInstances,NumberOfNumericFeatures,NumberOfCategoricalFeatures,NumberOfClasses,Downstream Performance Rank
0,Discriminative DL,216,MAR,0.01,climbRate,downstream_performance_mean,Regression Tasks,0.00284,0.0,0.002838,90,elevators,,,19.0,16599.0,19.0,0.0,,5.0
1,KNN,216,MAR,0.01,climbRate,downstream_performance_mean,Regression Tasks,0.002839,0.0,0.002838,90,elevators,,,19.0,16599.0,19.0,0.0,,4.0
2,VAE,216,MAR,0.01,climbRate,downstream_performance_mean,Regression Tasks,0.002839,0.0,0.002838,90,elevators,,,19.0,16599.0,19.0,0.0,,3.0
3,Random Forest,216,MAR,0.01,climbRate,downstream_performance_mean,Regression Tasks,0.002839,0.0,0.002838,90,elevators,,,19.0,16599.0,19.0,0.0,,2.0
4,Mean/Mode,216,MAR,0.01,climbRate,downstream_performance_mean,Regression Tasks,0.00284,0.0,0.002839,90,elevators,,,19.0,16599.0,19.0,0.0,,1.0


In [110]:
# Merge the two columns "Missing Type" and "Missing Fraction"

downstream_results_rank['Missing Type'] = downstream_results_rank['Missing Type'].astype(str)
downstream_results_rank['Missing Fraction'] = downstream_results_rank['Missing Fraction'].astype(str)
datatype_new = downstream_results_rank.dtypes
#print(datatype_new)

downstream_results_rank['Data_Constellation'] = downstream_results_rank['Missing Type'] + ' - ' + downstream_results_rank['Missing Fraction']
downstream_results_rank.to_csv('downstream_results_rank_temp.csv')
downstream_results_rank_heatmap2 = downstream_results_rank.copy()
downstream_results_rank.head()


Unnamed: 0,Imputation_Method,Task,Missing Type,Missing Fraction,Column,result_type,metric,Baseline,Corrupted,Imputed,...,name,MajorityClassSize,MinorityClassSize,NumberOfFeatures,NumberOfInstances,NumberOfNumericFeatures,NumberOfCategoricalFeatures,NumberOfClasses,Downstream Performance Rank,Data_Constellation
0,Discriminative DL,216,MAR,0.01,climbRate,downstream_performance_mean,Regression Tasks,0.00284,0.0,0.002838,...,elevators,,,19.0,16599.0,19.0,0.0,,5.0,MAR - 0.01
1,KNN,216,MAR,0.01,climbRate,downstream_performance_mean,Regression Tasks,0.002839,0.0,0.002838,...,elevators,,,19.0,16599.0,19.0,0.0,,4.0,MAR - 0.01
2,VAE,216,MAR,0.01,climbRate,downstream_performance_mean,Regression Tasks,0.002839,0.0,0.002838,...,elevators,,,19.0,16599.0,19.0,0.0,,3.0,MAR - 0.01
3,Random Forest,216,MAR,0.01,climbRate,downstream_performance_mean,Regression Tasks,0.002839,0.0,0.002838,...,elevators,,,19.0,16599.0,19.0,0.0,,2.0,MAR - 0.01
4,Mean/Mode,216,MAR,0.01,climbRate,downstream_performance_mean,Regression Tasks,0.00284,0.0,0.002839,...,elevators,,,19.0,16599.0,19.0,0.0,,1.0,MAR - 0.01


## Analyzing Performance based on Rank and Improvement per Data Constellation

Hier die Rechnung -> Bestes Ergebnis pro "Experimental condition" - Beste Methode im Durchschnitt 

ToDo´s für restliche Auswertung (Mathematische Part)
- Beste Imp-Methode je Datensatz ermitteln (-> via Ranking am besten, je Konstellation (Bsp. MAR 0.01)
- durchschnittliche Platzierung jeder Imp-Methode ermitteln (Ranking -> dann je Konstellation (Bsp. MAR 0.01)
- Beste Imp je Datensatz mit durchschnittlich bester Imp vergleichen (Liste mit beste Imp & Liste mit Durchschn. Imp -> VGL)
(jede Konstellation genau einmal in jeder Liste)


In [111]:
data = downstream_results_rank

# Count amount of different Data constellations in column "Data_Constellation"
dc_unique = data.Data_Constellation.unique().size
print(dc_unique, "Data Constellations")
print("_____________________")
# Count amount of 1.0 Ranking result in column "Downstream Performance Rank" (Numbers must match)
rank_count = data['Downstream Performance Rank'].value_counts()
print(rank_count)
print("_____________________")
# Filter for 1.0 Ranking -> Overview -> save as csv
rank_1 = data.loc[data['Downstream Performance Rank'] == 1.0]
rank_1.to_csv('rank_1.csv')

print("_____________________")
# Count how often each Imputation Method is present -> most "wins"
rank_wins = rank_1['Imputation_Method'].value_counts()
print(rank_wins)
print("_____________________")
# Take initial overview and filter for each imputation method and calculate average rank
methods = ['Random Forest', 'KNN', 'Mean/Mode', 'VAE', 'GAIN', 'Discriminative DL']
for i in methods:
    df_average_rank = data.loc[data['Imputation_Method'] == i]
    len_ar = len(df_average_rank)
    print(len_ar, "Amount of results available")
    rank_pos = df_average_rank['Downstream Performance Rank'].value_counts().sort_index(ascending=True)
    print(rank_pos)
    average_rank = df_average_rank["Downstream Performance Rank"].mean()
    print("Average Rank for", i, "is", average_rank)
    #average_improvement = df_average_rank["Improvement"].mean()
    #print("Average Improvement to baseline is", average_improvement)
    print("_____________________")



12 Data Constellations
_____________________
5.0    120
4.0    120
3.0    120
2.0    120
1.0    120
6.0    104
Name: Downstream Performance Rank, dtype: int64
_____________________
_____________________
GAIN                 55
VAE                  24
Mean/Mode            17
Random Forest        12
Discriminative DL     7
KNN                   5
Name: Imputation_Method, dtype: int64
_____________________
120 Amount of results available
1.0    12
2.0    13
3.0    13
4.0    20
5.0    32
6.0    30
Name: Downstream Performance Rank, dtype: int64
Average Rank for Random Forest is 4.141666666666667
_____________________
120 Amount of results available
1.0     5
2.0    17
3.0    27
4.0    36
5.0    21
6.0    14
Name: Downstream Performance Rank, dtype: int64
Average Rank for KNN is 3.775
_____________________
120 Amount of results available
1.0    17
2.0    26
3.0    31
4.0    17
5.0    14
6.0    15
Name: Downstream Performance Rank, dtype: int64
Average Rank for Mean/Mode is 3.25
____________

In [112]:
rank_1.head()

Unnamed: 0,Imputation_Method,Task,Missing Type,Missing Fraction,Column,result_type,metric,Baseline,Corrupted,Imputed,...,name,MajorityClassSize,MinorityClassSize,NumberOfFeatures,NumberOfInstances,NumberOfNumericFeatures,NumberOfCategoricalFeatures,NumberOfClasses,Downstream Performance Rank,Data_Constellation
4,Mean/Mode,216,MAR,0.01,climbRate,downstream_performance_mean,Regression Tasks,0.00284,0.0,0.002839,...,elevators,,,19.0,16599.0,19.0,0.0,,1.0,MAR - 0.01
10,GAIN,216,MAR,0.1,climbRate,downstream_performance_mean,Regression Tasks,0.002868,0.0,0.002899,...,elevators,,,19.0,16599.0,19.0,0.0,,1.0,MAR - 0.1
16,GAIN,216,MAR,0.3,climbRate,downstream_performance_mean,Regression Tasks,0.002868,0.0,0.002922,...,elevators,,,19.0,16599.0,19.0,0.0,,1.0,MAR - 0.3
22,GAIN,216,MAR,0.5,climbRate,downstream_performance_mean,Regression Tasks,0.002938,0.0,0.002937,...,elevators,,,19.0,16599.0,19.0,0.0,,1.0,MAR - 0.5
27,VAE,216,MCAR,0.01,climbRate,downstream_performance_mean,Regression Tasks,0.002836,0.0,0.002838,...,elevators,,,19.0,16599.0,19.0,0.0,,1.0,MCAR - 0.01


In [113]:
# Take initial overview and filter best average imputation method and take filtered dataframe from 1.0 Ranking
# Where Data_Constellation identical -> Ranking 1.0 [Improvement] - Best_Imp_Method [Improvement]
# Write Difference in seperat column - > Calculate Average improvement

AVERAGE_BEST_IMPUTATION_METHOD = "VAE" 

# Adjust the following depending on the previous results
av_best = data.loc[data['Imputation_Method'] == 'VAE']
av_best['Task'] = av_best['Task'].astype(str)
av_best['Data_Constellation'] = av_best['Data_Constellation'] + ' - ' + av_best['Task']

av_best = av_best[['Imputation_Method', 'Imputed', 'Data_Constellation', 'Downstream Performance Rank']]
av_best = av_best.rename(columns={'Imputation_Method':'Imputation_Method_average', 
                               'Imputed':'Imputed_average',
                                 'Downstream Performance Rank':'Downstream Performance Rank Average'})

#av_best.head()

rank_1['Task'] = rank_1['Task'].astype(str)
rank_1['Data_Constellation'] = rank_1['Data_Constellation'] + ' - ' + rank_1['Task']
rank_1 = rank_1[['Imputation_Method', 'Imputed', 'Data_Constellation', 'Downstream Performance Rank']]
rank_1 = rank_1.rename(columns={'Imputation_Method':'Imputation_Method_best', 
                               'Imputed':'Imputed_best',
                               'Downstream Performance Rank':'Downstream Performance Rank Best'})

performance_difference = pd.merge(av_best, rank_1, on='Data_Constellation')
performance_difference.head()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/

Unnamed: 0,Imputation_Method_average,Imputed_average,Data_Constellation,Downstream Performance Rank Average,Imputation_Method_best,Imputed_best,Downstream Performance Rank Best
0,VAE,0.002838,MAR - 0.01 - 216,3.0,Mean/Mode,0.002839,1.0
1,VAE,0.00285,MAR - 0.1 - 216,4.0,GAIN,0.002899,1.0
2,VAE,0.002872,MAR - 0.3 - 216,4.0,GAIN,0.002922,1.0
3,VAE,0.002907,MAR - 0.5 - 216,2.0,GAIN,0.002937,1.0
4,VAE,0.002838,MCAR - 0.01 - 216,1.0,VAE,0.002838,1.0


In [114]:
#performance_difference['Imputed_best'] = performance_difference['Improvement_best'] + 1
#performance_difference['Imputed_average'] = performance_difference['Improvement_average'] + 1

performance_difference['Performance Difference Best to Average'] = performance_difference['Imputed_best'] - performance_difference['Imputed_average']
Average_Difference = performance_difference['Performance Difference Best to Average'].mean()
print("Average Difference in Improvement from best method to average best method for RMSE", Average_Difference)


Average Difference in Improvement from best method to average best method for RMSE 1.1163316524901368


In [115]:

performance_difference.to_csv('performance_difference.csv')

In [116]:
performance_difference.head()

Unnamed: 0,Imputation_Method_average,Imputed_average,Data_Constellation,Downstream Performance Rank Average,Imputation_Method_best,Imputed_best,Downstream Performance Rank Best,Performance Difference Best to Average
0,VAE,0.002838,MAR - 0.01 - 216,3.0,Mean/Mode,0.002839,1.0,3.824422e-07
1,VAE,0.00285,MAR - 0.1 - 216,4.0,GAIN,0.002899,1.0,4.845833e-05
2,VAE,0.002872,MAR - 0.3 - 216,4.0,GAIN,0.002922,1.0,4.974528e-05
3,VAE,0.002907,MAR - 0.5 - 216,2.0,GAIN,0.002937,1.0,2.971394e-05
4,VAE,0.002838,MCAR - 0.01 - 216,1.0,VAE,0.002838,1.0,0.0


## Analysis and Ranking based on F1 Score

In [117]:
# Relative Difference in Percent -> Best Method to Average Best Method

#AVERAGE_BEST_IMPUTATION_METHOD = "VAE"

data = downstream_results_rank
data['Task'] = data['Task'].astype(str)
data['Data_Constellation_full'] = data['Data_Constellation'] + ' - ' + data['Task']

# TODO: drop unnecessary columns here
dc_unique = data.Data_Constellation_full.unique()
#print(dc_unique)

#data_constellations = ['MAR - 0.01', 'MAR - 0.1', 'MAR - 0.3', 'MCAR - 0.5', 'MCAR - 0.01', 'MCAR - 0.1', 'MCAR - 0.3', 'MCAR - 0.5', 'MNAR - 0.01', 'MNAR - 0.1', 'MNAR - 0.3', 'MNAR - 0.5']
data_constellations = dc_unique.tolist()
methods = ['Random Forest', 'KNN', 'Mean/Mode', 'VAE', 'GAIN', 'Discriminative DL']
#print(data_constellations)
#print(type(methods))
average_best_complete = pd.DataFrame()


for i in data_constellations:
    data_constel = data.loc[data['Data_Constellation_full'] == i]
    best_score = data_constel.loc[data_constel['Downstream Performance Rank'] == 1.0]
    average_best = data_constel.loc[data_constel['Imputation_Method'] == AVERAGE_BEST_IMPUTATION_METHOD]
    best_score_int = best_score.iloc[0]['Imputed']
    #print(best_score_int)
    average_best_int = average_best.iloc[0]['Imputed']
    #print(average_best_int)
    calc_result = ((best_score_int - average_best_int)/best_score_int)
    calc_result = abs(calc_result)
#    print(calc_result)
#    print(i)
    average_best['Performance Difference to Best to Average in Percent'] = calc_result
    average_best_complete = average_best_complete.append(average_best)

average_best_complete



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing

Unnamed: 0,Imputation_Method,Task,Missing Type,Missing Fraction,Column,result_type,metric,Baseline,Corrupted,Imputed,...,MinorityClassSize,NumberOfFeatures,NumberOfInstances,NumberOfNumericFeatures,NumberOfCategoricalFeatures,NumberOfClasses,Downstream Performance Rank,Data_Constellation,Data_Constellation_full,Performance Difference to Best to Average in Percent
2,VAE,216,MAR,0.01,climbRate,downstream_performance_mean,Regression Tasks,0.002839,0.0,0.002838,...,,19.0,16599.0,19.0,0.0,,3.0,MAR - 0.01,MAR - 0.01 - 216,0.000135
7,VAE,216,MAR,0.1,climbRate,downstream_performance_mean,Regression Tasks,0.002857,0.0,0.002850,...,,19.0,16599.0,19.0,0.0,,4.0,MAR - 0.1,MAR - 0.1 - 216,0.016716
13,VAE,216,MAR,0.3,climbRate,downstream_performance_mean,Regression Tasks,0.002860,0.0,0.002872,...,,19.0,16599.0,19.0,0.0,,4.0,MAR - 0.3,MAR - 0.3 - 216,0.017024
21,VAE,216,MAR,0.5,climbRate,downstream_performance_mean,Regression Tasks,0.002911,0.0,0.002907,...,,19.0,16599.0,19.0,0.0,,2.0,MAR - 0.5,MAR - 0.5 - 216,0.010119
27,VAE,216,MCAR,0.01,climbRate,downstream_performance_mean,Regression Tasks,0.002836,0.0,0.002838,...,,19.0,16599.0,19.0,0.0,,1.0,MCAR - 0.01,MCAR - 0.01 - 216,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
678,VAE,42712,MCAR,0.5,humidity,downstream_performance_mean,Regression Tasks,152.144893,0.0,152.251677,...,,13.0,17379.0,9.0,4.0,,2.0,MCAR - 0.5,MCAR - 0.5 - 42712,0.008144
684,VAE,42712,MNAR,0.01,humidity,downstream_performance_mean,Regression Tasks,149.535129,0.0,149.664127,...,,13.0,17379.0,9.0,4.0,,2.0,MNAR - 0.01,MNAR - 0.01 - 42712,0.000087
691,VAE,42712,MNAR,0.1,humidity,downstream_performance_mean,Regression Tasks,150.072266,0.0,149.894417,...,,13.0,17379.0,9.0,4.0,,1.0,MNAR - 0.1,MNAR - 0.1 - 42712,0.000000
697,VAE,42712,MNAR,0.3,humidity,downstream_performance_mean,Regression Tasks,151.276674,0.0,151.678701,...,,13.0,17379.0,9.0,4.0,,1.0,MNAR - 0.3,MNAR - 0.3 - 42712,0.000000


In [118]:
average_difference = average_best_complete['Performance Difference to Best to Average in Percent'].mean()
print(average_difference, "average difference in Percent")

0.0077032091030239825 average difference in Percent


In [119]:
# Relative Difference in absolute values (F1 Score) -> Best Method to Average Best Method
'''
AVERAGE_BEST_IMPUTATION_METHOD = "Random Forest"

data = downstream_results_rank
data['Task'] = data['Task'].astype(str)
data['Data_Constellation_full'] = data['Data_Constellation'] + ' - ' + data['Task']

# TODO: drop unnecessary columns here
dc_unique = data.Data_Constellation_full.unique()
#print(dc_unique)

#data_constellations = ['MAR - 0.01', 'MAR - 0.1', 'MAR - 0.3', 'MCAR - 0.5', 'MCAR - 0.01', 'MCAR - 0.1', 'MCAR - 0.3', 'MCAR - 0.5', 'MNAR - 0.01', 'MNAR - 0.1', 'MNAR - 0.3', 'MNAR - 0.5']
data_constellations = dc_unique.tolist()
methods = ['Random Forest', 'KNN', 'Mean/Mode', 'VAE', 'GAIN', 'Discriminative DL']
#print(data_constellations)
#print(type(methods))
average_best_total = pd.DataFrame()


for i in data_constellations:
    data_constel = data.loc[data['Data_Constellation_full'] == i]
    best_score = data_constel.loc[data_constel['Downstream Performance Rank'] == 1.0]
    average_best = data_constel.loc[data_constel['Imputation_Method'] == AVERAGE_BEST_IMPUTATION_METHOD]
    best_score_int = best_score.iloc[0]['Imputed']
    #print(best_score_int)
    average_best_int = average_best.iloc[0]['Imputed']
    #print(average_best_int)
    calc_result = (average_best_int - best_score_int)
#    print(calc_result)
#    print(i)
    average_best['Performance Difference to Best to Average in absolute'] = calc_result
    average_best_total = average_best_total.append(average_best)
 
average_best_total
'''

'\nAVERAGE_BEST_IMPUTATION_METHOD = "Random Forest"\n\ndata = downstream_results_rank\ndata[\'Task\'] = data[\'Task\'].astype(str)\ndata[\'Data_Constellation_full\'] = data[\'Data_Constellation\'] + \' - \' + data[\'Task\']\n\n# TODO: drop unnecessary columns here\ndc_unique = data.Data_Constellation_full.unique()\n#print(dc_unique)\n\n#data_constellations = [\'MAR - 0.01\', \'MAR - 0.1\', \'MAR - 0.3\', \'MCAR - 0.5\', \'MCAR - 0.01\', \'MCAR - 0.1\', \'MCAR - 0.3\', \'MCAR - 0.5\', \'MNAR - 0.01\', \'MNAR - 0.1\', \'MNAR - 0.3\', \'MNAR - 0.5\']\ndata_constellations = dc_unique.tolist()\nmethods = [\'Random Forest\', \'KNN\', \'Mean/Mode\', \'VAE\', \'GAIN\', \'Discriminative DL\']\n#print(data_constellations)\n#print(type(methods))\naverage_best_total = pd.DataFrame()\n\n\nfor i in data_constellations:\n    data_constel = data.loc[data[\'Data_Constellation_full\'] == i]\n    best_score = data_constel.loc[data_constel[\'Downstream Performance Rank\'] == 1.0]\n    average_best = data_

In [120]:
#average_difference = average_best_total['Performance Difference to Best to Average in absolute'].mean()
#print(average_difference, "average difference in absolut")

## Heatmap (needs to be adjusted)

In [121]:
#df_heat = pd.read_csv('downstream_results_rank_temp.csv')
df_heat = downstream_results_rank.copy()
df_heat.drop(["Missing Type", "Missing Fraction", "Column", "result_type", "metric", "Baseline", "Corrupted", "Unnamed: 0", "Unnamed: 0", "name", "NumberOfClasses", "MajorityClassSize", "MinorityClassSize"], axis=1, inplace=True)
#df_heat['Improvement'] = df_heat['Improvement'] - 1
df_heat

Unnamed: 0,Imputation_Method,Task,Imputed,NumberOfFeatures,NumberOfInstances,NumberOfNumericFeatures,NumberOfCategoricalFeatures,Downstream Performance Rank,Data_Constellation,Data_Constellation_full
0,Discriminative DL,216,0.002838,19.0,16599.0,19.0,0.0,5.0,MAR - 0.01,MAR - 0.01 - 216
1,KNN,216,0.002838,19.0,16599.0,19.0,0.0,4.0,MAR - 0.01,MAR - 0.01 - 216
2,VAE,216,0.002838,19.0,16599.0,19.0,0.0,3.0,MAR - 0.01,MAR - 0.01 - 216
3,Random Forest,216,0.002838,19.0,16599.0,19.0,0.0,2.0,MAR - 0.01,MAR - 0.01 - 216
4,Mean/Mode,216,0.002839,19.0,16599.0,19.0,0.0,1.0,MAR - 0.01,MAR - 0.01 - 216
...,...,...,...,...,...,...,...,...,...,...
699,Random Forest,42712,147.707377,13.0,17379.0,9.0,4.0,5.0,MNAR - 0.5,MNAR - 0.5 - 42712
700,KNN,42712,148.474038,13.0,17379.0,9.0,4.0,4.0,MNAR - 0.5,MNAR - 0.5 - 42712
701,Discriminative DL,42712,148.593887,13.0,17379.0,9.0,4.0,3.0,MNAR - 0.5,MNAR - 0.5 - 42712
702,VAE,42712,149.268209,13.0,17379.0,9.0,4.0,2.0,MNAR - 0.5,MNAR - 0.5 - 42712


In [122]:
# Get a dataframe for each "Data_Constellation"
# Hier mit Variablen arbeiten -> Liste mit Konstellationen

# Hier eventuell for schleife, etc


# drop unneccessary columns

#df_heat = downstream_results_rank
#df_heat.drop(["Missing Type", "Missing Fraction", "Column", "result_type", "metric", "Baseline", "Imputed", "Corrupted", "Unnamed: 0", "Unnamed: 0", "name", "NumberOfClasses", "MajorityClassSize", "MinorityClassSize"], axis=1, inplace=True)

#df_heat['Improvement'] = df_heat['Improvement']
df_heat = df_heat.astype({"Task":"string"})

#mar001.drop(["Missing Type", "Missing Fraction", "Column", "result_type", "metric", "Baseline", "Imputed", "Corrupted", "Unnamed: 0"], axis=1, inplace=True)

data_constellations = ['MAR - 0.01', 'MAR - 0.1', 'MAR - 0.3', 'MCAR - 0.5', 'MCAR - 0.01', 'MCAR - 0.1', 'MCAR - 0.3', 'MCAR - 0.5', 'MNAR - 0.01', 'MNAR - 0.1', 'MNAR - 0.3', 'MNAR - 0.5']


for i in data_constellations:
    data_constel = df_heat.loc[df_heat['Data_Constellation'] == i]

    ### uncomment whatever you want to investigate

    ## sort by amount datapoints (ascending)
    data_constel = data_constel.sort_values(by=['NumberOfInstances'])

    ## sort by amount of features (ascending)
    #data_constel = data_constel.sort_values(by=['NumberOfFeatures'])

    ## sort by amount of datapoints and features (ascending)
    #data_constel = data_constel.sort_values(by=['NumberOfInstances', 'NumberOfFeatures'])

    ## sort by amount of categorical features and datapoints (ascending)
    #data_constel = data_constel.sort_values(by=['NumberOfCategoricalFeatures', 'NumberOfInstances'])

    ## sort by amount of numerical features and datapoints (ascending)
    #data_constel = data_constel.sort_values(by=['NumberOfNumericFeatures', 'NumberOfInstances'])
    
    Dataset_number = data_constel["Task"]
    Imputation_Method = data_constel["Imputation_Method"]
    Improvement = data_constel["Imputed"]
    

    trace = go.Heatmap(
                   z=Improvement,
                   x=Dataset_number,
                   y=Imputation_Method,
                   type = 'heatmap',
                    autocolorscale= False,
                    colorscale = 'Reds',
                    #zmid=0,
                    #hoverinfo='text',
                    #text=hovertext
                    )
    data = [trace]
    fig = go.Figure(data=data)
    fig.update_layout(
        title=i,
        xaxis_nticks=36)
    fig.show()

In [123]:
downstream_results_rank_heatmap2

Unnamed: 0,Imputation_Method,Task,Missing Type,Missing Fraction,Column,result_type,metric,Baseline,Corrupted,Imputed,...,name,MajorityClassSize,MinorityClassSize,NumberOfFeatures,NumberOfInstances,NumberOfNumericFeatures,NumberOfCategoricalFeatures,NumberOfClasses,Downstream Performance Rank,Data_Constellation
0,Discriminative DL,216,MAR,0.01,climbRate,downstream_performance_mean,Regression Tasks,0.002840,0.0,0.002838,...,elevators,,,19.0,16599.0,19.0,0.0,,5.0,MAR - 0.01
1,KNN,216,MAR,0.01,climbRate,downstream_performance_mean,Regression Tasks,0.002839,0.0,0.002838,...,elevators,,,19.0,16599.0,19.0,0.0,,4.0,MAR - 0.01
2,VAE,216,MAR,0.01,climbRate,downstream_performance_mean,Regression Tasks,0.002839,0.0,0.002838,...,elevators,,,19.0,16599.0,19.0,0.0,,3.0,MAR - 0.01
3,Random Forest,216,MAR,0.01,climbRate,downstream_performance_mean,Regression Tasks,0.002839,0.0,0.002838,...,elevators,,,19.0,16599.0,19.0,0.0,,2.0,MAR - 0.01
4,Mean/Mode,216,MAR,0.01,climbRate,downstream_performance_mean,Regression Tasks,0.002840,0.0,0.002839,...,elevators,,,19.0,16599.0,19.0,0.0,,1.0,MAR - 0.01
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
699,Random Forest,42712,MNAR,0.5,humidity,downstream_performance_mean,Regression Tasks,148.151741,0.0,147.707377,...,Bike_Sharing_Demand,,,13.0,17379.0,9.0,4.0,,5.0,MNAR - 0.5
700,KNN,42712,MNAR,0.5,humidity,downstream_performance_mean,Regression Tasks,149.553233,0.0,148.474038,...,Bike_Sharing_Demand,,,13.0,17379.0,9.0,4.0,,4.0,MNAR - 0.5
701,Discriminative DL,42712,MNAR,0.5,humidity,downstream_performance_mean,Regression Tasks,149.191945,0.0,148.593887,...,Bike_Sharing_Demand,,,13.0,17379.0,9.0,4.0,,3.0,MNAR - 0.5
702,VAE,42712,MNAR,0.5,humidity,downstream_performance_mean,Regression Tasks,150.016976,0.0,149.268209,...,Bike_Sharing_Demand,,,13.0,17379.0,9.0,4.0,,2.0,MNAR - 0.5


In [124]:
#df_heat = pd.read_csv('downstream_results_rank_temp.csv')
df_heat_dif = downstream_results_rank_heatmap2
df_heat_dif.drop(["Missing Type", "Missing Fraction", "Column", "result_type", "metric", "Baseline", "Corrupted", "Unnamed: 0", "Unnamed: 0", "name", "NumberOfClasses", "MajorityClassSize", "MinorityClassSize"], axis=1, inplace=True)
#df_heat['Improvement'] = df_heat['Improvement'] - 1
df_heat_dif


Unnamed: 0,Imputation_Method,Task,Imputed,NumberOfFeatures,NumberOfInstances,NumberOfNumericFeatures,NumberOfCategoricalFeatures,Downstream Performance Rank,Data_Constellation
0,Discriminative DL,216,0.002838,19.0,16599.0,19.0,0.0,5.0,MAR - 0.01
1,KNN,216,0.002838,19.0,16599.0,19.0,0.0,4.0,MAR - 0.01
2,VAE,216,0.002838,19.0,16599.0,19.0,0.0,3.0,MAR - 0.01
3,Random Forest,216,0.002838,19.0,16599.0,19.0,0.0,2.0,MAR - 0.01
4,Mean/Mode,216,0.002839,19.0,16599.0,19.0,0.0,1.0,MAR - 0.01
...,...,...,...,...,...,...,...,...,...
699,Random Forest,42712,147.707377,13.0,17379.0,9.0,4.0,5.0,MNAR - 0.5
700,KNN,42712,148.474038,13.0,17379.0,9.0,4.0,4.0,MNAR - 0.5
701,Discriminative DL,42712,148.593887,13.0,17379.0,9.0,4.0,3.0,MNAR - 0.5
702,VAE,42712,149.268209,13.0,17379.0,9.0,4.0,2.0,MNAR - 0.5


In [125]:
#Calculate Difference for every Imputation towards average best Imputation Method per Data Constellation

# Relative Difference in Percent -> Best Method to Average Best Method

#AVERAGE_BEST_IMPUTATION_METHOD = "VAE"
print(AVERAGE_BEST_IMPUTATION_METHOD)
data = downstream_results_rank.copy()
data['Task'] = data['Task'].astype(str)
data['Data_Constellation_full'] = data['Data_Constellation'] + ' - ' + data['Task']

# TODO: drop unnecessary columns here
dc_unique = data.Data_Constellation_full.unique()
#print(dc_unique)

#data_constellations = ['MAR - 0.01', 'MAR - 0.1', 'MAR - 0.3', 'MCAR - 0.5', 'MCAR - 0.01', 'MCAR - 0.1', 'MCAR - 0.3', 'MCAR - 0.5', 'MNAR - 0.01', 'MNAR - 0.1', 'MNAR - 0.3', 'MNAR - 0.5']
data_constellations = dc_unique.tolist()

# EXCLUDE AVERAGE BEST FROM THIS LIST
#methods = ['KNN', 'Mean/Mode', 'VAE', 'GAIN', 'Discriminative DL']
methods = ['Random Forest', 'KNN', 'Mean/Mode', 'VAE', 'GAIN', 'Discriminative DL']

heatmap_data_difference = pd.DataFrame()


for i in data_constellations:
    data_constel = data.loc[data['Data_Constellation_full'] == i]
#    best_score = data_constel.loc[data_constel['Downstream Performance Rank'] == 1.0]
    average_best = data_constel.loc[data_constel['Imputation_Method'] == AVERAGE_BEST_IMPUTATION_METHOD]
    dataset_number = best_score.iloc[0]['Task']
    #print(average_best)
    #print(dataset_number)
    for i in methods:
        if ((data_constel['Imputation_Method'] == i).any()):
            current_score_row = data_constel.loc[data['Imputation_Method'] == i]
            current_score_int = current_score_row.iloc[0]['Imputed']
            
        #print(best_score_int)
            average_best_int = average_best.iloc[0]['Imputed']

        #print(average_best_int)
            calc_result = ((average_best_int - current_score_int)/current_score_int)

    #    print(calc_result)
    #    print(i)
            current_score_row['Performance Difference to Average Best in Percent'] = calc_result
            heatmap_data_difference = heatmap_data_difference.append(current_score_row)  
        else:
            print("Imputation Method not here ---------------------")

heatmap_data_difference





A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing

VAE
Imputation Method not here ---------------------
Imputation Method not here ---------------------
Imputation Method not here ---------------------
Imputation Method not here ---------------------




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing

Imputation Method not here ---------------------
Imputation Method not here ---------------------
Imputation Method not here ---------------------
Imputation Method not here ---------------------
Imputation Method not here ---------------------
Imputation Method not here ---------------------
Imputation Method not here ---------------------




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing

Imputation Method not here ---------------------




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing

Imputation Method not here ---------------------
Imputation Method not here ---------------------
Imputation Method not here ---------------------
Imputation Method not here ---------------------




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing

Unnamed: 0,Imputation_Method,Task,Missing Type,Missing Fraction,Column,result_type,metric,Baseline,Corrupted,Imputed,...,MinorityClassSize,NumberOfFeatures,NumberOfInstances,NumberOfNumericFeatures,NumberOfCategoricalFeatures,NumberOfClasses,Downstream Performance Rank,Data_Constellation,Data_Constellation_full,Performance Difference to Average Best in Percent
3,Random Forest,216,MAR,0.01,climbRate,downstream_performance_mean,Regression Tasks,0.002839,0.0,0.002838,...,,19.0,16599.0,19.0,0.0,,2.0,MAR - 0.01,MAR - 0.01 - 216,-0.000006
1,KNN,216,MAR,0.01,climbRate,downstream_performance_mean,Regression Tasks,0.002839,0.0,0.002838,...,,19.0,16599.0,19.0,0.0,,4.0,MAR - 0.01,MAR - 0.01 - 216,0.000042
4,Mean/Mode,216,MAR,0.01,climbRate,downstream_performance_mean,Regression Tasks,0.002840,0.0,0.002839,...,,19.0,16599.0,19.0,0.0,,1.0,MAR - 0.01,MAR - 0.01 - 216,-0.000135
2,VAE,216,MAR,0.01,climbRate,downstream_performance_mean,Regression Tasks,0.002839,0.0,0.002838,...,,19.0,16599.0,19.0,0.0,,3.0,MAR - 0.01,MAR - 0.01 - 216,0.000000
0,Discriminative DL,216,MAR,0.01,climbRate,downstream_performance_mean,Regression Tasks,0.002840,0.0,0.002838,...,,19.0,16599.0,19.0,0.0,,5.0,MAR - 0.01,MAR - 0.01 - 216,0.000201
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
700,KNN,42712,MNAR,0.5,humidity,downstream_performance_mean,Regression Tasks,149.553233,0.0,148.474038,...,,13.0,17379.0,9.0,4.0,,4.0,MNAR - 0.5,MNAR - 0.5 - 42712,0.005349
698,Mean/Mode,42712,MNAR,0.5,humidity,downstream_performance_mean,Regression Tasks,147.017846,0.0,146.311138,...,,13.0,17379.0,9.0,4.0,,6.0,MNAR - 0.5,MNAR - 0.5 - 42712,0.020211
702,VAE,42712,MNAR,0.5,humidity,downstream_performance_mean,Regression Tasks,150.016976,0.0,149.268209,...,,13.0,17379.0,9.0,4.0,,2.0,MNAR - 0.5,MNAR - 0.5 - 42712,0.000000
703,GAIN,42712,MNAR,0.5,humidity,downstream_performance_mean,Regression Tasks,148.397969,0.0,149.936478,...,,13.0,17379.0,9.0,4.0,,1.0,MNAR - 0.5,MNAR - 0.5 - 42712,-0.004457


In [126]:
# Get a dataframe for each "Data_Constellation"
# Hier mit Variablen arbeiten -> Liste mit Konstellationen

# Hier eventuell for schleife, etc


# drop unneccessary columns

#df_heat = downstream_results_rank
#df_heat.drop(["Missing Type", "Missing Fraction", "Column", "result_type", "metric", "Baseline", "Imputed", "Corrupted", "Unnamed: 0", "Unnamed: 0", "name", "NumberOfClasses", "MajorityClassSize", "MinorityClassSize"], axis=1, inplace=True)

#df_heat['Improvement'] = df_heat['Improvement']
heatmap_data_difference = heatmap_data_difference.astype({"Task":"string"})

#mar001.drop(["Missing Type", "Missing Fraction", "Column", "result_type", "metric", "Baseline", "Imputed", "Corrupted", "Unnamed: 0"], axis=1, inplace=True)

data_constellations = ['MAR - 0.01', 'MAR - 0.1', 'MAR - 0.3', 'MCAR - 0.5', 'MCAR - 0.01', 'MCAR - 0.1', 'MCAR - 0.3', 'MCAR - 0.5', 'MNAR - 0.01', 'MNAR - 0.1', 'MNAR - 0.3', 'MNAR - 0.5']


for i in data_constellations:
    data_constel = heatmap_data_difference.loc[df_heat['Data_Constellation'] == i]

    ### uncomment whatever you want to investigate

    ## sort by amount datapoints (ascending)
    #data_constel = data_constel.sort_values(by=['NumberOfInstances'])

    ## sort by amount of features (ascending)
    data_constel = data_constel.sort_values(by=['NumberOfFeatures'])

    ## sort by amount of datapoints and features (ascending)
    #data_constel = data_constel.sort_values(by=['NumberOfInstances', 'NumberOfFeatures'])

    ## sort by amount of categorical features and datapoints (ascending)
    #data_constel = data_constel.sort_values(by=['NumberOfCategoricalFeatures', 'NumberOfInstances'])

    ## sort by amount of numerical features and datapoints (ascending)
    #data_constel = data_constel.sort_values(by=['NumberOfNumericFeatures', 'NumberOfInstances'])
    
    Dataset_number = data_constel["Task"]
    Imputation_Method = data_constel["Imputation_Method"]
    Improvement = data_constel["Performance Difference to Average Best in Percent"]
    

    trace = go.Heatmap(
                   z=Improvement,
                   x=Dataset_number,
                   y=Imputation_Method,
                   type = 'heatmap',
                    autocolorscale= False,
                    colorscale = 'RdBu_r',
                    zmid=0,
                    #hoverinfo='text',
                    #text=hovertext
                    )
    data = [trace]
    fig = go.Figure(data=data)
    fig.update_layout(
        title=i,
        xaxis_nticks=36)
    fig.show()

## Plotly Heatmaps

In [127]:
#heatmap_mar001.head()


In [128]:


mar001.head()


NameError: name 'mar001' is not defined

In [None]:
#testmar001 = xr.tutorial.open_dataset('air_temperature').air.sel(lon=250.0)

'''
#plotly express test

fig = px.imshow(heatmap_mar001, text_auto = True, 
                labels=dict(x="Task", y="Imputation_Method", color="Improvement"),
                color_continuous_scale='RdBu_r', color_continuous_midpoint=0)
'''

In [None]:
### uncomment whatever you want to investigate

## sort by amount datapoints (ascending)
#mar001 = mar001.sort_values(by=['NumberOfInstances'])

## sort by amount of features (ascending)
#mar001 = mar001.sort_values(by=['NumberOfFeatures'])

## sort by amount of datapoints and features (ascending)
#mar001 = mar001.sort_values(by=['NumberOfInstances', 'NumberOfFeatures'])

## sort by amount of categorical features and datapoints (ascending)
#mar001 = mar001.sort_values(by=['NumberOfCategoricalFeatures', 'NumberOfInstances'])

## sort by amount of numerical features and datapoints (ascending)
#mar001 = mar001.sort_values(by=['NumberOfNumericFeatures', 'NumberOfInstances'])



mar001 = mar001.astype({"Task":"string"})

Dataset_number = mar001["Task"]
Imputation_Method = mar001["Imputation_Method"]
Improvement = mar001["Improvement"]




In [None]:
trace = go.Heatmap(
                   z=Improvement,
                   x=Dataset_number,
                   y=Imputation_Method,
                   type = 'heatmap',
                    autocolorscale= False,
                    colorscale = 'RdBu_r',
                    zmid=0,
                    hoverinfo='text',
                    text=hovertext
                    )




data = [trace]
fig = go.Figure(data=data)
#iplot(fig)


fig.show()

ToDo´s für Darstellung:
- Optionen für einfache Anpassung bei der Sortierung/Darstellung:
    - Anzahl Datenpunkte
    - Anzahl Features
    - Anzahl numerische Features
    - Anzahl kategorische Features
- Schleife aufsetzen für alle Datenkonstellationen (nicht hart kodieren)
- Jeweils beste Imputationsmethode je Datensatz nochmals separat in Heatmap



ToDo´s für restliche Auswertung (Mathematische Part)
- Beste Imp-Methode je Datensatz ermitteln (-> via Ranking am besten, je Konstellation (Bsp. MAR 0.01)
- durchschnittliche Platzierung jeder Imp-Methode ermitteln (Ranking -> dann je Konstellation (Bsp. MAR 0.01)
- Beste Imp je Datensatz mit durchschnittlich bester Imp vergleichen (Liste mit beste Imp & Liste mit Durchschn. Imp -> VGL)
(jede Konstellation genau einmal in jeder Liste)



Sonstiges (keine Prio)
- Optionen für Filterung (bei Beadrf umsetzen -> vorerst keine Priorität!)
    - Numerisches Feature wurde imputiert
    - Kategorisches Feature wurde imputiert

## Application Scenario 2 - Downstream Performance

### Categorical  Columns (Classification)

In [None]:
'''
draw_cat_box_plot(
    downstream_results,
    "Improvement",
    (-0.15, 0.3),
    FIGURES_PATH,
    "fully_observed_downstream_boxplot.eps",
    hue_order=list(rename_imputer_dict.values()),
    row_order=list(rename_metric_dict.values())
)
'''
# Not used at the moment -> function from other file required, check first field