# Vergleich von Imputation Methoden

An dieser Stelle sollen verschiede Methoden zum interpolieren von fehlenden Werten betrachtet und verglichen werden.

### Vorbereitung

In [1]:
import pandas as pd
import numpy as np
import math
import time

from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer, KNNImputer

In der Folge werden keine Regionen oder Gruppierungen von Ländern und nur die Jahre ab 1990 betrachtet. 
Von dem verbliebenen Datensatz werden nur Indikatoren behalten, die mindesten 20% gefüllt sind und nach ihrer Relevanz manuell ausgewählt wurden (siehe 03_preparing_data).

In [2]:
def reset_base():
    base= pd.read_csv('additional_data/base.csv') 
    base.set_index(['Country Name', 'Indicator Name'], inplace=True)
    base = base.sort_index(level=['Country Name', 'Indicator Name'])
    return base

In [3]:
base = reset_base()
base.isna().sum().sum()

170307

Um die Performanz unterschiedlicher Imutation Verfahren zu vergleichen werden weitere 1000 vorhandene (nicht NaN) Einträge entfernt. Diese 1000 Einträge werden später als Test-Daten verwendet, auf ihrer Grundlage lassen sich die Fehler der analysierten Verfahren errechnen. Um sicherzustellen, dass die Ergebnisse reproduzierbar sind wird ein Random State gesetzt. Dann werden zufällig Koordinaten zu Dateneinträgen gezogen. Da an dieser Stelle nur vorhandene Einträge relevant sind, werden zunächst zu viele Koordinaten gezogen, diese dann mit dem Datensatz abgeglichen und gelöscht, falls sie zu einem NaN zeigen und dann die ersten 1000 verbliebenen (und damit relevanten) Einträge ausgewählt. Diese werden in den Trainingsdaten entfernt. Auf diese Weise bleibt eine Reproduzierbarkeit erhalten.

In [4]:
base.head(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,...,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
Country Name,Indicator Name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
Afghanistan,Access to clean fuels and technologies for cooking (% of population),,,,,,,,,,,...,21.5,23.0,24.799999,26.700001,28.6,30.299999,32.200001,34.099998,36.0,
Afghanistan,Access to electricity (% of population),,,,,,,,,,,...,43.222019,69.1,68.982941,89.5,71.5,97.7,97.7,98.715622,97.7,
Afghanistan,"Access to electricity, rural (% of rural population)",,,,,,,,,,,...,29.572881,60.849157,61.315788,86.500512,64.573354,97.09936,97.091973,98.309603,96.90219,


### Simulation fehlender Werte und Evaluation

In [5]:
def get_cords(frac):
    n = int(base.isna().sum().sum()*frac)
    print(f'Testdaten mit {frac*100}% fehlenden Werten (absolut: {n})')
    #random state to ensure reproducibility
    rnds = np.random.RandomState(n)

    #coordinates for data entries to be removed randomly
    #5000 entries are selected
    cords = pd.DataFrame([[rnds.randint(0, len(base), size=n*4)[i], 
                  rnds.randint(0, len(base.columns), size=n*4)[i]]
                  for i in range(n*4)])

    #all coordinates pointing to NaN entries are removed and
    #first 1000 remaining entries are selected
    cords['value'] = [base.iloc[cords[0][i], cords[1][i]] for i in cords.index]
    cords = cords.dropna()[:n].reset_index(drop=True)
    
    return cords

In [6]:
cords = get_cords(0.05)

Testdaten mit 5.0% fehlenden Werten (absolut: 8515)


In [7]:
#getting train data by changing randomly chosen values to NaN
def reset_train(cords):
    train = base.copy()
    for i in cords.index:
        train.iloc[cords[0][i], cords[1][i]] = None
    return train

In [8]:
results = []

In [9]:
def evaluate(method, df, t, results):
    
        
    #scaling original data and imputed data
    #necessary ?????????????????????????????????????
    train = reset_train(cords)
    scaler = StandardScaler().fit(train) #fitting on train?
    norm_base = pd.DataFrame(scaler.transform(base))
    df = pd.DataFrame(scaler.transform(df))

    #getting imputed values for simulated NaNs and true value 
    res =pd.DataFrame({'y_true': [norm_base.iloc[cords[0][i], cords[1][i]] for i in cords.index],
                       'y_pred': [df.iloc[cords[0][i], cords[1][i]] for i in cords.index]
                      })
    res = res.dropna()

   
    #calculate evaluation metrics
    r2 = r2_score(res['y_true'], res['y_pred'])
    rmse = math.sqrt(mean_squared_error(res['y_true'], res['y_pred']))
    still_missing = df.isna().sum().sum()
    
    print(f'Mit dieser Methode bleiben {still_missing} NaNs bestehen.')
    print('')
    print(f'{len(res)} Werte wurden für die Metriken verwendet.')
    print(f'r2: {r2}, rmse: {rmse}')
    
    results.append([method, r2, rmse, still_missing, t])
    return results


###  Imputation Verfahren

Es werden diese Imputation Methoden verglichen:
- Backcasting
- Durchschnitt
- regionaler Durchschnitt
- Iterative Imputer
- KNN Imputer

#### Backfill

In [10]:
def impute_backfill(df):
    df = df.fillna(method='bfill', limit=3)
    return df

In [11]:
base = reset_base()
train = reset_train(cords)



t0 = time.time()
df= impute_backfill(train) 
t1 = time.time()

t = t1-t0

df.to_csv('additional_data/imputed_sets/backfill.csv')
results = evaluate('Backfill', df, t, results)

Mit dieser Methode bleiben 32853 NaNs bestehen.

8256 Werte wurden für die Metriken verwendet.
r2: -16.124570911516813, rmse: 1.683781763968918


#### Durchschnitt des Indikators über alle Jahre hinweg

In [12]:
def impute_overall_means(df):
    #fill NaNs with overall mean of that indicator
    values = pd.DataFrame(df.stack()).groupby('Indicator Name')[0].mean()
    df = pd.DataFrame(df.stack(dropna=False))
    
    df[0] = df[0].fillna(df.groupby('Indicator Name')[0].transform('mean'))
    df = df.unstack()
    df.columns = df.columns.droplevel(0)
    df = df.sort_index(level=['Country Name', 'Indicator Name'])
        
    return df

In [13]:
base = reset_base()
train = reset_train(cords)

t0 = time.time()
df = impute_overall_means(train)
t1 = time.time()

t = t1-t0

df.to_csv('additional_data/imputed_sets/mean.csv')
evaluate('Overall Mean', df, t, results)

Mit dieser Methode bleiben 0 NaNs bestehen.

8515 Werte wurden für die Metriken verwendet.
r2: -2.231564589450377, rmse: 0.7202445836681456


[['Backfill',
  -16.124570911516813,
  1.683781763968918,
  32853,
  0.0156252384185791],
 ['Overall Mean',
  -2.231564589450377,
  0.7202445836681456,
  0,
  0.33020544052124023]]

#### Durchschnitt des Indikators für das jeweilige Jahr

In [14]:
def impute_yearly_means(df):
    #fill NaNs with overall mean of that indicator
    
    for i in df.columns:
        df[i] = df[i].fillna(df.groupby('Indicator Name')[i].transform('mean'))
            
    return df

In [15]:
base = reset_base()
train = reset_train(cords)

t0 = time.time()
df = impute_yearly_means(train)
t1 = time.time()

t = t1-t0

evaluate('Yearly Mean',df, t, results)

Mit dieser Methode bleiben 52298 NaNs bestehen.

8515 Werte wurden für die Metriken verwendet.
r2: -0.09704256669343403, rmse: 0.41964815712332865


[['Backfill',
  -16.124570911516813,
  1.683781763968918,
  32853,
  0.0156252384185791],
 ['Overall Mean',
  -2.231564589450377,
  0.7202445836681456,
  0,
  0.33020544052124023],
 ['Yearly Mean',
  -0.09704256669343403,
  0.41964815712332865,
  52298,
  0.10567331314086914]]

#### Regionaler Durchschnitt des Indikators für das jeweilige Jahr

In [16]:
def impute_yearly_means_per_region(df):
    country_data = pd.read_csv('../Data/WDICountry.csv')
    country_data = country_data.loc[:,['Table Name', 'Region']]
    df = pd.merge(df.reset_index(), country_data, how='left', left_on='Country Name', right_on='Table Name').drop('Table Name', axis=1)
    df = df.set_index(['Country Name', 'Indicator Name', 'Region'])

    for i in df.columns:
        df[i] = df[i].fillna(df.groupby(['Indicator Name', 'Region'])[i].transform('mean'))

    df = df.reset_index().set_index(['Country Name', 'Indicator Name']).drop('Region', axis=1)
    return df

In [17]:
base = reset_base()
train = reset_train(cords)

t0 = time.time()
df = impute_yearly_means_per_region(train)
t1 = time.time()

t = t1-t0

evaluate('Yearly Mean per Region', df, t, results)

Mit dieser Methode bleiben 57262 NaNs bestehen.

8458 Werte wurden für die Metriken verwendet.
r2: -0.4349777575022391, rmse: 0.4815635503840548


[['Backfill',
  -16.124570911516813,
  1.683781763968918,
  32853,
  0.0156252384185791],
 ['Overall Mean',
  -2.231564589450377,
  0.7202445836681456,
  0,
  0.33020544052124023],
 ['Yearly Mean',
  -0.09704256669343403,
  0.41964815712332865,
  52298,
  0.10567331314086914],
 ['Yearly Mean per Region',
  -0.4349777575022391,
  0.4815635503840548,
  57262,
  0.25169897079467773]]

#### Interpolation

In [18]:
def interpolate3(df):
    df = df.interpolate(limit=3)
    return df

def interpolate_all(df):
    df = df.interpolate()
    return df

In [19]:
base = reset_base()
train = reset_train(cords)

t0 = time.time()
df = interpolate3(train)
t1 = time.time()

t = t1-t0

evaluate('Interpolation 3', df, t, results)

Mit dieser Methode bleiben 32884 NaNs bestehen.

8235 Werte wurden für die Metriken verwendet.
r2: -2.959611964171229, rmse: 0.810691330925762


[['Backfill',
  -16.124570911516813,
  1.683781763968918,
  32853,
  0.0156252384185791],
 ['Overall Mean',
  -2.231564589450377,
  0.7202445836681456,
  0,
  0.33020544052124023],
 ['Yearly Mean',
  -0.09704256669343403,
  0.41964815712332865,
  52298,
  0.10567331314086914],
 ['Yearly Mean per Region',
  -0.4349777575022391,
  0.4815635503840548,
  57262,
  0.25169897079467773],
 ['Interpolation 3',
  -2.959611964171229,
  0.810691330925762,
  32884,
  0.11022782325744629]]

In [20]:
base = reset_base()
train = reset_train(cords)

t0 = time.time()
df = interpolate_all(train)
t1 = time.time()

t = t1-t0

df.to_csv('additional_data/imputed_sets/interpolation.csv')
results = evaluate('Interpolation all', df, t, results)

Mit dieser Methode bleiben 58 NaNs bestehen.

8515 Werte wurden für die Metriken verwendet.
r2: -2.959674983506685, rmse: 0.7972661962988941


#### Iterative Imputer

An dieser Stelle wird der Iterative Imputer von SciKitLearn angewandt. Die drei Varianten laufen auf den drei Dimensionen des Datensatzes. 

In [21]:
def iterative_imputer_1(df):
    col = df.columns
    idx = df.index
    
    iter_imp = IterativeImputer(random_state=999)
    df= iter_imp.fit_transform(df)
    df= pd.DataFrame(df, columns=col, index=idx)
    return df

def iterative_imputer_2(df):
    df = df.unstack().T
    col = df.columns
    idx = df.index

    iter_imp = IterativeImputer(random_state=999)
    df= iter_imp.fit_transform(df)

    df = pd.DataFrame(df, columns=col, index=idx)
    df = df.unstack().T
    df = df.sort_index(level=['Country Name', 'Indicator Name'])
    
    return df

def iterative_imputer_3(df):

    df = df.reset_index()
    df = df.set_index(['Indicator Name', 'Country Name'])
    df = df.unstack().T

    col = df.columns
    idx = df.index

    iter_imp = IterativeImputer(random_state=999, verbose=True)
    df= iter_imp.fit_transform(df)

    df = pd.DataFrame(df, columns=col, index=idx)
    df = df.unstack().T
    df = df.reset_index()
    df = df.set_index(['Country Name', 'Indicator Name'])
    df = df.sort_index(level=['Country Name', 'Indicator Name'])
    
    return df

In [22]:
base = reset_base()
train = reset_train(cords)

t0 = time.time()
df = iterative_imputer_1(train)
t1 = time.time()

t = t1-t0

df.to_csv('additional_data/imputed_sets/ice.csv')
results = evaluate('Iterative Imputer 1', df, t, results)



Mit dieser Methode bleiben 0 NaNs bestehen.

8515 Werte wurden für die Metriken verwendet.
r2: 0.9609051476884861, rmse: 0.07921973204963073


In [23]:
base = reset_base()
train = reset_train(cords)

t0 = time.time()
df = iterative_imputer_2(train)
t1 = time.time()

t = t1-t0

df.to_csv('additional_data/imputed_sets/ice2.csv')
results = evaluate('Iterative Imputer 2', df, t, results)

Mit dieser Methode bleiben 0 NaNs bestehen.

8515 Werte wurden für die Metriken verwendet.
r2: 0.9661514177508125, rmse: 0.07371295511229263


In [24]:
base = reset_base()
train = reset_train(cords)

t0 = time.time()
df = iterative_imputer_3(train)
t1 = time.time()

t = t1-t0

df.to_csv('additional_data/imputed_sets/ice3.csv')
results = evaluate('Iterative Imputer 3', df, t, results)

[IterativeImputer] Completing matrix with shape (4898, 165)
[IterativeImputer] Change: 3.147511124721088e+16, scaled tolerance: 35084726045503.402 
[IterativeImputer] Change: 1034671076716780.0, scaled tolerance: 35084726045503.402 
[IterativeImputer] Change: 659123757395787.5, scaled tolerance: 35084726045503.402 
[IterativeImputer] Change: 616363768648911.4, scaled tolerance: 35084726045503.402 
[IterativeImputer] Change: 2396881809792168.5, scaled tolerance: 35084726045503.402 
[IterativeImputer] Change: 603334623851478.6, scaled tolerance: 35084726045503.402 
[IterativeImputer] Change: 3452555317561406.0, scaled tolerance: 35084726045503.402 
[IterativeImputer] Change: 1566687283686987.5, scaled tolerance: 35084726045503.402 
[IterativeImputer] Change: 562306276969730.8, scaled tolerance: 35084726045503.402 
[IterativeImputer] Change: 263990229283100.84, scaled tolerance: 35084726045503.402 




Mit dieser Methode bleiben 0 NaNs bestehen.

8515 Werte wurden für die Metriken verwendet.
r2: 0.662632017747099, rmse: 0.2327156083233093


#### MICE

In [25]:
def mice_imputer(df):
    n_imputations =  12
    dfs = []
    col = df.columns
    idx = df.index
    
    for i in range(n_imputations): 
        print(f'Imputation round {i}')
        iter_imp = IterativeImputer(random_state=i, sample_posterior=True, verbose=2)
        df_temp = iter_imp.fit_transform(df)
        dfs.append(df_temp)
    
    df = np.mean(np.array(dfs), axis=0)
    df = pd.DataFrame(df, columns=col, index=idx)
    return df

In [26]:
base = reset_base()
train = reset_train(cords)

t0 = time.time()
df = mice_imputer(train)
t1 = time.time()

t = t1-t0
df.to_csv('additional_data/imputed_sets/mice.csv')

Imputation round 0
[IterativeImputer] Completing matrix with shape (26070, 31)
[IterativeImputer] Ending imputation round 1/10, elapsed time 14.92
[IterativeImputer] Ending imputation round 2/10, elapsed time 29.54
[IterativeImputer] Ending imputation round 3/10, elapsed time 44.05
[IterativeImputer] Ending imputation round 4/10, elapsed time 58.46
[IterativeImputer] Ending imputation round 5/10, elapsed time 73.05
[IterativeImputer] Ending imputation round 6/10, elapsed time 87.60
[IterativeImputer] Ending imputation round 7/10, elapsed time 102.15
[IterativeImputer] Ending imputation round 8/10, elapsed time 116.92
[IterativeImputer] Ending imputation round 9/10, elapsed time 131.50
[IterativeImputer] Ending imputation round 10/10, elapsed time 146.14
Imputation round 1
[IterativeImputer] Completing matrix with shape (26070, 31)
[IterativeImputer] Ending imputation round 1/10, elapsed time 14.69
[IterativeImputer] Ending imputation round 2/10, elapsed time 29.42
[IterativeImputer] En

[IterativeImputer] Ending imputation round 8/10, elapsed time 121.05
[IterativeImputer] Ending imputation round 9/10, elapsed time 136.21
[IterativeImputer] Ending imputation round 10/10, elapsed time 151.44
Imputation round 11
[IterativeImputer] Completing matrix with shape (26070, 31)
[IterativeImputer] Ending imputation round 1/10, elapsed time 15.54
[IterativeImputer] Ending imputation round 2/10, elapsed time 30.40
[IterativeImputer] Ending imputation round 3/10, elapsed time 44.93
[IterativeImputer] Ending imputation round 4/10, elapsed time 59.85
[IterativeImputer] Ending imputation round 5/10, elapsed time 73.99
[IterativeImputer] Ending imputation round 6/10, elapsed time 88.39
[IterativeImputer] Ending imputation round 7/10, elapsed time 102.64
[IterativeImputer] Ending imputation round 8/10, elapsed time 117.09
[IterativeImputer] Ending imputation round 9/10, elapsed time 131.28
[IterativeImputer] Ending imputation round 10/10, elapsed time 145.49


In [27]:
results = evaluate('MICE', df, t, results)

Mit dieser Methode bleiben 0 NaNs bestehen.

8515 Werte wurden für die Metriken verwendet.
r2: 0.9588865414805373, rmse: 0.08123919037202011


#### KNN Imputer

In [28]:
def knn_imputer1(df):
    col = df.columns
    idx = df.index
    
    knn_imp = KNNImputer(n_neighbors=2)
    df= knn_imp.fit_transform(df)
    df = pd.DataFrame(df, columns=col, index=idx)
    return df

def knn_imputer2(df, n):
    
    df = df.reset_index()
    df = df.set_index(['Indicator Name', 'Country Name'])
    df = df.unstack().T

    col = df.columns
    idx = df.index

    knn_imp = KNNImputer(n_neighbors=n)
    df= knn_imp.fit_transform(df)
    df = pd.DataFrame(df, columns=col, index=idx)

    df = df.unstack().T
    df = df.reset_index()
    df = df.set_index(['Country Name', 'Indicator Name'])
    df = df.sort_index(level=['Country Name', 'Indicator Name'])
    
    return df

In [29]:
base = reset_base()
train = reset_train(cords)

t0 = time.time()
df = knn_imputer1(train)
t1 = time.time()

t = t1-t0

results = evaluate('KNN Imputer1', df, t, results)

Mit dieser Methode bleiben 0 NaNs bestehen.

8515 Werte wurden für die Metriken verwendet.
r2: 0.2944522579857045, rmse: 0.3365400724328245


In [30]:
base = reset_base()
train = reset_train(cords)

t0 = time.time()
df = knn_imputer2(train, 4)
t1 = time.time()

t = t1-t0

df.to_csv('additional_data/imputed_sets/knn.csv')
results= evaluate('KNN Imputer2', df, t, results)

Mit dieser Methode bleiben 0 NaNs bestehen.

8515 Werte wurden für die Metriken verwendet.
r2: 0.9700605815538997, rmse: 0.06932585920533507


In [31]:
cords = get_cords(0.075)
results7 = []

Testdaten mit 7.5% fehlenden Werten (absolut: 12773)


In [32]:
base = reset_base()
train = reset_train(cords)

t0 = time.time()
df= impute_backfill(train) 
t1 = time.time()

t = t1-t0

results7 = evaluate('Backfill', df, t, results)

###mean
base = reset_base()
train = reset_train(cords)

t0 = time.time()
df = impute_overall_means(train)
t1 = time.time()

t = t1-t0
results7 = evaluate('Overall Mean', df, t, results)


base = reset_base()
train = reset_train(cords)

t0 = time.time()
df = impute_yearly_means(train)
t1 = time.time()

t = t1-t0

results7 = evaluate('Yearly Mean', df, t, results)


base = reset_base()
train = reset_train(cords)

t0 = time.time()
df = impute_yearly_means_per_region(train)
t1 = time.time()

t = t1-t0

results7 = evaluate('Yearly Mean per Region', df, t, results)

base = reset_base()
train = reset_train(cords)

t0 = time.time()
df = interpolate3(train)
t1 = time.time()

t = t1-t0

results7 = evaluate('Interpolation 3', df, t, results)

base = reset_base()
train = reset_train(cords)

t0 = time.time()
df = interpolate_all(train)
t1 = time.time()

t = t1-t0

results7 = evaluate('Interpolation all', df, t, results)

base = reset_base()
train = reset_train(cords)

t0 = time.time()
df = iterative_imputer_1(train)
t1 = time.time()

t = t1-t0

results7 = evaluate('Iterative Imputer 1', df, t, results)

base = reset_base()
train = reset_train(cords)

t0 = time.time()
df = iterative_imputer_2(train)
t1 = time.time()

t = t1-t0
results7 = evaluate('Iterative Imputer 2', df, t, results)

base = reset_base()
train = reset_train(cords)

t0 = time.time()
df = iterative_imputer_3(train)
t1 = time.time()

t = t1-t0
results7 = evaluate('Iterative Imputer 3', df, t, results)

base = reset_base()
train = reset_train(cords)

t0 = time.time()
df = mice_imputer(train)
t1 = time.time()

t = t1-t0
results7 = evaluate('MICE', df, t, results)

base = reset_base()
train = reset_train(cords)

t0 = time.time()
df = knn_imputer1(train)
t1 = time.time()

t = t1-t0

results7 = evaluate('KNN Imputer 1', df, t, results)

base = reset_base()
train = reset_train(cords)

t0 = time.time()
df = knn_imputer2(train, 4)
t1 = time.time()

t = t1-t0

results7 = evaluate('KNN Imputer 2', df, t, results)

Mit dieser Methode bleiben 33291 NaNs bestehen.

12357 Werte wurden für die Metriken verwendet.
r2: -0.6130335478126085, rmse: 2.204393317611659
Mit dieser Methode bleiben 0 NaNs bestehen.

12773 Werte wurden für die Metriken verwendet.
r2: -0.07966531009333178, rmse: 1.7739835611317705
Mit dieser Methode bleiben 52456 NaNs bestehen.

12772 Werte wurden für die Metriken verwendet.
r2: 0.018306571582603737, rmse: 1.6916478144616887
Mit dieser Methode bleiben 57327 NaNs bestehen.

12675 Werte wurden für die Metriken verwendet.
r2: 0.04750946031970027, rmse: 1.6726391400547882
Mit dieser Methode bleiben 33321 NaNs bestehen.

12344 Werte wurden für die Metriken verwendet.
r2: -0.16049147170655953, rmse: 1.8708552797202576
Mit dieser Methode bleiben 58 NaNs bestehen.

12773 Werte wurden für die Metriken verwendet.
r2: -0.16047092507932859, rmse: 1.8391711645406432




Mit dieser Methode bleiben 0 NaNs bestehen.

12773 Werte wurden für die Metriken verwendet.
r2: 0.9887340145516139, rmse: 0.1812131084938613
Mit dieser Methode bleiben 0 NaNs bestehen.

12773 Werte wurden für die Metriken verwendet.
r2: 0.9869415093055817, rmse: 0.19509741234578154
[IterativeImputer] Completing matrix with shape (4898, 165)
[IterativeImputer] Change: 3.629456857808196e+16, scaled tolerance: 35084726045503.402 
[IterativeImputer] Change: 3247758564798911.0, scaled tolerance: 35084726045503.402 
[IterativeImputer] Change: 2759587364532460.5, scaled tolerance: 35084726045503.402 
[IterativeImputer] Change: 1656926274554512.0, scaled tolerance: 35084726045503.402 
[IterativeImputer] Change: 778581988678064.2, scaled tolerance: 35084726045503.402 
[IterativeImputer] Change: 339214715632727.56, scaled tolerance: 35084726045503.402 
[IterativeImputer] Change: 122409791482366.77, scaled tolerance: 35084726045503.402 
[IterativeImputer] Change: 176705508430741.7, scaled toleran



Mit dieser Methode bleiben 0 NaNs bestehen.

12773 Werte wurden für die Metriken verwendet.
r2: 0.9349444555054636, rmse: 0.43545890415417043
Imputation round 0
[IterativeImputer] Completing matrix with shape (26070, 31)
[IterativeImputer] Ending imputation round 1/10, elapsed time 14.56
[IterativeImputer] Ending imputation round 2/10, elapsed time 29.22
[IterativeImputer] Ending imputation round 3/10, elapsed time 43.93
[IterativeImputer] Ending imputation round 4/10, elapsed time 58.48
[IterativeImputer] Ending imputation round 5/10, elapsed time 73.09
[IterativeImputer] Ending imputation round 6/10, elapsed time 87.77
[IterativeImputer] Ending imputation round 7/10, elapsed time 102.25
[IterativeImputer] Ending imputation round 8/10, elapsed time 116.80
[IterativeImputer] Ending imputation round 9/10, elapsed time 131.29
[IterativeImputer] Ending imputation round 10/10, elapsed time 145.93
Imputation round 1
[IterativeImputer] Completing matrix with shape (26070, 31)
[IterativeImput

[IterativeImputer] Ending imputation round 6/10, elapsed time 87.15
[IterativeImputer] Ending imputation round 7/10, elapsed time 101.75
[IterativeImputer] Ending imputation round 8/10, elapsed time 116.45
[IterativeImputer] Ending imputation round 9/10, elapsed time 130.93
[IterativeImputer] Ending imputation round 10/10, elapsed time 145.44
Imputation round 11
[IterativeImputer] Completing matrix with shape (26070, 31)
[IterativeImputer] Ending imputation round 1/10, elapsed time 14.71
[IterativeImputer] Ending imputation round 2/10, elapsed time 29.27
[IterativeImputer] Ending imputation round 3/10, elapsed time 43.97
[IterativeImputer] Ending imputation round 4/10, elapsed time 58.60
[IterativeImputer] Ending imputation round 5/10, elapsed time 73.12
[IterativeImputer] Ending imputation round 6/10, elapsed time 87.62
[IterativeImputer] Ending imputation round 7/10, elapsed time 102.19
[IterativeImputer] Ending imputation round 8/10, elapsed time 116.65
[IterativeImputer] Ending imp

In [33]:
cords = get_cords(0.1)
results10 = []

Testdaten mit 10.0% fehlenden Werten (absolut: 17030)


In [34]:
base = reset_base()
train = reset_train(cords)

t0 = time.time()
df= impute_backfill(train) 
t1 = time.time()

t = t1-t0

results10 = evaluate('Backfill', df, t, results)

###mean
base = reset_base()
train = reset_train(cords)

t0 = time.time()
df = impute_overall_means(train)
t1 = time.time()

t = t1-t0
results10 = evaluate('Overall Mean', df, t, results)


base = reset_base()
train = reset_train(cords)

t0 = time.time()
df = impute_yearly_means(train)
t1 = time.time()

t = t1-t0

results10 = evaluate('Yearly Mean', df, t, results)


base = reset_base()
train = reset_train(cords)

t0 = time.time()
df = impute_yearly_means_per_region(train)
t1 = time.time()

t = t1-t0

results10 = evaluate('Yearly Mean per Region', df, t, results)

base = reset_base()
train = reset_train(cords)

t0 = time.time()
df = interpolate3(train)
t1 = time.time()

t = t1-t0

results10 = evaluate('Interpolation 3', df, t, results)

base = reset_base()
train = reset_train(cords)

t0 = time.time()
df = interpolate_all(train)
t1 = time.time()

t = t1-t0

results10 = evaluate('Interpolation all', df, t, results)

base = reset_base()
train = reset_train(cords)

t0 = time.time()
df = iterative_imputer_1(train)
t1 = time.time()

t = t1-t0

results10 = evaluate('Iterative Imputer 1', df, t, results)

base = reset_base()
train = reset_train(cords)

t0 = time.time()
df = iterative_imputer_2(train)
t1 = time.time()

t = t1-t0

results10 = evaluate('Iterative Imputer 2', df, t, results)

base = reset_base()
train = reset_train(cords)

t0 = time.time()
df = iterative_imputer_3(train)
t1 = time.time()

t = t1-t0

results10 = evaluate('Iterative Imputer 3', df, t, results)

base = reset_base()
train = reset_train(cords)

t0 = time.time()
df = mice_imputer(train)
t1 = time.time()

t = t1-t0
results10 = evaluate('MICE', df, t, results)

base = reset_base()
train = reset_train(cords)

t0 = time.time()
df = knn_imputer1(train)
t1 = time.time()

t = t1-t0

results10 = evaluate('KNN Imputer 1', df, t, results)

base = reset_base()
train = reset_train(cords)

t0 = time.time()
df = knn_imputer2(train, 4)
t1 = time.time()

t = t1-t0

results10 = evaluate('KNN Imputer 2', df, t, results)

Mit dieser Methode bleiben 33605 NaNs bestehen.

16462 Werte wurden für die Metriken verwendet.
r2: -0.043800632990352195, rmse: 2.089040474302827
Mit dieser Methode bleiben 0 NaNs bestehen.

17030 Werte wurden für die Metriken verwendet.
r2: -0.0015874687364016982, rmse: 2.0119672962935047
Mit dieser Methode bleiben 52772 NaNs bestehen.

17027 Werte wurden für die Metriken verwendet.
r2: 0.013454075014286193, rmse: 1.9969783912024721
Mit dieser Methode bleiben 57448 NaNs bestehen.

16903 Werte wurden für die Metriken verwendet.
r2: 0.05990583802688387, rmse: 1.9565294158245126
Mit dieser Methode bleiben 33638 NaNs bestehen.

16488 Werte wurden für die Metriken verwendet.
r2: -0.04342564936248783, rmse: 2.0869887856001283
Mit dieser Methode bleiben 61 NaNs bestehen.

17027 Werte wurden für die Metriken verwendet.
r2: -0.04341923745526266, rmse: 2.053733699351703




Mit dieser Methode bleiben 0 NaNs bestehen.

17030 Werte wurden für die Metriken verwendet.
r2: 0.9658526097359531, rmse: 0.37149693504215014
Mit dieser Methode bleiben 0 NaNs bestehen.

17030 Werte wurden für die Metriken verwendet.
r2: 0.9452737170229422, rmse: 0.4702994294809716
[IterativeImputer] Completing matrix with shape (4898, 165)
[IterativeImputer] Change: 2.6953462925334884e+16, scaled tolerance: 35084726045503.402 
[IterativeImputer] Change: 1852264431665211.8, scaled tolerance: 35084726045503.402 
[IterativeImputer] Change: 708567246179355.8, scaled tolerance: 35084726045503.402 
[IterativeImputer] Change: 755095640854468.6, scaled tolerance: 35084726045503.402 
[IterativeImputer] Change: 488581155875104.2, scaled tolerance: 35084726045503.402 
[IterativeImputer] Change: 345880135817053.0, scaled tolerance: 35084726045503.402 
[IterativeImputer] Change: 857073520992406.6, scaled tolerance: 35084726045503.402 
[IterativeImputer] Change: 717395896181668.9, scaled tolerance:



Mit dieser Methode bleiben 0 NaNs bestehen.

17030 Werte wurden für die Metriken verwendet.
r2: 0.5331569987069713, rmse: 1.3736051240937186
Imputation round 0
[IterativeImputer] Completing matrix with shape (26070, 31)
[IterativeImputer] Ending imputation round 1/10, elapsed time 15.17
[IterativeImputer] Ending imputation round 2/10, elapsed time 30.01
[IterativeImputer] Ending imputation round 3/10, elapsed time 44.76
[IterativeImputer] Ending imputation round 4/10, elapsed time 59.74
[IterativeImputer] Ending imputation round 5/10, elapsed time 74.58
[IterativeImputer] Ending imputation round 6/10, elapsed time 89.62
[IterativeImputer] Ending imputation round 7/10, elapsed time 104.51
[IterativeImputer] Ending imputation round 8/10, elapsed time 119.57
[IterativeImputer] Ending imputation round 9/10, elapsed time 134.50
[IterativeImputer] Ending imputation round 10/10, elapsed time 149.22
Imputation round 1
[IterativeImputer] Completing matrix with shape (26070, 31)
[IterativeImpute

[IterativeImputer] Ending imputation round 6/10, elapsed time 88.94
[IterativeImputer] Ending imputation round 7/10, elapsed time 103.86
[IterativeImputer] Ending imputation round 8/10, elapsed time 118.82
[IterativeImputer] Ending imputation round 9/10, elapsed time 133.50
[IterativeImputer] Ending imputation round 10/10, elapsed time 148.22
Imputation round 11
[IterativeImputer] Completing matrix with shape (26070, 31)
[IterativeImputer] Ending imputation round 1/10, elapsed time 14.75
[IterativeImputer] Ending imputation round 2/10, elapsed time 29.50
[IterativeImputer] Ending imputation round 3/10, elapsed time 44.36
[IterativeImputer] Ending imputation round 4/10, elapsed time 59.42
[IterativeImputer] Ending imputation round 5/10, elapsed time 74.24
[IterativeImputer] Ending imputation round 6/10, elapsed time 88.93
[IterativeImputer] Ending imputation round 7/10, elapsed time 103.65
[IterativeImputer] Ending imputation round 8/10, elapsed time 118.32
[IterativeImputer] Ending imp

#### Results

In [35]:
results = pd.DataFrame(results, columns=['Methode', 'r2', 'RSME', 'Remaining NaNs', 'Time'])
results = results.set_index('Methode')

In [36]:
results

Unnamed: 0_level_0,r2,RSME,Remaining NaNs,Time
Methode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Backfill,-16.124571,1.683782,32853,0.015625
Overall Mean,-2.231565,0.720245,0,0.330205
Yearly Mean,-0.097043,0.419648,52298,0.105673
Yearly Mean per Region,-0.434978,0.481564,57262,0.251699
Interpolation 3,-2.959612,0.810691,32884,0.110228
Interpolation all,-2.959675,0.797266,58,0.046986
Iterative Imputer 1,0.960905,0.07922,0,10.953229
Iterative Imputer 2,0.966151,0.073713,0,20.194192
Iterative Imputer 3,0.662632,0.232716,0,75.957905
MICE,0.958887,0.081239,0,1750.670281


In [37]:
results7 = pd.DataFrame(results7, columns=['Methode', 'r2', 'RSME', 'Remaining NaNs', 'Time'])
results7 = results7.set_index('Methode')
results7

Unnamed: 0_level_0,r2,RSME,Remaining NaNs,Time
Methode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Backfill,-16.124571,1.683782,32853,0.015625
Overall Mean,-2.231565,0.720245,0,0.330205
Yearly Mean,-0.097043,0.419648,52298,0.105673
Yearly Mean per Region,-0.434978,0.481564,57262,0.251699
Interpolation 3,-2.959612,0.810691,32884,0.110228
Interpolation all,-2.959675,0.797266,58,0.046986
Iterative Imputer 1,0.960905,0.07922,0,10.953229
Iterative Imputer 2,0.966151,0.073713,0,20.194192
Iterative Imputer 3,0.662632,0.232716,0,75.957905
MICE,0.958887,0.081239,0,1750.670281


In [38]:
results10 = pd.DataFrame(results10, columns=['Methode', 'r2', 'RSME', 'Remaining NaNs', 'Time'])
results10 = results10.set_index('Methode')
results10

Unnamed: 0_level_0,r2,RSME,Remaining NaNs,Time
Methode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Backfill,-16.124571,1.683782,32853,0.015625
Overall Mean,-2.231565,0.720245,0,0.330205
Yearly Mean,-0.097043,0.419648,52298,0.105673
Yearly Mean per Region,-0.434978,0.481564,57262,0.251699
Interpolation 3,-2.959612,0.810691,32884,0.110228
Interpolation all,-2.959675,0.797266,58,0.046986
Iterative Imputer 1,0.960905,0.07922,0,10.953229
Iterative Imputer 2,0.966151,0.073713,0,20.194192
Iterative Imputer 3,0.662632,0.232716,0,75.957905
MICE,0.958887,0.081239,0,1750.670281
