Ref. https://scikit-learn.org/stable/auto_examples/impute/plot_iterative_imputer_variants_comparison.html#sphx-glr-auto-examples-impute-plot-iterative-imputer-variants-comparison-py

# 1. Load Data

In [1]:
import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/a10.csv', parse_dates=['date'])

In [2]:
df.head()

Unnamed: 0,date,value
0,1991-07-01,3.526591
1,1991-08-01,3.180891
2,1991-09-01,3.252221
3,1991-10-01,3.611003
4,1991-11-01,3.565869


In [3]:
df.describe()

Unnamed: 0,value
count,204.0
mean,10.69443
std,5.956998
min,2.81452
25%,5.844095
50%,9.319345
75%,14.289964
max,29.665356


# 2. Remove arbitrary data

In [4]:
df_missing = df.copy()

In [5]:
n_del = 30

In [6]:
import numpy as np

np.random.seed(123)

In [7]:
del_set = np.unique(np.random.randint(low=0, high=np.shape(df_missing)[0], size=n_del))
del_set

array([  2,  17,  32,  39,  47,  49,  55,  57,  66,  68,  73,  78,  83,
        84,  96,  98, 106, 109, 111, 113, 123, 126, 153, 164, 174, 195])

In [8]:
for i in del_set:
  print(i,"-before:",df_missing['value'][i])
  df_missing['value'][i] = np.nan
  print(i,"-after:",df_missing['value'][i])

2 -before: 3.252221
2 -after: nan
17 -before: 5.81054917
17 -after: nan
32 -before: 4.39407557
32 -after: nan
39 -before: 5.3016513
39 -after: nan
47 -before: 5.17078711
47 -after: nan
49 -before: 5.85527729
49 -after: nan
55 -before: 5.06979585
55 -after: nan
57 -before: 5.59712628
57 -after: nan
66 -before: 8.52447101
66 -after: nan
68 -before: 5.71430345
68 -after: nan
73 -before: 6.70491861
73 -after: nan
78 -before: 8.79851303
78 -after: nan
83 -before: 7.38338118
83 -after: nan
84 -before: 7.81349587
84 -after: nan
96 -before: 8.71742046
96 -after: nan
98 -before: 9.17711337
98 -after: nan
106 -before: 9.3868026
106 -after: nan
109 -before: 10.64375083
109 -after: nan
111 -before: 11.7100413
111 -after: nan
113 -before: 12.07913184
113 -after: nan
123 -before: 12.65213444
123 -after: nan
126 -before: 16.30026927
126 -after: nan
153 -before: 12.88264507
153 -after: nan
164 -before: 13.402392
164 -after: nan
174 -before: 23.486694
174 -after: nan
195 -before: 23.26333992
195 -after

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [9]:
n_nan = np.isnan(df_missing['value']).sum()
print('The number of NaN is', n_nan)

The number of NaN is 26


# 3. LOCF: Last Observation Carried Forward

In [10]:
df_locf = df_missing.copy()

In [11]:
for i in range(np.shape(df_locf)[0]):
  if np.isnan(df_locf['value'][i]):
    try:
      df_locf['value'][i] = df_locf['value'][i-1]
      print('NaN cell: ', df_locf['value'][i], ' -- Previous Cell: ',df_locf['value'][i-1])
    except:
      df_locf['value'][i] = 0 # 예외규칙
      print('NaN cell: ', df_locf['value'][i],'is first cell')

NaN cell:  3.180891  -- Previous Cell:  3.180891
NaN cell:  4.38653092  -- Previous Cell:  4.38653092
NaN cell:  3.84127758  -- Previous Cell:  3.84127758
NaN cell:  5.20445484  -- Previous Cell:  5.20445484
NaN cell:  5.19475419  -- Previous Cell:  5.19475419
NaN cell:  5.25674157  -- Previous Cell:  5.25674157
NaN cell:  8.32945212  -- Previous Cell:  8.32945212
NaN cell:  5.26255667  -- Previous Cell:  5.26255667
NaN cell:  8.60693721  -- Previous Cell:  8.60693721
NaN cell:  5.27791837  -- Previous Cell:  5.27791837
NaN cell:  7.05083102  -- Previous Cell:  7.05083102
NaN cell:  10.09623339  -- Previous Cell:  10.09623339
NaN cell:  7.06420058  -- Previous Cell:  7.06420058
NaN cell:  7.06420058  -- Previous Cell:  7.06420058
NaN cell:  8.16532298  -- Previous Cell:  8.16532298
NaN cell:  9.07096378  -- Previous Cell:  9.07096378
NaN cell:  8.47400037  -- Previous Cell:  8.47400037
NaN cell:  10.8342948  -- Previous Cell:  10.8342948
NaN cell:  9.90816186  -- Previous Cell:  9.9081

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


In [12]:
n_nan = np.isnan(df_locf['value']).sum()
print('The number of NaN is', n_nan)

The number of NaN is 0


# 4. NOCB: Next Observation Carried Backward

In [13]:
df_nocb = df_missing.copy()

In [14]:
for i in range(np.shape(df_nocb)[0]):
  if np.isnan(df_nocb['value'][i]):
    try:
      df_nocb['value'][i] = df_nocb['value'][i+1]
      print('NaN cell: ', df_nocb['value'][i], ' -- Next Cell: ',df_nocb['value'][i+1])
    except:
      df_nocb['value'][i] = 0 # 예외규칙
      print('NaN cell: ', df_nocb['value'][i],'is last cell')

NaN cell:  3.6110029999999997  -- Next Cell:  3.6110029999999997
NaN cell:  6.19206769  -- Next Cell:  6.19206769
NaN cell:  4.07534073  -- Next Cell:  4.07534073
NaN cell:  5.77374216  -- Next Cell:  5.77374216
NaN cell:  5.25674157  -- Next Cell:  5.25674157
NaN cell:  5.49072901  -- Next Cell:  5.49072901
NaN cell:  5.26255667  -- Next Cell:  5.26255667
NaN cell:  6.110296  -- Next Cell:  6.110296
NaN cell:  5.27791837  -- Next Cell:  5.27791837
NaN cell:  6.21452908  -- Next Cell:  6.21452908
NaN cell:  7.25098761  -- Next Cell:  7.25098761
NaN cell:  5.91826076  -- Next Cell:  5.91826076
NaN cell:  nan  -- Next Cell:  nan
NaN cell:  7.43189221  -- Next Cell:  7.43189221
NaN cell:  9.07096378  -- Next Cell:  9.07096378
NaN cell:  9.25188674  -- Next Cell:  9.25188674
NaN cell:  9.56039945  -- Next Cell:  9.56039945
NaN cell:  9.90816186  -- Next Cell:  9.90816186
NaN cell:  11.34015074  -- Next Cell:  11.34015074
NaN cell:  14.49758109  -- Next Cell:  14.49758109
NaN cell:  13.6744

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


In [15]:
n_nan = np.isnan(df_nocb['value']).sum()
print('The number of NaN is', n_nan)

The number of NaN is 1


# 5. Linear Interpolation

In [16]:
df_linear = df_missing.copy()

In [17]:
for i in range(np.shape(df_linear)[0]):
  if np.isnan(df_linear['value'][i]):
    try:
      if (np.isnan(df_linear['value'][i-1])==False) & (np.isnan(df_linear['value'][i+1])==False):
        df_linear['value'][i] = (df_linear['value'][i-1] + df_linear['value'][i+1] )/2
        print('NaN cell: ', df_linear['value'][i], ' -- Previous Cell: ',df_linear['value'][i-1], ' -- Next Cell: ', df_linear['value'][i+1])
      else: # nan가 연달아 존재하는 경우
        try: 
          df_linear['value'][i] = df_linear['value'][i-1]
          print('NaN cell: ', df_linear['value'][i], ' -- Previous Cell: ',df_linear['value'][i-1], ' -- No Next Cell')

        except: 
          df_linear['value'][i] = df_linear['value'][i+1]
          print('NaN cell: ', df_linear['value'][i], ' -- Next Cell: ',df_linear['value'][i+1], ' -- No Previous Cell')
    except:
        try: # At last cell
          df_linear['value'][i] = df_linear['value'][i-1]
          print('NaN cell: ', df_linear['value'][i], ' -- Previous Cell: ',df_linear['value'][i-1], ' -- No Next Cell')

        except: # At first cell
          df_linear['value'][i] = df_linear['value'][i+1]
          print('NaN cell: ', df_linear['value'][i], ' -- Next Cell: ',df_linear['value'][i+1], ' -- No Previous Cell')


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


NaN cell:  3.3959469999999996  -- Previous Cell:  3.180891  -- Next Cell:  3.6110029999999997
NaN cell:  5.289299305  -- Previous Cell:  4.38653092  -- Next Cell:  6.19206769
NaN cell:  3.958309155  -- Previous Cell:  3.84127758  -- Next Cell:  4.07534073
NaN cell:  5.489098500000001  -- Previous Cell:  5.20445484  -- Next Cell:  5.77374216
NaN cell:  5.22574788  -- Previous Cell:  5.19475419  -- Next Cell:  5.25674157
NaN cell:  5.37373529  -- Previous Cell:  5.25674157  -- Next Cell:  5.49072901
NaN cell:  6.796004395  -- Previous Cell:  8.32945212  -- Next Cell:  5.26255667
NaN cell:  5.686426335  -- Previous Cell:  5.26255667  -- Next Cell:  6.110296
NaN cell:  6.94242779  -- Previous Cell:  8.60693721  -- Next Cell:  5.27791837
NaN cell:  5.746223725  -- Previous Cell:  5.27791837  -- Next Cell:  6.21452908
NaN cell:  7.150909315  -- Previous Cell:  7.05083102  -- Next Cell:  7.25098761
NaN cell:  8.007247075  -- Previous Cell:  10.09623339  -- Next Cell:  5.91826076
NaN cell:  7.

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':


In [18]:
n_nan = np.isnan(df_linear['value']).sum()
print('The number of NaN is', n_nan)

The number of NaN is 0
