<a href="https://colab.research.google.com/github/DepartmentOfStatisticsPUE/ann-for-survey-sampling/blob/main/ann_paper_simulation_1_properties.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## About

This notebook covers simulation study to assess the performance of predictive mean imputation based on exact and approximate nearest neigbours. 

**Warning**: before runing this scripts go to `Runtime` -> `Change runtime type` and set it to `GPU`.

## Install requested modules

Please note that this may take some time

In [None]:
!apt install libomp-dev
!pip install faiss-gpu
!pip install n2
!pip install scann
!pip install annoy
!pip install pyflann-py3
!pip install pynndescent

## Import requested modules

In [14]:
## standard modules
import pandas as pd
import numpy as np
import time

## linear regression
from sklearn.linear_model import LinearRegression

## ann modules
import scann
import faiss
from pyflann import *
from annoy import AnnoyIndex
from n2 import HnswIndex
from scipy.spatial import cKDTree

### setting for faiss gpu
res = faiss.StandardGpuResources()

## Helper functions

In [158]:
def kdtree_impute(y_pred, y_pred_miss, y):
  tree = cKDTree(y_pred, leafsize = 100, balanced_tree=True)
  dists, indx = tree.query(y_pred_miss, k = 1, eps = 0)
  res = (np.sum(y) + np.sum(y[indx])) / (len(y_pred) + len(y_pred_miss))
  return res

## Simulation studies outline

Here, we conduct simulation study based on predictive mean matching. We replicate study from *Yang, S., & Kim, J. K. (2020). Asymptotic theory and inference of predictive mean matching imputation using a superpopulation model framework. Scandinavian Journal of Statistics, 47(3), 839-861.* paper, however we only assume missing data mechanism



In [114]:
np.random.seed(123)
N = 500000
x1 = np.random.uniform(size = N)
x2 = np.random.uniform(size = N)
x3 = np.random.uniform(size = N)
x4 = np.random.normal(size = N)
x5 = np.random.normal(size = N)
x6 = np.random.normal(size = N)
epsilon = np.random.normal(size=N)

### target variables
y1 = -1 + x1 + x2 + epsilon
y2 = -1.167 + x1 + x2 + (x1 - 0.5)**2 + + (x2 - 0.5)**2 + epsilon
y3 = -1.5 + x1 + x2 + x3 + x4 + x5 + x6 + epsilon

## response indicator
p1 = np.exp(0.2 + x1 + x2) / (1 + np.exp(0.2 + x1 + x2))

data = np.column_stack((x1,x2,x3,x4,x5,x6,y1,y2,y3, p1)).astype('float32')

## first three rows
data[:3]

array([[ 0.6964692 ,  0.39244577,  0.772698  ,  0.06619908,  1.181968  ,
         0.33423862,  0.454862  ,  0.33803004,  2.3099656 ,  0.7839635 ],
       [ 0.28613934,  0.46933195,  0.46711504,  0.33278647,  0.22954535,
         0.11521499,  0.9884662 ,  0.8681431 ,  1.633128  ,  0.72221416],
       [ 0.22685145,  0.62351155,  0.01994883, -1.044568  ,  0.11454313,
        -0.31842756, -2.2750313 , -2.352166  , -4.003535  ,  0.7408446 ]],
      dtype=float32)

### 

In [159]:
R = 100
sim1_results_ckdtree = np.zeros(shape = (R, 3))
#sim1_results_ckdtree_time

for r in range(R):
  
  if (r % 10 == 0):
    print(r)

  np.random.seed(r)
  response_flag = np.random.binomial(n=1, p = p1, size = N)
  data_resp = data[response_flag == 1]
  data_noresp = data[response_flag != 1]
  
  ## predictive mean matching
  ## y1
  m1_reg_y1 = LinearRegression().fit(data_resp[:,:2], data_resp[:, 6])
  m1_resp_y1_predict = m1_reg_y1.predict(data_resp[:,:2]).reshape(-1,1)
  m1_noresp_y1_predict = m1_reg_y1.predict(data_noresp[:,:2]).reshape(-1,1)
  ## y2
  m1_reg_y2 = LinearRegression().fit(data_resp[:,:2], data_resp[:, 7])
  m1_resp_y2_predict = m1_reg_y2.predict(data_resp[:,:2]).reshape(-1,1)
  m1_noresp_y2_predict = m1_reg_y2.predict(data_noresp[:,:2]).reshape(-1,1)
  ## y3
  m1_reg_y3 = LinearRegression().fit(data_resp[:,:6], data_resp[:, 8])
  m1_resp_y3_predict = m1_reg_y3.predict(data_resp[:,:6]).reshape(-1,1)
  m1_noresp_y3_predict = m1_reg_y3.predict(data_noresp[:,:6]).reshape(-1,1)

  ## cktree imputation
  sim1_results_ckdtree[r, 0] = kdtree_impute(m1_resp_y1_predict, m1_noresp_y1_predict, data_resp[:, 6]) 
  sim1_results_ckdtree[r, 1] = kdtree_impute(m1_resp_y2_predict, m1_noresp_y2_predict, data_resp[:, 7])
  sim1_results_ckdtree[r, 2] = kdtree_impute(m1_resp_y3_predict, m1_noresp_y3_predict, data_resp[:, 8])

0
10
20
30
40
50
60
70
80
90


In [160]:
(np.mean(sim1_results_ckdtree, axis=0) - np.mean(data[:,[6,7,8]], axis = 0)) / np.mean(data[:,[6,7,8]], axis = 0)*100

array([12.51039367,  1.7673738 ,  2.43293736])

In [178]:
t = AnnoyIndex(1, "euclidean")  
for i in range(len(m1_resp_y2_predict)):
  t.add_item(i, m1_resp_y2_predict[i])

t.build(300)

True

In [None]:
annoy_inds = np.array([t.get_nns_by_vector(i, 1) for i in m1_resp_y2_predict])
(np.sum(m1_resp_y2_predict) + np.sum(m1_resp_y2_predict[annoy_inds]))/N