## Functions to compute variance decrease in imputed data

The idea comes from:

Yadav, M. L., & Roychoudhury, B. (2018). Handling missing values: A study of popular imputation packages in R. Knowledge-Based Systems. doi:10.1016/j.knosys.2018.06.012.


The use of the formula is not very clear in the paper. 

In this implementations, we consider a matrix $X \in \mathbb{R}^{N \times G}$.
For each variable $g \in G$, the Variance Decrease (VD) is defined as:

\begin{align}
VD_{g}=|\frac{(var(X_{g}) - var(X^{imputed}_{g}))}{var(X_{g})}|
\end{align}


The total VD score for the imputation of the whole matrix is the mean across all the genes for which an imputation was done (completed genes not considered), weighted with the fraction of the missing values for each gene:

\begin{align}
\frac{\sum_{g \in GN}{VD_{g}* w_{g}}}{\sum_{g \in GN}{w_{g}}}
\end{align}


\begin{align}
w_g =  \frac{len(NAN_{g})}{N}
\end{align}

where $ GN = \{g\in G \ | \exists  \ i \  s.t. \ X^i_g ==NA \}$ and $ NAN_{g} = \{g\in G \ | X^i_g ==NA \}$



In [1]:
import numpy as np


In [29]:
def Calc_VDscore(nan_matrix, imputed_matrix, axis = 1, weighted = True):
    
    """
    Compute the VD score for an imputed matrix
    __________________________________________
    
    Parameters:
    
    nan_matrix(np.array): with shape (n_samples, n_features), containing nas
    
    imputed_matrix(np.array) : with shape (n_samples, n_features), matrix derived from nan_matrix with NAs imputed
    
    axis(int):  1 if variance must be done on the columns, 0 otherwise. 
    Default to 1 (this supposes genes are on the columns)
    
    weighted (bool): if weighted average must be perfomed when aggregating VDs.
    If true, the average is weighted on the fraction of missing value.
    Default to True.
    
    
    Returns:
    
    float : total VD score for the imputed matrix
    
    
    """
    nan_index = Get_nanidx(nan_matrix)
    nan_weights = Get_nan_weight(nan_matrix, axis) if weighted else None
    
    sub_nan, sub_imputed = (Sub_matrix(mat, nan_index, axis) for mat in [nan_matrix, imputed_matrix])    
    var_nan, var_imputed = (np.nanvar(mat, axis = [1,0][axis]) for mat in [sub_nan , sub_imputed])
    
    vdp_vec = [VD_single(va_i, var_imputed[i]) for i, va_i in enumerate(var_nan) if va_i!=0]
    
    return np.average(vdp_vec,weights =  nan_weights)

In [3]:
def Get_nan_weight(nan_matrix, axis):
    
    """
    Compute the fraction of nan in the matrix on the given axis
    __________________________________________
    
    Parameters:
    
    nan_matrix(np.array): with shape (n_samples, n_features), containing nas
    
    axis(int):  1 if count must be done on the columns, 0 otherwise. 

    
    Returns:
    
    list of floats : nan fractions on the given axis
    
    
    """
    nans =  np.isnan(nan_matrix)
    nan_fraction  = np.sum(nans, [1,0][axis])/prova.shape[axis]
    
    return nan_fraction[nan_fraction>0]

In [4]:
def VD_single(varA, varB):
    
    """
    Simple VD formula
    __________________________________________
    
    Parameters:
    
    varA(float): original variance value for one feature
    
    varB(float) : variance value for one feature on the imputed matrix
    
    
    Returns:
    
    float : VD
    
    
    """
    return np.abs((varA - varB)/varA)

In [18]:
def Sub_matrix(tored_matrix, idx, axis = None):
    
    """
    Subset of a matrix
    __________________________________________
    
    Parameters:
    
    tored_matrix(np.array): matrix to be reduced
    
    idx (list of lists): indexes, in the form [[row1, col1], ...,[rown, coln]]
    
    axis (integer): if just one axis must be considered, 0 for rows, 1 for columns. 
    None restrict the matrix just to the elements in idx. Default to None
    
    Returns:
    
    if axis == None:
        array(flattened): elements of tored_matrix at indexes idx
    else:
        array (2D): tored_matrix restricted to rows or columns of interest (other dimension does not vary)
    
    
    """
    if axis is not None:
        
        idx = list(set(idx[:,axis]))
        if axis == 0:
            red_matrix =  tored_matrix[idx, :]
        else:
            red_matrix =  tored_matrix[ : , idx]
            
    else:
        red_matrix = tored_matrix[idx[:,0], idx[:,1]]
        
    return red_matrix

In [6]:
def Get_nanidx(nan_matrix):
    
    """
    Compute the VD score for an imputed matrix
    __________________________________________
    
    Parameters:
    
    nan_matrix(np.array): matrix (containing nans)
    
    Returns:
    
    list of lists : indexes of nans,  [[row1, col1], ..,[rown, coln]]
    
    """
    
    nans =  np.isnan(nan_matrix)
    
    return np.argwhere(nans)

In [47]:
def Negative_imputed(imputed_matrix, idx):
    
    """
    Compute the "negative" of a (imputed) matrix: a matrix with all nans except for values at given index
    __________________________________________
    
    Parameters:
    
    imputed_matrix(np.array) : [here:with shape (n_samples, n_features), matrix derived from nan_matrix with NAs imputed]

    idx (list of lists):  indexes, [[row1, col1], ..,[rown, coln]]
                        [here, indexes of null values in the original matrix]
    
    Returns:
    
    np.array : same shape of imputed_matrix, with values at idx equal at those in imputed matrix, the others NA
    
    
    """
    fake_nan = np.empty(shape = imputed_matrix.shape)
    fake_nan[:] = np.nan
    fake_nan[idx[:,0], idx[:,1]] = imputed_matrix[idx[:,0], idx[:,1]]
    
    return fake_nan

In [8]:
# def Calc_VDPscore_old(nan_matrix, imputed_matrix, axis = 1):
    
#     nan_index = Get_nanidx(nan_matrix)
#     negative_imputed = Negative_imputed(imputed_matrix, nan_index)
#     sub_nan, sub_imputed = (Sub_matrix(mat, nan_index, axis) for mat in [nan_matrix, negative_imputed])    
#     var_nan, var_imputed = (np.nanvar(mat, axis = axis) for mat in [sub_nan , sub_imputed])
    
#     vdp_vec = [VDP_single(va_i, var_imputed[i]) for i, va_i in enumerate(var_nan) if va_i!=0]
    
#     return np.mean(vdp_vec)

In [9]:
prova = np.array([[1.2, 3.5, 7, 6.4, np.nan], [6.2, 6.5, 8, np.nan, 5.6],
                [.2, 3, 6.7, 7, np.nan], [1.2, 3.5, 7,np.nan,  6.4],
                [2.9, np.nan, 7.5, 4, np.nan]])
prova

array([[1.2, 3.5, 7. , 6.4, nan],
       [6.2, 6.5, 8. , nan, 5.6],
       [0.2, 3. , 6.7, 7. , nan],
       [1.2, 3.5, 7. , nan, 6.4],
       [2.9, nan, 7.5, 4. , nan]])

In [10]:
prova_imp = np.array([[1.2, 3.5, 7, 6.4, 6], [6.2, 6.5, 8, 5, 5.6],
                [.2, 3, 6.7, 7, 7], [1.2, 3.5, 7,5.4,  6.4],
                [2.9, 3, 7.5, 4, 6]])
prova_imp

array([[1.2, 3.5, 7. , 6.4, 6. ],
       [6.2, 6.5, 8. , 5. , 5.6],
       [0.2, 3. , 6.7, 7. , 7. ],
       [1.2, 3.5, 7. , 5.4, 6.4],
       [2.9, 3. , 7.5, 4. , 6. ]])

In [11]:
nan_idx  =Get_nanidx(prova)
nan_idx

array([[0, 4],
       [1, 3],
       [2, 4],
       [3, 3],
       [4, 1],
       [4, 4]], dtype=int64)

In [12]:
sub1 = Sub_matrix(prova, idx = nan_idx, axis = None)
sub2 = Sub_matrix(prova_imp, idx = nan_idx, axis = None)
sub1, sub2

(array([nan, nan, nan, nan, nan, nan]), array([6. , 5. , 7. , 5.4, 3. , 6. ]))

In [19]:
sub3 = Sub_matrix(prova, idx = nan_idx, axis =0)
sub4 = Sub_matrix(prova, idx = nan_idx, axis = 1)
sub3, sub4

(array([[1.2, 3.5, 7. , 6.4, nan],
        [6.2, 6.5, 8. , nan, 5.6],
        [0.2, 3. , 6.7, 7. , nan],
        [1.2, 3.5, 7. , nan, 6.4],
        [2.9, nan, 7.5, 4. , nan]]), array([[3.5, 6.4, nan],
        [6.5, nan, 5.6],
        [3. , 7. , nan],
        [3.5, nan, 6.4],
        [nan, 4. , nan]]))

In [20]:
Get_nan_weight(prova, axis=1)

array([0.2, 0.4, 0.6])

In [30]:
Calc_VDscore(prova, prova_imp, axis = 1, weighted = True)
    

0.3287882307394493

In [31]:
Calc_VDscore(prova, prova_imp, axis = 1, weighted = False)
    

0.2778939217963602

In [48]:
Negative_imputed(prova_imp, nan_idx)

array([[nan, nan, nan, nan, 6. ],
       [nan, nan, nan, 5. , nan],
       [nan, nan, nan, nan, 7. ],
       [nan, nan, nan, 5.4, nan],
       [nan, 3. , nan, nan, 6. ]])