# Lab 1




The goal of this lab session is to work with the various concepts discussed in Session 1.

There are three main objectives in this lab:

1. to study the impact of removing missing values on the results (Exercise 1),

2. to illustrate the concept of ignorability of the missing data mechanism (Exercise 2),

3. to learn how to generate missing values (mainly Exercises 3 and 4).

### Note on amputation

A dataset is said to be *amputed* if it contains missing values that have been artificially generated. The amputation process refers to transforming a complete dataset into an incomplete one, or an already incomplete dataset into one with a higher proportion of missing values. In other words, amputation is the act of introducing missing values into the initial dataset. It is exactly the opposite of *imputation*.

This is very important when dealing with missing data: it allows testing new algorithms or comparing different methods while having access to a reference score and the observed values, which are required for computing certain metrics.

# Importing libraries

In [None]:
### Classical libraries
import numpy as np
import pandas as pd
from scipy import optimize

### Data visualisation
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib import colors

### Real datasets
from sklearn.datasets import load_breast_cancer

###  Specific libraries to handle missing values
import pyampute
import missingno

# Exercise 1: removing incomplete observations

Consider a dataset composed of $n$ i.i.d. Gaussian samples $(X_{1.}, \dots, X_{n.})$, where $X_{i.} \sim N(\mu, \Sigma)$ with $\mu \in \mathbb{R}^d$ and $\Sigma \in \mathbb{R}^{d \times d}$.

The goal is to empirically study the impact of removing incomplete observations. An observation (also called a row or sample) is considered incomplete if it contains at least one missing value.

We revisit the example from Zhu et al. (2022), presented in the Module 1 video. The dataset has $d$ variables and a missing value rate of 1%. In low dimension ($d=1$), removing incomplete samples leaves about 95% of complete observations. In high dimension ($d=300$), only about 5% of observations remain complete.

In this exercise, you will reproduce this example and study the impact of removing incomplete observations on the bias of the empirical mean.

In [None]:
n = 1000
d = 5
Mu = np.repeat(0, d)
Sigma = 0.5 * (np.ones((d,d)) + np.eye(d))

xfull = np.random.multivariate_normal(Mu, Sigma, size=n)  # complete dataset

In [None]:
pd.DataFrame(xfull).head()

You will first generate missing values of the Missing Completely At Random (MCAR) type: the missingness does not depend on the values of the data themselves.

## Question 1: generating MCAR missing values


To generate MCAR missing values, does the following approach seem satisfactory to you? If not, suggest a better strategy.

In [None]:
p = 0.4
xmiss = np.copy(xfull)
for j in range(d):
    miss_id = np.random.choice(n, np.floor(n*p).astype(int), replace=False)
    xmiss[miss_id, j] = np.nan
M = np.isnan(xmiss)  # mask: matrix indicating where the missing values are in the data

In [None]:
print("The percentage of NA is:", np.sum(M) / (n*d))

### Solution

This method does not take into account the stochastic nature of the mask $M$, which indicates where the missing values are. It effectively treats $p$ as an exact percentage of missing values. This is why you get exactly 40% missing values (try rerunning the code to observe this!). It is better to generate the mask $M$ according to a binomial distribution with parameter $p$, which represents the probability of a value being missing.

In [None]:
xmiss = np.copy(xfull)
for j in range(d):
  miss_id = np.random.uniform(0, 1, size=n) < p
  xmiss[miss_id, j] = np.nan
M = np.isnan(xmiss)

In [None]:
pd.DataFrame(xmiss).head()

In [None]:
print("The total percentage of NAs is:", np.sum(M) / (n*d))

## Question 2: computing the bias of the empirical mean

The dataset `xmiss` now contains missing values; it is the amputed dataset. You will compute the empirical mean of the variables after removing the incomplete observations, and compare it to the empirical mean that would have been obtained if all observations were fully observed.

Provide code to compute the biases of the empirical mean in both cases (per variable in a first step).

In [None]:
x_cc = pd.DataFrame(xmiss).dropna()
x_cc.shape

### Solution

In [None]:
empirical_mean = np.mean(xfull, axis=0)
empirical_mean_na = np.mean(x_cc, axis=0)

bias = empirical_mean - Mu
bias_na = empirical_mean_na - Mu

print("Bias without NA:", [f"{x:.3f}" for x in bias])
print("Bias with NA:", [f"{x:.3f}" for x in bias_na])

We can compute the L2 norm of the bias vector.

In [None]:
norm2_bias = (bias ** 2).sum()
norm2_bias_na = (bias_na ** 2).sum()

print("L2 norm of the bias without NA:", f"{norm2_bias:.3f}")
print("L2 norm of the bias with NA:", f"{norm2_bias_na:.3f}")

## Question 3: comparison across multiple simulations

The goal now is to reproduce the experiment for several values of $d$ (the number of variables in the dataset) and $p$ (the probability of a value being missing). To get a sense of the order of magnitude of the bias, we want to repeat the experiment multiple times for each case.

How can we complete the `compute_bias` function so that it returns the biases over multiple simulations? The function takes as input the number of simulations `n_sim`, the probability `p` that a value in the dataset is missing, the complete dataset `xfull`, and the theoretical mean `Mu`.

In [None]:
def compute_bias(n_sim, p, xfull, Mu):
    vec_norm2_bias = []
    vec_norm2_bias_na = []

    d = xfull.shape[1]

    for it in range(n_sim):

        ### Generate missing values ###

        ### TO COMPLETE ###

        vec_norm2_bias.append(norm2_bias)
        vec_norm2_bias_na.append(norm2_bias_na)

    return(vec_norm2_bias, vec_norm2_bias_na)

We can test the function with the following arguments: `n_sim`=10, `p`=10%. Then, we apply the function to obtain the values of the bias for various numbers of variables and different missing value probabilities.

### Note: stochasticity with missing values

In the complete case, when performing multiple simulations on synthetic datasets, a common approach is to generate the dataset multiple times from the same distribution with known parameters, here $({\Sigma},{\mu})$. In our case, however, the stochasticity comes from the generation of missing values. We consider the complete dataset `xfull` as fixed, and generate missing values multiple times, which results in different amputed datasets `xmiss`.

### Solution

In [None]:
def compute_bias(n_sim, p, xfull, Mu):
    vec_norm2_bias_na = []

    n = xfull.shape[0]
    d = xfull.shape[1]

    empirical_mean = np.mean(xfull,axis=0)
    bias = empirical_mean - Mu
    norm2_bias = np.sqrt((bias ** 2).sum())

    for it in range(n_sim):

        ### Generate missing values ###
        xmiss = np.copy(xfull)
        for j in range(d):
          miss_id = np.random.uniform(0, 1, size=np.floor(n).astype(int)) < p
          xmiss[miss_id, j] = np.nan

        x_cc = pd.DataFrame(xmiss).dropna()

        if x_cc.shape[0] == 0:
          vec_norm2_bias_na.append(np.nan)

        empirical_mean_na = np.mean(x_cc, axis=0)

        bias_na = empirical_mean_na - Mu

        norm2_bias_na = (bias_na ** 2).sum()

        vec_norm2_bias_na.append(norm2_bias_na)

    return(norm2_bias, vec_norm2_bias_na)

In [None]:
n_sim = 10
p = 0.1
norm2_bias, vec_norm2_bias_na = compute_bias(n_sim=n_sim, p=0.1, xfull=xfull, Mu=Mu)

In [None]:
print("Bias without NA:", f"{norm2_bias:.3f}")
print("Mean of the biases with NA over", f"{n_sim}", "simulations:", f"{np.mean(vec_norm2_bias_na):.3f}")

In [None]:
d_list = [5, 10, 100]
p_list = [0.01, 0.05, 0.1, 0.5]

vec_norm2_bias = np.zeros(len(d_list))
mat_norm2_bias_na = np.zeros((len(d_list), len(p_list)))

for pos_d, d in enumerate(d_list):

    ### Complete dataset
    Mu = np.repeat(0, d)
    Sigma = 0.5 * (np.ones((d,d)) + np.eye(d))
    xfull = np.random.multivariate_normal(Mu, Sigma, size=n)

    for pos_perc, p in enumerate(p_list):
        norm2_bias, vec_norm2_bias_na = compute_bias(n_sim=10, p=p, xfull=xfull, Mu=Mu)
        mat_norm2_bias_na[pos_d, pos_perc] = round(np.mean(vec_norm2_bias_na), 3)

    vec_norm2_bias[pos_d] = round(norm2_bias, 3)

In [None]:
results = pd.DataFrame(vec_norm2_bias, index=[f"d={d}" for d in d_list], columns=['Without NA'])
results_na = pd.DataFrame(mat_norm2_bias_na, index=[f"d={d}" for d in d_list], columns=[f"p={p}" for p in p_list])

results.join(results_na)

## Question 4: interpretation of the results

Interpret the results obtained in question 3.

### Solution

The bias is of the same order of magnitude for $d=5$ variables and a missingness probability $p$ ranging from 1% to 10%, or for $d=10$ variables with $p=1%$. Otherwise, in the other cases, the bias of the mean is significantly higher in the presence of missing values.

There are `NA` in the results table whenever there is at least one simulation with no complete observations. In the next code cell, we define the function `compute_number_complete_individuals` to display the number of complete observations.

In [None]:
def compute_number_complete_individuals(n_sim,p,xfull):
    vec_complete_individuals = []

    n = xfull.shape[0]
    d = xfull.shape[1]

    for it in range(n_sim):
        np.random.seed(it)

        ### Generation of missing values
        xmiss = np.copy(xfull)
        for j in range(d):
          miss_id = np.random.uniform(0, 1, size=np.floor(n).astype(int)) < p
          xmiss[miss_id, j] = np.nan

        x_cc = pd.DataFrame(xmiss).dropna()

        number_complete_individuals = x_cc.shape[0]

        vec_complete_individuals.append(number_complete_individuals)

    return(vec_complete_individuals)

In [None]:
d_list = [5, 10, 100]
p_list = [0.01, 0.05, 0.1, 0.5]

mat_complete_individuals = np.zeros((len(d_list), len(p_list)))

for pos_d, d in enumerate(d_list):

    ### Complete dataset
    Mu = np.repeat(0, d)
    Sigma = 0.5 * (np.ones((d, d)) + np.eye(d))
    xfull = np.random.multivariate_normal(Mu, Sigma, size=n)

    for pos_perc, p in enumerate(p_list):
        vec_complete_individuals = compute_number_complete_individuals(n_sim=10, p=p, xfull=xfull)
        mat_complete_individuals[pos_d, pos_perc] = round(np.mean(vec_complete_individuals) / (n*d)*100, 2)

In this table, we display the percentage of complete observations in each case.

In [None]:
percentage_complete_individuals = pd.DataFrame(mat_complete_individuals, index=[f"d={d}" for d in d_list], columns=[f"p={p}" for p in p_list])

percentage_complete_individuals

# Exercise 2: ignorability of the missing-data mechanism

In this exercise, you will illustrate the concept of ignorability of the missing data mechanism.

In the Module 1 video, you saw that the missing data mechanism is ignorable if it is MCAR or MAR, and non-ignorable in the MNAR case. Recall that the mechanism is Missing At Random (MAR) if the missingness depends on the observed data values, and Missing Not At Random (MNAR) if the missingness can depend on all data values, including the missing ones.

Let us consider bivariate Gaussian data, the same dataset as in Exercise 1 with $d=2$ variables.


In [None]:
n = 1000
d = 2
Mu = np.repeat(0, d)
Sigma = 0.5 * (np.ones((d, d)) + np.eye(d))

xfull = np.random.multivariate_normal(Mu, Sigma, size=n)

In [None]:
pd.DataFrame(xfull).head()

In [None]:
# Complete data scatter plot
sns.scatterplot(x=xfull[:, 0], y=xfull[:, 1])

To generate MCAR missing values, we use the code from Exercise 1 (Question 1).

In [None]:
p = 0.5
xmiss_mcar = np.copy(xfull)
miss_id_mcar = np.random.uniform(0, 1, size=n) < p
xmiss_mcar[miss_id_mcar, 1] = np.nan
M_mcar = np.isnan(xmiss_mcar)
print("The total percentage of NAs is:", np.sum(M_mcar[:, 1]) / n)

## Question 1: generating MAR and MNAR missing values

Let us consider that only the second variable contains missing values. Propose code to generate MAR and MNAR missing values using the following link function `logit`:
$$\mathrm{logit}(x)=1/(1+e^{-(ax+b)}),$$
where $a \in \mathbb{R}$ et $b \in \mathbb{R}$.

In [None]:
def logit(x,coeff,intercept):

  res = 1 / (1 + np.exp(-(coeff * x + intercept)))

  return res

We can set $a=-4$ and $b=0$. At this stage, we are not aiming to precisely control the percentage of missing values generated.

In [None]:
a = -4
b = 0

### Solution

Let $X = (X_{.0} \quad X_{.1})$ denote the dataset, where $X_{.0} = (x_{10}, \dots, x_{n0})^T \in \mathbb{R}^n$ is the first variable and $X_{.1} = (x_{11}, \dots, x_{n1})^T \in \mathbb{R}^n$ is the second variable. Similarly, the mask is $M = (M_{.0} \quad M_{.1})$. The mechanism is:


* MAR if $$\mathbb{P}(M_{.1}|X)=\mathrm{logit}(X_{.0}).$$
In this case, the missingness of the second variable depends on the first variable, which is observed.
* MNAR if $$\mathbb{P}(M_{.1}|X)=\mathrm{logit}(X_{.1}).$$
In this case, the missingness of the second variable depends on its own values.



In [None]:
###Generation of MAR values

xmiss_mar = np.copy(xfull)
proba_mar = logit(xfull[:, 0], a, b)
miss_id_mar = np.random.uniform(0, 1, size=n) < proba_mar
xmiss_mar[miss_id_mar, 1] = np.nan
M_mar = np.isnan(xmiss_mar)
print("The percentage of NA in the second variable is:", np.sum(M_mar[:, 1])/(n))

In [None]:
###Generation of MNAR values

xmiss_mnar = np.copy(xfull)
proba_mnar = logit(xfull[:, 1], a, b)
miss_id_mnar = np.random.uniform(0, 1, size=n) < proba_mnar
xmiss_mnar[miss_id_mnar, 1] = np.nan
M_mnar = np.isnan(xmiss_mnar)
print("The percentage of NA in the second variable is:", np.sum(M_mnar[:, 1])/(n))

We can also represent missing values on a scatter plot. We can clearly observe that:
* for MCAR : the missingness does not depend on the data values; the missing values are present throughout the scatter plot.
* for MAR : the missingness depends on the abscissa, that is, on the first variable $X_{.0}$ which is not missing.
* in the MNAR case : the missingness depends on the ordinate, that is, on the second variable $X_{.1}$ which is missing.

In [None]:
ax = sns.scatterplot(x=xfull[:, 0], y=xfull[:, 1], hue=M_mcar[:, 1], palette=['#d1e5f0', '#2166ac'])
handles, labels  =  ax.get_legend_handles_labels()
ax.set_title('MCAR')
ax.set_xlabel(r'$X_{.0}$')
ax.set_ylabel(r'$X_{.1}$')
ax.legend(handles, ['Observed', 'Missing'], loc='lower right', fontsize='13')
;

In [None]:
ax = sns.scatterplot(x=xfull[:, 0], y=xfull[:, 1], hue=M_mar[:, 1], palette=['#d1e5f0', '#2166ac'])
handles, labels  =  ax.get_legend_handles_labels()
ax.set_title('MAR')
ax.set_xlabel(r'$X_{.0}$')
ax.set_ylabel(r'$X_{.1}$')
ax.legend(handles, ['Observed', 'Missing'], loc='lower right', fontsize='13')
;

In [None]:
ax = sns.scatterplot(x=xfull[:, 0], y=xfull[:, 1], hue=M_mnar[:, 1], palette=['#d1e5f0', '#2166ac'])
handles, labels  =  ax.get_legend_handles_labels()
ax.set_title('MNAR')
ax.set_xlabel(r'$X_{.0}$')
ax.set_ylabel(r'$X_{.1}$')
ax.legend(handles, ['Observed', 'Missing'], loc='lower right', fontsize='13')
;

## Question 2: calculation of the bias of the empirical mean

The empirical means of the second variable are calculated by removing the missing values. Interpret the following results. Is there a bias in the MCAR case?

In [None]:
empirical_mean = np.mean(xfull[:, 1], axis=0)
empirical_mean_mcar = np.nanmean(xmiss_mcar[:, 1], axis=0)
empirical_mean_mar = np.nanmean(xmiss_mar[:, 1], axis=0)
empirical_mean_mnar = np.nanmean(xmiss_mnar[:, 1], axis=0)

In [None]:
print("Empirical mean:", f"{empirical_mean:.3f}")
print("Empirical mean, MCAR:", f"{empirical_mean_mcar:.3f}")
print("Empirical mean, MAR:", f"{empirical_mean_mar:.3f}")
print("Empirical mean, MNAR:", f"{empirical_mean_mnar:.3f}")

### Solution

In the MCAR case, there is no bias. We have:
$$\mathbb{E}\left[\frac{1}{n_{\textrm{obs}}}\sum_{i=1}^n (1-M_{i1}) X_{i1}\right]=\mathbb{E}[X_{i1}],$$
where $n_{\textrm{obs}}$ is the number of observed values in $X_{i1}$. Indeed,
$\mathbb{E}\left[\frac{1}{n_{\textrm{obs}}}\sum_{i=1}^n (1-M_{i1}) X_{i1}\right]=\frac{n}{n_{\textrm{obs}}}\mathbb{E}[(1-M_{.1})]\mathbb{E}[X_{.1}],$ car $M_{.1}$ and $X_{.1}$ are independent in the MCAR case. Finally, we have $\mathbb{E}[(1-M_{.1})]=n_{\textrm{obs}}/n$, since $M_{.1}$ is drawn from a Bernoulli distribution with parameter $p=(n-n_{\textrm{obs}})/n$.

In the MAR case, and even more so in the MNAR case, the empirical mean is biased.

If we look at the scatter plots showing where the missing values occur, this observation was expected (see the solution to Question 1).
In the MNAR case, most of the negative values of $X_{.1}$ are missing. As a result, the empirical mean is positive.
In the MAR case, even though the missingness does not depend on the value of the variable $X_{.1}$ itself but rather on $X_{.0}$, the linear relationship between the two variables also implies that many negative values of $X_{.1}$ are missing. Therefore, the empirical mean is again positive.

## Question 3: calculation of the bias of the maximum likelihood estimator

The previous question involved calculating the empirical mean based on the observed values. We will now compute the maximum likelihood estimator. We will revisit in detail how to derive its expression in the lab session for Module 3 (Exercise 1, Question 1).

This estimator depends on the empirical mean of $X_{.1}$; instead of using only the observed values of $X_{.2}$ (as in the empirical mean computed in Question 2), it uses all available values in the dataset and thus takes advantage of the relationship between the variables. This helps to better preserve the empirical distribution of the data.

We will observe that this likelihood-based estimator yields unbiased results in the MCAR and MAR cases, but biased results in the MNAR case.

Interpret the following results. Then, propose code to obtain the results over multiple simulations.

In [None]:
def maximum_likelihood_estimate(miss_id,xmiss):

  mu0 = np.mean(xmiss[:, 0])

  bar_x0 = np.mean(xmiss[~miss_id, 0])
  bar_x1 = np.mean(xmiss[~miss_id, 1])
  sig_0 = np.mean((xmiss[~miss_id, 0] - bar_x0) ** 2)
  sig_01 = np.mean((xmiss[~miss_id, 0] - bar_x0) * (xmiss[~miss_id, 1] - bar_x1))
  mu1 = np.mean(xmiss[~miss_id, 1]) + sig_01 / sig_0 * (mu0 - np.mean(xmiss[~miss_id, 0]))

  return(mu1)

In [None]:
mle_mcar = maximum_likelihood_estimate(miss_id_mcar,xmiss_mcar)
mle_mar = maximum_likelihood_estimate(miss_id_mar,xmiss_mar)
mle_mnar = maximum_likelihood_estimate(miss_id_mnar,xmiss_mnar)

print("Empirical mean without NA:", f"{empirical_mean:.3f}")
print("Maximum likelihood estimator, MCAR:", f"{mle_mcar:.3f}")
print("Maximum likelihood estimator, MAR:", f"{mle_mar:.3f}")
print("Maximum likelihood estimator, MNAR:", f"{mle_mnar:.3f}")

### Solution

In the computation of the maximum likelihood estimator, the missing data mechanism was not taken into account. This is why the results are biased in the MNAR case.

We can repeat the experiment over multiple simulations and display the boxplots of the results.

In [None]:
def compute_bias_mle(n_sim, p, a, b, xfull, Mu):

    vec_norm2_bias_mcar = []
    vec_norm2_bias_mar = []
    vec_norm2_bias_mnar = []

    n = xfull.shape[0]
    d = xfull.shape[1]

    empirical_mean = np.mean(xfull[:, 1])
    bias = empirical_mean - Mu[1]
    norm2_bias = (bias ** 2)

    for it in range(n_sim):

        ### Generation of missing values
        xmiss_mcar = np.copy(xfull)
        miss_id_mcar = np.random.uniform(0, 1, size=n) < p
        xmiss_mcar[miss_id_mcar, 1] = np.nan

        xmiss_mar = np.copy(xfull)
        proba_mar = logit(xfull[:, 0], a, b)
        miss_id_mar = np.random.uniform(0, 1, size=n) < proba_mar
        xmiss_mar[miss_id_mar, 1] = np.nan

        xmiss_mnar = np.copy(xfull)
        proba_mnar = logit(xfull[:, 1], a, b)
        miss_id_mnar = np.random.uniform(0, 1, size=n) < proba_mnar
        xmiss_mnar[miss_id_mnar, 1] = np.nan

        mle_mcar = maximum_likelihood_estimate(miss_id_mcar, xmiss_mcar)
        mle_mar = maximum_likelihood_estimate(miss_id_mar, xmiss_mar)
        mle_mnar = maximum_likelihood_estimate(miss_id_mnar, xmiss_mnar)

        bias_mcar = mle_mcar - Mu
        bias_mar = mle_mar - Mu
        bias_mnar = mle_mnar - Mu

        norm2_bias_mcar = (bias_mcar ** 2).sum()
        norm2_bias_mar = (bias_mar ** 2).sum()
        norm2_bias_mnar = (bias_mnar ** 2).sum()

        vec_norm2_bias_mcar.append(norm2_bias_mcar)
        vec_norm2_bias_mar.append(norm2_bias_mar)
        vec_norm2_bias_mnar.append(norm2_bias_mnar)

    return(norm2_bias, vec_norm2_bias_mcar, vec_norm2_bias_mar, vec_norm2_bias_mnar)

In [None]:
norm2_bias, vec_norm2_bias_mcar, vec_norm2_bias_mar, vec_norm2_bias_mnar = compute_bias_mle(n_sim=10, p=0.5, a=-4, b=0, xfull=xfull, Mu=Mu)

In [None]:
res_na = pd.DataFrame({"MCAR":vec_norm2_bias_mcar, "MAR":vec_norm2_bias_mar, "MNAR":vec_norm2_bias_mnar})
ax = sns.boxplot(res_na)
ax.set_title("Bias in the mean estimation")
;

# Exercise 3: challenges related to generating missing values

In this exercise, you will explore the challenges that can arise when attempting to generate missing values.

Let us consider the Gaussian dataset from the previous exercises with $d = 3$ variables.

In [None]:
n = 1000
d = 3
Mu = np.repeat(0, d)
Sigma = 0.5 * (np.ones((d, d)) + np.eye(d))

xfull = np.random.multivariate_normal(Mu, Sigma, size=n) #complete dataset

In [None]:
pd.DataFrame(xfull).head()

## Question 1: percentage of missing values

To generate MCAR-type missing values, it is straightforward to obtain a specific overall percentage of missing data. We saw how to proceed in the previous exercises, by drawing the missingness mask from a Bernoulli distribution with parameter $p$ (the probability of being missing). If we want 40% missing values in total, we can choose $p = 0.4$.

When the goal is to generate MAR or MNAR-type missing values, things become more complicated.

More specifically, suppose the objective is to generate MAR-type missing values in the second variable using the logistic function, such that the mechanism is defined as:

$$\mathbb{P}(M_{.1}|X=(X_{.0},X_{.1},X_{.2}))=1/(1+e^{-(a_0X_{.0}+a_2X_{.2}+b)}),$$
with $a_0 \in \mathbb{R},\ a_2 \in \mathbb{R},\ b \in \mathbb{R}$.

To control the proportion of missing values in the variable $X_{.1}$, one approach is to randomly choose the coefficients $a_0$ and $a_2$, and then adjust the value of $b$ accordingly.


How is the choice of the intercept $b$ adjusted if the following function `choose_intercept` is used? Generate MNAR missing values using it.

In [None]:
def logit(x, coeff, intercept):

  res = 1 / (1+np.exp(-(x.dot(coeff) + intercept)))

  return res

In [None]:
def choose_intercept(xfull, coeff, idx_var, p):

    def f(x):
        return logit(xfull[:, idx_var], coeff, x).mean().item() - p

    intercepts = optimize.bisect(f, -50, 50)

    return intercepts

In [None]:
idx_var = [0, 2]
coeff = np.random.normal(size=len(idx_var))
intercept = choose_intercept(xfull, coeff, idx_var, p=0.4)

In [None]:
print("The chosen coefficients are:", coeff)
print("The chosen intercept is:", intercept)

In [None]:
###Generation of MAR values

xmiss_mar = np.copy(xfull)
proba_mar = logit(xfull[:, idx_var], coeff, intercept)
miss_id_mar = np.random.uniform(0, 1, size=n) < proba_mar
xmiss_mar[miss_id_mar, 1] = np.nan
M_mar = np.isnan(xmiss_mar)
print("The percentage of NA in the second variable is:", np.sum(M_mar[:, 1]) / n)

### Solution

We choose $b$ such that, on average, the probability of being missing for a value of the variable $X_{.1}$ equals $p$. This is therefore an optimization problem, where we seek the root of the following function:

$f(x)=\frac{1}{n}\sum_{i=1}^n 1/(1+e^{-(a_0X_{i0}+a_2X_{i2}+b)})-p$

To generate MNAR missing values, we can consider the following mechanism:
$$\mathbb{P}(M_{.1}|X)=\mathrm{logit}(X_{.1}).$$
Here is the code

In [None]:
### MNAR case
idx_var = [1]
coeff = np.random.normal(size=len(idx_var))
intercept = choose_intercept(xfull, coeff, idx_var, p=0.4)

xmiss_mnar = np.copy(xfull)
proba_mnar = logit(xfull[:, idx_var], coeff, intercept)
miss_id_mnar = np.random.uniform(0, 1, size=n) < proba_mnar
xmiss_mnar[miss_id_mnar, 1] = np.nan
M_mnar = np.isnan(xmiss_mnar)
print("The percentage of NA in the second variable is:", np.sum(M_mnar[:, 1]) / n)

We can quickly verify the generation of missing values using plots.

In [None]:
ax = sns.scatterplot(x=xfull[:, 0], y=xfull[:, 1], hue=M_mar[:, 1], palette=['#d1e5f0', '#2166ac'])
handles, labels  =  ax.get_legend_handles_labels()
ax.set_title('MAR')
ax.set_xlabel(r'$X_{.0}$')
ax.set_ylabel(r'$X_{.1}$')
ax.legend(handles, ['Observed', 'Missing'], loc='lower right', fontsize='13')
;

In [None]:
ax = sns.scatterplot(x=xfull[:, 0], y=xfull[:, 1], hue=M_mnar[:, 1], palette=['#d1e5f0', '#2166ac'])
handles, labels  =  ax.get_legend_handles_labels()
ax.set_title('MNAR')
ax.set_xlabel(r'$X_{.0}$')
ax.set_ylabel(r'$X_{.1}$')
ax.legend(handles, ['Observed', 'Missing'], loc='lower right', fontsize='13')
;

## Question 2: specificity of the MAR case

The specificity of the MAR case is that the missingness depends on observed values of the data.

In the MNAR case, we can generate missing values in each variable, for example by using the logistic function as follows:
$$ \forall j \in \{1,\dots,d\}, \mathbb{P}(M_{.j}|X)=1/(1+e^{-(aX_{.j}+b)}).$$

In the MAR case, should certain variables be considered fully observed?

### Solution

The vast majority of codes for generating missing values consider one or more variables as fully observed. However, this is not necessary to simulate MAR values.

The original definition of the MAR mechanism considers vectorized quantities in $\mathbb{P}(M \mid X_{\mathrm{obs}(M)})$, meaning that $M$ is a vector of size $n \times d$ and $X_{\mathrm{obs}(M)}$ is a vector of size equal to the number of observed values in $X$.

In fact, this vectorized representation corresponds to simulating missing values by row (pattern or missingness patterns). With three variables, it is perfectly possible to have MAR missing values in all variables, for example with the following patterns and mechanisms.

Patterns:
- $i \in \mathrm{Pattern}_1$ if $M_{i.}=(1,0,0)$, that is only the first variable is missing.
- $i \in \mathrm{Pattern}_2$ if $M_{i.}=(0,1,0)$, that is only the second variable is missing
- $i \in \mathrm{Pattern}_3$ if $M_{i.}=(0,0,1)$, that is only the third mechanism is missing

Mechanisms:
- $i \in \mathrm{Pattern}_1, \mathbb{P}(M_{i1}|X)=\mathrm{logit}(X_{i2},X_{i3})$,
- $i \in \mathrm{Pattern}_2, \mathbb{P}(M_{i2}|X)=\mathrm{logit}(X_{i1},X_{i2})$,
- $i \in \mathrm{Pattern}_3, \mathbb{P}(M_{i3}|X)=\mathrm{logit}(X_{i1},X_{i2})$.

## Question 3: use of the `pyampute` library

To generate missing values by pattern, you can use the `pyampute` library. Documentation is available [here](https://rianneschouten.github.io/pyampute/build/html/index.html). The `MultivariateAmputation` function allows you to generate missing values in a (initially complete) dataset. There are two main arguments:
* `prop`: the proportion of missing values per variable,
* `patterns`: a list of dictionaries that notably include the following entries:
  * `incomplete_vars`: ndices of the variables with missing values
  * `weights`: weights on the variables that will influence the missingness
  * `mechanism`: missing data mechanism
  * `freq`: frequency of the pattern in the amputed dataset

  Each dictionary corresponds to the description of a pattern.

Use the `MultiviriateAmputation` function to generate MAR missing values as specified in the solution to the previous question, with pattern frequencies of 10%, 50%, and 40%, respectively. Are the results consistent?

Warning, the `pyampute` library has not been maintained since 2022. The pattern visualization function (Python module `pyampute.exploration`) returns an error; the following code can be used instead.

In [None]:
def plot_patterns(res):

  #### res is a DataFrame containing all possible missing-data patterns of an incomplete dataset
  #### Example: res = np.unique(M,axis=0), with M the mask

  myred = "#B61A51B3"
  myblue = "#006CC2B3"
  cmap = colors.ListedColormap(['#d1e5f0', '#2166ac'])

  fig, ax = plt.subplots(1)
  ax.imshow(res.astype(bool), aspect="auto", cmap=cmap)


  ax.set_yticks(np.arange(0, len(res.index), 1))
  ax.set_yticks(np.arange(-0.5, len(res.index), 1), minor=True)
  ax.set_xticks(np.arange(0, len(res.columns), 1))
  ax.set_xticks(np.arange(-0.5, len(res.columns), 1), minor=True)


  ax.set_xticklabels([k for k in res.columns])
  ax.set_yticklabels([k for k in res.index])
  ax.grid(which="minor", color="w", linewidth=1)
  plt.show()

### Solution

In [None]:
pattern1 = {"incomplete_vars": [0], "mechanism": "MAR", "freq":0.1}
pattern2 = {"incomplete_vars": [1], "mechanism": "MAR", "freq":0.5}
pattern3 = {"incomplete_vars": [2], "mechanism": "MAR", "freq":0.4}
patterns = [pattern1, pattern2, pattern3]


ma = pyampute.ampute.MultivariateAmputation(prop=0.9, patterns=patterns)
xmiss = ma.fit_transform(xfull)
M = np.isnan(xmiss)
print("The total percentage of NAs is:", np.sum(M)/(n*d))

The argument `prop` is indeed the proportion of rows containing at least one missing value.

In [None]:
x_cc = pd.DataFrame(xmiss).dropna()
print("The percentage of incomplete rows is:", x_cc.shape[0] / n * 100, "%.")

We can verify that the frequency of the missingness patterns has been properly respected.

In [None]:
### Visualisation of the missing-data patterns

which_patterns, counts_patterns = np.unique(M, axis=0, return_counts=True)

res = pd.DataFrame(which_patterns * 1, columns=["X0", "X1", "X2"], index=["Complete row", "Pattern1", "Pattern2", "Pattern3"])

In [None]:
plot_patterns(res)

In [None]:
### Frequency of the missing-data patterns

res["Percentage"] = counts_patterns / n * 100
res

# Exercise 4: pseudo-realistic mechanisms in a real dataset

In this exercise, you will consider a real dataset, Breast Cancer Wisconsin, available from the UCI repository [here](https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic)
, which initially contains no missing values. The variables are calculated from images of breast tumors. More specifically, they describe each cell nucleus with ten measurements (radius, texture, perimeter, area, etc.). Finally, the 30 variables available in the dataset correspond to the mean of the measurements over the nuclei, the standard error, and the worst measurement (in the sense of the largest value, which is most likely to indicate a malignant tumor diagnosis).

Generally, this dataset is used for prediction purposes to classify patients according to tumor type: malignant or benign.

The goal of this exercise is to generate missing values with a *pseudo-realistic* pattern.


In [None]:
data = load_breast_cancer()
xfull = data['data']  # covariates, without missing values
diagnosis = data['target']  # target variable to predict, when the learning task is prediction
features_names = data['feature_names']

In [None]:
pd.DataFrame(xfull, columns=features_names).head()

In [None]:
features_names

In [None]:
n, d = xfull.shape

## Question 1: generation of a basic MCAR mask

Generate MCAR missing values across all variables, with a missingness probability of $p = 0.3$ for the first 10 variables (corresponding to the means of the measurements), $p = 0.6$ for the next 10 (standard errors), and $p = 0.8$ for the last 10 (worst measurements). Explain why this mechanism is truly MCAR.

This is not the most realistic missingness scenario. One can imagine that the values are manually recorded by three different people, and depending on their diligence and available time (which are independent of the data values), there are more or fewer missing values.

### Solution

The mechanism is indeed MCAR here because the probability of a value being missing does not depend on the data values. Having different probabilities of missingness across variables is not incompatible with the MCAR mechanism.

In [None]:
p = [0.3, 0.6, 0.8]
xmiss = np.copy(xfull)
for j in range(10):
  miss_id = np.random.uniform(0, 1, size=n) < p[0]
  xmiss[miss_id, j] = np.nan
for j in range(10, 20):
  miss_id = np.random.uniform(0, 1, size=n) < p[1]
  xmiss[miss_id, j] = np.nan
for j in range(20, 30):
  miss_id = np.random.uniform(0, 1, size=n) < p[2]
  xmiss[miss_id, j] = np.nan

In [None]:
pd.DataFrame(xmiss,columns=features_names).head()

## Question 2: using the `missingno` visualization library

The `missingno` library is a Python visualization library for handling missing data. Documentation is available [here](https://github.com/ResidentMario/missingno).
.

Use the `matrix` and `bar` functions from the `missingno` library to visualize the amputed dataset `xmiss`.

### Solution

The `matrix` function provides a visualization of the missingness patterns in `xmiss`. We observe that the first 10 variables have more observed values (black squares) than the next 10, which in turn have more observed values than the last 10. This was expected, given the missing value generation in Question 1.

In [None]:
missingno.matrix(pd.DataFrame(xmiss)) #global visualisation of missing-data patterns

The `bar` function allows visualization of the number of observed values per variable (at the top) and the percentage of observed values per variable (on the left y-axis).

The results remain consistent with the missing value generation from Question 1.

In [None]:
missingno.bar(pd.DataFrame(xmiss))
#percentage of observed values, and number of observed values per variable

## Question 3: generation of a mask with dependency

Now consider that the first 10 variables are missing with probability $p = 0.3$. Furthermore, suppose that if the first variable is missing, then the 11th variable is also missing; if the second variable is missing, then the 12th is missing as well, and so on. In fact, referring back to the example in Question 1 where the values were manually recorded, we can assume there were only two people. The first person either recorded the mean values (variables 1 to 10) and the standard error values (variables 11 to 20), or neither of these sets. The second person recorded all the values (variables 21 to 30).

Generate the mask corresponding to this scenario. Use the `heatmap` function from `missingno` to visualize the influence of the presence of the first ten variables on the presence of the next ten. Does the mechanism remain MCAR?

### Solution

Using the `heatmap` function, we observe that the presence of the first 10 variables is directly linked (with a correlation of 1) to the presence of the next 10 variables, as expected from the mask generation.

The missing data mechanism remains MCAR. The probability of being missing for each value does not depend on the data values. Here, we introduced a dependency between the masks $M_{.,j}$ and $M_{.,j+10})$ for $j=0,\dots,9$, but the mask remains independent of the data values, that is, $\mathbb{P}(M|X)=\mathbb{P}(M)$.

In [None]:
p = 0.3
xmiss = np.copy(xfull)
for j in range(10):
  miss_id = np.random.uniform(0, 1, size=n) < p
  xmiss[miss_id, j] = np.nan
  xmiss[miss_id, j+10] = np.nan

In [None]:
missingno.heatmap(pd.DataFrame(xmiss))

## Question 4: Case of a mask dependent on the diagnosis

A common practical scenario occurs when an individual has more or fewer missing values depending on the group they belong to. Here, we can imagine that patients with a benign tumor have more missing values because the images are of lower quality or come from patients with another pathology, and doctors do not necessarily retake images for these patients.

Generate missing values under this missingness scenario. What type of missing data mechanism is this, depending on whether the variable `diagnosis` (indicating if the tumor is malignant with a `0` or benign with a `1`) is observed or not?

### Solution

The mechanism is MAR if the variable `diagnosis` is fully observed, and MNAR if it is completely missing (and thus latent).

In [None]:
p_benign = 0.7  # probability of being missing for the values of the population with a benign tumor
p_malign = 0.1  # probability of being missing for the values of the population with a malignant tumor

xmiss = np.copy(xfull)
for j in range(d):
  benign_idx = np.where(diagnosis == 1)[0]
  miss_id = np.random.uniform(0, 1, size=benign_idx.shape[0]) < p_benign
  xmiss[benign_idx[miss_id], j] = np.nan

for j in range(d):
  malign_idx = np.where(diagnosis == 0)[0]
  miss_id = np.random.uniform(0, 1, size=malign_idx.shape[0]) < p_malign
  xmiss[malign_idx[miss_id], j] = np.nan

In [None]:
M = np.isnan(xmiss)
print("The total percentage of NAs is in the healthy population:", np.sum(M[diagnosis == 1, :]) / (sum(diagnosis == 1) *d))
print("The total percentage of NAs is in the sick population:", np.sum(M[diagnosis == 0, :]) / (sum(diagnosis == 0) *d))

### Note: Amputation on an incomplete dataset

This practical session does not address the case of amputation on an incomplete dataset that already contains *native* missing values. In this case, the problem is that there is neither a reference score nor a way to directly compare imputation methods by calculating the imputation error. One solution is to introduce new missing values. It is then relevant to generate them according to the distribution of the native missing values. This is challenging because it requires estimating the distribution $p(Mâˆ£X)$, and thus knowing whether the mechanism is MCAR, MAR, or MNAR. A first step is to respect the same patterns as those of the native missing values. The new missing values can be introduced on complete rows of the dataset if possible, or by completing already existing patterns.
