# Missingness Data != Missing Any Information

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib import style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 8)
%matplotlib inline
import re
from itertools import product
import itertools

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import matthews_corrcoef, roc_curve, auc, roc_auc_score, classification_report, confusion_matrix
from sklearn.model_selection import train_test_split

from scipy import stats

In this notebook, we will explore the relationship between the missingness of data fields and their values.

In [None]:
dat = pd.read_pickle("../input/gstore-revenue-data-preprocessing/train.pkl")

If we look at the counts of NA values in each attribute below, we can see that for many of the attributes, the data is missing for a significant portion of the rows. For many of these attributes, "missing" does not exactly mean "we know nothing about it". In fact, in many cases, "missing" can be a source of useful insight just as valid values do. For instance, if we failed to record the number of pages viewed, does this mean the user did not view anything?
Missingness can also reveal which attributes were collected together and could potentially dependent on each other.

In [None]:
dat.apply(lambda x: np.sum(pd.isna(x)))

### Independence Test of Column Values vs Column Missingness

As a first step, we would like to know if some of the columns are dependent on the missingness of other columns. For categorical columns, we can apply the chi-square test to determine if they are independent from column missingnesses. If a significant number of columns appear to be dependent on column missingness, there is a good reason to include these missingnesses in models based on them.

To do so, we will first have to find all the categorical columns as well as columns with missing values, and create a new dataframe including both data columns and missingness indicator columns.

In [None]:
cat_columns = [c for c in dat.columns if str(dat[c].dtype) == 'category']

In [None]:
missing_count = dat.apply(lambda x: np.sum(pd.isna(x)))
col_w_missing = list(missing_count[missing_count > 0].index)
col_w_missing

In [None]:
missing = dat.copy()
for col in col_w_missing:
    missing['miss_' + col] = pd.isnull(dat[col])

Since in our preprocessing step, we converted missing revenue values to 0, here we add the revenue missingness column back.

In [None]:
zero_revenue = missing['totals.transactionRevenue'] == 0
missing['miss_totals.transactionRevenue'] = zero_revenue
col_w_missing.append('totals.transactionRevenue')

Now we can perform a pairwise chi2 independence test for categorical columns vs column missingness:

In [None]:
ind_miss_p = np.full((len(cat_columns), len(col_w_missing)), np.nan)
for i, j in product(
        range(len(cat_columns)), range(len(col_w_missing))):
    chi2, p, dof, ex = stats.chi2_contingency(
        missing.groupby([cat_columns[i], 'miss_' + col_w_missing[j]
                         ]).size().unstack().fillna(0).astype(np.int))
    ind_miss_p[i, j] = p
    
miss_ind_test_output = pd.DataFrame(
    ind_miss_p,
    index=cat_columns,
    columns=['miss_' + c for c in col_w_missing])

In [None]:
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(data=miss_ind_test_output, ax=ax, linewidths=0.01)
ax.set_title("p-values of chi2 independence test of categorical values vs missingness")
plt.show()

As we see above, the value of many of the columns appear to be dependent on many other columns' missingness (e.g. chi2 statistic large or p-value sufficiently small), we know that when considering the whole dataset, the values are not missing at random. There is potential information to be extracted from the missing values or the relationship between existing values and missing values. A missing value might indicate a specific state of the user or session that has an effect on the existing values, or an existing value might give away clues on what a missing value should have been if it were not missing.

### Independence Test of Column Missingness

Now that we know some of the columns are dependent on column missingness, what about the relationship between the missingness of different columns? Here we perform the same chi2 test, except with only cloumn missingness and between themselves:

In [None]:
ind_miss2miss_p = np.full((len(col_w_missing), len(col_w_missing)), 0.)
for i, j in product(range(len(col_w_missing)), range(len(col_w_missing))):
    if i < j:
        chi2, p, dof, ex = stats.chi2_contingency(
            missing.groupby([
                'miss_' + col_w_missing[i], 'miss_' + col_w_missing[j]
            ]).size().unstack().fillna(0).astype(np.int))
        ind_miss2miss_p[i, j] = p
        ind_miss2miss_p[j, i] = ind_miss2miss_p[i, j]
    elif i == j:
        ind_miss2miss_p[i, j] = 0

miss2miss_p_output = pd.DataFrame(
    ind_miss2miss_p,
    index=['miss_' + c for c in col_w_missing],
    columns=['miss_' + c for c in col_w_missing])

g = sns.clustermap(
    data=miss2miss_p_output, figsize=(12, 12), linewidths=0.01)
g.ax_col_dendrogram.set_title("pairwise p-value of column missingness independence test")
plt.show()

As we see here, there are two major clusters in the pairwise p-value heatmap. Remember that larger p-value (brighter colour) indicates that the pair is more likely to be independent. The upper left corner has four columns that are dependent on each other but mostly independent from other columns **(device.browser, trafficSource.source, totals.pageviews, trafficSource.medium)**, and the lower right corner has a large number of columns that are all dependent on each other.

The analysis above tells us whether there are relationships between the missingnesses, but not how they are related to each other. Apart from independence, we would also like to know if the missingness of different columns are "in sync" which each other, are good predictors of each other, or at least offer much information about each other.

We first analyse if some of the missingnesses are "in sync", e.g. tend to happen together. It is roughly the same as asking if one missingness is a good predictor of another. Here we will be using pairwise Matthews correlation coefficient, a common measure for binary classification evaluation. A coefficient of +1 represents a perfect prediction, 0 no better than random prediction and −1 indicates total disagreement between prediction and observation. We also perform heatmap clustering to identify clusters of columns that are closely related to each other.

In [None]:
ind_miss2miss_mcc = np.full((len(col_w_missing), len(col_w_missing)), 0.)
for i, j in product(range(len(col_w_missing)), range(len(col_w_missing))):
    if i < j:
        ind_miss2miss_mcc[i, j] = matthews_corrcoef(
            missing['miss_' + col_w_missing[i]],
            missing['miss_' + col_w_missing[j]])
        ind_miss2miss_mcc[j, i] = ind_miss2miss_mcc[i, j]
    elif i == j:
        ind_miss2miss_mcc[i, j] = 1

miss2miss_mcc_output = pd.DataFrame(
    ind_miss2miss_mcc,
    index=['miss_' + c for c in col_w_missing],
    columns=['miss_' + c for c in col_w_missing])
miss2miss_mcc_output.index.name = 'predicted'
miss2miss_mcc_output.columns.name = 'input'

g = sns.clustermap(
    data=miss2miss_mcc_output, figsize=(12, 12), linewidths=0.01)
g.ax_col_dendrogram.set_title("pairwise MCC score of column missingness")
plt.show()

Here we can clearly see four clusters of cloumns and eight other relatively isolated columns. 

In the top left we see seven columns related to the ad contents. Among them, the **trafficSource.adwardsClickInfo** attributes are more closely related with each other than the others (they are always missing at the same time). 

Then we see a separation between the **geoNetwork** attributes, with **(country, continent, SubContinent)** in one cluster, always appearing togther and with **(metro, city, region)** in another cluster, mostly appearing together. This is a good indicator that there might be two separate sources of data for these attributes, and potential (actually proven to exist) conflicts between the two clusters can be explained that way. We also notice that **networkDomain**, despite not in any of these two clusters, appear to have higher score with the first clutser than the second, indicating that it is less likely related to the second cluster. 

Then we find that **medium** and **source** are related in missingness.

This is all useful, but sometimes we are not too concerned about whether one column's missingness is a good predictor of another. We just want to know if one column can tell us some information about another, even if it is very noisy information. We need some other measures that are more about information gain or "doing better than random", such as AUC, entropy, etc.

In [None]:
ind_miss2miss_auc = np.full((len(col_w_missing), len(col_w_missing)), 0.)
for i, j in product(range(len(col_w_missing)), range(len(col_w_missing))):
        score1 = roc_auc_score(missing['miss_' + col_w_missing[i]],
                                          missing['miss_' + col_w_missing[j]])
        score2 = roc_auc_score(missing['miss_' + col_w_missing[i]],
                                          ~missing['miss_' + col_w_missing[j]])
        ind_miss2miss_auc[i, j] = max(score1, score2)
        
miss2miss_auc_output = pd.DataFrame(
    ind_miss2miss_auc,
    index=['miss_' + c for c in col_w_missing],
    columns=['miss_' + c for c in col_w_missing])
miss2miss_auc_output.index.name = 'predicted'
miss2miss_auc_output.columns.name = 'input'

g = sns.clustermap(data=miss2miss_auc_output, figsize=(12, 12), linewidths=0.01)
g.ax_col_dendrogram.set_title("pairwise AUC score of column missingness")
plt.show()

This pairwise AUC heatmap tells us whether the missingness of attributes at the bottom gives us useful information about attributes on the right. Some observations are expected, such as the two **geoNetwork** clusters still being present here, as well as the presense of the ad contents cluster. However, there are something unexpected as well, such as **totals.transactionRevenue**, the attribute we are most interested about, actually leaks some information about it in several other columns! The missingness of **totals.bounces** appear to tell us a great deal about whether revenue exist, but the missingnesses of **geoNetwork.metro** and **totals.newVisits** also reveals a little. Let us plot the ROC graph of predicting **transcationRevenue** missingness with these columns:

In [None]:
cur_dict = dict()
cols = [c for c in col_w_missing if c != 'totals.transactionRevenue']
for c in cols:
    fpr_p, tpr_p, _ = roc_curve(~missing['miss_totals.transactionRevenue'],
                                missing['miss_' + c])
    fpr_n, tpr_n, _ = roc_curve(~missing['miss_totals.transactionRevenue'],
                                ~missing['miss_' + c])
    auc_p, auc_n = auc(fpr_p, tpr_p), auc(fpr_n, tpr_n)
    if auc_p >= 0.55:
        cur_dict[c] = [fpr_p, tpr_p, auc_p]
    elif auc_n >= 0.55:
        cur_dict[c] = [fpr_n, tpr_n, auc_n]

plt.figure(figsize=(12, 12))
lw = 2
for c, v in cur_dict.items():
    plt.plot(v[0], v[1], lw=lw, label="{0}  AUC={1}".format(c, v[2]))
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend(loc="lower right")
plt.show()

Using **totals.bounces** missingness to catch cases with positive revenue is surprisingly good, reaching 100% success with only about 45% false positive. As we see below, there are no cases where **bounces** and **transactionRevenue** are both present, so it appears that **transactionRevenue** can only be positive if **bounces** is missing. This alone will not make a good predictor though, as the misclassification of cases where revenue = 0 will be exceedingly high. Nevertheless, this is some information we can use with almost certainty, and is definitely better than no information.

In [None]:
missing.groupby(['miss_totals.transactionRevenue', 'miss_totals.bounces']).size().unstack().fillna(0)

What if we try to use *all* the column missingness to predict the missingness of revenue? Here we go, using a random forest:

In [None]:
X = missing.loc[:, [
    c for c in missing.columns if re.match(r'miss_', c) is not None
    and c != 'miss_totals.transactionRevenue'
]]
y = ~missing['miss_totals.transactionRevenue']

In [None]:
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.3, random_state=7777)

In [None]:
clf = RandomForestClassifier(n_estimators=100)

In [None]:
clf.fit(train_X, train_y)

In [None]:
preds = clf.predict(test_X)
probs = clf.predict_proba(test_X)

As we see, due to the extreme imbalance between revenue = 0 and revenue > 0 cases, the classifier does not learn to classify the positive class (revenue > 0), unfortunately.

In [None]:
print(classification_report(test_y, preds))

However, the classifier does learn to find likely suspects of the positive class, if we loosen the threshold a little (read: a lot). Yes, we end up with very poor precision, but remember that this is with just column missingness without touching the actual data, and we already found a way to exclude many rows that cannot be in the positive class.

In [None]:
print(classification_report(test_y, probs[:, 1] > 0.01))

In [None]:
pd.DataFrame(
    confusion_matrix(test_y, probs[:, 1] > 0.01),
    columns=['pred_miss', 'pred_exist'],
    index=['miss', 'exist'])

With the ROC curve below, we can see that with the combined might of all column missingness, we can do better than just using the most informative column **bounces**.

In [None]:
fpr, tpr, _ = roc_curve(test_y, probs[:, 1])
auc_score = auc(fpr, tpr)

In [None]:
plt.figure(figsize=(12, 12))
lw = 2
c = 'totals.bounces'
v = cur_dict[c]
plt.plot(v[0], v[1], lw=lw, label="{0}  AUC={1}".format(c, v[2]))
plt.plot(fpr, tpr, lw=lw, label="{0}  AUC={1}".format('RF classifier', auc_score))
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend(loc="lower right")
plt.title('ROC of RF Classifier Based on Missingness')
plt.show()