# Statistical dimension reduction using Python and Statsmodels -- a case study using the NHANES data

This notebook illustrates several approaches to statistical dimension
reduction, focusing on the practical aspects of performing dimension
reduction in Python using the
[Statsmodels](http://www.statsmodels.org) library. We will also be
using the [Pandas](http://pandas.pydata.org) library for data
management, and the [Numpy](http://www.numpy.org) library for
numerical calculations.

Dimension reduction encompasses a number of techniques that have the
overarching goal of taking a collection of related variables and
converting them to a smaller number of summary variables that contain
most of their information.  Many statistical techniques can be viewed
as having this goal, including classical techniques such as Principal
Components Analysis and more modern techniques such as autoencoders.

To begin, we import several libraries that we will use below.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import statsmodels.api as sm
import numpy as np

As an initial illustration, we will use the body measures (BMX) and
blood pressure (BPX) data files from the 2015 wave of NHANES.  The
data files can be downloaded in SAS xport format from [this
link](https://wwwn.cdc.gov/nchs/nhanes/search/datapage.aspx?Component=Examination&CycleBeginYear=2015).
Place the two data files (BPX_I.XPT and BMX_I.XPT) in your Jupyter
working directory to make them accessible to the code below.

We will start by working with six blood pressure variables -- three
repeated measures of systolic blood pressure and three repeated
measures of diastolic blood pressure.  These measures are strongly
correlated with each other, particularly the three mesures with the
same blood pressure type.

In [None]:
df = pd.read_sas("BPX_I.XPT")

bpx_vars = ["BPXSY1", "BPXSY2", "BPXSY3", "BPXDI1", "BPXDI2", "BPXDI3"]
da = df.loc[:, bpx_vars + ["SEQN", "BPXPLS"]].dropna()
print(da.shape)
print(da.dtypes)

## Linear dimension reduction

The goal of linear dimension reduction is to define "variates", or
linear combinations of the input variables, that capture some
interesting property of the input variables.  A variate is defined
through its coefficients, e.g. the linear combination

$$
b_1\cdot{\rm BPXSY1} + b_2\cdot{\rm BPXSY2} + b_3\cdot{\rm BPXSY3} + b_4\cdot{\rm BPXDI1} + b_5\cdot{\rm BPXDI2} + b_6\cdot{\rm BPXDI3}
$$

is a variate, defined by coefficients (loadings) $b_1, \ldots, b_6$.

Different approaches to linear dimension reduction use different
methods for defining the variate loadings.  For example, Principal
Components Analysis (PCA) constructs variates that capture the
greatest fraction of the information in the input data.

As noted above, the "loadings" are the coefficients that define how
each input variable relates to the reduced variables.  The "scores"
are the numbers that are obtained by taking the data for one unit of
analysis and combining its data using the loadings.  If there are $p$
input variables and $n$ observations, and we use a linear dimension
reduction technique to reduce to $q$ variables, then the loadings
matrix has shape $p \times q$, and the scores matrix has dimension
$n\times q$.

## Covariance and correlation matrices

Classical dimension reduction techniques often start by considering
the covariance matrix of the variables of interest.  This matrix
contains a lot of information about the pairwise relationships among
the variables.  Recall that if we have three variables, X, Y, and Z,
then the covariance matrix among these variables is a $3\times 3$
matrix with the following entries:

$$
\left(\begin{array}{ccc}
{\rm Var}(X)     &{\rm Cov}(X, Y) & {\rm Cov}(X, Z)\\
{\rm Cov}(Y, X)  &{\rm Var}(Y)    & {\rm Cov}(Y, Z)\\
{\rm Cov}(Z, X)  &{\rm Cov}(Z, Y) & {\rm Var}(Z)\\
\end{array}\right).
$$

If we standardize the data, then we obtain the correlation matrix
which describes the pair-wise relationships among the variables in a
unit-free way.

$$
\left(\begin{array}{ccc}
1     &{\rm Cor}(X, Y) & {\rm Cor}(X, Z)\\
{\rm Cor}(Y, X)  &1    & {\rm Cor}(Y, Z)\\
{\rm Cor}(Z, X)  &{\rm Cor}(Z, Y) & 1\\
\end{array}\right).
$$

In the next cell, we calculate the $6\times 6$ correlation matrix of
the blood pressure measures and plot it as a heatmap.  We can see that
there are very strong correlations between two systolic blood pressure
measures, and strong (but slightly weaker) correlations between two
diastolic blood pressure measures.  A systolic and a diastolic measure
are also positively related, but with a correlation coefficient of
around 0.5.

In [None]:
bpx_cor = np.corrcoef(da.loc[:, bpx_vars].T)
plt.imshow(bpx_cor, vmin=0, vmax=1)
_ = plt.colorbar()

## The spectral decomposition

Many dimension reduction techniques make heavy use of an important
technique from linear algebra known as the "spectral decomposition",
or "eigen-decomposition".  The basic idea is to take a symmetric,
positive definite matrix $C$ (i.e. a covariance matrix), and decompose
it in the form $C = QDQ^\prime$, where $Q$ is an orthogonal matrix
with the same shape as $C$ (so $Q^\prime Q = I$), and $D$ is a
diagonal matrix with decreasing non-negative entries along the main
diagonal.

In the next code cell, we calculate the eigen-decomposition of the
covariance matrix among the six blood pressure measures, and verify
that it satisfies the expected mathematical properties.

In [None]:
eiv, eig = np.linalg.eig(bpx_cor)

# Verify that bpx_cor = eig * eiv * eig'
di = bpx_cor - np.dot(eig, np.dot(np.diag(eiv), eig.T))
print(np.max(np.abs(di)))

# Verify that eig is orthogonal
di = np.dot(eig, eig.T)
print(np.max(np.abs(di - np.eye(6))))

# The eigenvalues are not sorted, so we sort them
ii = np.argsort(eiv)[::-1]
eiv = eiv[ii]
eig = eig[:, ii]

# The eigenvalues of a covariance matrix must be non-negative
print(eiv)

## Principal components analysis

The columns of $Q$ define linear combinations of the input variables
that capture the greatest possible fraction of their variation.  The
first column of $Q$ is the "dominant principal component" and defines
the single best linear combination for preserving the information in
the data.  After this component is accounted for, the second column of
$Q$ (the "second principal component") defines the most informative
variate that is uncorrelated with the dominant variate.  The
subsequent principal components are constructed similarly.

PCA can be viewed from several different perspectives.  One important
perspective is that we are "encoding" the data to a lower dimension in
such a way that the greatest possible fraction of information in the
input variables is retained in the reduced variables.  If $P$ is a
$p\times q$ projection matrix that carries out the dimension
reduction, then the optimal $P$ matrix for PCA minimizes $E[\|y -
Py\|^2]$.

As a special case, for the first variate (dominant principal
component), this optimization reduces to minimizing $E[(y - \langle b,
y \rangle\cdot b)^2]$ over all unit vectors $b$.

For the NHANES blood pressure data, the dominant three principal
components are

In [None]:
print(eig[:, 0:3])

Note that principal components are only defined up to sign, so it
would be equally correct to say that the principal components are

In [None]:
print(-eig[:, 0:3])

The PC's can be interpreted as follows.  The dominant PC is
essentially an equally-weighted positive combination of all six blood
pressure measures.  The systolic measure receives a slightly higher
weight in the PC than the diastolic measure, which is likely beacuse
the diastolic measures are less correlated with each other.  The
second PC reflects the difference between the two types of blood
pressure. The third PC reflects an individual-specific tendency for
the three repeated measures to rise of fall together, with much
greater amplitude for the trend in the diastolic values.

The principal components are often considered in light of the variance
that each component explains, as given by the calculation below.

In [None]:
print(eiv**2 / np.sum(eiv**2))

The dominant principal component explains 85% of the variance in the
six measures, the next component explains 14%, and the remaining
components contribute very little.

### The PC scores

The PC loadings describe the variables, to describe the observations
we use the scores.  Since above we carried out our PCA using
standardized data, we first standardize the blood pressure data, then
calculate the scores using the loading vectors obtained from the
spectral decomposition.

In [None]:
ds = da.loc[:, bpx_vars]
ds -= ds.mean(0)
ds /= da.std(0)

scores = np.dot(ds, eig)

By construction, these scores are uncorrelated and have mean zero:

In [None]:
print(scores.mean(0))
print(np.around(100*np.cov(scores.T)))

It is common to plot scores for different components against each
other in a scatterplot.  By construction, this scatterplot is centered
around the origin, and the two coordinates are uncorrelated.  However
the data need not be perfectly elliptically-distributed.  For example,
here, the second PC score distribution is substantially skewed.

In [None]:
plt.grid(True)
plt.plot(scores[:, 0], scores[:, 1], 'o', alpha=0.4)
plt.xlabel("PC 1 scores")
_ = plt.ylabel("PC 2 scores")

One way to understand what the PC reduction means is to plot the PC
scores against an auxiliary variable that was not part of the PC
reduction.  Here we will use "BPXPLS", which is a subject's pulse
rate.

We see that there is a weak tendency for people with higher pulse rate
tend to have greater PC1 scores.

In [None]:
from statsmodels.nonparametric.smoothers_lowess import lowess
yx = lowess(scores[:,0], da.loc[:, "BPXPLS"], frac=0.3)
plt.grid(True)
plt.plot(da.loc[:, "BPXPLS"], scores[:, 0], 'o', alpha=0.4)
plt.plot(yx[:, 0], yx[:, 1], lw=4, color='orange')
plt.xlabel("Pulse rate")
plt.ylabel("PC1 score")

print(np.corrcoef(da.loc[:, "BPXPLS"], scores[:, 0]))

There isn't much of a relationship between PC2 and pulse rate.

In [None]:
yx = lowess(scores[:,1], da.loc[:, "BPXPLS"], frac=0.3)
plt.grid(True)
plt.plot(da.loc[:, "BPXPLS"], scores[:, 1], 'o', alpha=0.4)
plt.plot(yx[:, 0], yx[:, 1], lw=4, color='orange')
plt.xlabel("Pulse rate")
plt.ylabel("PC2 score")

print(np.corrcoef(da.loc[:, "BPXPLS"], scores[:, 1]))

Another approach to understanding how this two-dimensional data
reduction relates to the original data, we can select points in
different regions of the score scatterplot, and then plot their data
values.  Note that the data values plotted below are centered and
standardized, so they are scaled deviations from the mean.

In [None]:
plt.grid(True)
plt.plot(scores[:, 0], scores[:, 1], 'o', alpha=0.4)
plt.xlabel("PC 1 scores")
plt.ylabel("PC 2 scores")

i0 = (scores[:, 0] < -4) & (np.abs(scores[:, 1]) < 1)
plt.plot(scores[i0, 0], scores[i0, 1], 'o', color='purple')

i1 = (scores[:, 0] > 4) & (np.abs(scores[:, 1]) < 1)
plt.plot(scores[i1, 0], scores[i1, 1], 'o', color='orange')

i2 = (scores[:, 1] > 3) & (np.abs(scores[:, 0]) < 1)
plt.plot(scores[i2, 0], scores[i2, 1], 'o', color='lime')

i3 = (scores[:, 1] < -1.5) & (np.abs(scores[:, 0]) < 1)
plt.plot(scores[i3, 0], scores[i3, 1], 'o', color='yellow')

We see that the dominant PC axis differentiates people with higher
than average systolic and diastolic blood pressure (left, purple),
from people with lower than average systolic and diastolic blood
pressure (right, orange).  The second PC identifies people who have
higher than average systolic blood pressure but lower than average
diastolic blood pressure (top, green), from people who have higher
than average diastolic blood pressure and lower than average systolic
blood pressure (bottom, yellow).

In [None]:
plt.clf()
plt.grid(True)

for i in np.flatnonzero(i0):
    plt.plot(ds.iloc[i, :], color='purple')

for i in np.flatnonzero(i1):
    plt.plot(ds.iloc[i, :], color='orange')

for i in np.flatnonzero(i2):
    plt.plot(ds.iloc[i, :], color='lime')

for i in np.flatnonzero(i3):
    plt.plot(ds.iloc[i, :], color='yellow')

### Role of marginal means and variances in PCA

PCA is usually conducted using the correlation matrix, as above, but
sometimes the covariance matrix is used instead.  In both cases, the
mean is removed before calculating the correlations or covariances.
While the mean and marginal variances are mostly irrelevant for
dimension reduction, to get a complete picture of the data it is a
good idea to briefly inspect them:

In [None]:
print(da.mean(0))
print(da.std(0))

### PCA in Statsmodels

It's not difficult to conduct a PCA manually using Numpy.  But many
people will prefer to use the higher-level PCA function from
Statsmodels.  Below we illustrate how identical results to what we
found above can be obtained in this way.

In [None]:
rslt = sm.PCA(da.loc[:, bpx_vars], ncomp=3)

# coeff gives the loadings
print(rslt.coeff.iloc[0:3, :].T)

print(rslt.eigenvals)
print(rslt.eigenvals**2 / np.sum(rslt.eigenvals**2))

As noted above, the principal components are only meaningful up to a
scaling constant, i.e. if $b$ is the dominant PC, then $k\cdot b$ (for
real, nonzero $k$) can also be considered to be the dominant PC.
There are different conventions about how to scale the PC loading
vectors.  If we scale the loadings obtained from the PCA function to
have unit length, they agree exactly with the results obtained above
using the eigen-decomposition.

In [None]:
c = rslt.coeff.iloc[0:3, :].T
c = c / np.sqrt((c**2).sum(0))
print(c)

Next we obtain the scores from the fitted Statsmodels PCA results
object, and confirm that they are identical (up to scaling) with the
scores that we obtained above.

In [None]:
scores_sm = rslt.scores

sc = np.hstack((scores_sm.iloc[:, 0:3], scores[:, 0:3]))
print(np.around(100*np.corrcoef(sc.T)))

plt.grid(True)
plt.plot(scores_sm.iloc[:, 0], scores_sm.iloc[:, 1], 'o', color='purple', alpha=0.4)
plt.xlabel("PC 1")
plt.ylabel("PC 2")

## PCA case study: NHANES body dimensions

Next we will set up another example in which PCA can be used
productively.  We provide code for some of the initial steps, and
leave the remainder as an exercise.  Here we consider a set of seven
"anthropometric" variables based on various body dimensions.  These
are substantially correlated, but they are all distinct measures,
unlike above, we do not have any repeated measures of the same value.
Therefore these measures are somewhat less correlated than the blood
pressure values considered above.

In [None]:
da = pd.read_sas("BMX_I.XPT")

bmx_vars = ["BMXWT", "BMXHT", "BMXBMI", "BMXLEG", "BMXARML", "BMXARMC", "BMXWAIST"]

bmx = da.loc[:, bmx_vars]
print(pd.isnull(bmx).mean(0))
bmx = bmx.dropna()
print(bmx.shape)

Here is the $7\times 7$ correlation matrix of the anthropometric
variables:

In [None]:
bmx_cor = np.corrcoef(bmx.T)
plt.imshow(bmx_cor, vmin=0, vmax=1)
_ = plt.colorbar()

Even though these are measures of seven distinct quantities, they are
still quite strongly related -- 90% of the variance can be captured
through one component, and most of the remaining variance is captured
by the second component.

In [None]:
rslt = sm.PCA(bmx, ncomp=3)
print(rslt.eigenvals**2 / np.sum(rslt.eigenvals**2))

All seven of the measures here are numerically larger for bigger
people.  Based on the loadings, the dominant principal component
captures overall body size.  The second PC captures variation that is
related to "stockiness".  A person with a greater score on PC 2 tends
to have greater weight, BMI, arm circumference and leg circumference,
and tends to have lesser height, leg length, and arm length.

In [None]:
rslt.coeff.iloc[0:2, :].T

## Factor analysis

Factor-analysis is a technique for dimension reduction that uses a probability
model to aid in understanding how well the reduced variables approximate the
input variables.  Specifically, factor analysis introduces
independent random noise between the factors and the data.

If $y$ is the observed
vector of $p$ variables for one case, we can write $z$ for the $q$-factor approximation
to $y$.  The residuals $y - z$ are treated as being independent random values with
mean zero.  Each element of $y - z$ has its own variance called the "uniqueness".
Factor analysis is usually performed using mean-centered, standardized data.  Therefore
the uniquenesses will always fall between 0 and 1.

Factor analysis uses the terms "score" and "loading" in very analogous ways to the use
of these terms in Principal Component Analysis.

In [None]:
from statsmodels.multivariate.factor import Factor
fa = Factor(bmx, n_factor=3, method='ml').fit()
print(fa.summary())

Factor analysis models may have a lot of parameters and the parameters
may not be well-identified.  Looking at the gradient (score vector) at
the approximated MLE may provide more insights about how close we are
to a well-defined MLE.

In [None]:
print(fa.model.score([fa.loadings, fa.uniqueness]))

## Canonical Correlation Analysis (CCA)

In [None]:
df = pd.read_sas("BPX_I.XPT")
bpx_vars = ["BPXSY1", "BPXSY2", "BPXSY3", "BPXDI1", "BPXDI2", "BPXDI3"]
bpx = df.loc[:, bpx_vars + ["SEQN"]].dropna()

df = pd.read_sas("BMX_I.XPT")
bmx_vars = ["BMXWT", "BMXHT", "BMXBMI", "BMXLEG", "BMXARML", "BMXARMC", "BMXWAIST"]
bmx = df.loc[:, bmx_vars + ["SEQN"]].dropna()

bpmx = pd.merge(bpx, bmx, left_on="SEQN", right_on="SEQN")
bpx = bpmx.loc[:, bpx_vars]
bmx = bpmx.loc[:, bmx_vars]

from statsmodels.multivariate.cancorr import CanCorr

ca = CanCorr(bpx, bmx)

print(ca.corr_test())

print(ca.x_cancoef[:, 0:3])
print(ca.y_cancoef[:, 0:3])

_