United states data scientists spend most of our time analysing relationships and patterns in our data. Most of our exploratory tools, however, focus on one-to-one relationships. But what if we want to have a more generalised view and find commonalities and patterns between certain groups of variables?

This colab includes:

*   an introduction to Canonical Correlation Analysis that lets us identify associations among groups of variables at a time.
*   a CCA tutorial in Python on how school environment affects students’ performance.



**So what is CCA?**

Suppose we want to find out how a school’s ambience affects its students’ academic success. On one hand, we have variables about the level of support, trust and collaboration in their learning environment. On the other hand, we have students’ academic records and test results.
CCA lets us explore associations between these two sets of variables as a whole, rather than considering them on an individual basis. Loosely speaking, we come up with a collective representation (a latent variable called canonical variate) for each of these variable sets in a way that the correlation between those variates is maximised.

**First Things First, Why is CCA Useful?**

Before diving into a pile of equations, let’s see why it’s worth the effort.
With CCA we can:

*   Find out whether two sets of variables are independent or, measure the magnitude of their relationship if there is one.
*   Interpret the nature of their relationship by assessing each variable’s contribution to the canonical variates (i.e. components) and find what dimensions are common between the two set.

*   Summarise relationships into a lesser number of statistics.
*   Conduct a dimensionality reduction that takes into account the existence of certain variable groups.






**Constructing Canonical Variates**

Given two sets of variables:

![alt text](https://miro.medium.com/max/295/1*LbAIENJJq5ZBWYJR5sJ-0Q.png)

We construct the first pair of Canonical Variates as linear combinations of the variables in each group:

![alt text](https://miro.medium.com/max/364/1*hmnJCgMlRZWKGhMNSAVgSg.png)

where the weights (a1, … ap), (b1, … , bq) are chosen in a way that the correlation between the two variates is maximised.

![alt text](https://miro.medium.com/max/786/1*b44a-CV8M68cxA9G2cXz-w.png)

Having our pair of covariates:



*   The Canonical Correlation Coefficient is the correlation between the canonical variates CVX and CVY.

To compute the second pair of covariates, we conduct the same process by adding one more constraint: each new variate should be orthogonal and uncorrelated to the previous ones.

![alt text](https://miro.medium.com/max/455/1*t83HfFb00yM9ZndaXblSzA.png)

We compute min(p,q) pairs in a similar fashion and end up with min(p,q) components ready to explore. (Note that the number of variables in each set doesn’t have to be the same.)

Just as in PCA, we project our data onto min(p,q) latent dimensions. However, not all of them might be informative and important. Let’s see the makeup and interpretation of our canonical variates in the example below.



**NYC School Data**

![alt text](https://miro.medium.com/max/600/1*dYLAMj51ObODJ3aQGzK5fg.png)

We’ll be using two variable groups from NYC schools dataset 

**Group 1: Environment Metrics**




1.   Group 1: Environment Metrics%
2.   Collaborative Teachers %
3.   Effective School Leadership %
4.   Family-Community Ties %
5.   Trust %


**Group 2: Performance Metrics**


1.   Average ELA Proficiency
2.   Average Math Proficiency

As for the tool, we’ll be using the pyrcca implementation.








We start with separating each of our variable groups in a single dataframe:


In [0]:
import pandas as pd
import numpy as np
df = pd.read_csv('2016 School Explorer.csv')
# choose relevant features
df = df[['Rigorous Instruction %',
      'Collaborative Teachers %',
     'Supportive Environment %',
       'Effective School Leadership %',
   'Strong Family-Community Ties %',
    'Trust %','Average ELA Proficiency',
       'Average Math Proficiency']]
# drop missing values
df = df.dropna()
# separate X and Y groups
X = df[['Rigorous Instruction %',
      'Collaborative Teachers %',
     'Supportive Environment %',
       'Effective School Leadership %',
   'Strong Family-Community Ties %',
    'Trust %'
      ]]
Y = df[['Average ELA Proficiency',
       'Average Math Proficiency']]

Convert group X into numeric variables and standardise the data:

In [0]:
for col in X.columns:
    X[col] = X[col].str.strip('%')
    X[col] = X[col].astype('int')
# Standardise the data
from sklearn.preprocessing import StandardScaler
sc = StandardScaler(with_mean=True, with_std=True)
X_sc = sc.fit_transform(X)
Y_sc = sc.fit_transform(Y)

After relevant preprocessing, we’re ready to apply CCA. Note that we set the regularisation parameter to 0 as regularised CCA is out of the scope of this post (we will get back to this in future posts though).

In [0]:
import pyrcca
nComponents = 2 # min(p,q) components
cca = pyrcca.CCA(kernelcca = False, reg = 0., numCC = nComponents,)
# train on data
cca.train([X_sc, Y_sc])
print('Canonical Correlation Per Component Pair:',cca.cancorrs)
print('% Shared Variance:',cca.cancorrs**2)

**The meaning behind the numbers**

Cannonical Correlations

In [0]:
>> Canonical Correlation Per Component Pair: [0.46059902 0.18447786]
>> % Shared Variance: [0.21215146 0.03403208]

For our two pairs of canonical variates, we have canonical correlations of 0.46 and 0.18 respectively. So, latent representations of school ambience and students’ performance do have a positive correlation of 0.46 and share 21 percent of variance.

Squared canonical correlation represents the shared variance by the latent representations of variable sets, and not the variance inferred from the sets of variables themselves.

**Canonical Weights**

In order to access the weights assigned to our standardised variables (a1, … , ap) and (b1, … , bq) we use cca.ws:

In [0]:
cca.ws
>> [array([[-0.00375779,  0.0078263 ],
           [ 0.00061439, -0.00357358],
           [-0.02054012, -0.0083491 ],
           [-0.01252477,  0.02976148],
           [ 0.00046503, -0.00905069],
           [ 0.01415084, -0.01264106]]), 
    array([[ 0.00632283,  0.05721601],
           [-0.02606459, -0.05132531]])]

Given these weights, the canonical variates of set Y, for example, were calculated with the following formula:

![alt text](https://miro.medium.com/max/392/1*GLnnhlzHSdTGb4fhLtkuBg.png)

where the weights can be interpreted as the coefficients in a linear regression model. We should take into account that it is not highly recommended to rely on weights when interpreting individual variable contribution to covariates.

Here’s why:



*   Weights are subject to variability from one sample to another.
*   Weights can be highly affected by multicollinearity (which is quite common      for same-context variable groups).

Relying on canonical loadings instead is a more common practice.



**Canonical Loadings**

Canonical loadings are nothing more than the correlation between the original variable and the canonical variate of that set. For example, to assess the contribution of Trust in school environment representation, we calculate the correlation between the variable Trust and the resulting variate for variable set X.

Calculating loadings for group Y in the first variate:



In [0]:
print('Loading for Math Score:',np.corrcoef(cca.comps[0][:,0],Y_sc[:,0])[0,1])
print('Loading for ELA Score:',np.corrcoef(cca.comps[0][:,0],Y_sc[:,1])[0,1])
>> Loading for Math Score: -0.4106778140971078
>> Loading for ELA Score: -0.4578120954218724

**Canonical Covariates**

Finally, we might want to access the covariate values directly, be it for visualisation or any other purpose.

To do so, we need:

In [0]:
# CVX 
cca.comps[0]
# First CV for X 
cca.comps[0][:,0]
# Second CV for X
cca.comps[0][:,1]
# CVY
cca.comps[1]
# First CV for Y
cca.comps[1][:,0]
# Second CV for Y 
cca.comps[1][:,1]

**That’s about it!**

Hope you find this helpful and use CCA more in your EDA routine.