# Harmonisation strategies
Now the we have seen how different scanners can produce difference in images between the same individual on different scanners, we need to figure out to account for this data. 

In [1]:
import os
import nest_asyncio
nest_asyncio.apply()
import numpy as np
import pandas as pd
import datalad.api as dl

Let's grab the dataset that has all the imgaging volumes for the study and read it into memory

In [2]:
dl_source='/Users/davecash/Data/IDEAS/sample'
sample=dl.clone(dl_source,path='/tmp/sample',description='Cloned sample dataset for import')
sample.update(merge=True)
in_file=sample.get('./GENFI_DEMON_SPREADSHEET.xlsx')

[INFO] Fetching updates for Dataset(/tmp/sample) 
[INFO] Start enumerating objects 
[INFO] Start counting objects 
[INFO] Start compressing objects 


In [3]:
df_xsec=pd.read_excel(in_file[0]['path'])
df_xsec['Sex']=pd.Categorical(df_xsec['Sex'],categories=[0,1])
df_xsec['Sex']=df_xsec['Sex'].cat.rename_categories(['Female','Male'])
df_xsec['Site'] = pd.Categorical(df_xsec['Site'],categories=np.arange(23))
df_xsec['Group']=pd.Categorical(df_xsec['Group'],categories=[0,1,2])
df_xsec['Group']=df_xsec['Group'].cat.rename_categories(['Non-carrier','Presymptomatic','Symptomatic'])

A quick cross tab to show the different kinds of scanners used at each site.

In [4]:
pd.crosstab(df_xsec['Site'],df_xsec['Scanner'])

Scanner,GE 1.5T,GE 3T,Philips 3T,Siemens 1.5T,Siemens Prisma 3T,Siemens Skyra 3T,Siemens Trio 3T
Site,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0,0,0,0,0,8,20
1,0,0,45,0,0,0,0
2,0,0,0,0,3,0,0
3,0,0,143,0,0,0,0
4,0,0,3,0,0,0,0
5,0,25,0,0,6,0,0
6,0,0,18,0,0,0,0
7,0,0,0,0,10,0,12
8,0,0,0,33,0,36,0
9,8,0,1,0,0,0,0


## Analysis
Now that we have the data,we have three groups. Let's take a look at a couple of structures and see what edifferences they are.
### Simple regression no sites
We will compare the groups using sex, age, and TIV only as covariates.

In [5]:
df_xsec.groupby("Group").agg(
    {
        "Age": ["mean", "std", "min", "max" ],
        "EYO": ["mean", "std", "min", "max" ],
        "TIV": ["mean", "std", "min", "max" ]
    }
).style.format('{0:,.2f}')

Unnamed: 0_level_0,Age,Age,Age,Age,EYO,EYO,EYO,EYO,TIV,TIV,TIV,TIV
Unnamed: 0_level_1,mean,std,min,max,mean,std,min,max,mean,std,min,max
Group,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2
Non-carrier,46.95,13.7,18.6,85.7,-12.78,14.31,-50.0,27.6,1420.48,142.47,1100.7,2051.7
Presymptomatic,44.78,11.91,20.1,75.5,-13.81,11.48,-47.4,15.7,1412.19,137.71,1074.7,1811.5
Symptomatic,63.01,8.17,32.9,78.7,3.43,6.91,-26.1,21.7,1442.3,157.82,1129.5,1848.9


### Checking for site differences
Are their differences in site or scanner that are occuring in this data. Let's run the same analysis again including them as a covariate. For this, we are going to use the package statsmodels

In [6]:
import statsmodels.api as stats
import statsmodels.formula.api as statsfx

If youa re familiar with R, the regression equations in statsmodels are setup in a similar fashion. So the first model will take into account age, sex, and TIV< but *not* site. We will center age and TIV so that the intercept makes more sense.

In [16]:
df_xsec=df_xsec.dropna(subset=["Group","Age","Sex","TIV"])
df_xsec["Age_centered"]=df_xsec['Age']-df_xsec['Age'].mean()
df_xsec["TIV_centered"]=df_xsec['TIV']-df_xsec['TIV'].mean()
results= statsfx.ols('Insula_volume ~ Group + Age_centered + Sex + TIV_centered',data=df_xsec).fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:          Insula_volume   R-squared:                       0.711
Model:                            OLS   Adj. R-squared:                  0.709
Method:                 Least Squares   F-statistic:                     303.5
Date:                Wed, 08 Sep 2021   Prob (F-statistic):          1.72e-163
Time:                        11:27:28   Log-Likelihood:                -5109.0
No. Observations:                 622   AIC:                         1.023e+04
Df Residuals:                     616   BIC:                         1.026e+04
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                              coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------
Intercept                1

### Interpretation
Let's look at the results from the regression output:
* The average Insular volume for a female non-carrier of average age (~48 years in this cohort) and average TIV(1420 ml) is 10730mm$^3$. 
* Female presymptomatic carriers of a similar age and TIV have 233mm$^3$ less volume than the non carriers
* Female symptomatic carriers of a similar age and TIV have 2147mm$^3$ less. These differences are both significantly differetn from zero. 
* While there is no significant evidence that the difference between sexes is significant, There are clear age and TIV related differences with every 1000mm$^3$(or 1ml) of TIV resulting in an insual that is 6mm$^3$ bigger, and for every year, there is a loss of 41mm$^3$.

Before we go on, let's take a look and see if we can show differences in individuals in site. We will plot the predicted values accoring to 
## Adding site as a covariate
Now we are going to add the site variable as a covariate to the equation.

In [19]:
site_results= statsfx.ols('Insula_volume ~ Group + Age_centered + Sex + TIV_centered + Site',data=df_xsec).fit()
print(site_results.summary())

                            OLS Regression Results                            
Dep. Variable:          Insula_volume   R-squared:                       0.735
Model:                            OLS   Adj. R-squared:                  0.723
Method:                 Least Squares   F-statistic:                     60.98
Date:                Wed, 08 Sep 2021   Prob (F-statistic):          7.83e-152
Time:                        13:21:53   Log-Likelihood:                -5082.4
No. Observations:                 622   AIC:                         1.022e+04
Df Residuals:                     594   BIC:                         1.034e+04
Df Model:                          27                                         
Covariance Type:            nonrobust                                         
                              coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------
Intercept                1

### Interpretation
Overall the estimates change only slightly:
* The change in the presymptomatic carriers is down from 223 to 208.
* The change in the symptomatic carriers is down from 2147 to 2087
* Age is slightly greater per year, while TIV is unchanged and Sex is still not significant.
* There are a couple of sites: Site 3,5,8 that are significantly different than site 0. Notice that Site 0 is all Siemens 3T, while 3 is Philips 3T, 5 is GE 3T, and 8 is a mix of Siemens 1.5T and Skyra 3T - so different scanners.
* The log likelihood, since these are nested models, suggests that adding site improves the models slightly. The $R^2$ is slightly higher as well.

## Scanner
It could be that this is just a scanner difference rather than what site the data was acquired, and some sites have used more than one scanner. Let's try the model again with scanner only. The primary scanner that the protocol was developed on was a Siemens Trio 3T, so let's use that as our baseline value to see how the other scanners compare.

In [20]:
df_xsec['Scanner_cat'] = pd.Categorical(df_xsec['Scanner'],
                                        categories=['Siemens Trio 3T',
                                                    'Siemens Prisma 3T',
                                                    'Siemens Skyra 3T',
                                                    'Siemens 1.5T',
                                                    'Philips 3T',
                                                    'GE 1.5T',
                                                    'GE 3T'])
scanner_results= statsfx.ols('Insula_volume ~ Group + Age_centered + Sex + TIV_centered + Scanner_cat',data=df_xsec).fit()
print(scanner_results.summary())

                            OLS Regression Results                            
Dep. Variable:          Insula_volume   R-squared:                       0.720
Model:                            OLS   Adj. R-squared:                  0.714
Method:                 Least Squares   F-statistic:                     142.3
Date:                Wed, 08 Sep 2021   Prob (F-statistic):          2.74e-160
Time:                        13:22:04   Log-Likelihood:                -5099.9
No. Observations:                 622   AIC:                         1.022e+04
Df Residuals:                     610   BIC:                         1.028e+04
Df Model:                          11                                         
Covariance Type:            nonrobust                                         
                                       coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------------------
Intercep

In [104]:
siteX = df_xsec[["Group","Age","Sex","TIV","Site"]]
siteX = pd.get_dummies(data=siteX, drop_first=True)
regressor_with_site = LinearRegression()  # create object for the class
regressor_with_site.fit(siteX, y)  # perform linear regression

LinearRegression()

### Interpretation
Compared to the site model, it doesn't seem to improve things as much, although there are clear differences between the different scanners.

## Try your own
Now that you have seen how to setup this anlysis. Try with another structure, as scanner differences may not be universal. Some structures to try would be Frontol_lobe_volume, Total_Brain, or Right_Hippocampus. How do the numbers change? 

### Turning to combat
This is really helpful, but what if there are difference in standard error between sites, and how can we trust those small sample size in some cells. Let's try this with COMbat instead.