# Harmonisation strategies
Now the we have seen how different scanners can produce difference in images between the same individual on different scanners, we need to figure out to account for this data. 

In [68]:
import os
import nest_asyncio
nest_asyncio.apply()
import numpy as np
import pandas as pd
import datalad.api as dl

Let's grab the dataset that has all the imgaging volumes for the study and read it into memory

In [69]:
dl_source='/Users/davecash/Data/IDEAS/sample'
sample=dl.clone(dl_source,path='/tmp/sample',description='Cloned sample dataset for import')
sample.update(merge=True)
in_file=sample.get('./GENFI_DEMON_SPREADSHEET.xlsx')

[INFO] Fetching updates for Dataset(/tmp/sample) 


In [97]:
df_xsec=pd.read_excel(in_file[0]['path'])
df_xsec['Sex']=pd.Categorical(df_xsec['Sex'],categories=[0,1])
df_xsec['Sex']=df_xsec['Sex'].cat.rename_categories(['Female','Male'])
df_xsec['Site'] = pd.Categorical(df_xsec['Site'],categories=np.arange(23))
df_xsec['Group']=pd.Categorical(df_xsec['Group'],categories=[0,1,2])
df_xsec['Group']=df_xsec['Group'].cat.rename_categories(['Non-carrier','Presymptomatic','Symptomatic'])

A quick cross tab to show the different kinds of scanners used at each site.

In [98]:
pd.crosstab(df_xsec['Site'],df_xsec['Scanner'])

Scanner,GE 1.5T,GE 3T,Philips 3T,Siemens 1.5T,Siemens Prisma 3T,Siemens Skyra 3T,Siemens Trio 3T
Site,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0,0,0,0,0,8,20
1,0,0,45,0,0,0,0
2,0,0,0,0,3,0,0
3,0,0,143,0,0,0,0
4,0,0,3,0,0,0,0
5,0,25,0,0,6,0,0
6,0,0,18,0,0,0,0
7,0,0,0,0,10,0,12
8,0,0,0,33,0,36,0
9,8,0,1,0,0,0,0


## Analysis
Now that we have the data,we have three groups. Let's take a look at a couple of structures and see what edifferences they are.
### Simple regression no sites
We will compare the groups using sex, age, and TIV only as covariates.

In [99]:
df_xsec.groupby("Group").agg(
    {
        "Age": ["mean", "std", "min", "max" ],
        "EYO": ["mean", "std", "min", "max" ],
        "TIV": ["mean", "std", "min", "max" ]
    }
).style.format('{0:,.2f}')

Unnamed: 0_level_0,Age,Age,Age,Age,EYO,EYO,EYO,EYO,TIV,TIV,TIV,TIV
Unnamed: 0_level_1,mean,std,min,max,mean,std,min,max,mean,std,min,max
Group,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2
Non-carrier,46.95,13.7,18.6,85.7,-12.78,14.31,-50.0,27.6,1420.48,142.47,1100.7,2051.7
Presymptomatic,44.78,11.91,20.1,75.5,-13.81,11.48,-47.4,15.7,1412.19,137.71,1074.7,1811.5
Symptomatic,63.01,8.17,32.9,78.7,3.43,6.91,-26.1,21.7,1442.3,157.82,1129.5,1848.9


### Checking for site differences
Are their differences in site or scanner that are occuring in this data. Let's run the same analysis again including them as a covariate. For this, we will need the python package scikit-learn

In [100]:
from sklearn.linear_model import LinearRegression

The X variable contains all of our independent covariates. The Y variable contains our desired outcome

In [101]:
df_xsec=df_xsec.dropna(subset=["Group","Age","Sex","TIV"])
X = df_xsec[["Group","Age","Sex","TIV"]]
X = pd.get_dummies(data=X, drop_first=True)
y = df_xsec["Left Insula volume"] + df_xsec["Right Insula volume"]

In [102]:
linear_regressor = LinearRegression()  # create object for the class
linear_regressor.fit(X, y)  # perform linear regression

LinearRegression()

In [106]:
print(list(X.columns))
print(linear_regressor.intercept_)
print(linear_regressor.coef_)
linear_regressor.score(X,y)

['Age', 'TIV', 'Group_Presymptomatic', 'Group_Symptomatic', 'Sex_Male']
3932.3729049353115
[  -40.61251411     6.17734497  -223.44222734 -2147.24043702
  -100.28443119]


0.7112441075723002

In [104]:
siteX = df_xsec[["Group","Age","Sex","TIV","Site"]]
siteX = pd.get_dummies(data=siteX, drop_first=True)
regressor_with_site = LinearRegression()  # create object for the class
regressor_with_site.fit(siteX, y)  # perform linear regression

LinearRegression()

In [107]:
print(list(siteX.columns))
print(regressor_with_site.intercept_)
print(regressor_with_site.coef_)
regressor_with_site.score(siteX,y)

['Age', 'TIV', 'Group_Presymptomatic', 'Group_Symptomatic', 'Sex_Male', 'Site_1', 'Site_2', 'Site_3', 'Site_4', 'Site_5', 'Site_6', 'Site_7', 'Site_8', 'Site_9', 'Site_10', 'Site_11', 'Site_12', 'Site_13', 'Site_14', 'Site_15', 'Site_16', 'Site_17', 'Site_18', 'Site_19', 'Site_20', 'Site_21', 'Site_22']
4150.738604939085
[  -42.37708002     5.9090495   -208.0305405  -2087.44332263
   -34.51568108    42.0873887  -1001.83582594   454.67919697
  -276.86968655   487.72012179  -101.76563507   -81.15427676
   440.95751922   118.43877925  -149.53246397    -4.94621358
   103.05184684   -77.07673731  -120.31270881   474.23190449
   323.61336057   -11.37350027   123.50818612  -153.54284841
  -414.4583679    266.39314966   685.18640196]


0.7348822949249503

### Turning to combat
This is really helpful, but what if there are difference in standard error between sites, and how can we trust those small sample size in some cells. Let's try this with COMbat instead.