# Subtype analysis
This experiment aims to identify subtypes within the ABIDE dataset based on three different types of maps:

1. Scores
2. Dual Regression
3. Seed Functional Connectivity

I will then compute individual weights for each of these subjects reflecting the overall similarity of an individuals map with the average subtype map. These weights will then be used in a GLM to regress against a number of phenotype variables.

## Scientific Assumptions

1. We will identify 7 subtypes per map. This could also be a data driven number, but not for now.

## Scientific Question
These are the questions I want to ask:

1. What are the subtype maps and how do they differ within a map-type
2. Are there subtypes for which the individual weights are predictive of the phenotype data
3. Is there a map type that is more useful for this investigation than the others (or a ranking)

## Practical Questions

1. Do we really need to demean the maps? For most of the maps, the values are already scaled. They are not 0-centered, but that could very well be meaningful. I'm not sure what demeaning does then.

In [29]:
# Imports
import os
import glob
import numpy as np
import pandas as pd
import nibabel as nib
import statsmodels.api as sm
from scipy import stats as st
from scipy import cluster as scl
from matplotlib import pyplot as plt
from sklearn import linear_model as slin

In [5]:
%pylab inline

Populating the interactive namespace from numpy and matplotlib


In [3]:
# Paths
scale = 7
subtypes = 7
network = 1
net_id = network - 1
template = '*_fmri_{:07d}_session_1_run1_stability_maps.nii.gz'
data_path = '/data1/abide/Out/Scores/sc{:02d}/time'.format(scale)
pheno_path = '/data1/abide/Pheno/pheno.csv'
mask_path = '/data1/abide/Mask/mask_data_specific.nii.gz'
map_types = ['stability_maps', 'rmap_part', 'dual_regression']

In [25]:
# Load the mask
m_img = nib.load(mask_path)
mask = m_img.get_data()
mask = mask!=0
n_vox = np.sum(mask)

In [5]:
# Load the phenotype information
pheno = pd.read_csv(pheno_path)

In [21]:
# Loop through the subject ID's and find the corresponding 
# files. If there is no file, drop the subject
drop_id = list()
path_list = list()
for index, row in pheno.iterrows():
    s_id = row['SUB_ID']
    s_path = glob.glob(os.path.join(data_path, map_types[0], template.format(s_id)))
    if s_path:
        path_list.append(s_path[0])
    else:
        drop_id.append(index)
        continue
clean_pheno = pheno.drop(drop_id)

In [26]:
n_files = len(path_list)
# Prepare the storage matrix
net_mat = np.zeros((n_vox, n_files))
# Go through the files
for index, s_path in enumerate(path_list):
    f_net = nib.load(s_path).get_data()[mask][..., net_id]
    net_mat[..., index] = f_net
net_mat = net_mat - np.mean(net_mat,0)

In [30]:
# Make a correlation matrix of the subjects
corr_sub = np.corrcoef(net_mat, rowvar=0)
link_sub = scl.hierarchy.ward(corr_sub)
part_sub = scl.hierarchy.fcluster(link_sub, subtypes, criterion='maxclust')

In [31]:
# Make the average of the subtypes
sbt_avg = np.zeros((n_vox, subtypes))
for idx in range(subtypes):
    sub_id = np.unique(part_sub)[idx]
    sbt_avg[..., idx] = np.mean(net_mat[...,part_sub==sub_id],1)

In [59]:
# Generate the individual weights
y_stp = np.zeros((n_files, subtypes))
for s_id in range(subtypes):
    type_map = sbt_avg[:, s_id]
    y_stp[:, s_id] = np.array([np.corrcoef(type_map, net_mat[:,x])[0,1] for x in range(n_files)])

In [114]:
# Get the features from the phenotype data
X_diag = pd.get_dummies(clean_pheno['SITE_ID'])
X_diag = X_diag.rename(columns={'CALTECH': 'INTERCEPT'})
X_diag['INTERCEPT'] = 1

In [115]:
model = sm.OLS(y_stp[:,1], X_diag)
results = model.fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.160
Model:                            OLS   Adj. R-squared:                  0.145
Method:                 Least Squares   F-statistic:                     10.56
Date:                Fri, 03 Jul 2015   Prob (F-statistic):           5.66e-25
Time:                        13:48:02   Log-Likelihood:                 1027.5
No. Observations:                 901   AIC:                            -2021.
Df Residuals:                     884   BIC:                            -1939.
Df Model:                          16                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
INTERCEPT      0.6887      0.013     54.360      0.0