# Applying Machine Learning on rest brain imaging dataset

        1. Using BidsGrabber to understand the data layout
        2. Identity Inteface module for iteration on subjects
        3. Function for reading path
        4. Creating Nodes
            1. Function
            2. Identity Inteface for iteration
            3. Machine learning module
        5. Workflow

#### setting config

    setting parameteres values here, so if we need to change value anytime, we can make the changes here, no need to find in the pipeline or functions 

In [248]:
data_input_dir = '/data/NYU/'
subjects_info='/data/NYU/participants.tsv'
data_output_dir = '/output/'
no_of_subjects = 2 #nof of subjects to consider for computation

# In participants.tsv file, we have many variables, however we are using participant_id and group
subject_id = 'participant_id'
group = 'dx_group' # given in file


#### BidsLayout to easily access the structure of files and filenames

In [47]:
from bids.grabbids import BIDSLayout

In [51]:
layout = BIDSLayout(data_input_dir)

In [52]:
!tree $data_input_dir

/data/NYU/
|-- T1w.json
|-- participants.tsv
|-- phenotypic_NYU.csv
|-- sub-0050952
|   |-- anat
|   |   `-- sub-0050952_T1w.nii.gz
|   `-- func
|       `-- sub-0050952_task-rest_run-1_bold.nii.gz
|-- sub-0050953
|   |-- anat
|   |   `-- sub-0050953_T1w.nii.gz
|   `-- func
|       `-- sub-0050953_task-rest_run-1_bold.nii.gz
|-- sub-0050954
|   |-- anat
|   |   `-- sub-0050954_T1w.nii.gz
|   `-- func
|       `-- sub-0050954_task-rest_run-1_bold.nii.gz
|-- sub-0050955
|   |-- anat
|   |   `-- sub-0050955_T1w.nii.gz
|   `-- func
|       `-- sub-0050955_task-rest_run-1_bold.nii.gz
|-- sub-0050956
|   |-- anat
|   |   `-- sub-0050956_T1w.nii.gz
|   `-- func
|       `-- sub-0050956_task-rest_run-1_bold.nii.gz
|-- sub-0050957
|   |-- anat
|   |   `-- sub-0050957_T1w.nii.gz
|   `-- func
|       `-- sub-0050957_task-rest_run-1_bold.nii.gz
|-- sub-0050958
|   |-- anat
|   |   `-- sub-0050958_T1w.nii.gz
|   `-- func
|       `-- sub-0050958_task-rest_run-1_bold.nii.gz
|-- sub-0050959
|   |-- anat


In [71]:
subject_list = (layout.get_subjects())[0:(no_of_subjects)] #this gives the list of subjects in a given directory
subject_list

['0050952', '0050953']

#### function to get the nifti filenames

In [72]:
def get_nifti_filenames(subject_id,data_input_dir):
#     Remember that all the necesary imports need to be INSIDE the function for the Function Interface to work!
    from bids.grabbids import BIDSLayout
    
    layout = BIDSLayout(data_input_dir)
    
    func_file_path = [f.filename for f in layout.get(subject=subject_id, type='bold', extensions=['nii', 'nii.gz'])]
    
    return func_file_path[0]

#### Preparing a list of filenames

In [139]:
func_filenames = list()
for i , subject_id in enumerate (subject_list):
    func_filenames.append(get_nifti_filenames(subject_id, data_input_dir)) 
                        

#### showing all the nifti file name with path

In [140]:
nifti_images_list

['/data/NYU/sub-0050952/func/sub-0050952_task-rest_run-1_bold.nii.gz',
 '/data/NYU/sub-0050953/func/sub-0050953_task-rest_run-1_bold.nii.gz']

####  FROM 4-DIMENSIONAL IMAGES TO 2-DIMENSIONAL ARRAY: MASKING
    Neuroimaging data are represented in 4 dimensions: 3 spatial dimensions, and one dimension to index time or trials.
   
    Machine leanrning algorithms,on the other hand, only accept 2-dimensional samples × features matrices
   
    Depending on the setting, voxels and time series can be considered as features or samples.
   
    The selected voxels form a brain mask. Such a mask is often given along with the datasets or can be computed with software tools such as FSL or SPM. 
   

In [141]:
from nilearn.input_data import NiftiMasker,MultiNiftiMasker

In [269]:
test_imgs = list()
test_imgs.append('/data/NYU/sub-0050952/func/sub-0050952_task-rest_run-1_bold.nii.gz')
test_imgs.append('/data/NYU/sub-0050952/func/sub-0050952_task-rest_run-1_bold.nii.gz')

multi_nifti_masker = MultiNiftiMasker(standardize=True)
fmri_masked = multi_nifti_masker.fit_transform(test_imgs) 
# fmri_masked = list of ndarray of each subject and each ndarray represents timeseries * selected voxels from brain mask

#### DATA PREPARATION: 
    For the machine learning settings, we need a data matrix, that we will denote X, and optionally a target    variable to predict, y
   
    X = timeseries * voxels

In [393]:
import numpy as np
X = np.concatenate(fmri_masked, axis=0 )

In [394]:
X.shape

(360, 132240)

#### loading subjects meta data and will form the (subject_id - group)
  With pandas library , we load tsv file into data frame , and then select the selected columns form this

In [395]:
import pandas as pd

subjects_meta_data=pd.read_csv(subjects_info,sep='\t')

#### extracting subject_id, group that is given in subject_list
   1. preparing mask to select the selected subject_id
   2. using that mask to select the participant_id , dx_group

In [396]:
mask=subjects_meta_data[subject_id].isin(subject_list) # check setting config for subject_id
subject_id_group = subjects_meta_data[mask][[subject_id,group]] # check setting config for group

In [397]:
subject_id_group

Unnamed: 0,participant_id,dx_group
0,50952,Autism
1,50953,Autism


#### there are two groups, Autism  , Control

In [398]:
groups = subject_id_group['dx_group'].values
type(groups)

numpy.ndarray

Function to create a labels set with respective no of time points for each subject

Setting labels `1` as `Autism` and `0` as `Control`

In [399]:
def labels_set(nifti_files_path,subjects_groups):
    
    import nibabel as nb
    import numpy as np
    labels=[]
    
    for img in nifti_files_path:
        img_load = nb.load(img)
        labels += img_load.shape[3] * [(1 if subjects_groups[i] =='Autism' else 0)]
        
    return np.asarray(labels)
   

Now, we call labels_set , passing parameters values of list of images path and groups_id

In [400]:
labels=labels_set(nifti_images_list,groups)

#### Applying classifier , train and test 


   we use a Support Vector Classification, with a linear kernel.
   1.The svc object is an object that can be fit (or trained) on data with labels, and then predict labels on data without.

In [401]:
from sklearn.svm import SVC
svc = SVC(kernel='linear')
print(svc)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)


####  Measuring prediction scores using cross-validation
   1. The proper way to measure error rates or prediction accuracy is via cross-validation: leaving out some data and testing on it.

Scikit-learn has tools to perform cross-validation easier:
You can speed up the computation by using n_jobs=-1, which will spread the computation equally across all processors (but might not work under Windows):

In [279]:
from sklearn.cross_validation import cross_val_score
cv_scores = cross_val_score(svc, X, labels, cv=5, n_jobs=-1, verbose=10)
print(cv_score)


ValueError: Found input variables with inconsistent numbers of samples: [360, 2]