# Full project downloading data from XNAT

#### Maria Yanez Lopez 2018 (maria.yanez-lopez@imperial.ac.uk)
#### ~ adapted for full project download Niall Bourke Feb 2019 (n.bourke@imperial.ac.uk)
### Documentation: 

https://github.com/pyxnat/pyxnat/blob/master/pyxnat/core/downloadutils.py

https://groups.google.com/forum/#!topic/xnat_discussion/K8h4VP4CBMg

https://gist.github.com/mattsouth/db8f2d09acf3c57ba605fa93c4e8d03e

https://ubuntuforums.org/showthread.php?t=786879

https://wiki.imperial.ac.uk/pages/viewpage.action?spaceKey=HPC&title=Jupyter

Version 2.0 ~ Niall Bourke  
Updated 24/09/2019  
~ Checks for data that is already pulled from xnat to allow rolling updating of data on the HPC
 
This scripts downloads DICOM data from XNAT according to users specifications.

Works with python version py2.7 (upadte envionment libraries for py3 for continued support)

### Import python libraries

In [1]:
import sys, os, getpass                           
from pyxnat import Interface

### Introduce your XNAT login details (same as college credentials) and project folder

In [2]:
userName = raw_input('Type XNAT User Name: ')
passWord = getpass.getpass('Type XNAT Password: ')
projectID = raw_input('Type XNAT Project ID: ')
server = 'http://cif-xnat.hh.med.ic.ac.uk'

Type XNAT User Name:  nbourke
Type XNAT Password:  ·········
Type XNAT Project ID:  BIO-AX-TBI


In [3]:
print 'INPUT'
print 'Server: ', server
print 'Username: ', userName
print 'Password: ', ''.join(['*']*len(passWord))
print 'ProjectID: ', projectID 

INPUT
Server:  http://cif-xnat.hh.med.ic.ac.uk
Username:  nbourke
Password:  *********
ProjectID:  BIO-AX-TBI


### Create PYXNAT interface

In [4]:
central = Interface(server=server, user=userName, password=passWord)

# Full project download:
subjects = central.select.project(projectID).subjects().get()

# individual subject:
#subID="CIF02_S00677"
#subjects = central.select.project(projectID).subjects(subID).get()

print(subjects)
allSessions = []
number_subjects = 0

['CIF02_S00423', 'CIF02_S00424', 'CIF_S00714', 'CIF02_S00425', 'CIF02_S00426', 'CIF02_S00427', 'CIF_S01259', 'CIF02_S00728', 'CIF02_S00606', 'CIF02_S00738', 'CIF02_S00798', 'CIF02_S00823', 'CIF02_S00833', 'CIF02_S00869', 'CIF02_S00872', 'CIF02_S00874', 'CIF02_S00877', 'CIF02_S00922', 'CIF02_S00949', 'CIF02_S01022', 'CIF02_S01023', 'CIF02_S01069', 'CIF02_S01070', 'CIF02_S01073', 'CIF02_S01074', 'CIF02_S01075', 'CIF02_S01076', 'CIF02_S01077', 'CIF02_S01078', 'CIF02_S01079', 'CIF02_S00430', 'CIF02_S00432', 'CIF_S02259', 'CIF02_S00440', 'CIF02_S00472', 'CIF02_S00476', 'CIF02_S00477', 'CIF02_S00108', 'CIF02_S00746', 'CIF02_S00747', 'CIF02_S00761', 'CIF02_S00762', 'CIF02_S00777', 'CIF02_S00778', 'CIF02_S00793', 'CIF02_S00799', 'CIF02_S00816', 'CIF02_S00819', 'CIF02_S00820', 'CIF02_S00821', 'CIF02_S00471', 'CIF02_S00822', 'CIF02_S00899', 'CIF02_S00902', 'CIF02_S00957', 'CIF02_S01024', 'CIF02_S01071', 'CIF02_S01080', 'CIF02_S01159', 'CIF02_S01199', 'CIF02_S01204', 'CIF02_S01205', 'CIF02_S01247

### Browse through project, collect subjects/sessions/scans and print subject labels

In [5]:
for i, subject in enumerate(subjects):
    label = central.select.project(projectID).subject(subject).label()
    print label, ('%i/%i' % (i+1, len(subjects)))
    sessions = central.select.project(projectID).subjects(subject).experiments().get()
    allSessions.append(sessions)

IT4381265 1/346
SI0092660 2/346
CIF0608 3/346
IT4381266 4/346
IT4381269 5/346
IT4381272 6/346
CIF1075 7/346
IT079C001 8/346
IT6516127 9/346
SI0092827 10/346
CIF2581 11/346
IT079C006 12/346
CH001C004 13/346
CIF2646 14/346
IT078C005 15/346
IT4381513 16/346
CIF2654 17/346
CIF2691 18/346
IT4381525 19/346
CH001C010 20/346
CH001C008 21/346
SI009C006 22/346
SI009C007 23/346
IT078C007 24/346
SI009C008 25/346
SI009C009 26/346
SI009C010 27/346
SI009C011 28/346
IT651C001 29/346
IT651C002 30/346
CH0010016 31/346
CH0010014 32/346
CIF1757 33/346
CIF2277 34/346
IT4381209 35/346
IT4381273 36/346
IT4381274 37/346
CH0010001 38/346
IT6516397 39/346
CIF2550 40/346
IT438C001 41/346
CIF2551 42/346
CIF2566 43/346
IT6516386 44/346
IT438C002 45/346
CIF2582 46/346
CIF2588 47/346
IT438C003 48/346
IT0781663 49/346
SI0092710 50/346
IT0794629 51/346
IT079C005 52/346
CIF2671 53/346
CIF2675 54/346
CH001C003 55/346
CH001C007 56/346
IT6516576 57/346
CH001C014 58/346
IT0791082 59/346
CH0010034 60/346
SI0092993 61/346
SI

## Modify the output diretory, where the datasets will be saved form XNAT
 Set so path is always the tbi group raw direcotry and will download to a folder with the name of project being downloaded


In [6]:
#dirName = os.path.join('/rds/general/project/c3nl_shared/live/', projectID)
dirName = os.path.join('/rds/general/project/c3nl_djs_imaging_data/live/data/raw/', projectID)

# Create target Directory if don't exist
if not os.path.exists(dirName):
    os.mkdir(dirName)
    print("Directory " , dirName ,  " Created ")
else:    
    print("Directory " , dirName ,  " already exists")
    
Results_Dir = dirName # needs to exist or next cell will throw error


('Directory ', '/rds/general/project/c3nl_djs_imaging_data/live/data/raw/BIO-AX-TBI', ' already exists')


### Download datasets
This script will look into the predefined project. Check the printed output to look for duplicates and incomplete datasets.

In [7]:
import glob 

subjectCounter = 0
for s, subjectID in enumerate(subjects):
    subjectLabel = central.select.project(projectID).subject(subjectID).label()
    
    for experimentID in allSessions[s]:
            scans = central.select.project(projectID).subject(subjectID).experiments(experimentID).scans()
            scanIDs = scans.get()
            
            coll = central.select.project(projectID).subject(subjectID).experiments(experimentID)
            for ese in coll:
                explab = ese.attrs.get('label')
            
            # Check if data has already been pulled
            dataCheck = glob.glob(Results_Dir + "/" + subjectLabel + "/*" + explab )
            #print("sub label is: " + subjectLabel)
            #print("exp label is: " + explab)
            dataCheck = ''.join(dataCheck) # covert list to string
            #print("data path is: " + dataCheck)
            if not os.path.exists(dataCheck):
                print("Downloading:", explab)        
                number_subjects+=1
            
                if len(scanIDs) == 0:
                    print("There are no scans to download for", explab)
                else:
                    filenames = central.select.project(projectID).subject(subjectID).experiment(experimentID).scans()
                    filenames.download(Results_Dir, type='ALL', extract=False, removeZip=True)   
            else:
                print(explab + " already pulled")
print "The total number of scanning sessions downloaded is = " + str(number_subjects)


IT4381265_v1 already pulled
SI0092660_v1 already pulled
MINI_COAT002 already pulled
IT4381266_v1 already pulled
IT4381269_v1 already pulled
IT4381272_v1 already pulled
('Downloading:', 'COAT_HC_076')
IT079C001_v1 already pulled
IT6516127_v1 already pulled
SI0092827_v1 already pulled
UK0010084_v1 already pulled
UK0010084_v3 already pulled
UK0010084_v2 already pulled
IT079C006_v1 already pulled
CH001C004_v1 already pulled
UK0010043_v2 already pulled
IT078C005_v1 already pulled
IT4381513_v1 already pulled
COAT_HC046 already pulled
BIOAX_UK0010083_v1 already pulled
UK0010083_v3 already pulled
UK0010083_v2 already pulled
IT4381525_v1 already pulled
CH001C010_v1 already pulled
CH001C008_v1 already pulled
SI009C006_v1 already pulled
SI009C007_v1 already pulled
IT078C007_v1 already pulled
SI009C008_v1 already pulled
SI009C009_v1 already pulled
SI009C010_v1 already pulled
SI009C011_v1 already pulled
IT651C001_v1 already pulled
IT651C002_v1 already pulled
CH0010016_v1 already pulled
CH0010016_v2

## Sweet now we're rolling! 
To make life easy all our labs notebooks are going assume a BIDS format.
As data curating can be a pain in the derrière, lets run a nice little function to sort that for us ;)

## Dependencies

#### A CIF_config.json has been created to match MRI acquisitions and label them in the correct format. 
This may need to be updated if new seqences are being collected. 
Requires the labels from the scan card for each acquisition being formated (NOTE: How these are displayed on the XNAT website unhelpfuly does not necessarily match with the actual data labels!)  

#### Index files
* I have used XDC (xnat data cliant) to pull metaData about scan labels from xnat.

* The bids scripts are hardcoded to look for this metaData in a indexFiles directory within the working dir. This should contain two files for the project PROJECT_experiments.csv and PROJECT_subject.csv

* This requires local setup. I have a function that runs through a list of TBI projects on xnat and downloads the metadata. If a proejct is not in the imaging directory try the XDC setup below. 

* The following XDC function can be used to pull project and subject information from xnat


### XDC setup

Install via the following instructions:
https://wiki.xnat.org/xnat-tools/xnatdataclient


Use some lines like the following to pull the csv files used for indexing and renaming the files

In [None]:
# XDC is an alias set in .bash_profile to the function which can be downloaded in the above link

XDC -u USERNAME -p PASSWORD -r "http://wmec-transtec1.hh.med.ic.ac.uk:/data/archive/projects/PROJECT/experiments?format=csv" -o PROJECT_experiments
    
XDC -u USERNAME -p PASSWORD -r "http://wmec-transtec1.hh.med.ic.ac.uk:/data/archive/projects/PROJECT/subjects?format=csv" -o PROJECT_subject

## Extracting and indexing data from xnat

The following functions are in the dependencies folder on the Imperial HPC along with the CIF_config.json file

#### 1: bids_1_preproc -i project
    Unzips & indexes files downloaded from XNAT with more meaningfull labels such as participant ID and scan session.  
    This sets up the initial file structure to run the conversion to BIDS.
    
    
#### 2: bids_2_proc -i project -c config.json
    Loops over all subjects->sessions->modalities->scans and converts DICOMS to NIFTI.   
    The labels for each of the scans on the scan card are then converted to match the BIDS format and file structure  
    
#### Sources of error
* Conversion to nii at this point should be robust and all data will be in raw under the project name
* Missing data in source directory is likely due to a new exception in how something was named on the scanner - this should be added to the config.json file. Be careful not to clash with similar names. 
* This works well for data comming off the CIF scanner (Imperial). Data from new sites have to be checked as something in the structure may cause unexpected outcomes. 