#  Downloading data from XNAT
Niall Bourke (n.bourke@imperial.ac.uk)
  
#### Version 3.0 

Updates:   
~ 2018 Maria Yanez-Lopez  
~ 24/09/2019: Checks for data that is already pulled from xnat to allow rolling updating of data on the HPC  
~ 12/10/2021: Adapting for new broken XNAT. **Requires project access**  
~ Nov 2021: incorporation of BIDS converter scripts 
~ May 2022: Documentation & containerisation 
 
  
### Documentation: 

This scripts downloads DICOM data from XNAT according to users specifications.

Updated envionment libraries for py3

https://github.com/pyxnat/pyxnat/blob/master/pyxnat/core/downloadutils.py

https://groups.google.com/forum/#!topic/xnat_discussion/K8h4VP4CBMg

https://gist.github.com/mattsouth/db8f2d09acf3c57ba605fa93c4e8d03e

https://ubuntuforums.org/showthread.php?t=786879

https://wiki.imperial.ac.uk/pages/viewpage.action?spaceKey=HPC&title=Jupyter

## Dependencies
Some requirements need to be in place. This should be setup from home dir in terminal. Setup py3.5 env (with jupyter kernal = ipykernel)  


>module load anaconda3/personal anaconda –setup  
>conda create -n py35 python=3.5 ipykernel  
>source activate py35  
>pip install pyxnat  
>conda install jq  #pip install jq  

### Import python libraries

In [None]:
import sys, os, getpass                           
from pyxnat import Interface
import pandas as pd

### Introduce your XNAT login details (same as college credentials) and project folder

In [None]:
userName = input('Type XNAT User Name: ')
passWord = getpass.getpass('Type XNAT Password: ')
projectID = input('Type XNAT Project ID: ')
server = 'http://cif-xnat.hh.med.ic.ac.uk'

In [None]:
print ('INPUT')
print ('Server: ', server)
print ('Username: ', userName)
print ('Password: ', ''.join(['*']*len(passWord)))
print ('ProjectID: ', projectID )

### Create PYXNAT interface
*If no files are detected check password and access to xnat project*

In [None]:
central = Interface(server=server, user=userName, password=passWord)

# Full project download:
subjects = central.select.project(projectID).subjects().get()

# individual subject:
#subID= "CIF3_S04363" 
#subjects = central.select.project(projectID).subjects(subID).get()

print(subjects)
#head(subjects)
allSessions = []
number_subjects = 0

### Browse through project, collect subjects/sessions/scans and print subject labels

In [None]:
for i, subject in enumerate(subjects):
    label = central.select.project(projectID).subject(subject).label()
    print (label, ('%i/%i' % (i+1, len(subjects))))
    sessions = central.select.project(projectID).subjects(subject).experiments().get()
    allSessions.append(sessions)

## Modify the output directory, where the datasets will be saved from XNAT

* The path is currently set to c3nl_djs_working_dir/ephemeral directory and will download to a folder with the name of project being downloaded
* For curation purposes a defined location should be set to host the raw data


In [None]:
dirName = os.path.join('/rds/general/project/c3nl_djs_imaging_data/live/data/raw/', projectID)

# Create target Directory if don't exist
if not os.path.exists(dirName):
    os.mkdir(dirName)
    print("Directory " , dirName ,  " Created ")
else:    
    print("Directory " , dirName ,  " already exists")
        
Results_Dir = dirName # needs to exist or next cell will throw error
idx_dir = ('/rds/general/project/c3nl_djs_imaging_data/live/data/indexFiles')

### Download datasets
This script will look into the predefined project. Check the printed output to look for duplicates and incomplete datasets.

In [None]:
import glob 

subjectCounter = 0
for s, subjectID in enumerate(subjects):
    subjectLabel = central.select.project(projectID).subject(subjectID).label()
    
    for experimentID in allSessions[s]:
            scans = central.select.project(projectID).subject(subjectID).experiments(experimentID).scans()
            scanIDs = scans.get()
            
            coll = central.select.project(projectID).subject(subjectID).experiments(experimentID)
            for ese in coll:
                explab = ese.attrs.get('label')
            
            # Check if data has already been pulled
            dataCheck = glob.glob(Results_Dir + "/" + subjectLabel + "/*" + explab )
            #print("sub label is: " + subjectLabel)
            #print("exp label is: " + explab)
            dataCheck = ''.join(dataCheck) # covert list to string
            #print("data path is: " + dataCheck)
            if not os.path.exists(dataCheck):
                print("Downloading:", explab)        
                number_subjects+=1
            
                if len(scanIDs) == 0:
                    print("There are no scans to download for", explab)
                else:
                    filenames = central.select.project(projectID).subject(subjectID).experiment(experimentID).scans()
                    filenames.download(Results_Dir, type='ALL', extract=False) #removeZip=True   
            else:
                print(explab + " already pulled")
print ("The total number of scanning sessions downloaded is = " + str(number_subjects))


## Sweet now we're rolling! 
To make life easy all our labs notebooks are going assume a BIDS format.
The following curates data in a standardised format, which will be the starting point of analysis pipelines

## Dependencies

#### A CIF_config.json has been created to match MRI acquisitions and label them in the correct format. 
This may need to be updated if new seqences are being collected. 
Requires the labels from the scan card for each acquisition being formated (NOTE: How these are displayed on the XNAT website unhelpfuly does not necessarily match with the actual data labels!)  
**The following bids scrips are currently hardcoded to use the config file located in the shared dependencies folder**

#### Index files
* I have used XDC (xnat data cliant) to pull metaData about scan labels from xnat.

* The bids scripts are hardcoded to look for this metaData in a indexFiles directory within the working dir. This should contain two files for the project PROJECT_experiments.csv and PROJECT_subject.csv

* The following XDC function can be used to pull project and subject information from xnat


### XDC setup
**This folder has been saved in the dependencies folder on the cluster (17/11/2021)** 
Installed via the following instructions:
https://wiki.xnat.org/xnat-tools/xnatdataclient

### jq setup

**UPDATE**
- Pass python variables and save output to working directory
- Check paths in bash scripts called (CIF_unzip & bids_beta)
- For BIDS need to make sure jq is installed in the envs that is run (i.e. make sure to load py35 in terminal and jq is installed in that env)
>brew install jq

# Run BIDS curation 
- Pulls metadata from xnat and stores in csv
- Runs wrapper for dcm2niix (CIF_unzip)
- Runs Curation script (bids_proc). 

* Some project on xnat are prefixed 'Study__' this has caused problems and seperate scripts have been included to handle this (CIF_unzip_Study__ bids_proc_Study__)

In [None]:
%%bash -s "$userName" "$passWord" "$projectID" 

# Set variables
username=${1} 
password=${2}
ID=${3}
path=/rds/general/project/c3nl_djs_imaging_data/live/data/indexFiles
dep=/rds/general/project/c3nl_shared/live/dependencies
cwd=$(pwd)
job=${cwd}/bidJob.txt

# Find index from xnat
input1="http://wmec-transtec1.hh.med.ic.ac.uk:/data/archive/projects/"${ID}"/experiments?format=csv"
input2="http://wmec-transtec1.hh.med.ic.ac.uk:/data/archive/projects/"${ID}"/subjects?format=csv"

# Run xnat data client
## This updates the indexing information from xnat which the bids script relies on
java -jar /rds/general/project/c3nl_shared/live/dependencies/data-client-shadow-1.7.6/lib/XnatDataClient-1.7.6-all.jar -u ${username} -p ${password} -r ${input1} -o ${path}/${ID}_experiments  
java -jar /rds/general/project/c3nl_shared/live/dependencies/data-client-shadow-1.7.6/lib/XnatDataClient-1.7.6-all.jar -u ${username} -p ${password} -r ${input2} -o ${path}/${ID}_subject

# CIF_unzip
# bids_proc 

# Set job script
echo "source activate py35; ${cwd}/lib/CIF_unzip_Study__ -i ${ID}; ${cwd}/lib/bids_proc_Study__ -i ${ID} -c CIF_config.json" > ${job}

# Run job
${dep}/hpcSubmit ${job} 12:00:00 3 6Gb
echo ""; echo "***"; echo ""; echo "Submitted commands:"
head ${job}
 

## Extracting and indexing data from xnat

The following functions are in the dependencies folder on the Imperial HPC along with the CIF_config.json file
* New aquisitions need to be added to the CIF_config.json (this is a sort of dictionary for standardised naming)

#### 1: CIF_unzip -i project
    Unzips & indexes files downloaded from XNAT with more meaningfull labels such as participant ID and scan session.  
    This sets up the initial file structure to run the conversion to BIDS.
    
#### 2: bids_proc -i project -c config.json
    Loops over all subjects->sessions->modalities->scans and converts DICOMS to NIFTI.   
    The labels for each of the scans on the scan card are then converted to match the BIDS format and file structure  
    
#### Sources of error
* Conversion to nii at this point should be robust and all data will be in raw under the project name
* Missing data in source directory is likely due to a **new exception** in how something was named on the scanner - this should be added to the config.json file. Be careful not to clash with similar names. 
* This works well for data comming off the CIF scanner (Imperial). Data from new sites have to be checked/validated as something in the structure may cause unexpected outcomes. 


##### Known bugs
XDC as an alias set in .bash_profile wont be sourced in Jupyter, not sure why. 

Point to it by adding the following lines to your .bash_profile
#XDC
alias XDC='java -jar /rds/general/user/**nbourke**/home/data-client-shadow-1.7.6/lib/XnatDataClient-1.7.6-all.jar'
