## Narrowing the data based on sequencing type
Previously I sorted data to 4 cohort (check "mri_data_sort_to_cohorts.ipynb"). <br>
Now I want to keep only relevant sequencing types: MPRAGE and FSPGR. <br> 
<br>
In my 4 cohorts you can find 3 different naming conventions. I will provide 3 examples, one from each type: <br>
1) 1018_NACC282203_20170908ni <br>
2) mri129ni<br>
3) NACC497363_128401136192134176253428319601354034337135ni<br> 

The 1) all have MPRAGE sequencing, the 2) all have FSPGR sequencing and some of 3) have FSPGR, some MPRAGE.

In [6]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import os 
import re
import shutil

In [55]:
ncPath = '../../NACC_data/sorted_cohorts/NC/'
mciPath = '../../NACC_data/sorted_cohorts/MCI/'
alzdPath = '../../NACC_data/sorted_cohorts/ALZD/'
transPath = '../../NACC_data/sorted_cohorts/TRANS/'

In [56]:
# Convert the relative path to an absolute path
ncPath = os.path.abspath(ncPath)
mciPath = os.path.abspath(mciPath)
alzdPath = os.path.abspath(alzdPath)
transPath = os.path.abspath(transPath)

# Modify the absolute path for long path support on Windows
if os.name == 'nt':                     # Check if the operating system is Windows
    ncPath = '\\\\?\\' + ncPath
    mciPath = '\\\\?\\' + mciPath
    alzdPath = '\\\\?\\' + alzdPath
    transPath = '\\\\?\\' + transPath

In [57]:
print(f"Modified path for Windows: {ncPath}")
print(f"Modified path for Windows: {mciPath}")
print(f"Modified path for Windows: {alzdPath}")
print(f"Modified path for Windows: {transPath}")

Modified path for Windows: \\?\c:\Users\Crt\Desktop\WIMR\asymmetryAD\NACC_data\sorted_cohorts\NC
Modified path for Windows: \\?\c:\Users\Crt\Desktop\WIMR\asymmetryAD\NACC_data\sorted_cohorts\MCI
Modified path for Windows: \\?\c:\Users\Crt\Desktop\WIMR\asymmetryAD\NACC_data\sorted_cohorts\ALZD
Modified path for Windows: \\?\c:\Users\Crt\Desktop\WIMR\asymmetryAD\NACC_data\sorted_cohorts\TRANS


### 1) 1018_NACC282203_20170908ni
Directory structure example: <br>
<br>
1018_NACC282203_20170908ni 
- 1018_NACC282203_20170908 
    - Mag_Images_17_1312211075219452552017090815014550782025439000
    - mIP_ImagesSW_19_1312211075219452552017090815014550780625438000
    - MPRAGE_GRAPPA2_6_1312211075219452552017090814552268696468275000
    - Pha_Images_18_1312211075219452552017090815014550782825440000
    - SWI_Images_20_1312211075219452552017090815014550784425442000
    - T2FLAIRSPACENEW_7_1312211075219452552017090814555267209169171000

I will only keep the MPRAGE folder and delete the others.

Converting the paths, sice Windows has a limit to 260 characters in paths, which causes errors in this case.

In [58]:
# Regular expression pattern for folder names starting with 4 digits followed by an underscore
pattern = r'^\d{4}_'

In [59]:
# List to store the matching folders
matching_folders = []

# Iterate over the items in the directory
for item in os.listdir(ncPath):
    item_path = os.path.join(ncPath, item)
    
    # Check if the item is a folder and matches the pattern
    if os.path.isdir(item_path) and re.match(pattern, item):
        matching_folders.append(item_path)

In [60]:
for folder in matching_folders:
    print(folder)

\\?\c:\Users\Crt\Desktop\WIMR\asymmetryAD\NACC_data\sorted_cohorts\TRANS\1018_NACC356689_20171019ni
\\?\c:\Users\Crt\Desktop\WIMR\asymmetryAD\NACC_data\sorted_cohorts\TRANS\1018_NACC356689_20201102ni
\\?\c:\Users\Crt\Desktop\WIMR\asymmetryAD\NACC_data\sorted_cohorts\TRANS\1018_NACC450406_20180615ni
\\?\c:\Users\Crt\Desktop\WIMR\asymmetryAD\NACC_data\sorted_cohorts\TRANS\1018_NACC450406_20210128ni
\\?\c:\Users\Crt\Desktop\WIMR\asymmetryAD\NACC_data\sorted_cohorts\TRANS\1018_NACC838157_20170510ni
\\?\c:\Users\Crt\Desktop\WIMR\asymmetryAD\NACC_data\sorted_cohorts\TRANS\1018_NACC838157_20200820ni


In [61]:
# Regular expression for MPRAGE subfolder
keep_prefix = 'MPRAGE'

This is commented, since it ran only once (to delete folders).

In [None]:
# Loop through each folder in matching_folders
for folder in matching_folders:

    # Get the path to the subfolder (e.g., 1018_NACC282203_20170908)
    subfolder_path = os.path.join(folder, os.listdir(folder)[0])         # only 1 subfolder exists at this level

    # List all subfolders in the subfolder_path
    subfolders = os.listdir(subfolder_path)

    for subfolder in subfolders:
        
        subfolder_full_path = os.path.join(subfolder_path, subfolder)
        
        # Check if the subfolder name starts with 'MPRAGE'
        if os.path.isdir(subfolder_full_path) and not subfolder.startswith(keep_prefix):

            # If the subfolder doesn't start with 'MPRAGE', delete it
            #shutil.rmtree(subfolder_full_path)
            print(f"Deleted: {subfolder_full_path}")
        
        else:
            print(f"Kept: {subfolder_full_path}")

### 2) mri129ni
Directory structure example: <br>
<br>
mri129ni
- s5_dti_DTI
- s6_bravo_T1
- s8_cubet2_T2
- s9_cubet2flair_T2_Flair

I will keep only the s6_bravo_T1 subflolder (all the subfolders that have T1 in their names).

IMPORTANT: not every mri folder has this substructure. Some look like this: <br>
<br>
mri1900ni
- scans
    - many subfolders

or like this: <br>
<br>
mri6615ni
- DICOM
    - many files

or like this <br>
<br>
mri8192ni
- NACC581039
    - nacc
        - many subfolder

Probably there are other structures, just keep this in mind.

In [153]:
pattern = 'mri'

In [154]:
# List to store the matching folders
matching_folders = []

# Iterate over the items in the directory
for item in os.listdir(alzdPath):
    item_path = os.path.join(alzdPath, item)
    
    # Check if the item is a folder and matches the pattern
    if os.path.isdir(item_path) and re.match(pattern, item):
        matching_folders.append(item_path)

In [155]:
for folder in matching_folders:
    print(folder)

\\?\c:\Users\Crt\Desktop\WIMR\asymmetryAD\NACC_data\sorted_cohorts\ALZD\mri1901ni
\\?\c:\Users\Crt\Desktop\WIMR\asymmetryAD\NACC_data\sorted_cohorts\ALZD\mri1925ni
\\?\c:\Users\Crt\Desktop\WIMR\asymmetryAD\NACC_data\sorted_cohorts\ALZD\mri5006ni
\\?\c:\Users\Crt\Desktop\WIMR\asymmetryAD\NACC_data\sorted_cohorts\ALZD\mri5008ni
\\?\c:\Users\Crt\Desktop\WIMR\asymmetryAD\NACC_data\sorted_cohorts\ALZD\mri5038ni
\\?\c:\Users\Crt\Desktop\WIMR\asymmetryAD\NACC_data\sorted_cohorts\ALZD\mri5039ni
\\?\c:\Users\Crt\Desktop\WIMR\asymmetryAD\NACC_data\sorted_cohorts\ALZD\mri5040ni
\\?\c:\Users\Crt\Desktop\WIMR\asymmetryAD\NACC_data\sorted_cohorts\ALZD\mri5041ni
\\?\c:\Users\Crt\Desktop\WIMR\asymmetryAD\NACC_data\sorted_cohorts\ALZD\mri5043ni
\\?\c:\Users\Crt\Desktop\WIMR\asymmetryAD\NACC_data\sorted_cohorts\ALZD\mri5044ni
\\?\c:\Users\Crt\Desktop\WIMR\asymmetryAD\NACC_data\sorted_cohorts\ALZD\mri5045ni
\\?\c:\Users\Crt\Desktop\WIMR\asymmetryAD\NACC_data\sorted_cohorts\ALZD\mri5046ni
\\?\c:\Users\Crt

In [156]:
# Regular expression for T1 subfolder
T1_prefix = 'T1'
nacc_prefix = 'NACC'
mprage_prefix = 'MPRAGE'
fspgr_prefix = 'FSPGR'

In [157]:
# Storing the names of deleted patients
deleted_mri_folders = []

In [159]:
# Loop through each folder in matching_folders
for folder in matching_folders:

    # List all subfolders in the folder_path
    subfolders = os.listdir(folder)

    # Folders that contain only 1 subfolder (those have different substructures)
    if len(subfolders) == 1:

        for subfolder in subfolders:

            subfolder_full_path = os.path.join(folder, subfolder)
            subfolder_full_path = os.path.abspath(subfolder_full_path)                      # Absolute path
            subfolder_full_path = '\\\\?\\' + subfolder_full_path                           # Handle long paths

            # Only keeping those folders whose subfolders that start with NACC, deleting the rest
            if os.path.isdir(subfolder_full_path) and not subfolder.startswith(nacc_prefix):

                deleted_mri_folders.append(folder)
                shutil.rmtree(folder)
                print(f"Deleted: {folder}")

            # If it does start with NACC, check inside the NACC folder for MPRAGE or FSPGR
            elif os.path.isdir(subfolder_full_path) and subfolder.startswith(nacc_prefix):

                subfolders_in_nacc = os.listdir(subfolder_full_path)                         # List subfolders inside NACC

                for subs in subfolders_in_nacc:

                    subs_full_path = os.path.join(subfolder_full_path, subs)
                    subs_full_path = os.path.abspath(subs_full_path)
                    subs_full_path = '\\\\?\\' + subs_full_path                              # Handle long paths

                    # Check if subfolder does NOT contain 'MPRAGE' or 'FSPGR'
                    if os.path.isdir(subs_full_path) and (mprage_prefix not in subs and fspgr_prefix not in subs):

                        shutil.rmtree(subs_full_path)
                        print(f"Deleted: {subs_full_path}")

                    else:

                        print(f"Kept: {subs_full_path}")

    else:
        # For folders that contain multiple subfolders, delete subfolders that don't have 'T1'
        for subfolder in subfolders:

            subfolder_full_path = os.path.join(folder, subfolder)
            subfolder_full_path = os.path.abspath(subfolder_full_path)                        # Absolute path
            subfolder_full_path = '\\\\?\\' + subfolder_full_path                             # Handle long paths

            # Check if the subfolder name contains 'T1' anywhere in its name
            if os.path.isdir(subfolder_full_path) and T1_prefix not in subfolder:

                # If the subfolder doesn't contain 'T1', delete it
                shutil.rmtree(subfolder_full_path)
                print(f"Deleted: {subfolder_full_path}")

            else:
                
                print(f"Kept: {subfolder_full_path}")

Saving deleted mri folders, so I can later correct the .csv files containing what patients I have.

In [160]:
writePath = '../../NACC_data/sorted_cohorts/'

In [161]:
print(len(deleted_mri_folders))

0


In [85]:
def save_common_naccids_to_txt(array, filepath):
    with open(filepath, 'w') as f:
        for filepath in array:
            f.write(f"{filepath}\n")

#save_common_naccids_to_txt(deleted_mri_folders, writePath + 'alzd_deleted.txt')

print("Save to .txt complete!")

Save to .txt complete!
