# 01 - Find Files

**Search for the Files**
_______
Formatting the Directory-Pattern Dictionary
The function glob_multiple_file_paths expects a dictionary where each key-value pair corresponds to a root directory and a file pattern to search within that directory. The keys are the root directories where you want to start the search, and the values are the file patterns to match against.

Example Dictionary Format:

>dir_pattern_dict = {
>    '/path/to/first/root_dir': '*.nii',
>
>    '/path/to/second/root_dir': '*.nii.gz',
>
>    '/another/path': '*_label.nii'
>     Add more key-value pairs as needed
>}

Using Wildcards:

The file patterns can include wildcards to match multiple files:
- *: Matches zero or more characters
- **: Searches all directories recursively
- *.nii will match all files ending with .nii
- ?: Matches any single character
- file?.nii will match file1.nii, file2.nii, etc.
- [seq]: Matches any character in seq
- file[1-3].nii will match file1.nii, file2.nii, file3.nii
- [!seq]: Matches any character NOT in seq
- file[!1-3].nii will match any file that doesn't have 1, 2, or 3 in that position, like file4.nii, file5.nii, etc.

Feel free to combine these wildcards to create complex file patterns. For example, *_??.nii will match files like file_01.nii, file_02.nii, etc.

Where to Save to

In [3]:
# Define the dictionary with root directories and file patterns
dir_pattern_dict = {
    '/Users/cu135/Partners HealthCare Dropbox/Calvin Howard/studies/atrophy_seeds_2023/shared_analysis/niftis_for_elmira/unsmoothed_atrophy_seeds/sub-*/ses-01/thresholded_tissue_segment_z_scores/': 'sub-*_cerebrospinal_fluid_generated_nifti.nii'}

## Glob the files and check to see if acceptable

In [4]:
save_files = False

In [13]:
from calvin_utils.file_utils.file_path_collector import glob_multiple_file_paths
import os
# Validate Directory
# os.mkdir(os.path.dirname(csv_path))
# Call the function and save the returned DataFrame to a CSV file
path_df = glob_multiple_file_paths(dir_pattern_dict, save=save_files, save_path=None)

# Display the saved path and the DataFrame
display(path_df)

Unnamed: 0,paths
0,/Users/cu135/Partners HealthCare Dropbox/Calvi...
1,/Users/cu135/Partners HealthCare Dropbox/Calvi...
2,/Users/cu135/Partners HealthCare Dropbox/Calvi...
3,/Users/cu135/Partners HealthCare Dropbox/Calvi...
4,/Users/cu135/Partners HealthCare Dropbox/Calvi...
5,/Users/cu135/Partners HealthCare Dropbox/Calvi...
6,/Users/cu135/Partners HealthCare Dropbox/Calvi...
7,/Users/cu135/Partners HealthCare Dropbox/Calvi...
8,/Users/cu135/Partners HealthCare Dropbox/Calvi...
9,/Users/cu135/Partners HealthCare Dropbox/Calvi...


# Option 2 - Extract Subject IDs

In [14]:
# Define the preceding and proceeding strings
preceding = 'sub-'
proceeding = '/'

# Extract the substring and add it to a new column 'subject'
path_df['subject'] = path_df['paths'].str.extract(f'{preceding}(.*?){proceeding}')

# Display the updated DataFrame
display(path_df)

Unnamed: 0,paths,subject
0,/Users/cu135/Partners HealthCare Dropbox/Calvi...,21
1,/Users/cu135/Partners HealthCare Dropbox/Calvi...,26
2,/Users/cu135/Partners HealthCare Dropbox/Calvi...,19
3,/Users/cu135/Partners HealthCare Dropbox/Calvi...,10
4,/Users/cu135/Partners HealthCare Dropbox/Calvi...,17
5,/Users/cu135/Partners HealthCare Dropbox/Calvi...,28
6,/Users/cu135/Partners HealthCare Dropbox/Calvi...,43
7,/Users/cu135/Partners HealthCare Dropbox/Calvi...,44
8,/Users/cu135/Partners HealthCare Dropbox/Calvi...,16
9,/Users/cu135/Partners HealthCare Dropbox/Calvi...,29


In [15]:
path_df['subid'] = path_df['subject'].astype('int')

In [16]:
path_df

Unnamed: 0,paths,subject,subid
0,/Users/cu135/Partners HealthCare Dropbox/Calvi...,21,21
1,/Users/cu135/Partners HealthCare Dropbox/Calvi...,26,26
2,/Users/cu135/Partners HealthCare Dropbox/Calvi...,19,19
3,/Users/cu135/Partners HealthCare Dropbox/Calvi...,10,10
4,/Users/cu135/Partners HealthCare Dropbox/Calvi...,17,17
5,/Users/cu135/Partners HealthCare Dropbox/Calvi...,28,28
6,/Users/cu135/Partners HealthCare Dropbox/Calvi...,43,43
7,/Users/cu135/Partners HealthCare Dropbox/Calvi...,44,44
8,/Users/cu135/Partners HealthCare Dropbox/Calvi...,16,16
9,/Users/cu135/Partners HealthCare Dropbox/Calvi...,29,29


# 03 Option A - Import Another CSV and Add the Paths to It
**The CSV is expected to be in this format**
- ID and absolute paths to niftis are critical
```
+-----+----------------------------+--------------+--------------+--------------+
| ID  | Nifti_File_Path            | Covariate_1  | Covariate_2  | Covariate_3  |
+-----+----------------------------+--------------+--------------+--------------+
| 1   | /path/to/file1.nii.gz      | 0.5          | 1.2          | 3.4          |
| 2   | /path/to/file2.nii.gz      | 0.7          | 1.4          | 3.1          |
| 3   | /path/to/file3.nii.gz      | 0.6          | 1.5          | 3.5          |
| 4   | /path/to/file4.nii.gz      | 0.9          | 1.1          | 3.2          |
| ... | ...                        | ...          | ...          | ...          |
+-----+----------------------------+--------------+--------------+--------------+
```

In [17]:
spreadsheet_path = '/Users/cu135/Partners HealthCare Dropbox/Calvin Howard/studies/atrophy_seeds_2023/metadata/paths_and_covariates/experimental_group_master_list.csv'
sheet = None #If using Excel, enter a string here

In [18]:
from calvin_utils.permutation_analysis_utils.statsmodels_palm import CalvinStatsmodelsPalm
# Instantiate the PalmPrepararation class
cal_palm = CalvinStatsmodelsPalm(input_csv_path=spreadsheet_path, output_dir=os.path.dirname(spreadsheet_path), sheet=sheet)
# Call the process_nifti_paths method
data_df = cal_palm.read_and_display_data()
data_df

Unnamed: 0,subid,local_w6_ct_path,local_z6_csf_paths,local_z6_wm_paths,local_z6_gm_paths,local_z6_ct_paths,Coded_Disease_Status,PTID,Alzheimer,Control,MCI,blinded_id,Age,Sex,Male,Female
0,47,/Users/cu135/Dropbox (Partners HealthCare)/stu...,/Users/cu135/Dropbox (Partners HealthCare)/stu...,/Users/cu135/Dropbox (Partners HealthCare)/stu...,/Users/cu135/Dropbox (Partners HealthCare)/stu...,/Users/cu135/Dropbox (Partners HealthCare)/stu...,3,002_S_0816,1,0,0,47,70.838356,M,1.0,0.0
1,31,/Users/cu135/Dropbox (Partners HealthCare)/stu...,/Users/cu135/Dropbox (Partners HealthCare)/stu...,/Users/cu135/Dropbox (Partners HealthCare)/stu...,/Users/cu135/Dropbox (Partners HealthCare)/stu...,/Users/cu135/Dropbox (Partners HealthCare)/stu...,1,002_S_4270,0,1,0,31,,,,
2,42,/Users/cu135/Dropbox (Partners HealthCare)/stu...,/Users/cu135/Dropbox (Partners HealthCare)/stu...,/Users/cu135/Dropbox (Partners HealthCare)/stu...,/Users/cu135/Dropbox (Partners HealthCare)/stu...,/Users/cu135/Dropbox (Partners HealthCare)/stu...,3,005_S_0929,1,0,0,42,82.10137,M,1.0,0.0
3,18,/Users/cu135/Dropbox (Partners HealthCare)/stu...,/Users/cu135/Dropbox (Partners HealthCare)/stu...,/Users/cu135/Dropbox (Partners HealthCare)/stu...,/Users/cu135/Dropbox (Partners HealthCare)/stu...,/Users/cu135/Dropbox (Partners HealthCare)/stu...,3,006_S_4192,1,0,0,18,82.345205,M,1.0,0.0
4,29,/Users/cu135/Dropbox (Partners HealthCare)/stu...,/Users/cu135/Dropbox (Partners HealthCare)/stu...,/Users/cu135/Dropbox (Partners HealthCare)/stu...,/Users/cu135/Dropbox (Partners HealthCare)/stu...,/Users/cu135/Dropbox (Partners HealthCare)/stu...,1,007_S_0068,0,1,0,29,74.526027,F,0.0,1.0
5,35,/Users/cu135/Dropbox (Partners HealthCare)/stu...,/Users/cu135/Dropbox (Partners HealthCare)/stu...,/Users/cu135/Dropbox (Partners HealthCare)/stu...,/Users/cu135/Dropbox (Partners HealthCare)/stu...,/Users/cu135/Dropbox (Partners HealthCare)/stu...,1,007_S_0101,0,1,0,35,73.665753,M,1.0,0.0
6,15,/Users/cu135/Dropbox (Partners HealthCare)/stu...,/Users/cu135/Dropbox (Partners HealthCare)/stu...,/Users/cu135/Dropbox (Partners HealthCare)/stu...,/Users/cu135/Dropbox (Partners HealthCare)/stu...,/Users/cu135/Dropbox (Partners HealthCare)/stu...,1,013_S_4580,0,1,0,5,69.791781,F,0.0,1.0
7,24,/Users/cu135/Dropbox (Partners HealthCare)/stu...,/Users/cu135/Dropbox (Partners HealthCare)/stu...,/Users/cu135/Dropbox (Partners HealthCare)/stu...,/Users/cu135/Dropbox (Partners HealthCare)/stu...,/Users/cu135/Dropbox (Partners HealthCare)/stu...,1,014_S_0558,0,1,0,4,79.934247,M,1.0,0.0
8,19,/Users/cu135/Dropbox (Partners HealthCare)/stu...,/Users/cu135/Dropbox (Partners HealthCare)/stu...,/Users/cu135/Dropbox (Partners HealthCare)/stu...,/Users/cu135/Dropbox (Partners HealthCare)/stu...,/Users/cu135/Dropbox (Partners HealthCare)/stu...,1,018_S_4313,0,1,0,19,77.10137,F,0.0,1.0
9,15,/Users/cu135/Dropbox (Partners HealthCare)/stu...,/Users/cu135/Dropbox (Partners HealthCare)/stu...,/Users/cu135/Dropbox (Partners HealthCare)/stu...,/Users/cu135/Dropbox (Partners HealthCare)/stu...,/Users/cu135/Dropbox (Partners HealthCare)/stu...,1,019_S_4367,0,1,0,1,65.230137,F,0.0,1.0


In [19]:
merged_df = data_df.merge(path_df, on='subid', how='outer', suffixes=('', '_pathdf'))
display(merged_df)

Unnamed: 0,subid,local_w6_ct_path,local_z6_csf_paths,local_z6_wm_paths,local_z6_gm_paths,local_z6_ct_paths,Coded_Disease_Status,PTID,Alzheimer,Control,MCI,blinded_id,Age,Sex,Male,Female,paths,subject
0,1,,,,,,,,,,,,,,,,/Users/cu135/Partners HealthCare Dropbox/Calvi...,1
1,2,,,,,,,,,,,,,,,,/Users/cu135/Partners HealthCare Dropbox/Calvi...,2
2,3,,,,,,,,,,,,,,,,/Users/cu135/Partners HealthCare Dropbox/Calvi...,3
3,4,,,,,,,,,,,,,,,,/Users/cu135/Partners HealthCare Dropbox/Calvi...,4
4,5,,,,,,,,,,,,,,,,/Users/cu135/Partners HealthCare Dropbox/Calvi...,5
5,6,,,,,,,,,,,,,,,,/Users/cu135/Partners HealthCare Dropbox/Calvi...,6
6,7,,,,,,,,,,,,,,,,/Users/cu135/Partners HealthCare Dropbox/Calvi...,7
7,8,,,,,,,,,,,,,,,,/Users/cu135/Partners HealthCare Dropbox/Calvi...,8
8,9,,,,,,,,,,,,,,,,/Users/cu135/Partners HealthCare Dropbox/Calvi...,9
9,10,/Users/cu135/Dropbox (Partners HealthCare)/stu...,/Users/cu135/Dropbox (Partners HealthCare)/stu...,/Users/cu135/Dropbox (Partners HealthCare)/stu...,/Users/cu135/Dropbox (Partners HealthCare)/stu...,/Users/cu135/Dropbox (Partners HealthCare)/stu...,1.0,130_S_4343,0.0,1.0,0.0,10.0,79.712329,M,1.0,0.0,/Users/cu135/Partners HealthCare Dropbox/Calvi...,10


Rename Paths Column

In [20]:
newname = 'csf_paths_fwhm0'
merged_df = merged_df.rename(columns={'paths': newname})
display(merged_df)

Unnamed: 0,subid,local_w6_ct_path,local_z6_csf_paths,local_z6_wm_paths,local_z6_gm_paths,local_z6_ct_paths,Coded_Disease_Status,PTID,Alzheimer,Control,MCI,blinded_id,Age,Sex,Male,Female,csf_paths_fwhm0,subject
0,1,,,,,,,,,,,,,,,,/Users/cu135/Partners HealthCare Dropbox/Calvi...,1
1,2,,,,,,,,,,,,,,,,/Users/cu135/Partners HealthCare Dropbox/Calvi...,2
2,3,,,,,,,,,,,,,,,,/Users/cu135/Partners HealthCare Dropbox/Calvi...,3
3,4,,,,,,,,,,,,,,,,/Users/cu135/Partners HealthCare Dropbox/Calvi...,4
4,5,,,,,,,,,,,,,,,,/Users/cu135/Partners HealthCare Dropbox/Calvi...,5
5,6,,,,,,,,,,,,,,,,/Users/cu135/Partners HealthCare Dropbox/Calvi...,6
6,7,,,,,,,,,,,,,,,,/Users/cu135/Partners HealthCare Dropbox/Calvi...,7
7,8,,,,,,,,,,,,,,,,/Users/cu135/Partners HealthCare Dropbox/Calvi...,8
8,9,,,,,,,,,,,,,,,,/Users/cu135/Partners HealthCare Dropbox/Calvi...,9
9,10,/Users/cu135/Dropbox (Partners HealthCare)/stu...,/Users/cu135/Dropbox (Partners HealthCare)/stu...,/Users/cu135/Dropbox (Partners HealthCare)/stu...,/Users/cu135/Dropbox (Partners HealthCare)/stu...,/Users/cu135/Dropbox (Partners HealthCare)/stu...,1.0,130_S_4343,0.0,1.0,0.0,10.0,79.712329,M,1.0,0.0,/Users/cu135/Partners HealthCare Dropbox/Calvi...,10


Save the merged df

In [21]:
merged_df.to_csv(spreadsheet_path, index=False)

# 03 Option B - Save Path DF To Its Own CSV

In [None]:
# import os
# os.makedirs(out_dir, exist_ok=True)
# path_df.to_csv(os.path.join(out_dir, filename), index=False)

Hope this was helpful

--Calvin