<a href="https://colab.research.google.com/github/Nick7900/permutation_test/blob/main/1_preprocessing_data_selection_no_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tutorial: Selecting and Analyzing Data from the HCP Dataset
In this tutorial, we will prepare and select data from the Human Connectome Project (HCP) dataset that we will use in the Tutorial ```a_between_subject_testing```.

We will go though the following steps in this Notebook:

1. Setup Google Colab
2. Download the HCP data
3. Prepare behavioral data
4. Prepare neuroimaging data
5. Save data


## **1: Setup Google Colab**
This script was written using Google Colab.
The tutorial assumes that Python and the necessary libraries are installed.

The **neuroimaging data** is of type **.mat** and can be loaded using the **mat73** library.

To install mat73, you can use the following command:
```
pip install mat73
```

When using **Google Colab** we need to import the following libraries, so we can load the data of interest

```
pip install requests
pip install gdown
```

In [None]:
# Using -q gwpy to hide output
!pip install mat73 -q gwpy
!pip install requests -q gwpy
!pip install gdown -q gwpy

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.4/45.4 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.0/51.0 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for ligo-segments (setup.py) ... [?25l[?25hdone


### Import Libraries

In [None]:
import os
import scipy.io
import pandas as pd
import numpy as np
import mat73


### Load Helper function
This part should not be included in the final tutorial, if it is part of the GLHMM toolbox

In [None]:
# Import helper function
import requests
# Get the raw github file
url = 'https://raw.githubusercontent.com/Nick7900/permutation_test/main/helper_functions/helperfunctions.py'
r = requests.get(url)
# Save the function to the directory
with open("helperfunctions.py","w") as f:
  f.write(r.text)
# Import the helper function


We will focus on specific columns of interest and perform random sampling to obtain a subset of data for further analysis.


## **2: Download the HCP data**
Make sure you have been granted permission to use restricted HCP data. If not, apply for such permission following the instructions in the **[HCP](https://www.humanconnectome.org/)** website.



## **3: Prepare behavioral data**
In this section we are going to load and prepare the **behavioral data** from the HCP dataset, that we will use for future examples.

**Behavioral data** refers to non-imaging data collected from participants, which provides information about their characteristics, traits, and behaviors.

First, we load the **headers** of the data from ```headers_with_category.mat``` to determine which variables we want to use for our analysis.

We will load the files of the HCP project, by using the helper function ```load_files```. It will load different file types such as  **.mat** and  **.txt** files.


In [None]:
from helperfunctions import load_files
# Load files using the helper function
data_folder = ""
file_name = "/headers_with_category"
file_type = ".mat"
# Load files from the specified data folder
data, var_name = load_files(data_folder, file_name, file_type)
# Filter the loaded data
df = pd.DataFrame(data[var_name])
df.head(10)

Unnamed: 0,0,1
0,[ID],[Demographics]
1,[recon],[Demographics]
2,[sex],[Demographics]
3,[age],[Demographics]
4,[handedness],[Demographics]
5,[race],[Demographics]
6,[ethnicity],[Demographics]
7,[rfMRI_motion],[Confound]
8,[SSAGA_Employ],[Demographics]
9,[SSAGA_Income],[Demographics]


Here we can see the different headers of the behavioral data from the HCP-dataset

### Select Columns of Interest
We can search for specific strings in the data and retrieve their row locations.

Here, we search for columns related to **"sex"** and **"age"** in the DataFrame.

In [None]:
search_string = ['sex', 'age']
# Find the indices of the first occurrence of each search string in the DataFrame
indices = [df[df[0] == i].index[0] for i in search_string]
indices

[2, 3]

What we can see here is that the indices for **"sex"** and **"age"** are at position 2 and 3.

### Load behavioral Data
Now, we load the behavioral data for the HCP project.
The data is stored as a **.txt**  in the file ```vars.txt```

We filter the data based on the previously obtained row locations (**indices**).

In [None]:
file_name = "/vars"
file_type = ".txt"
# Load files from the specified data folder
data = load_files(data_folder, file_name, file_type=".txt", delimiter=' ')
# Look at how the raw behavioral data
data.head()


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,669,670,671,672,673,674,675,676,677,678
0,100206,12,0,27,65,0.0,0.0,0.057224,2.0,4.0,...,0.0,,108.79,97.19,49.7,72.63,72.03,1.84,0.0,1.84
1,100307,1,1,27,95,0.0,0.0,0.065499,2.0,7.0,...,1.0,3.6,101.12,86.45,38.6,71.69,71.76,1.76,0.0,1.76
2,100408,3,0,33,55,0.0,0.0,0.098191,2.0,7.0,...,1.0,2.0,108.79,98.04,52.6,114.01,113.59,1.76,2.0,1.68
3,100610,12,0,27,85,0.0,0.0,0.101858,2.0,6.0,...,1.0,2.0,122.25,110.45,38.6,84.84,85.31,1.92,1.0,1.88
4,101006,6,1,35,90,1.0,0.0,0.086306,2.0,3.0,...,2.0,6.0,122.25,111.41,38.6,123.8,123.31,1.8,0.0,1.8


### Filter behavioral data based on defined indices

In [None]:
# Filter the loaded data using the defined indices
df_filter = data[indices]
# Display the first 10 rows of the filtered DataFrame
df_filter.head(10)

Unnamed: 0,2,3
0,0,27
1,1,27
2,0,33
3,0,27
4,1,35
5,0,22
6,0,29
7,1,35
8,0,24
9,0,27


To make it more intuitive, we will rename the columns headers to the original **headers** that we selected earlier from the variable**search_string**

In [None]:
# Define the mapping dictionary
mapping_dict = {val: search_string[idx] for idx, val in enumerate(indices)}

# Rename columns using the mapping dictionary
data_behavioral = df_filter.rename(columns=mapping_dict)

# Print the resulting DataFrame
data_behavioral.head()

Unnamed: 0,sex,age
0,0,27
1,1,27
2,0,33
3,0,27
4,1,35


## **4: Load neuroimaging data**
Now that we got a subset of the **behavioral data** that only contain information about the subjects **sex** and **age**, we need to get the corresponding **neuroimaging data** for each subject.

The **neuroimaging data** used in this example provide information about the brain's activity for each subject during resting state.

In [None]:
# Check if data_folder is a file path or a local folder
data_folder =""
file_name = "/hcp1003_REST1_LR_groupICA50"
file_type = ".mat"
data_dict =load_files(data_folder, file_name, file_type)
data_neuroimaging = [val[0] for val in (data_dict["data"])]



For each subject 50 parcellations have been measured across 1200 timepoints.

This can be seen for one subject downbelow

In [None]:
subject_id = 0
data_neuroimaging[subject_id].shape

(1200, 50)

## **5: Save Data**
Finally, we save the sampled data and behavior data as NumPy arrays for further analysis.

You can download the files inside the folder ```data``` afterwards

In [None]:
# Specify the folder path and name
import os
folder_name = "/data"
current_directory = os.getcwd()
folder_path = os.path.join(current_directory+folder_name)

isExist = os.path.exists(folder_path)
if not isExist:
   # Create a new directory because it does not exist
   os.makedirs(folder_path)
   print("The new directory is created!")

# Save behavioral data
data_behavioral_file = 'data_behavioral.npy'
file_path = os.path.join(folder_path, data_behavioral_file)
np.save(file_path, data_behavioral)

# Save measurement data
data_neuroimaging_file = 'data_neuroimaging.npy'
file_path = os.path.join(folder_path, data_neuroimaging_file)
np.save(file_path, data_neuroimaging)

The new directory is created!
