<a href="https://colab.research.google.com/github/Nick7900/permutation_test/blob/main/1_preprocessing_data_selection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Title: Tutorial: Selecting and Analyzing Data from the HCP Dataset


## Introduction:
In this tutorial, we will explore how to select and analyze data from the Human Connectome Project (HCP) dataset.

We will focus on specific columns of interest and perform random sampling to obtain a subset of data for further analysis.

The tutorial assumes that Python and the necessary libraries are installed.

The **neuroimaging data** is loaded from another **.mat** file using the **mat73** library.

To install mat73, you can use the following command:
```
pip install mat73
```

To use the helper function my_functions you need to install the library **statsmodels**.
```
pip install statsmodels
```

When using **Google Colab** we need to import the following libraries, so we can load the data of interest

```
!pip install requests
!pip install gdown
```

In [2]:
# Using -q gwpy to hide output
!pip install mat73 -q gwpy
!pip install statsmodels -q gwpy
!pip install requests -q gwpy
!pip install gdown -q gwpy

## Step 1: Import Libraries

In [3]:
import os
import scipy.io
import pandas as pd
import numpy as np
import mat73


### Load Helper function

In [4]:
# Import helper function
import requests
# Get the raw github file
url = 'https://raw.githubusercontent.com/Nick7900/permutation_test/main/helper_functions/my_functions.py'
r = requests.get(url)
# Save the function to the directory
with open("my_functions.py","w") as f:
  f.write(r.text)
# Import the helper function


# Behavioral data
In this section we are going to load the Behavioral data in the HCP dataset.

Behavioral data refers to non-imaging data collected from participants, which provides information about their characteristics, traits, and behaviors.

## Step 3: Load Data
Next, we load the headers of the data from the HCP project to determine what variables we want to measure on.
We will load the files of the HCP project, by using the helper function ```load_files```. It will load different file types such as  **.mat** and  **.txt** files.


In [6]:
import gdown
# Downlod files from google colab
# headers_with_category
url = "https://drive.google.com/uc?id=1i1rMeJu5lcPxGHvo-SR5Y7Mvd2iWMKjn&export=download"
gdown.download(url, quiet=False)
# vars
url = 'https://drive.google.com/uc?id=14Xht-P-RGRz8JDckwQLClka1wzEB0iwY&export=download'
gdown.download(url, quiet=False)
# Measurement data
url = 'https://drive.google.com/uc?id=1h8IRQ-iAdWP845vWkicsya_achxeZAQI&export=download'
gdown.download(url, quiet=False)


Downloading...
From: https://drive.google.com/uc?id=1i1rMeJu5lcPxGHvo-SR5Y7Mvd2iWMKjn&export=download
To: /content/headers_with_category.mat
100%|██████████| 7.83k/7.83k [00:00<00:00, 19.7MB/s]
Downloading...
From: https://drive.google.com/uc?id=14Xht-P-RGRz8JDckwQLClka1wzEB0iwY&export=download
To: /content/vars.txt
100%|██████████| 2.97M/2.97M [00:00<00:00, 56.9MB/s]
Downloading...
From: https://drive.google.com/uc?id=1h8IRQ-iAdWP845vWkicsya_achxeZAQI&export=download
To: /content/hcp1003_REST1_LR_groupICA50.mat
100%|██████████| 253M/253M [00:05<00:00, 43.8MB/s]


'hcp1003_REST1_LR_groupICA50.mat'

In [7]:
from my_functions import load_files
# Load files using the helper function
data_folder = ""
file_name = "/headers_with_category"
file_type = ".mat"
# Load files from the specified data folder
data, var_name = load_files(data_folder, file_name, file_type)
# Filter the loaded data
df = pd.DataFrame(data[var_name])
df.head(10)

Unnamed: 0,0,1
0,[ID],[Demographics]
1,[recon],[Demographics]
2,[sex],[Demographics]
3,[age],[Demographics]
4,[handedness],[Demographics]
5,[race],[Demographics]
6,[ethnicity],[Demographics]
7,[rfMRI_motion],[Confound]
8,[SSAGA_Employ],[Demographics]
9,[SSAGA_Income],[Demographics]


Here we can see the different headers of the behavioral data from the HCP-dataset

## Step 4: Select Columns of Interest
We can search for specific strings in the data and retrieve their row locations.

Here, we search for columns related to **"sex"** and **"age"** in the DataFrame.

In [19]:
search_string = ['sex', 'age']
# Find the indices of the first occurrence of each search string in the DataFrame
indices = [df[df[0] == i].index[0] for i in search_string]

## Step 5: Load behavioral Data
Now, we load the behavioral data for the HCP project.
The data is stored as a **.txt**  in the file **vars**

We filter the data based on the previously obtained row locations (**indices**).

In [20]:
file_name = "/vars"
file_type = ".txt"
# Load files from the specified data folder
data = load_files(data_folder, file_name, file_type=".txt", delimiter=' ')
# Look at how the behavioral data look like
data.head()


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,669,670,671,672,673,674,675,676,677,678
0,100206,12,0,27,65,0.0,0.0,0.057224,2.0,4.0,...,0.0,,108.79,97.19,49.7,72.63,72.03,1.84,0.0,1.84
1,100307,1,1,27,95,0.0,0.0,0.065499,2.0,7.0,...,1.0,3.6,101.12,86.45,38.6,71.69,71.76,1.76,0.0,1.76
2,100408,3,0,33,55,0.0,0.0,0.098191,2.0,7.0,...,1.0,2.0,108.79,98.04,52.6,114.01,113.59,1.76,2.0,1.68
3,100610,12,0,27,85,0.0,0.0,0.101858,2.0,6.0,...,1.0,2.0,122.25,110.45,38.6,84.84,85.31,1.92,1.0,1.88
4,101006,6,1,35,90,1.0,0.0,0.086306,2.0,3.0,...,2.0,6.0,122.25,111.41,38.6,123.8,123.31,1.8,0.0,1.8


In [11]:
# Filter the loaded data using the defiined indices
df_filter = data[indices]
# Display the first 10 rows of the filtered DataFrame
df_filter.head(10)

Unnamed: 0,2,3
0,0,27
1,1,27
2,0,33
3,0,27
4,1,35
5,0,22
6,0,29
7,1,35
8,0,24
9,0,27


To make it more intuitive, we will rename the columns headers to the original **headers** that we selected earlier from the variable**search_string**

In [21]:
# Define the mapping dictionary
mapping_dict = {val: search_string[idx] for idx, val in enumerate(indices)}

# Rename columns using the mapping dictionary
data_behavioral = df_filter.rename(columns=mapping_dict)

# Print the resulting DataFrame
data_behavioral.head()

Unnamed: 0,sex,age
0,0,27
1,1,27
2,0,33
3,0,27
4,1,35


# Step 6: Load neuroimaging data
Now that we got a subset of the **behavioral data**, we need to get the corresponding **neuroimaging data** for each subject. The
**neuroimaging data** used in this example provide information about the brain's activity for each subject during resting state.

In [14]:
# Check if data_folder is a file path or a local folder
data_folder =""
file_name = "/hcp1003_REST1_LR_groupICA50"
file_type = ".mat"
data_dict =load_files(data_folder, file_name, file_type)
data_neuroimaging = [val[0] for val in (data_dict["data"])]



For each subject 50 parcellations have been measured across 1200 timepoints.

This can be seen for one subject downbelow

In [18]:
subject_id = 0
data_neuroimaging[subject_id].shape

(1200, 50)

## Step 7: Save Data
Finally, we save the sampled data and behavior data as NumPy arrays for further analysis.

In [22]:
# Specify the folder path and name
import os
folder_name = "/data"
current_directory = os.getcwd()
folder_path = os.path.join(current_directory+folder_name)

isExist = os.path.exists(folder_path)
if not isExist:
   # Create a new directory because it does not exist
   os.makedirs(folder_path)
   print("The new directory is created!")

# Save behavioral data
data_behavioral_file = 'data_behavioral.npy'
file_path = os.path.join(folder_path, data_behavioral_file)
np.save(file_path, data_behavioral)

# Save measurement data
data_neuroimaging_file = 'data_neuroimaging.npy'
file_path = os.path.join(folder_path, data_neuroimaging_file)
np.save(file_path, data_neuroimaging)

The new directory is created!
