<a href="https://colab.research.google.com/github/Nick7900/permutation_test/blob/main/1_preprocessing_data_selection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Title: Tutorial: Selecting and Analyzing Data from the HCP Dataset


## Introduction:
In this tutorial, we will explore how to select and analyze data from the Human Connectome Project (HCP) dataset.

We will focus on specific columns of interest and perform random sampling to obtain a subset of data for further analysis.

The tutorial assumes you have Python and the necessary libraries installed.

We load **Measurement data** from another **.mat** file using the **mat73** library.

You can install **mat73** with the following command, if it is installed on your machine
```
pip install mat73
```

To use the helper function my_functions you need to install the library **statsmodels**.
```
pip install statsmodels
```

When using **Google Colab** we need to import the following libraries, so we can load the data of interest

```
!pip install requests
!pip install gdown
```

In [None]:
!pip install mat73
!pip install statsmodels
!pip install requests
!pip install gdown

Collecting mat73
  Downloading mat73-0.60-py3-none-any.whl (19 kB)
Installing collected packages: mat73
Successfully installed mat73-0.60


pip install mat73
pip install statsmodels

## Step 1: Import Libraries

In [None]:
import os
import scipy.io
import pandas as pd
import numpy as np
import mat73


### Load Helper function

In [None]:
# Import helper function
import requests
# Get the raw github file
url = 'https://raw.githubusercontent.com/Nick7900/permutation_test/main/helper_functions/my_functions.py'
r = requests.get(url)
# Save the function to the directory
with open("my_functions.py","w") as f:
  f.write(r.text)
# Import the helper function


# Behavioral data
In this section we are going to load the Behavioral data in the HCP dataset.

Behavioral data refers to non-imaging data collected from participants, which provides information about their characteristics, traits, and behaviors.

Behavioral data helps researchers understand the relationship between brain function and behavior, cognition, personality traits, and other psychosocial factors.

## Step 3: Load Data Headers
Next, we load the headers of the data from the HCP project to determine what variables we want to measure on.
We will load the files of the HCP project, by using the helper function ```load_files```. It will load different file types such as  **.mat** and  **.txt** files.


Remove the text **file/d/** from the link and replace it with **uc?id=**

Now remove the section after the file ID, including **/view** and replace it with **&export=download** in place of the text you have removed

In [None]:
import gdown
# Downlod files from google colab
# headers_with_category
url = "https://drive.google.com/uc?id=1i1rMeJu5lcPxGHvo-SR5Y7Mvd2iWMKjn&export=download"
gdown.download(url, quiet=False)
# vars
url = 'https://drive.google.com/uc?id=14Xht-P-RGRz8JDckwQLClka1wzEB0iwY&export=download'
gdown.download(url, quiet=False)
# Measurement data
url = 'https://drive.google.com/uc?id=1h8IRQ-iAdWP845vWkicsya_achxeZAQI&export=download'
gdown.download(url, quiet=False)


Downloading...
From: https://drive.google.com/uc?id=1i1rMeJu5lcPxGHvo-SR5Y7Mvd2iWMKjn&export=download
To: /content/headers_with_category.mat
100%|██████████| 7.83k/7.83k [00:00<00:00, 15.9MB/s]
Downloading...
From: https://drive.google.com/uc?id=14Xht-P-RGRz8JDckwQLClka1wzEB0iwY&export=download
To: /content/vars.txt
100%|██████████| 2.97M/2.97M [00:00<00:00, 57.1MB/s]
Downloading...
From: https://drive.google.com/uc?id=1h8IRQ-iAdWP845vWkicsya_achxeZAQI&export=download
To: /content/hcp1003_REST1_LR_groupICA50.mat
100%|██████████| 253M/253M [00:06<00:00, 42.0MB/s]


'hcp1003_REST1_LR_groupICA50.mat'

In [None]:
from my_functions import load_files
# Load files using the helper function
data_folder = ""
file_name = "/headers_with_category"
file_type = ".mat"
# Load files from the specified data folder
data, var_name = load_files(data_folder, file_name, file_type)
# Filter the loaded data
df = pd.DataFrame(data[var_name])
df.head(10)

Unnamed: 0,0,1
0,[ID],[Demographics]
1,[recon],[Demographics]
2,[sex],[Demographics]
3,[age],[Demographics]
4,[handedness],[Demographics]
5,[race],[Demographics]
6,[ethnicity],[Demographics]
7,[rfMRI_motion],[Confound]
8,[SSAGA_Employ],[Demographics]
9,[SSAGA_Income],[Demographics]


Here we can see the different headers of the behavioral data from the HCP-dataset

## Step 4: Select Columns of Interest
We can search for specific strings in the data and retrieve their row locations.

Here, we search for columns related to **"sex"** and **"age"** in the DataFrame.

In [None]:
search_string = ['sex', 'age']
# Find the indices of the first occurrence of each search string in the DataFrame
indices = [df[df[0] == i].index[0] for i in search_string]

## Step 5: Load Measurement Data
Now, we load the measurement data for the HCP project.
The data is stored as a **.txt**  in the file **vars**

We filter the data based on the previously obtained row locations (**indices**).

In [None]:
file_name = "/vars"
file_type = ".txt"
# Load files from the specified data folder
data = load_files(data_folder, file_name, file_type=".txt", delimiter=' ')
# Filter the loaded data using indices
df_filter = data[indices]
# Display the first 10 rows of the filtered DataFrame
df_filter.head(10)


Unnamed: 0,2,3
0,0,27
1,1,27
2,0,33
3,0,27
4,1,35
5,0,22
6,0,29
7,1,35
8,0,24
9,0,27


In [None]:
data

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,669,670,671,672,673,674,675,676,677,678
0,100206,12,0,27,65,0.0,0.0,0.057224,2.0,4.0,...,0.0,,108.79,97.19,49.7,72.63,72.03,1.84,0.0,1.84
1,100307,1,1,27,95,0.0,0.0,0.065499,2.0,7.0,...,1.0,3.6,101.12,86.45,38.6,71.69,71.76,1.76,0.0,1.76
2,100408,3,0,33,55,0.0,0.0,0.098191,2.0,7.0,...,1.0,2.0,108.79,98.04,52.6,114.01,113.59,1.76,2.0,1.68
3,100610,12,0,27,85,0.0,0.0,0.101858,2.0,6.0,...,1.0,2.0,122.25,110.45,38.6,84.84,85.31,1.92,1.0,1.88
4,101006,6,1,35,90,1.0,0.0,0.086306,2.0,3.0,...,2.0,6.0,122.25,111.41,38.6,123.80,123.31,1.80,0.0,1.80
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
998,992673,12,1,33,70,0.0,0.0,0.070254,0.0,3.0,...,1.0,-99.0,122.25,111.41,38.6,101.63,99.26,1.80,0.0,1.80
999,992774,2,0,35,100,0.0,0.0,0.071538,0.0,3.0,...,2.0,8.4,122.25,111.41,50.1,107.17,103.55,1.76,0.0,1.76
1000,993675,12,1,29,85,0.0,0.0,0.084067,2.0,3.0,...,2.0,0.4,122.25,110.45,38.6,84.07,84.25,1.80,1.0,1.76
1001,994273,6,0,30,60,0.0,0.0,0.083142,0.0,4.0,...,1.0,6.0,122.25,111.41,63.8,110.65,109.73,1.80,1.0,1.76


To make it more intuitive, we will rename the columns headers to the original **headers** that we selected earlier from the variable**search_string**

In [None]:
# Define the mapping dictionary
mapping_dict = {val: search_string[idx] for idx, val in enumerate(indices)}

# Rename columns using the mapping dictionary
df_filter = df_filter.rename(columns=mapping_dict)

# Print the resulting DataFrame
df_filter.head()

Unnamed: 0,sex,age
0,0,27
1,1,27
2,0,33
3,0,27
4,1,35


## Step 6: Random Sampling
To obtain a subset of data for further analysis instead of using the whole dataset as a demonstration, we perform random sampling.

We set the random seed, define the number of samples (**N**), and select a header (**header**) for sampling.

In [None]:
np.random.seed(42)

N = 60 # Select 60 subjects
# Specify the header for categorical variable
header = 'sex'
# Get unique categories from the filtered DataFrame based on the header
cat_val = np.unique(df_filter[header])
# Calculate the number of samples per category
n_samples = int(round(N / len(cat_val)))

## Step 7: Perform Random Sampling
We obtain the indices for each category value and randomly choose indices for each category without replacement.

Finally, we combine the randomly chosen indices for all categories.

In [None]:
# For each value in cat_val, np.where(df_filter[header] == val) returns the indices where the condition is True
# These indices represent the rows in df_filter where the value in the specified header column matches the cat_val
# The resulting indices for each value are stored in cat_indices as a list of arrays
cat_indices = [np.where(df_filter[header] == val)[0] for val in cat_val]

# Create another list comprehension that iterates over each list of indices in cat_indices
# For each list of indices, it uses np.random.choice to randomly select n_samples indices without replacement
# The randomly selected indices are stored in a list called random_indices
random_indices = [np.random.choice(indices, size=n_samples, replace=False) for indices in cat_indices]

# Concatenate all the lists in random_indices into a single 1D array using np.concatenate
# Then, sort the array using np.sort
# The resulting sorted array represents the sample indices that were randomly selected from cat_indices
# These indices can be used to extract the corresponding samples from the original data
sample_indices = np.sort(np.concatenate(random_indices))


## Step 8: Select Sampled Data
Using the sampled indices, we select rows from the filtered DataFrame (df_filter).

In [None]:
# Locate the data of interest based on the sample_indices
data_behavioral= df_filter.iloc[sample_indices]
data_behavioral.head()

Unnamed: 0,sex,age
0,0,27
13,0,26
16,0,30
52,1,28
62,0,32


# Step 9: Load Measurement Data
Now that we got a subset of the **Behavioral** data, we need to find the corresponding **Measurement Data** for each subject.
Measurement data in the HCP dataset refers to the data obtained through various imaging techniques, such as functional magnetic resonance imaging (fMRI), diffusion tensor imaging (DTI), and structural imaging.

These measurements provide information about the brain's structure, connectivity, and activity.


The behavior data is selected based on the sampled indices defied earlier in the variable **sample_indices**

In [None]:
# Check if data_folder is a file path or a local folder
data_folder =""
file_name = "/hcp1003_REST1_LR_groupICA50"
file_type = ".mat"
data_dict =load_files(data_folder, file_name, file_type)
data_measurement = [data_dict["data"][i] for i in sample_indices]



For each subject 50 parcellations have been measured across 1200 timepoints.

This can be seen for one subject downbelow

In [None]:
subject_id = 0
data_measurement[subject_id][0].shape

(1200, 50)

To decrease the dataset for each subject for the following demonstration we reduce it to only 300 timepoints

In [None]:
limit = 300
# Only reading timepoints from 0 to limit (e.g.300)
data_measurement_reduced =[val[0][:limit,:] for val in (data_measurement)]
data_measurement_reduced[0].shape

(300, 50)

## Step 10: Save Data
Finally, we save the sampled data and behavior data as NumPy arrays for further analysis.

In [None]:
# Specify the folder path and name
import os
folder_name = "/data"
current_directory = os.getcwd()
folder_path = os.path.join(current_directory+folder_name)

isExist = os.path.exists(folder_path)
if not isExist:
   # Create a new directory because it does not exist
   os.makedirs(folder_path)
   print("The new directory is created!")

# Save behavioral data
data_behavioral_file = 'data_behavioral.npy'
file_path = os.path.join(folder_path, data_behavioral_file)
np.save(file_path, data_behavioral)

# Save measurement data
data_measurement_file = 'data_measurement.npy'
file_path = os.path.join(folder_path, data_measurement_file)
np.save(file_path, data_measurement_reduced)