In [1]:
import os
import numpy as np
import nibabel as nib
import pandas as pd
from matplotlib import pyplot as plt
from scipy import stats

data_path = "/Users/abry4213/data/HCP100/"

%load_ext rpy2.ipython

In [2]:
%%R
# Load R packages
suppressPackageStartupMessages({
    library(tidyverse)
    library(FactoMineR)
    library(factoextra)

    # Load the theft and tidyr packages
    library(theft)
    library(tidyr)
})


    an issue that caused a segfault when used with rpy2:
    https://github.com/rstudio/reticulate/pull/1188
    Make sure that you use a version of that package that includes
    the fix.
    

We will start with our resting-state fMRI data stored in a [`.feather` file](https://arrow.apache.org/docs/python/feather.html) (for easy conversion between R and Python).
Data should be organized in a long format, such that there is one row for each brain region and timepoint per participant.

In [3]:
# Define input time-series feather files for the two datasets
HCP100_input_time_series_data = pd.read_feather(f'{data_path}/raw_data/functional_MRI/HCP100_fMRI_TS.feather')

# Load information about the time-series features
univariate_TS_feature_info = pd.read_csv("data/feature_info/univariate_feature_info.csv")

## Extracting univariate time-series features

First, we will extract 25 univariate time-series features comprising the [`catch22`](https://doi.org/10.1007/s10618-019-00647-x) feature set, mean, standard deviation, and fractional amplitude of low-frequency fluctuations (fALFF).
The `catch22` features, mean, and SD can all be computed in R using the [`theft`](https://cran.r-project.org/web/packages/theft/vignettes/theft.html) package (collectively referred to as the `catch24` feature set), while the fALFF will be computed in Matlab.
Computing the `catch24` features will take several minutes, so feel free to hit play on the next code chunk and grab a coffee ☕️
(Alternatively, you can run this on a high-performance computing cluster if you prefer.)

In [18]:
%%R -i HCP100_input_time_series_data -o HCP100_catch24_features

# We can define a helper function to compute the `catch24` time-series features using the `theft` package
catch24_all_samples <- function(full_TS_data,
                                output_column_names = c("Output"),
                                unique_columns = c("Sample_ID", "Brain_Region")) {
  
  
  # Merge columns into unique ID
  full_TS_data <- full_TS_data %>%
    tidyr::unite("Unique_ID", unique_columns, sep="__")
  
  # Compute the set of 24 time-series features using theft
  TS_catch24 <- theft::calculate_features(data = full_TS_data, 
                                          id_var = "Unique_ID", 
                                          time_var = "timepoint", 
                                          values_var = "values", 
                                          feature_set = "catch22",
                                          features = list("mean" = mean, "sd" = sd)) %>%
    tidyr::separate("id", c(output_column_names), sep="__")

  # Return the resulting set of 24 features computed per brain region
  return(TS_catch24)
    
}

# Compute the 24 time-series features for UCLA CNP and ABIDE time-series data
HCP100_catch24_features <- catch24_all_samples(HCP100_input_time_series_data,
                                                 output_column_names = c("Sample_ID", "Brain_Region"),
                                                 unique_columns = c("Sample_ID", "Brain_Region"))
                                  

No IDs removed. All value vectors good for feature extraction.
Running computations for catch22...

Calculations completed for catch22.
Running computations for user-supplied features...

Calculations completed for user-supplied features.


In [20]:
# Save to a feather file
HCP100_catch24_features.reset_index().to_feather("data/time_series_features/HCP100/HCP100_catch24_features.feather")

We will separately compute the fractional amplitude of low-frequency fluctuations (fALFF) using the Matlab script `compute_regional_fALFF.m` as follows (note: Matlab license is required to run this):

In [None]:
%%bash 

# First, we need to convert our time-series feather file to a Matlab .mat file to be read in properly
HCP100_time_series_file_base=/Users/abry4213/data/HCP100/raw_data/functional_MRI/HCP100_fMRI_TS

# Run the feather_to_mat.py script with the file base as the input argument, indicating that the output file should be a mat file
python3 code/feature_extraction/feather_to_mat.py ${HCP100_time_series_file_base} mat

In [5]:
%%bash 

# Run the feather_to_mat.py script with the file base as the input argument, indicating that the output file should be a mat file
python3 code/feature_extraction/feather_to_mat.py data/input_data/HCP100_fMRI_TS mat

# Define the path to the data
data_path=$(echo $(pwd) | tr -d ' ')

# Run the compute_regional_fALFF.m script -- note that you might need to update your matlab path here
cd code/feature_extraction

# HCP100
TS_mat_file="$data_path/data/input_data/HCP100_fMRI_TS.mat"
output_mat_file="data/time_series_features/HCP100_fALFF.mat"
/Applications/MATLAB_R2023b.app/bin/matlab -nodisplay -singleCompThread -r "compute_regional_fALFF $data_path $TS_mat_file $output_mat_file; exit"

# Convert the mat file back to feather for fALFF
python3 feather_to_mat.py ${data_path}/data/time_series_features/HCP100_fALFF feather



                            < M A T L A B (R) >
                  Copyright 1984-2023 The MathWorks, Inc.
              R2023b Update 7 (23.2.0.2515942) 64-bit (maci64)
                              January 30, 2024

 
To get started, type doc.
For product information, visit www.mathworks.com.
 
Process is interrupted.


Launching updater executable


In [15]:
# Read in the fALFF feather files
HCP100_fALFF = pd.read_feather('data/time_series_features/HCP100_fALFF.feather')

# Remove whitespace from Brain_Region column in the fALFF dataframes
HCP100_fALFF['Brain_Region'] = HCP100_fALFF['Brain_Region'].str.strip()

# Read in metadata files
HCP100_metadata = pd.read_feather('data/input_data/HCP100_sample_metadata.feather')

# Read in catch24 data
HCP100_catch24_features = pd.read_feather('data/time_series_features/HCP100/HCP100_catch24_features.feather')

In [19]:
%%R -i HCP100_fALFF,HCP100_catch24_features,HCP100_metadata -o HCP100_catch25_filtered

source("code/feature_extraction/QC_functions_univariate.R")
univariate_feature_set <- "catch25"

HCP100_catch24_features$Noise_Proc <- "fmriprep"

HCP100_catch25_filtered <- run_QC_for_univariate_dataset(sample_metadata = HCP100_metadata,
                                                           univariate_feature_set = univariate_feature_set,
                                                           catch24_results = HCP100_catch24_features,
                                                           fALFF_results = HCP100_fALFF,
                                                           participants_to_drop = c())

`summarise()` has grouped output by 'Sample_ID'. You can override using the
`.groups` argument.
`mutate_all()` ignored the following grouping variables:
• Column `Sample_ID`
ℹ Use `mutate_at(df, vars(-group_cols()), myoperation)` to silence the message.


In [20]:
# Save the filtered data to feather files
HCP100_catch25_filtered.reset_index().to_feather('data/time_series_features/HCP100_catch25_filtered.feather')

# Save filtered metadata
HCP100_metadata_filtered = HCP100_metadata[HCP100_metadata['Sample_ID'].isin(HCP100_catch25_filtered['Sample_ID'])]

HCP100_metadata_filtered.reset_index().to_feather('data/input_data/HCP100_sample_metadata_filtered.feather')


# pyspi for pairwise feature extraction

In [24]:
%%bash

python3 code/feature_extraction/merge_pyspi_data.py \
--data_path /Users/abry4213/data/HCP100/ \
--dataset_ID HCP100 \
--pkl_file calc.pkl \
--pairwise_feature_set pyspi14 \
--brain_region_lookup data/input_data/HCP100_Brain_Region_Lookup.feather

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sample_data["Sample_ID"] = sample
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sample_data["Sample_ID"] = sample
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sample_data["Sample_ID"] = sample
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_index

CalledProcessError: Command 'b'\npython3 code/feature_extraction/merge_pyspi_data.py \\\n--data_path /Users/abry4213/data/HCP100/ \\\n--dataset_ID HCP100 \\\n--pkl_file calc.pkl \\\n--pairwise_feature_set pyspi14 \\\n--brain_region_lookup data/input_data/HCP100_Brain_Region_Lookup.feather\n'' returned non-zero exit status 1.