<a href="https://colab.research.google.com/github/Nick7900/permutation_test/blob/main/2_Train_HMM_Gamma.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## GLHMM: train basic HMM and get Gamma

In this tutorial, we will explore how to select and analyze data from the Human Connectome Project (HCP) dataset.

This notebook goes through the basic steps to train a "classic" HMM on a single set of timeseries, such as neuroimaging or electrophysiological recordings from multiple subjects or sessions.

We will go though the following steps in this Notebook:

1. Setup Google Colab
2. Download the neuroimaing data
3. Prepare data for the HMM
4. Initialise and train HMM
5. Save data


## 1: Setup Google Colab
This script was written using **Google Colab** and you need to install the different packages to run this code and import libraries to load the data that we prepared in the Notebook ```1_preprocessing_data_selection```.

This can be done by using the following commands
```
pip install requests
pip install gdown
```

To train the HMM, install the GLHMM toolkit in your Python environment.
```
pip install --user git+https://github.com/vidaurre/glhmm
```

In **Google Colab** we will clone the toolbox

In [None]:
# Using -q gwpy to hide output in Google Colab
!pip install requests -q gwpy
!pip install gdown -q gwpy
# Clone the GLHMM into Google Colab
!git clone https://github.com/vidaurre/glhmm
%cd glhmm

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.4/45.4 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.0/51.0 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for ligo-segments (setup.py) ... [?25l[?25hdone
Cloning into 'glhmm'...
remote: Enumerating objects: 863, done.[K
remote: Counting objects: 100% (156/156), done.[K
remote: Compressing objects: 100% (115/115), done.[K
remote: Total 863 (delta 75), reused 65 (delta 40), pack-reused 707[K
Receiving objects: 100% (863/863), 12.61 MiB | 21.13 MiB/s, done.
Resolving deltas: 100% (506/506), done.
/content/glhmm


### Import Libraries

In [None]:
import os
import numpy as np
from glhmm import glhmm
import requests
import gdown

### Load Helper function
We will use ```my_functions.py``` to prepare the data to be trained using the GLHMM

In [None]:
# Move back to main folder
%cd ..
# Import helper function
# Get the raw github file
url = 'https://raw.githubusercontent.com/Nick7900/permutation_test/main/helper_functions/my_functions.py'
r = requests.get(url)
# Save the function to the directory
with open("my_functions.py","w") as f:
  f.write(r.text)

/content


## 2: Download the neuroimaing data
Now, we will download the **neuroimaing data** data that was prepared in the tutorial **1_preprocessing_data_selection.ipynb**.
We will train a classic HMM on the ```data_neuroimaging.npy``` that is a subset from the HCP dataset that we exported in the previous notebook.

The file ```data_neuroimaging.npy``` is a dataset of 1003 subjects, 1200 timepoints, and 50 parcellations.

In [None]:
# Downlod files from google colab
# data_measurement
url = "https://drive.google.com/uc?id=1bPhw4GOoLDqkMWvVbkRAIh_XYG6L0JQZ&export=download"
gdown.download(url, quiet=True)

'data_neuroimaging.npy'

In [None]:
## Load the data
current_directory = os.getcwd()
data_folder = ""
data_file = '/data_neuroimaging.npy'

# Load behavioral data
data_file_path = os.path.join(current_directory+data_folder+data_file)
data_neuroimaging = np.load(data_file_path)

Look at the dataset

In [None]:
data_neuroimaging.shape

(1003, 1200, 50)

## 3: Prepare data for the HMM
When preparing the data for training a **HMM**, it is important that the data has a specific shape. This shape should be in the format ((no of subjects/sessions * number of timepoints), number of features), which means that all subjects and/or sessions are combined along the first dimension. The second dimension represents the number of features, for example, the number of parcels or channels.

Now, in order to train the HMM, we need to concatenate our dataset, ```data_neuroimaging```, along the first dimension, as we mentioned earlier. This concatenation allows us to transform our dataset from its initial shape of ```[1003, 1200, 50]``` representing ```[n_subject, n_timepoints, n_features]``` to a new concatenated shape of ```[1203600, 50]```, which is structured as ```[(n_subject by n_timepoints), n_features]```. This step prepares the data appropriately for the subsequent HMM training process.


### Concatenate data

In [None]:
from my_functions import get_concatenate_data, get_timestamp_indices
# Getting the shape
n_subjects = len(data_neuroimaging)
n_timestamps, n_features =data_neuroimaging[0].shape
# Using a helper function to concatenate data
data = get_concatenate_data(data_neuroimaging)
data.shape

(1203600, 50)

The concatenated data has the shape ```(1203600, 50)```

### Indices of each timestep
Besides of having ```data_neuroimaging.npy```, we need to specify the indices in the concatenated timeseries corresponding to the beginning and end of individual subjects/sessions in the shape ```[n_subjects, 2]```.

In this case, we have generated timeseries of 1200 timepoints for each 1003 subject.

In [None]:
#Generate indices of the timestamps for each subject in the data.
idx_data =get_timestamp_indices(n_timestamps, n_subjects)
# Visualize the first 10 timepoints
print(f"Show the first 10 indices:\n{idx_data[:10]}\n")
print(f"The shape of idx_data:\n{idx_data.shape}")

Show the first 10 indices:
[[    0  1200]
 [ 1200  2400]
 [ 2400  3600]
 [ 3600  4800]
 [ 4800  6000]
 [ 6000  7200]
 [ 7200  8400]
 [ 8400  9600]
 [ 9600 10800]
 [10800 12000]]

The shape of idx_data:
(1003, 2)


## 4: Initialise and train HMM
We first initialise the hmm object and specify hyperparameters. In this case, since we do not model an interaction between two sets of variables in the HMM states, we set ```model_beta='no'```.

We here estimate 8 states. If you want to model a different number of states, change K to a different value.

We here model states as Gaussian distributions with mean and full covariance matrix, so that each state is described by a mean amplitude and functional connectivity pattern, specify ```covtype='full'```. If you do not want to model the mean, add ```model_mean='no'```.
Optionally, you can check the hyperparameters to make sure that they correspond to how you want the model to be set up.

In [None]:
K = 8
hmm = glhmm.glhmm(model_beta='no', K=K, covtype='full')
print(hmm.hyperparameters)

{'K': 8, 'covtype': 'full', 'model_mean': 'state', 'model_beta': 'no', 'dirichlet_diag': 10, 'connectivity': None, 'Pstructure': array([[ True,  True,  True,  True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True,  True,  True,  True]]), 'Pistructure': array([ True,  True,  True,  True,  True,  True,  True,  True])}


Next, we move on to train to train the Hidden Markov Model (HMM) using the previously loaded data and indices. In this case, we are not modeling an interaction between two sets of timeseries; instead, we are running a "classic" HMM. To do this, we set ```X``` to ```None```. ```Y``` corresponds to the timeseries (```data_neuroimaging```) for which we want to estimate states, and the indices (```idx_data```) represent the beginning and end indices of each subject.

We can generate different output variables such as ```Gamma```, which represents the state probabilities at each timepoint, ```Xi```, which represents the joint probabilities of past and future states conditioned on the data, and ```FE```, which represents the free energy of each iteration.

In [None]:
Gamma,Xi,FE = hmm.train(X=None, Y=data, indices=idx_data)

Cycle 1 free energy = 336428513.51399845
Cycle 2 free energy = 335363299.15856534
Cycle 3, free energy = 333652794.5246657, relative change = 0.6162384018242506


  Gamma[tt,:] = Gamma[tt,:] / np.expand_dims(np.sum(Gamma[tt,:],axis=1), axis=1)
  Xi[tt_xi,:,:] = Xi[tt_xi,:,:] / np.expand_dims(np.sum(Xi[tt_xi,:,:],axis=(1,2)),axis=(1,2))


Cycle 4, free energy = 333150665.11334014, relative change = 0.15318872319559634
Cycle 5, free energy = 332967971.07746595, relative change = 0.05279346785218141
Cycle 6, free energy = 332894673.8331534, relative change = 0.020741530723600268
Cycle 7, free energy = 332858024.17624485, relative change = 0.010264603375522244
Cycle 8, free energy = 332834989.2265579, relative change = 0.006410127730999965
Cycle 9, free energy = 332818065.4088708, relative change = 0.0046874562919374625
Cycle 10, free energy = 332804779.3035477, relative change = 0.0036664127531260493
Finished training in 925.47s : active states = 8
Init repetition 1 free energy = 332804779.3035477
Cycle 1 free energy = 336427908.04994136
Cycle 2 free energy = 335760876.37482184
Cycle 3, free energy = 333807027.0132252, relative change = 0.745493341447013
Cycle 4, free energy = 333287445.8452534, relative change = 0.16544735586888465
Cycle 5, free energy = 333090140.86944956, relative change = 0.05911286352057068
Cycle 6, 

We can see the shape of gamma is ```[18000,3]```, which correspond with the concatenated data ```[18000, 50]```.

This bacicallay means that for each timepoint we have estimated a correspoinding state, since Gamma is the probability of each state being active at a giving timepoint.

In [None]:
Gamma.shape

## 5: Save Data
Finally, we save ```gammma``` for further analysis.

In [None]:
import os
# Specify the folder path and name
folder_name = "/data"
current_directory = os.getcwd()
folder_path = os.path.join(current_directory+folder_name)
isExist = os.path.exists(folder_path)
if not isExist:
   # Create a new directory because it does not exist
   os.makedirs(folder_path)
   print("The new directory is created!")


# Save gamma
gamma_file = 'gamma.npy'
file_path = os.path.join(folder_path, gamma_file)
np.save(file_path, gamma_file)