<a href="https://colab.research.google.com/github/Nick7900/permutation_test/blob/main/2_preprocessing_Gamma.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## GLHMM: train basic HMM and get Gamma

This notebook goes through the basic steps to train a "classic" HMM on a single set of timeseries, such as neuroimaging or electrophysiological recordings from multiple subjects or sessions.


When using **Google Colab** we need to import the following libraries, so we can load the data of interest

```
!pip install requests
!pip install gdown
```



### Import modules
We first import the relevant modules. If you have not done so, install the repo using:

```pip install --user git+https://github.com/vidaurre/glhmm```

In [None]:
!pip install requests
!pip install gdown
!pip install mat73

Collecting mat73
  Downloading mat73-0.60-py3-none-any.whl (19 kB)
Installing collected packages: mat73
Successfully installed mat73-0.60


In [None]:
!git clone https://github.com/vidaurre/glhmm
%cd glhmm

Cloning into 'glhmm'...
remote: Enumerating objects: 863, done.[K
remote: Counting objects: 100% (155/155), done.[K
remote: Compressing objects: 100% (114/114), done.[K
remote: Total 863 (delta 74), reused 65 (delta 40), pack-reused 708[K
Receiving objects: 100% (863/863), 12.61 MiB | 15.19 MiB/s, done.
Resolving deltas: 100% (505/505), done.
/content/glhmm


### Import packages

In [None]:
import os
import numpy as np
from glhmm import glhmm
import requests
import gdown

### Load Helper function

In [None]:
%cd ..
# Import helper function
# Get the raw github file
url = 'https://raw.githubusercontent.com/Nick7900/permutation_test/main/helper_functions/my_functions.py'
r = requests.get(url)
# Save the function to the directory
with open("my_functions.py","w") as f:
  f.write(r.text)

/content


### Load data
First, we need to load data files into our Python environment and are from the output that we showed in Tutorial **1_preprocessing_data_selection.ipynb**.
We will train a classic HMM on the ```data_measurement.npy``` that is a subset from the HCP dataset that we exported in the previous notebook.

The file ```data_measurement.npy``` is a dataset of 60 subjects, 300 timepoints, and 50 parcellations.


When training a HMM the data data should have the shape ((no subjects/sessions * no timepoints), no features), meaning that all subjects and/or sessions have been concatenated along the first dimension.
The second dimension is the number of features, e.g., the number of parcels or channels.

Remove the text **file/d/** from the link and replace it with **uc?id=**

Now remove the section after the file ID, including **/view** and replace it with **&export=download** in place of the text you have removed

In [None]:
# Downlod files from google colab
# data_measurement
url = "https://drive.google.com/uc?id=1dBlUk_ecvkCQILZCSCcreMJ71vOvQ54P&export=download"
gdown.download(url, quiet=False)

Downloading...
From: https://drive.google.com/uc?id=1dBlUk_ecvkCQILZCSCcreMJ71vOvQ54P&export=download
To: /content/data_measurement.npy
100%|██████████| 7.20M/7.20M [00:00<00:00, 22.9MB/s]


'data_measurement.npy'

In [None]:
# Get the current directory
current_directory = os.getcwd()
data_tmp_folder = ""
data_tmp_file = '/data_measurement.npy'

# Load behavioral data
data_tmp_file_path = os.path.join(current_directory+data_tmp_folder+data_tmp_file)
data_tmp = np.load(data_tmp_file_path)

Look at the dataset

In [None]:
data_tmp.shape

(60, 300, 50)

Now we are going to concatenate our dataset ```data_tmp``` along the first dimension, before we can train the HMM. Just Like we mentioned just before.

So we will go from having a dataset ```[60, 300, 50]``` of ```[n_subject, n_timepoints, n_features]``` to a concatenated dataset ```[18000, 50]``` that is based on ```[(n_subject by n_timepoints), n_features]```

In [None]:
from my_functions import *
# Getting the shape
n_subjects = len(data_tmp)
n_timestamps, n_features =data_tmp[0].shape
# Using a helper function to concatenate data
data = get_concatenate_data(data_tmp)
data.shape


(18000, 50)

The timeseries has the shape ```(18000, 50)``` and the indices have the shape (20, 2).
Data should be in numpy format.

Besides of having ```data_measurement.npy```, we need to specify the indices in the concatenated timeseries corresponding to the beginning and end of individual subjects/sessions in the shape ```[n_subjects, 2]```.

In this case, we have generated timeseries of 300 timepoints for each 60 subject.

In [None]:
#Generate indices of the timestamps for each subject in the data.
T_t_tmp =get_timestamp_indices(n_timestamps, n_subjects)
# Visualize the first 10 timepoints
T_t_tmp[:10]

array([[   0,  300],
       [ 300,  600],
       [ 600,  900],
       [ 900, 1200],
       [1200, 1500],
       [1500, 1800],
       [1800, 2100],
       [2100, 2400],
       [2400, 2700],
       [2700, 3000]])

### Initialise and train HMM
We first initialise the hmm object and specify hyperparameters. In this case, since we do not model an interaction between two sets of variables in the HMM states, we set ```model_beta='no'```.

We here estimate 3 states. If you want to model a different number of states, change K to a different value.

We here model states as Gaussian distributions with mean and full covariance matrix, so that each state is described by a mean amplitude and functional connectivity pattern, specify ```covtype='full'```. If you do not want to model the mean, add ```model_mean='no'```.
Optionally, you can check the hyperparameters to make sure that they correspond to how you want the model to be set up.

In [None]:
hmm = glhmm.glhmm(model_beta='no', K=3, covtype='full')
print(hmm.hyperparameters)

{'K': 3, 'covtype': 'full', 'model_mean': 'state', 'model_beta': 'no', 'dirichlet_diag': 10, 'connectivity': None, 'Pstructure': array([[ True,  True,  True],
       [ True,  True,  True],
       [ True,  True,  True]]), 'Pistructure': array([ True,  True,  True])}


We then train the HMM using the data and indices loaded above. Since we here do not model an interaction between two sets of timeseries but run a "classic" HMM instead, we set X=None. Y should be the timeseries in which we want to estimate states (in here called data) and indices should be the beginning and end indices of each subject (here called T_t).

Optionally, you can also return Gamma (the state probabilities at each timepoint), Xi (the joint probabilities of past and future states conditioned on the data) and FE (the free energy of each iteration).

In [None]:
Gamma,Xi,FE = hmm.train(X=None, Y=data, indices=T_t_tmp)

Cycle 1 free energy = 4989870.453329922
Cycle 2 free energy = 4981449.500864424
Cycle 3, free energy = 4973980.010111624, relative change = 0.47006182585257045
Cycle 4, free energy = 4969390.5142995, relative change = 0.22409714234535133
Cycle 5, free energy = 4967099.221585458, relative change = 0.10062225617631541
Cycle 6, free energy = 4965936.121445439, relative change = 0.04859547137697134
Cycle 7, free energy = 4965279.007008751, relative change = 0.026721260234384498
Cycle 8, free energy = 4964705.51264927, relative change = 0.02278941829265188
Cycle 9, free energy = 4964343.14268738, relative change = 0.014195383405785387
Cycle 10, free energy = 4963967.226763115, relative change = 0.014512320436048111
Finished training in 5.1s : active states = 3
Init repetition 1 free energy = 4963967.226763115
Cycle 1 free energy = 4989508.608867989
Cycle 2 free energy = 4981851.947660873
Cycle 3, free energy = 4975271.730316416, relative change = 0.46219522914519223
Cycle 4, free energy = 4

We can see the shape of gamma is ```[18000,3]```, which correspond with the concatenated data ```[18000, 50]```.

This bacicallay means that for each timepoint we have estimated a correspoinding state, since Gamma is the probability of each state being active at a giving timepoint.

In [None]:
Gamma.shape

(18000, 3)

When we are going to perform within session continuous testing (Tutorial C),using the ```GLHMM``` package the input data would be the Viterbi path.
The within-session continuous testing allows us to continuously recognize the most likely state sequence of an HMM in real-time as new observations arrive. The Viterbi algorithm efficiently calculates the most probable state sequence given an observation sequence and HMM.


In [None]:
vpath = hmm.decode(X=None, Y=data, indices=T_t_tmp, viterbi=True)
vpath

array([[1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       ...,
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.]])

## Save Data
Finally, we save ```gammma``` for further analysis.

In [None]:
import os
# Specify the folder path and name
folder_name = "/data"
current_directory = os.getcwd()
folder_path = os.path.join(current_directory+folder_name)
isExist = os.path.exists(folder_path)
if not isExist:
   # Create a new directory because it does not exist
   os.makedirs(folder_path)
   print("The new directory is created!")


# Save gamma
gamma_file = 'gamma.npy'
file_path = os.path.join(folder_path, gamma_file)
np.save(file_path, gamma_file)

## Save Viterbi path

In [None]:
# Save Viterbi path
vpath_file = 'vpath.npy'
file_path = os.path.join(folder_path, vpath_file)
np.save(file_path, vpath_file)