<a href="https://colab.research.google.com/github/Nick7900/permutation_test/blob/main/3_preprocessing_Viterbi_path.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## GLHMM: train basic HMM and get Viterbi Path

This notebook goes through the basic steps to train a "classic" HMM on a single set of timeseries, such as neuroimaging or electrophysiological recordings from multiple subjects or sessions.


When using **Google Colab** we need to import the following libraries, so we can load the data of interest

```
!pip install requests
!pip install gdown
```



### Import modules
We first import the relevant modules. If you have not done so, install the repo using:

```pip install --user git+https://github.com/vidaurre/glhmm```

In [1]:
!pip install requests
!pip install gdown
!pip install mat73

Collecting mat73
  Downloading mat73-0.60-py3-none-any.whl (19 kB)
Installing collected packages: mat73
Successfully installed mat73-0.60


In [2]:
!git clone https://github.com/vidaurre/glhmm
%cd glhmm

Cloning into 'glhmm'...
remote: Enumerating objects: 863, done.[K
remote: Counting objects: 100% (156/156), done.[K
remote: Compressing objects: 100% (115/115), done.[K
remote: Total 863 (delta 75), reused 65 (delta 40), pack-reused 707[K
Receiving objects: 100% (863/863), 12.61 MiB | 22.41 MiB/s, done.
Resolving deltas: 100% (506/506), done.
/content/glhmm


### Import Libraies

In [3]:
import os
import numpy as np
from glhmm import glhmm
import requests
import gdown

In [4]:
%cd ..
# Import helper function
# Get the raw github file
url = 'https://raw.githubusercontent.com/Nick7900/permutation_test/main/helper_functions/my_functions.py'
r = requests.get(url)
# Save the function to the directory
with open("my_functions.py","w") as f:
  f.write(r.text)

/content


### Load data
For this example we are analyzing memory task data measured inside a Magnetoencephalography (MEG) scanner at different sessions and over multiple trials for 1 subject.


We will load the data from google drive:

Remove the text **file/d/** from the link and replace it with **uc?id=**

Now remove the section after the file ID, including **/view** and replace it with **&export=download** in place of the text you have removed

In [5]:
# Downlod files from google colab
# Load X_data (X_memory)
url = "https://drive.google.com/uc?id=1XhjINejfg7yPsxJ_sLjOZQ-1VNOT9ySB&export=download"
gdown.download(url, quiet=False)

# Load dependent variables (y_memory)
url = "https://drive.google.com/uc?id=17QcxDcvZasvsQ-iBTL2uqsDbBgBOTLkj&export=download"
gdown.download(url, quiet=False)

# Load indices
url = "https://drive.google.com/uc?id=12aOoMd6DheYb9PfPfOFi3oytemH7enqI&export=download"
gdown.download(url, quiet=False)

Downloading...
From: https://drive.google.com/uc?id=1XhjINejfg7yPsxJ_sLjOZQ-1VNOT9ySB&export=download
To: /content/X_memory.npy
100%|██████████| 844M/844M [00:03<00:00, 262MB/s]
Downloading...
From: https://drive.google.com/uc?id=17QcxDcvZasvsQ-iBTL2uqsDbBgBOTLkj&export=download
To: /content/y_memory.npy
100%|██████████| 52.9k/52.9k [00:00<00:00, 60.8MB/s]
Downloading...
From: https://drive.google.com/uc?id=12aOoMd6DheYb9PfPfOFi3oytemH7enqI&export=download
To: /content/idx_session.npy
100%|██████████| 248/248 [00:00<00:00, 1.32MB/s]


'idx_session.npy'

In [6]:
# Show the shape of the data

current_directory = os.getcwd()
folder_name = ""
file_name = '/X_memory.npy'

# Load X data
file_path = os.path.join(current_directory+folder_name+file_name)
X_data = np.load(file_path)

# Load y data
file_name = '/y_memory.npy'
file_path = os.path.join(current_directory+folder_name+file_name)
y_data = np.load(file_path)


# Load indices
file_name = '/idx_session.npy'
file_path = os.path.join(current_directory+folder_name+file_name)
idx_data = np.load(file_path)


print(f"Data dimension of X Memory data: {X_data.shape}")
print(f"Data dimension of y Memory data: {y_data.shape}")
print(f"Data dimension of indices Memory: {idx_data.shape}")

Data dimension of X Memory data: (250, 6595, 64)
Data dimension of y Memory data: (6595,)
Data dimension of indices Memory: (15, 2)


Now we can look at the data structure.
- X_data: 3D array of shape (n_timepoints, n_trials, n_features)
- y_data : 1D array of shape (n_trials,)
- idx_data: 2D array of shape (n_sessions, 2)

```X_data``` represents the measurements taken from the subject. It is a list with three elements: ```[250, 6595, 64]```. The first element indicates that the subject was measured over a period of ```250``` timestamps. The second element, ```6595```, represents the number of trials conducted. Each trial consists of measuring ```64``` channels inside the MEG scanner.

```y_data``` is an array containing only 0s and 1s. The values in this array indicate whether an image of an animated or inanimate object was shown on a screen during each trial.

For our ```X_data``` we have our corresponding ```idx_data = [15, 2]```. This indicates the number of sessions conducted, which in this case is ```15```. The values in each row represent the start and end indices of the trials.



### Preapare data for the HMM

Before we can input the data to the ```GLHMM``` package, we need to concatenate every trial from each session into a new data matrix ```data```.
The resulting data matrix has shape ```[1648750, 64]``` (```n_timepoints``` * ```n_trials```, ```n_channels```), where ```n_timepoints``` is the total number of time points of every trial, and ```n_trials``` is the total number of trials across all selected sessions.



our dataset ```X_data``` along the first dimension, before we can train the HMM. Just Like we mentioned just before.

To train the GLHMM, we need to give 2D data matrix ```X_data``` of continuous measurement as **input** and a 2D array that indexes each session within a measurement```idx_data```. So, we need to concatenate the selected trials for every timepoint from each session into a new data matrix data.\
The resulting data matrix has shape ```(n_timepoints * n_trials, 64)```, where ```n_timepoints``` is the total number of time points in the selected segments, and ```n_trials``` is the total number of trials across all selected sessions.




In [7]:
#  Concatenates selected trials from each session into a new data matrix.
X_memory_con = []
y_memory_con =[]
idx_data_con =np.zeros_like(idx_data)
#n_timepoints= 2
for i in np.arange(len(idx_data)):
#for i in np.arange((n_timepoints)):
    for j in np.arange(idx_data[i,0],idx_data[i,1]+1):
        X_memory_con.extend(X_data[:,j,:])
        y_memory_con.extend([y_data[j] for _ in range(X_data.shape[0])])
    idx_data_con[i,1]=len(X_memory_con)
    if i==len(idx_data)-1:
        pass
    else:
        idx_data_con[i+1,0]=idx_data_con[i,1]


X_memory_con = np.array(X_memory_con)
y_memory_con = np.array(y_memory_con)
# Show the measurement of the continuous measurements
print(f"Data dimension of X Memory data: {X_memory_con.shape}")
# Show the indices of the continuous measurements
print(f"Data dimension of indices data: {idx_data_con.shape}")
# Show the indices of the continuous reponds
print(f"Data dimension of y Memory data: {y_memory_con.shape}")

Data dimension of X Memory data: (1648750, 64)
Data dimension of indices data: (15, 2)
Data dimension of y Memory data: (1648750,)


The updated timeseries has the shape ```(1648750, 64)``` and the indices have the shape (15, 2).
Data should be in numpy format.

### Initialise and train HMM
We first initialise the hmm object and specify hyperparameters. In this case, since we do not model an interaction between two sets of variables in the HMM states, we set ```model_beta='no'```.

We here estimate 3 states. If you want to model a different number of states, change K to a different value.

We here model states as Gaussian distributions with mean and full covariance matrix, so that each state is described by a mean amplitude and functional connectivity pattern, specify ```covtype='full'```. If you do not want to model the mean, add ```model_mean='no'```.
Optionally, you can check the hyperparameters to make sure that they correspond to how you want the model to be set up.

In [8]:
K = 3
hmm = glhmm.glhmm(model_beta='no', K=K, covtype='full')
print(hmm.hyperparameters)

{'K': 3, 'covtype': 'full', 'model_mean': 'state', 'model_beta': 'no', 'dirichlet_diag': 10, 'connectivity': None, 'Pstructure': array([[ True,  True,  True],
       [ True,  True,  True],
       [ True,  True,  True]]), 'Pistructure': array([ True,  True,  True])}


We then train the HMM using the prepared ```data``` and ```idx_data_new``` . Since we here do not model an interaction between two sets of timeseries but run a "classic" HMM instead, we set ```X=None```. Y should be the timeseries in which we want to estimate states (in here called ```data```) and **indices** should be the beginning and end **indices** of each subject (here called ```idx_data_new```).


In [None]:
hmm.train(X=None, Y=X_memory_con, indices=idx_data_con)
#Gamma,Xi,FE =hmm.train(X=None, Y=X_memory_con, indices=idx_data_con)

Cycle 1 free energy = nan
Cycle 2 free energy = nan
Cycle 3, free energy = nan, relative change = nan
Cycle 4, free energy = nan, relative change = nan
Cycle 5, free energy = nan, relative change = nan
Cycle 6, free energy = nan, relative change = nan
Cycle 7, free energy = nan, relative change = nan
Cycle 8, free energy = nan, relative change = nan
Cycle 9, free energy = nan, relative change = nan
Cycle 10, free energy = nan, relative change = nan
Finished training in 244.3s : active states = 3
Init repetition 1 free energy = nan
Cycle 1 free energy = nan
Cycle 2 free energy = nan
Cycle 3, free energy = nan, relative change = nan
Cycle 4, free energy = nan, relative change = nan
Cycle 5, free energy = nan, relative change = nan
Cycle 6, free energy = nan, relative change = nan
Cycle 7, free energy = nan, relative change = nan
Cycle 8, free energy = nan, relative change = nan
Cycle 9, free energy = nan, relative change = nan
Cycle 10, free energy = nan, relative change = nan
Finished t

(array([[0.33333467, 0.33333745, 0.33332789],
        [0.33370806, 0.33313454, 0.3331574 ],
        [0.33370867, 0.3331342 , 0.33315712],
        ...,
        [0.33370867, 0.3331342 , 0.33315712],
        [0.33370867, 0.3331342 , 0.33315712],
        [0.33370867, 0.3331342 , 0.33315712]]),
 array([[[0.11159984, 0.11086365, 0.11087118],
         [0.11105575, 0.11141004, 0.11087166],
         [0.11105247, 0.11086085, 0.11141456]],
 
        [[0.11172485, 0.11098783, 0.11099537],
         [0.11098815, 0.11134222, 0.11080417],
         [0.11099567, 0.11080415, 0.11135758]],
 
        [[0.11172506, 0.11098804, 0.11099558],
         [0.11098804, 0.11134211, 0.11080406],
         [0.11099558, 0.11080406, 0.11135749]],
 
        ...,
 
        [[0.11172506, 0.11098804, 0.11099558],
         [0.11098804, 0.11134211, 0.11080406],
         [0.11099558, 0.11080406, 0.11135749]],
 
        [[0.11172506, 0.11098804, 0.11099558],
         [0.11098804, 0.11134211, 0.11080406],
         [0.11099558, 0.

When we perform within session continuous testing in the tutorial (**c_within_session_continuous_testing.ipynb**), we will use the '''GLHMM''' package and the Viterbi path as input data.
The within-session continuous testing allows us to continuously recognize the most likely state sequence of an HMM in real-time as new observations arrive. The Viterbi algorithm efficiently calculates the most probable state sequence given an observation sequence and HMM.


In [None]:
vpath = hmm.decode(X=None, Y=X_memory_con, indices=idx_data_con, viterbi=True)

## Save Viterbi path

In [None]:
# Get the current directory
current_directory = os.getcwd()
folder_name = "/data_memory"
current_directory = os.getcwd()
folder_path = os.path.join(current_directory+folder_name)
isExist = os.path.exists(folder_path)
if not isExist:
   # Create a new directory because it does not exist
   os.makedirs(folder_path)
   print("The new directory is created!")


# Save viterbi path
file_name = f'vpath_{K}_memory.npy'
file_path = os.path.join(folder_path, file_name)
np.save(file_path, file_name)

## Save Continuous data

In [None]:
# Continuous X_data
file_name = 'X_memory_con.npy'
# save file to path
file_path = os.path.join(current_directory+folder_name+file_name)
np.save(file_path, X_memory_con)

# Continuous y_data
file_name = 'y_memory_con.npy'
# save file to path
file_path = os.path.join(current_directory+folder_name+file_name)
np.save(file_path, y_memory_con)

# Continuous idx_data_con
file_name = 'idx_data_con.npy'
# save file to path
file_path = os.path.join(current_directory+folder_name+file_name)
np.save(file_path, idx_data_con)

## Save HMM model path

In [None]:
import pickle
# Save model
# Specify the file path where you want to save the data
pickle_file = 'hmm.pickle'
file_path = os.path.join(folder_path+pickle_file)

# Open the file in binary write mode
with open(file_path, 'wb') as file:
    # Use pickle.dump to save the data to the file
    pickle.dump(hmm, file)

#print("Data saved to", file_path)

## Load HMM model

In [None]:
# Load pickle file
# Open the file in binary read mode
with open(file_path, 'rb') as file:
    # Use pickle.load to load the data from the file
    loaded_data = pickle.load(file)

#print("Loaded data:", loaded_data)