# Part 2: Preprocessing intracranial EEG using MNE-python - Epochs!

*NeuroHackademy 2023*  
[Liberty Hamilton, PhD](https://slhs.utexas.edu/research/hamilton-lab)  
Assistant Professor, Department of Speech, Language, and Hearing Sciences and  
Department of Neurology  
The University of Texas at Austin  

This is part two of the notebooks. Please first run through [`01_ieeg_preprocessing_MNE.ipynb`](01_ieeg_preprocessing_MNE.ipynb) before running this. In this portion of the tutorial, you will learn about epoching your data. Epoched data allows you to calculate averaged responses to events of interest (event-related potentials). We will do this based on the provided annotations of speech vs. music, as well as additional annotations that are available in the Berezutskaya dataset.

In [None]:
import mne
from matplotlib import pyplot as plt
from matplotlib import cm
import numpy as np
import pandas as pd
import os
from mne_bids import read_raw_bids
from mne_bids.path import get_bids_path_from_fname
from bids import BIDSLayout
from ecog_preproc_utils import transformData
import bids 

## Load BIDS iEEG dataset

Here we will load an example iEEG dataset from [Berezutskaya et al.  Open multimodal iEEG-fMRI dataset from naturalistic stimulation with a short audiovisual film](https://openneuro.org/datasets/ds003688/versions/1.0.7/metadata). For this tutorial we will use data from `sub-06`, `iemu` data only, which has been downloaded to the jupyter hub. The whole dataset is rather large (15 GB), so if you prefer to download just this session you can do that.

In [None]:
# This is the example participant's data that we will load for the tutorial,
# but there are more options.
subj = '06'
sess = 'iemu'
task = 'film'
acq = 'clinical'
run = 1

In [None]:
# Change the data directory below to where your data are located. 
parent_dir = '/home/jovyan/shared/ds003688/'  # This is on the jupyter hub
ieeg_dir = f'{parent_dir}/sub-{subj}/ses-{sess}/ieeg/'
channel_path = f'{ieeg_dir}/sub-{subj}_ses-{sess}_task-{task}_acq-{acq}_run-{run}_channels.tsv'
raw_path = f'{ieeg_dir}/sub-{subj}_ses-{sess}_task-{task}_acq-{acq}_run-{run}_ieeg.vhdr'

bids_path = get_bids_path_from_fname(raw_path)
base_name = os.path.basename(raw_path).split('.')[0]

## Load the iEEG data

First, we will choose the relevant subject, session, task, acquisition, and run. Note that if you wish to change these variables, you may need to download the data yourself.

To show the capabilities of BIDS and contrast to when we don't use BIDS, we'll load the data in two ways. The data structure using BIDS will be called `raw`, the data structure without BIDS will be `raw_nobids`.

In [None]:
# Read data and extract parameters from BIDS files
raw = read_raw_bids(bids_path, verbose=True)

In [None]:
# Let's load the data into memory and print some information about it. The 
# info structure contains a lot of helpful metadata about number of channels,
# sampling rate, data types, etc. It can also contain information about the
# participant and date of acquisition, however, this dataset has been anonymized.
raw.load_data()
raw.info

# Calculate the high gamma transform of your data

Now we will take the raw, preprocessed data, and convert to high gamma analytic amplitude for further analysis. The high gamma analytic amplitude is used in many papers as a proxy for multi-unit firing (see [Ray and Maunsell, PLoS Biology 2011](https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1000610)).

This particular version of the high gamma transform uses the same procedure as used in [Hamilton et al. 2018](cell.com/current-biology/pdf/S0960-9822(18)30461-5.pdf) and [Hamilton et al. 2021](https://www.cell.com/cell/pdf/S0092-8674(21)00878-3.pdf). The basic idea is to take 8 bands within the 70-150 Hz range, calculate the Hilbert transform, then take the analytic amplitude of that signal and average across the 8 bands. This form of averaging results in higher SNR than one band between 70-150 Hz. 

In [None]:
# Get only the iEEG channels for high gamma
raw_ieeg = raw.copy()
raw_ieeg.pick_types(ecog=True)
raw_ieeg.anonymize()

notch_freqs = list(np.arange(raw.info['line_freq'], raw.info['lowpass'], step=raw.info['line_freq']))
# Get the high gamma data
# Generally, do a CAR if you have widespread coverage over multiple
# areas (not just one sensory area)
# If you have limited coverage, you may choose to do no CAR or choose
# to reference to one specific channel.
hgdat = transformData(raw_ieeg, ieeg_dir, band='high_gamma', notch=True, CAR=True,
                      car_chans='average', log_transform=True, do_zscore=True,
                      hg_fs=100, notch_freqs=notch_freqs, overwrite=True,
                      ch_types='ecog')


# Plotting evoked data

Now that we have some preprocessed data, let's plot the differences between experimental conditions. To do this, we will need the events timings, which are included in the `events.tsv` file. In this case, the events correspond to blocks of music and speech.

## Loading events

Now we will load events from the .tsv file to plot evoked responses to music and speech events. First I'll show you how to do this by creating an MNE events array, next I'll show you how to derive them from the annotations. This first method could also be used with non-BIDS datasets if you have the onset and duration and trial information.

In [None]:
# This is a simple way of loading a tab-delimited file, and is not specific to
# MNE python. We're using the library pandas, which you may also find very
# helpful in other applications.
event_file = f'{ieeg_dir}/sub-{subj}_ses-{sess}_task-{task}_run-{run}_events.tsv'
event_df = pd.read_csv(event_file, delimiter='\t')

In [None]:
# Let's print the contents of this dataframe. 
event_df

## Convert event times to samples

Now these event times are in seconds, not samples, so we have to convert them for use with MNE python's epochs constructor. Let's do that here. 

The times here are in seconds, and sampling rate is in units of Hz (samples/sec), so to get samples, we just multiply the amount of time by the sampling rate.

\begin{eqnarray}
\mbox{number of samples} &=& \mbox{time }  \times \mbox{sampling rate}\\
\mbox{(samples)} &=& \mbox{(s) }  \times \mbox{(samples/sec)}
\end{eqnarray}

We also cast these as integers since data samples are discrete values.

In [None]:
onset_samp = [int(onset*hgdat.info['sfreq']) for onset in event_df.onset]
dur_samp = [int(dur*hgdat.info['sfreq']) for dur in event_df.duration]
ev_id = [int(e*hgdat.info['sfreq']) for e in event_df.value]

eve = list(zip(onset_samp, dur_samp, ev_id))
eve

## Another way...

So actually, because we already had these particular events as annotations, we could have also done this a simpler way, but the method above also works for other events that are stored in tsv files without becoming annotations.

In [None]:
# We could also do this with the `raw` object
events = mne.events_from_annotations(hgdat, event_id='auto')

In [None]:
events

## Create an epochs object

Now if we want to plot our data by epoch type, we can use the mne Epochs class. This allows us to parse our data according to these events and plot evoked activity.

In [None]:
tmin = -0.2  # How much time to account for before the event of interest
tmax = 0.5   # How much time to account for after the event of interest
event_id = events[1]['speech']  # This is the speech event ID

# Here we take events[0] because those are the timings, whereas events[1] has
# the information about event type. If you just have a list of timings,
# you don't need to index the events in this way.
epochs = mne.Epochs(hgdat, events=events[0], tmin=tmin, tmax=tmax, event_id=event_id) 

In [None]:
# Here we will just plot the average across all channels. This is a bit
# weird to do with iEEG because this is across a lot of different brain
# areas, but it's still possible.
epochs.plot_image(combine='mean')

In [None]:
# What about plotting a particular electrode? This is one that
# appears to be on the STG based on the image above
epochs.plot_image(picks=[hgdat.info['ch_names'][13]])

In [None]:
def plot_epochs(epochs, nchans, ch_names, color='b', label='spkr', show=True, vmin_max=None):
    '''
    Function that plots the averaged epoched data for each channel as a grid so you can 
    see all channels at once.
    
    Inputs:
        epochs [obj] : MNE epochs object
        nchans [int] : number of channels to plot
        ch_names [list] : channel names 
        color [str, hex, tuple]: color for ERP traces
        label [str] : label for the ERP (could be epoch type/annotation type) 
        show [bool] : whether to show the figure or not
        vmin_max [list] : list of ylim min and max, e.g. [-0.5, 0.5]
        
    '''
    
    # Get the data as an array
    eps = epochs.get_data()
    
    # Find the maximum across the whole dataset, helps with scaling the plots
    emax = np.abs(epochs.average().data).max()
    
    # Determine how many rows and columns we'll need in our subplots grid
    # based on the number of channels. 
    nrows = int(np.floor(np.sqrt(nchans)))
    ncols = int(np.ceil(nchans/nrows))
    
    # Loop through all electrode channels
    for ch in np.arange(nchans):
        plt.subplot(nrows, ncols, ch+1)
        
        # Get the average response across trials for this particular channel
        erp = eps[:,ch,:].mean(0)
        
        # Get the standard error across trials
        erpstderr = eps[:,ch,:].std(0)/np.sqrt(eps.shape[0])
        
        # Plot transparent shaded standard error in the [color] you choose
        ybottom = erp - erpstderr
        ytop = erp + erpstderr
        plt.fill_between(epochs.times, ybottom.ravel(), ytop.ravel(),
                         alpha=0.5, color=color)
        
        # Plot the average epoch on top in the same color
        plt.plot(epochs.times, erp, color=color, label=label)
        
        # Plot the x and y origins
        plt.axvline([0], color='k', linewidth=0.5)
        plt.axhline([0], color='k', linewidth=0.5)
        
        # If we haven't explicitly set ylimits with vmin/vmax, use 
        # the maximum of the data and 50% more so the whole thing 
        # fits nicely 
        if vmin_max is None:
            plt.gca().set_ylim([-emax*1.5, emax*1.5])
        else:
            plt.gca().set_ylim([vmin_max[0], vmin_max[1]])
            
        # Only show the ticks for the 0th plot, otherwise this gets
        # hard to see/read
        if ch != 0:
            plt.gca().set_xticks([])
            plt.gca().set_yticks([])
        else:
            plt.ylabel('Z-score')
        
        # Write the name of the channel in the plot -- you could also
        # use plt.title() but sometimes that makes everything look
        # a little squashed
        plt.text(0.5, 0.25, ch_names[ch], 
            horizontalalignment='center', verticalalignment='center',
            transform=plt.gca().transAxes, fontsize=8)
    
    # Plot ticks at meaningful times (the min, 0, and max in seconds)
    plt.gca().set_xticks([epochs.tmin, 0, epochs.tmax])
    plt.xlabel('Time (s)')
    plt.legend()
    #plt.tight_layout()
    if show:
        plt.show()

In [None]:
plt.figure(figsize=(10,10))
plot_epochs(epochs, len(hgdat.info['ch_names']), hgdat.info['ch_names'], label='speech', show=True)

## Using stimulus annotations

In addition to the "speech" vs "music" gross-level annotations, the researchers have provided information about the onset and offset of different types of information in the sound as well as the video. You can look in the `stimuli` folder to see what types of annotations are provided, but in general, these include word-level, syllable-level, sentence-level, and specific talkers as well as some other information.

In [None]:
# Get the word times
annotation = 'words'  # try other types of annotations here! 
word_times = pd.read_csv(f'{parent_dir}/stimuli/annotations/sound/sound_annotation_{annotation}.tsv', delimiter='\t')

# print them
word_times

In [None]:
# Let's get the sample that corresponds to the start of the task, since we
# will need to offset all the stimulus time labels from that
start_sample = events[0][events[0][:,2] == events[1]['start task'],0][0]

### Create the new word events epochs

In [None]:
word_events = []

# Loop through the times for each word event and convert to samples, as we did before
# This time we can't use annotations, because these word events were not included
# as annotations in the raw files, just as .tsv files.
for idx, row in word_times.iterrows():
    onset_sample = int(row['onset']*hgdat.info['sfreq'])  # convert time to samples
    offset_sample = int(row['offset']*hgdat.info['sfreq'])  # convert time to samples
    duration_sample = offset_sample - onset_sample  # Get the duration in samples
    onset_sample += start_sample  # need to shift by the actual starting time of the task
    
    # Append this event to our events list
    word_events.append([onset_sample, duration_sample, 1])


In [None]:
# Now create the epochs object again. Note that we don't need to index the `word_events`
# because it is already a list in the correct format
epochs_words = mne.Epochs(hgdat, events=word_events, tmin=-0.2, tmax=0.5)

### Plot word epochs!

In [None]:
# Plot the average across all channels
epochs_words.plot_image(combine='mean')

In [None]:
# Plot one subplot for each channel as we had done before
plot_epochs(epochs_words, len(hgdat.info['ch_names']), hgdat.info['ch_names'], label='words')

In [None]:
# Plot one example electrode that has a strong word response
epochs_words.plot_image(picks=['P18'])

### Export data to numpy array

If we want to export the data to use with our own functions, we can also do that with the `.get_data()` method.

In [None]:
# Get the data from our epochs_words object to do other things with
epochs_array = epochs_words.get_data()
print(f'{epochs_array.shape[0]} {annotation} events for\
  {epochs_array.shape[1]} channels and {epochs_array.shape[2]} time points')

In [None]:
# Use matplotlib to show the average across all epochs
# Scale to the -max(abs) of the data and max(abs) of the data
# with a diverging colormap so that the color for 0 is white,
# and positive values are red, and negative values are blue
plt.imshow(epochs_array.mean(0), cmap=cm.RdBu_r, 
           vmin=-np.max(np.abs(epochs_array.mean(0))),
           vmax=np.max(np.abs(epochs_array.mean(0))))  # Take the average across all trials (words)
plt.xlabel('Time')
plt.ylabel('Channel')
plt.gca().set_xticks([0, 
                      -int(epochs_words.tmin*epochs_words.info['sfreq']), 
                      int((epochs_words.tmax - epochs_words.tmin)*epochs_words.info['sfreq'])])
plt.gca().set_xticklabels([epochs_words.tmin, 0, epochs_words.tmax])
plt.axvline(-int(epochs_words.tmin*epochs_words.info['sfreq']), color='k', linestyle='--')
plt.colorbar()
plt.show()

# That's it!

Some suggestions for things you can try:

* Create epochs for different types of events - speech, music, syllables, sentences, etc
* Compare amplitude of speech versus music responses in each electrode. Note that you can use `show=False` and call `plot_epochs` more than once to plot different epochs on the same axes
* Look at effects of referencing on the evoked data/epochs.