# Reformat and save data to file

Great, now we know how to access the data we need. Our next step will be to structure it in a way that will be useful to us. 

Let's think about what the important bits of information we need from the full data set. 

About the neurons, we need to know:
1. Spike times for each recorded and clustered neuron (remember we call them 'units')
2. Which brain area the neuron belongs to: is it CA1, or Prefrontal Cortex (PFC)
3. What task was the animal doing?
4. What were the x and y positions of the animal?

Unfortunately for us, all this information is in different files. All of them are nested, so organizing this may not be straightforward. 




In [1]:
import scipy.io as spio # you will need this library to import .mat data into python
import numpy as np 
import pandas as pd

import os

# we'll use glob and pathlib.Path to get the names all files
# that end with the extension .mat in the folder (also referred to as 'directory') 
# that we're interested in 
import glob

from pathlib import Path # this allows us to only import 
                         # the function Path from the library 
                         # pathlib rather than the full library.

We can always type out the file names and import them one by one manually, but the method shown here will be easier to generalize. That is, once you finish setting up this process, you can easily reuse it for a different animal with making only minor changes, rathere than having to type out all the file names again. 

In [2]:
directory = '../data/N2/' # here, you specify which folder you want to look at
                          # if you're later interested in repeating this process for 
                          # another animal, simply change the name of the folder.

In [3]:
# glob is a function that automaticall gets the file paths (or addresses) of all files 
# that end with the .mat extension. The * simply means that the name could be anything, 
# and the only requirement is that the file extension is .mat. If you're loooking for images, 
# you can specify *.jpg, or *.bmp, etc. 

# as an aside, it's impossible to memorize that these functions exist. There are many others that do 
# similar things. The only reason I know is because I needed this functionality and I googled: 
# 'find file names in folder python' or something along those lines. And it eventually led me to these
# functions. Google is your friend.

file_paths = glob.glob(directory + '*.mat') # remember you can put strings together using the + operator
                                            # so directory + '*.mat' = '../data/N2/*.mat'

In [4]:
file_paths # we now have addresses to every file in the folder that ends with .mat

['../data/N2\\N2cellinfo.mat',
 '../data/N2\\N2eeg.mat',
 '../data/N2\\N2rawpos.mat',
 '../data/N2\\N2spikes.mat',
 '../data/N2\\N2task.mat',
 '../data/N2\\N2tetinfo.mat']

Now we can go through our list and get data from each file as needed. 


### **Step 1:**   
The `tetinfo` file will have information about which brain are the tetrode was in that particular day. So we can easily label any neuron recorded from that tetrode as belonging to the corresponding brain area. 

In [5]:
# look for the tetinfo.mat file. 
# we'll loop through the list of file names
# then we'll see if any of them have the word 'tetinfo' in them. 
# if the string method .find() finds the word 'tetinfo' it will return 
# the index of the position where it found the word. If it doesn't, it will return -1. 
# So we can say, that if the output of .find() is not -1, then that's the name we want.


for name in file_paths:
    if name.find('tetinfo') != -1:
        matfile = name

matfile # now we have the address to tetinfo.mat

'../data/N2\\N2tetinfo.mat'

In [6]:
# we will load this file same as before and use the dictionary key to access the data in it. 
tetinfo = spio.loadmat(matfile, squeeze_me=True)
tetinfo = tetinfo['tetinfo']

My goal here is to convert the whole nested array into a dictionary so we can easily access all elements. 
Dictionaries have `keys` and `values`. There are different ways to define or crreate a dictionary. 

Method 1: manually defining each key-value pair
```
dictionary_1 = {
    'key_1' : [1, 2, 3],
    'key_2' : ['a', 'b', 'c']
}
```

Method 2: using two lists and the `zip()` function to automatically match them.
```
keys = ['key_1', 'key_2']                # we define a list of keys
values = [[1, 2, 3], ['a', 'b', 'c']]    # we define a list of values
dictionary_2 = dict(zip(keys, values))   # we use the zip() function to pair them up by element. 
```

in Method 2, we use `zip()` to match two lists of the same length. `zip()` will automatically match `key_1` with the first list `[1, 2, 3]`, and `key_2` with the second list `['a', 'b', 'c']`. Both methods will result in the same output. the second one just more convenient to use in our case here. You can copy and paste both pieces of code and compare `dictionary_1` to `dictionary_2`.

**Taking apart and storing the first level of data: days**

In [7]:
num_days = tetinfo.size # number of days = the number of elements in the first level

# next, we create lists of keys and values
day_keys = ['day' + str(num) for num in range(num_days)] # this is a list comprehension. 
                                                         # It's a shortened form of a for-loop
                                                         # It says: day_keys is a list (indicated by the []),
                                                         # each element of the list is the string 'day' plus numbers 
                                                         # in the range of 0 to the number of days. 
                    
day_values = [day for day in tetinfo] # another list comprehension that says day_values is a list and 
                                       # each element is one element in the array tetinfo

N2_tetinfo = dict(zip(day_keys, day_values)) # here we use that zip() function to pair these two lists 
                                             # and make them into a dictionary

del([day_keys, day_values]) # delete unnecessary variables from memory

In [8]:
# now, if we look at what is in N2_tetinfo
print(f'N2_tetinfo is of type : {type(N2_tetinfo)}')
print(f'with keys: {N2_tetinfo.keys()}')

# we see it is a dictionary with each day as a key. 

N2_tetinfo is of type : <class 'dict'>
with keys: dict_keys(['day0', 'day1', 'day2', 'day3', 'day4', 'day5', 'day6', 'day7', 'day8', 'day9', 'day10', 'day11'])


In [9]:
# we can access each day's data using the key for that day. 
N2_tetinfo['day0']

array([array([], dtype=float64),
       array([array([], dtype=float64),
              array(('PFC', 72), dtype=[('area', 'O'), ('depth', 'O')]),
              array(('PFC', 59), dtype=[('area', 'O'), ('depth', 'O')]),
              array(('PFC', 59), dtype=[('area', 'O'), ('depth', 'O')]),
              array(('PFC', 78), dtype=[('area', 'O'), ('depth', 'O')]),
              array(('PFC', 66), dtype=[('area', 'O'), ('depth', 'O')]),
              array(('PFC', 60), dtype=[('area', 'O'), ('depth', 'O')]),
              array(('PFC', 73), dtype=[('area', 'O'), ('depth', 'O')]),
              array(('PFC', 61), dtype=[('area', 'O'), ('depth', 'O')]),
              array(('PFC', 39), dtype=[('area', 'O'), ('depth', 'O')]),
              array(('PFC', 58), dtype=[('area', 'O'), ('depth', 'O')]),
              array(('PFC', 48), dtype=[('area', 'O'), ('depth', 'O')]),
              array(('PFC', 48), dtype=[('area', 'O'), ('depth', 'O')]),
              array(('PFC', 38), dtype=[('area', 'O

**Taking apart and storing the second level of data: epochs**  
The next step will be to make each day into a dictionary where the keys are the epochs and we can access data belonging to each epoch using the appropriate key. Because we're looking at locations of tetrodes here, we do not need to go any deeper into this file. We can pick any epoch of one day and use the location of the tetrode in that epoch for all epochs in that day. So if the tetrode was in PFC for epoch0 of day1, we can assume that it was in PFC for the rest of the epochs as well. 

In [10]:
for day_key in N2_tetinfo.keys():                        # we first cycle through each day
    day_data = N2_tetinfo[day_key]                       # we put all the data for that day in a variable called day_data
        
    for epoch in day_data:                               # then we cycle through each epoch in the day and find one that isn't empty.
        if epoch.size > 0:                                               # I only thought to check for this because I noticed not all epochs had data in them
            tt_info = epoch                              # We can then use the first epoch we find that has data in it and 
            break                                                        # break out of the loop without checking any others 
    tt_name = []                                         # We'll create two empty lists to use as the keys and values
    tt_loc = []
    for tt in range(tt_info.size):                       # cycle through each tetrode in that epoch
        if tt_info[tt].size > 0:                         # some tetrodes have no data about location. They were probably used as ground. We can skip those 
            tt_name.append('tt'+ str(tt))                # then append the tetrode name and tetrode location to the appropriate lists
            tt_loc.append(tt_info[tt].item()[0])
    N2_tetinfo[day_key] = dict(zip(tt_name, tt_loc))     # create a dictionary from the list and place it in the value place for that day. 


In [11]:
# so now, if we want to see the locations of the tetrodes for day0:

N2_tetinfo['day0']

{'tt1': 'PFC',
 'tt2': 'PFC',
 'tt3': 'PFC',
 'tt4': 'PFC',
 'tt5': 'PFC',
 'tt6': 'PFC',
 'tt7': 'PFC',
 'tt8': 'PFC',
 'tt9': 'PFC',
 'tt10': 'PFC',
 'tt11': 'PFC',
 'tt12': 'PFC',
 'tt13': 'PFC',
 'tt15': 'CA1',
 'tt16': 'CA1',
 'tt17': 'CA1',
 'tt18': 'CA1',
 'tt19': 'CA1',
 'tt20': 'CA1'}

In [12]:
# or if we want to specifically know where tetrode 16 was on day3:
N2_tetinfo['day3']['tt16']

'CA1'

### **Step 2:**   
The `task` file will have information about what the animal was doing during that epoch

In [13]:
# look for the task.mat file. 
# we'll loop through the list of file names
# then we'll see if any of them have the word 'task' in them. 
# if the string method .find() finds the word 'task' it will return 
# the index of the position where it found the word. If it doesn't, it will return -1. 
# So we can say, that if the output of .find() is not -1, then that's the name we want.


for name in file_paths:
    if name.find('task') != -1:
        matfile = name

matfile # now we have the address to tetinfo.mat

'../data/N2\\N2task.mat'

In [14]:
# we will load this file same as before and use the dictionary key to access the data in it. 
task = spio.loadmat(matfile, squeeze_me=True)
task = task['task']

**Taking apart and storing the first level of data: days**

In [15]:
num_days = task.size # number of days = the number of elements in the first level

In [16]:
# next, we create lists of keys and values
day_keys = ['day' + str(num) for num in range(num_days)] # this is a list comprehension. 
                                                         # It's a shortened form of a for-loop
                                                         # It says: day_keys is a list (indicated by the []),
                                                         # each element of the list is the string 'day' plus numbers 
                                                         # in the range of 0 to the number of days. 
                    
day_values = [day for day in task] # another list comprehension that says day_values is a list and 
                                       # each element is one element in the array tetinfo

In [17]:
N2_task = dict(zip(day_keys, day_values)) # here we use that zip() function to pair these two lists 
                                             # and make them into a dictionary

del([day_keys, day_values]) # delete unnecessary variables from memory

In [18]:
# now, if we look at what is in N2_tetinfo
print(f'N2_task is of type : {type(N2_task)}')
print(f'with keys: {N2_task.keys()}')

# we see it is a dictionary with each day as a key. 

N2_task is of type : <class 'dict'>
with keys: dict_keys(['day0', 'day1', 'day2', 'day3', 'day4', 'day5', 'day6', 'day7', 'day8', 'day9'])


In [19]:
# we can access each day's data using the key for that day. 
N2_task['day5']

array([array([], dtype=float64), array(('Run',), dtype=[('type', 'O')]),
       array([], dtype=float64), array(('Run',), dtype=[('type', 'O')]),
       array([], dtype=float64), array(('Run',), dtype=[('type', 'O')])],
      dtype=object)

**Taking apart and storing the second level of data: epochs**  
The next step will be to make each day into a dictionary where the keys are the epochs and we can access data belonging to each epoch using the appropriate key. Because we're looking at locations of tetrodes here, we do not need to go any deeper into this file. We can pick any epoch of one day and use the location of the tetrode in that epoch for all epochs in that day. So if the tetrode was in PFC for epoch0 of day1, we can assume that it was in PFC for the rest of the epochs as well. 

In [20]:
for day_key in N2_task.keys():
    # print(day_key)
    day_data = N2_task[day_key]
    num_epochs = day_data.size
    # print(num_epochs)
    epoch_keys = []
    epoch_values = []
    for num, epoch in enumerate(day_data):
        # print(num, epoch.size)
        if epoch.size > 0:                        # I only thought to check for this because I noticed not all epochs had data in them
            task = epoch[()][0]                                   
            epoch_key = 'epoch' + str(num)
            epoch_keys.append(epoch_key)
            epoch_values.append(task)
    N2_task[day_key] = dict(zip(epoch_keys, epoch_values))

In [21]:
N2_task['day0']['epoch1']

'Run'

### **Step 2:**  
Now that we know where each tetrode was on each day, we can safely assume that any neuron recorded from that tetrode was from that location. So if the tetrode was in CA1, any neuron from that tetrode is from CA1. 

As before let's extract the filename of interest, get its address and load the file. 

In [22]:
for name in file_paths:
    if name.find('spikes') != -1:
        matfile = name

matfile

'../data/N2\\N2spikes.mat'

In [23]:
# we'll do the same thing as before and use spio to load the .mat file

data = spio.loadmat(matfile, squeeze_me=True)

In [24]:
data = data['spikes']

**Taking apart and storing the first level of data: days**  
Here, just like before we take change the organization of data so that `N2_data` is a dictionary with the day number as the key and the data from that day as the value paired with that key. 

In [25]:
num_days = data.size
day_keys = ['day' + str(num) for num in range(num_days)]
day_values = [day for day in data]
N2_data = dict(zip(day_keys, day_values))
del([day_keys, day_values])

In [26]:
N2_data.keys()

dict_keys(['day0', 'day1', 'day2', 'day3', 'day4', 'day5', 'day6', 'day7', 'day8', 'day9'])

**Taking apart and storing the second level of data: epochs**  
Next, we will change the organization of data so that each day in `N2_data` is a dictionary with the epoch number as the key and the data from that epoch as the value paired with that key. 

In [27]:
for key in N2_data.keys():
    day_data = N2_data[key]
    num_epochs = day_data.size
    epoch_keys = [key + '_epoch' +str(num) for num in range(num_epochs)]
    epoch_values = [epoch for epoch in day_data]
    N2_data[key] = dict(zip(epoch_keys, epoch_values))
    del([epoch_keys, epoch_values])

In [101]:
save_path = f'../clean_data/N2/'                                           # we want to eventually save this data to a folder on our computer, 
                                                                                # so this defines where that will be
if not os.path.exists(save_path):                                          # here we first check to make sure that folder doesn't already exist
            os.makedirs(save_path)                                         # if it doesn't exist, we create it. If it does, we don't bother.

        
# now we cycle through the data:

for day in N2_data.keys():                                                 # cycle through each day
    for epoch in N2_data[day].keys():                                      # cycle through each epoch of each day
        num_tetrodes = N2_data[day][epoch].size                            # the number of tetrodes is just the size of that epoch array
        cell_keys = []                                                     # create empty lists to store unit (neuron or cell) data, we make one for keys
        cell_values = []                                                   # and one for values
        
        for tt in range(num_tetrodes):                                     # cycle through each tetrode one by one
            num_cells = N2_data[day][epoch][tt].size                       # number of cells recorded on that tetrode = the size of the array
            for cell in range(num_cells):                                  # cycle through each cell to extract spike data for that cell
                cell_name = 'tt' + str(tt) + '_unit' + str(cell)           # create a name for the cell which includes which tetrode it's from
                if (num_cells == 1):                                       # if there's just one cell on that tetrode we have to use .item()[0]
                    spike_data = N2_data[day][epoch][tt].item()[0]
                elif num_cells > 1:
                    if N2_data[day][epoch][tt][cell].size > 0:             # I had to add this size condition in because there were some cells with no spike data in them at all
                        spike_data = N2_data[day][epoch][tt][cell].item()[0]   # otherwise, we index using the cell variable and then use .item()[0]
                
                try:                                    # some tetrodes do not have corresponding location info so trying to pull it out of the ...
                    cell_loc = N2_tetinfo[day][f'tt{tt}']      # ... tetinfo dictionary results in a 'key error'. The 'try', 'except' lines allow you to... 
                except:                                        # ... tell python what to do if an error pops up. In this case I'm saying,... 
                    cell_loc = []                              # ... if you encounter an error, just create an empty variable.
                
                if spike_data.ndim == 2:                       # I noticed that some cells have no extracted spikes on that so... 
                    cell_data = {                                  # ...we can skip these. For the ones that are 2-dimensional...
                        'cell_location': cell_loc,                 # ...we get each column from spike_data and save it in the appropriate...
                        'spike_times' : list(spike_data[:,0]),     # ... variable
                        'xpos' : list(spike_data[:,1]),
                        'ypos' : list(spike_data[:,2])
                    }
                    cell_keys.append(cell_name)                          # becauase we want to make this a dictionary, we append cell_name to cell_keys
                    cell_values.append(cell_data)                        # and cell_data to cell_values
        
       
        # now we have to save the data into a file on our computer
                                                                               
        df = pd.DataFrame(dict(zip(cell_keys, cell_values))).T           # using pandas, we create a dataframe from a dictionary, more on that below.
        if df.size > 0:                                                  # we then check to see if there is actually any data in it, if there isn't don't save it
            df.to_csv(f'{save_path}/{epoch}.csv', index=False)           # if it does, save it. 
        
            
            

**Just a little about dataframes**

Dataframes are like excel sheets. They have rows and columns and the columns have `column names`, and the rows have `indexes`
You can create a dataframe in multiple ways. I'm going to demonstrate the way we used above in our code. 

In [59]:
# first I make a dictionary

dictionary = {
    'items' : ['apple', 'banana', 'sky', 'leaves'],
    'color' : ['red', 'yellow', 'blue', 'green']
    }

In [62]:
# view the dictionary
dictionary

{'items': ['apple', 'banana', 'sky', 'leaves'],
 'color': ['red', 'yellow', 'blue', 'green']}

In [64]:
# to turn this dictionary into a dataframe, I just say:
pd.DataFrame(dictionary)

Unnamed: 0,items,color
0,apple,red
1,banana,yellow
2,sky,blue
3,leaves,green


if you noticed, I had a `.T` at the end of the line that defined the dataframe in the big block of code that saves our data to file. This just transposes the dataframe. That is, it switches the rows and columns

In [65]:
pd.DataFrame(dictionary).T

Unnamed: 0,0,1,2,3
items,apple,banana,sky,leaves
color,red,yellow,blue,green
