# Chapter 4

## Flexible data loading, bulk extraction from NWB files

The DataJoint Position table we created in Chapter 1 was designed to read coordinate data from a csv file.  Here we define a function that can extract data from either. This chapter walks through bulk data extraction from the NWB files generated in the previous chapter and passing them to DataJoint.  

To recap what we covered so far:
Chapter 1: Created a DataJoint pipeline and an NWB file.
Chapter 2: Demonstrated principles of working with data from an NWB file and a DataJoint pipeline.
Chapter 3: Integrated data from diverse source files in DataJoint, queried the pipeline and created NWB files based on query results

This chapter closes the loop by writing data from NWB files into a modified pipeline that works for both the original source file and NWB files.


In [1]:
import numpy as np
import os
import pynwb #NWB API
import pandas as pd
import h5py #Provides methods to supplement NWB
import datajoint as dj #
import csv 
from datetime import datetime
from dateutil import tz
from pynwb import NWBHDF5IO
from matplotlib import pylab as plt #Needed for ERD diagrams
from ImportsAndTableDefinitions import *

Connecting marikelreimer@tutorial-db.datajoint.io:3306


# Setup

We will work with several directories in this Chapter.  Update the path variables below:

In [2]:
SharedFiles = 'C:/Users/meowm/OneDrive/DataWarehouse/SharedFiles/'
experiment_path = 'C:/Users/meowm/OneDrive/DataWarehouse/Experimenter1/'
data_directory = 'C:/Users/meowm/OneDrive/DataWarehouse/Experimenter1/Mouse_5025/Session_22'

os.chdir(data_directory)

## Part 1:  Extending DataJoint Table definitions

When we defined the Position table, we designed it to read CSV files.  When transitioning to the NWB data standard, it is helpful for the Position table to work with existing data as well as NWB files.  The code in the next section describes how to create logic that checks file type and then extracts data from it.  It requires some logic, so we begin with pseudocode.

Create a file_reader function that takes one argument, file_type:
    If the file_type is nwb:
        use NWB API methods to retrieve data
    Otherwise, if the file_type is 'NoseLocationLog.csv'
        use numpy method to retrieve data 
    If there is no file_type, alert the user

In [3]:
def file_reader(file_type):
    #def type_handler(self, file_type):
    if file_type == 'nwb':
        #See Chapter 2 for details about this method of accessing timeseries names
        timeseries_name = str(list(nwbfile.processing['behavior'].data_interfaces))        
        #Clean the sting before we use it
        timeseries_name = timeseries_name[:]
        #Use timeseries name to retrieve data in timeseries group
        data = nwbfile.processing['behavior'][timeseries_name].data[:]
        return data

    elif file_type == 'NoseLocationLog.csv':
        data = np.genfromtxt('NoseLocationLog.csv', delimiter=',', skip_header=1)
        return data

    else:
        print('file not recognized')

If you use the file_reader() function outside of this notebook, be sure to add it to ImportsAndTableDefinitions.py.

To test the function, we open our NWB file from Chapter 1.  In order to read data from the file, we must use the NWB API in read mode.  We specify our file type and the file_reader function returns our data.

In [4]:
file_type = 'nwb'

In [5]:
with NWBHDF5IO('Mouse_5025_Session_22.nwb', 'r') as io:
    nwbfile = io.read()
    #Pass the file type to the function we defined on input line 3
    data = file_reader(file_type)

## Part 2: Modifying DataJoint to read position from NWB File

We begin by instantiating our tables.  The position table is commented out, because we will redefine it to include the file_reader function.

In [6]:
#Instantiate tables
mouse = Mouse()
session = Session()
#position = Position()
positionStatistics = PositionStatistics()

We add our newly-defined function to the make_tuples method in the Position table. Defining the function outside the table lets us add that functionality with a single line of code, making the Position table code easier to read and maintain.

In [7]:
@schema
class Position(dj.Imported):
    definition = """
    -> Session
    ---
    coordinates:  longblob    # X,Y coordinates of mouse
    """
    
    def _make_tuples(self, key):
        #Pass the file_type variable to the file_reader function and store the results in 'data'
        data = file_reader(file_type) #New line of code.
        #Associate 'data' with the coordinates key
        key['coordinates'] = data
        #Insert into Position table
        self.insert1(key)
        print('Populated a position for {subject_id}'.format(**key))

In [8]:
#Instantiate our updated table
position = Position()
position

subject_id  Primary keys above the '---',name  Primary key for the session table,"coordinates  X,Y coordinates of mouse"
,,


## Part 3: Putting it all together

We will now loop through our shared file directory extracting data from NWB files and passing it to our DataJoint pipeline. As each mouse can have multiple sessions, some logic is required to prevent duplication for the primary key of the of the Mouse table.  

We use pseudocode to describe the functionality:  

Navigate to the shared directory
Create a list of its contents
Create an empty list for mice

for file in file_list:
    	read in all file data
	    if the subject ID doesn't exist in the mice_list:
            Add subject ID to mice list
		    Insert subject_ID and data along with session and position data into datajoint and populate positionStatistics
	    else:
		    For the appropriate subject_id, insert session and position data into DataJoint and populate
            positionStatistics 

In [9]:
#Navigate to the shared directory:
os.chdir(SharedFiles)

#Creat a list of files in the directory
file_list = os.listdir()

#An empty list whose that we will use to track subject_id, preventing duplicating mouse data in datajoint.  
mice = []


In [10]:
for file in file_list:
    io = NWBHDF5IO(file, mode='r')
    nwbfile = io.read()

    #Extract Subject Data
    date_of_birth = nwbfile.subject.date_of_birth
    subject_description = nwbfile.subject.description
    genotype = nwbfile.subject.genotype
    sex = nwbfile.subject.sex
    species = nwbfile.subject.species
    subject_id = nwbfile.subject.subject_id
    weight = nwbfile.subject.weight

    #Extract Session Data and from root directory
    session_description = nwbfile.session_description
    identifier = nwbfile.identifier
    session_start_time = nwbfile.session_start_time
    experimenter = nwbfile.experimenter
    experiment_description = nwbfile.experiment_description
    institution = nwbfile.institution
    lab = nwbfile.lab

    #Extract data from timeseries container
    #timeseries_name = str(list(nwbfile.acquisition))
    #We must clean the sting before we use it
    #timeseries_name = timeseries_name[2:-2]
    #See Chapter 2 for details about this method of accessing timeseries names
    timeseries_name = str(list(nwbfile.processing['behavior'].data_interfaces))        
    #Clean the sting before we use it
    timeseries_name = timeseries_name[:]
    
    #timeseries_data = nwbfile.processing['behavior'][timeseries_name].data[:]
    timeseries_data = nwbfile.processing['behavior'][timeseries_name]
    name = nwbfile.processing['behavior'][timeseries_name].name
    
    raw_data_link = experiment_path + subject_id +'/' + name

    starting_time = timeseries_data.starting_time
    reference_frame = timeseries_data.reference_frame
    rate = timeseries_data.rate
    comments = timeseries_data.comments
    description = timeseries_data.description 
    
    if subject_id not in mice:
        #Add subject_id to mice list
        mice.append(subject_id)

        #Insert the subject data into DataJoint
        mouse.insert1((
            subject_id, 
            date_of_birth,
            genotype,
            sex,   
            species,
            subject_description,
            weight
            ))

        #Pass each session's data to DataJoint
        session.insert1((
            subject_id, #Primary key for mouse table
            name, #Primary key for session table
            comments, 
            description,
            experiment_description,
            experimenter, 
            identifier,
            institution, 
            lab, 
            rate,
            raw_data_link,
            reference_frame,
            session_description,
            session_start_time,
            starting_time
            ))
        position.populate()
    
        positionStatistics.populate()

    else:        
        #Pass each session's data to DataJoint
        session.insert1((
            subject_id, #Primary key for mouse table
            name, #Primary key for session table
            comments, 
            description,
            experiment_description,
            experimenter, 
            identifier,
            institution, 
            lab,
            rate,
            raw_data_link,
            reference_frame,
            session_description,
            session_start_time,
            starting_time
            ))
        position.populate()
    
        positionStatistics.populate()



Populated a position for Mouse_5025
Populating for:  {'subject_id': 'Mouse_5025', 'name': 'Session_21'}
Populated a position for Mouse_5027
Populating for:  {'subject_id': 'Mouse_5027', 'name': 'Session_7'}
Populated a position for Mouse_5027
Populating for:  {'subject_id': 'Mouse_5027', 'name': 'Session_8'}


The data from the NWB files is now in the DataJoint pipeline, completing this tutorial.

In [11]:
mouse * session * positionStatistics

subject_id  Primary keys above the '---',name  Primary key for the session table,date_of_birth,genotype,sex,species,subject_description,weight,comments,description  Timeseries description,experiment_description,experimenter,identifier,institution,lab,rate,raw_data_link,reference_frame,session_description,session_start_time,starting_time,left_side  coordinates in left half,right_side  coordinates in the right half,preference_index  left_side - right_side/ left_side + right_side
Mouse_5025,Session_21,2017-04-03 11:00:00,B6,F,Mus musculus,fuzzy,25g,Mouse fell asleep,"The position of a mouse in an arena is transformed into X,Y coordinates",Demonstrate bulk data imports of mouse position data and metadata,Experimenter 1,Mouse_5025_Session_21,Yale University,Tan Lab,30.0,C:/Users/meowm/OneDrive/DataWarehouse/Experimenter1/Mouse_5025/Session_21,"Zero refers to the bottom left corner of the rig, when viewed from above","The position of a mouse in an arena is transformed into X,Y coordinates",2019-08-02 11:29:13,0.0,9571.0,11261.0,-0.0811252
Mouse_5027,Session_7,2017-08-10 10:00:00,B6,F,Mus musculus,silky,25g,Will need to fix camera glitch,"The position of a mouse in an arena is transformed into X,Y coordinates",Demonstrate bulk data imports of mouse position data and metadata,Experimenter 1,Mouse_5027_Session_7,Yale University,Tan Lab,30.0,C:/Users/meowm/OneDrive/DataWarehouse/Experimenter1/Mouse_5027/Session_7,"Zero refers to the bottom left corner of the rig, when viewed from above","The position of a mouse in an arena is transformed into X,Y coordinates",2019-08-02 14:21:59,0.0,4443.0,12656.0,-0.48032
Mouse_5027,Session_8,2017-08-10 10:00:00,B6,F,Mus musculus,silky,25g,Will need to fix camera glitch,"The position of a mouse in an arena is transformed into X,Y coordinates",Demonstrate bulk data imports of mouse position data and metadata,Experimenter 1,Mouse_5027_Session_8,Yale University,Tan Lab,30.0,C:/Users/meowm/OneDrive/DataWarehouse/Experimenter1/Mouse_5027/Session_8,"Zero refers to the bottom left corner of the rig, when viewed from above","The position of a mouse in an arena is transformed into X,Y coordinates",2019-08-02 14:21:22,0.0,4731.0,14657.0,-0.511966
