# Introduction

This tutorial illustrates how to use *ObjTables* to revision datasets, revision schemas, and migrate datasets between revisions of their schemas. This tutorial uses an address book of CEOs as an example.

# Define a schema for an address book

First, as described in [Tutorial 1](1.%20Building%20and%20visualizing%20schemas.ipynb), use *ObjTables* to define a schema for an address book.

In [1]:
import enum
import obj_tables
import types


class Address(obj_tables.Model):
    street = obj_tables.StringAttribute(unique=True, primary=True, verbose_name='Street')
    city = obj_tables.StringAttribute(verbose_name='City')
    state = obj_tables.StringAttribute(verbose_name='State')
    zip_code = obj_tables.StringAttribute(verbose_name='Zip code')
    country = obj_tables.StringAttribute(verbose_name='Country')

    class Meta(obj_tables.Model.Meta):
        table_format = obj_tables.TableFormat.multiple_cells
        attribute_order = ('street', 'city', 'state', 'zip_code', 'country',)
        verbose_name = 'Address'
        verbose_name_plural = 'Addresses'


class Company(obj_tables.Model):
    name = obj_tables.StringAttribute(unique=True, primary=True, verbose_name='Name')
    url = obj_tables.UrlAttribute(verbose_name='URL')
    address = obj_tables.OneToOneAttribute(Address, related_name='company', verbose_name='Address')

    class Meta(obj_tables.Model.Meta):
        table_format = obj_tables.TableFormat.row
        attribute_order = ('name', 'url', 'address',)
        verbose_name = 'Company'
        verbose_name_plural = 'Companies'


class PersonType(str, enum.Enum):
    family = 'family'
    friend = 'friend'
    business = 'business'


class Person(obj_tables.Model):
    name = obj_tables.StringAttribute(unique=True, primary=True, verbose_name='Name')
    type = obj_tables.EnumAttribute(PersonType, verbose_name='Type')
    company = obj_tables.ManyToOneAttribute(Company, related_name='employees', verbose_name='Company')
    email_address = obj_tables.EmailAttribute(verbose_name='Email address')
    phone_number = obj_tables.StringAttribute(verbose_name='Phone number')
    address = obj_tables.ManyToOneAttribute(Address, related_name='people', verbose_name='Address')

    class Meta(obj_tables.Model.Meta):
        table_format = obj_tables.TableFormat.row
        attribute_order = ('name', 'type', 'company', 'email_address', 'phone_number', 'address',)
        verbose_name = 'Person'
        verbose_name_plural = 'People'


class AddressBook(obj_tables.Model):
    id = obj_tables.StringAttribute(unique=True, primary=True, verbose_name='Id')
    companies = obj_tables.OneToManyAttribute(Company, related_name='address_book')
    people = obj_tables.OneToManyAttribute(Person, related_name='address_book')

    class Meta(obj_tables.Model.Meta):
        table_format = obj_tables.TableFormat.column
        attribute_order = ('id', 'companies', 'people')
        verbose_name = 'Address book'
        verbose_name_plural = 'Address books'

# Revision an address book of the CEOs of technology companies

In many domains such as exploratory areas of science, datasets must often be built iteratively over time. For example, we believe that whole-cell models will be built by iteratively modeling additional biochemical species, reactions, and pathways over time as more experimental data and knowledge is generated and additional collaborators contribute to a model. Consequently, it is often helpful to track the provenence of a dataset including when the dataset was first created; when each revision was made; which objects and relationships were added, removed, or changed with each revision and why; and who contributed each revision.

##### Revisioning datasets with Git

We recommend using [Git](https://git-scm.com/) to track the revision provenance of a dataset as follows:
1. Create a Git repository.
2. Host the repository on a publicly accessible server such as [GitHub](https://github.com).
3. Save each revision in CSV, TSV, MULTI.CSV, or MULTI.TSV format so that Git can be difference and merge the dataset.
4. Commit each revision to the dataset, noting the rationale for each revision in the commit message. In addition, [configure Git to track the author of each revision](https://git-scm.com/book/en/v2/Getting-Started-First-Time-Git-Setup).
5. Push the revisions to the public server.

##### Tracking revision metadata within datasets

Because datasets are often also shared via email; via public cloud storage systems such as Box, DropBox and Google Drive; and as supplementary files to journal articles, we also recommend that each revision of a dataset capture its own revision identifier (e.g., URL of the Git repository, branch of the repostory, and hash of the revision) so that readers of the dataset can easily lookup its provenance (e.g., Git commit log).

To make it easy for revisions of datasets to capture their own revision identifier, *ObjTables* provides a model (`obj_tables.utils.DataRepoMetadata`) for this information, as well as utilities for setting capturing the revision of a dataset from its Git repository.

##### Create a Git repository to track the revisions of the address book

In [2]:
from obj_tables.utils import DataRepoMetadata, set_git_repo_metadata_from_path
from wc_utils.util.git import RepoMetadataCollectionType
import git
import os
import shutil

repo_path = 'Address book'
repo_url = 'https://github.com/KarrLab/obj_tables_revisioning_tutorial_repo.git'

# create repository
if os.path.isdir(repo_path):
    shutil.rmtree(repo_path)
repo = git.Repo.clone_from(repo_url, repo_path)

##### Create an initial address book of the CEOs of several technology companies as of 2011, and save it to multiple CSV files along with metadata about the current revision of the address book

In [3]:
import obj_tables.io

# Steve Jobs of Apple
apple = Company(name='Apple',
                url='https://www.apple.com/',
                address=Address(street='10600 N Tantau Ave',
                                city='Cupertino',
                                state='CA',
                                zip_code='95014',
                                country='US'))
jobs = Person(name='Steve Jobs',
              type=PersonType.business,
              company=apple,
              email_address='sjobs@apple.com',
              phone_number='408-996-1010',
              address=apple.address)

# Reed Hasting of Netflix
netflix = Company(name='Netflix',
                  url='https://www.netflix.com/',
                  address=Address(street='100 Winchester Cir',
                                  city='Los Gatos',
                                  state='CA',
                                  zip_code='95032',
                                  country='US'))
hastings = Person(name='Reed Hastings',
                  type=PersonType.business,
                  company=netflix,
                  email_address='reed.hastings@netflix.com',
                  phone_number='408-540-3700',
                  address=netflix.address)

# Eric Schmidt of Google
google = Company(name='Google',
                 url='https://www.google.com/',
                 address=Address(street='1600 Amphitheatre Pkwy',
                                 city='Mountain View',
                                 state='CA',
                                 zip_code='94043',
                                 country='US'))
schmidt = Person(name='Eric Schmidt',
                type=PersonType.business,
                company=google,
                email_address='eschmidt@google.com',
                phone_number='650-253-0000',
                address=google.address)

# Mark Zuckerberg of Facebook
facebook = Company(name='Facebook',
                   url='https://www.facebook.com/',
                   address=Address(street='1 Hacker Way #15',
                                   city='Menlo Park',
                                   state='CA',
                                   zip_code='94025',
                                   country='US'))
zuckerberg = Person(name='Mark Zuckerberg',
                    type=PersonType.business,
                    company=facebook,
                    email_address='zuck@fb.com',
                    phone_number='650-543-4800',
                    address=facebook.address)

# Merge the companies and CEOs into a single address book
ceos = AddressBook(
    id = 'tech',
    companies = [apple, facebook, google, netflix],
    people = [schmidt, zuckerberg, hastings, jobs],
)

# Get the current revision of the repository
revision = DataRepoMetadata()
set_git_repo_metadata_from_path(revision, RepoMetadataCollectionType.DATA_REPO, path=repo_path)

# Save the address book to multiple CSV files along with its revision metadata
address_book_filename = os.path.join(repo_path, 'ceos-*.csv')
obj_tables.io.Writer().run(address_book_filename, [revision, ceos],
                           models=[DataRepoMetadata, AddressBook, Company, Person])

In [4]:
import pandas
pandas.read_csv(os.path.join(repo_path, 'ceos-Data repo metadata.csv'), delimiter=',')

Unnamed: 0,!!ObjTables type='Data' tableFormat='column' class='DataRepoMetadata' name='Data repo metadatas' date='2020-04-28 22:25:01' objTablesVersion='0.0.9',Unnamed: 1
0,!Url,https://github.com/KarrLab/obj_tables_revision...
1,!Branch,master
2,!Revision,a1815391af4d387a309cb0604312fd25a9acfe0e


In [5]:
pandas.read_csv(os.path.join(repo_path, 'ceos-Address book.csv'), delimiter=',')

Unnamed: 0,!!ObjTables type='Data' tableFormat='column' class='AddressBook' name='Address books' date='2020-04-28 22:25:01' objTablesVersion='0.0.9',Unnamed: 1
0,!Id,tech
1,!Companies,"Apple, Facebook, Google, Netflix"
2,!People,"Eric Schmidt, Mark Zuckerberg, Reed Hastings, ..."


In [6]:
pandas.read_csv(os.path.join(repo_path, 'ceos-Companies.csv'), delimiter=',')

Unnamed: 0,!!ObjTables type='Data' tableFormat='row' class='Company' name='Companies' date='2020-04-28 22:25:01' objTablesVersion='0.0.9',Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6
0,,,!Address,!Address,!Address,!Address,!Address
1,!Name,!URL,!Street,!City,!State,!Zip code,!Country
2,Apple,https://www.apple.com/,10600 N Tantau Ave,Cupertino,CA,95014,US
3,Facebook,https://www.facebook.com/,1 Hacker Way #15,Menlo Park,CA,94025,US
4,Google,https://www.google.com/,1600 Amphitheatre Pkwy,Mountain View,CA,94043,US
5,Netflix,https://www.netflix.com/,100 Winchester Cir,Los Gatos,CA,95032,US


In [7]:
pandas.read_csv(os.path.join(repo_path, 'ceos-People.csv'), delimiter=',')

Unnamed: 0,!!ObjTables type='Data' tableFormat='row' class='Person' name='People' date='2020-04-28 22:25:01' objTablesVersion='0.0.9',Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9
0,,,,,,!Address,!Address,!Address,!Address,!Address
1,!Name,!Type,!Company,!Email address,!Phone number,!Street,!City,!State,!Zip code,!Country
2,Eric Schmidt,business,Google,eschmidt@google.com,650-253-0000,1600 Amphitheatre Pkwy,Mountain View,CA,94043,US
3,Mark Zuckerberg,business,Facebook,zuck@fb.com,650-543-4800,1 Hacker Way #15,Menlo Park,CA,94025,US
4,Reed Hastings,business,Netflix,reed.hastings@netflix.com,408-540-3700,100 Winchester Cir,Los Gatos,CA,95032,US
5,Steve Jobs,business,Apple,sjobs@apple.com,408-996-1010,10600 N Tantau Ave,Cupertino,CA,95014,US


##### Commit the initial address book

In [8]:
repo.index.add([
    'ceos-Data repo metadata.csv',
    'ceos-Address book.csv',
    'ceos-Companies.csv',
    'ceos-People.csv',
])
repo.index.commit('Initial version of address book')

<git.Commit "8bcd4ed1aa1d1dcfaf3cede9339b228b88522fcf">

##### Revise the address book to reflect the current CEOs as of 2020

In [9]:
# Tim Cook is now the CEO of Apple
ceos.people.remove(jobs)
apple.employees.remove(jobs)
apple.address.people.remove(jobs)
cook = Person(name='Tim Cook',
              type=PersonType.business,
              company=apple,
              email_address='tcook@apple.com',
              phone_number='408-996-1010',
              address=apple.address)

# Sundar Pichai is now the CEO of Google
ceos.people.remove(schmidt)
google.employees.remove(schmidt)
google.address.people.remove(schmidt)
pichai = Person(name='Sundar Pichai',
                type=PersonType.business,
                company=google,
                email_address='sundar@google.com',
                phone_number='650-253-0000',
                address=google.address)

# Get the current revision of the repository
set_git_repo_metadata_from_path(revision, RepoMetadataCollectionType.DATA_REPO, path=repo_path)

# Save the address book to a MULTI.CSV file along with its revision metadata
obj_tables.io.Writer().run(address_book_filename, [revision, ceos], 
                           models=[DataRepoMetadata, AddressBook, Company, Person])

In [10]:
pandas.read_csv(os.path.join(repo_path, 'ceos-Data repo metadata.csv'), delimiter=',')

Unnamed: 0,!!ObjTables type='Data' tableFormat='column' class='DataRepoMetadata' name='Data repo metadatas' date='2020-04-28 22:25:01' objTablesVersion='0.0.9',Unnamed: 1
0,!Url,https://github.com/KarrLab/obj_tables_revision...
1,!Branch,master
2,!Revision,8bcd4ed1aa1d1dcfaf3cede9339b228b88522fcf


In [11]:
pandas.read_csv(os.path.join(repo_path, 'ceos-People.csv'), delimiter=',')

Unnamed: 0,!!ObjTables type='Data' tableFormat='row' class='Person' name='People' date='2020-04-28 22:25:01' objTablesVersion='0.0.9',Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9
0,,,,,,!Address,!Address,!Address,!Address,!Address
1,!Name,!Type,!Company,!Email address,!Phone number,!Street,!City,!State,!Zip code,!Country
2,Mark Zuckerberg,business,Facebook,zuck@fb.com,650-543-4800,1 Hacker Way #15,Menlo Park,CA,94025,US
3,Reed Hastings,business,Netflix,reed.hastings@netflix.com,408-540-3700,100 Winchester Cir,Los Gatos,CA,95032,US
4,Sundar Pichai,business,Google,sundar@google.com,650-253-0000,1600 Amphitheatre Pkwy,Mountain View,CA,94043,US
5,Tim Cook,business,Apple,tcook@apple.com,408-996-1010,10600 N Tantau Ave,Cupertino,CA,95014,US


##### Commit the revised address book

In [12]:
repo.index.add([
    'ceos-Data repo metadata.csv',
    'ceos-Address book.csv',
    'ceos-Companies.csv',
    'ceos-People.csv',
])
repo.index.commit('Initial version of address book')

<git.Commit "dc735cea7f6fe707a829104bcf83eb1c98050ef6">

# Revise the address book schema and migrate the address book to the revised schema

Please check back soon! In the meantime, please contact us at [info@karrlab.org](mailto:info@karrlab.org) with any questions.