Skip to content

DB Migration Procedures

Ethan Smith edited this page Jun 2, 2015 · 6 revisions

OrientDB -> Neo4j

The initial data migration associated with this project was made from the legacy MySQL relational database to a new OrientDB graph database. Thanks to a new licensing option from Neo4j there is now an option to migrate again to a Neo4j Enterprise database. No additional planning has yet occurred regarding the procedure or code changes necessary to do this.

MySQL -> OrientDB

This file documents how data from the AskNature 1.x mySQL database has been mapped to the AskNature 2.x OrientDB database.

Learn more about the relationships found in the AskNature 1.x schema via this set of diagrams.

In general, this migration followed the process described in the Import from a CSV file to a Graph chapter of the OrientDB documentation. There's also a good video tutorial at Udemy. After some early trials with CSV and JSON dumps, JSON turned out to be a much easier format to work with.

##Status

See the status of the migration, and how the legacy data schema maps to the new one, in this public Google Sheet.

##Process

Note: All ETL files that have already been created exist in this private repository.

  1. Export tables to JSON. It's simplest (easiest to comprehend) if tables without relationship dependencies are exported on their own, and if each relationship type is exported independently. Read a better description of this concept here.
  2. Delete automated header the JSON plugin adds to top of exported json file. The snipped below should be edited so it starts at the brackets [{"id": 1,...:
/**
Export to JSON plugin for PHPMyAdmin
@version 0.1
*/

/* Database 'biomimicry' */
/* biomimicry.highlighted_content */

[{"id": 1,"type": "strategy","name": "Retrofit" ...

Most of the exported tables that included fields that allowed html input have needed cleaning up.

It also has been helpful to pre-assign partial schemas for each new class that explicitly tell OrientDB to treat fields like "masterid" and "newpassword" as STRINGS rather than INTEGERS, which it sometime defaults to.

  1. Create a JSON file to tell the importer how to map the data.

The simplest version of this just translates a table to a new vertex class without defining edges, like so (note that this includes a "merge" tranformer, which makes sure the import doesn't result in duplicate entries):

{
 "config": {
   "log": "debug"
 },
 "begin": [
 ],
 "source" : {
   "file": { "path": "../products.json", "lock" : true }
 },
 "extractor" : {
   "json": {}
 },
 "transformers" : [
   { "json": { } },
   { "merge": {"joinFieldName": "masterid", "lookup": "Products.masterid"}},
   { "vertex": { "class": "Products" }}
 ],
 "loader" : {
   "orientdb": {
     "dbURL": "remote:localhost/databases/AskNature",
     "dbUser": "***",
     "dbPassword": "***",
     "dbAutoCreate": true,
     "tx": false,
     "batchCommit": 1000,
     "dbType": "graph"
   }
 }
}

In some cases a child/parent hierarchy of records is defined within the table being imported. For instance, every Linnaean species is the child of a Linnaean genus, whose unique ID is referenced in the species record. We add an index and define an edge class (note that this example uses a CSV source file, but could just as easily be JSON):

{
 "config": {
   "log": "debug"
 },
 "begin": [
 ],
 "source" : {
   "file": { "path": "../functions.csv", "lock" : true }
 },
 "extractor" : {
   "row": {}
 },
 "transformers" : [
   { "csv": { "separator": ",", "nullValue": "NULL" } },
   { "vertex": { "class": "Function" }},
   { "edge": { "class": "ChildOf",
               "joinFieldName": "parentid",
               "lookup": "Function.area_masterid"}}
 ],
 "loader" : {
   "orientdb": {
     "dbURL": "remote:localhost/databases/AskNature",
     "dbUser": "***",
     "dbPassword": "***",
     "dbAutoCreate": true,
     "tx": false,
     "batchCommit": 1000,
     "dbType": "graph",
     "indexes": [
       {"class":"Function", "fields":["area_masterid:integer"], "type":"UNIQUE" }
     ]
   }
 }
}

The source DB has a relations table that defines a number of relationships between various record types. This actually represents a pretty impressive stab at a mySQL version of "edges." There are two records for each relationship, one from A -> to -> B and another from B -> to -> A. We only need one set (i.e. A>B), but must be cognizant of which direction we want this edge to go in the graph. In the following example, the exported file incapsulated a filtered set from relations based on the following SQL query:

SELECT *
FROM `relations`
WHERE `from_type` LIKE 'product'
AND `to_type` LIKE 'aof'
{
 "config": {
   "log": "debug"
 },
 "begin": [
 ],
 "source" : {
   "file": { "path": "../relations_product-function.json", "lock" : true }
 },
 "extractor" : {
   "json": {}
 },
 "transformers" : [
   { "json": {} },
   { "merge": { "joinFieldName": "from_masterid", "lookup":"Product.masterid", "unresolvedLinkAction" : "SKIP"}},
   { "vertex": { "class": "Product" }},
   { "edge": { "class": "HasFunction",
               "joinFieldName": "to_masterid",
               "lookup": "Function.masterid"}}
 ],
 "loader" : {
   "orientdb": {
     "dbURL": "remote:localhost/databases/AskNature",
     "dbUser": "***",
     "dbPassword": "***",
     "dbAutoCreate": true,
     "tx": false,
     "batchCommit": 1000,
     "dbType": "graph",
     "indexes": [
       {"class":"Function", "fields":["masterid:integer"], "type":"UNIQUE" },
       {"class":"Product", "fields":["masterid:string"], "type":"UNIQUE" }
     ]
   }
 }
}

And what of tables that necessitate the creation of multiple edges? One method is to use the above technique multiple times with various edge definitions. OR you can add multiple edges to the transformer, like so:

{
"config": {
  "log": "debug"
},
"begin": [

],
"source" : {
  "file": { "path": "../importdata/relations_product.json", "lock" : true }
},
"extractor" : {
  "json": {}
},
"transformers" : [
  { "json": {} },
  { "merge": { "joinFieldName": "from_masterid", "lookup":"InspiredSolutions.masterid", "unresolvedLinkAction" : "NOTHING"}},
  { "vertex": { "class": "InspiredSolutions" }},
  { "edge": {
        "class": "RelatedTo",
        "joinFieldName": "to_masterid",
        "lookup": "InspiredSolutions.masterid",
        "direction": "out",
        "if": "to_type LIKE 'product'"
      }
  },
  { "edge": {
        "class": "InspiredBy",
        "joinFieldName": "to_masterid",
        "lookup": "Strategy.masterid",
        "direction": "out",
        "if": "to_type LIKE 'strategy'"
      }
  },
  { "edge": {
        "class": "HasFunction",
        "joinFieldName": "to_masterid",
        "lookup": "Function.masterid",
        "direction": "out",
        "if": "to_type LIKE 'aof'"
      }
  },
  { "edge": {
        "class": "FeaturedIn",
        "joinFieldName": "to_masterid",
        "lookup": "Sources.masterid",
        "direction": "out",
        "if": "to_type LIKE 'referenceMaterial'"
      }
  },
  { "edge": {
        "class": "StudiedBy",
        "joinFieldName": "to_masterid",
        "lookup": "Experts.masterid",
        "direction": "out",
        "if": "to_type LIKE 'expert'"
      }
  },
  { "edge": {
        "class": "Bookmarked",
        "joinFieldName": "to_masterid",
        "lookup": "Users.masterid",
        "direction": "in",
        "if": "to_type LIKE 'user'"
      }
  }
],
"loader" : {
  "orientdb": {
    "dbURL": "remote:localhost/databases/AskNature2",
    "dbUser": "admin",
    "dbPassword": "admin",
    "dbAutoCreate": true,
    "tx": false,
    "batchCommit": 1000,
    "dbType": "graph",
    "indexes": [
      {"class":"InspiredSolutions", "fields":["masterid:string"], "type":"UNIQUE" }
    ]
  }
}
}

  1. Make sure oetl.sh is installed by following these directions.

Finally, run the ETL process from the command line:

$ cd bin
$ ./oetl.sh ../ETL_Direcory/ETLfile.json

The debug console should give you a good sense about whether things are working or not.

If you're using the files that already exist, this is the rough sequence that was used:

None of these have dependencies:

  1. Strategy
  2. Products
  3. Experts
  4. Reference
  5. Functions (creates edges between parents and children)
  6. Organisms (creates edges between parents and children)
  7. Users

These all have dependencies on vertices created above:

  1. Userbans

  2. Relations-product

  3. Relations-strategy

  4. Relations-user

  5. Files

  6. Check to make sure everything worked using OrientDB Studio or ./bin/console.sh.