kyleburton / adventures-in-etl
- Source
- Commits
- Network (1)
- Issues (0)
- Downloads (0)
- Wiki (1)
- Graphs
-
Branch:
master
| name | age | message | |
|---|---|---|---|
| |
README.textile | Thu Oct 29 05:15:20 -0700 2009 | |
| |
adventures-in-etl.key | Tue Oct 20 05:39:49 -0700 2009 |
Adventures in Data – Looking up from the Mezzanine of Data Warehousing
If you’ve worked with relational databases at all, you’ve likely had to export data from them, and import data into them. We do this to back dat up, to migrate data sets from one version of our application to another, for replication, and as part of the normal operation of our systems. You are surrounded by ETL.
I was personally thrown deep into the world of ETL during my time at a Philladelphia area staart up who’s core focus was data integration. At that company had to deal with, what was at the time, a vast amount of ETL. Being a start up, we couldn’t afford commercial tools and embarked on creating our own.
This talk is about my experiences developing those tools and how we used them to process data en masse. I’ll discuss how I feel we diverged from the standard model of ETL and what we feel the trade offs were.
What is ETL?
{{{ need more references : wikipedia has a good overview of ETL }}}
ETL is an acronym for: Extract, Transform and Load. It is a blanket term for the tools, processes, and techniques used to export data from one system, cleanse and reformat it, and finally import it into a target sysetm.
The Problem: Circa 1999
The initial core product of the data integration company was a national (US) master file of demographic data on healthcare practitioners. The founder’s idea was that you could builtd a more comprehensive, accurate master data set by combining as many independent sources of data as possible – each one providing more attirbutes, or more of a confirmation of the data values from the other sources. The more sources agreed that a particular attriubte was correct (eg: first name), the higher the probability that the value was actually correct.
He had created the first data sets out of a handful of data bases that he had purchased and integrated together using a combination of table joins in Microsoft Access augmented by cleansing and parsing he did by hand with a mouse and cut&paste. It was humbling to see how quickly he could manipulate data by hand with MS Access.
After having good success creating an initial data set by hand, and realizing that he could acheive significantly better results by acquirning and integrating vastly larger data sets, he founded the data integration company with the express purpose of developing software to perform the data ingtegration.
Data sets had to be purchased, and employees be hired – this left little capital early on for buying commercial ETL and data integration tools, which were prohibitivly expensive for us as a start up (hundreds of thousands to millions of dollars).
Data Sources
Not all of the software was developed at once. At about the five year mark, we had managed to create a system that integrated several thousand databases, representing several hundred relational data models.
We had initially started with what we saw as the traditionall ETL approaches: have developers create ‘sql loader’ import configurations (we tried Oracle, SQL Server, and MySQL’s tools); have developers create custom perl or VB import programs; or have an analyst try to create import processes using MS Access. What we found was that we had very little influence over the data providers and the format they would send to us; many of the data providers we were receiving data from changed their formats much more frequently than we had anticipated (leading to high maintenence on the customized ETL processes); we were continually improving our understanding of the data we were creating and this was feeding back into how we were doing ETL on the data we were receiving – we realized that as we better understood the data, that we could improve our master-file’s comprehensiveness and accuracy by improving how we cleansed and mapped the input sources.
As we started in that first year, the realization dawned on us that having developers perform the ETL did not scale well, nor was it going to be cost effective. We wanted to re-define ETL so that people who were, though still technical, not necessarily full on software developers. At that point we began to focus on that as a core problem in addition to data integration.
Simplifying ETL
We were able to hire technical personell who could learn a bit of scripting (bash or perl) and knew enough sql to work with data once it was in a relational database, and then extrat the data to our customer’s requirements afterward. What we needed then were simpler tools for making it easy to describe and subsequently work with the raw data (represented as flat files, ms access databases and other well structured formats), then modelling that data and mapping it into a form that could be used.
For the former we developed a set of libraries and suite of tools we called the Abstract Table Utilities. For the latter we used what we now understand to be a Semantic Mapping approach, using semantic tagging to describe the input data model and then an ontological mapping to describe how to map it into a form that the subsequent data integration could be performed on.
Abstract Tables
The abtab (abstract-table) libraries centered around the idea of abstracting away the encoding of a table of data from its encoding, by defining a table as a stream of records where a record was a set of named fields. The encoding shoudl be able to vary independentl of the structure of the table itself.
Not long after we settled on that description, we came up with the idea of using a URI as an opaque representation of the table. The schema would represent the encoding (which we called a ‘driver’ : tab delimited, csv, fixed width, mysql, mdb, and so on), the path would represent the location (file path, database host / database / schema), and the query string would allow for custom options for each driver (schema).
We developed the libraries in Perl and in parallel we also developed command line utilities. The command line utilites allowed the non-programmers to write scripts and makefiles that performed moderately complex ETL and data transformations by composing these tools together.
An example is probably the most valuable way to get this idea across:
user@host ~$ abtab_cat -i mysql://localhost/pa_licensed_physicians | abtab_grep -e '$F[LICENSE_TYPE] eq "MD"' | abtab_cat -o csv://data/clients/med-dev-co1/pa-mds.csv
This set of commands first selects the data out of a table in mysql, converting it to tab-delimited format (inclusive of the header), performing a filter operation on the LICENSE_TYPE field for values of ‘MD’, then finally emitting the result (including the header) into a file in comma separated value format.
These utilities were originally implemented in Perl and used utf8 as their default character encoding. This covered the vast majority of the file encodings that we saw. In rare cases we would process data that was encoded in EBCDIC or another character set and in those rare instances we would use a separate format converter or a custom format adapter.
API and and Commands
Driver.open_resource(uri)
Static factory method. Based on the set of registered drivers, construct, initialize and reutrn a driver instance for the given uri. The driver class was determined via registration based on the URI schema.
Note that after open_resource, the driver is neither in read, nor write mode. The subsequent api calls that the user makes determined the read/write nature of the driver instance. If one of the read methods were invoked (eg: read_next_record, get_headers, etc.) then the stream was opened and internal state was initalized. If one of write methods were invoked (eg: set_headers, write_record), then the target encoding was destroyed (delted, truncated, etc) or otherwise opened in write mode. Not all drivers supported writing, in which case they were expected to throw an exception to halt further processing.
Driver.import_header_constants([prefix])
This used Perl’s meta-programming to inject constants into the calling code, which were named identically to the column names in the table. These constants mapped to the array indicies for the fields (to be used with read-next-record, as well as in constructing records).
This allowed code to be both readable (using column names instead of hard-coded index values), efficient (array indexing was superior to map lookups), and resistant against the input file format changing – as the constants would trigger a hard error at runtime if they were not created during the import call.
my $input = Abtab.get_resource($some_uri);
$input->import_header_constants('F_');
while( my $rec = $input->read_next_record ) {
printf "(Name,Id) = (%s,%s)\n", $rec->[F_NAME()], $rec->[F_ID()];
}
$input->close;
In that example, if it were to open a source that did not have both a NAME and an ID column, the code would error on the first record.
In the event that a field name was used more than one time, the implementation would emit a warning and then append an underscore and a numeric counter so that the constant would not conflict. If there were 3 NAME fields in the previous example, then import would create 3 constants: NAME, NAME_1, an NAME_2.
In practice, we observed poorly constructed files, and often had no control over having them be re-created to ‘spec’, this feature allowed those files to be analyzed and processed.
Driver.read_next_record => Array
Returns the next record from the stream, or undef (equiv to nil / null) if the stream was exhausted. Note that it did not automatically close or clean up the driver instance, the user had to do so manually. Field order was preserved perfectly.
Driver.read_next_record_map => Map
Returned the next record from the stream as a map. Field names were a string, values were a string — no type convesion was done by the library, that behavior was left up to calling code.
Driver.read_all_records => Array[Array]
Drivers were free to implement this if greater efficiencies could be gained via a bulk read operation.
Driver.get_headers => Array
Returned an array of the column names in declared order.
Driver.get_field_idx(field_name)
Returned the integer index (0 baesd) for the requested field.
Driver.write_next_record(Array)
Wrote a record to the stream.
Driver.write_next_record_map(Map)
Driver.write_records(Array[Array])
Drivers were free to implement this if greater efficiencies could be gained via a bulk write operation.
Driver.reset
This reset the internal state of the driver. If it was used for reading, it could then be used again. If it were used for writing, then this closed the write handle and re-set it for either re-writing or reading from the beginning of the stream.
Driver.set_headers
This was a destructive operation to the uri target. If called on a driver that was currently reading, it caused an exception (a reset would allow for this). If called on a driver already writing then an exception was also thrown.
If called on a newly initalized driver, the target would be truncated or deleted and re-created to fit the new structure. In the case of sql based targets, they would create VARCHAR fields, making a default assumption of text / string data. Some of the drivers did support additional parameters on the URI (in the query string) or to their constructors and instances directly to specify types, but type specificaiton was not part of the base abstract table api since many drivers would be unable to support it and it was not necessary for the system to be useful. We determined it wasn’t worth the effort of implementation in the base classes.
Command Line Utilities / Applications
All of these utilities followed the core unix philosophies of being composable via their standard input and standard output streams, emitting non-data errors to their error stream, as well as being permissive in what input they accepted and strict in the output they emitted. Following these principals enabled the tools to be used and re-used in conjunction with a wider array of other tools – including the standard suite of unix utilities.
user@host ~$ abtab_cat -i 'xls://pa_licensed_physicians.xls?sheet=Sheet1&row=5&column=2&stop_pattern=Totals' | cut -f1 | sort | uniq -c | sort -nr
This example extracts data from the first work sheet in an Excel file, starting at a specific offset and stopping when it sees teh text ‘Totals’ in the record. The first column (presumably an identifier column) is taken, with all others discarded, the ids are sorted and then duplicates are counted and finally sorted numerically in reverse order. The output produced from this ensemble is the list of duplicated IDs sorted by the amount of duplicaton. If used on another column (say state or credentials), the same set of operations would produce a histogram of values – a common initial step when analyzing a file.
App.Cat
This utility acted similarly to the standard Unix program cat. In its basic form it took a URI and emitted a tab-delimited record stream to it’s standard output. It was able to take both an input and an output URI, and thus was easily useable to perform file format or encoding conversion.
user@host ~$ abtab_cat -i mysql://localhost/pa_licensed_physicians -o xls://pa-physicians.xls user@host ~$ abtab_cat -i mysql://localhost/pa_licensed_physicians -o tab:///dev/stdout | cut -f 2- > physicians-no-ids.tab user@host ~$ abtab_cat -i mysql://localhost/pa_licensed_physicians -o 'tab:///dev/stdout?delim=,&record_separator=%0D%0A'
In the first example a mysql table is converted into an Excel spreadsheet. The second example emits tab-delimited output, passing it through the unix cut utility to strip off the first column (presumably an ID column) and the redirects the (still) tab-delimited output to capture it into a local file. The last example emits the same data, utilizing driver specific parameters to change the field delimiter and the record separator.
On Unix, the use of the ‘/dev/stdin’ and ‘/dev/stdout’ special files allowed several of the utilities to have simpler arugment processing (eg: require the input or output files to be spcified) while still allowing them to be composed with each other and other commands in a pipeline.
App.Grep
This mirroed the functionality of the unix grep utility. This utility preserved the header line, supported full perl regular expressions as well as perl expressions. As with the behavior of perl’s ‘-a’ (awk-mode) switch, the record was stored in the array ‘@F’ for use by the expression.
user@host ~$ abtab_cat -i mysql://localhost/pa_licensed_physicians | abtab_grep -e '$F[LICENSE_TYPE] eq "MD"' | abtab_cat -o csv://data/clients/med-dev-co1/pa-mds.csv
As this exampel shows, the constants mentioned in the drivers section were made available to the code being executed, which made expression based searching more readable.
App.View
View was an interactive appilcation that allowed the records to be seen in a vertially displayed manner, rather than just as a table.
tab://pa_licensed_physicians.tab [ 1] [ 0] ID: 134166134 [ 1] FNAME: John [ 2] LNAME: Smith [ 3] CRED: MD [ 4] STREET1: 123 E. Main St. ...
The utility supported an interactive mode where keys similar to Vim’s navigation keys could be used to navigate the record stream (including going backward).
App.Mod
App.Analyze
App.Diff
Filter.Grepping
Filter.Sorting
Filter.Cutting
Filter.Normalizaing
Filter.Landscaping
Filter.Modifying
Filter.Teeing
Filter.Putback
Developed for use by abtab_view, this allowed records to be placed back into the stream (which were just stored in-memory) to be (re)read again.
Drivers We Suported
Tab
CSV
Fixed Width
mysql
mdb (MS Access database)
oracle
DBI (equivalent to JDBC or ODBC)
Array
This was an in-memory driver
Driver.import(uri), Driver.import(Driver)
Used to do bulk operations when they were more efficient – in the case of Mysql, where a file based URI was provided where mysql could handle the format, this would make use of mysql’s “copy infile” command to quickly import the data (typically 10-100x faster than a series of insert statements).
Aaron’s ETL Outline
- What is ETL
ETL (Extract, Transform, Load) is the blanket term for the tools and processes used to export data from one system and import them into antoher. This typicall manifests itself as a dump of the source system’s data base, a process to make the data model and values compatible with a target system and an import process to get the data into the target system.
- ETL @ The Data Integration Company
- what type of problem was being solved?
- data volume
- data source type
- speed ETL
- frequency of ETL
- technique or tool(s) used for ETL
- describe the evolution of the current tool(s)
- what did you guys used to do?
- how did you guys arrive at the current approach?
- how long did it take to develop the initial version of the tool(s)
- who were the end users of the tool(s)
- ability to configure during runtime
- visually inspect data while it is being processed
- How do they work
- What data sources could they handle (source, and destination)
- Syntax
- How is transformation being handled?
- How does it handle the following scenarios:
- one-to-one (trivial)
- one-to-many – one record becomes multiple records
- many-to-many – one logical record from multiple tables into multiple tables
1. The era of table stores
2. Relational Databases
3. Reporting Systems Co-opted as Export tools
4. Import / Export tools
5. Relational Data Model Mapping
6. There is no Spoon : How to stop worrying and Love Ontological Mapping Techniques
Outline
What is ETL?
Extract, Transform, Load
The goal is to take a data set from one system and load it into another system where there are likely to be significant differences in how the two systems store, model and otherwise represent their data.
Extract data from a system (be they text files, a database dump, or other propreitary format), transform them into a form which can be understood (loaded) into the target system (change the encoding, data types, standardize values, morph the data model).
Most of this talk is going to focus on the movement of relationaly modeled data, tables of records with foreign key relationships, though ETL does not necessarily imply relationally modeled data. Many of the tools that will be discussed were designed specifically to deal with this type of data.
My own Experience with Data Integration
My programming career started at a 401k benefits managment company, my job was to take a flat-file export from a Unisys main-frame and load it into an Integrated Voice Response system where customers could use a phone to check their balances and change their allocations.
Little did I know at the time of what was to come…
More recently I worked for a company that did larger scale data integration. At that company we were integrating thousands of independent data sources into a single consolidated master file of data. These were acquired in a wide variety of ways: assets purchased from companies that had gone out of business, public domain sources, government agencies, other data providers and even our own customers in reciprocial agreements.
In nearly all cases we had very little influence on how we received the data, most often the sending party did not have the resources or expertise, or we did not wish to pay the preimium, to have them adapt their export processes to our input format or data model.
Early on we didn’t have a data model anyway.
We needed to load tens to hundreds of sources a day, where a source was a snapshot of an input database. Some of these were to be incorporeated into the master file, some were customer data integration jobs – to enhance their data and produce a deliverable back.
We were a start-up and could not afford to purchase expensive ETL software, nor could we afford to staff up a large number of highly skilled (and expensive) database developers to perform the data transformation and loading process.
What we ended up doing was thinking a bit differently about the different facets of ETL and creating tools that allowed us to leverage technical anlysts to perform the ETL mapping, without requiring programming skills, or significant amounts of training.
We even saw a not insignificant amount of data errors or corruption – we had optical media from defunct companies that had bit level errors. The data was valueable enought that we worked to correct and accomodate those kinds of errors in our system.
Data Volumes
We dealt with tens to hundreds of database snapshots a day. These sources would be up to several million records each. The entire system maintained billions of records of data, connected to about five million entities.
Data Formats
- Text Formats:
- Structured
- Fixed Width
- Record Separators optional
- Padding
- Often Not Self-describing (no header row)
- Packed Fields, more than 1 value in a field (delimited within fixed width)
- Mis-fielded Data
- ‘Somewhat fixed’ – data overrunning into adjacent fields
- Delimited
- Name for the ‘well formed delimted’ like tab/pipe?
- Name for the ‘complex delimited’ like CSV?
- Fixed Width
- SemiStructured
- PDF Files
- HTML Pages
- Legal Proceedings
- Structured
Start with the lowest level, move up one conceptual level at a time.
- Files: Blobs of Binary Data Sitting on Disk
- Records: Delimited or Fixed Width
- Feilds: Simple Delimited, Complex Delimited, Fixedwidth
- Model: types, validation and meaning from a text dump
- Files: Blobs of Binary Data Sitting on Disk
- Encoding: Streams of Characters
By far the most common format I had to deal with was basic ascii represented as 8-bit characters, which
is in fact, equivalent to Latin-X. The next most common was Latin-15 (ISO-8859-15), allowing for other
common western language’s extended character sets.- ASCII
- utf-8
- Latin-X
- EBCDIC
- First recognition of structure: lines or records (record delmiter / fixed record size)
- record delimiters are not consistent
- record delimiters are embedded in the data
- sometimes there are no record separators
- Second recognition of structure: fields / columns
- Headers may not be present
- this means you have to specify them as part of the descriptor for the source
- Headers may not be present
- Keys
- Surprise!: keys
- sometimes not present (!)
- sometimes documentation is ‘wrong’
- specify the key they used
- but file is a de-normalized dump of primary source
- key no longer unique
- complicates your import
- simple analyiss tricks for determining the uniqueness of a key column
- Simple analysis tricks and tools for getting a feel for the data
- cut | sort | uniq -c | wc -l
- cut | sort | uniq -c | sort -nr
- abtab_view
- abtab_historgram
- Encoding: Streams of Characters
- Developing a Taxonomy and A Descriptor Schema For Files that represent Tables
- One Column must become Many (hard)
- Many Columns must become One (easier)
- Data type must change:
- numeric: fairly straightforward
- enumerated values
- coding tables
- did the vendor supply one?
- does their data actualy match the documentation? (extra / missing / unused values?)
- strategies for recognizing what’s present
- strategies for handling the mapping
- strategies for modeling your own standardized lookup tables
- be as specific as possible, things are much easier on you if your LU tables are a super-set of all your inputs
- or if not, in addition to mapping, pass-through the original data for re-analysis later
- Computed Values, precision, accuracy
- searching to determining the min/max/avg precision
- currency
- lon/lat
- Data Models
- Each file is a table
- Join key as single column
- Join key as tuple (multiple columns)
- Join key as computed composite
- Suprise!: 1-1 table relations based on record position (ugh)
- The great tragedy : loss of data
- as you transform, be prescient of what you discard
- you can’t get it back
- if space is cheap for you, carry forward, or better yet deferr
- Consolidation
Talk about how much data we used to process
- >2k+ Different Relational Data Models
- Tens of Thousands of Snapshots
- billions of records
- all for ~5 million ‘entities’
