Home

RinkeHoekstra edited this page Nov 7, 2012 · 4 revisions
Clone this wiki locally

TabLinker documentation

Introduction

TabLinker is a conversion tool that transforms annotated Excel spreadsheets into RDF. More specifically, it deals with census tables which have been represented digitally using the Excel format. In order to convert Excel spreadsheets into RDF, specific statements regarding the structure of the data inside the spreadsheets need to be done. This document is an informal description of the conversion process being performed behind-the-scenes; hence, a general answer for the question “how does the TabLinker work”?

Please, have in mind that general behaviour depicted here may vary, as TabLinker is a work in progress tool.

Input files

Excel styles

Input files are read from the /input directory of TabLinker’s distribution. An input file shall be an annotated Excel spreadsheet.

In the scope of TabLinker, by ‘annotated’ Excel files we mean spreadsheets in which all relevant cells (i.e. all cells that shall be converted into RDF) have been applied an Excel style. This cell styling is a feature from Excel, and therefore Excel must be used to apply styles in order to let TabLinker know what cells have to be processed. However, these styles are not arbitrary: TabLinker only recognizes a special set of styles, each of them denoting one special type of cell. It is very important to mark the input files using these styles appropriately: depending on the applied style, cells will be converted into RDF in one way or another.

There are eight different styles recognized by TabLinker:

  1. TL Title :: The cell containing the title of a sheet
  2. TL Data :: A cell that contains data, e.g. a number for the population size
  3. TL ColHeader :: Used for the headers of columns
  4. TL RowHeader :: Used for row headers
  5. TL HRowHeader :: Hierarchical header, used for multi-column row headers with subsumption/taxonomic relations between the values of the columns
  6. TL RowProperty :: Typically used for the header cells directly above RowHeader or HierarchicalRowHeader cells. Cell values are the properties that relate Data cells to RowHeader and HierarchicalRowHeader cells.
  7. TL RowLabel :: Used for cells that contain a label for one of the HierarchicalRowHeader cells in the same row.
  8. TL Metadata :: Currently not used/testing purposes.

Census spreadsheet model

Typically, in the census spreadsheets one may find the following structure:

The “province” and “city” variables may differ (e.g. in an occupation census additional informational, such as the occupation name or classes, is considered).

Table title Variable V1 Variable V1 Variable V1 Variable V1
Sub-variable (partition) SV1 Sub-variable (partition) SV1 Sub-variable (partition) SV2 Sub-variable (partition) SV2
Province City M W M W
NH Amsterdam 1 2 3 4
Haarlem 5 6 7 8
ZH Amsterdam 9 10 11 12

Title cell

This is a single cell containing the title of the table. It should be styled using the TL Title style. It is used by TabLinker to generate consistent URI namespaces for the RDF exports.

Column/variable cells

This is a set of cells typically denoting variables of interest of the census. They are usually located as column headers of the tables (in the example above, those cells named ‘Variable name’, ‘Sub-variable (partition)’ and ‘M’ or ‘W’). These cells depict a partition of some variable of socio-historical interest: for example, an instance of ‘Variable name’ could be ‘Age’, ‘Sub-variable (partition) 1’ could be ‘less or equal than 70’, and ‘Sub-variable (partition) 2’ could be ‘more than 70’. Finally, there is a distinction between men (‘M’) and women (‘W’). This hierarchy tree can increase or decrease its complexity (i.e. its range goes from one single, non partitioned variable name, to several layer partitions like above) among different spreadsheets.

Following the previous exmaple, one can read the spreadsheet from a column point of view. For instance, an statement such ‘there are 5 men, less or equal than 70 years (of age)’ can be made this way, denoting the size of a partition of the total population described in the census that match the variable criteria. This statement is obviously incomplete, since the left hand side of the table is not being considered, but let’s forget about this for now.

These cells are marked with the TL ColHeader style.

Row/attribute value cells

In the left hand side of the census tables we usually find valuable attribute data, which helps describing concisely additional information about the population sizes we can find reading the variable cells. These attribute value cells may vary according to the type of the census, and may contain rows denoting names of occupations, cities, provinces, social classes, etc. In the example above, ‘NH’, ‘Amsterdam’, ‘Haarlem’, ‘ZH’ and ‘Amsterdam’ (again) are attribute value cells.

Now, a reading considering both variable cells (column headers) and attribute cells (row headers) can be made. For instance, we can make an statement like ‘there are 5 men, less or equal than 70 years (of age) with attribute values of Haarlem, NH’.

These cells are marked with the TL RowHeader or TL HRowHeader styles, depending on the existence (or not) of hierarchies or taxonomies between the rows themselves (in this case, there is a hierarchical relation between a province and the cities it contains).

Property cells

Property cells are, in some sense, column header cells as well. However, the columns they entitle do not contain sizes of puplation (i.e. numbers), but names of attributes relevant to the data (cities, occupation names, social classes, etc.). For this reason, they are treated as a separate type of cells, which we call property cells. In our example above, there are two property cells, with values ‘Province’ and ‘City’.

These property cells are very useful to fill the ‘predicate’ part of a statement about the data. For example, property cells in the table above let TabLinker know that ‘Haarlem’ is a city. Now, our previous example can be enriched with this semantical awareness: ‘there are 5 men, less or equal than 70 years (of age) in Haarlem, which is a /city/ inside the province of NH’. As we can see, property cells are basic for giving an accurate description of the data.

These cells are marked with the TL RowProperty style.

Data cells

Finally, data cells contain the counts of population, and they are located in the spreadsheets according to the variables and attributes they describe, usually in the right-bottom side, filling the vast majority of cells. In the example above, their values range from 1 to 12.

Data cells allow the very first part of our example statement: ’there are 5 men, less or equal than 70 years (of age) in Haarlem, which is a/city inside the province of NH’.

These cells are marked with the TL Data style.

Output files

Outout files from TabLinker contain the same kind of statements made in the previous sections, in RDF format (Turtle serialization). This means that these statements, instead of being written in natural language, like they are above, are written in a triple format that is very suitable for machines to process.

Triples are the data model used in RDF datsets. This is not a semantic web course, so for the moment should be enough to state that triples are ‘expressions of the form subject-predicate-object’. They are similar to sentences, in the sense that allow us to express (very little pieces of) information: for instance, the triple <Charles hasTel 0655555555> (subject is ‘Charles’, predicate is ‘hasTel’, and object is ‘0655555555’) could be read in natural language as ‘The telephone number of Charles is 0655555555’.

In the context of RDF datasets, controlled vocabularies (especially those describing predicate values) are very important, because they facilitate some kind of ‘standard’ language spoken by all datasets that use them. There is a special vocabulary for expressing statistical data, which is the case for data contained in the census, which is Data Cube vocabulary. This vocabulary provides all predicates necessary for making statemens like ‘has the value of’ or ‘has the variable name of’.

For example, a suitable simplified set of triples depicting the ‘5 men of Haarlem’ before could be this one:

<x hasPopulation 5> <x city ‘Haarlem’> <x province ‘NH’> <x hasDimension ‘M’> <x hasDimension ‘less or equal than 70’> <x hasDimension ‘Age’>

‘x’ is a genericname TabLinker gives to a cell.

TabLinker performs this export the following way: given the cell ‘x’, the first statement that can be done is its value (which equals 5). Then, it goes throught the row until the the left hand side of the table, finding the attribute value cells and the property cells; this way, the following two statements about ‘Haarlem’ and ‘NH’ can be made (note that the property cells provide the predicate names ‘city’ and ‘province’). Finally, it climbs up until the variable cells (column headers) in the table, matching all the cells that apply for the cell ‘x’.

All these statements are treated as a conjunction, which means that when a query processor reads this data, it takes into account that cell ‘x’ matches the first triple, and the second, and the third,… and so on until the last of them.

Discussion

One question that remains in the air is: “is there always a partition of the overall population?” In other words: given a person inside a count (for example, one of the ‘5 men of Haarlem’), does this person appear in any other count in the table? This is an important question to be answered, in order to envisage the inference capabilities one may require from the generated RDF.

Further work

TL Metadata

The TL Metadata style can be used in the future for exporting DANS notes inside the spreadsheets (and thus being able to reconstruct the same spreadsheet from the RDF data).

SQL translation

TabLinker will have an additional feature to export the resulting RDF model into a relational (SQL) database. To perform this export, a data model based on the concept of an observation will be considered. Likewise Data Cube vocabulary, each cell’s data is considered such an observation, and all additional metadata described above in this document is intended to give additional information on that observation.

Considering again that ‘x’ is a generic ID TabLinker gives to a cell, the resulting SQL export shall consist of a single table of observations with the following structure:

ID Population size Gender Age City Province
x 5 M less or equal than 70 Haarlem NH

Reversal transformation

TabLinker will have an additional feature for, given a previously generated RDF model, reconstruct the original Excel spreadsheet, preserving its data and its structure as well.

Glossary of terms

RDF
The RDF (Resource Description Framework) data model is similar to classic conceptual modeling approaches such as entity-relationship or class diagrams, as it is based upon the idea of making statements about resources (in particular Web resources) in the form of subject-predicate-object expressions. These expressions are known as triples in RDF terminology. The subject denotes the resource, and the predicate denotes traits or aspects of the resource and expresses a relationship between the subject and the object. For example, one way to represent the notion “The sky has the color blue” in RDF is as the triple: a subject denoting “the sky” , a predicate denoting “has the color”, and an object denoting “blue”. Therefore RDF swaps object for subject that would be used in the classical notation of an Entity–attribute–value model within Object oriented design; object (sky), attribute (color) and value (blue). RDF is an abstract model with several serialization formats (i.e., file formats), and so the particular way in which a resource or triple is encoded varies from format to format.
Spreadsheet
A spreadsheet displays multiple cells usually in a two-dimensional matrix or grid consisting of rows and columns (in other words, a table, hence “tabular”). Each cell contains alphanumeric text, numeric values, or formulas. A formula defines how the content of that cell is to be calculated from the contents of any other cell (or combination of cells) each time any cell is updated. A pseudo third dimension to the matrix is sometimes applied as another layer, or layers/sheets, of two-dimensional data. Spreadsheets share many principles and traits of databases, but spreadsheets and databases are not the same thing. A spreadsheet is essentially just one table, whereas a database is a collection of many tables with machine-readable semantic relationships between them. Spreadsheets are often imported into databases to become tables within them. While it is true that a workbook that contains three sheets is indeed a file containing multiple tables that can interact with each other, it lacks the relational structure of a database.
SQL
Structured Query Language is a language designed for managing data in relational database management systems (RDBMS). Originally based upon relational algebra and tuple relational calculus, its scope includes data insert, query, update and delete, schema creation and modification, and data access control.