New medialib backend

This page tries to summarize all the requirements for the new medialib backend and possible implementation notes.

How is the medialib used?

We have a couple of common operations in the medialib that the new backend should be optimized for:

Inserting new entries and properties, both bulk and single.
Getting all properties for one entry
Get a list of entries that contains property X and more complex queries like property X && property Y || property Z
Storage of Collections

It should also be:

portable
light dependency
self-contained
scalable, we should aim for full speed at around 70.000 - 100.000 songs in the db.

Current Implementation (SQLite)

How it's done

The schema is very simple and straightforward, each property is a row.

inserting new entries and properties is just about inserting a new row or multiple rows.
select key, value from Media where id = X. Select all key,value pair for a ID number.
when you want to get a list of entries that contains property X with ANDs and ORs it gets really complicated it means you need a lot of joins and subselects to do it in one question.
this is done in a separate table.

Pro's with this implementation

Make use of a defined standard SQL
Self-contained code in sqlite library
Easy to understand the simple operations (insert and getting all properties for one entry)
Mature

Con's with this implementation

SQLite have scalability problems with so many rows in the database, it easily get 10 rows per entry. so if you have 10000 songs you get around 100000 rows
Implementation for 3 - Getting a list of entries with complex query is very slow and not easy to overview
Inserting entries in bulk seems to slow down SQLite

Possible improvements

SQLite is designed from the ground up to support data in a relational model, but at this point it's basically used as a very, very complex hashtable. Vastly improved performance and simplicity can be achieved using an SQLite database simply by employing the relational paradigm - that is, one row corresponds to one song, and contains the necessary data in columns. Although this reduces the performance on queries for non-standard tags, they're non-standard tags. We shouldn't be sacrificing performance and scalability (and violate all kinds of good practice) to fully support them. What good is ensuring that corner cases are as simple as normal cases, when the means of doing that is to make simple cases as complex as corner cases? Srobertson 03:11, 6 February 2008 (CET)

S4 implementation

S4 is the working name for a new database backend that Anders has been thinking about. S4 stands for Short String Storage System (iirc) (I think it could be "Storage System for/4 Short Strings" aswell). The idea is to optimize the system for short string storage which SQL isn't optimized for. The idea is to store each entry as a Object and each object has properties.

It tries to address a few issues:

There is massive replication of strings. If you have 100 "Metallica" songs in your medialib the string Metallica is duplicated in 100 entries.
The structure that an album/artist/genre contains songs is implicit, attributes can only be set on songs, not on albums/artists.

Each entry is just a set of triplets. There are just two triplet-types (in the form subject-verb-object). Entry-HasProperty-Property, and Entry-Contains-Entry. This is called Entity-Attribute-Value model as employed for example by RDF.

A property is just a string-pair. Strings could(/should) be stored in a stringtable. Meaning that strings just are an integer, hence a string-pair is just an integer-pair, also known as an interer of twice bit-width.

Entry = [(HasProperty, (stringnum("url"), stringnum("file:///tmp/x.mp3"))), (HasProperty, (stringnum("title"), stringnum("Korv"))), (HasProperty, (stringnum("type"), stringnum("song"))]

Entry = [(HasProperty, (stringnum("album"), "Korving songs EP"))), (HasProperty, (stringnum("type"), stringnum("album"))), (Contains, songentry1), (Contains, songentry2), (Contains, songentry3)]

S4 should know (be told at startup) about the structure (ontology) so xforms setting properties doesn't have to know about it, they just do prop_set("album", "...and Justice for all") and S4 will take care of the rest.

A prototype implementation was realized by Anders to showcase some ideas for data structure and algorithms.