Skip to content

How To Index Data Set

Chad Trabant edited this page May 12, 2017 · 15 revisions

Guidelines for indexing

  • Files should be indexed in-place so the correct path is recorded.
  • The full, absolute path to files is stored by default regardless of how the file is specified. This behavior can be optionally modified using the -kp (keep path) option.
  • Files should be indexed whenever they are created or replaced in a data set.
  • Index entries must be deleted via SQL if an archive file is removed.
  • The exit value of mseedindex should be checked to be sure the indexing was successful.

The mseedindex program exits with a value of 0 if everything was successful and non-zero otherwise.

Building a time series index using SQLite

In the most straightforward scenario indexing miniSEED in one or more files is as simple as:

$ mseedindex -sqlite timeseries.sqlite <miniSEED files>

This would create an SQLite database file named timeseries.sqlite if needed, a table with the default name of tsindex (and associated indexes) if needed and scan all specified files.

If files are changed, just run the command again on the file and all references in the index table will be updated. See FILE VERSIONING section in the manual for a description of how to avoid race conditions while simultaneously updating data files and extracting data.

Steps for bulk indexing new data

  1. (Optional) Turn on WAL mode for for more efficiency if loading a huge volume of data, especially if scanning files in parallel. Using the SQLite command line tool issue this statement:

    sqlite timeseries.sqlite 'PRAGMA journal_mode=wal;'

  2. Index each target file with -noup1 option:

    for each file:
       mseedindex -sqlite timeseries.sqlite -noup miniSEEDfile
    
  3. (Optional) Create the summary2 table, useful for optimizing queries to large index tables. Using the SQLite command line tool issue these statements (assuming the index table name is tsindex):

    DROP TABLE IF EXISTS tsindex_summary;
    CREATE TABLE tsindex_summary AS
       SELECT network,station,location,channel,
       min(starttime) AS earliest, max(endtime) AS latest, datetime('now') as updt
       FROM tsindex
       GROUP BY 1,2,3,4;
    
  4. (Optional) Run ANALYZE on index table to optimize table index usage. This can take some time. Using the SQLite command line tool issue this statement:

    sqlite timeseries.sqlite 'ANALYZE tsindex;'

  5. (Optional) If you turned on WAL mode, you may wish to turn it back off for compatibility with older versions of SQLite. It should not effect SELECT query efficiency. Using the SQLite command line tool issue this statement:

    sqlite timeseries.sqlite 'PRAGMA journal_mode=delete;'

1. About -noup

The -noup option instructs the program to not perform updates. If the files to be scanned are known to be new to the index this is more efficient as it avoids the need to search for and replace existing entries for a given file.

2. About summary

The summary table is a useful addition to very large index tables to optimize access. This table is a summary with a row for each distinct network, station, location and channel with earliest and latest data times for each. Such a table is usually much smaller than the main index table and can be used to expand wildcards and reduce any given query to the main index table to entries known to be present. The fetchIndexInfo.py script included in the repository illustrates how such a table can be used.

Building a time series index using PostgreSQL

If using PostgreSQL, the time series index table must already exist. The table and useful indexes can be created using the statements described in the Database Schema documentation.

Once the table exists, indexing data is the same as described for SQlite except that the -pghost option must be specified to identify the database host. Other options to specific port, database name, user and password are -dbport, -dbname, -dbuser and -dbpass respectively.

Removing data from the index

If a file in a data set is removed it should be removed from the index as well. This must be done via SQL with a statement similar to:

DELETE FROM timeseries.tsindex WHERE filename='<Filename>'

Ideally a file would be removed from the index prior to being removed from the file system to avoid inconsistencies between the index and existing files.