How To Index Data Set
- Files should be indexed in-place so the correct path is recorded.
- The full, absolute path to files is stored by default regardless of how the file is specified. This behavior can be optionally modified using the
-kp
(keep path) option. - Files should be indexed whenever they are created or replaced in a data set.
- Index entries must be deleted via SQL if an archive file is removed.
- The exit value of mseedindex should be checked to be sure the indexing was successful.
The mseedindex program exits with a value of 0 if everything was successful and non-zero otherwise.
Building a time series index using SQLite
In the most straightforward scenario indexing miniSEED in one or more files is as simple as:
$ mseedindex -sqlite timeseries.sqlite <miniSEED files>
This would create an SQLite database file named timeseries.sqlite
if needed, a table with the default name of tsindex
(and associated indexes) if needed and scan all specified files.
If files are changed, just run the command again on the file and all references in the index table will be updated. See FILE VERSIONING section in the manual for a description of how to avoid race conditions while simultaneously updating data files and extracting data.
-
(Optional) Turn on WAL mode for for more efficiency if loading a huge volume of data, especially if scanning files in parallel. Using the SQLite command line tool issue this statement:
sqlite timeseries.sqlite 'PRAGMA journal_mode=wal;'
-
Index each target file with
-noup
1 option:for each file: mseedindex -sqlite timeseries.sqlite -noup miniSEEDfile
-
(Optional) Create the summary2 table, useful for optimizing queries to large index tables. Using the SQLite command line tool issue these statements (assuming the index table name is
tsindex
):DROP TABLE IF EXISTS tsindex_summary; CREATE TABLE tsindex_summary AS SELECT network,station,location,channel, min(starttime) AS earliest, max(endtime) AS latest, datetime('now') as updt FROM tsindex GROUP BY 1,2,3,4;
-
(Optional) Run ANALYZE on index table to optimize table index usage. This can take some time. Using the SQLite command line tool issue this statement:
sqlite timeseries.sqlite 'ANALYZE tsindex;'
-
(Optional) If you turned on WAL mode, you may wish to turn it back off for compatibility with older versions of SQLite. It should not effect SELECT query efficiency. Using the SQLite command line tool issue this statement:
sqlite timeseries.sqlite 'PRAGMA journal_mode=delete;'
The -noup
option instructs the program to not perform updates. If the files to be scanned are known to be new to the index this is more efficient as it avoids the need to search for and replace existing entries for a given file.
The summary
table is a useful addition to very large index tables to optimize access. This table is a summary with a row for each distinct network, station, location and channel with earliest and latest data times for each. Such a table is usually much smaller than the main index table and can be used to expand wildcards and reduce any given query to the main index table to entries known to be present. The fetchIndexInfo.py
script included in the repository illustrates how such a table can be used.
Building a time series index using PostgreSQL
If using PostgreSQL, the time series index table must already exist. The table and useful indexes can be created using the statements described in the Database Schema documentation.
Once the table exists, indexing data is the same as described for SQlite except that the -pghost
option must be specified to identify the database host. Other options to specific port, database name, user and password are -dbport
, -dbname
, -dbuser
and -dbpass
respectively.
If a file in a data set is removed it should be removed from the index as well. This must be done via SQL with a statement similar to:
DELETE FROM timeseries.tsindex WHERE filename='<Filename>'
Ideally a file would be removed from the index prior to being removed from the file system to avoid inconsistencies between the index and existing files.