# 4. Implementing Full-Text Search (FTS5)

### Virtual Tables in SQLite

A virtual table is an *object that is registered with an open SQLite database connection*. 

From the perspective of an SQL statement, the virtual table object looks like any other table or view. But behind the scenes, queries and updates on a virtual table invoke callback methods of the virtual table object instead of reading and writing on the database file.

The virtual table mechanism allows an application to **publish interfaces** that are accessible from SQL statements as if they were tables. 

SQL statements can do almost anything to a virtual table that they can do to a real table, with the following exceptions:

* One cannot create a trigger on a virtual table.
* One cannot create additional indices on a virtual table. (Virtual tables can have indices but that must be built into the virtual table implementation. Indices cannot be added separately using CREATE INDEX statements.)
* One cannot run ALTER TABLE ... ADD COLUMN commands against a virtual table.
* Individual virtual table implementations might impose additional constraints. For example, some virtual implementations might provide read-only tables. Or some virtual table implementations might allow INSERT or DELETE but not UPDATE. Or some virtual table implementations might limit the kinds of UPDATEs that can be made.

A virtual table might represent an in-memory data structures. Or it might represent a view of data on disk that is not in the SQLite format. Or the application might compute the content of the virtual table on demand.

## FTS5

In [None]:
import json
import pandas as pd
from sqlite3 import connect

DB_PATH = '../sqlite-olt.db'

In [None]:
with connect(DB_PATH) as db:
    comments_text_df = pd.read_sql("""
                    select 
                    json_extract(data, '$.objectID') as objectID,
                    json_extract(data, '$.author') as author,
                    json_extract(data, '$.comment_text') as comment_text,
                    json_extract(data, '$._tags') as tags,
                    length(json_extract(data, '$.comment_text')) as comment_text_length
                    from hn_items_raw
                    where comment_text notnull and tags notnull
                    """,db
                    )

In [None]:
comments_text_df['comment_text'][0]

FTS5 is an SQLite virtual table module that provides full-text search functionality to database applications. In their most elementary form, full-text search engines allow the user to efficiently search a large collection of documents for the subset that contain one or more instances of a search term. The search functionality provided to world wide web users by Google is, among other things, a full-text search engine, as it allows users to search for all documents on the web that contain, for example, the term "fts5".

To use FTS5, the user creates an FTS5 virtual table with one or more columns. For example:

CREATE VIRTUAL TABLE email USING fts5(sender, title, body);
It is an error to add types, constraints or PRIMARY KEY declarations to a CREATE VIRTUAL TABLE statement used to create an FTS5 table. Once created, an FTS5 table may be populated using INSERT, UPDATE or DELETE statements like any other table. Like any other table with no PRIMARY KEY declaration, an FTS5 table has an implicit INTEGER PRIMARY KEY field named rowid.

In [None]:
with connect(DB_PATH) as db:
    db.execute("""
    DROP TABLE IF EXISTS comments_fts;
    """)
    db.execute("""
    CREATE VIRTUAL TABLE comments_fts USING fts5(objectID, author, comment_text);
    """)
    
    db.execute("""
    DELETE FROM comments_fts;
    """)

In [None]:
with connect(DB_PATH) as db:
    comments_text_df[['objectID', 'author', 'comment_text']].to_sql('comments_fts', db, if_exists='append', index=False)

Once populated, there are three ways to execute a full-text query against the contents of an FTS5 table:

* Using a MATCH operator in the WHERE clause of a SELECT statement, or
* Using an equals ("=") operator in the WHERE clause of a SELECT statement, or
* using the table-valued function syntax.


If using the MATCH or = operators, the expression to the left of the MATCH operator is usually the **name of the FTS5 table** (the exception is when specifying a column-filter). 

We usually search on the **whole table**

In [None]:
with connect(DB_PATH) as db:
    search_df = pd.read_sql("""
                    select *
                    from comments_fts
                    where comments_fts MATCH 'bane'
                    """,db
                    )
search_df

In [None]:
search_df.iloc[85]['comment_text']

We can of course search on specific columns

In [None]:
%%timeit
with connect(DB_PATH) as db:
    search_df = pd.read_sql("""
                    select *
                    from comments_fts
                    where comment_text MATCH 'SQLite'
                    """,db
                    )
search_df

In [None]:
%%timeit
with connect(DB_PATH) as db:
    search_df = pd.read_sql("""
                    select *
                    from comments_fts
                    where comment_text LIKE '%SQLite%'
                    """,db
                    )
search_df

We can also order by relevance

In [None]:
with connect(DB_PATH) as db:
    search_df = pd.read_sql("""
                    select *
                    from comments_fts
                    where comment_text MATCH 'SQLite'
                    order by rank
                    """,db
                    )
search_df

Auxiliary functions can be used to retrieve extra information regarding the matched row. 

For example, an auxiliary function may be used to retrieve a copy of a column value for a matched row with all instances of the matched term surrounded by html <b></b> tags.

In [None]:
with connect(DB_PATH) as db:
    search_df = pd.read_sql("""
                    SELECT highlight(comments_fts, 2, '<b>', '</b>') as matches
                    FROM comments_fts
                    WHERE comment_text MATCH 'SQLite' and comment_text MATCH 'redis'
                    """,db
                    )
search_df

In [None]:
search_df['matches'][1]

### Searching for strings

In [None]:
with connect(DB_PATH) as db:
    search_df = pd.read_sql("""
                    SELECT highlight(comments_fts, 2, '<b>', '</b>') as matches
                    FROM comments_fts
                    WHERE comment_text MATCH '"database system"'
                    """,db
                    )
search_df

In [None]:
search_df.matches[0]

### Searching for phrases

In [None]:
with connect(DB_PATH) as db:
    search_df = pd.read_sql("""
                    SELECT highlight(comments_fts, 2, '<***>', '</***>') as matches
                    FROM comments_fts
                    WHERE comment_text MATCH 'Redis + sqlite'
                    """,db
                    )
search_df

In [None]:
search_df['matches'][10]

### Prefix queries

In [None]:
with connect(DB_PATH) as db:
    search_df = pd.read_sql("""
                    SELECT highlight(comments_fts, 2, '<***>', '</***>') as matches
                    FROM comments_fts
                    WHERE comment_text MATCH 'stats*'
                    """,db
                    )
search_df

In [None]:
search_df['matches'][10]