# Building and Loading Text Search in PostgreSQL-Part 1

--- 
<a id='PG_text' ></a>

## PostgreSQL Text Storage

PostgreSQL is the most powerful and flexible open-source relational database management system (RDBMS) available.
As you may know, it is actually an Object-Relational DBMS (ORDBMS).
Beyond these capabilities, PostgreSQL supports extensibility including No-SQL extensions, JSON extensions, and Spatial / Geospatial extensions.
There are many extensions available, and **this notebook focuses on an Information Retrieval (IR) based extension, _full text search_.**

### PostgreSQL Textual Field (column) Types

| Name                             | Description                |  
| -------------------------------- | -------------------------- |  
| character varying(n), varchar(n) | variable-length with limit |  
| character(n), char(n)            | fixed-length, blank padded |  
| text                             | variable unlimited length  |  



### From the manual

In addition, PostgreSQL provides the `text` type, which stores strings of any length. 
Although the type `text` is not in the SQL standard, several other SQL database management systems have it as well.

...


In any case, the longest possible character string that can be stored is about 1 GB. 

...

**If you desire to store long strings with no specific upper limit, use text or character varying without a length specifier, rather than making up an arbitrary length limit.**

---
So, `text` fields have no size limit, per se.
In reality, the underlying computer system may impose some limits.

In the details of things, `text` and other large objects are optimized for storage by being compressed into backup tables to accelerate relational operations on other columns.
 * When you have spare time, [read about PostgreSQL TOASTing](https://www.postgresql.org/docs/9.5/static/storage-toast.html)

<a id='task' /> </a>

## Task at Hand

Building systems to access unstructured data has been a long-standing challenge in computer and information science. Luckily for this course, we have two stellar tools: **PostgreSQL** and **Python**. We have seen full text search with Python in the pervious lab. 


For this lab, we are going to walk through the process of creating full text search capability within PostgreSQL at the basic level. We will be using a toy dataset for this lab. 

<a id='build_it' /> </a>

## Building a Text Retrieval Database

<span style="color:red">
**You will need create and load the database similarly to how you interacted with PostgreSQL in the Database and Analytics course.**
</span>

Remember a few key things:
 1. You will use your pawprint as your user name, and the password you will type in is your normal MU password.
 1. The database is: `dsa_student`
 1. The database host is: `pgsql.dsa.lan`
 1. The schema name is the same as your pawprint.

There are 3 ways to create/manipulate a database (See the Database Course): 

* Using Jupyter SQL magic function 
* Using psql console 
* Programatic access using psycopg or SQLAlchemy


<span style="background-color:yellow">For the commands below, replace the schema name with your own pawprint.</span>

### Text Search in SQL

In the Database and Analytics course, we explored queries with text column. E.g., we have used LIKE operator. 

```SQL
SELECT column_name FROM table_name WHERE column_name LIKE 'pattern';  
```

We use wildcards such as % (as in LIKE 'a%' to search for columns that start with "a"), and _ (as in LIKE '_r%' to find any values that have an "r" in the second position). In PostgreSQL we can also use ILIKE to ignore cases. For simple columns (e.g. name, address), this type of search may serve our purpose. But for a column that contains a text document, searching with regular expression will be very slow. 

A more effective way to approach this problem is by getting a semantic vector for all of the words contained in a document, that is, a language-specific representation of such words. So, when you search for a word like "jump", you will match all instances of the word and its tenses, even if you searched for "jumped" or "jumping". Additionally, you won't be searching the full document itself (which is slow), but the vector (which is fast).



### tsvector and tsquery

For facilitating full-text search, Postgres offers two data types: tsvector and tsquery.

From https://www.postgresql.org/docs/10/datatype-textsearch.html: 

> PostgreSQL provides two data types that are designed to support full text search, which is the activity of searching through a collection of natural-language documents to locate those that best match a query. The **tsvector** type represents a document in a form optimized for text search; the **tsquery** type similarly represents a text query. 


PostgreSQL has two functions that help us create these two data types. 

* `to_tsvector`: for creating a list of tokens (the tsvector data type, where ts stands for "text search");
* `to_tsquery`: for querying the vector for occurrences of certain words or phrases.


Now let's connect to the database and create a toy table that can store a text document. We will discuss about these data types at the appropriate place. 

### Step 0: Connect with your database.

You might remember that a database has a set of schemas and a schema has a set of tables. 

In [None]:
import getpass

# Initialize some variables
mysso="<your pawprint>"    # this is also your schema name. 
schema='<your pawprint>' 
hostname='pgsql.dsa.lan'
database='dsa_student'

mypasswd = getpass.getpass("Type Password and hit enter")
connection_string = f"postgres://{mysso}:{mypasswd}@{hostname}/{database}"

%load_ext sql
%sql $connection_string 

# Then remove the password from computer memory
del mypasswd

Let's check the connection by printing the first 5 tables in this schema. 

In [None]:
%%sql

select * 
from information_schema.tables
where table_schema = '<your pawprint>'
limit 5

### Step 1: Create data repository (i.e table) within a database.


We store text documents in this database. One table is enough to store the text contents. This table has two fields: document_id and document_text. 

```SQL
DROP TABLE IF EXISTS Documents;


CREATE TABLE Documents(
    document_id SERIAL NOT NULL PRIMARY KEY,
    document_text text NOT NULL
);
```

In [None]:
%%sql

DROP TABLE IF EXISTS Documents;


CREATE TABLE Documents(
    document_id SERIAL NOT NULL PRIMARY KEY,
    document_text text NOT NULL
);


### Step 1.1: Add a column that implements the vector model, then parse the data into it.

Postgres allows us to parse text data and store into a vector model **tsvector**. See here to learn https://www.postgresql.org/docs/10/datatype-textsearch.html about this type of field. 

> A tsvector value is a sorted list of distinct lexemes, which are words that have been normalized to merge different variants of the same word


The data type `tsvector`  is handy in a full-text search. We can add a field in the above table that processes the content and creates a text vector representation. Before creating a field of type `tsvector` in the table, let's see how it works. A tsvector value merges different variants of the same word and removes duplicates and stopwords to create a sorted list of distinct words called lexemes (i.e, terms). So tsvector essentially represents the data as a term vector with their occurrence positions. 

In [None]:
%%sql 

SELECT to_tsvector('pg_catalog.english', 'Never gonna give you up. Never gonna let you down');

Here the lexeme/term `gonna` occurs at postion 2 and 6 of the text. Also, the words `you`, `up`, and `down` are removed as they are stopwords. The first argument passed to `to_tsvector` is the name of a dictionary to use. Each dictionary includes a list of `stop words` that get excluded from the result. Different dictionaries have different stop words.

In [None]:
%%sql

SELECT to_tsvector('pg_catalog.simple', 'Never gonna give you up. Never gonna let you down');


Now let's add another column in the BookSearch table that can store vector representation of the content column. We could have defined this column in Step 1, but we created it separately for the sake of discussion.  
```SQL
ALTER TABLE Documents 
  ADD COLUMN document_tokens tsvector;

```

In [None]:
%%sql


ALTER TABLE Documents 
  ADD COLUMN document_tokens tsvector;


### Step 2: Now add some records to the table

In [None]:
%%sql

INSERT INTO documents (document_text) VALUES  
('Pack my box with five dozen milk jugs.'),
('Jackdaws love my big sphinx of quartz.'),
('The five boxing wizards jump quickly.'),
('How vexingly quick daft zebras jump!'),
('Bright vixens jump; dozy fowl quack.'),
('Sphinx of black quartz, judge my vow.');


In [None]:
%%sql

SELECT * from Documents;

### Step 3: Update the document_tokens column 

We will take advantage of `to_tsvector()` function for converting the document_text column and populating document_tokens column. 

In [None]:
%%sql

UPDATE documents d1  
SET document_tokens = to_tsvector(d1.document_text)  
FROM documents d2;  


In [None]:
%%sql

SELECT * from Documents;

We could have populated both document_text and document_tokens together. 

In [None]:
%%sql

DELETE from Documents;

SELECT * from Documents;

In [None]:
%%sql

INSERT INTO documents (document_text, document_tokens) VALUES  
('Pack my box with five dozen liquor jugs.', to_tsvector('Pack my box with five dozen liquor jugs.')) ,
('Jackdaws love my big sphinx of quartz.', to_tsvector('Jackdaws love my big sphinx of quartz.')),
('The five boxing wizards jump quickly.', to_tsvector('The five boxing wizards jump quickly.')),
('How vexingly quick daft zebras jump!', to_tsvector('How vexingly quick daft zebras jump!')),
('Bright vixens jump; dozy fowl quack.', to_tsvector('Bright vixens jump; dozy fowl quack.')),
('Sphinx of black quartz, judge my vow.', to_tsvector('Sphinx of black quartz, judge my vow.'));

In [None]:
%%sql

SELECT * from Documents;

### Step 4. Searching

The next function that we're interested in is `to_tsquery()`, which accepts a list of words that will be checked against the normalized vector we created with `to_tsvector()`.

To do this, we'll use the `@@` operator to check if tsquery matches tsvector.



In [None]:
%%sql

SELECT to_tsvector('The quick brown fox jumped over the lazy dog')  
    @@ to_tsquery('fox');


Searhing with 'fox' returned true. Now with "foxes"...

In [None]:
%%sql 

SELECT to_tsvector('The quick brown fox jumped over the lazy dog')  
    @@ to_tsquery('foxes');


That also returns "true" because "foxes" is the plural form of "fox". But how about "foxit"?

In [None]:
%%sql 

SELECT to_tsvector('The quick brown fox jumped over the lazy dog')  
    @@ to_tsquery('foxit');

That's false because the search is smart enough not to match anything that simply starts with fox unless it's related to the same semantics (meaning) of the text originally vectorized. 

And finally, now with "jumping":

In [None]:
%%sql 

SELECT to_tsvector('The quick brown fox jumped over the lazy dog')  
    @@ to_tsquery('jumping');


#### Operators: 

tsquery also provides a set of operators that we would expect in any decent query facility.


##### AND operator 

In [None]:
%%sql 

SELECT to_tsvector('The quick brown fox jumped over the lazy dog')  
    @@ to_tsquery('fox & dog');

#### OR operator (|)

In [None]:
%%sql 

SELECT to_tsvector('The quick brown fox jumped over the lazy dog')  
    @@ to_tsquery('fox | clown');

#### NEGATION operator (!)

In [None]:
%%sql 

SELECT to_tsvector('The quick brown fox jumped over the lazy dog')  
    @@ to_tsquery('!clown');

And we can, of course, combine them all.

In [None]:
%%sql 

SELECT to_tsvector('The quick brown fox jumped over the lazy dog')  
    @@ to_tsquery('fox & (dog | clown) & !queen');


Now let's perform search over the Documents table that we created earlier. 

In [None]:
%%sql

SELECT document_id, document_text, document_tokens  FROM documents  
WHERE document_tokens @@ to_tsquery('jump & quick'); 

The AND operator doesn't make any distinction in regards to the location of words in the documents. Let's try it now with the proximity operator <->. This operator facilitate phrase search. 

In [None]:
%%sql

SELECT document_id, document_text, document_tokens FROM documents  
WHERE document_tokens @@ to_tsquery('jump <-> quick');  


So you can now find words next to each other, but can you find words "close" to each other even if one doesn't come immediately after the other? In fact, the dash - in the proximity operator <-> is a placeholder for the amount of proximity you're searching for. Let's give some examples:

Let's search for "sphinx" and "quartz" next to each other (<->):

In [None]:
%%sql 

SELECT * FROM documents  
WHERE document_tokens @@ to_tsquery('sphinx <-> quartz');  


Let's increase the proximity between "sphinx" and "quartz" to two words apart (<2>):

In [None]:
%%sql 

SELECT * FROM documents  
WHERE document_tokens @@ to_tsquery('sphinx <2> quartz'); 

And three words apart (<3>):

In [None]:
%%sql 

SELECT * FROM documents  
WHERE document_tokens @@ to_tsquery('sphinx <3> quartz');  

A word of caution when performing proximity search. Unlike text-search where (jump & quick) and (quick & jump) would yield the same results, phrase search is not symmetric! That is, searching for (jump <-> quick) is not the same as searching for (quick <-> jump) as the PostgreSQL engine will consider the order in which you're placing the words.

And just so you know, <-> is really syntactic sugar for the tsquery_phrase() function; so `to_tsquery('sphinx <3> quartz')` is equivalent to `tsquery_phrase('sphinx', 'quartz', 3)`. 

## Performance

The reason why full-text search works really fast is because of the tsvector data type, which works as an index for the document's context. That being said, the cost of the operation is generating this index, which is something you would normally need to do only once (unless the document gets updated).

A good practice, therefore, is to store the vectors alongside with the documents, just as we did in our phrase search example. This way, you can profit from the speedup and flexibility of the tsvector/tsquery pair, while paying the small cost of generating and storing the document tokens.

In the next lab, we will how can create index for this tsvector to make it more faster. 

# Save your notebook, the `File > Close and Halt`