# Building and Loading Text Search in PostgreSQL


--- 
<a id='PG_text' ></a>

## PostgreSQL Text Storage

This notebook documents the building of the `BookLines` table using the Information Retrieval (IR) based extension, _full text search_.


<a id='task' /> </a>

## Task at Hand

This lab walks through the process of creating full text search capability within PostgreSQL for integration into other analytical processes of lines for a book (with sub-books).


### Database of Unstructured Text Files 

As was used in the lab, we are going to use this collection of text files.
It is 4.3 megabytes of text and 31 thousand lines, sounds fun!

```BASH
$ ls /dsa/data/all_datasets/book/*
book/1chron.txt    book/acts.txt      book/isaiah.txt    book/nahum.txt
book/1corinth.txt  book/amos.txt      book/james.txt     book/nehemiah.txt
book/1john.txt     book/colossia.txt  book/jeremiah.txt  book/numbers.txt
book/1kings.txt    book/daniel.txt    book/job.txt       book/obadiah.txt
book/1peter.txt    book/deut.txt      book/joel.txt      book/philemon.txt
book/1samuel.txt   book/eccl.txt      book/john.txt      book/philipp.txt
book/1thess.txt    book/ephesian.txt  book/jonah.txt     book/proverbs.txt
book/1timothy.txt  book/esther.txt    book/joshua.txt    book/psalms.txt
book/2chron.txt    book/exodus.txt    book/jude.txt      book/rev.txt
book/2corinth.txt  book/ezekiel.txt   book/judges.txt    book/romans.txt
book/2john.txt     book/ezra.txt      book/lament.txt    book/ruth.txt
book/2kings.txt    book/galatian.txt  book/levit.txt     book/song.txt
book/2peter.txt    book/genesis.txt   book/luke.txt      book/titus.txt
book/2samuel.txt   book/habakkuk.txt  book/malachi.txt   book/zech.txt
book/2thess.txt    book/haggai.txt    book/mark.txt      book/zeph.txt
book/2timothy.txt  book/hebrews.txt   book/matthew.txt
book/3john.txt     book/hosea.txt     book/micah.txt

$ du -skh /dsa/data/all_datasets/book
4.6M	/dsa/data/all_datasets/book
$ wc -l book/*  | tail -n1
  31258 total
```

### However, now we are going to index it line-by-line.

<span style="color:red">
**You will need create and load the database similarly to how you interacted with PostgreSQL in the Database and Analytics course.**
</span>

Remember a few key things:
 1. You will use your pawprint as your user name, and the password you will type in is your normal MU password.
 1. The database is: `dsa_student`
 1. The database host is: `pgsql.dsa.lan`
 1. The schema name is the same as your pawprint.


<a id='build_it' /> </a>

## Building a Text Retrieval Database

#### Examples of all the commands are available [here](../resources/PG_Build_Lines_Search.sql). An equivalent Python implemention is [here](./Table-Setup.ipynb).

You will need to open the terminal, then connect to the database to build your schema tables.

<span style="background-color:yellow">For the commands below, replace the schema name  with your own pawprint.</span>

### Step 0: Connect with your database.

In [None]:
import getpass

# Initialize some variables
mysso= <pawprint>    # this is also your schema name. 
schema=<pawprint> 
hostname='pgsql.dsa.lan'
database='dsa_student'

mypasswd = getpass.getpass("Type Password and hit enter")
connection_string = f"postgres://{mysso}:{mypasswd}@{hostname}/{database}"

%load_ext sql
%sql $connection_string 

# Then remove the password from computer memory
del mypasswd

### Step 1: Create data repository (i.e table) within a database.¶

```SQL
-------------------------
-- Basic Table 
-------------------------

DROP TABLE IF EXISTS BookLines;

CREATE TABLE BookLines(
        id SERIAL NOT NULL,
        name varchar(250) NOT NULL,
        line_no INT NOT NULL,
        line text NOT NULL
);

ALTER TABLE BookLines
ADD CONSTRAINT pk_BookLines PRIMARY KEY (id);
```

In [None]:
%%sql 



### Step 2: Add a column that implements the vector model

```SQL
-------------------------
-- Separate Ts_Vector column
-------------------------
-- TS_Vector for GIN INDEX
ALTER TABLE BookLines
  ADD COLUMN line_tsv_gin tsvector;

UPDATE BookLines
SET line_tsv_gin = to_tsvector('pg_catalog.english', line);
```

In [None]:
%%sql 



### Step 3: Another column that implements the vector model


```SQL
-- TS_Vector for GIST INDEX
ALTER TABLE BookLines
  ADD COLUMN line_tsv_gist tsvector;

UPDATE BookLines
SET line_tsv_gist = to_tsvector('pg_catalog.english', line);
```

In [None]:
%%sql

### Complete additional steps to build your IR backend (e.g. adding triggers and indexes)

**<span style='background:yellow'>[See lab](../labs/FullText_PostgreSQL-02.ipynb)</span>**

---

In [None]:
-- Add triggers

In [None]:
-- Add indexes


### Result


Finally, take a look at the resulting table definition:

```SQL
dsa_student=# \dt booklines
          List of relations
 Schema |   Name    | Type  | Owner
--------+-----------+-------+--------
 sebcq5 | booklines | table | sebcq5
(1 row)

dsa_student=# \d booklines
                                       Table "sebcq5.booklines"
    Column     |          Type          | Collation | Nullable |                Default
---------------+------------------------+-----------+----------+---------------------------------------
 id            | integer                |           | not null | nextval('booklines_id_seq'::regclass)
 name          | character varying(250) |           | not null |
 line_no       | integer                |           | not null |
 line          | text                   |           | not null |
 line_tsv_gin  | tsvector               |           |          |
 line_tsv_gist | tsvector               |           |          |
Indexes:
    "pk_booklines" PRIMARY KEY, btree (id)
    "booklines_line" gin (line gin_trgm_ops)
    "booklines_line_tsv_gin" gin (line_tsv_gin)
    "booklines_line_tsv_gist" gist (line_tsv_gist)
Triggers:
    tsv_gin_update BEFORE INSERT OR UPDATE ON booklines FOR EACH ROW EXECUTE PROCEDURE tsvector_update_trigger('line_tsv_gin', 'pg_catalog.english', 'line')
    tsv_gist_update BEFORE INSERT OR UPDATE ON booklines FOR EACH ROW EXECUTE PROCEDURE tsvector_update_trigger('line_tsv_gist', 'pg_catalog.english', 'line')
    
```

<a id='load_it' /> </a>

## Loading Data

To load the data, we will use a python script with follow the basic crawling behavior

 1. For each file/folder in the specified starting folder:
 1. If it is a folder, recurse into folder and process contents
 1. If it is a file, read contents and load into database, one line at a time.

In [None]:
import getpass
# This collects a masked password from the user
mypasswd = getpass.getpass()

In [None]:
mysso = '<your pawprint>'  # change to your pawprint
dbname = 'dsa_student'
schema = '<your pawprint>' #change to your pawprint

In [None]:
import os
import psycopg2

try:
    conn = psycopg2.connect(database=dbname,
                            user=mysso,
                            host='pgsql.dsa.lan',
                            password=mypasswd)
    print("I am able to connect to the database")
except:
    print("I am unable to connect to the database")

del mypasswd

In [None]:

def loadFile(filename):
    '''
    Read file contents, load into database.
    
    Returns: The document ID that was created
    '''
    line_no = 1
    with conn, conn.cursor() as curs:
        with open(filename, 'r') as infile:
            for line in infile:
                line = line.rstrip('\n')
                ###############################
                # Review the Printout
                ###############################
                print("Loading: {},{} = {}".format(filename,line_no,line))
                ###############################
                # When you are ready
                # Fill in the SQL variable
                # and Un-comment the curs.execute()
                ###############################
                SQL = <EDIT HERE>   
                curs.execute(SQL,(filename,line_no,line))
                #row_id = curs.fetchone()[0]
                line_no += 1
    return line_no


#### Use the cell below to test your code edits for above.

##### After testing, when you are ready
 1. comment out the print statements 
 1. Un-comment the cursor execute
 1. Reload the edited cells
 1. Load the cell that defines processFolder
 1. Execute `processFolder()`

In [None]:
###############################
# Dev-Testing Cell
###############################


lines_loaded = loadFile('/dsa/data/all_datasets/book/1peter.txt')
print("Lines Loaded: {}".format(lines_loaded))




###############################
# Dev-Testing Cell
# When done, change to cell type
# Raw NBConvert
###############################

In [None]:
def processFolder(folder):
    '''
    Process a folder for files and subfolders
    '''
    
    print('Processing folder: ',folder)
    
    for root, dirs, files in os.walk(folder):
        
        print("root = ", root)
        
        # Process Files
        for file in files:
            if file.endswith(".txt"):
                filename = os.path.join(root, file)
                print('Processing File:',filename)
                document_id = 0
                # Comment out this line to watch the next cell walk the tree
                lines_loaded = loadFile(filename)
                print("Lines Loaded: {}".format(lines_loaded))
                
            elif file.endswith(".html"):
                print("HTML Files Not Handled Yet")
        

In [None]:
###########################
# Launch the Parsing
###########################


processFolder('/dsa/data/all_datasets/book');

##### Example, similar output for the above is available [here](../resources/PG_FTS_Lines_Load.txt).

### Check the Results

```SQL
dsa_student=# select count(*),sum(length(line)) from sebcq5.booklines;
 count |   sum
-------+---------
 31365 | 4329283
(1 row)                                   
```

#### 31K lines

In [None]:
%%sql 


#### Looking at a random line that was added:

```SQL
dsa_student=# \x 
Expanded display is on.
dsa_student=# select * from BookLines where id = 9352;
-[ RECORD 1 ]-+-------------------------------------------------------------------------------------------
id            | 9352
name          | /dsa/data/all_datasets/book/ephesian.txt
line_no       | 135
line          | 6:3: That it may be well with thee, and thou mayest live long on the earth.
line_tsv_gin  | '3':2 '6':1 'earth':17 'live':13 'long':14 'may':5 'mayest':12 'thee':9 'thou':11 'well':7
line_tsv_gist | '3':2 '6':1 'earth':17 'live':13 'long':14 'may':5 'mayest':12 'thee':9 'thou':11 'well':7
```

Notice that we have built a document vector that is stemmed and has removed common (stop) words.



In [None]:
%%sql 


# Save your notebook, then `File > Close and Halt`