In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import pyodbc
import sqlalchemy
import sqlite3
from subprocess import check_output
import os

%sql sqlite://

'Connected: @None'

In [2]:
actor = pd.read_csv('/kaggle/input/data-sakila-sql/actor.txt', sep = ';')
category = pd.read_csv('/kaggle/input/data-sakila-sql/category.txt', sep = ';')
customer = pd.read_csv('/kaggle/input/data-sakila-sql/customer.txt', sep = ';')
film = pd.read_csv('/kaggle/input/data-sakila-sql/film.txt', sep = ';')
film_cat = pd.read_csv('/kaggle/input/data-sakila-sql/film_category.txt', sep = ';')
inventory = pd.read_csv('/kaggle/input/data-sakila-sql/inventory.txt', sep = ';')
rental = pd.read_csv('/kaggle/input/data-sakila-sql/rental.txt', sep = ';')

In [3]:
from sqlalchemy import create_engine
engine = create_engine('sqlite:////sakila', echo=False)

actor.to_sql('actor', con = engine)
category.to_sql('category', con = engine)
customer.to_sql('customer', con = engine)
film.to_sql('film', con = engine)
film_cat.to_sql('film_category', con = engine)
inventory.to_sql('inventory', con = engine)
rental.to_sql('rental', con = engine)

# PRACTICEs

In this kernel/notebook, we will discuss 3 topics: 

`Full text search, Extending Postgre SQL` and `Improving full text search with extension`

## 1. Full-text search

Firstly, we remind the `wildcard operator` and `LIKE` operator in [section5_like_notlike](https://github.com/Nhan121/Lectures_notes-teaching-in-VN-/blob/master/SQL%20practices/Introduction%20to%20using%20SQL%20%26%20ML%20prob%20on%20jupyter%20notebook/filtering-rows.ipynb)

#### 1.1. The `LIKE` operator

    +===============+========================================+======================+=============================+
    |    type       |                  usages                |        Examples      |         Result              |
    |===============*========================================*======================*=============================+
    |(1)  _wildcard | used to match exactly one character    | WHERE name LIKE '_D' |  ED                         |
    |               | [or the word ending at a given_letter] |                      |                             |
    |===============|----------------------------------------|----------------------|-----------------------------|
    |(2)  wildcard_ |[or the word starting at a given_letter]| WHERE name LIKE 'A_' |  AL                         |
    |===============*========================================*======================*=============================|
    |(3)  %wildcard | used to match zero or more characters  | WHERE name LIKE '%A' |  SANDRA, JULIA, JESSICA, ...|
    |===============|----------------------------------------|----------------------|-----------------------------|
    |(4)  wildcard% | used to find words begining with       | WHERE name LIKE 'A%' |  ADAM, ALEX, ANGELA,        |
    |               |                                        |                      |  ANDREW, ...                |
    |===============|========================================|======================|=============================|
            
- **Example 1.1.** To find the text beginning with a given letters. 

For example; find the `actor` whose the `first_name` matches ***at-least one*** letter `A` **from the left**

In [4]:
pd.read_sql(
        """
            SELECT first_name, last_name
            FROM actor
            WHERE first_name LIKE 'A%'
        """, con = engine)

Unnamed: 0,first_name,last_name
0,ADAM,GRANT
1,ADAM,HOPPER
2,AL,GARLAND
3,ALAN,DREYFUSS
4,ALBERT,NOLTE
5,ALBERT,JOHANSSON
6,ALEC,WAYNE
7,ANGELA,WITHERSPOON
8,ANGELA,HUDSON
9,ANGELINA,ASTAIRE


- **Example 1.2.** `first_name` ***contains at least 1*** letter `A` from the right.

In [5]:
pd.read_sql(
    """
        SELECT first_name, last_name
        FROM actor
        WHERE first_name LIKE '%A'
    """, con = engine)

Unnamed: 0,first_name,last_name
0,ANGELA,WITHERSPOON
1,ANGELA,HUDSON
2,ANGELINA,ASTAIRE
3,BELA,WALKEN
4,CUBA,BIRCH
5,CUBA,OLIVIER
6,CUBA,ALLEN
7,GINA,DEGENERES
8,GRETA,KEITEL
9,GRETA,MALDEN


- **Example 1.3.** `first_name` match ***exactly one*** letter `A` **from the right**

In [6]:
pd.read_sql(
    """
        SELECT first_name, last_name
        FROM actor
        WHERE first_name LIKE '_A'
        ORDER BY first_name
    """, con = engine)

Unnamed: 0,first_name,last_name


This indicates that **no-one** in the `actor database` whose the `first_name` ending by letter `A`, so how about ending by letter `D`??

In [7]:
pd.read_sql(
    """
        SELECT first_name, last_name
        FROM actor
        WHERE first_name LIKE '_D'
    """, con = engine)

Unnamed: 0,first_name,last_name
0,ED,CHASE
1,ED,MANSFIELD
2,ED,GUINESS


- **Example 1.4.** `first_name` match ***exactly one*** letter `A` from the **left** or starting with letter `A`.

In [8]:
pd.read_sql(
    """
        SELECT first_name, last_name
        FROM actor
        WHERE first_name LIKE 'A_'
    """, con = engine)

Unnamed: 0,first_name,last_name
0,AL,GARLAND


- `title` begin with `elf`

In [9]:
pd.read_sql(
    """
        SELECT title
        FROM film
        WHERE title LIKE 'ELF%'
    """, con = engine)

Unnamed: 0,title
0,ELF MURDER


or `title` contain `ELF`

In [10]:
pd.read_sql(
    """
        SELECT title
        FROM film
        WHERE title LIKE '%ELF'
    """, con = engine)

Unnamed: 0,title
0,CHARADE DUFFEL
1,ROXANNE REBEL
2,INDEPENDENCE HOTEL
3,MOB DUFFEL


#### 1.2. LIKE versus full text search
Follow that, we will use 2 functions **`to_tsvector()`** and **`to_query()`** in the syntax

                    to_tsvector(text_column) @@ to_query(character)
For example,

In [11]:
pd.read_sql(
    """
        SELECT title, description
        FROM film
        WHERE to_tsvector(title) @@ to_tsquery('elf')
    """, con = engine)

Unnamed: 0,title,description
0,GHOSTBUSTERS ELF,A Thoughtful Epistle of a Dog And a Feminist w...
1,ELF MURDER,A Action-Packed Story of a Frisbee And a Woman...
2,ENCINO ELF,A Astounding Drama of a Feminist And a Teacher...


### What is full text search?

**`FTS (full text search)`** provides a means for performing `NLQ (natural language queries)` of `text-data` in your database by using
> **`Stemming`**
> 
> **`Splelling mistake`**
>
> **`Ranking`**

### EXERCISEs 
#### Exercise 1.1. A review of the LIKE operator
The `LIKE operator` allows us to filter our queries by matching one or more characters in text data. By using the `% wildcard` we can match one or more characters in a string. This is useful when you want to return a result set that matches certain characteristics and can also be very helpful during exploratory data analysis or data cleansing tasks.

Let's explore how different usage of the `% wildcard` will return different results by looking at the film table of the `Sakila DVD Rental database`.

#### Instructions 
**Step 1.** Select all columns for all records that begin with the word `GOLD`.

In [12]:
pd.read_sql(
    """ 
        SELECT *
        FROM film
        WHERE title LIKE 'GOLD%';
    """, con = engine)

Unnamed: 0,film_id,title,description,release_year,language_id,original_language_id,rental_duration,rental_rate,length,replacement_cost,rating,last_update,special_features
0,365,GOLD RIVER,A Taut Documentary of a Database Administrator...,2006,1,1,4,4.99,154,21.99,R,2006-02-15 05:03:00,"Trailers,Commentaries,Deleted Scenes,Behind th..."
1,366,GOLDFINGER SENSIBILITY,A Insightful Drama of a Mad Scientist And a Hu...,2006,1,1,3,0.99,93,29.99,G,2006-02-15 05:03:00,"Trailers,Commentaries,Behind the Scenes"
2,367,GOLDMINE TYCOON,A Brilliant Epistle of a Composer And a Frisbe...,2006,1,1,6,0.99,153,20.99,R,2006-02-15 05:03:00,"Trailers,Behind the Scenes"


**Step 2.** Now select all records that end with the word `GOLD`.

In [13]:
pd.read_sql(
    """ 
        SELECT *
        FROM film
        WHERE title LIKE '%GOLD';
    """, con = engine)

Unnamed: 0,film_id,title,description,release_year,language_id,original_language_id,rental_duration,rental_rate,length,replacement_cost,rating,last_update,special_features
0,644,OSCAR GOLD,A Insightful Tale of a Database Administrator ...,2006,1,1,7,2.99,115,29.99,PG,2006-02-15 05:03:00,Behind the Scenes
1,870,SWARM GOLD,A Insightful Panorama of a Crocodile And a Boa...,2006,1,1,4,0.99,123,12.99,PG-13,2006-02-15 05:03:00,"Trailers,Commentaries"


**Step 3.** Finally, select all records that contain the word `'GOLD'`.

In [14]:
pd.read_sql(
    """ 
        SELECT *
        FROM film
        WHERE title LIKE '%GOLD%';    
    """, con = engine)

Unnamed: 0,film_id,title,description,release_year,language_id,original_language_id,rental_duration,rental_rate,length,replacement_cost,rating,last_update,special_features
0,365,GOLD RIVER,A Taut Documentary of a Database Administrator...,2006,1,1,4,4.99,154,21.99,R,2006-02-15 05:03:00,"Trailers,Commentaries,Deleted Scenes,Behind th..."
1,2,ACE GOLDFINGER,A Astounding Epistle of a Database Administrat...,2006,1,1,3,4.99,48,12.99,G,2006-02-15 05:03:00,"Trailers,Deleted Scenes"
2,95,BREAKFAST GOLDFINGER,A Beautiful Reflection of a Student And a Stud...,2006,1,1,5,4.99,123,18.99,G,2006-02-15 05:03:00,"Trailers,Commentaries,Deleted Scenes"
3,366,GOLDFINGER SENSIBILITY,A Insightful Drama of a Mad Scientist And a Hu...,2006,1,1,3,0.99,93,29.99,G,2006-02-15 05:03:00,"Trailers,Commentaries,Behind the Scenes"
4,367,GOLDMINE TYCOON,A Brilliant Epistle of a Composer And a Frisbe...,2006,1,1,6,0.99,153,20.99,R,2006-02-15 05:03:00,"Trailers,Behind the Scenes"
5,644,OSCAR GOLD,A Insightful Tale of a Database Administrator ...,2006,1,1,7,2.99,115,29.99,PG,2006-02-15 05:03:00,Behind the Scenes
6,798,SILVERADO GOLDFINGER,A Stunning Epistle of a Sumo Wrestler And a Ma...,2006,1,1,4,4.99,74,11.99,PG,2006-02-15 05:03:00,"Trailers,Commentaries"
7,870,SWARM GOLD,A Insightful Panorama of a Crocodile And a Boa...,2006,1,1,4,0.99,123,12.99,PG-13,2006-02-15 05:03:00,"Trailers,Commentaries"


#### Exercise 1.2. What is a tsvector?
You saw how to convert `strings` to `tsvector` and `tsquery` in the previous sections and, in this exercise, we are going to dive deeper into what these functions actually return after converting a string to a tsvector. 

In this example, you will convert a text column from the film table to a tsvector and inspect the first 25 results. 

**Understanding how `full-text search` works is the first step in more `advanced machine learning and data science concepts like natural language processing`**.

#### Instructions
Select the film description and convert it to a `tsvector` data type.

In [15]:
pd.read_sql(
    """ 
    SELECT to_tsvector(description)
    FROM film
    LIMIT 25
    """, con = engine)

Unnamed: 0,to_tsvector
0,'display':3 'fate':2 'georgia':19 'mad':9 'mus...
1,'ancient':18 'awe':3 'awe-inspir':2 'boy':16 '...
2,'abandon':19 'astound':2 'charact':3 'fun':20 ...
3,'berlin':17 'drama':3 'husband':9 'must':11 'o...
4,'boat':7 'charact':3 'emot':2 'explor':15 'fin...
5,"'cat':14 'emot':2 'first':17 'man':9,18 'meet'..."
6,'ancient':18 'astronaut':11 'fast':3 'fast-pac...
7,'abandon':18 'boat':6 'compos':9 'documentari'...
8,'abandon':18 'brilliant':2 'defeat':13 'epistl...
9,'awe':3 'awe-inspir':2 'baloon':19 'confront':...


#### Exercise 1.3. Basic full-text search
Searching text will become something you do repeatedly when building applications or exploring data sets for data science. `Full-text search` is helpful when performing `exploratory data analysis` for a `NLP (natural language processing) model` or building a search feature into your application.

In this exercise, you will practice searching a text column and match it against a string. The search will return the same result as a query that uses the **`LIKE`** operator with the `% wildcard` at the `beginning` and `end` of the string, but will perform much better and provide you with a foundation for more advanced `full-text search queries`. 

#### Instructions
Select the `title` and `description` columns from the `film table`. Perform a `full-text search` on the `title` column for the word `elf`.

In [16]:
pd.read_sql(
    """ 
    SELECT title, description
    FROM film
    WHERE to_tsvector(title) @@ to_tsquery('elf');
    """, con = engine)

Unnamed: 0,title,description
0,GHOSTBUSTERS ELF,A Thoughtful Epistle of a Dog And a Feminist w...
1,ELF MURDER,A Action-Packed Story of a Frisbee And a Woman...
2,ENCINO ELF,A Astounding Drama of a Feminist And a Teacher...


#### Exercise 1.4. A lot of advance in full-text search.
Your task in this exercise is searching a text column `title` that match against the word `elf`; then return its position in the `title`; don't forget return the `length of title`.

#### Instruction
Using the functions **`POSITION()`** and **`LENGTH()`** that mentioned in the previous [chapter:parsing&manipulating_SQL](https://github.com/Nhan121/Lectures_notes-teaching-in-VN-/blob/master/SQL%20practices/Functions%20for%20Manipulating%20Data%20in%20PostgreSQL/parsing-and-manipulating-text.ipynb)

In [17]:
pd.read_sql(
    """ 
    SELECT title, POSITION('ELF' IN title) As pos_contain_elf, LENGTH(title) AS tit_len
    FROM film
    WHERE to_tsvector(title) @@ to_tsquery('elf')
    ORDER BY pos_contain_elf
    """, con = engine)

Unnamed: 0,title,pos_contain_elf,tit_len
0,ELF MURDER,1,10
1,ENCINO ELF,8,10
2,GHOSTBUSTERS ELF,14,16


## 2. Extending PostgreSQL

#### 2.1 User-defined data-types

**`Enumerated` datatypes**, for example:

           CREATE TYPE dayofweek AS ENUM('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday')
and we can query the system table called `pg_type`

**Getting infomation about `user-defined data types`**. For example, in this query, we can get the `name` of the datatype used `typname` and `category`'s type used `typcategory`

                    SELECT typname, typcategory
                    FROM pg_type
                    WHERE typname = 'dayofweek'
The result of the query returns the `day of week` and the name of datatype we just created!!

| typname  | typcategory |
|----------|-------------|
| dayofweek|  E          |

Recall that, in the [basic_datatype_SQL](https://github.com/Nhan121/Lectures_notes-teaching-in-VN-/blob/master/SQL%20practices/Functions%20for%20Manipulating%20Data%20in%20PostgreSQL/common-data-types.ipynb)


In [18]:
pd.read_sql(
    """
    SELECT column_name, data_type, udt_name
    FROM INFORMATION_SCHEMA.COLUMNS
    WHERE table_name = 'film';
    """, con = engine)

Unnamed: 0,column_name,data_type,udt_name
0,film_id,smallint,int2
1,title,character varying,varchar
2,description,text,text
3,release_year,integer,int4
4,language_id,smallint,int2
5,original_language_id,smallint,int2
6,rental_duration,smallint,int2
7,rental_rate,numeric,numeric
8,length,smallint,int2
9,replacement_cost,numeric,numeric


The `udt_name` column or `user-defined_data type` likes `mpaa_rating` contain the values of the names provided when created in the datatype using created_type cmt! Indeed,

                CREATE TYPE mpaa_rating AS ENUM (
                    'G',
                    'PG',
                    'PG-13',
                    'R',
                    'NC-17'
                );

#### 2.2. User-defined function

For example, we will define the function `square` as follow:

        CREATE FUNCTION squared(i integer) RETURNS integer AS $$ 
            BEGIN            
                RETURNS i*i;
            END;
        $$ LANGUAGE plpgsql
After that,

        SELECT squared(10);        
Result        

| squared |
|---------|
| 100     |

#### Some `user-defined functions` given in the `SAKILA database`

**1) get_customer_balance(customer_id, effective_data)**: calculate the current outstanding balances for a given customer!

        CREATE FUNCTION get_customer_balance(p_customer_id integer, p_effective_date timestamp with time zone) 
                        RETURNS numeric LANGUAGE plpgsql AS $$
                           --- These line just commented
                           --#OK, WE NEED TO CALCULATE THE CURRENT BALANCE GIVEN A CUSTOMER_ID AND A DATE
                           --#THAT WE WANT THE BALANCE TO BE EFFECTIVE FOR. THE BALANCE IS:
                           --#   1) RENTAL FEES FOR ALL PREVIOUS RENTALS
                           --#   2) ONE DOLLAR FOR EVERY DAY THE PREVIOUS RENTALS ARE OVERDUE
                           --#   3) IF A FILM IS MORE THAN RENTAL_DURATION * 2 OVERDUE, CHARGE THE REPLACEMENT_COST
                           --#   4) SUBTRACT ALL PAYMENTS MADE BEFORE THE DATE SPECIFIED
                           
        -- Now declare and begin the function                   
        DECLARE
            v_rentfees DECIMAL(5,2); --#FEES PAID TO RENT THE VIDEOS INITIALLY
            v_overfees INTEGER;      --#LATE FEES FOR PRIOR RENTALS
            v_payments DECIMAL(5,2); --#SUM OF PAYMENTS MADE PREVIOUSLY
        BEGIN
        
            SELECT COALESCE(SUM(film.rental_rate),0) INTO v_rentfees
            FROM film, inventory, rental
            WHERE film.film_id = inventory.film_id
              AND inventory.inventory_id = rental.inventory_id
              AND rental.rental_date <= p_effective_date
              AND rental.customer_id = p_customer_id;

            SELECT COALESCE(SUM(IF((rental.return_date - rental.rental_date) > (film.rental_duration * '1 day'::interval),
                ((rental.return_date - rental.rental_date) - (film.rental_duration * '1 day'::interval)),0)),0) INTO v_overfees
            FROM rental, inventory, film
            WHERE film.film_id = inventory.film_id
              AND inventory.inventory_id = rental.inventory_id
              AND rental.rental_date <= p_effective_date
              AND rental.customer_id = p_customer_id;

            SELECT COALESCE(SUM(payment.amount),0) INTO v_payments
            FROM payment
            WHERE payment.payment_date <= p_effective_date
            AND payment.customer_id = p_customer_id;

            RETURN v_rentfees + v_overfees - v_payments;
        END
        $$;

**2) inventory_held_by_customer(inventory_id)**: return the customer_id that currently renting an inventory item or NULL if it's currently available.

                CREATE OR REPLACE FUNCTION inventory_held_by_customer(p_inventory_id integer) RETURNS integer
                    LANGUAGE plpgsql
                    AS $$
                DECLARE
                    v_customer_id INTEGER;
                BEGIN

                  SELECT customer_id INTO v_customer_id
                  FROM rental
                  WHERE return_date IS NULL
                  AND inventory_id = p_inventory_id;

                  RETURN v_customer_id;
                END $$;

**3) inventory_in_stock(inventory_id)**: returns a boolean value of whether an inventory item is currently in stock.

            CREATE OR REPLACE FUNCTION inventory_in_stock(p_inventory_id integer) RETURNS boolean
                LANGUAGE plpgsql
                AS $$
            DECLARE
                v_rentals INTEGER;
                v_out     INTEGER;
            BEGIN
                -- AN ITEM IS IN-STOCK IF THERE ARE EITHER NO ROWS IN THE rental TABLE
                -- FOR THE ITEM OR ALL ROWS HAVE return_date POPULATED

                SELECT count(*) INTO v_rentals
                FROM rental
                WHERE inventory_id = p_inventory_id;

                IF v_rentals = 0 THEN
                  RETURN TRUE;
                END IF;

                SELECT COUNT(rental_id) INTO v_out
                FROM inventory LEFT JOIN rental USING(inventory_id)
                WHERE inventory.inventory_id = p_inventory_id
                AND rental.return_date IS NULL;

                IF v_out > 0 THEN
                  RETURN FALSE;
                ELSE
                  RETURN TRUE;
                END IF;
            END $$;

### EXERCISEs
#### Exercise 2.1. User-defined data types
ENUM or enumerated data types are great options to use in your database when you have a column where you want to store a fixed list of values that rarely change. Examples of when it would be appropriate to use an **`ENUM`** include days of the week and states or provinces in a country.

Another example can be the directions on a `compass` (i.e., `north`, `south`, `east` and `west`.) 

In this exercise, you are going to create a new **`ENUM`** `data type`, called `compass_position`.

#### Instructions
**Step 1.** Create a new `enumerated data type` called `compass_position`; Then use the four positions of a compass as the values.

                -- Create an enumerated data type, compass_position
                CREATE TYPE compass_position AS ENUM (
                    'North', 
                    'South',
                    'East', 
                    'West'
                );

**query result:**

                Your query did not generate any results.

**Step 2.** Verify that the new data type has been created by looking in the `pg_type system table`.

In [19]:
pd.read_sql(
    """ 
        -- Create an enumerated data type, compass_position
        CREATE TYPE compass_position AS ENUM (
            -- Use the four cardinal directions
            'North', 
            'South',
            'East', 
            'West'
        );
        -- Confirm the new data type is in the pg_type system table
        SELECT *
        FROM pg_type
        WHERE typname='compass_position';
    """, con = engine)

Unnamed: 0,typname,typnamespace,typowner,typlen,typbyval,typtype,typcategory,typispreferred,typisdefined,typdelim,...,typalign,typstorage,typnotnull,typbasetype,typtypmod,typndims,typcollation,typdefaultbin,typdefault,typacl
0,compass_position,2200,16384,4,True,e,E,False,True,",",...,i,p,False,0,-1,0,0,,,


Now let's take a closer look at some of the sample user-defined data types that are available in the `Sakila DVD Rental database`.

#### Exercise 2.2. Getting info about user-defined data types
The Sakila database has a `user-defined` **`enum data type`** called `mpaa_rating`. The rating column in the film table is an mpaa_rating type and contains the familiar rating for that film like `PG` or `R`. This is a great example of when an enumerated data type comes in handy. Film ratings have a limited number of standard values that rarely change.

When you want to learn about a column or data type in your database the best place to start is the **`INFORMATION_SCHEMA`**. You can find information about the `rating` column that can help you learn about the type of data you can expect to find. For `enum data types`, you can also find the specific values that are valid for a particular `enum` by looking in the `pg_enum system` table. Let's dive into the exercises and learn more.

#### Instructions
**Step 1.** Select the `column_name`, `data_type`, `udt_name`, Then filter for the rating column in the `film table`.

In [20]:
pd.read_sql(
    """
    -- Select the column name, data type and udt name columns
    SELECT column_name, data_type, udt_name
    FROM INFORMATION_SCHEMA.COLUMNS 
    -- Filter by the rating column in the film table
    WHERE table_name='film' AND column_name ='rating';
    """, con = engine)

Unnamed: 0,column_name,data_type,udt_name
0,rating,USER-DEFINED,mpaa_rating


**Step 2.** Select all columns from the `pg_type` table where the type name is equal to `mpaa_rating`.

In [21]:
pd.read_sql(
    """ 
    SELECT *
    FROM pg_type
    WHERE pg_type.typname='mpaa_rating'
    """, con = engine)

Unnamed: 0,typname,typnamespace,typowner,typlen,typbyval,typtype,typcategory,typispreferred,typisdefined,typdelim,...,typalign,typstorage,typnotnull,typbasetype,typtypmod,typndims,typcollation,typdefaultbin,typdefault,typacl
0,mpaa_rating,2200,16384,4,True,e,E,False,True,",",...,i,p,False,0,-1,0,0,,,


Notice that the `mpaa_rating` type has a `typcategory` of `E` which means its an enumerated data type.

Let's lookback this `query_result`,

In [22]:
pd.read_sql(
    """
    SELECT *
    FROM pg_type LIMIT 12
    """, con = engine)

Unnamed: 0,typname,typnamespace,typowner,typlen,typbyval,typtype,typcategory,typispreferred,typisdefined,typdelim,...,typalign,typstorage,typnotnull,typbasetype,typtypmod,typndims,typcollation,typdefaultbin,typdefault,typacl
0,bool,11,10,1,True,b,B,True,True,",",...,c,p,False,0,-1,0,0,,,
1,bytea,11,10,-1,False,b,U,False,True,",",...,i,x,False,0,-1,0,0,,,
2,char,11,10,1,True,b,S,False,True,",",...,c,p,False,0,-1,0,0,,,
3,name,11,10,64,False,b,S,False,True,",",...,c,p,False,0,-1,0,0,,,
4,int8,11,10,8,True,b,N,False,True,",",...,d,p,False,0,-1,0,0,,,
5,int2,11,10,2,True,b,N,False,True,",",...,s,p,False,0,-1,0,0,,,
6,int2vector,11,10,-1,False,b,A,False,True,",",...,i,p,False,0,-1,0,0,,,
7,int4,11,10,4,True,b,N,False,True,",",...,i,p,False,0,-1,0,0,,,
8,regproc,11,10,4,True,b,N,False,True,",",...,i,p,False,0,-1,0,0,,,
9,text,11,10,-1,False,b,S,True,True,",",...,i,x,False,0,-1,0,100,,,


#### Exercise 2.3. User-defined functions in Sakila
If you were running a `real-life DVD Rental` store, there are many questions that you may need to answer repeatedly like whether a film is in stock at a particular store or the outstanding balance for a particular customer. These types of scenarios are where user-defined functions will come in very handy. The Sakila database has several user-defined functions pre-defined. These functions are available out-of-the-box and can be used in your queries like many of the built-in functions we've learned about in this course.

In this exercise, you will build a query step-by-step that can be used to produce a report to determine which film title is currently held by which customer using the **`inventory_held_by_customer()`** function.

#### Instructions 
**Step 1.** Select the `title` and `inventory_id` columns from the `film` and `inventory` tables in the database.

In [23]:
pd.read_sql(
    """
    SELECT f.title, i.inventory_id -- Select the film title and inventory ids
    FROM film AS f 
        INNER JOIN inventory AS i 	-- Join the film table to the inventory table
            ON f.film_id = i.film_id
    LIMIT 10
    """, con = engine)

Unnamed: 0,title,inventory_id
0,ACE GOLDFINGER,9
1,ACE GOLDFINGER,10
2,ACE GOLDFINGER,11
3,ADAPTATION HOLES,12
4,ADAPTATION HOLES,13
5,ADAPTATION HOLES,14
6,ADAPTATION HOLES,15
7,AFFAIR PREJUDICE,16
8,AFFAIR PREJUDICE,17
9,AFFAIR PREJUDICE,18


**Step 2.** `inventory_id` is currently held by a customer and alias the column as `held_by_cust`.

In [24]:
pd.read_sql(
    """
    SELECT f.title, i.inventory_id,
           inventory_held_by_customer(i.inventory_id) AS held_by_cust -- Determine whether the inventory is held by a customer
    FROM film as f 
        INNER JOIN inventory AS i ON f.film_id=i.film_id  	-- Join the film table to the inventory table
    """, con = engine)

Unnamed: 0,title,inventory_id,held_by_cust
0,ACE GOLDFINGER,9,366.0
1,ACE GOLDFINGER,10,
2,ACE GOLDFINGER,11,
3,ADAPTATION HOLES,12,
4,ADAPTATION HOLES,13,
5,ADAPTATION HOLES,14,
6,ADAPTATION HOLES,15,
7,AFFAIR PREJUDICE,16,
8,AFFAIR PREJUDICE,17,
9,AFFAIR PREJUDICE,18,


**Step 3.** Now filter your query to only return records where the **`inventory_held_by_customer()`** function returns a **`non-null value`**.

In [25]:
pd.read_sql(
    """
    SELECT f.title, i.inventory_id, -- Select the film title and inventory ids
            -- Determine whether the inventory is held by a customer
            inventory_held_by_customer(i.inventory_id) as held_by_cust 
    FROM film as f 
        INNER JOIN inventory AS i 
            ON f.film_id=i.film_id 
    
    -- Only include results where the held_by_cust is not null
    WHERE inventory_held_by_customer(i.inventory_id) IS NOT NULL 
""", con = engine)

Unnamed: 0,title,inventory_id,held_by_cust
0,ACE GOLDFINGER,9,366
1,AFFAIR PREJUDICE,21,111
2,AFRICAN EGG,25,590
3,ALONE TRIP,81,236
4,AMERICAN CIRCUS,106,44
...,...,...,...
135,WILD APOLLO,4460,274
136,WINDOW SIDE,4472,374
137,WOMEN DORADO,4496,216
138,WORLD LEATHERNECKS,4537,532


`User-defined types` and `functions` provide you with advanced capabilities for managing and querying your data in `PostgreSQL`.

## 3. Intro to PostgreSQL extensions

#### 3.1. Commonly used extension in [PostgreSQL extensions](https://www.postgresql.org/download/products/6-postgresql-extensions/)!
In this Section, we only focus on the following functions:
> **`PostGIS`** : adds support for geographic objects to the `PostgreSQL` object-relational database. In effect, `PostGIS` "spatially enables" the PostgreSQL server, allowing it to be used as a backend spatial database for `geographic information systems (GIS)`, much like ESRI's `SDE` or `Oracle's Spatial extension`. `PostGIS` follows the `OpenGIS` `"Simple Features Specification for SQL"` and has been certified as compliant with the `"Types and Functions"` profile.
> 
> **`PostPic`** : is an extension for the open source `dbms` `PostgreSQL` that enables image processing inside the database, like `PostGIS` does for `spatial data`. It adds the new 'image' type to the `SQL`, and several functions to process images and to extract their attributes.
>
> **`fuzzystrmatch`**: provides several [functions](https://www.postgresql.org/docs/9.1/fuzzystrmatch.html) to determine similarities and distance between strings.
>
> **`pg_trgm`**: provides [functions and operators](https://www.postgresql.org/docs/current/pgtrgm.html) for `determining` the *similarity of alphanumeric text* based on `trigram matching`, as well as index operator classes that support fast searching for similar strings.

#### 3.2. Querying extension meta-data
**Available extension**. They help you discover which extension are availabe in the specific `PostgreSQL`. You can query the `pg_available_extensions` system here,

In [26]:
pd.read_sql(
    """ 
    SELECT name
    FROM pg_available_extensions
    """, con = engine)

Unnamed: 0,name
0,insert_username
1,fuzzystrmatch
2,pageinspect
3,file_fdw
4,tsm_system_time
5,lo
6,dblink
7,timetravel
8,unaccent
9,earthdistance


And they show the name of the `extension` is available and enable for you.

**Installed extension.** Likewise, this make you know which extension is installed and enable to use!

In [27]:
pd.read_sql(
    """ 
        SELECT extname
        FROM pg_extension
    """, con = engine)

Unnamed: 0,extname
0,plpgsql
1,fuzzystrmatch
2,pg_trgm


Next, let's see whether the `fuzzystrmatch extension` is enabled or not??

In [28]:
pd.read_sql(
    """ 
    -- Enable the fuzzystrmatch extension
    CREATE EXTENSION IF NOT EXISTS fuzzystrmatch;
    -- Confirm the fuzzystrmatch is enabled or not
    SELECT extname 
    FROM pg_extension
    """, con = engine)

Unnamed: 0,extname
0,plpgsql
1,fuzzystrmatch
2,pg_trgm


#### 3.2. Using `fuzzystrmatch` or `fuzzy searching`?

The `levenshtein` function calculate the `levenshtein distance` between 2 strings:

In [29]:
pd.read_sql(
    """ 
        SELECT levenshtein('GUMBO', 'GAMBO') AS u_vs_a,
               levenshtein('GUMBO', 'GOMBO') AS u_vs_o,
               levenshtein('GUMBO', 'GINBO') AS um_vs_in,
               levenshtein('GUMBO', 'Biingooo') AS none_fit
    """, con = engine)

Unnamed: 0,u_vs_a,u_vs_o,um_vs_in,none_fit
0,1,1,2,8


#### 3.3. Compare 2 strings with `pg_trgm`
Likewise, but the metric here is `similarity` function,

In [30]:
pd.read_sql(
    """
        SELECT similarity('GUMBO', 'GAMBO') AS u_vs_a,
               similarity('GUMBO', 'GOMBO') AS u_vs_o,
               similarity('GUMBO', 'GINBO') AS um_vs_in,
               similarity('GUMBO', 'Biingooo') AS none_fit 
    """, con = engine)

Unnamed: 0,u_vs_a,u_vs_o,um_vs_in,none_fit
0,0.333333,0.333333,0.2,0


### EXERCISEs.

#### Exercise 3.1. Enabling extensions
Before you can use the capabilities of an extension it must be enabled. As you have previously learned, most `PostgreSQL` distributions come pre-bundled with many useful extensions to help extend the native features of your database. You will be working with `fuzzystrmatch` and `pg_trgm` in upcoming exercises but before you can practice using the capabilities of these extensions you will need to first make sure they are enabled in our database. In this exercise you will enable the pg_trgm extension and confirm that the `fuzzystrmatch` extension, which was enabled in the video, is still enabled by querying the `pg_extension` system table.

#### Instructions
**Step 1.** Enable the `pg_trgm` extension

                -- Enable the pg_trgm extension
                CREATE EXTENSION IF NOT EXISTS pg_trgm;
**query result:**
    
                Your query did not generate any results.

**Step 2.** Now confirm that both `fuzzystrmatch` and `pg_trgm` are enabled by selecting all rows from the appropriate system table.

In [31]:
pd.read_sql(
    """ 
    -- Select all rows extensions
        SELECT * 
        FROM pg_extension;
    """, con = engine)

Unnamed: 0,extname,extowner,extnamespace,extrelocatable,extversion,extconfig,extcondition
0,plpgsql,10,11,False,1.0,,
1,fuzzystrmatch,16384,2200,True,1.1,,
2,pg_trgm,16384,2200,True,1.4,,


#### Exercise 3.2. Measuring similarity between two strings
Now that you have enabled the `fuzzystrmatch` and `pg_trgm extensions` you can begin to explore their capabilities. 

First, we will measure the similarity between the `title` and `description` from the `film` table of the Sakila database.

#### Instructions
Select the `film title` and `description`. Then calculate the `similarity` between the `title` and `description`.

In [32]:
pd.read_sql(
    """ 
        SELECT title, description, -- Select the title and description columns
               similarity(title, description) -- Calculate the similarity
        FROM film
        LIMIT 29
    """, con = engine)

Unnamed: 0,title,description,similarity
0,BEACH HEARTBREAKERS,A Fateful Display of a Womanizer And a Mad Sci...,0.0
1,BEAST HUNCHBACK,A Awe-Inspiring Epistle of a Student And a Squ...,0.022222
2,BEDAZZLED MARRIED,A Astounding Character Study of a Madman And a...,0.029126
3,BEHAVIOR RUNAWAY,A Unbelieveable Drama of a Student And a Husba...,0.021277
4,BETRAYED REAR,A Emotional Character Study of a Boat And a Pi...,0.011111
5,BILKO ANONYMOUS,A Emotional Reflection of a Teacher And a Man ...,0.021277
6,BIRDCAGE CASPER,A Fast-Paced Saga of a Frisbee And a Astronaut...,0.0
7,BLUES INSTINCT,A Insightful Documentary of a Boat And a Compo...,0.037037
8,BORROWERS BEDAZZLED,A Brilliant Epistle of a Teacher And a Sumo Wr...,0.019802
9,BUBBLE GROSSE,A Awe-Inspiring Panorama of a Crocodile And a ...,0.047059


Looking at the `similarity()` column indicates that the `title` and `description` columns are **`not very similar`** based on the `low number` returned for most of the results. Indeed,

In [33]:
pd.read_sql(
    """ 
        SELECT 'low_similarity' AS count_meaning, COUNT(*) AS value
        FROM film 
        WHERE similarity(title, description) <= 0.05

        UNION

        SELECT 'total_count' , COUNT(*)
        FROM film
    """, con = engine)

Unnamed: 0,count_meaning,value
0,low_similarity,925
1,total_count,985


Now let's take a closer at how we can use the `levenshtein function` to account for `grammatical errors` in the search text.

#### Exercise 3.3. Levenshtein distance examples
Now let's take a closer look at how we can use the `levenshtein` function to match strings against text data. If you recall, the levenshtein distance represents the number of edits required to convert one string to another string being compared.

In a search application or when performing data analysis on any data that contains manual user input, you will always want to account for `typos` or `incorrect spellings`. The `levenshtein function` provides a great method for performing this task. In this exercise, we will perform a query against the film table using a search string with a misspelling and use the results from `levenshtein` to determine a match. Let's check it out.

#### Instructions
Select the `film title` and `film description`. Then calculate the `levenshtein distance` for the film title with the string `JET NEIGHBOR`.

In [34]:
pd.read_sql(
    """ 
        SELECT title, description, -- Select the title and description columns
               levenshtein(title, 'JET NEIGHBOR') AS distance -- Calculate the levenshtein distance
        FROM film
        ORDER BY 3
        LIMIT 29
    """, con = engine)

Unnamed: 0,title,description,distance
0,JET NEIGHBORS,A Amazing Display of a Lumberjack And a Teache...,1
1,HILLS NEIGHBORS,A Epic Display of a Hunter And a Feminist who ...,6
2,BED HIGHBALL,A Astounding Panorama of a Lumberjack And a Do...,7
3,WEST LION,A Intrepid Drama of a Butler And a Lumberjack ...,8
4,EGG IGBY,A Beautiful Documentary of a Boat And a Sumo W...,8
5,COAST RAINBOW,A Astounding Documentary of a Mad Cow And a Pi...,9
6,WAIT CIDER,A Intrepid Epistle of a Woman And a Forensic P...,9
7,CLYDE THEORY,A Beautiful Yarn of a Astronaut And a Frisbee ...,9
8,STONE FIRE,A Intrepid Drama of a Astronaut And a Crocodil...,9
9,NEWSIES STORY,A Action-Packed Character Study of a Dog And a...,9


Because we sorted by the results of the `levenshtein function`, you can see that the first result is the `closest match` because it requires one edit to match the `plural` version of the word `NEIGHBOR` from `film title`.

See the summary_statistical_values: `max, min, avg`,

In [35]:
pd.read_sql("""
    SELECT MIN(levenshtein(title, 'JET NEIGHBOR')) AS min_distance,
           AVG(levenshtein(title, 'JET NEIGHBOR')) AS avg_distance,
           MAX(levenshtein(title, 'JET NEIGHBOR')) AS max_distance
    FROM film
    """)

Unnamed: 0,min_distance,avg_distance,max_distance
0,1,12.505584,21


And how about the averages each group of data_partition

In [36]:
pd.read_sql(
    """ 
    WITH distances AS (SELECT film_id, levenshtein(title, 'JET NEIGHBOR') AS dist
                        FROM film)
    SELECT 'dist <= 10' AS where_condition, 
            ROUND(AVG(dist), 3) AS avg_dist, COUNT(dist) 
    FROM (SELECT dist
          FROM distances WHERE dist <= 10
          ) AS small_dist
    UNION
    SELECT 'dist >= 15' AS where_condition, 
            ROUND(AVG(dist), 3) AS avg_dist, COUNT(dist) 
    FROM (SELECT dist
          FROM distances WHERE dist >= 15 
          ) AS high_dist
    UNION
    SELECT '14 >= dist >= 11' AS where_condition, 
            ROUND(AVG(dist), 3) AS avg_dist, COUNT(dist) 
    FROM (SELECT dist
          FROM distances WHERE (dist >= 11) AND dist <= 14
          ) AS med_dist
    UNION
    SELECT 'whole_data' AS where_condition,
            ROUND(AVG(levenshtein(title, 'JET NEIGHBOR')), 3) AS avg_whole, COUNT(*)
    FROM film ORDER BY count
    """, con = engine)

Unnamed: 0,where_condition,avg_dist,count
0,dist <= 10,9.658,114
1,dist >= 15,16.124,145
2,14 >= dist >= 11,12.23,726
3,whole_data,12.506,985


#### Exercise 3.4. Putting it all together
In this exercise, we are going to use many of the techniques and concepts we learned throughout the course to generate a data set that we could use to predict whether the words and phrases used to describe a film have an impact on the number of rentals.

First, you need to create a tsvector from the description column in the film table. You will match against a tsquery to determine if the phrase `"Astounding Drama"` leads to more rentals per month. Next, create a new column using the similarity function to rank the film descriptions based on this phrase.

#### Instructions
**Step 1.** Select the `title` and `description` for all DVDs from the `film` table.
Perform a `full-text search` by converting the `description` to a `tsvector` and match it to the phrase `'Astounding & Drama'` using a `tsquery` in the `WHERE` clause.

In [37]:
pd.read_sql(
    """ 
        SELECT title, description -- Select the title and description columns
        FROM film
        WHERE to_tsvector(description) @@ to_tsquery('Astounding & Drama');  -- Match "Astounding Drama" in the description
    """, con = engine)

Unnamed: 0,title,description
0,COWBOY DOOM,A Astounding Drama of a Boy And a Lumberjack w...
1,BIKINI BORROWERS,A Astounding Drama of a Astronaut And a Cat wh...
2,CAMPUS REMEMBER,A Astounding Drama of a Crocodile And a Mad Co...
3,ENCINO ELF,A Astounding Drama of a Feminist And a Teacher...
4,GLASS DYING,A Astounding Drama of a Frisbee And a Astronau...


**Step 2.** Add a new column that calculates the `similarity` of the `description` with the phrase `'Astounding Drama'`.

Sort the results by the new `similarity` column in `descending order`.

In [38]:
pd.read_sql(
    """ 
        SELECT title, description, 
               similarity(description, 'Astounding Drama') -- Calculate the similarity
        FROM film 
        WHERE to_tsvector(description) @@ to_tsquery('Astounding & Drama') 
        ORDER BY similarity(description, 'Astounding Drama') DESC;
    """, con = engine)

Unnamed: 0,title,description,similarity
0,COWBOY DOOM,A Astounding Drama of a Boy And a Lumberjack w...,0.246377
1,GLASS DYING,A Astounding Drama of a Frisbee And a Astronau...,0.239437
2,CAMPUS REMEMBER,A Astounding Drama of a Crocodile And a Mad Co...,0.236111
3,ENCINO ELF,A Astounding Drama of a Feminist And a Teacher...,0.22973
4,BIKINI BORROWERS,A Astounding Drama of a Astronaut And a Cat wh...,0.195402


Great work! We have just scratched the surface of what you can do with full-text search and `natural language processing` with `PostgreSQL` extensions. I encourage you to keep exploring these capabilities.