<div style="position: relative;">
<img src="https://user-images.githubusercontent.com/7065401/98728503-5ab82f80-2378-11eb-9c79-adeb308fc647.png"></img>

<h1 style="color: white; position: absolute; top:30%; left:10%;">
    PostgreSQL for Python Developers
</h1>

<h3 style="color: #ef7d22; font-weight: normal; position: absolute; top:48%; left:10%;">
    David Mertz, Ph.D.
</h3>

<h3 style="color: #ef7d22; font-weight: normal; position: absolute; top:55%; left:10%;">
    Data Scientist
</h3>
</div>
<br/>

<div style="width: 100%; background-color: #ef7d22; text-align: center">
<br><br>

<h1 style="color: white; font-weight: bold;">
    PostgreSQL functions
</h1>

<br><br> 
</div>

Within PostgreSQL there are a huge number of built-in functions, many of them also available as operators.  As well, you may define your own user-defined functions in several programming languages, including in Python.  In this lesson we also look at PostgreSQL views, which work nicely with functions.

In [1]:
import psycopg2
cred = dict(user='ine_student', password='ine-password', database='ine', host='localhost')
from collections import namedtuple

conn = psycopg2.connect(**cred)
cur = conn.cursor()

## Built-in functions
![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

The hundreds of functions available as built-ins in PostgreSQL can be loosely broken out by the data type(s) they operate on.  For example, we have numeric functions like:

* `abs()`: absolute value
* `cbrt()`: cube root
* `ceil()`: nearest integer greater than or equal
* `cos()`: cosine (radians)
* `cosd()`: cosine (degrees)
* `degrees()`: radians to degrees
* `exp(x)`: $e^x$ 
* `factorial()`

Other functions deal with string manipulation, or bit strings, or datetimes, or inet addresses, enums, regular expression matching, geometric functions, and others.  Many of these functions are exposed as, or indeed only available as, operators.

Another special kind of function is an aggregation that takes many inputs—typically the many values in a query column—and combines them into a single value.  Particularly notable among those are `count()`, `avg()`, `min()`, `max()`, and `sum()`.  But more specialized ones like `covar_pop()` (population covariance) or `percent_rank()` (percentile of specified value) are also available.

## Operators
![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

Many functions are normally used as operators.  Often these follow familiar symbols like `+`, `/` and `*`.  But an operator can be (almost) any combination of `+-*/<>=~!@#%^&|?` up to 63 characters long.  There are a few restriction, the most notable being one not technically imposed of "don't use more than 3-4 symbols as one operator."  We saw a few of these compound operators in earlier examples text search and geometric types.

We can define our own operators in PostgreSQL.  Let us start with a somwhat contrived example.  One string field function is `left(str text, n int)` that takes the initial portion of a TEXT or CHAR data value.  For example:

In [2]:
cur.execute('SELECT para_num, left(raw_text, para_num) FROM books LIMIT 10;')
cur.fetchall()

[(0, ''),
 (1, 'T'),
 (2, 'Ti'),
 (3, 'Aut'),
 (4, 'Rele'),
 (5, 'Langu'),
 (6, 'Charac'),
 (7, '*** STA'),
 (8, ''),
 (9, '\nProduced')]

Suppose this is common operation that we would like to shortcut.  Let's create such shortcuts for both `left()` and `right()`.

In [3]:
left_op = "CREATE OPERATOR <| (LEFTARG = text, RIGHTARG = int, PROCEDURE = left);"
right_op = "CREATE OPERATOR |> (LEFTARG = text, RIGHTARG = int, PROCEDURE = right);"
cur.execute(left_op)
cur.execute(right_op)

In [4]:
# The last 5 of the initial para_num characters
cur.execute('SELECT para_num, (raw_text <| para_num) |> 5 FROM books LIMIT 10;')
cur.fetchall()

[(0, ''),
 (1, 'T'),
 (2, 'Ti'),
 (3, 'Aut'),
 (4, 'Rele'),
 (5, 'Langu'),
 (6, 'harac'),
 (7, '* STA'),
 (8, ''),
 (9, 'duced')]

## User-defined functions
![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

You may create your own functions (and optionally operator synonyms) using SQL, or using PL/pgSQL.  For Python programmers, it is especially powerful to write functions in Python.  Let us first create a simple SQL function.  This one will pull off every Nth paragraph from a given book, and show the first characters from it.  The result is returned as a table.

In [5]:
sql_nth_para_left = """
CREATE FUNCTION nth_para_left (N int, L int, id text) 
    RETURNS TABLE (num int, starts text)
AS $$
  SELECT para_num, raw_text <| L 
  FROM books
  WHERE para_num % N = 0
  AND book_id = id;
$$ LANGUAGE SQL;
"""
cur.execute(sql_nth_para_left)

In [6]:
cur.execute("SELECT * FROM nth_para_left(11, 50, '58650-0.txt');")

Para = namedtuple("Para", [c.name for c in cur.description])
for row in cur.fetchmany(10):
    print(Para(*row))

Para(num=0, starts='\ufeffThe Project Gutenberg EBook of Introduction to th')
Para(num=11, starts='')
Para(num=22, starts='  AND\n  BENJAMIN IDE WHEELER\n  PROFESSOR OF GREEK ')
Para(num=66, starts='  XX. THE DIVISION OF THE PARTS OF SPEECH      343')
Para(num=33, starts='Moreover, Professor Paul very frequently follows t')
Para(num=44, starts='')
Para(num=55, starts='  IX. ORIGINAL CREATION      157')
Para(num=77, starts='\nIt is the province of the Science of Language to ')
Para(num=88, starts='We now proceed to ask what are the causes of chang')
Para(num=99, starts='The comparison implied by such use of these terms ')


In [7]:
# Do not save operators after example
conn.rollback()

## Python functions
![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

Creating functions in SQL can be useful, but Python is the chosen language of students of this course.  Before we can do so, some additional components have to be installed and activated.  Doing this will require administrator priviledges for both the underlying operating system and for PostgreSQL.

For the extension itself, a system level installer can add it (on the machine where PostgreSQL actually runs).  For example, on a Debian/Ubuntu type machine, you might run:

```bash
sudo apt-get install postgresql-plpython3
```

Other operating systems will have varying installation tools.

Once the software is installed, you need to enable it for a particular database.  The extension you want is named `plpython3u`.  There are a few things going on in that name.  "pl" means, roughly, "PostgreSQL language;" "python" is relatively straightforward; you also want Python 3, not out-of-maintenance version 2.  The "u" at the end indicates that it is an "untrusted language" in the sense that there is no mechanism to restrict code to a secure "sandbox." 

A user could write a pl/Python function that did any arbitrary bad action, so add such user functions with caution.  The danger is not so much that someone will write careless code as that they will write malicious code.  Therefore, only a superuser can activate the extension or add such functions.  For example:

```bash
% sudo -u postgres psql
psql (12.5 (Ubuntu 12.5-0ubuntu0.20.04.1))
Type "help" for help.
postgres=# \c ine
You are now connected to database "ine" as user "postgres".
```
```sql
ine=# CREATE EXTENSION plpython3u;
CREATE EXTENSION
```


Let us start with a toy function:

```sql
ine=# CREATE EXTENSION plpython3u;
CREATE EXTENSION
ine=# CREATE FUNCTION pymax (a integer, b integer)
  RETURNS integer
AS $$
  if a > b:
    return a
  return b
$$ LANGUAGE plpython3u;
CREATE FUNCTION
```


In [8]:
cur.execute("SELECT a, b, c, pymax(a, b) max_ab FROM numbers;")
Row = namedtuple("Row", [c.name for c in cur.description])
for row in cur:
    print(Row(*row))

Row(a=4, b=3, c=5, max_ab=4)
Row(a=7, b=9, c=6, max_ab=9)


The function `pymax()` doesn't do anything we cannot do with the SQL `max()` function already.  Let us create one that might actually be useful.  For an example, PostgreSQL has a built-in `md5()` function for a cryptographic hash, but one that is not fully secure any longer.  Python supports `sha1()` and other options (SHA1 is possibly questionable now as a cryptographic primitive, but we could equally use SHA256 or BLAKE2b, for example).

```sql
ine=# CREATE FUNCTION sha1 (t text)
  RETURNS CHAR(40)
AS $$
  import hashlib
  h = hashlib.sha1(t.encode('utf-8'))
  return h.hexdigest()
$$ LANGUAGE plpython3u;
```

Having defined the function as a superuser, we can utilize it anywhere in the database.

In [9]:
cur.execute("SELECT para_num, sha1(raw_text) FROM books LIMIT 5;")
cur.fetchall()

[(0, '9d0224324ae88aa0ef36ccf9bb7a453bda363590'),
 (1, 'e1711a9b2f93f321ca211c3929e7901405f8d129'),
 (2, '97a60b5e79735073acbeae0994c7d3eb04b4f356'),
 (3, '84880e2a31f683013b5bb268e48fed1d9d97fd8d'),
 (4, 'ca2b701084f7631052932adbb010812a350f80f6')]

We might even create an operator version of this.

In [10]:
sha1_op = "CREATE OPERATOR # (RIGHTARG = text, PROCEDURE = sha1);"
cur.execute(sha1_op)

In [11]:
cur.execute("SELECT para_num, #raw_text FROM books LIMIT 5;")
cur.fetchall()

[(0, '9d0224324ae88aa0ef36ccf9bb7a453bda363590'),
 (1, 'e1711a9b2f93f321ca211c3929e7901405f8d129'),
 (2, '97a60b5e79735073acbeae0994c7d3eb04b4f356'),
 (3, '84880e2a31f683013b5bb268e48fed1d9d97fd8d'),
 (4, 'ca2b701084f7631052932adbb010812a350f80f6')]

## Views
![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

A view is a virtualized table that is only generated when it is accessed.  Among other benefits, this lets us include function calls in queries without a user needing to think about those functions.  View are also commonly useful when they are the result of JOINs, GROUP BYs, subqueries, and other more complex constructions.  The user of the virtual table does not need to think about how it is constructed, just use it as if it were a simple table.

In [12]:
sql_hashes = """
CREATE OR REPLACE VIEW book_hashes (book_id, para_num, excerpt, sha1) AS 
SELECT book_id, para_num, left(raw_text, 40), sha1(raw_text)
FROM books;
"""
cur.execute(sql_hashes)

In [13]:
sql = """
SELECT para_num, excerpt, sha1 
FROM book_hashes 
ORDER BY para_num
LIMIT 5 
OFFSET 1000;
"""
cur.execute(sql)
cur.fetchall()

[(1000,
  'The defects of written speech which have',
  'cf84b2c5c56386d085ca15ea43483343627631ce'),
 (1001,
  'The advantages of a fixed orthography ar',
  'a1f30479d3dc089d8715f38fb55d8039afb1e5a6'),
 (1002,
  'On the whole, it is true that the natura',
  '9bdeefca3a7a7840f4a889327580e46107bdfa0a'),
 (1003,
  'If we should institute a comparison betw',
  '2cb0e4c5e0b73fe08c16d4865e209b895a827d48'),
 (1004,
  'One of the most obvious difficulties tha',
  '78c45a50080537d1db61d5f7458788abbdb07c73')]

In [14]:
# Don't save operator or view
conn.rollback()

## Summary
![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)


Custom functions and operators can be a powerful enhancement to the already right collection of functions PostgreSQL provides.  Combining these with views can provide a simple face to quite complex underlying queries and synthesis of data.

<div style="position: relative;">
<img src="https://user-images.githubusercontent.com/7065401/98729912-57be3e80-237a-11eb-80e4-233ac344b391.png"></img>
</div>