# Connecting to Postgres from a Jupyter notebook

We need to use a Jupyter "magic" instruction which starts with a `%` for inline or `%%` if we want to apply it to the whole code block.

We first do `%load_ext sql` to load the SQL extension:

In [1]:
%load_ext sql

Then we establish a connection with `%sql postgresql://<username>:<password>@localhost[/<dbname>]`

In [5]:
CONN_STRING="postgresql://postgres:password1@localhost/discogs"

%sql $CONN_STRING

'Connected: postgres@discogs'

We can create a new database to isolate everything we will do and to make it easier to drop it later and start again.

We create database with a `CREATE DATABASE <dbname>` command. We drop a database (if it exists) by running `DROP DATABASE IF EXISTS <dbname>`.

# Configuring MADlib

To set up MADlib go to http://madlib.apache.org/download.html and get the appropriate package (for Ubuntu pick the 4th option). After installation, we need to set up MADlib in our Postgres database:

```
/usr/local/madlib/bin/madpack -s madlib -p postgres -c postgres@localhost/discogs install
```

Then, to check if everything went OK, we run the `SELECT madlib.version()` query.

In [7]:
%sql SELECT madlib.version();

 * postgresql://postgres:***@localhost/discogs
1 rows affected.


version
"MADlib version: 1.15.1, git revision: unknown, cmake configuration time: Wed Oct 10 09:12:58 UTC 2018, build type: Release, build system: Linux-4.9.93-linuxkit-aufs, C compiler: gcc 5.4.0, C++ compiler: g++ 5.4.0"


# Defining the Database Schema

We start off by running `DROP TABLE IF EXISTS <table1>[, <table2>, ...]` to delete any table if we have already created it. This makes the code block _idempotent_ and enables us to run it multiple times.

Then we define the schema for all tables. The relational model is the folliwng:

```
artists : (artist_id : int, name : varchar(256)?, realname : text?, profile : text?, url : text?)
    key: artist_id

releases : (release_id : int, released : date, title : text, country : varchar(256)?, genre : varchar(256))
    key: release_id

released_by : (release_id : int, artist_id : int)
    key: release_id, artist_id
    foreign key: release_id : releases(release_id), artist_id : artists(artist_id)

tracks : (release_id : int, position : varchar(128), title : text?, duration : int?)
    key: release_id, position
    foreign key: release_id : releases(release_id)

```

We use the `?` sign to denote attributes that are nullable.

In [11]:
%%sql

DROP TABLE IF EXISTS artists, releases, released_by, tracks CASCADE;

CREATE TABLE artists (
    artist_id int NOT NULL,
    name varchar(256) NULL,
    realname text NULL,
    profile text NULL,
    url text NULL,
    PRIMARY KEY (artist_id)
);

CREATE TABLE releases (
    release_id int NOT NULL,
    released date NOT NULL,
    title text NOT NULL,
    country varchar(256) NULL,
    genre varchar(256) NOT NULL,
    PRIMARY KEY (release_id)
);

CREATE INDEX IF NOT EXISTS idx_releases_genre ON releases(genre);

CREATE TABLE released_by (
    release_id int REFERENCES releases(release_id) ON DELETE CASCADE,
    artist_id int REFERENCES artists(artist_id) ON DELETE CASCADE,
    PRIMARY KEY (release_id, artist_id)
);

CREATE TABLE tracks (
    release_id int REFERENCES releases(release_id) ON DELETE CASCADE,
    position varchar(128) NOT NULL,
    title text NULL,
    duration int NULL,
    PRIMARY KEY (release_id, position)
);

 * postgresql://postgres:***@localhost/discogs
Done.
Done.
Done.
Done.
Done.
Done.


[]

# Visualizing the ER diagram of the database

We can use the eralchemy tool to display the ER diagram of the database schema we have just created. We need to import the `render_er` function from the `eralchemy` module. We will also be using the `IPython.display.Image` function to display the PNG that eralchemy outputs.

In [9]:
from eralchemy import render_er
from IPython.display import Image

IMG_PATH='erd_discogs.png'

render_er(CONN_STRING, IMG_PATH)

Image(url=IMG_PATH)

***

_Note:_ To install eralchemy, you will need to first run

`sudo apt-get install graphviz, libgraphviz-dev`

and then install the Python package with:

`pip install eralchemy`

***

# Loading the data

In [10]:
Q="'"
DATAPATH= "/home/bojan/projects/phd/teaching/IntroDB-HandsOn/dmdb19_handson2"

%sql COPY artists FROM $Q$DATAPATH/artists.csv$Q DELIMITERS ',' CSV HEADER;
%sql COPY releases FROM $Q$DATAPATH/releases.csv$Q DELIMITERS ',' CSV HEADER;
%sql COPY released_by FROM $Q$DATAPATH/released_by.csv$Q DELIMITERS ',' CSV HEADER;
%sql COPY tracks FROM $Q$DATAPATH/tracks.csv$Q DELIMITERS ',' CSV HEADER;

 * postgresql://postgres:***@localhost/discogs
6034590 rows affected.
 * postgresql://postgres:***@localhost/discogs
538640 rows affected.
 * postgresql://postgres:***@localhost/discogs
547652 rows affected.
 * postgresql://postgres:***@localhost/discogs
3502190 rows affected.


[]