# SSC Data Science and Analytics Workshop 2022

### Intro to Databases in Industry: Data Cleaning, Querying, and Modeling at Scale
-----------------

# 1. Introduction

### 1.1 Workshop structure

The workshop will be split in two parts. 

- In the first part, Arman and I will discuss
    - what it is and why to use a Database Management Systems (DBMS);
    - the basics of SQL queries;
    - accessing a database from R;
    
    
    
- In the second part, Diego will:
   - discuss the concept of a data warehouse;
   - introduce the ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform) processes;
   - present dimensional modelling as methodology to guide the design of data warehouses at scale.
   
The workshop will be focused on the user aspect of databases, not on design and implementation. 

### 1.2 Why use a Database management system (DBMS)?

To analyze data, we need to first collect and store the data. In fact, depending on the scale of the dataset, we need much more than that. Curating a large dataset is not just about storing it, we also need to:

- keep track of the relationship between the data entities;
- avoid redundancies; 
- manage security issues to protect private and sensitive information;
- have an error recovery system in place;
- be able to retrieve data in an efficient manner;
- have well controlled concurrent access;
- keep the data in a consistent state;
- enforce data integrity!

### 1.3 Database vs DBMS 

- A **database is a collection of data about different but related objects** (also called entities in database jargon), each entity with a common set of attributes. For example, a taxi company stores information about:
  - Drivers (name, licence number, licence expiration);
  - Cars (model, year);
  - Clients (name, address, phone, e-mail);


- Note that entities also interact among themselves (e.g., a driver drives a car).

- A **DBMS is a software** responsible to assist the maintenance and usage of databases (yes, plural!)
  - There are many different vendors of **relational** DBMS out there (e.g., PostgreSQL, MS Access, MS SQL Server, MySQL, etc…)
  <img src="img/lecture1/flavours_sql.png" width="700"/>

---

**Remember:**
    
database $\ne$ database management system

---

### 1.4 PostgreSQL

- In this workshop we will be using PostgreSQL as our RDBMS;


- PostgreSQL is an open-source multi-platform (Windows, Linux, macOS) RDBMS with a high level of compliance with ANSI SQL (the langugae that practically all RDBMS use to create, modify and query databases);

### 1.5 The client-server model

Similar to most other DBMSs, Postgres works based on a **client-sever** model. In this model

- The DBMS along with its databases and data are all stored on a host computer where the database server resides. This is typically a powerful machine with high processing power and large storage
- Client hosts are usually personal computers with GUIs that can connect to a database server to access the data.

In this model, the clients and the server are connected over a network. The heavy-lifting of processing, managing and storing large amounts of data is done by the server host, and clients only retrieve the data that they need.

<img src="img/lecture1/client-server.png" width="500"/>

> Although sometimes used interchangeably, there is a difference between a **client/server** and a **client/server host**. A host is a device, whereas a client/server is a piece of software. For example, you can simultaneously have multiple client programs connected to a remote database. Similarly, a remote host is a device (i.e. a computer) that might have several server programs running concurrently.

The idea of client-server models for databases has become the standard of computation and storage today, known as **cloud computing**:

- Today we rarely store movie or music files on our computers. This is why most of us have laptops with only 256/512 GB of space, because most of that takes up space is already provided as a cloud service (e.g. Netflix, Spotify, Youtube), or is stored on cloud storage spaces (e.g. One Drive, Dropbox, Google Drive).
- We rarely run production-stage computation-intensive jobs on our own computers. All such computations are done on cloud-computing services (e.g. Google Cloud Platform, Amazon Web Services, Microsoft Azure). I personally haven't run a single simulation code on my own computer, neither ever stored any raw data locally. I use my computer mainly as an interface to access the services that I want.

> Note that there are certain situations where one might want to **locally** benefit from the advantages of storing data in a database. A relational database engine that works only with local databases is SQLite. If you're curious to find out the use cases for **SQLite**, take a look [here](https://www.sqlite.org/whentouse.html).

Whenever we use Postgres (or any other client-server DBMS), the first step before anything else is to **connect** to the database server. This is why we will talk about _host address_, _port_, _username_, and _password_ when we try to use a database.

### 1.5 What is SQL?

Well, it's finally time to learn about SQL!

- SQL stands for Structured Query Language ([or... does it?](https://en.wikipedia.org/wiki/SQL#History)).

- It is a language that we use to talk to a relational DBMS.

- Originally developed by IBM in 1970s to manipulate and retrieve data stored in their DBMS, System R.

- Keep in mind that SQL is not set in stone and the SQL standard is updated from time to time (e.g., 1986, 1992, 1999, 2011, 2016);

- There’s no DBMS that fully complies with the standard;

## 2. SQL

Suppose that we have the following table in our database, and 

> we want to retrieve the names and GPAs of students older than 25.

|  sid  | name      | login      | age | gpa |
|-------|-----------|------------|-----|-----|
| 23792 | Arman     | arman@mds  | 28  | 2.5 |
| 82347 | Varada    | varada@mds | 29  | 2.9 |
| 11238 | Tiffany   | tiff@mds   | 23  | 2.8 |
| 87263 | Mike      | mike@mds   | 19  | 3.8 |
| 13298 | Joel      | joel@mds   | 25  | 3.2 |
| 91287 | Florencia | flor@mds   | 20  | 3.3 |

We can write this as the following SQL query:

```sql
SELECT
    name, age, gpa
FROM
    Students
WHERE
    age > 25;
```

Running the above query should return this relation:

| name   | age | gpa |
|--------|-----|-----|
| Arman  | 28  | 2.5 |
| Varada | 29  | 2.9 |

Let's dissect the different parts of our SQL query here.

### 2.1 The `SELECT` statement

```sql
-- A basic SELECT statement
-- ===========================
SELECT
    name, age, gpa  -- column names
FROM
    Students        -- table name
WHERE
    age > 25;       -- condition
```

- A SQL statement usually starts with a verb that describes what the statement is doing, e.g., `SELECT`, `UPDATE`, `INSERT`, `CREATE`, …

- SQL is NOT case sensitive: “SELECT”, “select”, and “SeLeCt” are all the same;

- A SQL command can use multiple lines;

- Don't forget that every SQL statement needs to be terminated with a `;`.

- SQL keywords are traditionally written in upper case letters, but that is not a requirement.

### 2.1 How to run SQL in Postgres?

Well, we have a variety of options to run our SQL statements in PostgreSQL:

- pgAdmin is the official web-based GUI for interacting with PostgreSQL databases
- `psql` is PostgreSQL's interactive command-line interface
- `%sql` and `%%sql` magic commands in Jupyter notebooks, which are provided by the `ipython-sql` package
- `DBI` and `RPostgres` packages in R;

---------------------------
### 2.1.1 `psql`

This is PostgreSQL's command-line tool that allows us to interactively run SQL statements as well as "meta" commands. I introduce a couple of useful `psql` meta commands here, but you can find all the other ones in Postgres documentations [here](https://www.postgresql.org/docs/current/app-psql.html) or a shorter version in this [cheatsheet](http://www.postgresonline.com/downloads/special_feature/postgresql83_psql_cheatsheet.pdf).

| Command | Usage                                         |
|---------|-----------------------------------------------|
| `\l`    | list all databases                            |
| `\c`    | connect to a database                         |
| `\cd`   | change directory                              |
| `\!`    | execute shell commands                        |
| `\i`    | execute commands from file                    |
| `\d`    | list tables and views                         |
| `\d+`   | list tables and views with additional info    |
| `\dt`   | list tables                                   |
| `\dt+`  | list tables with additional info              |
| `\h`    | view help on SQL commands                     |
| `\?`    | view help on psql meta commands               |
| `\q`    | quit interactive shell                        |

> Note that you don't need to terminate meta commands with `;`.

### 2.1.2 `ipython-sql` (`%sql` and `%%sql`)

`ipython-sql` is a package that enables us to run SQL statements right from a Jupyter notebook. In order to use it, we should load it first:

In [None]:
%load_ext sql
%config SqlMagic.displaylimit = 30

Now we need the host address of where the database is stored, along with a username and a password.

It is always a bad idea to store login information directly in a notebook or code file because of security reasons. For example, you don't want to commit your sensitive login information to a Git repo.

In order to avoid that, we store that kind of information in a separate file, like `credentials.json` here, and read the username and password into our IPython session:

In [None]:
import json

with open('credentials.json') as f:
    credentials = json.load(f)
    
username = credentials['user']
password = credentials['password']
host = credentials['host']
port = credentials['port']

And also make sure to add your file name (e.g. `credentials.json`) to your `.gitignore` file, so you don't accidentally commit it.

Now we can establish the connection to the `world` database using the following code:

In [None]:
%sql postgresql://{username}:{password}@{host}:{port}/world

Note that we have used the `%sql` line magic to interpret the line in front of it as a magic command. 

We can also use `%%sql` cell magic to apply the magic to an entire notebook cell.

A limited number of `psql` meta commands (e.g. `\l`, `d`) can also be executed here. This is made possible through the `pgspecial` package. For example, let's list all databases that exist on our PostgreSQL server:

In [None]:
#%sql \l

Or list the relations (i.e. tables) in the current database:

In [None]:
#%sql \d

---------------------------------------
#### Wait, isn't this a workshop? Let's get to work then!!
To learn you need to practice! 

**Exercise 1:**

Retrieve the `name` and `population` columns from the `country` table:

In [None]:
%%sql 


**Exercise 2: `SELECT *`?**

What happens if you replace your column list with * in the SELECT statement?

In [None]:
%%sql 


Use `*` with care, since this may cause an unnecessary burden to your network. 
Besides, it make your SQL code less readable. If someone is not highly familiar with the database, they might not know what is being returned. 


**Exercise 3: `LIMIT` keyword**

You can use the `LIMIT` keyword to limit the number of rows returned by the `SELECT` statement. 
Fill in the SQL query below to retrieve the `name` and `lifeexpectancy` columns from the `country` table, while limiting the number of rows to 10. 

In [None]:
%%sql


... name, ... 
   ... country
   LIMIT ...;


**Exercise 4: Aliases**

As you can see in the previous exercise, `lifeexpectancy` is hard to read. 
We can rename the columns in the SELECT statement's output by using the `AS` keyword. 
This will not change the `country` table, only the output of the SELECT statement. 

Fill in the SQL query below to retrieve the `name` and `lifeexpectancy` columns from the country table. Rename `lifeexpectancy` to `life_expectancy`. Again, limit the output to 15 rows.

In [None]:
%%sql

...
    name,
    ... ... life_expectancy
  FROM ...
  ...;



**Exercise 5: `DISTINCT`**

What are the continents in the present in the `country` table? We can add the `DISTINCT` modifier to the `SELECT` clause to return only distinct results.  

> Note: `DISTINCT` is applied to **all columns** that we list in front of `SELECT`, and returns all distinct combinations of values stored in those columns.

In [None]:
%%sql

... DISTINCT ...
  FROM ...;
  


**Exercise 6: Filtering with `WHERE`**

- The `WHERE` clause allows us to filter the rows we want;
- We can specify a condition (a Boolean expression) in the `WHERE` clause to retrieve only the rows for which the condition is true;
- Remember the filter verb in dplyr? It is the same idea!!

Retrieve the countries with a population higher than 150 million people. 


In [None]:
%%sql

... ...
  FROM ...
  ... population > 150000000;



- Some comparisons operator to use in `WHERE` clause

Operator | Condition |
---------|-----------|
`<`, `<=`, `=`, `>=`, `>` | ordinal comparisons | 
`LIKE`, `ILIKE`, `SIMILAR TO` | Pattern Matching | 
`BETWEEN` | Range filtering | 
`IN` | check belonginess (similar to `%in% `in R) |
`IS NULL` | check if is NULL |

**Exercise 7: derived columns**

SQL is able to make operations using columns. 

Retrieve the `name` column and `gnp_per_capita` by dividing the `gnp` column by the `population` column. 
Make sure to only calculate this for rows with population higher than 0. Remember to rename the column as `gnp_per_capita`. 

In [None]:
%%sql


...
    name,
    ... ... gnp_per_capita
  FROM ...
  ... ...
  LIMIT 15;


> Remember:  the `SELECT` statement is powerful, but not dangerous. Derived columns returned by Postgres are not saved anywhere, nor do they change existing columns.

**Exercise 8: Arranging with `ORDER BY`**

What are the countries with the highest `gnp_per_capita`? 
You can use the `ORDER BY` clause to rearrange the rows that is returned. By default, it will arrange the rows in ascending order. 
If you want in descend order you can add the `DESC` keyword. 

Retrieve the `name` column and `gnp_per_capita` by dividing the `gnp` column by the `population` column. 
1. Make sure to only calculate this for rows with population higher than 0. 
2. Remember to rename the column as `gnp_per_capita`. 
3. Order the rows in descending order. 

In [None]:
%%sql


...
    name,
    ... ... gnp_per_capita
  FROM ...
  ... population > 0
  ORDER BY ... DESC
  LIMIT 15;


> Note: if there’s a draw in `ORDER BY`, the rows are presented in arbitrary order;


---  
**Remember:**

The order of SQL keywords does matter: `SELECT`, `FROM`, `WHERE`, `ORDER BY`, `LIMIT`
    
---

## Data Types

We will not cover data types in detail. You can find all data types supported by PostgreSQL [here](https://www.postgresql.org/docs/current/datatype.html).

One thing that is worth noting is the `CAST` function, to convert from one data type to another. 

**Example**

In [None]:
%%sql 

SELECT CAST('1.2587465416874' AS NUMERIC(5,2));

### Nulls

- A null is marker to indicate that the value for a column is unknown, or not entered yet. A null is not equal to 0, or an empty string. In fact, **a null is not even equal to another null**!

- In `ORDER BY`, `NULL` values are considered either: the highest or the smallest possible value, depending on the DBMS. PostgreSQL treats `NULL` the highest possible value.

- Use `IS NULL` to match `NULL`. Do not try using equality. A `NULL` is not equal to another `NULL`.

- Although `NULL`s are not equal to each other, `DISTINC` treats them as if they were!

- How different environments show nulls:
  - `ipython-sql` -> `None`
  - psql -> blank space
  - pgAdmin -> `[null]`

------------------------

## `imdb` database 
Let's continue exploring the `WHERE` clause, but to keep things interesting, let's use another database. 

In [None]:
%config SqlMagic.displaylimit = 20

In [None]:
%sql postgresql://{username}:{password}@{host}:{port}/imdb

In [None]:
%%sql

/* Lets take a look at the movies table */

SELECT *
  FROM movies;

### Pattern matching

It is a quite common situation that we want to find rows for which the values of one or more columns match a particular pattern. In SQL, this can be done either using `LIKE` or by using regular expressions. The syntax is as follows:

```sql
SELECT
    column1, column2
FROM
    table1
WHERE
    column1 [NOT] LIKE '<pattern>'
;
```

Postgres provides us with two wild-cards that we can use with `LIKE`:
- `%` matches any string of characters
- `_` matches a single character.

Pattern matching with `LIKE` is case sensitive; however, Postgres also provides the `ILIKE` keyword that has the same functionality as `LIKE` but is case-insensitive (not part of standard SQL).

> **Note:** With `LIKE` or `ILIKE`, the entire string should match the pattern.

---

**Example:** Retrieve those movies from the `movie` table whose title contains the word `'violin'` (note that `LIKE` is picky about letter cases in strings!)

---

In [None]:
%%sql

SELECT
    *
FROM
    movies
WHERE
    title ILIKE '%Violin%'
;


**Exercise 9:** 

Retrieve those movies from the `movie` table whose title starts with the word `'Zero'`.

In [None]:
%%sql


**Exercise 10:** 

Retrieve those movies from the `movie` table whose title is 4 letters long and ends with the letter `'e'`.

In [None]:
%%sql


## The relational model

SQL databases are based on the so-called relational model, it is based on the set theory in mathematics was introduced by by Edgar Codd (IBM) in 1970 ([more details here](https://en.wikipedia.org/wiki/Relational_model)). 
It's foundations in **set theory** is the reason you will here words like "tuples", "domain", "union", "cross product", etc.

We don't have time to cover the relational model in detail. But there are three things that you should keep in mind:


- **All operations** your perform in SQL result in a table (it could be an empty table).


- Primary keys: is one or more columns which uniquely identify each row (i.e., they are different for all rows); 
  - Examples: id, driver's license, SIN, license plate


- Foreign key: is one or more columns that uniquely identify each row of another table. For example
  - `id` in the `movies` table uniquely identify the movies;
  - `id` in the `names` table uniquely identify people; 
  - the `acting_roles` table relates people to movies using these primary keys to guarantee that there is no people working on a movie that doesn't exist, nor movies with people that don't exist (referential integrity);

In [None]:
%sql select * from names LIMIT 5;

In [None]:
%sql select * from movies LIMIT 5;

In [None]:
%sql select * from acting_roles LIMIT 5;

There are also other useful models, which are less structured, which are known as `NoSQL` Databases. But this is outside the scope of this workshop.