# SSC Data Science and Analytics Workshop 2022

### Intro to Databases in Industry: Data Cleaning, Querying, and Modeling at Scale
---------------

SQL is powerful, fast, and reliable. But unfortunately, queries can quickly become complex, even for routine data wrangling. 

Languages like R  and Python have powerful packages, such as `tidyverse` and `pandas`, that are designed to facilitate wrangling and cleaning data. The disadvantage is that they are not as fast as SQL. Luckily for us, we can connect R directly to the databases. Not only that, but we can use the usual `tidyverse` verbs, and `dbplyr` will generate the SQL queries for us! So we can have the best of both worlds! 

In this part of the workshop, we will explore the R$\leftrightarrow$SQL interface. 




## 1. Connecting `R` to a database 

We will use the `DBI` package to connect R to a database. There are many different database management systems (DBMS) vendors out there (e.g., Oracle, Microsoft, Postgres, MySQL). Although all these DBMS are somewhat similar, they have some differences. For this reason, we need to tell the `DBI` package which database we want to connect to. In our case here, we are using the `PostgreSQL` DBMS. We need to install the PostgreSQL backend for DBI, which is the package `RPostgres`. 

Finally, the package [dbplyr](https://dbplyr.tidyverse.org/) creates the interface with the database and converts the `dplyr` verbs into SQL queries. How does that work? Very similarly to if you had loaded the tables into R as data frames.


Let's start by creating the connection.

In [1]:
library(tidyverse) # dbplyr is part of tidyverse metapackage
library(RPostgres)

-- [1mAttaching packages[22m ----------------------------------------------------------------------- tidyverse 1.3.1 --

[32mv[39m [34mggplot2[39m 3.3.5     [32mv[39m [34mpurrr  [39m 0.3.4
[32mv[39m [34mtibble [39m 3.1.6     [32mv[39m [34mdplyr  [39m 1.0.7
[32mv[39m [34mtidyr  [39m 1.1.4     [32mv[39m [34mstringr[39m 1.4.0
[32mv[39m [34mreadr  [39m 2.1.1     [32mv[39m [34mforcats[39m 0.5.1

-- [1mConflicts[22m -------------------------------------------------------------------------- tidyverse_conflicts() --
[31mx[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31mx[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

"package 'RPostgres' was built under R version 4.1.3"


**Exercise 1.1** 

Connect R to the imdb database located at [...]. 

In [2]:
# connection = dbConnect(
#     drv = ..., 
#     user = ..., 
#     password = ..., 
#     port = 5432, # this is the default port for postgres 
#     dbname = ..., 
#     host = ...)

### BEGIN SOLUTION
connection = dbConnect(
    drv = Postgres(), 
    user = "postgres", 
    password = "1645", 
    port = 5432, # this is the default port for postgres 
    dbname = "imdb", 
    host = "localhost")
### END SOLUTION

Congratulations!!! R is now connected to the database. 

## 2. Retrieving data from a database with R

Now that we have the connection ready to go, we can pull data from the database. But before we start pulling data from tables, it is useful to get some information about the database itself (e.g., what tables there are in a database, what are the fields of a table): 

- `dbListTables(connection)`: list all tables in the database accessed in connection;+
- `dbListFields(connection, table_name)`: List all columns of `table_name` in the database 

**Exercise 2.1**

List all tables of the `imdb` database. 


In [3]:
# Your code goes here. 

### BEGIN SOLUTION
dbListTables(connection)
### END SOLUTION

**Exercise 2.2**

List all columns from of the `movies` relation in the `imdb` database. (Note: relation is just another name for table in the database literature.) 


In [4]:
# Your code goes here. 

### BEGIN SOLUTION
dbListFields(connection, 'movies')
### END SOLUTION

### 2.1 Wrangling data with `dbplyr`

With `dbplyr`, we can work with a database table like it was loaded into memory (but it isn't!). 

To "read" a table from a database we can use the [dplyr::tbl](https://dplyr.tidyverse.org/reference/tbl.html) function. 

**Example**

Read the `movies` table from the `imdb` database.

In [5]:
(movies <- tbl(connection, 'movies'))

[90m# Source:   table<movies> [?? x 8][39m
[90m# Database: postgres [postgres@localhost:5432/imdb][39m
         id title         orig_title   start_year end_year runtime rating nvotes
      [3m[90m<int>[39m[23m [3m[90m<chr>[39m[23m         [3m[90m<chr>[39m[23m             [3m[90m<int>[39m[23m    [3m[90m<int>[39m[23m   [3m[90m<int>[39m[23m  [3m[90m<dbl>[39m[23m  [3m[90m<int>[39m[23m
[90m 1[39m 10[4m0[24m[4m3[24m[4m5[24m423 Kate & Leopo~ [31mNA[39m                 [4m2[24m001       [31mNA[39m     118    6.4  [4m7[24m[4m4[24m982
[90m 2[39m 10[4m0[24m[4m4[24m[4m2[24m742 Mister 880    [31mNA[39m                 [4m1[24m950       [31mNA[39m      90    7.1   [4m1[24m171
[90m 3[39m 10[4m0[24m[4m4[24m[4m1[24m181 Black Hand    [31mNA[39m                 [4m1[24m950       [31mNA[39m      92    6.4    666
[90m 4[39m 10[4m0[24m[4m4[24m[4m1[24m387 Francis       [31mNA[39m                 [4m1[24m950      

Now we can treat the `movies` variable like a regular tibble that was loaded into memory (although, again, it isn't) and use all usual `tidyverse` verbs to wrangle, and explore the data. 

**Exercise 2.1.1**

What are the top rated movies produced after 2000 with more than 500 votes? Remove the `id`, `orig_title` and `end_year` columns. 

In [6]:
# Your code goes here. 
#top_rated_movies <- ...

### BEGIN SOLUTION
start_time <- Sys.time()
top_rated_movies <- 
    movies %>%
    filter(start_year > 2000 & nvotes > 500) %>%
    select(-id, -orig_title, -end_year) %>%
    head(10) %>%
    arrange(desc(rating)) 
end_time <- Sys.time()
end_time - start_time
### END SOLUTION

Time difference of 0.008001089 secs

All evaluations are lazy when using `dbplyr` as the backend of `dplyr` (i.e., the data is not retrieved until requested). So what the command actually does is generate the SQL code. 

We can check the generated SQL code using the `show_query` function. 

**Example**

In [7]:
top_rated_movies %>% 
    show_query()

<SQL>
SELECT *
FROM (SELECT "title", "start_year", "runtime", "rating", "nvotes"
FROM "movies"
WHERE ("start_year" > 2000.0 AND "nvotes" > 500.0)
LIMIT 10) "q01"
ORDER BY "rating" DESC


We can always call the `collect` function to collect the data from the database immediately. 

**Example**

In [84]:
top_rated_movies %>% 
    collect()

title,start_year,runtime,rating,nvotes
<chr>,<int>,<int>,<dbl>,<int>
The Lord of the Rings: The Fellowship of the Ring,2001,178,8.8,1537080
Star Wars: Episode III - Revenge of the Sith,2005,140,7.5,650834
Frida,2002,123,7.4,75612
Corpse Bride,2005,77,7.3,226501
From Hell,2001,122,6.8,140669
The Shipping News,2001,111,6.7,31012
Star Wars: Episode II - Attack of the Clones,2002,142,6.6,584616
Kate & Leopold,2001,118,6.4,74982
Men in Black II,2002,88,6.2,320765
Heartbreakers,2001,123,6.2,49154


In [10]:
(principals <- tbl(connection, 'principals'))

[90m# Source:   table<principals> [?? x 3][39m
[90m# Database: postgres [postgres@localhost:5432/imdb][39m
   movie_id ordering  name_id
      [3m[90m<int>[39m[23m    [3m[90m<int>[39m[23m    [3m[90m<int>[39m[23m
[90m 1[39m 10[4m0[24m[4m3[24m[4m5[24m423        1 20[4m0[24m[4m0[24m[4m0[24m212
[90m 2[39m 10[4m0[24m[4m3[24m[4m5[24m423        2 20[4m4[24m[4m1[24m[4m3[24m168
[90m 3[39m 10[4m0[24m[4m3[24m[4m5[24m423        3 20[4m0[24m[4m0[24m[4m0[24m630
[90m 4[39m 10[4m0[24m[4m3[24m[4m5[24m423        4 20[4m0[24m[4m0[24m[4m5[24m227
[90m 5[39m 10[4m0[24m[4m3[24m[4m5[24m423        5 20[4m0[24m[4m0[24m[4m3[24m506
[90m 6[39m 10[4m0[24m[4m3[24m[4m5[24m423        6 20[4m7[24m[4m3[24m[4m7[24m216
[90m 7[39m 10[4m0[24m[4m3[24m[4m5[24m423        7 20[4m4[24m[4m6[24m[4m5[24m298
[90m 8[39m 10[4m0[24m[4m3[24m[4m5[24m423        8 20[4m4[24m[4m4[24m[4m8[24m843
[90m 9[39m 10[4m0

**Exercise 2.1.2**

What are the median running times and the average ratings of movies in each genre in `movie_genres` table? Check the SQL code generated by `dbplyr`, and collect the data. 

In [27]:
# Your code goes here
# genres <- ...(..., 'movie_genres')
# ...

### BEGIN SOLUTION
genres <- tbl(connection, 'movie_genres')
summary_genres <- 
    movies %>% 
    left_join(genres, by = c("id" = "movie_id")) %>%
    group_by(genre) %>%
    summarise(median_runtime = median(runtime, na.rm = TRUE),
              avg_rating = mean(rating, na.rm = TRUE))

w
### END SOLUTION
    

### 2.2 Plotting with data from a database

## 3. Writing to a database from R