# Part 2, Lesson 1: Databases, SQL & R
**Author:** Your Name  
**Date:** Block Lecture (4 hours)

# Part 2 – Lesson 1: Advanced Databases & SQL in R

Welcome to the **second part** of our course! In this first **4-hour** block, we’ll dive deeper into **databases**, **SQL**, and how to integrate them seamlessly into your R workflow. By the end of this lesson, you should understand:

- When and why to store data in a **database** vs. flat files.
- How to connect R to a local or remote **SQL database** (using `DBI` + `RSQLite` and notes on MySQL/PostgreSQL).
- Best practices for **querying**, **inserting**, and **updating** data in a relational database.
- Basic database design concepts (e.g., primary keys, foreign keys, indexing).
- Strategies for **ETL** (Extract, Transform, Load) workflows in journalistic or research contexts.


## Outline

1. **Why Databases?**
2. **Setting Up R for Database Connections**
3. **Creating & Managing a Local SQLite Database**
4. **SQL Essentials Refresher**
5. **Advanced Query Examples & Joins**
6. **Importing / Exporting Large Data**
7. **Remote Databases & Practical Considerations**
8. **Mini ETL Workflow**


---

## 1. Why Databases?

In previous lessons, we mainly worked with **CSV** or Excel files. Flat files are fine for small datasets but can become unwieldy when:

- Data is **large** or frequently updated.
- Multiple datasets need to be **linked** via common keys.
- We require **complex queries** or want to optimize data retrieval.

A **relational database** (e.g., SQLite, MySQL, PostgreSQL) is structured, optimized for queries, and supports ACID properties. For journalists, this can be crucial when handling large public datasets (e.g., repeated FOIA requests, election data, etc.).

> **Key concepts**:

- **Tables** (like sheets, but with strict schemas).
- **Primary keys** (unique row identifier).
- **Foreign keys** (relation to another table’s primary key).
- **Indexes** (speed up queries on certain columns).


---

## 2. Setting Up R for Database Connections

R provides two main avenues:

1. **`DBI` + `RSQLite`** (or `RMariaDB` for MySQL, `RPostgres` for PostgreSQL, etc.): a general interface for working with databases.
2. **`sqldf`**: simpler but primarily for querying data frames as if they were tables.

In a professional setting, `DBI` is more flexible and robust. Let’s install/load the packages we need:


In [None]:
# If needed:
# install.packages("DBI")
# install.packages("RSQLite")
# install.packages("dplyr")  # For demonstration

library(DBI)
library(RSQLite)
library(dplyr) # We'll combine SQL & dplyr if needed


### Potential Packages for Other SQL Databases

- **MySQL**: `RMariaDB` or `RMySQL`.
- **PostgreSQL**: `RPostgres`.
- **Microsoft SQL Server**: Various ODBC-based connectors or `odbc` package.

> **Instructor Note**: For the sake of this lesson, we’ll use **SQLite** since it’s file-based and easy for demonstration. The same commands generally apply to MySQL/Postgres with minor changes in the connection details.


---

## 3. Creating & Managing a Local SQLite Database

We’ll create an **in-memory** database first, then we’ll demonstrate creating a **file-based** SQLite database.

### 3.1 In-Memory Example


In [None]:
# Create an in-memory SQLite database
con <- dbConnect(SQLite(), ":memory:")

# We'll create a sample table from a built-in dataset (mtcars)
df_mtcars <- mtcars %>%
  mutate(car_name = row.names(mtcars)) %>%
  select(car_name, everything())

# Write the data frame to the DB as a new table 'cars'
dbWriteTable(con, "cars", df_mtcars)

# Let's list the tables in our in-memory DB
dbListTables(con)


### 3.2 Querying the Table

We can use `dbGetQuery()` to run standard SQL commands:


In [None]:
# SELECT only cars with 6 cylinders, sorted by mpg descending
query_result <- dbGetQuery(con, "SELECT car_name, cyl, mpg
                           FROM cars
                           WHERE cyl = 6
                           ORDER BY mpg DESC;")
query_result


> **Note**: The in-memory database will disappear once we close the connection or end the session.

### 3.3 File-Based SQLite Database

To persist data:

```r
con_file <- dbConnect(SQLite(), "my_database.sqlite")
# Now we can do the same dbWriteTable, dbGetQuery, etc.
dbWriteTable(con_file, "cars", df_mtcars)
dbListTables(con_file)
dbDisconnect(con_file)  # Always disconnect when done!
```

This will create a **my_database.sqlite** file in your working directory.


---

## 4. SQL Essentials Refresher

Even if we covered SQL in previous lessons, let’s recap key statements:

- **`SELECT ... FROM ...`**: basic query.
- **`WHERE ...`**: filtering.
- **`JOIN`** clauses: combine related tables.
- **`INSERT`, `UPDATE`, `DELETE`**: modifying data.
- **`CREATE TABLE`, `DROP TABLE`, `ALTER TABLE`**: schema changes.

### 4.1 Creating a Table via SQL

Instead of `dbWriteTable`, we can run raw SQL to create an empty table:

```sql
CREATE TABLE employees (
  emp_id INTEGER PRIMARY KEY,
  name TEXT NOT NULL,
  department TEXT,
  salary REAL
);
```

We then **INSERT** rows. Let’s demonstrate.


In [None]:
# We'll reuse our in-memory connection 'con'
# 1) Create an 'employees' table
dbExecute(con, "DROP TABLE IF EXISTS employees;") # Clean slate
dbExecute(con, "CREATE TABLE employees (
                 emp_id INTEGER PRIMARY KEY,
                 name TEXT NOT NULL,
                 department TEXT,
                 salary REAL
               );")

# 2) Insert rows
dbExecute(con, "INSERT INTO employees (name, department, salary)
             VALUES ('Alice', 'Sales', 50000),
                    ('Bob', 'Marketing', 60000),
                    ('Carla', 'IT', 70000);")

# 3) Check results
dbGetQuery(con, "SELECT * FROM employees;")


### 4.2 Updating & Deleting

```r
# Update Bob's salary
dbExecute(con, "UPDATE employees SET salary = 65000 WHERE name = 'Bob';")
# Remove Carla
dbExecute(con, "DELETE FROM employees WHERE name = 'Carla';")

# Check again
dbGetQuery(con, "SELECT * FROM employees;")
```

These operations become powerful in large-scale data manipulation. However, always remember **transaction** safety and **backups** for production databases.


---

## 5. Advanced Query Examples & Joins

### 5.1 JOIN Basics

**Joins** let us combine multiple tables based on **key columns**. Let’s create a second table, `departments`, to store department info:


In [None]:
# 1) Create 'departments' table
dbExecute(con, "DROP TABLE IF EXISTS departments;")
dbExecute(con, "CREATE TABLE departments (
                 dept_name TEXT PRIMARY KEY,
                 location  TEXT
               );")

# 2) Insert rows
dbExecute(con, "INSERT INTO departments (dept_name, location)
             VALUES ('Sales', 'Building A'),
                    ('Marketing', 'Building B'),
                    ('IT', 'Annex 1');")

# 3) Let's do a LEFT JOIN on employees -> departments
joined_query <- dbGetQuery(
  con,
  "SELECT e.name, e.department, e.salary, d.location
   FROM employees e
   LEFT JOIN departments d
   ON e.department = d.dept_name;"
)

joined_query


> **Tip**: We can also do **`INNER JOIN`** (only matching records), **`RIGHT JOIN`**, or **`FULL OUTER JOIN`** (though SQLite doesn’t have a direct full outer join, we can simulate it).

### 5.2 Aggregation & Group By

Let’s compute average salary per department:

```sql
SELECT department, AVG(salary) AS avg_salary
FROM employees
GROUP BY department;
```


In [None]:
dbGetQuery(con, "SELECT department, AVG(salary) AS avg_salary
             FROM employees
             GROUP BY department;")


> **Exercise**: Combine knowledge from previous parts.

1. Insert more employees with different salaries.
2. Do a grouped query by `department` with `COUNT(*)` and `AVG(salary)`.


---

## 6. Importing / Exporting Large Data

Databases shine with **large** datasets. In R:

1. You can **bulk insert** large CSV files using `dbWriteTable` or custom scripts.
2. Export from DB to CSV with `dbGetQuery` and `write.csv`.

### Example: Writing a Large Data Frame

```r
dbWriteTable(con, "big_table", big_df)
```

If you have memory constraints, consider chunked approaches or use the DB’s **import** utilities (like `sqlite3` command-line or MySQL’s `LOAD DATA INFILE`).


### Combining dplyr & Databases

The **`dplyr`** package integrates with `DBI` to let you write code like:

```r
library(dplyr)
tbl_con <- tbl(con, "employees")  # reference a remote table
tbl_con %>%
  filter(salary > 55000) %>%
  arrange(desc(salary)) %>%
  collect()  # pulls data back into R
```

This is often called **dbplyr** (the `dplyr` backend for databases). It translates your R code into **SQL** behind the scenes.


---

## 7. Remote Databases & Practical Considerations

1. **Authentication**: You’ll typically need a **username** and **password** to connect to MySQL/Postgres. For example:
   ```r
   # Example for MySQL
   library(RMariaDB)
   con_mysql <- dbConnect(MariaDB(),
                         user = "myuser",
                         password = "mypassword",
                         host = "dbserver.example.com",
                         dbname = "mydb")
   ```
2. **Security**: For journalism projects on sensitive data, ensure your connections are **encrypted** (SSL/TLS). Avoid storing passwords in scripts.
3. **Schema Management**: Larger projects may require designing multiple normalized tables, or using migrations for schema changes.
4. **Performance**: Index frequently queried columns. For instance,
   ```sql
   CREATE INDEX idx_salary ON employees (salary);
   ```
   This speeds up queries filtering by `salary`.


---

## 8. Mini ETL Workflow

Let’s outline a short end-to-end scenario:

1. **Extract**: We have a raw CSV with employee data (`raw_employees.csv`).
2. **Transform**: Clean column names, fix missing salaries, unify department naming.
3. **Load**: Insert into a new table in our SQLite database.
4. **Query**: Summaries or merges with other tables.

### Example Code (Pseudocode)

```r
# 1) Extract
raw_df <- read_csv("raw_employees.csv")

# 2) Transform
clean_df <- raw_df %>%
  janitor::clean_names() %>%                # optional: consistent col naming
  mutate(
    department = str_to_title(department),  # unify naming
    salary = ifelse(is.na(salary), 40000, salary) # fill missing with 40k
  )

# 3) Load into DB
dbWriteTable(con, "clean_employees", clean_df)

# 4) Query
res <- dbGetQuery(
  con,
  "SELECT department, AVG(salary) AS avg_salary
   FROM clean_employees
   GROUP BY department;"
)

res
```

> **Practical**: For very large datasets, consider chunked reads/writes or direct DB import tools.


---

## Wrap-Up & Next Steps

In this 4-hour lesson, you’ve learned:

- Why and when to use **databases** (scalability, complex queries, multi-table linking).
- Connecting R to **SQLite** (and references for MySQL/Postgres).
- Core SQL statements (SELECT, JOIN, GROUP BY, INSERT, UPDATE, etc.).
- Basic **ETL** patterns for real-world data.

**Next Lesson** (Part 2, Lesson 2) might cover:

- More advanced **SQL** concepts (window functions, subqueries) or
- Database design (normalization, advanced indexing, foreign key constraints) or
- Combining R-based analyses with dashboards (e.g., R Shiny or other front-ends) for data storytelling.

### Additional Resources

- **R for Data Science** (Hadley Wickham) – for `dplyr/dbplyr` usage.
- **Database Design** courses or tutorials for deeper schema architecture.
- **SQLBolt** or **W3Schools** – quick references for SQL syntax.
- **Databases for Journalists** guides (e.g., from NICAR, IRE) for practical insights.

Feel free to explore, experiment, and build small demonstration databases as you become more comfortable with SQL in R!

# End of Part 2, Lesson 1
