# DATA 304 – Module 6, Session 1

## Relational Databases with SQLite: Import, Query, Join, Aggregate
**Goal:** Load data into SQLite, practice SQL queries, joins, and aggregation from Python.

**What you'll learn:**
- Create and connect to a SQLite database
- Import CSV data into tables
- Run SELECT, WHERE, ORDER BY, GROUP BY, HAVING
- Perform INNER and LEFT JOINs
- Create indexes and use parameterized queries
- Export results back to CSV

**Prereqs:** Python 3, `pandas`

## 1) Setup

In [1]:
import sqlite3
import pandas as pd
print(pd.__version__)

2.2.2


## 2) Create SQLite DB and tables

In [2]:
db_path = './data/module6_session1.db'
conn = sqlite3.connect(db_path)
cur = conn.cursor()

cur.execute('DROP TABLE IF EXISTS employees;')
cur.execute('DROP TABLE IF EXISTS departments;')

cur.execute('''CREATE TABLE departments (
    dept_id INTEGER PRIMARY KEY,
    dept_name TEXT NOT NULL
);''')

cur.execute('''CREATE TABLE employees (
    emp_id INTEGER PRIMARY KEY,
    name   TEXT NOT NULL,
    dept_id INTEGER,
    salary REAL,
    hire_date TEXT,
    FOREIGN KEY(dept_id) REFERENCES departments(dept_id)
);''')
conn.commit()
print('Tables created.')

Tables created.


## 3) Load CSVs into tables

In [3]:
df_emp = pd.read_csv('data/employees.csv')
df_dep = pd.read_csv('data/departments.csv')

df_dep.to_sql('departments', conn, if_exists='append', index=False) #if_exists → 'fail' (default, error if table exists), 'replace' (drop & recreate), 'append' (insert rows).
df_emp.to_sql('employees', conn, if_exists='append', index=False)

pd.read_sql('SELECT * FROM employees LIMIT 5;', conn)

Unnamed: 0,emp_id,name,dept_id,salary,hire_date
0,1,Alice,10,90000.0,2021-03-15
1,2,Bob,20,75000.0,2020-07-01
2,3,Carla,10,105000.0,2019-11-20
3,4,Dan,30,68000.0,2022-04-10
4,5,Eve,20,82000.0,2021-12-01


## 4) Basic SELECT, WHERE, ORDER BY

In [4]:
pd.read_sql(''' 
                SELECT emp_id, name, salary
                FROM employees
                ORDER BY salary DESC;
            ''', conn)

Unnamed: 0,emp_id,name,salary
0,3,Carla,105000.0
1,1,Alice,90000.0
2,5,Eve,82000.0
3,2,Bob,75000.0
4,6,Fred,72000.0
5,4,Dan,68000.0


In [5]:
pd.read_sql(''' 
                SELECT name, salary 
                FROM employees 
                WHERE salary >= 80000 
                ORDER BY salary DESC;
            ''', conn)

Unnamed: 0,name,salary
0,Carla,105000.0
1,Alice,90000.0
2,Eve,82000.0


## 5) Aggregation with GROUP BY and HAVING

In [6]:
pd.read_sql(''' 
                SELECT dept_id, COUNT(*) AS n_emp, AVG(salary) AS avg_salary
                FROM employees
                GROUP BY dept_id
                ORDER BY avg_salary DESC;
            ''', conn)

Unnamed: 0,dept_id,n_emp,avg_salary
0,10,2,97500.0
1,20,2,78500.0
2,30,2,70000.0


In [7]:
pd.read_sql(''' 
                SELECT dept_id, COUNT(*) AS n_emp, AVG(salary) AS avg_salary
                FROM employees
                GROUP BY dept_id
                HAVING AVG(salary) >= 80000;
            ''', conn)

Unnamed: 0,dept_id,n_emp,avg_salary
0,10,2,97500.0


In [8]:
pd.read_sql(''' 
                SELECT dept_id, COUNT(*) AS n_emp, AVG(salary) AS avg_salary
                FROM employees
                WHERE salary >=80000
                GROUP BY dept_id;
            ''', conn)

Unnamed: 0,dept_id,n_emp,avg_salary
0,10,2,97500.0
1,20,1,82000.0


In [9]:
pd.read_sql(''' 
                SELECT dept_id, COUNT(*) AS n_emp, AVG(salary) AS avg_salary
                FROM employees
                WHERE salary >= 80000
                GROUP BY dept_id
                HAVING AVG(salary) >= 80000;
            ''', conn)

Unnamed: 0,dept_id,n_emp,avg_salary
0,10,2,97500.0
1,20,1,82000.0


## 6) Joins

In [10]:
pd.read_sql('''
                SELECT e.emp_id, e.name, d.dept_name, e.salary
                FROM employees e
                INNER JOIN departments d ON e.dept_id = d.dept_id
                ORDER BY e.emp_id;
            ''', conn)

Unnamed: 0,emp_id,name,dept_name,salary
0,1,Alice,Engineering,90000.0
1,2,Bob,Marketing,75000.0
2,3,Carla,Engineering,105000.0
3,4,Dan,Finance,68000.0
4,5,Eve,Marketing,82000.0
5,6,Fred,Finance,72000.0


In [11]:
pd.read_sql('''
                SELECT d.dept_id, d.dept_name, e.emp_id, e.name
                FROM departments d
                LEFT JOIN employees e ON e.dept_id = d.dept_id
                ORDER BY d.dept_id;
            ''', conn)

Unnamed: 0,dept_id,dept_name,emp_id,name
0,10,Engineering,1.0,Alice
1,10,Engineering,3.0,Carla
2,20,Marketing,2.0,Bob
3,20,Marketing,5.0,Eve
4,30,Finance,4.0,Dan
5,30,Finance,6.0,Fred
6,40,HR,,


## 7) Parameterized queries

In [12]:
min_salary = 80000
sql = 'SELECT name, salary FROM employees WHERE salary >= ? ORDER BY salary DESC;'
pd.read_sql(sql, conn, params=(min_salary,))

Unnamed: 0,name,salary
0,Carla,105000.0
1,Alice,90000.0
2,Eve,82000.0


In [13]:
min_salary = 80000
max_salary = 100000
sql = 'SELECT name, salary FROM employees WHERE salary >= ? AND salary < ? ORDER BY salary DESC;'
pd.read_sql(sql, conn, params=(min_salary, max_salary,))

Unnamed: 0,name,salary
0,Alice,90000.0
1,Eve,82000.0


## 8) Indexes and EXPLAIN QUERY PLAN

In [14]:
pd.read_sql('EXPLAIN QUERY PLAN SELECT name FROM employees WHERE salary >= 80000;', conn)

Unnamed: 0,id,parent,notused,detail
0,2,0,0,SCAN employees


In [15]:
cur.execute('CREATE INDEX IF NOT EXISTS idx_emp_salary ON employees(salary);')
conn.commit()
pd.read_sql('EXPLAIN QUERY PLAN SELECT name FROM employees WHERE salary >= 80000;', conn)

Unnamed: 0,id,parent,notused,detail
0,3,0,0,SEARCH employees USING INDEX idx_emp_salary (s...


## 9) Export results

In [16]:
res = pd.read_sql('SELECT name, salary FROM employees WHERE salary >= 80000;', conn)
res.to_csv('data/high_paid.csv', index=False)
res.head()

Unnamed: 0,name,salary
0,Eve,82000.0
1,Alice,90000.0
2,Carla,105000.0


## 10) Mini-exercises

1. List employees hired in or after 2021

In [17]:
pd.read_sql('''
                SELECT emp_id, name, hire_date 
                FROM employees
                WHERE hire_date >= "2021-01-01"
                ORDER BY hire_date;
            ''', conn)

Unnamed: 0,emp_id,name,hire_date
0,1,Alice,2021-03-15
1,5,Eve,2021-12-01
2,4,Dan,2022-04-10


2. Total payroll per department

In [18]:
pd.read_sql('''
                SELECT d.dept_name, SUM(e.salary) AS total_salary
                FROM employees e
                JOIN departments d ON e.dept_id = d.dept_id
                GROUP BY d.dept_name
                ORDER BY total_salary DESC;
            ''', conn)

Unnamed: 0,dept_name,total_salary
0,Engineering,195000.0
1,Marketing,157000.0
2,Finance,140000.0


3. Departments with zero employees

In [19]:
pd.read_sql('''
                SELECT d.dept_id, d.dept_name
                FROM departments d
                LEFT JOIN employees e ON e.dept_id = d.dept_id
                WHERE e.emp_id IS NULL;
            ''', conn)

Unnamed: 0,dept_id,dept_name
0,40,HR


In [20]:
pd.read_sql('''
                SELECT d.dept_id, d.dept_name, COUNT(e.emp_id) as count
                FROM departments d
                LEFT JOIN employees e ON e.dept_id = d.dept_id
                GROUP BY d.dept_id
                HAVING count = 0;
            ''', conn)

Unnamed: 0,dept_id,dept_name,count
0,40,HR,0


## 11) Close connection

In [21]:
conn.close()