# Transactions with `DuckDB`

In this notebook, we showcase how to connect to a [DuckDB](https://duckdb.org/) database, execute queries, and run transactions using the [`DuckDB Python API`](https://duckdb.org/docs/api/python/overview). 


The notebook is based on our analog notebook [Transactions](https://github.com/BigDataAnalyticsGroup/bigdataengineering/blob/master/Transactions.ipynb), that shows transactions in the context of a [PostgreSQL](https://www.postgresql.org/) database using the [`psycopg`](https://www.psycopg.org/psycopg3/docs/index.html) package for Python.

Copyright Joris Nix & Jens Dittrich, [Big Data Analytics Group](https://bigdata.uni-saarland.de/), [CC-BY-SA](https://creativecommons.org/licenses/by-sa/4.0/legalcode)

# Setup

The following cell serves as setup. We will explain the syntax in more detail below. Here, we simply connect to the database, create a new table `accounts` with attributes `id` and `balance`, and add some toy data.

In [1]:
import duckdb

def reset_database():
    # establish connection to 'accounts' database
    conn = duckdb.connect(database='accounts.duckdb')

    # drop table if it exists
    conn.execute("""DROP TABLE IF EXISTS accounts;""")

    # create accounts table
    conn.execute("""
    CREATE TABLE accounts (
        id int PRIMARY KEY,
        balance float(2)
    );""")

    # insert sample data into accounts table
    conn.execute("""
    INSERT INTO accounts VALUES
        (1, 2000.0),
        (2, 520.0),
        (3, 470.0),
        (4, 1700.0),
        (5, 2400.0);
    """)
    
    # close connection
    conn.close()
reset_database()

# Basics

### Connection
In order to send queries to the database, we first need to establish a `connection`.The `connect()` method lets you decide between an **in-memory database** and a **persistent database** in form of a DuckDB file, by specifying the `database` parameter.
```python
conn = duckdb.connect(database=':memory:') # creates an in-memory database
conn = duckdb.connect(databaes='my-db.duckdb') # creates a persistent database file called 'my-db.duckdb'
```

In our setup above, we create a persistent database. The default is creating an in-memory database.

### Querying
We can send queries to the database using the connection object `conn`. In contrast to the PostgreSQL database adapter for Python, we do not need a _cursor_ that has to be openend from an established connection. The connection directly allows us send queries (`execute()`) and retrieve results (`fetchone()`, `fetchall()`). Results are always tuples, even if they consist of a single integer. We have to consider this when parsing the results. In addition,
DuckDB provides multiple additional methods that can be used to convert query results into well-established formats, e.g., `fetchdf()` fetches the data as a Pandas DataFrame. When we are done, we close the connection using the `close()` method (this happens implicitly if the connection goes out of scope).

The following example shows how to query the database for an entire table.

In [2]:
# establish connection to 'accounts' database
conn = duckdb.connect(database='accounts.duckdb')

# define a SQL query
q_accounts = """SELECT * FROM accounts;"""

# execute the query using the connection
conn.execute(q_accounts)

# retrieve the tuples
accounts = conn.fetchall()

# print sorted results
print(f"The query returned the following tuples:\n{sorted(accounts)}")

The query returned the following tuples:
[(1, 2000.0), (2, 520.0), (3, 470.0), (4, 1700.0), (5, 2400.0)]


In [3]:
# query the database again and retrieve the tuples as a Pandas DataFrame
conn.execute(q_accounts)

accounts_df = conn.fetchdf()
display(accounts_df)

Unnamed: 0,id,balance
0,1,2000.0
1,2,520.0
2,3,470.0
3,4,1700.0
4,5,2400.0


In [4]:
# close the current connection
conn.close()

# Read-Only Connection

The DuckDB allows us the set a connection to **read_only** by specifying the corresponding paramter. If we set a connection to read-only mode, write operations will not be executed and instead will raise an `InvalidInputException`. However, this only works for **persistent databases**.

The following example demonstrates this behavior.

In [5]:
# establish a read-only connection
conn = duckdb.connect(database='accounts.duckdb', read_only=True)

try:
    # try to insert a new tuple into the table
    conn.execute("INSERT INTO accounts VALUES (6, 100000.0);")
    
    # if successful, retrieve newly added tuple
    conn.execute("SELECT * FROM accounts WHERE id=6;")
    print(conn.fetchone())
except duckdb.InvalidInputException:
    print(f"ERROR: The query failed due to the connection being read-only.")
    
conn.close()

ERROR: The query failed due to the connection being read-only.


# Transactions

As a default, every **execute()** statement sent to the database has an immediate effect, i.e. each statement is an individual transaction that is implicitly commited upon successful completion. However, DuckDB also provides functionality to bundle multiple statements into one transaction.

In particular, the DuckDB Python adapter provides the following functions for transactional processing:
* `begin()`: Start a new transaction.
* `commit()`: Commit changes performed within a transaction.
* `rollback()`: Roll back changes performed within a transaction.

## Immediate Commit
The following example shows that without starting a transactions, each modification is immediately visible to other connections to the database.

In [6]:
reset_database()

# establish two connections
conn1 = duckdb.connect(database='accounts.duckdb')
conn2 = duckdb.connect(database='accounts.duckdb')

# insert a new value using the first connection
conn1.execute("INSERT INTO accounts VALUES (6, 237.0);")
# retrieve this new tuple using the second connection
conn2.execute("SELECT * FROM accounts WHERE id=6;")

print(f'Due to immediate commit, the tuple with id=6 is already visible to the other connection: {conn2.fetchall()}')

conn1.close()
conn2.close()

Due to immediate commit, the tuple with id=6 is already visible to the other connection: [(6, 237.0)]


##  Begin & Commit Transactions
The example below transfers money from one account to another inside a transaction. It is equivalent to running the following SQL statements:

```SQL
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id=3;
UPDATE accounts SET balance = balance + 100 WHERE id=1;
COMMIT;
```
The example also shows that as long as the transaction is not committed, changes are not visible to other connections.

In [7]:
reset_database()

# establish two connections
conn1 = duckdb.connect(database='accounts.duckdb')
conn2 = duckdb.connect(database='accounts.duckdb')

# start a new transaction
conn1.begin()
# update balance of account 3
conn1.execute("""UPDATE accounts SET balance = balance - 100 WHERE id=3;""") 
# update balance of account 1
conn1.execute("""UPDATE accounts SET balance = balance + 100 WHERE id=1;""")
# compare states visible to both connections (transactions)
q_acc = """SELECT * FROM accounts WHERE id=1 OR id=3;"""
conn1.execute(q_acc)
conn2.execute(q_acc)
print(f"Account balances observed by each connection before COMMIT:\n"\
      f"Transaction 1: {conn1.fetchall()}\n"\
      f"Transaction 2: {conn2.fetchall()}\n"\
      f"Changes not yet visible to connection 2."\
     )

# explicitly commit the changes performed by the first connection
conn1.commit()
print("--Transaction 1 commited--")

# compare states visible to both connections again
conn1.execute(q_acc)
conn2.execute(q_acc)
print(f"Account balances observed by each connection after COMMIT:\n"\
      f"Transaction 1: {conn1.fetchall()}\n"\
      f"Transaction 2: {conn2.fetchall()}\n"\
      f"Changes visible to connection 2."\
     )
# close both connections
conn1.close()
conn2.close()

Account balances observed by each connection before COMMIT:
Transaction 1: [(1, 2100.0), (3, 370.0)]
Transaction 2: [(1, 2000.0), (3, 470.0)]
Changes not yet visible to connection 2.
--Transaction 1 commited--
Account balances observed by each connection after COMMIT:
Transaction 1: [(1, 2100.0), (3, 370.0)]
Transaction 2: [(1, 2100.0), (3, 370.0)]
Changes visible to connection 2.


## Rollback Transactions

The next example shows a similar transaction as above. The only difference is that instead of making the changes persistent, we decide to `ABORT` the transaction by calling `rollback()` on the connection. It is equivalent to running the following SQL statements:
```SQL
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id=3;
UPDATE accounts SET balance = balance + 100 WHERE id=1;
ABORT;
```
All changes performed by the aborted transaction must not become durable in the database. Note that if we `close()` an open connection, `rollback()` will be performed implicitly.

In [8]:
reset_database()

# establish two connections
conn1 = duckdb.connect(database='accounts.duckdb')
conn2 = duckdb.connect(database='accounts.duckdb')

# start a new transaction
conn1.begin()

# update balance of account 3
conn1.execute("""UPDATE accounts SET balance = balance - 100 WHERE id=3;""") 
# update balance of account 1
conn1.execute("""UPDATE accounts SET balance = balance + 100 WHERE id=1;""")
# compare states visible to both connections (transactions)
q_acc = """SELECT * FROM accounts WHERE id=1 OR id=3;"""
conn1.execute(q_acc)
conn2.execute(q_acc)
print(f"Account balances observed by each connection before COMMIT:\n"\
      f"Transaction 1: {conn1.fetchall()}\n"\
      f"Transaction 2: {conn2.fetchall()}\n"\
      f"Changes not yet visible to connection 2."\
     )

# explicitly rollback the changes performed by the first connection
conn1.rollback()
print("--Transaction 1 aborted--")

# compare states visible to both connections again
conn1.execute(q_acc)
conn2.execute(q_acc)
print(f"Account balances observed by each connection after COMMIT:\n"\
      f"Transaction 1: {conn1.fetchall()}\n"\
      f"Transaction 2: {conn2.fetchall()}\n"\
      f"Changes of connection 1 undone."\
     )
# close both connections
conn1.close()
conn2.close()

Account balances observed by each connection before COMMIT:
Transaction 1: [(1, 2100.0), (3, 370.0)]
Transaction 2: [(1, 2000.0), (3, 470.0)]
Changes not yet visible to connection 2.
--Transaction 1 aborted--
Account balances observed by each connection after COMMIT:
Transaction 1: [(1, 2000.0), (3, 470.0)]
Transaction 2: [(1, 2000.0), (3, 470.0)]
Changes of connection 1 undone.
