[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Bartcardi/azure_ml_training/blob/duckdb_talk/notebooks/duckdb_talk.ipynb)

In [None]:
import pandas
import duckdb

In [None]:
!kaggle datasets download -d zanjibar/100-million-data-csv

In [None]:
!unzip 100-million-data-csv.zip

# "DuckDB: Your New Favorite Analytical Tool"

* Bart Joosten / ilionx
* 12-02-2025

# The Story of DuckDB (Origin)

* DuckDB was born out of academic research at CWI (Centrum Wiskunde & Informatica) in the Netherlands.
* Originally developed by Mark Raasveldt and Hannes Mühleisen.
* The initial goal was to create an in-process database system optimized for analytical queries on embedded devices.
* The project quickly evolved into a powerful and versatile analytical DBMS suitable for a wide range of applications.
* It's open-source (MIT License) and has a vibrant community contributing to its development.


# What is DuckDB?

* DuckDB is an embedded or  analytical database management system (DBMS).
* It's designed to be fast, portable, and easy to use, especially for analytical queries.
* Written in C++ with zero dependies (only working C++11 compiler required).
* Think of it as SQLite for analytics.
* **Key Feature:** Optimized for analytical workloads (OLAP).

# Why Use DuckDB (Especially with Python)?

* **Speed:** DuckDB is *significantly* faster than using Pandas or other Python libraries for many analytical operations, especially on larger datasets.  It pushes computation down into the database engine, which is highly optimized.
* **Ease of Use:**  It's embedded! No setting up a separate database server. Just install the `duckdb` Python package and you're ready to go.
* **SQL Power:** Leverage the full power of SQL for complex queries, aggregations, and data transformations directly within your Python code. This can be more concise and efficient than equivalent Pandas code.
* **Seamless Integration:** The `duckdb` Python library provides a smooth interface for interacting with DuckDB. You can easily load data from Pandas DataFrames, execute SQL queries, and retrieve results back into Pandas.
* **Portability and Reproducibility:**  DuckDB is entirely self-contained.  This makes your analyses portable (serverless) and reproducible.  You can easily share your code and data without worrying about database configurations.
* **Parquet and other formats:** Read and write Parquet, CSV, and other data science-friendly formats directly.


# OLAP vs. OLTP - The Core Difference

* **OLTP (Online Transaction Processing):**
    * Designed for transactional workloads. Think of your online banking system or e-commerce checkout.
    * Focus: High volume of small transactions, data consistency, and speed of individual transactions.
    * Examples: Inserting a new customer, updating an order status.
* **OLAP (Online Analytical Processing):**
    * Designed for complex analytical queries. Think of business intelligence dashboards or data science exploration.
    * Focus: Analyzing large datasets, complex aggregations, and query performance.
    * Examples: Calculating sales trends over time, identifying customer segments.

# OLAP vs. OLTP - Comparison Table

| Feature           | OLTP                               | OLAP                                  |
|-------------------|------------------------------------|---------------------------------------|
| Workload          | Transactions (inserts, updates)      | Analytical queries (SELECTs, aggregations) |
| Data Volume       | Relatively small transactions         | Large datasets                           |
| Query Complexity  | Simple, fast queries                 | Complex queries, aggregations             |
| Performance Goal  | Transaction speed, data consistency  | Query speed, data analysis              |
| Data Changes      | Frequent, small updates              | Infrequent, bulk updates                |
| Data Focus        | Current, operational data            | Historical, analytical data             |


# Why DuckDB is OLAP-focused

* DuckDB's architecture and optimizations are specifically geared towards OLAP workloads.
* **Columnar storage:** Data is stored column-wise, which is much more efficient for analytical queries that often only access a subset of columns. *(Reduced I/O, better compression)*
* **Vectorized query execution:** DuckDB processes data in batches (vectors), leading to significant performance gains.
* **Optimized query planner:** DuckDB's query planner is designed to find the most efficient execution plan for complex analytical queries.
* These features make DuckDB significantly faster than a traditional row-oriented database (like SQLite) for analytical tasks.

# DuckDB Use Cases for Data Scientists

* **Local data analysis:** Analyze large datasets on your laptop without setting up a complex database server.
* **Data exploration and prototyping:** Quickly test out different analytical queries and transformations.
* **Reproducible research:** Embed DuckDB directly into your analysis scripts to ensure reproducibility.
* **Integration with data science tools:** Seamlessly use DuckDB with Python (via the `duckdb` library), R, and other languages.
* **Parquet and CSV support:** Easily import and export data in common data science formats.

# Slide 8: Demo Time!

* Let's see DuckDB in action! 