This project is designed for educational purposes. Redistribution, modification, or copying of this project is strictly prohibited.
This demonstrates how to set up a DuckDB database and interact with it using Python for data ingestion. It is similar to what we did in [https://github.com/Rasagna29/ingest_data_with_python](ingest_data_with_python), but the main difference lies in the method of ingesting data into the DuckDB database. In the previous approach, we ingested data by running SQL queries from Python scripts. In this version, the assumption is that data is made available as CSV files in a directory. The script monitors the directory in real time to consume new files, process them, and ingest the data into the database.
We attempted to search the directory in two ways. The first method uses a custom Python implementation called watch_directory in /scripts/watch_directory_custom_script.py. When executed, it checks if the directory (/files/watch_directory_uploads) contains new files. If a new file is found and the data passes the validation check, it is processed and ingested into the database. The script then sleeps for 30 seconds before rerunning the check to see if any new files are available. To test this, upload the order_data_202412110923.csv file only to /files/watch_directory_uploads and allow the watch_directory_custom_script.py script to identify and process it. The script will print "Search is over, the next search is in 30 seconds." Then start uploading other files within the next run(i.e 30 seconds), the script will identify and process it once the next cycle starts.
In contrast, the /scripts/watchdog_python_library.py method works using an observer pattern. As soon as a new file is uploaded to /files/watchdog_uploads, it is immediately picked up for processing, and the data is ingested without waiting for a time interval as in the first method.
Clone the repository to your local machine: git clone https://github.com/Rasagna29/ingest_data_from_file_with_python.git
Follow these steps to set up the environment and tools:
Homebrew is a package manager for macOS. To install it, open the terminal and run: /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
Install it using Homebrew:
brew install python
Note: A virtual environment isolates project dependencies, preventing conflicts and ensuring consistent setups across different projects. pip install virtualenv
mkdir ingest_data_from_file_with_python/
cd ingest_data_from_file_with_python/
To keep the dependencies isolated, create and activate a Python virtual environment:
To create Python virtual environment (this is done for each project to ensure the dependencies of this project are isolated from other projects)
python3 -m venv venv
source venv/bin/activate
source venv/bin/activate.fish
Once your virtual environment is activated, install the necessary libraries for DuckDB, Pandas, Apache Airflow and Watchdog
Run below to see version of duckdb python -c "import duckdb; print(duckdb.version)"
Run below to see version of pandas python -c "import pandas; print(pandas.version)"
Exit Virtual Environment deactivate
Exit Bash exit
Create scripts directory in ingest_data_from_file_with_python/ mkdir scripts/ cd scripts nano create_tables.py (the file is already available in scripts directory in main branch)
Press Ctrl + X, type Y, and hit Enter
Return to the project root: cd ..
Create a data directory for the database file: mkdir data/
Activate the virtual environment: source venv/bin/activate
Run the scripts to create a DuckDB database named dynamic_state_ingestion_database.duckdb. The below will create orders, order_states table, and also ingest data to these tables. 1. python /Users/rasagna/ingest_data_from_file_with_python/scripts/create_tables.py --> new database (dynamic_state_ingestion_database.duckdb) will be created in /data directory 2. python /Users/rasagna/ingest_data_from_file_with_python/scripts/watch_directory_custom_script.py --> searches the /files/watch_directory_uploads once in every 30 seconds for new file 3. python /Users/rasagna/ingest_data_from_file_with_python/scripts/watchdog_python_library.py --> works on observer pattern to keep an eye on every new upload in the /files/watchdog_uploads. This is unlike above which waits for next run to happen even if file was available
To connect to the database using a client:
- Download and install DBeaver.
- Open DBeaver and create a new database connection.
- Select DuckDB as the database type.
- Browse to the database file located at: /Users/rasagna/ingest_data_from_file_with_python/data/dynamic_state_ingestion_database.duckdb
- You can now view the newly created tables and its columns as defined in the create_tables.py script and also data in these tables
When DBeaver (or any other client) is open and you try to run Python scripts that interact with a DuckDB database, you might encounter a conflict due to DuckDB's file-based locking mechanism. DuckDB ensures that only one process can write to the database at a time. If DBeaver (or another process) has the database open, other scripts cannot access or modify the database concurrently. To resolve this:
- Close or disconnect the database in DBeaver (or the other client) before running the script.
- Alternatively, you can run DBeaver in read-only mode to prevent write conflicts.
Congratulations! You have successfully searched directories for new file uploads with two different approaches, processed the files as per the seach and process implementations respectively and persit data in order and order_states table of DuckDB database