The goal of the project was to create a fragment of an insurance company's database and a data warehouse based on it. The database contains information about the insurance company's customers, employees and branches, as well as the insurance policies between these entities. The database and warehouse designed this way allows for easy visualization and development of summaries of the work of all branches, employees, as well as policies taken by clients. The assumption is that a company can offer different types of policies - e.g., home insurance, vehicle insurance - and each customer can take out an unlimited number of policies. The project was developed entirely in the environment offered by Oracle, using SQL Developer and SQL Developer Data Modeler, as well as the SQL*Loader tool. The scripts generating the data with which the database was fed were written in Python using the Faker library.
Report Bug
Table of Contents
This project was part of my coursework done through my studies at the Kielce University of Technology. It aimed to create a simplified version of a database for an insurance company (under 20 tables) and later to build a data warehouse on top of it (using a star schema).
Prior to starting work on this project, I did research to determine if anyone had created something similar and made it available to the public. Unfortunately, the results of my search were less than satisfactory. I had to attempt to create the database model, the relationships, the data generator, the transformation of the data into a data warehouse and all the queries myself. However, this was my first approach to creating a database and data warehouse so the model or queries may contain errors or fall short of good practices. Nevertheless, below you will find the results of my work, which may be useful to those who are tasked with creating a similar project.
If you would like to know more about the creation process you can jump to Roadmap. If you want to see the project in action you will have to follow the steps described in Getting Started
To start working with this project you need to install Oracle Express database. You can find instructions on how to perform this process here: Oracle Database XE Quick Start
The more experienced users can use the Docker image: Oracle Container Registry Look for the appropriate image under Database -> Express
The database model is created in Oracle SQL Data Modeler. This tool is useful to more easily visualize and create the structure of the database.
To operate the database I used SQL Developer
When we have all the tools we can proceed to launch the project. You can use the project structure below and roadmap to navigate the project. Refer to Usage section for quick tips on how to operate on project.
- Be sure that you have installed all the prerequisites from Getting Started section
- Open SQL Developer and run
01_database/create.sql
to create the database
-
Open
01_database/dataGenerator/main.py
and change set USERNAME, set PASSWORD, set CONNECTION_STRING values to fit your database connection. -
Run
01_database/dataGenerator/main.py
-
OR You could also invoke load_data.bat, and provide username, password and connection string via the command line:
.\load_files.bat yourUserName yourPassword yourDatabaseConnectionString
- Query the database using the
03_queries/
of your choice - You can create a separate database instance and use
02_data_warehouse/create.sql
and02_data_warehouse/insertsql
to create tables with data exported from the database. If you want to export your own set of data you can follow the instructions provided on the Oracle website - Query the created data warehouse with queries from
03_queries/
βββ 01_database // Database related files
βΒ Β βββ create.sql // Sql script that creates database tables - DDL generated from model
βΒ Β βββ dataGenerator // Files related to data generator script AND generated by script
βΒ Β βΒ Β βββ generatedData // Folder containing .csv files with generated data from python script
βΒ Β βΒ Β βΒ Β βββ branch.csv
βΒ Β βΒ Β βΒ Β βββ city.csv
βΒ Β βΒ Β βΒ Β βββ claim.csv
βΒ Β βΒ Β βΒ Β βββ claimStatus.csv
βΒ Β βΒ Β βΒ Β βββ client.csv
βΒ Β βΒ Β βΒ Β βββ clientType.csv
βΒ Β βΒ Β βΒ Β βββ employee.csv
βΒ Β βΒ Β βΒ Β βββ houseNr.csv
βΒ Β βΒ Β βΒ Β βββ insurance.csv
βΒ Β βΒ Β βΒ Β βββ insuranceType.csv
βΒ Β βΒ Β βΒ Β βββ payment.csv
βΒ Β βΒ Β βΒ Β βββ phone.csv
βΒ Β βΒ Β βΒ Β βββ phoneType.csv
βΒ Β βΒ Β βΒ Β βββ region.csv
βΒ Β βΒ Β βΒ Β βββ street.csv
βΒ Β βΒ Β βββ main.py // Generates data to populate database. Also contains functions that create files and scripts to load data
βΒ Β βββ load_files.bat // Batch file that invokes sqlldr
βΒ Β βββ model_database // Folder containing model files
βΒ Β βΒ Β βββ model.dmd
βΒ Β βββ sqlldr // Everything related to sqlloader. All files inside get generated by main.py script
βΒ Β βββ bads // If loading goes wrong, info will be available here
βΒ Β βββ branch.ctl
βΒ Β βββ city.ctl
βΒ Β βββ claim.ctl
βΒ Β βββ claimStatus.ctl
βΒ Β βββ client.ctl
βΒ Β βββ clientType.ctl
βΒ Β βββ employee.ctl
βΒ Β βββ houseNr.ctl
βΒ Β βββ insurance.ctl
βΒ Β βββ insuranceType.ctl
βΒ Β βββ payment.ctl
βΒ Β βββ phone.ctl
βΒ Β βββ phoneType.ctl
βΒ Β βββ region.ctl
βΒ Β βββ street.ctl
βΒ Β βββ logs // Sqlloader log files
βΒ Β βΒ Β βββ branch.log
βΒ Β βΒ Β βββ city.log
βΒ Β βΒ Β βββ claim.log
βΒ Β βΒ Β βββ claimStatus.log
βΒ Β βΒ Β βββ client.log
βΒ Β βΒ Β βββ clientType.log
βΒ Β βΒ Β βββ employee.log
βΒ Β βΒ Β βββ houseNr.log
βΒ Β βΒ Β βββ insurance.log
βΒ Β βΒ Β βββ insuranceType.log
βΒ Β βΒ Β βββ payment.log
βΒ Β βΒ Β βββ phone.log
βΒ Β βΒ Β βββ phoneType.log
βΒ Β βΒ Β βββ region.log
βΒ Β βΒ Β βββ street.log
βββ 02_data_warehouse // Files related to data warehouse. Create and insert scripts along with model.
βΒ Β βββ create.sql
βΒ Β βββ insert.sql
βΒ Β βββ model_data_warehouse
βΒ Β βββ model.dmd
βββ 03_queries // All database and data warehouse queries divided by type
βΒ Β βββ 01_rollup
βΒ Β βΒ Β βββ database.sql
βΒ Β βΒ Β βββ warehouse.sql
βΒ Β βββ 02_cube
βΒ Β βΒ Β βββ database.sql
βΒ Β βΒ Β βββ warehouse.sql
βΒ Β βββ 03_partition
βΒ Β βΒ Β βββ database.sql
βΒ Β βΒ Β βββ warehouse.sql
βΒ Β βββ 04_window
βΒ Β βΒ Β βββ database.sql
βΒ Β βΒ Β βββ warehouse.sql
βΒ Β βββ 05_rank
βΒ Β βββ database.sql
βΒ Β βββ warehouse.sql
βββ images // README images
βΒ Β βββ data_warehouse.png
βΒ Β βββ example_query.png
βΒ Β βββ logo.png
βΒ Β βββ original_design.png
βΒ Β βββ screenshot.png
βββ LICENSE.txt
βββ README.md
My original idea was to create a more real-life database design following all normalization principles. It looked something like that:
You can check it out here.
I later had to transform it into the design that you can see in the first screenshot in the about section. The reason for it was that it could become too complicated to create a data warehouse and queries based on my original design. Here is the finished data warehouse model:
To create the models I used Oracle SQL Data Modeler. It's a great tool that can be used to create logical models, convert them to physical models, and finally DDL.
With database and data warehouse models created I had to generate the necessary data (one of the project requirements was that tables should be loaded with at least 1000 or 10000 rows of data). I could take the short path and use one of the data generation tools available on the web but I decided that the web available tools wouldn't make a cut and decided to create a python script using the Faker library to generate the data. It proved to be a sound decision because I used the same script to generate control files for SQL* Loader and write lines to batch files that could then be run to load data into the database. SQL * Loader is a tool developed by Oracle, that can be invoked through command line to load files (.csv files in our case) into database. The syntax to load data looks like this:
sqlldr %USERNAME%/%PASSWORD%%CONNECTION_STRING% control='sqlldr/region.ctl' log='sqlldr/logs/region.log' bad='sqlldr/bads/region.bad'
Invoking such command for each of 15 tables would be tedious and ineffective.
My idea to simplify the process was to:
- Generate the data for the tables using separate python functions (the last line invokes a function to generate control file)
def generate_branches(records):
headers = ["branch_id", "branch_name", "region_id", "city_id", "street_id", "houseNr_id"]
with open(f"./generatedData/branch.csv", 'wt', newline='') as csvFile:
writer = csv.DictWriter(csvFile, fieldnames=headers)
writer.writeheader()
for i in range(records):
writer.writerow({
headers[0]: i,
headers[1]: faker.company(),
headers[2]: i % 16,
headers[3]: i % 16,
headers[4]: i % 16,
headers[5]: i,
})
print(f'Successfully generated {records} branches')
create_ctl_file("branch", headers)
- Invoke a function that will create a .ctl file that will control the data loading to the database through SQL*Loader
def create_ctl_file(filename, headers):
file = open(f'../sqlldr/{filename}.ctl', "w+")
file.write("LOAD DATA\n")
file.write(f"INFILE 'dataGenerator/generatedData/{filename}.csv'\n")
file.write("REPLACE\n")
file.write(f"INTO TABLE {filename}\n")
file.write("FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '\"'\n")
file.write("TRAILING NULLCOLS\n")
file.write("(\n")
for i in range(len(headers)):
if re.search(r'date', headers[i], re.IGNORECASE):
file.write(f'{headers[i]} DATE "yyyy-mm-dd"')
else:
file.write(headers[i])
if i != (len(headers) - 1):
file.write(",")
file.write("\n)")
file.close()
print(f"Successfully generated {filename}.ctl file")
bat_file = open('../load_data.bat', "a+")
bat_file.write(f"sqlldr %USERNAME%/%PASSWORD%%CONNECTION_STRING% control='sqlldr/{filename}.ctl' log='sqlldr/logs/{filename}.log' bad='sqlldr/bads/{filename}.bad'\n")
bat_file.close()
print(f"Successfully appended {filename} data to load_files.bat")
- Append the appropriate line to the batch file and save it. It can be conveniently run to load data using SQL*Loader.
load_files.bat
@echo off
rem Script that loads previously created data into the database using sqlldr
rem To change the default values you need to call the script and enter them sequentially as command arguments
rem Run the script using a command line, e.g. cmd
rem Then we move to the folder where the script is located
rem And then run it by entering the command .\data.bat
rem This command will run it with the default user data
rem If you want to change it, enter other data, e.g. .\load_data.bat username password my_base:1521
rem Set default values for the connection
set USERNAME=kielx
set PASSWORD=d11
set CONNECTION_STRING=@//localhost:1521/XEPDB1
rem Check if values exist
if not "%1"=="" (
set USERNAME=%1
)
if not "%2"=="" (
set PASSWORD=%2
)
if not "%3"=="" (
set CONNECTION_STRING=%3
)
sqlldr %USERNAME%/%PASSWORD%%CONNECTION_STRING% control='sqlldr/region.ctl' log='sqlldr/logs/region.log' bad='sqlldr/bads/region.bad'
sqlldr %USERNAME%/%PASSWORD%%CONNECTION_STRING% control='sqlldr/city.ctl' log='sqlldr/logs/city.log' bad='sqlldr/bads/city.bad'
sqlldr %USERNAME%/%PASSWORD%%CONNECTION_STRING% control='sqlldr/street.ctl' log='sqlldr/logs/street.log' bad='sqlldr/bads/street.bad'
sqlldr %USERNAME%/%PASSWORD%%CONNECTION_STRING% control='sqlldr/houseNr.ctl' log='sqlldr/logs/houseNr.log' bad='sqlldr/bads/houseNr.bad'
sqlldr %USERNAME%/%PASSWORD%%CONNECTION_STRING% control='sqlldr/clientType.ctl' log='sqlldr/logs/clientType.log' bad='sqlldr/bads/clientType.bad'
sqlldr %USERNAME%/%PASSWORD%%CONNECTION_STRING% control='sqlldr/client.ctl' log='sqlldr/logs/client.log' bad='sqlldr/bads/client.bad'
sqlldr %USERNAME%/%PASSWORD%%CONNECTION_STRING% control='sqlldr/employee.ctl' log='sqlldr/logs/employee.log' bad='sqlldr/bads/employee.bad'
sqlldr %USERNAME%/%PASSWORD%%CONNECTION_STRING% control='sqlldr/branch.ctl' log='sqlldr/logs/branch.log' bad='sqlldr/bads/branch.bad'
sqlldr %USERNAME%/%PASSWORD%%CONNECTION_STRING% control='sqlldr/phoneType.ctl' log='sqlldr/logs/phoneType.log' bad='sqlldr/bads/phoneType.bad'
sqlldr %USERNAME%/%PASSWORD%%CONNECTION_STRING% control='sqlldr/phone.ctl' log='sqlldr/logs/phone.log' bad='sqlldr/bads/phone.bad'
sqlldr %USERNAME%/%PASSWORD%%CONNECTION_STRING% control='sqlldr/payment.ctl' log='sqlldr/logs/payment.log' bad='sqlldr/bads/payment.bad'
sqlldr %USERNAME%/%PASSWORD%%CONNECTION_STRING% control='sqlldr/insuranceType.ctl' log='sqlldr/logs/insuranceType.log' bad='sqlldr/bads/insuranceType.bad'
sqlldr %USERNAME%/%PASSWORD%%CONNECTION_STRING% control='sqlldr/insurance.ctl' log='sqlldr/logs/insurance.log' bad='sqlldr/bads/insurance.bad'
sqlldr %USERNAME%/%PASSWORD%%CONNECTION_STRING% control='sqlldr/claimStatus.ctl' log='sqlldr/logs/claimStatus.log' bad='sqlldr/bads/claimStatus.bad'
sqlldr %USERNAME%/%PASSWORD%%CONNECTION_STRING% control='sqlldr/claim.ctl' log='sqlldr/logs/claim.log' bad='sqlldr/bads/claim.bad'
All this code is held in 01_database/dataGenerator
, with generated data being in 01_database/dataGenerator/generatedData
subfolder and control files that handle loading are held in 01_database/sqlldr
. The .bat file that can be used to load the data, is generated by the python program and can be run with command arguments that specify a username, password, and database connection string.
To convert the database tables and data into a data warehouse I used Oracle SQL Developer Export Wizard. It proved to be an easy way to achieve my goal. I tried to use the Oracle data pump but couldn't get it to work. All related files are held in 02_data_warehouse
folder
Queries to the database are stored in 03_queries
folder. Each type of query is stored in separate folder (one for each - rollup, cube, partition, windows and rank functions).
Queries differ depending on whether they are executed against a database or a data warehouse. One of project requierements was to store them in that way, with queries to database in file named database.sql
and queries to data warehouse in warehouse.sql
Example ROLLUP query with annual summary of the number of insurance policies sold in relation to the branch and type of policy:
SELECT EXTRACT(YEAR FROM insurance.begin_date) AS Year,
NVL2(branch.branch_name, branch.branch_name, 'All Branches') AS Branch_Name,
NVL2(insurancetype.insurance_type, insurancetype.insurance_type, 'All Policies') AS Policy_Type,
COUNT(insurance.insuranceType_id) AS Number_of_Policies
FROM insurance
INNER JOIN insuranceType
ON insurance.insurancetype_id = insurancetype.insurancetype_id
INNER JOIN branch
ON insurance.branch_id = branch.branch_id
GROUP BY ROLLUP (EXTRACT(YEAR FROM insurance.begin_date), branch.branch_name, insurancetype.insurance_type )
ORDER BY EXTRACT(YEAR FROM insurance.begin_date),
branch.branch_name,
COUNT(insurance.insuranceType_id) DESC;
This query returns a summary of the number of insurance policies sold each year, grouped by branch name and policy type. The ROLLUP operator is used to generate subtotals for each year, branch name, and policy type. The NVL2 function is used to replace NULL values in the branch name and policy type columns with 'All Branches' and 'All Policies', respectively. The result is ordered by year, branch name, and number of policies sold in descending order.
Distributed under the MIT License. See LICENSE.txt
for more information.
Krzysztof Pantak - kielx.dev@gmail.com