This repository contains a complete end-to-end SQL project built on a large financial transactions dataset. The project demonstrates advanced SQL database engineering concepts such as partitioning, indexing, stored procedures, triggers, materialized views, and business-focused reporting.
- Financial_Database_Management_SQL
- Project Overview
- Project Goals
- Aim of Project
- Who This Project is For
- Tools Used
- Data Source and Processing
- Database Design
- Database Optimization
- Stored Procedures, Views and Triggers (Automation and Integrity)
- Reporting & Analytics
- How to Run
The main objective of this project was to take a raw financial dataset and transform it into a high-performing, analytics-ready database. By implementing advanced SQL techniques, the project showcases how to design, optimize, and query a large dataset efficiently.
Key highlights include:
- Building normalized tables for users, cards, merchants, transactions, and fraud labels.
- Handling 13 million transaction rows with indexing, partitioning, and materialized views.
- Automating checks with views, stored procedures and triggers for fraud detection and credit limits.
- Answering real-world business questions through SQL-based reporting and analytics.
- Performance: Optimize queries over millions of rows using indexes and partitioning.
- Data Integrity: Ensure data consistency and enforce business rules with triggers.
- Automation: Use stored procedures for reusable analytics.
- Analytics: Provide insights into fraud detection, customer spending, and merchant activity.
- Documentation: Build a project that is well-structured, transparent, and reproducible.
The aim was not just to practice SQL queries, but to simulate a real-world financial analytics database where:
- Data is large and needs optimization.
- Business queries need to run fast.
- Analysts can quickly retrieve insights through pre-defined views and procedures.
-
Fintech companies and payment processors (e.g., PayPal, Stripe, Paystack, Flutterwave, Interswitch, Moniepoint, FairMoney) looking to enforce business rules like fraud detection and credit-limit monitoring at the database level.
-
Banks and financial institutions that need to process millions of daily transactions securely and efficiently.
-
Students and learners who want to see how SQL can be applied to large-scale financial data.
-
Data analysts and data engineers looking for templates on reporting queries, partitioning, and materialized views.
-
Database administrators (DBAs) interested in practical optimization techniques on large datasets.
The following tools were used for the project
- PostgreSQL
- Python
- Jupyter Notebook
The dataset for this project was gotten from Kaggle. The data came in a very messy state with mixed datatypes,inconsistent characters and missing values.
Using Python...I
-
Loaded the dirty dataset into Jupyter Notebook
-
Converted JSON label files into structured DataFrames.
-
Changed datatypes (e.g., dates → datetime, transaction amounts → numeric).
-
Removed unwanted strings and reformatted categorical variables.
-
Handled nulls gracefully — dropping non-informative columns and imputing where necessary.
-
Ensured column naming consistency to match PostgreSQL schema.
The now cleaned dataset was then exported into PostgreSQL for SQL operations. Attched here is the cleaned dataset.
The Full Database Schema can be seen here
The database schema includes five main tables:
-
users – information about customers.
-
cards – card details of customers.
-
merchants – merchant profiles.
-
transactions – ~13M rows of financial transactions.
-
fraud_labels – indicates fraudulent vs non-fraudulent activity.
For full table creation scripts...see here.
The Full scripts can be seen here
Given that the transactions table has 13M+ rows, naive queries would be slow. So i engineered the database for scale:
-
Added indexes to transaction_date, card_id, client_id, merchant_id and columns that are frequently queries
-
Ensured query filters and joins hit indexed columns.
-
Result: Queries that previously scanned millions of rows dropped to millisecond responses.
Ran the following query to check the execution time
explain analyze
select * from transactions where card_id = 1807;
Explain Analyze before Indexing
Explain Analyze after Indexing
-
The transactions table was range-partitioned by year on transaction_date and indexed.
-
Advantage: queries restricted to specific years only scan relevant partitions, reducing I/O dramatically.
-
Added a default “future” partition to capture incoming data.
Validated that partition works by inserting a new row below
insert into financial_system.transactions
(transaction_id, transaction_date, client_id, card_id, amount, use_chip, merchant_city, merchant_state)
values (999999, '2015-06-01', 123, 456, 200.0, 'Y', 'Lagos', 'NG');
Checking the partition it falls below
select tableoid::regclass as partition_name,*
from financial_system.transactions
where transaction_id = 999999;
Result:
Indexed the partitioned table
do $$
declare
part RECORD;
begin
for part in
select inhrelid::regclass as partition_name
from pg_inherits
where inhparent = 'financial_system.transactions'::regclass
loop
execute format('
create index if not exists %I_transaction_date_idx ON %s (transaction_date);
create index if not exists %I_client_id_idx ON %s (client_id);
create index if not exists %I_card_id_idx ON %s (card_id);
create index if not exists %I_merchant_city_idx ON %s (merchant_city);
create index if not exists %I_merchant_state_idx ON %s (merchant_state);
',
part.partition_name, part.partition_name,
part.partition_name, part.partition_name,
part.partition_name, part.partition_name,
part.partition_name, part.partition_name,
part.partition_name, part.partition_name
);
end loop;
end$$;
Validating the performance of the transaction table using this query below:
explain analyze
select * from transactions
where transaction_date between '2014-01-01' and '2014-07-20';
Results show SQL ONLY scanned transaction table of year 2014 and not the entire dataset...Optimization fully accomplished.
-
Built client_yearly_spending as a materialized view for quick reporting.
-
Aggregates millions of rows into precomputed yearly totals and averages per client.
-
Refreshable with a single command:
refresh materialized view client_yearly_spending;
All SQL scripts for this lives here
Designed for reusablilty and automation:
-
Monthly Spend per User: quickly retrieves a breakdown of spending per user across months.
-
Profile Summary: compiles user details, card limits, and aggregated transactions into a single report.
-
Card Usage Statistics: shows transaction counts and amounts per card, useful for both analytics and fraud detection.
Example of stored procedures created
Monthly spending summary per user
create or replace procedure get_monthly_summary(p_user_id int)
language plpgsql
as $$
declare
rec record; -- declare variable
begin
for rec in
select date_trunc('month',t.transaction_date) as month,sum(t.amount) as total_spent
from transactions t where t.client_id = p_user_id
group by 1
order by 1
loop
raise notice 'Month: %,Total spent: %',rec.month,rec.total_spent;
end loop;
end;
$$;
Calling the stored procedure below
call get_monthly_summary(825);
Used to perform an action whenever an event happens:
-
Fraud Detection Trigger: whenever a new fraud label is inserted with “Yes,” the system automatically logs it in fraud_alert.
-
Ensures real-time fraud flagging.
-
Credit Limit Trigger: blocks any new transaction that would exceed a card’s limit.
Enforces business rules at the database level, ensuring no inconsistent or invalid financial activity is recorded.
Example trigger created
create or replace function raise_fraud_alert()
returns trigger as $$
begin
if new.fraud_label = 'Yes' then
insert into fraud_alert(transaction_id,alert_message)
values(new.transaction_id,'Fraudulent Transaction Detected');
end if;
return new;
end;
$$ language plpgsql;
Validating that the trigger works by adding in a new fraudulent transaction
insert into fraud_labels(transaction_id,fraud_label)
values(419,'Yes');
Result can be seen below
Views make it easier for analysts to get summaries without repeating SQL
-
Fast access to all fraud-flagged events with key join context
-
Surfaces high-value activity instantly for finance, ops, and risk
-
Quick look at top customer segments by income for premium services or product strategy
Example of view created
create or replace view fraud_transactions as
select f.transaction_id,t.transaction_date,t.amount,t.card_id,u.id,u.address
from fraud_labels f join transactions t
on f.transaction_id = t.transaction_id
join users u
on t.client_id = u.id
where fraud_label = 'Yes';
Calling the view
select * from fraud_transactions;
To make the dataset actionable, several queries and views were written:
- Monthly transaction volume (trend analysis).
select extract('month' from transaction_date) as Month_number,to_char(transaction_date,'month') as Monthname,sum(amount)
from transactions t
group by Month_number,Monthname
order by Month_number;
- Fraudulent vs non-fraudulent transaction rates.
select Fraudulent,Non_fraudulent,(Fraudulent::decimal/nullif(Non_fraudulent,0)) * 100 as Percentage
from
(select
count(case when fraud_label = 'Yes' then 1 end) as Fraudulent,
count(case when fraud_label = 'No' then 1 end) as Non_fraudulent
from fraud_labels);
- Top fraud-prone merchant locations.
select t.merchant_city,t.merchant_state,count(*) as Fraudulent_transactions
from transactions t join fraud_labels f
on t.transaction_id = f.transaction_id
where f.fraud_label = 'Yes'
group by t.merchant_city,t.merchant_state
order by Fraudulent_transactions desc;
-
Clone the repository:
git clone https://github.com/NStanley0524/financial-sql-project.git cd financial-sql-project
-
Ensure PostgreSQL 12+ is installed and running.
-
Create a database schema in pgAdmin (e.g., financial_system).
-
Run scripts in this order:
- 01_create_tables.sql - 02_optimization.sql - 03_procedures_triggers_views.sql - 04_reporting_analytics.sql
-
Load the cleaned dataset into the base tables.
-
Refresh materialized views when needed:
refresh materialzed view client_yearly_spending;