I have worked on analyzing a dataset of over 20,000 sales records from an Amazon-like e-commerce platform. This project involves extensive querying of customer behavior, product performance, and sales trends using PostgreSQL. Through this project, I have tackled various SQL problems, including revenue analysis, customer segmentation, and inventory management.
The project also focuses on data cleaning, handling null values, and solving real-world business problems using structured queries.
An ERD diagram is included to visually represent the database schema and relationships between tables.
CREATE TABLE category
(
category_id INT PRIMARY KEY,
category_name VARCHAR(20)
);
-- customers TABLE
CREATE TABLE customers
(
customer_id INT PRIMARY KEY,
first_name VARCHAR(20),
last_name VARCHAR(20),
state VARCHAR(20),
address VARCHAR(5) DEFAULT ('xxxx')
);
-- sellers TABLE
CREATE TABLE sellers
(
seller_id INT PRIMARY KEY,
seller_name VARCHAR(25),
origin VARCHAR(15)
);
-- products table
CREATE TABLE products
(
product_id INT PRIMARY KEY,
product_name VARCHAR(50),
price FLOAT,
cogs FLOAT,
category_id INT, -- FK
CONSTRAINT product_fk_category FOREIGN KEY(category_id) REFERENCES category(category_id)
);
-- orders
CREATE TABLE orders
(
order_id INT PRIMARY KEY,
order_date DATE,
customer_id INT, -- FK
seller_id INT, -- FK
order_status VARCHAR(15),
CONSTRAINT orders_fk_customers FOREIGN KEY (customer_id) REFERENCES customers(customer_id),
CONSTRAINT orders_fk_sellers FOREIGN KEY (seller_id) REFERENCES sellers(seller_id)
);
CREATE TABLE order_items
(
order_item_id INT PRIMARY KEY,
order_id INT, -- FK
product_id INT, -- FK
quantity INT,
price_per_unit FLOAT,
CONSTRAINT order_items_fk_orders FOREIGN KEY (order_id) REFERENCES orders(order_id),
CONSTRAINT order_items_fk_products FOREIGN KEY (product_id) REFERENCES products(product_id)
);
-- payment TABLE
CREATE TABLE payments
(
payment_id
INT PRIMARY KEY,
order_id INT, -- FK
payment_date DATE,
payment_status VARCHAR(20),
CONSTRAINT payments_fk_orders FOREIGN KEY (order_id) REFERENCES orders(order_id)
);
CREATE TABLE shippings
(
shipping_id INT PRIMARY KEY,
order_id INT, -- FK
shipping_date DATE,
return_date DATE,
shipping_providers VARCHAR(15),
delivery_status VARCHAR(15),
CONSTRAINT shippings_fk_orders FOREIGN KEY (order_id) REFERENCES orders(order_id)
);
CREATE TABLE inventory
(
inventory_id INT PRIMARY KEY,
product_id INT, -- FK
stock INT,
warehouse_id INT,
last_stock_date DATE,
CONSTRAINT inventory_fk_products FOREIGN KEY (product_id) REFERENCES products(product_id)
);I cleaned the dataset by:
- Removing duplicates: Duplicates in the customer and order tables were identified and removed.
- Handling missing values: Null values in critical fields (e.g., customer address, payment status) were either filled with default values or handled using appropriate methods.
Null values were handled based on their context:
- Customer addresses: Missing addresses were assigned default placeholder values.
- Payment statuses: Orders with null payment statuses were categorized as “Pending.”
- Shipping information: Null return dates were left as is, as not all shipments are returned.
The primary objective of this project is to showcase SQL proficiency through complex queries that address real-world e-commerce business challenges. The analysis covers various aspects of e-commerce operations, including:
- Customer behavior
- Sales trends
- Inventory management
- Payment and shipping analysis
- Forecasting and product performance
Key business problems identified:
- Low product availability due to inconsistent restocking.
- High return rates for specific product categories.
- Significant delays in shipments and inconsistencies in delivery times.
- High customer acquisition costs with a low customer retention rate.
- Top Selling Products Query the top 10 products by total sales value. Challenge: Include product name, total quantity sold, and total sales value.
SELECT
oi.product_id,
p.product_name,
SUM(oi.total_sale) as total_sale,
COUNT(o.order_id) as total_orders
FROM orders as o
JOIN
order_items as oi
ON oi.order_id = o.order_id
JOIN
products as p
ON p.product_id = oi.product_id
GROUP BY 1, 2
ORDER BY 3 DESC
LIMIT 10- Revenue by Category Calculate total revenue generated by each product category. Challenge: Include the percentage contribution of each category to total revenue.
SELECT
p.category_id,
c.category_name,
SUM(oi.total_sale) as total_sale,
SUM(oi.total_sale)/
(SELECT SUM(total_sale) FROM order_items)
* 100
as contribution
FROM order_items as oi
JOIN
products as p
ON p.product_id = oi.product_id
LEFT JOIN category as c
ON c.category_id = p.category_id
GROUP BY 1, 2
ORDER BY 3 DESC- Average Order Value (AOV) Compute the average order value for each customer. Challenge: Include only customers with more than 5 orders.
SELECT
c.customer_id,
CONCAT(c.first_name, ' ', c.last_name) as full_name,
SUM(total_sale)/COUNT(o.order_id) as AOV,
COUNT(o.order_id) as total_orders --- filter
FROM orders as o
JOIN
customers as c
ON c.customer_id = o.customer_id
JOIN
order_items as oi
ON oi.order_id = o.order_id
GROUP BY 1, 2
HAVING COUNT(o.order_id) > 5- Monthly Sales Trend Query monthly total sales over the past year. Challenge: Display the sales trend, grouping by month, return current_month sale, last month sale!
SELECT
year,
month,
total_sale as current_month_sale,
LAG(total_sale, 1) OVER(ORDER BY year, month) as last_month_sale
FROM ---
(
SELECT
EXTRACT(MONTH FROM o.order_date) as month,
EXTRACT(YEAR FROM o.order_date) as year,
ROUND(
SUM(oi.total_sale::numeric)
,2) as total_sale
FROM orders as o
JOIN
order_items as oi
ON oi.order_id = o.order_id
WHERE o.order_date >= CURRENT_DATE - INTERVAL '1 year'
GROUP BY 1, 2
ORDER BY year, month
) as t1- Customers with No Purchases Find customers who have registered but never placed an order. Challenge: List customer details and the time since their registration.
Approach 1
SELECT *
-- reg_date - CURRENT_DATE
FROM customers
WHERE customer_id NOT IN (SELECT
DISTINCT customer_id
FROM orders
);-- Approach 2
SELECT *
FROM customers as c
LEFT JOIN
orders as o
ON o.customer_id = c.customer_id
WHERE o.customer_id IS NULL- Least-Selling Categories by State Identify the least-selling product category for each state. Challenge: Include the total sales for that category within each state.
WITH ranking_table
AS
(
SELECT
c.state,
cat.category_name,
SUM(oi.total_sale) as total_sale,
RANK() OVER(PARTITION BY c.state ORDER BY SUM(oi.total_sale) ASC) as rank
FROM orders as o
JOIN
customers as c
ON o.customer_id = c.customer_id
JOIN
order_items as oi
ON o.order_id = oi. order_id
JOIN
products as p
ON oi.product_id = p.product_id
JOIN
category as cat
ON cat.category_id = p.category_id
GROUP BY 1, 2
)
SELECT
*
FROM ranking_table
WHERE rank = 1- Customer Lifetime Value (CLTV) Calculate the total value of orders placed by each customer over their lifetime. Challenge: Rank customers based on their CLTV.
SELECT
c.customer_id,
CONCAT(c.first_name, ' ', c.last_name) as full_name,
SUM(total_sale) as CLTV,
DENSE_RANK() OVER( ORDER BY SUM(total_sale) DESC) as cx_ranking
FROM orders as o
JOIN
customers as c
ON c.customer_id = o.customer_id
JOIN
order_items as oi
ON oi.order_id = o.order_id
GROUP BY 1, 2- Inventory Stock Alerts Query products with stock levels below a certain threshold (e.g., less than 10 units). Challenge: Include last restock date and warehouse information.
SELECT
i.inventory_id,
p.product_name,
i.stock as current_stock_left,
i.last_stock_date,
i.warehouse_id
FROM inventory as i
join
products as p
ON p.product_id = i.product_id
WHERE stock < 10- Shipping Delays Identify orders where the shipping date is later than 3 days after the order date. Challenge: Include customer, order details, and delivery provider.
SELECT
c.*,
o.*,
s.shipping_providers,
s.shipping_date - o.order_date as days_took_to_ship
FROM orders as o
JOIN
customers as c
ON c.customer_id = o.customer_id
JOIN
shippings as s
ON o.order_id = s.order_id
WHERE s.shipping_date - o.order_date > 3- Payment Success Rate Calculate the percentage of successful payments across all orders. Challenge: Include breakdowns by payment status (e.g., failed, pending).
SELECT
p.payment_status,
COUNT(*) as total_cnt,
COUNT(*)::numeric/(SELECT COUNT(*) FROM payments)::numeric * 100
FROM orders as o
JOIN
payments as p
ON o.order_id = p.order_id
GROUP BY 1- Top Performing Sellers Find the top 5 sellers based on total sales value. Challenge: Include both successful and failed orders, and display their percentage of successful orders.
WITH top_sellers
AS
(SELECT
s.seller_id,
s.seller_name,
SUM(oi.total_sale) as total_sale
FROM orders as o
JOIN
sellers as s
ON o.seller_id = s.seller_id
JOIN
order_items as oi
ON oi.order_id = o.order_id
GROUP BY 1, 2
ORDER BY 3 DESC
LIMIT 5
),
sellers_reports
AS
(SELECT
o.seller_id,
ts.seller_name,
o.order_status,
COUNT(*) as total_orders
FROM orders as o
JOIN
top_sellers as ts
ON ts.seller_id = o.seller_id
WHERE
o.order_status NOT IN ('Inprogress', 'Returned')
GROUP BY 1, 2, 3
)
SELECT
seller_id,
seller_name,
SUM(CASE WHEN order_status = 'Completed' THEN total_orders ELSE 0 END) as Completed_orders,
SUM(CASE WHEN order_status = 'Cancelled' THEN total_orders ELSE 0 END) as Cancelled_orders,
SUM(total_orders) as total_orders,
SUM(CASE WHEN order_status = 'Completed' THEN total_orders ELSE 0 END)::numeric/
SUM(total_orders)::numeric * 100 as successful_orders_percentage
FROM sellers_reports
GROUP BY 1, 2- Product Profit Margin Calculate the profit margin for each product (difference between price and cost of goods sold). Challenge: Rank products by their profit margin, showing highest to lowest. */
SELECT
product_id,
product_name,
profit_margin,
DENSE_RANK() OVER( ORDER BY profit_margin DESC) as product_ranking
FROM
(SELECT
p.product_id,
p.product_name,
-- SUM(total_sale - (p.cogs * oi.quantity)) as profit,
SUM(total_sale - (p.cogs * oi.quantity))/sum(total_sale) * 100 as profit_margin
FROM order_items as oi
JOIN
products as p
ON oi.product_id = p.product_id
GROUP BY 1, 2
) as t1- Most Returned Products Query the top 10 products by the number of returns. Challenge: Display the return rate as a percentage of total units sold for each product.
SELECT
p.product_id,
p.product_name,
COUNT(*) as total_unit_sold,
SUM(CASE WHEN o.order_status = 'Returned' THEN 1 ELSE 0 END) as total_returned,
SUM(CASE WHEN o.order_status = 'Returned' THEN 1 ELSE 0 END)::numeric/COUNT(*)::numeric * 100 as return_percentage
FROM order_items as oi
JOIN
products as p
ON oi.product_id = p.product_id
JOIN orders as o
ON o.order_id = oi.order_id
GROUP BY 1, 2
ORDER BY 5 DESC- Inactive Sellers Identify sellers who haven’t made any sales in the last 6 months. Challenge: Show the last sale date and total sales from those sellers.
WITH cte1 -- as these sellers has not done any sale in last 6 month
AS
(SELECT * FROM sellers
WHERE seller_id NOT IN (SELECT seller_id FROM orders WHERE order_date >= CURRENT_DATE - INTERVAL '6 month')
)
SELECT
o.seller_id,
MAX(o.order_date) as last_sale_date,
MAX(oi.total_sale) as last_sale_amount
FROM orders as o
JOIN
cte1
ON cte1.seller_id = o.seller_id
JOIN order_items as oi
ON o.order_id = oi.order_id
GROUP BY 1- IDENTITY customers into returning or new if the customer has done more than 5 return categorize them as returning otherwise new Challenge: List customers id, name, total orders, total returns
SELECT
c_full_name as customers,
total_orders,
total_return,
CASE
WHEN total_return > 5 THEN 'Returning_customers' ELSE 'New'
END as cx_category
FROM
(SELECT
CONCAT(c.first_name, ' ', c.last_name) as c_full_name,
COUNT(o.order_id) as total_orders,
SUM(CASE WHEN o.order_status = 'Returned' THEN 1 ELSE 0 END) as total_return
FROM orders as o
JOIN
customers as c
ON c.customer_id = o.customer_id
JOIN
order_items as oi
ON oi.order_id = o.order_id
GROUP BY 1)- Top 5 Customers by Orders in Each State Identify the top 5 customers with the highest number of orders for each state. Challenge: Include the number of orders and total sales for each customer.
SELECT * FROM
(SELECT
c.state,
CONCAT(c.first_name, ' ', c.last_name) as customers,
COUNT(o.order_id) as total_orders,
SUM(total_sale) as total_sale,
DENSE_RANK() OVER(PARTITION BY c.state ORDER BY COUNT(o.order_id) DESC) as rank
FROM orders as o
JOIN
order_items as oi
ON oi.order_id = o.order_id
JOIN
customers as c
ON
c.customer_id = o.customer_id
GROUP BY 1, 2
) as t1
WHERE rank <=5- Revenue by Shipping Provider Calculate the total revenue handled by each shipping provider. Challenge: Include the total number of orders handled and the average delivery time for each provider.
SELECT
s.shipping_providers,
COUNT(o.order_id) as order_handled,
SUM(oi.total_sale) as total_sale,
COALESCE(AVG(s.return_date - s.shipping_date), 0) as average_days
FROM orders as o
JOIN
order_items as oi
ON oi.order_id = o.order_id
JOIN
shippings as s
ON
s.order_id = o.order_id
GROUP BY 1- Top 10 product with highest decreasing revenue ratio compare to last year(2022) and current_year(2023) Challenge: Return product_id, product_name, category_name, 2022 revenue and 2023 revenue decrease ratio at end Round the result Note: Decrease ratio = cr-ls/ls* 100 (cs = current_year ls=last_year)
WITH last_year_sale
as
(
SELECT
p.product_id,
p.product_name,
SUM(oi.total_sale) as revenue
FROM orders as o
JOIN
order_items as oi
ON oi.order_id = o.order_id
JOIN
products as p
ON
p.product_id = oi.product_id
WHERE EXTRACT(YEAR FROM o.order_date) = 2022
GROUP BY 1, 2
),
current_year_sale
AS
(
SELECT
p.product_id,
p.product_name,
SUM(oi.total_sale) as revenue
FROM orders as o
JOIN
order_items as oi
ON oi.order_id = o.order_id
JOIN
products as p
ON
p.product_id = oi.product_id
WHERE EXTRACT(YEAR FROM o.order_date) = 2023
GROUP BY 1, 2
)
SELECT
cs.product_id,
ls.revenue as last_year_revenue,
cs.revenue as current_year_revenue,
ls.revenue - cs.revenue as rev_diff,
ROUND((cs.revenue - ls.revenue)::numeric/ls.revenue::numeric * 100, 2) as reveneue_dec_ratio
FROM last_year_sale as ls
JOIN
current_year_sale as cs
ON ls.product_id = cs.product_id
WHERE
ls.revenue > cs.revenue
ORDER BY 5 DESC
LIMIT 10This project enabled me to:
- Design and implement a normalized database schema.
- Clean and preprocess real-world datasets for analysis.
- Use advanced SQL techniques, including window functions, subqueries, and joins.
- Conduct in-depth business analysis using SQL.
- Optimize query performance and handle large datasets efficiently.
This advanced SQL project successfully demonstrates my ability to solve real-world e-commerce problems using structured queries. From improving customer retention to optimizing inventory and logistics, the project provides valuable insights into operational challenges and solutions.
By completing this project, I have gained a deeper understanding of how SQL can be used to tackle complex data problems and drive business decision-making.

