This project is an end-to-end data analysis solution designed to extract critical business insights from Walmart sales data. We utilize Python for data processing and analysis, SQL for advanced querying, and structured problem-solving techniques to solve key business questions. The project is ideal for data analysts looking to develop skills in data manipulation, SQL querying, and data pipeline creation.
- Tools Used: Visual Studio Code (VS Code), Python, SQL (MySQL and PostgreSQL)
- Goal: Create a structured workspace within VS Code and organize project folders for smooth development and data handling.
- API Setup: Obtain your Kaggle API token from Kaggle by navigating to your profile settings and downloading the JSON file.
- Configure Kaggle:
- Place the downloaded
kaggle.json
file in your local.kaggle
folder. - Use the command
kaggle datasets download -d <dataset-path>
to pull datasets directly into your project.
- Place the downloaded
- Data Source: Use the Kaggle API to download the Walmart sales datasets from Kaggle.
- Dataset Link: Walmart Sales Dataset
- Storage: Save the data in the
data/
folder for easy reference and access.
- Libraries: Install necessary Python libraries using:
pip install pandas numpy sqlalchemy mysql-connector-python psycopg2
- Loading Data: Read the data into a Pandas DataFrame for initial analysis and transformations.
- Goal: Conduct an initial data exploration to understand data distribution, check column names, types, and identify potential issues.
- Analysis: Use functions like
.info()
,.describe()
, and.head()
to get a quick overview of the data structure and statistics.
- Remove Duplicates: Identify and remove duplicate entries to avoid skewed results.
- Handle Missing Values: Drop rows or columns with missing values if they are insignificant; fill values where essential.
- Fix Data Types: Ensure all columns have consistent data types (e.g., dates as
datetime
, prices asfloat
). - Currency Formatting: Use
.replace()
to handle and format currency values for analysis. - Validation: Check for any remaining inconsistencies and verify the cleaned data.
- Create New Columns: Calculate the
Total Amount
for each transaction by multiplyingunit_price
byquantity
and adding this as a new column. - Enhance Dataset: Adding this calculated field will streamline further SQL analysis and aggregation tasks.
- Set Up Connections: Connect to MySQL using
sqlalchemy
and load the cleaned data into each database. - Table Creation: Set up tables in both MySQL and PostgreSQL using Python SQLAlchemy to automate table creation and data insertion.
- Verification: Run initial SQL queries to confirm that the data has been loaded accurately.
- Business Problem-Solving: Write and execute complex SQL queries to answer critical business questions, such as:
- Revenue trends across branches and categories.
- Identifying best-selling product categories.
- Sales performance by time, city, and payment method.
- Analyzing peak sales periods and customer buying patterns.
- Profit margin analysis by branch and category.
Task 2: Create a New Book Record
INSERT INTO Books
VALUES ('978-1-60129-456-2', 'To Kill a Mockingbird', 'Classic', 6.0, 'yes', 'Harper Lee', 'J.B. Lippincott & Co.');
Count total records
SELECT COUNT(*) FROM walmart;
Show payment methods and number of transactions by payment method
SELECT
payment_method,
COUNT(*) AS no_payments
FROM walmart
GROUP BY payment_method;
Count distinct branches
SELECT COUNT(DISTINCT branch)
FROM walmart;
Find the minimum quantity sold
SELECT category,MIN(quantity)
FROM walmart
GROUP BY category;
Business Problem Q1: Find different payment methods, number of transactions, and quantity sold by payment method
SELECT
payment_method,
COUNT(*) AS no_payments,
SUM(quantity) AS no_qty_sold
FROM walmart
GROUP BY payment_method;
Project Question #2: Identify the highest-rated category in each branch Display the branch, category, and avg rating
SELECT branch, category, avg_rating
FROM (
SELECT
branch,
category,
AVG(rating) AS avg_rating,
RANK() OVER(PARTITION BY branch ORDER BY AVG(rating) DESC) as ran_k
FROM walmart
GROUP BY branch, category
) AS ranked
WHERE ran_k = 1;
Q3: Identify the busiest day for each branch based on the number of transactions
SELECT branch, day_name, no_transactions
FROM (
SELECT
branch,
DAYNAME(STR_TO_DATE(date, '%d/%m/%Y')) AS day_name,
COUNT(*) AS no_transactions,
RANK() OVER(PARTITION BY branch ORDER BY COUNT(*) DESC) AS ran_k
FROM walmart
GROUP BY branch, day_name
) AS ranked
WHERE ran_k = 1;
Q4: Calculate the total quantity of items sold per payment method
SELECT
payment_method,
SUM(quantity) AS no_qty_sold
FROM walmart
GROUP BY payment_method;
**Q5: Determine the average, minimum, and maximum rating of categories for each city**
```sql
SELECT
city,
category,
MIN(rating) AS min_rating,
MAX(rating) AS max_rating,
AVG(rating) AS avg_rating
FROM walmart
GROUP BY city, category;
Q6: Calculate the total profit for each category
SELECT
category,
SUM(unit_price * quantity * profit_margin) AS total_profit
FROM walmart
GROUP BY category
ORDER BY total_profit DESC;
Q7: Determine the most common payment method for each branch
WITH cte AS (
SELECT
branch,
payment_method,
COUNT(*) AS total_trans,
RANK() OVER(PARTITION BY branch ORDER BY COUNT(*) DESC) AS ran_k
FROM walmart
GROUP BY branch, payment_method
)
SELECT branch, payment_method AS preferred_payment_method
FROM cte
WHERE ran_k = 1;
second approach ----------
select branch, payment_method
from(select branch, payment_method,
count(payment_method),
rank() over( partition by branch order by count(payment_method) desc) as ran
from walmart
group by branch, payment_method)as ranked
where ran =1;
Q8: Profitability Analysis Which product category has the highest average profit margin across branches?
SELECT branch, category, AVG(profit_margin)
FROM(
SELECT branch, category, AVG(profit_margin),
RANK() OVER(PARTITION BY category ORDER BY AVG(profit_margin) DESC) AS RAN
FROM walmart
GROUP BY category) AS RANKED
WHERE RAN =1;
WITH category_avg AS (
SELECT category, AVG(profit_margin) AS avg_profit_margin
FROM walmart
GROUP BY category
)
SELECT w.branch, w.category, w.profit_margin
FROM walmart w
JOIN category_avg c
ON w.category = c.category
WHERE c.avg_profit_margin = (SELECT MAX(avg_profit_margin) FROM category_avg);
Q9: Categorize sales into Morning, Afternoon, and Evening shifts
SELECT
branch,
CASE
WHEN HOUR(TIME(time)) < 12 THEN 'Morning'
WHEN HOUR(TIME(time)) BETWEEN 12 AND 17 THEN 'Afternoon'
ELSE 'Evening'
END AS shift,
COUNT(*) AS num_invoices
FROM walmart
GROUP BY branch, shift
ORDER BY branch, num_invoices DESC;
Q10: Branch-City InsightWhich city consistently brings in the highest profit margin for the company?
SELECT city, AVG(profit_margin) AS avg_profit_margin
FROM walmart
GROUP BY city
ORDER BY avg_profit_margin DESC
LIMIT 1;
Q11: Branch Optimization If management wants to shut down the least profitable branch, which one should it be?
SELECT branch, SUM(profit_margin) AS total_profit
FROM walmart
GROUP BY branch
ORDER BY total_profit ASC
LIMIT 1;
Q12: Identify the 5 branches with the highest revenue decrease ratio from last year to current year (e.g., 2022 to 2023)
WITH revenue_2022 AS (
SELECT
branch,
SUM(total) AS revenue
FROM walmart
WHERE YEAR(STR_TO_DATE(date, '%d/%m/%Y')) = 2022
GROUP BY branch
),
revenue_2023 AS (
SELECT
branch,
SUM(total) AS revenue
FROM walmart
WHERE YEAR(STR_TO_DATE(date, '%d/%m/%Y')) = 2023
GROUP BY branch
)
SELECT
r2022.branch,
r2022.revenue AS last_year_revenue,
r2023.revenue AS current_year_revenue,
ROUND(((r2022.revenue - r2023.revenue) / r2022.revenue) * 100, 2) AS revenue_decrease_ratio
FROM revenue_2022 AS r2022
JOIN revenue_2023 AS r2023 ON r2022.branch = r2023.branch
WHERE r2022.revenue > r2023.revenue
ORDER BY revenue_decrease_ratio DESC
LIMIT 5;
- Python 3.8+
- SQL Databases: MySQL
- Python Libraries:
pandas
,numpy
,sqlalchemy
,mysql-connector-python
- Kaggle API Key (for data downloading)
- Clone the repository:
git clone <repo-url>
- Install Python libraries:
pip install -r requirements.txt
- Set up your Kaggle API, download the data, and follow the steps to load and analyze.
|-- data/ # Raw data and transformed data
|-- sql_queries/ # SQL scripts for analysis and queries
|-- notebooks/ # Jupyter notebooks for Python analysis
|-- README.md # Project documentation
|-- requirements.txt # List of required Python libraries
|-- main.py # Main script for loading, cleaning, and processing data
This section will include your analysis findings:
- Sales Insights: Key categories, branches with highest sales, and preferred payment methods.
- Profitability: Insights into the most profitable product categories and locations.
- Customer Behavior: Trends in ratings, payment preferences, and peak shopping hours.
This project is licensed under the MIT License.
- Data Source: Kaggle’s Walmart Sales Dataset
- Inspiration: Walmart’s business case studies on sales and supply chain optimization.