Skip to content

Performed a comprehensive analysis of Walmart sales and operational data to uncover key business insights, identify trends, and support data-driven decision-making. This project involved data extraction, transformation, analysis, visualization, and predictive modeling using SQL and Python

Notifications You must be signed in to change notification settings

Ankitkumar-x/Walmart_SQL_Python_Project

Repository files navigation

Walmart Data Analysis: End-to-End SQL + Python Project P-9

Project Overview

Project Pipeline

This project is an end-to-end data analysis solution designed to extract critical business insights from Walmart sales data. We utilize Python for data processing and analysis, SQL for advanced querying, and structured problem-solving techniques to solve key business questions. The project is ideal for data analysts looking to develop skills in data manipulation, SQL querying, and data pipeline creation.


Project Steps

1. Set Up the Environment

  • Tools Used: Visual Studio Code (VS Code), Python, SQL (MySQL and PostgreSQL)
  • Goal: Create a structured workspace within VS Code and organize project folders for smooth development and data handling.

2. Set Up Kaggle API

  • API Setup: Obtain your Kaggle API token from Kaggle by navigating to your profile settings and downloading the JSON file.
  • Configure Kaggle:
    • Place the downloaded kaggle.json file in your local .kaggle folder.
    • Use the command kaggle datasets download -d <dataset-path> to pull datasets directly into your project.

3. Download Walmart Sales Data

  • Data Source: Use the Kaggle API to download the Walmart sales datasets from Kaggle.
  • Dataset Link: Walmart Sales Dataset
  • Storage: Save the data in the data/ folder for easy reference and access.

4. Install Required Libraries and Load Data

  • Libraries: Install necessary Python libraries using:
    pip install pandas numpy sqlalchemy mysql-connector-python psycopg2
  • Loading Data: Read the data into a Pandas DataFrame for initial analysis and transformations.

5. Explore the Data

  • Goal: Conduct an initial data exploration to understand data distribution, check column names, types, and identify potential issues.
  • Analysis: Use functions like .info(), .describe(), and .head() to get a quick overview of the data structure and statistics.

6. Data Cleaning

  • Remove Duplicates: Identify and remove duplicate entries to avoid skewed results.
  • Handle Missing Values: Drop rows or columns with missing values if they are insignificant; fill values where essential.
  • Fix Data Types: Ensure all columns have consistent data types (e.g., dates as datetime, prices as float).
  • Currency Formatting: Use .replace() to handle and format currency values for analysis.
  • Validation: Check for any remaining inconsistencies and verify the cleaned data.

7. Feature Engineering

  • Create New Columns: Calculate the Total Amount for each transaction by multiplying unit_price by quantity and adding this as a new column.
  • Enhance Dataset: Adding this calculated field will streamline further SQL analysis and aggregation tasks.

8. Load Data into MySQL

  • Set Up Connections: Connect to MySQL using sqlalchemy and load the cleaned data into each database.
  • Table Creation: Set up tables in both MySQL and PostgreSQL using Python SQLAlchemy to automate table creation and data insertion.
  • Verification: Run initial SQL queries to confirm that the data has been loaded accurately.

9. SQL Analysis: Complex Queries and Business Problem Solving

  • Business Problem-Solving: Write and execute complex SQL queries to answer critical business questions, such as:
    • Revenue trends across branches and categories.
    • Identifying best-selling product categories.
    • Sales performance by time, city, and payment method.
    • Analyzing peak sales periods and customer buying patterns.
    • Profit margin analysis by branch and category.

Important Questions......

Task 2: Create a New Book Record

INSERT INTO Books
VALUES ('978-1-60129-456-2', 'To Kill a Mockingbird', 'Classic', 6.0, 'yes', 'Harper Lee', 'J.B. Lippincott & Co.');

Count total records

SELECT COUNT(*) FROM walmart;

Show payment methods and number of transactions by payment method

SELECT 
    payment_method,
    COUNT(*) AS no_payments
FROM walmart
GROUP BY payment_method;

Count distinct branches

SELECT COUNT(DISTINCT branch) 
FROM walmart;

Find the minimum quantity sold

SELECT category,MIN(quantity) 
FROM walmart
GROUP BY category;

Business Problem Q1: Find different payment methods, number of transactions, and quantity sold by payment method

SELECT 
    payment_method,
    COUNT(*) AS no_payments, 
    SUM(quantity) AS no_qty_sold
FROM walmart
GROUP BY payment_method;

Project Question #2: Identify the highest-rated category in each branch Display the branch, category, and avg rating

SELECT branch, category, avg_rating
FROM (
    SELECT 
        branch,
        category,
        AVG(rating) AS avg_rating,
        RANK() OVER(PARTITION BY branch ORDER BY AVG(rating) DESC) as ran_k
    FROM walmart
    GROUP BY branch, category
) AS ranked
WHERE ran_k = 1;

Q3: Identify the busiest day for each branch based on the number of transactions

SELECT branch, day_name, no_transactions
FROM (
    SELECT 
        branch,
        DAYNAME(STR_TO_DATE(date, '%d/%m/%Y')) AS day_name,
        COUNT(*) AS no_transactions,
        RANK() OVER(PARTITION BY branch ORDER BY COUNT(*) DESC) AS ran_k
    FROM walmart
    GROUP BY branch, day_name
) AS ranked
WHERE ran_k = 1;

Q4: Calculate the total quantity of items sold per payment method

SELECT 
    payment_method,
    SUM(quantity) AS no_qty_sold
FROM walmart
GROUP BY payment_method;

**Q5: Determine the average, minimum, and maximum rating of categories for each city**
```sql
SELECT 
    city,
    category,
    MIN(rating) AS min_rating,
    MAX(rating) AS max_rating,
    AVG(rating) AS avg_rating
FROM walmart
GROUP BY city, category;

Q6: Calculate the total profit for each category

SELECT 
    category,
    SUM(unit_price * quantity * profit_margin) AS total_profit
FROM walmart
GROUP BY category
ORDER BY total_profit DESC;

Q7: Determine the most common payment method for each branch

WITH cte AS (
    SELECT 
        branch,
        payment_method,
        COUNT(*) AS total_trans,
        RANK() OVER(PARTITION BY branch ORDER BY COUNT(*) DESC) AS ran_k
    FROM walmart
    GROUP BY branch, payment_method
)
SELECT branch, payment_method AS preferred_payment_method
FROM cte
WHERE ran_k = 1;

second approach ----------

select branch, payment_method
from(select branch, payment_method, 
count(payment_method),
rank() over( partition by branch order by count(payment_method) desc) as ran
from walmart
group by branch, payment_method)as ranked
where ran =1;

Advanced Level (Business Insights) ----

Q8: Profitability Analysis Which product category has the highest average profit margin across branches?

SELECT branch, category, AVG(profit_margin)
FROM(
	SELECT branch, category, AVG(profit_margin),
    RANK() OVER(PARTITION BY category ORDER BY AVG(profit_margin) DESC) AS RAN
	FROM walmart
	GROUP BY category) AS RANKED
WHERE RAN =1;    
WITH category_avg AS (
    SELECT category, AVG(profit_margin) AS avg_profit_margin
    FROM walmart
    GROUP BY category
)
SELECT w.branch, w.category, w.profit_margin
FROM walmart w
JOIN category_avg c
ON w.category = c.category
WHERE c.avg_profit_margin = (SELECT MAX(avg_profit_margin) FROM category_avg);

Q9: Categorize sales into Morning, Afternoon, and Evening shifts

SELECT
    branch,
    CASE 
        WHEN HOUR(TIME(time)) < 12 THEN 'Morning'
        WHEN HOUR(TIME(time)) BETWEEN 12 AND 17 THEN 'Afternoon'
        ELSE 'Evening'
    END AS shift,
    COUNT(*) AS num_invoices
FROM walmart
GROUP BY branch, shift
ORDER BY branch, num_invoices DESC;

Q10: Branch-City InsightWhich city consistently brings in the highest profit margin for the company?

SELECT city, AVG(profit_margin) AS avg_profit_margin
FROM walmart
GROUP BY city
ORDER BY avg_profit_margin DESC
LIMIT 1;

Q11: Branch Optimization If management wants to shut down the least profitable branch, which one should it be?

SELECT branch, SUM(profit_margin) AS total_profit
FROM walmart
GROUP BY branch
ORDER BY total_profit ASC
LIMIT 1;

Q12: Identify the 5 branches with the highest revenue decrease ratio from last year to current year (e.g., 2022 to 2023)

WITH revenue_2022 AS (
    SELECT 
        branch,
        SUM(total) AS revenue
    FROM walmart
    WHERE YEAR(STR_TO_DATE(date, '%d/%m/%Y')) = 2022
    GROUP BY branch
),
revenue_2023 AS (
    SELECT 
        branch,
        SUM(total) AS revenue
    FROM walmart
    WHERE YEAR(STR_TO_DATE(date, '%d/%m/%Y')) = 2023
    GROUP BY branch
)
SELECT 
    r2022.branch,
    r2022.revenue AS last_year_revenue,
    r2023.revenue AS current_year_revenue,
    ROUND(((r2022.revenue - r2023.revenue) / r2022.revenue) * 100, 2) AS revenue_decrease_ratio
FROM revenue_2022 AS r2022
JOIN revenue_2023 AS r2023 ON r2022.branch = r2023.branch
WHERE r2022.revenue > r2023.revenue
ORDER BY revenue_decrease_ratio DESC
LIMIT 5;

Requirements

  • Python 3.8+
  • SQL Databases: MySQL
  • Python Libraries:
    • pandas, numpy, sqlalchemy, mysql-connector-python
  • Kaggle API Key (for data downloading)

Getting Started

  1. Clone the repository:
    git clone <repo-url>
  2. Install Python libraries:
    pip install -r requirements.txt
  3. Set up your Kaggle API, download the data, and follow the steps to load and analyze.

Project Structure

|-- data/                     # Raw data and transformed data
|-- sql_queries/              # SQL scripts for analysis and queries
|-- notebooks/                # Jupyter notebooks for Python analysis
|-- README.md                 # Project documentation
|-- requirements.txt          # List of required Python libraries
|-- main.py                   # Main script for loading, cleaning, and processing data

Results and Insights

This section will include your analysis findings:

  • Sales Insights: Key categories, branches with highest sales, and preferred payment methods.
  • Profitability: Insights into the most profitable product categories and locations.
  • Customer Behavior: Trends in ratings, payment preferences, and peak shopping hours.

License

This project is licensed under the MIT License.


Acknowledgments

  • Data Source: Kaggle’s Walmart Sales Dataset
  • Inspiration: Walmart’s business case studies on sales and supply chain optimization.

About

Performed a comprehensive analysis of Walmart sales and operational data to uncover key business insights, identify trends, and support data-driven decision-making. This project involved data extraction, transformation, analysis, visualization, and predictive modeling using SQL and Python

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published