Skip to content

PRANAV7389/Retail-Order-data-analysis-using-Python-Sql

Repository files navigation

Retail Order data analysis using Python & Sql

This project utilized a Python data pipeline to download and clean data from the Kaggle API. The prepared data was then ingested into an SQL database to facilitate further analysis. SQL queries were leveraged to answer five specific questions about the data.

Screenshot 2024-05-25 184438

Data Pipeline to Download, Clean, and Load Data into SQL Database

This Python script demonstrates a data pipeline that downloads a dataset from Kaggle, performs data cleaning, and loads the prepared data into a SQL database for further analysis. Below are the detailed steps and the corresponding code.

1. Install Kaggle API

# Install the Kaggle package to interact with the Kaggle API
!pip install kaggle
Requirement already satisfied: kaggle in c:\users\dell\anaconda3\lib\site-packages (1.6.14)
Requirement already satisfied: six>=1.10 in c:\users\dell\anaconda3\lib\site-packages (from kaggle) (1.16.0)
Requirement already satisfied: certifi>=2023.7.22 in c:\users\dell\anaconda3\lib\site-packages (from kaggle) (2023.7.22)
Requirement already satisfied: python-dateutil in c:\users\dell\anaconda3\lib\site-packages (from kaggle) (2.8.2)
Requirement already satisfied: requests in c:\users\dell\anaconda3\lib\site-packages (from kaggle) (2.31.0)
Requirement already satisfied: tqdm in c:\users\dell\anaconda3\lib\site-packages (from kaggle) (4.65.0)
Requirement already satisfied: python-slugify in c:\users\dell\anaconda3\lib\site-packages (from kaggle) (5.0.2)
Requirement already satisfied: urllib3 in c:\users\dell\anaconda3\lib\site-packages (from kaggle) (1.26.16)
Requirement already satisfied: bleach in c:\users\dell\anaconda3\lib\site-packages (from kaggle) (4.1.0)
Requirement already satisfied: packaging in c:\users\dell\anaconda3\lib\site-packages (from bleach->kaggle) (23.1)
Requirement already satisfied: webencodings in c:\users\dell\anaconda3\lib\site-packages (from bleach->kaggle) (0.5.1)
Requirement already satisfied: text-unidecode>=1.3 in c:\users\dell\anaconda3\lib\site-packages (from python-slugify->kaggle) (1.3)
Requirement already satisfied: charset-normalizer<4,>=2 in c:\users\dell\anaconda3\lib\site-packages (from requests->kaggle) (2.0.4)
Requirement already satisfied: idna<4,>=2.5 in c:\users\dell\anaconda3\lib\site-packages (from requests->kaggle) (3.4)
Requirement already satisfied: colorama in c:\users\dell\anaconda3\lib\site-packages (from tqdm->kaggle) (0.4.6)

2. Download the Dataset from Kaggle

import kaggle
# Download the 'orders.csv' file from the specified Kaggle dataset
!kaggle datasets download ankitbansal06/retail-orders -f orders.csv
Dataset URL: https://www.kaggle.com/datasets/ankitbansal06/retail-orders
License(s): CC0-1.0
orders.csv.zip: Skipping, found more recently modified local copy (use --force to force download)

3. Extract the Downloaded ZIP File

import zipfile
# Extract the downloaded zip file containing the 'orders.csv'
zip_ref = zipfile.ZipFile('orders.csv.zip')
zip_ref.extractall()
zip_ref.close()

4. Load Data into a DataFrame

import pandas as pd
# Read the CSV file into a pandas DataFrame
df = pd.read_csv('orders.csv')
# Display the first 20 rows of the DataFrame
df.head(20)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Order Id Order Date Ship Mode Segment Country City State Postal Code Region Category Sub Category Product Id cost price List Price Quantity Discount Percent
0 1 2023-03-01 Second Class Consumer United States Henderson Kentucky 42420 South Furniture Bookcases FUR-BO-10001798 240 260 2 2
1 2 2023-08-15 Second Class Consumer United States Henderson Kentucky 42420 South Furniture Chairs FUR-CH-10000454 600 730 3 3
2 3 2023-01-10 Second Class Corporate United States Los Angeles California 90036 West Office Supplies Labels OFF-LA-10000240 10 10 2 5
3 4 2022-06-18 Standard Class Consumer United States Fort Lauderdale Florida 33311 South Furniture Tables FUR-TA-10000577 780 960 5 2
4 5 2022-07-13 Standard Class Consumer United States Fort Lauderdale Florida 33311 South Office Supplies Storage OFF-ST-10000760 20 20 2 5
5 6 2022-03-13 Not Available Consumer United States Los Angeles California 90032 West Furniture Furnishings FUR-FU-10001487 50 50 7 3
6 7 2022-12-28 Standard Class Consumer United States Los Angeles California 90032 West Office Supplies Art OFF-AR-10002833 10 10 4 3
7 8 2022-01-25 Standard Class Consumer United States Los Angeles California 90032 West Technology Phones TEC-PH-10002275 860 910 6 5
8 9 2023-03-23 Not Available Consumer United States Los Angeles California 90032 West Office Supplies Binders OFF-BI-10003910 20 20 3 2
9 10 2023-05-16 Standard Class Consumer United States Los Angeles California 90032 West Office Supplies Appliances OFF-AP-10002892 90 110 5 3
10 11 2023-03-31 Not Available Consumer United States Los Angeles California 90032 West Furniture Tables FUR-TA-10001539 1470 1710 9 3
11 12 2023-12-25 Not Available Consumer United States Los Angeles California 90032 West Technology Phones TEC-PH-10002033 750 910 4 3
12 13 2022-02-11 Standard Class Consumer United States Concord North Carolina 28027 South Office Supplies Paper OFF-PA-10002365 20 20 3 3
13 14 2023-07-18 Standard Class Consumer United States Seattle Washington 98103 West Office Supplies Binders OFF-BI-10003656 360 410 3 2
14 15 2023-11-09 unknown Home Office United States Fort Worth Texas 76106 Central Office Supplies Appliances OFF-AP-10002311 60 70 5 5
15 16 2022-06-18 Standard Class Home Office United States Fort Worth Texas 76106 Central Office Supplies Binders OFF-BI-10000756 0 0 3 5
16 17 2022-02-04 Standard Class Consumer United States Madison Wisconsin 53711 Central Office Supplies Storage OFF-ST-10004186 610 670 6 3
17 18 2023-08-04 Second Class Consumer United States West Jordan Utah 84084 West Office Supplies Storage OFF-ST-10000107 60 60 2 4
18 19 2022-01-23 Second Class Consumer United States San Francisco California 94109 West Office Supplies Art OFF-AR-10003056 10 10 2 4
19 20 2022-01-11 Second Class Consumer United States San Francisco California 94109 West Technology Phones TEC-PH-10001949 170 210 3 3

5. Data Inspection

# Display information about the DataFrame (column types, non-null values, etc.)
df.info()
# Check for missing values in the DataFrame
df.isnull().sum()
# Display unique values in the 'Ship Mode' column
df['ship_mode'].unique
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9994 entries, 0 to 9993
Data columns (total 16 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   order_id      9994 non-null   int64         
 1   order_date    9994 non-null   datetime64[ns]
 2   ship_mode     9988 non-null   object        
 3   segment       9994 non-null   object        
 4   country       9994 non-null   object        
 5   city          9994 non-null   object        
 6   state         9994 non-null   object        
 7   postal_code   9994 non-null   int64         
 8   region        9994 non-null   object        
 9   category      9994 non-null   object        
 10  sub_category  9994 non-null   object        
 11  product_id    9994 non-null   object        
 12  quantity      9994 non-null   int64         
 13  discount      9994 non-null   float64       
 14  sale_price    9994 non-null   float64       
 15  profit        9994 non-null   float64       
dtypes: datetime64[ns](1), float64(3), int64(3), object(9)
memory usage: 1.2+ MB





<bound method Series.unique of 0         Second Class
1         Second Class
2         Second Class
3       Standard Class
4       Standard Class
             ...      
9989      Second Class
9990    Standard Class
9991    Standard Class
9992    Standard Class
9993      Second Class
Name: ship_mode, Length: 9994, dtype: object>

6. Data Cleaning

# Re-read the CSV file, treating 'Not Available' and 'unknown' as NaN
df = pd.read_csv('orders.csv', na_values=['Not Available', 'unknown'])
df.head(20)
# Convert column names to lowercase
df.columns = df.columns.str.lower()
df.columns
# Replace spaces in column names with underscores
df.columns = df.columns.str.replace(' ', '_')
df.columns
df.head(5)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
order_id order_date ship_mode segment country city state postal_code region category sub_category product_id cost_price list_price quantity discount_percent
0 1 2023-03-01 Second Class Consumer United States Henderson Kentucky 42420 South Furniture Bookcases FUR-BO-10001798 240 260 2 2
1 2 2023-08-15 Second Class Consumer United States Henderson Kentucky 42420 South Furniture Chairs FUR-CH-10000454 600 730 3 3
2 3 2023-01-10 Second Class Corporate United States Los Angeles California 90036 West Office Supplies Labels OFF-LA-10000240 10 10 2 5
3 4 2022-06-18 Standard Class Consumer United States Fort Lauderdale Florida 33311 South Furniture Tables FUR-TA-10000577 780 960 5 2
4 5 2022-07-13 Standard Class Consumer United States Fort Lauderdale Florida 33311 South Office Supplies Storage OFF-ST-10000760 20 20 2 5

7. Feature Engineering

# Calculate discount amount
df['discount'] = df['list_price'] * df['discount_percent'] * 0.01
# Calculate sale price after discount
df['sale_price'] = df['list_price'] - df['discount']
df
# Calculate profit
df['profit'] = df['sale_price'] - df['cost_price']
# Check data types
df.dtypes
# Convert 'order_date' column to datetime format
df['order_date'] = pd.to_datetime(df['order_date'], format="%Y-%m-%d")
df.dtypes
# Drop unnecessary columns
df.drop(columns=['list_price', 'cost_price', 'discount_percent'], inplace=True)
df
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
order_id order_date ship_mode segment country city state postal_code region category sub_category product_id quantity discount sale_price profit
0 1 2023-03-01 Second Class Consumer United States Henderson Kentucky 42420 South Furniture Bookcases FUR-BO-10001798 2 5.2 254.8 14.8
1 2 2023-08-15 Second Class Consumer United States Henderson Kentucky 42420 South Furniture Chairs FUR-CH-10000454 3 21.9 708.1 108.1
2 3 2023-01-10 Second Class Corporate United States Los Angeles California 90036 West Office Supplies Labels OFF-LA-10000240 2 0.5 9.5 -0.5
3 4 2022-06-18 Standard Class Consumer United States Fort Lauderdale Florida 33311 South Furniture Tables FUR-TA-10000577 5 19.2 940.8 160.8
4 5 2022-07-13 Standard Class Consumer United States Fort Lauderdale Florida 33311 South Office Supplies Storage OFF-ST-10000760 2 1.0 19.0 -1.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
9989 9990 2023-02-18 Second Class Consumer United States Miami Florida 33180 South Furniture Furnishings FUR-FU-10001889 3 1.2 28.8 -1.2
9990 9991 2023-03-17 Standard Class Consumer United States Costa Mesa California 92627 West Furniture Furnishings FUR-FU-10000747 2 3.6 86.4 16.4
9991 9992 2022-08-07 Standard Class Consumer United States Costa Mesa California 92627 West Technology Phones TEC-PH-10003645 2 5.2 254.8 34.8
9992 9993 2022-11-19 Standard Class Consumer United States Costa Mesa California 92627 West Office Supplies Paper OFF-PA-10004041 4 0.9 29.1 -0.9
9993 9994 2022-07-17 Second Class Consumer United States Westminster California 92683 West Office Supplies Appliances OFF-AP-10002684 2 7.2 232.8 22.8

9994 rows × 16 columns

8. Load Data into SQL Database

from sqlalchemy import create_engine
import sqlalchemy as sal

# Database connection details
user = 'pranav_user'
password = 'kshama1234'
host = 'localhost'
database = 'pranav'

# Create a SQLAlchemy engine
engine = create_engine(f'mysql+mysqlconnector://{user}:{password}@{host}/{database}')
conn = engine.connect()

# Load the DataFrame into the SQL database table 'df_orders'
df.to_sql('df_orders', con=conn, index=False, if_exists='append')

SQL Query to analyze the data:

Select all records from the df_orders table
SELECT * FROM pranav.df_orders;

-- Find top 10 highest revenue-generating products
-- 1. Group the records by product_id
-- 2. Calculate the sum of sale_price for each product_id
-- 3. Order the results by sales in descending order
-- 4. Limit the results to the top 10 entries
SELECT product_id, sum(sale_price) as sales
FROM df_orders
GROUP BY product_id
ORDER BY sales DESC
LIMIT 10;

-- Find top 5 highest selling products in each region
-- 1. Create a common table expression (CTE) to calculate sales per product_id for each region
-- 2. Use ROW_NUMBER() to rank the products within each region by sales in descending order
-- 3. Select the top 5 products (rn <= 5) for each region
WITH cte AS (
  SELECT region, product_id, sum(sale_price) as sales
  FROM df_orders
  GROUP BY region, product_id
)
SELECT * FROM (
  SELECT *,
    ROW_NUMBER() OVER(PARTITION BY region ORDER BY sales DESC) as rn
  FROM cte
) A
WHERE rn <= 5;

-- Find month-over-month growth comparison for 2022 and 2023 sales
-- 1. Create a CTE to calculate the total sales per month for each year
-- 2. Use CASE statements to segregate sales data for 2022 and 2023
-- 3. Group the results by month and order by month
WITH cte AS (
  SELECT YEAR(order_date) as year, MONTH(order_date) as month, sum(sale_price) as sales
  FROM df_orders
  GROUP BY YEAR(order_date), MONTH(order_date)
)
SELECT month,
  sum(CASE WHEN year = 2022 THEN sales ELSE 0 END) as sales_2022,
  sum(CASE WHEN year = 2023 THEN sales ELSE 0 END) as sales_2023
FROM cte
GROUP BY month
ORDER BY month;

-- For each category, find the month with the highest sales
-- 1. Create a CTE to calculate sales per category for each year-month
-- 2. Use ROW_NUMBER() to rank the months within each category by sales in descending order
-- 3. Select the month with the highest sales (rn = 1) for each category
WITH cte AS (
  SELECT category, FORMAT(order_date, 'yyyyMM') AS order_year_month, sum(sale_price) as sales
  FROM df_orders
  GROUP BY category, FORMAT(order_date, 'yyyyMM')
)
SELECT * FROM (
  SELECT *,
    ROW_NUMBER() OVER(PARTITION BY category ORDER BY sales DESC) as rn
  FROM cte
) a
WHERE rn = 1;

-- Identify which sub-category had the highest growth by profit in 2023 compared to 2022
-- 1. Create a CTE to calculate sales per sub_category for each year
-- 2. Create another CTE to calculate the growth by profit between 2023 and 2022 for each sub_category
-- 3. Order the results by growth in descending order and limit to the top 1 entry
WITH cte AS (
  SELECT sub_category, YEAR(order_date) as year, sum(sale_price) as sales
  FROM df_orders
  GROUP BY sub_category, YEAR(order_date)
), cte2 AS (
  SELECT sub_category,
    sum(CASE WHEN year = 2022 THEN sales ELSE 0 END) as sales_2022,
    sum(CASE WHEN year = 2023 THEN sales ELSE 0 END) as sales_2023
  FROM cte
  GROUP BY sub_category
)
SELECT *,
  (sales_2023 - sales_2022) * 100 / sales_2022 as growth_percentage
FROM cte2
ORDER BY growth_percentage DESC
LIMIT 1;

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published