Retail Order data analysis using Python & Sql

This project utilized a Python data pipeline to download and clean data from the Kaggle API. The prepared data was then ingested into an SQL database to facilitate further analysis. SQL queries were leveraged to answer five specific questions about the data.

Data Pipeline to Download, Clean, and Load Data into SQL Database

This Python script demonstrates a data pipeline that downloads a dataset from Kaggle, performs data cleaning, and loads the prepared data into a SQL database for further analysis. Below are the detailed steps and the corresponding code.

1. Install Kaggle API

# Install the Kaggle package to interact with the Kaggle API
!pip install kaggle

Requirement already satisfied: kaggle in c:\users\dell\anaconda3\lib\site-packages (1.6.14)
Requirement already satisfied: six>=1.10 in c:\users\dell\anaconda3\lib\site-packages (from kaggle) (1.16.0)
Requirement already satisfied: certifi>=2023.7.22 in c:\users\dell\anaconda3\lib\site-packages (from kaggle) (2023.7.22)
Requirement already satisfied: python-dateutil in c:\users\dell\anaconda3\lib\site-packages (from kaggle) (2.8.2)
Requirement already satisfied: requests in c:\users\dell\anaconda3\lib\site-packages (from kaggle) (2.31.0)
Requirement already satisfied: tqdm in c:\users\dell\anaconda3\lib\site-packages (from kaggle) (4.65.0)
Requirement already satisfied: python-slugify in c:\users\dell\anaconda3\lib\site-packages (from kaggle) (5.0.2)
Requirement already satisfied: urllib3 in c:\users\dell\anaconda3\lib\site-packages (from kaggle) (1.26.16)
Requirement already satisfied: bleach in c:\users\dell\anaconda3\lib\site-packages (from kaggle) (4.1.0)
Requirement already satisfied: packaging in c:\users\dell\anaconda3\lib\site-packages (from bleach->kaggle) (23.1)
Requirement already satisfied: webencodings in c:\users\dell\anaconda3\lib\site-packages (from bleach->kaggle) (0.5.1)
Requirement already satisfied: text-unidecode>=1.3 in c:\users\dell\anaconda3\lib\site-packages (from python-slugify->kaggle) (1.3)
Requirement already satisfied: charset-normalizer<4,>=2 in c:\users\dell\anaconda3\lib\site-packages (from requests->kaggle) (2.0.4)
Requirement already satisfied: idna<4,>=2.5 in c:\users\dell\anaconda3\lib\site-packages (from requests->kaggle) (3.4)
Requirement already satisfied: colorama in c:\users\dell\anaconda3\lib\site-packages (from tqdm->kaggle) (0.4.6)

2. Download the Dataset from Kaggle

import kaggle
# Download the 'orders.csv' file from the specified Kaggle dataset
!kaggle datasets download ankitbansal06/retail-orders -f orders.csv

Dataset URL: https://www.kaggle.com/datasets/ankitbansal06/retail-orders
License(s): CC0-1.0
orders.csv.zip: Skipping, found more recently modified local copy (use --force to force download)

3. Extract the Downloaded ZIP File

import zipfile
# Extract the downloaded zip file containing the 'orders.csv'
zip_ref = zipfile.ZipFile('orders.csv.zip')
zip_ref.extractall()
zip_ref.close()

4. Load Data into a DataFrame

import pandas as pd
# Read the CSV file into a pandas DataFrame
df = pd.read_csv('orders.csv')
# Display the first 20 rows of the DataFrame
df.head(20)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Order Id	Order Date	Ship Mode	Segment	Country	City	State	Postal Code	Region	Category	Sub Category	Product Id	cost price	List Price	Quantity	Discount Percent
0	1	2023-03-01	Second Class	Consumer	United States	Henderson	Kentucky	42420	South	Furniture	Bookcases	FUR-BO-10001798	240	260	2	2
1	2	2023-08-15	Second Class	Consumer	United States	Henderson	Kentucky	42420	South	Furniture	Chairs	FUR-CH-10000454	600	730	3	3
2	3	2023-01-10	Second Class	Corporate	United States	Los Angeles	California	90036	West	Office Supplies	Labels	OFF-LA-10000240	10	10	2	5
3	4	2022-06-18	Standard Class	Consumer	United States	Fort Lauderdale	Florida	33311	South	Furniture	Tables	FUR-TA-10000577	780	960	5	2
4	5	2022-07-13	Standard Class	Consumer	United States	Fort Lauderdale	Florida	33311	South	Office Supplies	Storage	OFF-ST-10000760	20	20	2	5
5	6	2022-03-13	Not Available	Consumer	United States	Los Angeles	California	90032	West	Furniture	Furnishings	FUR-FU-10001487	50	50	7	3
6	7	2022-12-28	Standard Class	Consumer	United States	Los Angeles	California	90032	West	Office Supplies	Art	OFF-AR-10002833	10	10	4	3
7	8	2022-01-25	Standard Class	Consumer	United States	Los Angeles	California	90032	West	Technology	Phones	TEC-PH-10002275	860	910	6	5
8	9	2023-03-23	Not Available	Consumer	United States	Los Angeles	California	90032	West	Office Supplies	Binders	OFF-BI-10003910	20	20	3	2
9	10	2023-05-16	Standard Class	Consumer	United States	Los Angeles	California	90032	West	Office Supplies	Appliances	OFF-AP-10002892	90	110	5	3
10	11	2023-03-31	Not Available	Consumer	United States	Los Angeles	California	90032	West	Furniture	Tables	FUR-TA-10001539	1470	1710	9	3
11	12	2023-12-25	Not Available	Consumer	United States	Los Angeles	California	90032	West	Technology	Phones	TEC-PH-10002033	750	910	4	3
12	13	2022-02-11	Standard Class	Consumer	United States	Concord	North Carolina	28027	South	Office Supplies	Paper	OFF-PA-10002365	20	20	3	3
13	14	2023-07-18	Standard Class	Consumer	United States	Seattle	Washington	98103	West	Office Supplies	Binders	OFF-BI-10003656	360	410	3	2
14	15	2023-11-09	unknown	Home Office	United States	Fort Worth	Texas	76106	Central	Office Supplies	Appliances	OFF-AP-10002311	60	70	5	5
15	16	2022-06-18	Standard Class	Home Office	United States	Fort Worth	Texas	76106	Central	Office Supplies	Binders	OFF-BI-10000756	0	0	3	5
16	17	2022-02-04	Standard Class	Consumer	United States	Madison	Wisconsin	53711	Central	Office Supplies	Storage	OFF-ST-10004186	610	670	6	3
17	18	2023-08-04	Second Class	Consumer	United States	West Jordan	Utah	84084	West	Office Supplies	Storage	OFF-ST-10000107	60	60	2	4
18	19	2022-01-23	Second Class	Consumer	United States	San Francisco	California	94109	West	Office Supplies	Art	OFF-AR-10003056	10	10	2	4
19	20	2022-01-11	Second Class	Consumer	United States	San Francisco	California	94109	West	Technology	Phones	TEC-PH-10001949	170	210	3	3

5. Data Inspection

# Display information about the DataFrame (column types, non-null values, etc.)
df.info()
# Check for missing values in the DataFrame
df.isnull().sum()
# Display unique values in the 'Ship Mode' column
df['ship_mode'].unique

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9994 entries, 0 to 9993
Data columns (total 16 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   order_id      9994 non-null   int64         
 1   order_date    9994 non-null   datetime64[ns]
 2   ship_mode     9988 non-null   object        
 3   segment       9994 non-null   object        
 4   country       9994 non-null   object        
 5   city          9994 non-null   object        
 6   state         9994 non-null   object        
 7   postal_code   9994 non-null   int64         
 8   region        9994 non-null   object        
 9   category      9994 non-null   object        
 10  sub_category  9994 non-null   object        
 11  product_id    9994 non-null   object        
 12  quantity      9994 non-null   int64         
 13  discount      9994 non-null   float64       
 14  sale_price    9994 non-null   float64       
 15  profit        9994 non-null   float64       
dtypes: datetime64[ns](1), float64(3), int64(3), object(9)
memory usage: 1.2+ MB





<bound method Series.unique of 0         Second Class
1         Second Class
2         Second Class
3       Standard Class
4       Standard Class
             ...      
9989      Second Class
9990    Standard Class
9991    Standard Class
9992    Standard Class
9993      Second Class
Name: ship_mode, Length: 9994, dtype: object>

6. Data Cleaning

# Re-read the CSV file, treating 'Not Available' and 'unknown' as NaN
df = pd.read_csv('orders.csv', na_values=['Not Available', 'unknown'])
df.head(20)
# Convert column names to lowercase
df.columns = df.columns.str.lower()
df.columns
# Replace spaces in column names with underscores
df.columns = df.columns.str.replace(' ', '_')
df.columns
df.head(5)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	order_id	order_date	ship_mode	segment	country	city	state	postal_code	region	category	sub_category	product_id	cost_price	list_price	quantity	discount_percent
0	1	2023-03-01	Second Class	Consumer	United States	Henderson	Kentucky	42420	South	Furniture	Bookcases	FUR-BO-10001798	240	260	2	2
1	2	2023-08-15	Second Class	Consumer	United States	Henderson	Kentucky	42420	South	Furniture	Chairs	FUR-CH-10000454	600	730	3	3
2	3	2023-01-10	Second Class	Corporate	United States	Los Angeles	California	90036	West	Office Supplies	Labels	OFF-LA-10000240	10	10	2	5
3	4	2022-06-18	Standard Class	Consumer	United States	Fort Lauderdale	Florida	33311	South	Furniture	Tables	FUR-TA-10000577	780	960	5	2
4	5	2022-07-13	Standard Class	Consumer	United States	Fort Lauderdale	Florida	33311	South	Office Supplies	Storage	OFF-ST-10000760	20	20	2	5

7. Feature Engineering

# Calculate discount amount
df['discount'] = df['list_price'] * df['discount_percent'] * 0.01
# Calculate sale price after discount
df['sale_price'] = df['list_price'] - df['discount']
df
# Calculate profit
df['profit'] = df['sale_price'] - df['cost_price']
# Check data types
df.dtypes
# Convert 'order_date' column to datetime format
df['order_date'] = pd.to_datetime(df['order_date'], format="%Y-%m-%d")
df.dtypes
# Drop unnecessary columns
df.drop(columns=['list_price', 'cost_price', 'discount_percent'], inplace=True)
df

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	order_id	order_date	ship_mode	segment	country	city	state	postal_code	region	category	sub_category	product_id	quantity	discount	sale_price	profit
0	1	2023-03-01	Second Class	Consumer	United States	Henderson	Kentucky	42420	South	Furniture	Bookcases	FUR-BO-10001798	2	5.2	254.8	14.8
1	2	2023-08-15	Second Class	Consumer	United States	Henderson	Kentucky	42420	South	Furniture	Chairs	FUR-CH-10000454	3	21.9	708.1	108.1
2	3	2023-01-10	Second Class	Corporate	United States	Los Angeles	California	90036	West	Office Supplies	Labels	OFF-LA-10000240	2	0.5	9.5	-0.5
3	4	2022-06-18	Standard Class	Consumer	United States	Fort Lauderdale	Florida	33311	South	Furniture	Tables	FUR-TA-10000577	5	19.2	940.8	160.8
4	5	2022-07-13	Standard Class	Consumer	United States	Fort Lauderdale	Florida	33311	South	Office Supplies	Storage	OFF-ST-10000760	2	1.0	19.0	-1.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
9989	9990	2023-02-18	Second Class	Consumer	United States	Miami	Florida	33180	South	Furniture	Furnishings	FUR-FU-10001889	3	1.2	28.8	-1.2
9990	9991	2023-03-17	Standard Class	Consumer	United States	Costa Mesa	California	92627	West	Furniture	Furnishings	FUR-FU-10000747	2	3.6	86.4	16.4
9991	9992	2022-08-07	Standard Class	Consumer	United States	Costa Mesa	California	92627	West	Technology	Phones	TEC-PH-10003645	2	5.2	254.8	34.8
9992	9993	2022-11-19	Standard Class	Consumer	United States	Costa Mesa	California	92627	West	Office Supplies	Paper	OFF-PA-10004041	4	0.9	29.1	-0.9
9993	9994	2022-07-17	Second Class	Consumer	United States	Westminster	California	92683	West	Office Supplies	Appliances	OFF-AP-10002684	2	7.2	232.8	22.8

9994 rows × 16 columns

8. Load Data into SQL Database

from sqlalchemy import create_engine
import sqlalchemy as sal

# Database connection details
user = 'pranav_user'
password = 'kshama1234'
host = 'localhost'
database = 'pranav'

# Create a SQLAlchemy engine
engine = create_engine(f'mysql+mysqlconnector://{user}:{password}@{host}/{database}')
conn = engine.connect()

# Load the DataFrame into the SQL database table 'df_orders'
df.to_sql('df_orders', con=conn, index=False, if_exists='append')

SQL Query to analyze the data:

Select all records from the df_orders table
SELECT * FROM pranav.df_orders;

-- Find top 10 highest revenue-generating products
-- 1. Group the records by product_id
-- 2. Calculate the sum of sale_price for each product_id
-- 3. Order the results by sales in descending order
-- 4. Limit the results to the top 10 entries
SELECT product_id, sum(sale_price) as sales
FROM df_orders
GROUP BY product_id
ORDER BY sales DESC
LIMIT 10;

-- Find top 5 highest selling products in each region
-- 1. Create a common table expression (CTE) to calculate sales per product_id for each region
-- 2. Use ROW_NUMBER() to rank the products within each region by sales in descending order
-- 3. Select the top 5 products (rn <= 5) for each region
WITH cte AS (
  SELECT region, product_id, sum(sale_price) as sales
  FROM df_orders
  GROUP BY region, product_id
)
SELECT * FROM (
  SELECT *,
    ROW_NUMBER() OVER(PARTITION BY region ORDER BY sales DESC) as rn
  FROM cte
) A
WHERE rn <= 5;

-- Find month-over-month growth comparison for 2022 and 2023 sales
-- 1. Create a CTE to calculate the total sales per month for each year
-- 2. Use CASE statements to segregate sales data for 2022 and 2023
-- 3. Group the results by month and order by month
WITH cte AS (
  SELECT YEAR(order_date) as year, MONTH(order_date) as month, sum(sale_price) as sales
  FROM df_orders
  GROUP BY YEAR(order_date), MONTH(order_date)
)
SELECT month,
  sum(CASE WHEN year = 2022 THEN sales ELSE 0 END) as sales_2022,
  sum(CASE WHEN year = 2023 THEN sales ELSE 0 END) as sales_2023
FROM cte
GROUP BY month
ORDER BY month;

-- For each category, find the month with the highest sales
-- 1. Create a CTE to calculate sales per category for each year-month
-- 2. Use ROW_NUMBER() to rank the months within each category by sales in descending order
-- 3. Select the month with the highest sales (rn = 1) for each category
WITH cte AS (
  SELECT category, FORMAT(order_date, 'yyyyMM') AS order_year_month, sum(sale_price) as sales
  FROM df_orders
  GROUP BY category, FORMAT(order_date, 'yyyyMM')
)
SELECT * FROM (
  SELECT *,
    ROW_NUMBER() OVER(PARTITION BY category ORDER BY sales DESC) as rn
  FROM cte
) a
WHERE rn = 1;

-- Identify which sub-category had the highest growth by profit in 2023 compared to 2022
-- 1. Create a CTE to calculate sales per sub_category for each year
-- 2. Create another CTE to calculate the growth by profit between 2023 and 2022 for each sub_category
-- 3. Order the results by growth in descending order and limit to the top 1 entry
WITH cte AS (
  SELECT sub_category, YEAR(order_date) as year, sum(sale_price) as sales
  FROM df_orders
  GROUP BY sub_category, YEAR(order_date)
), cte2 AS (
  SELECT sub_category,
    sum(CASE WHEN year = 2022 THEN sales ELSE 0 END) as sales_2022,
    sum(CASE WHEN year = 2023 THEN sales ELSE 0 END) as sales_2023
  FROM cte
  GROUP BY sub_category
)
SELECT *,
  (sales_2023 - sales_2022) * 100 / sales_2022 as growth_percentage
FROM cte2
ORDER BY growth_percentage DESC
LIMIT 1;

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
README.md		README.md
Retail_orders Python+Sql.ipynb		Retail_orders Python+Sql.ipynb
Retailorder_analysis.sql		Retailorder_analysis.sql
Screenshot 2024-05-25 184438.png		Screenshot 2024-05-25 184438.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Retail Order data analysis using Python & Sql

This project utilized a Python data pipeline to download and clean data from the Kaggle API. The prepared data was then ingested into an SQL database to facilitate further analysis. SQL queries were leveraged to answer five specific questions about the data.

Data Pipeline to Download, Clean, and Load Data into SQL Database

This Python script demonstrates a data pipeline that downloads a dataset from Kaggle, performs data cleaning, and loads the prepared data into a SQL database for further analysis. Below are the detailed steps and the corresponding code.

1. Install Kaggle API

2. Download the Dataset from Kaggle

3. Extract the Downloaded ZIP File

4. Load Data into a DataFrame

5. Data Inspection

6. Data Cleaning

7. Feature Engineering

8. Load Data into SQL Database

SQL Query to analyze the data:

About

Uh oh!

Releases

Packages

Languages

PRANAV7389/Retail-Order-data-analysis-using-Python-Sql

Folders and files

Latest commit

History

Repository files navigation

Retail Order data analysis using Python & Sql

This project utilized a Python data pipeline to download and clean data from the Kaggle API. The prepared data was then ingested into an SQL database to facilitate further analysis. SQL queries were leveraged to answer five specific questions about the data.

Data Pipeline to Download, Clean, and Load Data into SQL Database

This Python script demonstrates a data pipeline that downloads a dataset from Kaggle, performs data cleaning, and loads the prepared data into a SQL database for further analysis. Below are the detailed steps and the corresponding code.

1. Install Kaggle API

2. Download the Dataset from Kaggle

3. Extract the Downloaded ZIP File

4. Load Data into a DataFrame

5. Data Inspection

6. Data Cleaning

7. Feature Engineering

8. Load Data into SQL Database

SQL Query to analyze the data:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages