This project utilized a Python data pipeline to download and clean data from the Kaggle API. The prepared data was then ingested into an SQL database to facilitate further analysis. SQL queries were leveraged to answer five specific questions about the data.
This Python script demonstrates a data pipeline that downloads a dataset from Kaggle, performs data cleaning, and loads the prepared data into a SQL database for further analysis. Below are the detailed steps and the corresponding code.
# Install the Kaggle package to interact with the Kaggle API
!pip install kaggle
Requirement already satisfied: kaggle in c:\users\dell\anaconda3\lib\site-packages (1.6.14)
Requirement already satisfied: six>=1.10 in c:\users\dell\anaconda3\lib\site-packages (from kaggle) (1.16.0)
Requirement already satisfied: certifi>=2023.7.22 in c:\users\dell\anaconda3\lib\site-packages (from kaggle) (2023.7.22)
Requirement already satisfied: python-dateutil in c:\users\dell\anaconda3\lib\site-packages (from kaggle) (2.8.2)
Requirement already satisfied: requests in c:\users\dell\anaconda3\lib\site-packages (from kaggle) (2.31.0)
Requirement already satisfied: tqdm in c:\users\dell\anaconda3\lib\site-packages (from kaggle) (4.65.0)
Requirement already satisfied: python-slugify in c:\users\dell\anaconda3\lib\site-packages (from kaggle) (5.0.2)
Requirement already satisfied: urllib3 in c:\users\dell\anaconda3\lib\site-packages (from kaggle) (1.26.16)
Requirement already satisfied: bleach in c:\users\dell\anaconda3\lib\site-packages (from kaggle) (4.1.0)
Requirement already satisfied: packaging in c:\users\dell\anaconda3\lib\site-packages (from bleach->kaggle) (23.1)
Requirement already satisfied: webencodings in c:\users\dell\anaconda3\lib\site-packages (from bleach->kaggle) (0.5.1)
Requirement already satisfied: text-unidecode>=1.3 in c:\users\dell\anaconda3\lib\site-packages (from python-slugify->kaggle) (1.3)
Requirement already satisfied: charset-normalizer<4,>=2 in c:\users\dell\anaconda3\lib\site-packages (from requests->kaggle) (2.0.4)
Requirement already satisfied: idna<4,>=2.5 in c:\users\dell\anaconda3\lib\site-packages (from requests->kaggle) (3.4)
Requirement already satisfied: colorama in c:\users\dell\anaconda3\lib\site-packages (from tqdm->kaggle) (0.4.6)
import kaggle
# Download the 'orders.csv' file from the specified Kaggle dataset
!kaggle datasets download ankitbansal06/retail-orders -f orders.csv
Dataset URL: https://www.kaggle.com/datasets/ankitbansal06/retail-orders
License(s): CC0-1.0
orders.csv.zip: Skipping, found more recently modified local copy (use --force to force download)
import zipfile
# Extract the downloaded zip file containing the 'orders.csv'
zip_ref = zipfile.ZipFile('orders.csv.zip')
zip_ref.extractall()
zip_ref.close()
import pandas as pd
# Read the CSV file into a pandas DataFrame
df = pd.read_csv('orders.csv')
# Display the first 20 rows of the DataFrame
df.head(20)
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
</style>
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Order Id | Order Date | Ship Mode | Segment | Country | City | State | Postal Code | Region | Category | Sub Category | Product Id | cost price | List Price | Quantity | Discount Percent | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2023-03-01 | Second Class | Consumer | United States | Henderson | Kentucky | 42420 | South | Furniture | Bookcases | FUR-BO-10001798 | 240 | 260 | 2 | 2 |
1 | 2 | 2023-08-15 | Second Class | Consumer | United States | Henderson | Kentucky | 42420 | South | Furniture | Chairs | FUR-CH-10000454 | 600 | 730 | 3 | 3 |
2 | 3 | 2023-01-10 | Second Class | Corporate | United States | Los Angeles | California | 90036 | West | Office Supplies | Labels | OFF-LA-10000240 | 10 | 10 | 2 | 5 |
3 | 4 | 2022-06-18 | Standard Class | Consumer | United States | Fort Lauderdale | Florida | 33311 | South | Furniture | Tables | FUR-TA-10000577 | 780 | 960 | 5 | 2 |
4 | 5 | 2022-07-13 | Standard Class | Consumer | United States | Fort Lauderdale | Florida | 33311 | South | Office Supplies | Storage | OFF-ST-10000760 | 20 | 20 | 2 | 5 |
5 | 6 | 2022-03-13 | Not Available | Consumer | United States | Los Angeles | California | 90032 | West | Furniture | Furnishings | FUR-FU-10001487 | 50 | 50 | 7 | 3 |
6 | 7 | 2022-12-28 | Standard Class | Consumer | United States | Los Angeles | California | 90032 | West | Office Supplies | Art | OFF-AR-10002833 | 10 | 10 | 4 | 3 |
7 | 8 | 2022-01-25 | Standard Class | Consumer | United States | Los Angeles | California | 90032 | West | Technology | Phones | TEC-PH-10002275 | 860 | 910 | 6 | 5 |
8 | 9 | 2023-03-23 | Not Available | Consumer | United States | Los Angeles | California | 90032 | West | Office Supplies | Binders | OFF-BI-10003910 | 20 | 20 | 3 | 2 |
9 | 10 | 2023-05-16 | Standard Class | Consumer | United States | Los Angeles | California | 90032 | West | Office Supplies | Appliances | OFF-AP-10002892 | 90 | 110 | 5 | 3 |
10 | 11 | 2023-03-31 | Not Available | Consumer | United States | Los Angeles | California | 90032 | West | Furniture | Tables | FUR-TA-10001539 | 1470 | 1710 | 9 | 3 |
11 | 12 | 2023-12-25 | Not Available | Consumer | United States | Los Angeles | California | 90032 | West | Technology | Phones | TEC-PH-10002033 | 750 | 910 | 4 | 3 |
12 | 13 | 2022-02-11 | Standard Class | Consumer | United States | Concord | North Carolina | 28027 | South | Office Supplies | Paper | OFF-PA-10002365 | 20 | 20 | 3 | 3 |
13 | 14 | 2023-07-18 | Standard Class | Consumer | United States | Seattle | Washington | 98103 | West | Office Supplies | Binders | OFF-BI-10003656 | 360 | 410 | 3 | 2 |
14 | 15 | 2023-11-09 | unknown | Home Office | United States | Fort Worth | Texas | 76106 | Central | Office Supplies | Appliances | OFF-AP-10002311 | 60 | 70 | 5 | 5 |
15 | 16 | 2022-06-18 | Standard Class | Home Office | United States | Fort Worth | Texas | 76106 | Central | Office Supplies | Binders | OFF-BI-10000756 | 0 | 0 | 3 | 5 |
16 | 17 | 2022-02-04 | Standard Class | Consumer | United States | Madison | Wisconsin | 53711 | Central | Office Supplies | Storage | OFF-ST-10004186 | 610 | 670 | 6 | 3 |
17 | 18 | 2023-08-04 | Second Class | Consumer | United States | West Jordan | Utah | 84084 | West | Office Supplies | Storage | OFF-ST-10000107 | 60 | 60 | 2 | 4 |
18 | 19 | 2022-01-23 | Second Class | Consumer | United States | San Francisco | California | 94109 | West | Office Supplies | Art | OFF-AR-10003056 | 10 | 10 | 2 | 4 |
19 | 20 | 2022-01-11 | Second Class | Consumer | United States | San Francisco | California | 94109 | West | Technology | Phones | TEC-PH-10001949 | 170 | 210 | 3 | 3 |
# Display information about the DataFrame (column types, non-null values, etc.)
df.info()
# Check for missing values in the DataFrame
df.isnull().sum()
# Display unique values in the 'Ship Mode' column
df['ship_mode'].unique
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9994 entries, 0 to 9993
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 order_id 9994 non-null int64
1 order_date 9994 non-null datetime64[ns]
2 ship_mode 9988 non-null object
3 segment 9994 non-null object
4 country 9994 non-null object
5 city 9994 non-null object
6 state 9994 non-null object
7 postal_code 9994 non-null int64
8 region 9994 non-null object
9 category 9994 non-null object
10 sub_category 9994 non-null object
11 product_id 9994 non-null object
12 quantity 9994 non-null int64
13 discount 9994 non-null float64
14 sale_price 9994 non-null float64
15 profit 9994 non-null float64
dtypes: datetime64[ns](1), float64(3), int64(3), object(9)
memory usage: 1.2+ MB
<bound method Series.unique of 0 Second Class
1 Second Class
2 Second Class
3 Standard Class
4 Standard Class
...
9989 Second Class
9990 Standard Class
9991 Standard Class
9992 Standard Class
9993 Second Class
Name: ship_mode, Length: 9994, dtype: object>
# Re-read the CSV file, treating 'Not Available' and 'unknown' as NaN
df = pd.read_csv('orders.csv', na_values=['Not Available', 'unknown'])
df.head(20)
# Convert column names to lowercase
df.columns = df.columns.str.lower()
df.columns
# Replace spaces in column names with underscores
df.columns = df.columns.str.replace(' ', '_')
df.columns
df.head(5)
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
</style>
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
order_id | order_date | ship_mode | segment | country | city | state | postal_code | region | category | sub_category | product_id | cost_price | list_price | quantity | discount_percent | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2023-03-01 | Second Class | Consumer | United States | Henderson | Kentucky | 42420 | South | Furniture | Bookcases | FUR-BO-10001798 | 240 | 260 | 2 | 2 |
1 | 2 | 2023-08-15 | Second Class | Consumer | United States | Henderson | Kentucky | 42420 | South | Furniture | Chairs | FUR-CH-10000454 | 600 | 730 | 3 | 3 |
2 | 3 | 2023-01-10 | Second Class | Corporate | United States | Los Angeles | California | 90036 | West | Office Supplies | Labels | OFF-LA-10000240 | 10 | 10 | 2 | 5 |
3 | 4 | 2022-06-18 | Standard Class | Consumer | United States | Fort Lauderdale | Florida | 33311 | South | Furniture | Tables | FUR-TA-10000577 | 780 | 960 | 5 | 2 |
4 | 5 | 2022-07-13 | Standard Class | Consumer | United States | Fort Lauderdale | Florida | 33311 | South | Office Supplies | Storage | OFF-ST-10000760 | 20 | 20 | 2 | 5 |
# Calculate discount amount
df['discount'] = df['list_price'] * df['discount_percent'] * 0.01
# Calculate sale price after discount
df['sale_price'] = df['list_price'] - df['discount']
df
# Calculate profit
df['profit'] = df['sale_price'] - df['cost_price']
# Check data types
df.dtypes
# Convert 'order_date' column to datetime format
df['order_date'] = pd.to_datetime(df['order_date'], format="%Y-%m-%d")
df.dtypes
# Drop unnecessary columns
df.drop(columns=['list_price', 'cost_price', 'discount_percent'], inplace=True)
df
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
</style>
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
order_id | order_date | ship_mode | segment | country | city | state | postal_code | region | category | sub_category | product_id | quantity | discount | sale_price | profit | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2023-03-01 | Second Class | Consumer | United States | Henderson | Kentucky | 42420 | South | Furniture | Bookcases | FUR-BO-10001798 | 2 | 5.2 | 254.8 | 14.8 |
1 | 2 | 2023-08-15 | Second Class | Consumer | United States | Henderson | Kentucky | 42420 | South | Furniture | Chairs | FUR-CH-10000454 | 3 | 21.9 | 708.1 | 108.1 |
2 | 3 | 2023-01-10 | Second Class | Corporate | United States | Los Angeles | California | 90036 | West | Office Supplies | Labels | OFF-LA-10000240 | 2 | 0.5 | 9.5 | -0.5 |
3 | 4 | 2022-06-18 | Standard Class | Consumer | United States | Fort Lauderdale | Florida | 33311 | South | Furniture | Tables | FUR-TA-10000577 | 5 | 19.2 | 940.8 | 160.8 |
4 | 5 | 2022-07-13 | Standard Class | Consumer | United States | Fort Lauderdale | Florida | 33311 | South | Office Supplies | Storage | OFF-ST-10000760 | 2 | 1.0 | 19.0 | -1.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
9989 | 9990 | 2023-02-18 | Second Class | Consumer | United States | Miami | Florida | 33180 | South | Furniture | Furnishings | FUR-FU-10001889 | 3 | 1.2 | 28.8 | -1.2 |
9990 | 9991 | 2023-03-17 | Standard Class | Consumer | United States | Costa Mesa | California | 92627 | West | Furniture | Furnishings | FUR-FU-10000747 | 2 | 3.6 | 86.4 | 16.4 |
9991 | 9992 | 2022-08-07 | Standard Class | Consumer | United States | Costa Mesa | California | 92627 | West | Technology | Phones | TEC-PH-10003645 | 2 | 5.2 | 254.8 | 34.8 |
9992 | 9993 | 2022-11-19 | Standard Class | Consumer | United States | Costa Mesa | California | 92627 | West | Office Supplies | Paper | OFF-PA-10004041 | 4 | 0.9 | 29.1 | -0.9 |
9993 | 9994 | 2022-07-17 | Second Class | Consumer | United States | Westminster | California | 92683 | West | Office Supplies | Appliances | OFF-AP-10002684 | 2 | 7.2 | 232.8 | 22.8 |
9994 rows × 16 columns
from sqlalchemy import create_engine
import sqlalchemy as sal
# Database connection details
user = 'pranav_user'
password = 'kshama1234'
host = 'localhost'
database = 'pranav'
# Create a SQLAlchemy engine
engine = create_engine(f'mysql+mysqlconnector://{user}:{password}@{host}/{database}')
conn = engine.connect()
# Load the DataFrame into the SQL database table 'df_orders'
df.to_sql('df_orders', con=conn, index=False, if_exists='append')
Select all records from the df_orders table
SELECT * FROM pranav.df_orders;
-- Find top 10 highest revenue-generating products
-- 1. Group the records by product_id
-- 2. Calculate the sum of sale_price for each product_id
-- 3. Order the results by sales in descending order
-- 4. Limit the results to the top 10 entries
SELECT product_id, sum(sale_price) as sales
FROM df_orders
GROUP BY product_id
ORDER BY sales DESC
LIMIT 10;
-- Find top 5 highest selling products in each region
-- 1. Create a common table expression (CTE) to calculate sales per product_id for each region
-- 2. Use ROW_NUMBER() to rank the products within each region by sales in descending order
-- 3. Select the top 5 products (rn <= 5) for each region
WITH cte AS (
SELECT region, product_id, sum(sale_price) as sales
FROM df_orders
GROUP BY region, product_id
)
SELECT * FROM (
SELECT *,
ROW_NUMBER() OVER(PARTITION BY region ORDER BY sales DESC) as rn
FROM cte
) A
WHERE rn <= 5;
-- Find month-over-month growth comparison for 2022 and 2023 sales
-- 1. Create a CTE to calculate the total sales per month for each year
-- 2. Use CASE statements to segregate sales data for 2022 and 2023
-- 3. Group the results by month and order by month
WITH cte AS (
SELECT YEAR(order_date) as year, MONTH(order_date) as month, sum(sale_price) as sales
FROM df_orders
GROUP BY YEAR(order_date), MONTH(order_date)
)
SELECT month,
sum(CASE WHEN year = 2022 THEN sales ELSE 0 END) as sales_2022,
sum(CASE WHEN year = 2023 THEN sales ELSE 0 END) as sales_2023
FROM cte
GROUP BY month
ORDER BY month;
-- For each category, find the month with the highest sales
-- 1. Create a CTE to calculate sales per category for each year-month
-- 2. Use ROW_NUMBER() to rank the months within each category by sales in descending order
-- 3. Select the month with the highest sales (rn = 1) for each category
WITH cte AS (
SELECT category, FORMAT(order_date, 'yyyyMM') AS order_year_month, sum(sale_price) as sales
FROM df_orders
GROUP BY category, FORMAT(order_date, 'yyyyMM')
)
SELECT * FROM (
SELECT *,
ROW_NUMBER() OVER(PARTITION BY category ORDER BY sales DESC) as rn
FROM cte
) a
WHERE rn = 1;
-- Identify which sub-category had the highest growth by profit in 2023 compared to 2022
-- 1. Create a CTE to calculate sales per sub_category for each year
-- 2. Create another CTE to calculate the growth by profit between 2023 and 2022 for each sub_category
-- 3. Order the results by growth in descending order and limit to the top 1 entry
WITH cte AS (
SELECT sub_category, YEAR(order_date) as year, sum(sale_price) as sales
FROM df_orders
GROUP BY sub_category, YEAR(order_date)
), cte2 AS (
SELECT sub_category,
sum(CASE WHEN year = 2022 THEN sales ELSE 0 END) as sales_2022,
sum(CASE WHEN year = 2023 THEN sales ELSE 0 END) as sales_2023
FROM cte
GROUP BY sub_category
)
SELECT *,
(sales_2023 - sales_2022) * 100 / sales_2022 as growth_percentage
FROM cte2
ORDER BY growth_percentage DESC
LIMIT 1;