# SQLite query Practice

SQLite Practice using olist datasets from [Kaggle Olist sample](https://www.kaggle.com/datasets/olistbr/brazilian-ecommerce).

This practice utilizes Pandas sql query reading capability to convert query into Pandas dataframe. Aim of this practice is to minimize usage of Pandas and maximize usage of SQL instead for dataprocessing and analysis.

SQLiteStudio (3.4.17) will also be used for testing and troubleshooting query prior to inserting for use by Pandas.

Reference for ideas and (only as last resort) solution:
- [SQL Challenge: E-commerce data analysis](https://www.kaggle.com/code/terencicp/sql-challenge-e-commerce-data-analysis)

In [2]:
# Import required module
import pandas as pd
import sqlite3
import os

# Connection setup

In [3]:
# Path
base:str = os.getcwd()
sqlite_path:str = os.path.join(base, "Dataset", "olist.sqlite")

# connector set up
conn = sqlite3.connect(sqlite_path)

## Testing

Testing query with CTE; previous attempt at querying result in error, testing in the beginning just in case.

In [None]:
# First layer - CTE
CTE_Query = f"""
Select *
From products p Join product_category_name_translation pcnt 
Where pcnt.product_category_name = p.product_category_name"""

output = pd.read_sql_query(CTE_Query, conn)
output
# print(CTE)

In [None]:
# Second Layer - Query combined with CTE
Query_2 = f"""WITH New_product_Cat AS ({CTE_Query})
Select
    npc.product_id,
    npc.product_category_name_english,
    (npc.product_length_cm 
        * npc.product_height_cm 
        * npc.product_width_cm) As vol_weight
From
    New_product_Cat npc    
    """

output = pd.read_sql_query(Query_2, conn)
output

In [None]:
# Third layer - Second layer CTE query

Query_3 = f"""WITH new_product_cat AS ({Query_2})
Select
    Distinct(o.order_id),
    npc.product_category_name_english,
    npc.vol_weight,
    o.order_delivered_customer_date
From
    new_product_cat npc
Left Join order_items oi 
    USING (product_id)
Left Join orders o 
    USING (order_id)"""

print(Query_3)

output = pd.read_sql_query(Query_3, conn)
output

In [None]:
# Fourth layer - Third layer CTE query

# Query_4 = 


In [None]:
# Sample query from: https://www.kaggle.com/code/terencicp/sql-challenge-e-commerce-data-analysis

ranked_categories = """
SELECT
    product_category_name_english AS category,
    SUM(price) AS sales,
    RANK() OVER (ORDER BY SUM(price) DESC) AS rank
FROM order_items
    JOIN orders USING (order_id)
    JOIN products USING (product_id)
    JOIN product_category_name_translation USING (product_category_name)
WHERE order_status = 'delivered'
GROUP BY product_category_name_english
"""

category_sales_summary = f"""
WITH RankedCategories AS (
    {ranked_categories}
)
-- Top 18 categories by sales
SELECT
    category,
    sales
FROM RankedCategories
WHERE rank <= 18
-- Other categories, aggregated
UNION ALL
SELECT
    'Other categories' AS category,
    SUM(sales) AS sales
FROM RankedCategories
WHERE rank > 18
"""

df = pd.read_sql_query(category_sales_summary, conn)
df

## 2. Simple Query

This section include simple query without any joining or cleaning operation.

### 2.1 Total order count
Total order count without cleaning

In [None]:
total_order_query = """
Select
    COUNT(order_id) AS vol
From orders"""

output = pd.read_sql_query(total_order_query, conn)
output

### 2.2 Total order count (completed vs incompleted)

Total order count of completed vs incompleted.

In [None]:
total_order_query_completion = """
Select
    IIF(order_delivered_customer_date ISNULL, 
            "N", 
            "Y") 
            as Order_completion,
    COUNT(order_id) AS vol
From orders
Group By
    Order_completion"""

output = pd.read_sql_query(total_order_query_completion, conn)
output

## 3. Intermediate Querying

### 3.3 Order vol by category

Group the count of order by category.

In [None]:
category_order = """
Select
    Distinct (pcnt.product_category_name_english) as product_list,
    Count(o.order_id) as vol
From orders o
Left Join order_items USING (order_id)
Left Join products p USING (product_id)
Left Join product_category_name_translation pcnt USING (product_category_name)
Group By
    product_list"""

output = pd.read_sql_query(category_order, conn)
output

## 2.4 Order by city

In [None]:
geo_cte = """
Select
    Distinct (gl.geolocation_zip_code_prefix) as Distinct_zip_prefix,
    *
From 
    geolocation gl
Join customers c 
    On gl.geolocation_zip_code_prefix = c.customer_zip_code_prefix
Where
    customer_city is not null
Group By
    Distinct_zip_prefix
"""

output = pd.read_sql_query(geo_cte, conn)
output

In [6]:
# geo_cte = "Select

city_order = """
Select
    Distinct (gl.geolocation_state) as geo_state,
    Count(o.order_id) as vol
From orders o
Left Join customers c USING (customer_id)
Left Join geolocation gl On c.customer_zip_code_prefix = gl.geolocation_zip_code_prefix
Where
    geo_state is not null
Group By
    geo_state
Having
    vol >= 200000"""

output = pd.read_sql_query(city_order, conn)
output

Unnamed: 0,geo_state,vol
0,BA,365875
1,ES,316654
2,MG,2878728
3,PR,626021
4,RJ,3015690
5,RS,805370
6,SC,538638
7,SP,5620430


## 2.5 Rank of product category by segment

In [19]:
product_cat_processed = """
Select
    pcnt.product_category_name_english as product_cat,
    lc.business_segment,
    Count(o.order_id) as order_vol,
    Sum(oi.price) as revenue
From
    orders o
Join order_items oi USING (order_id)
Join products p On oi.product_id = p.product_id
Join product_category_name_translation pcnt USING (product_category_name)
Join sellers s USING (seller_id)
Join leads_closed lc USING (seller_id)
Group By
    product_cat
"""

# reference:
# https://five.co/blog/rank-over-partition-by-in-sql/
# https://stackoverflow.com/questions/2051162/sql-multiple-column-ordering

prod_cat_segment_rank = f"""
With product_cat_done AS ({product_cat_processed})
Select
    pcd.product_cat,
    pcd.business_segment,
    Rank() Over(PARTITION BY pcd.business_segment 
                ORDER BY pcd.order_vol) as segment_vol_rank,
    Rank() Over(PARTITION BY pcd.business_segment
                ORDER BY pcd.revenue) as segment_rev_rank
From
    product_cat_done pcd
ORDER BY
    business_segment Asc,
    segment_vol_rank Asc
"""

# # Test read product_cat_processed; disable when not testing
# output = pd.read_sql_query(product_cat_processed, conn)
# output.describe
# output.to_csv("test.csv")

# # Read rank table
output = pd.read_sql_query(prod_cat_segment_rank, conn)
output

Unnamed: 0,product_cat,business_segment,segment_vol_rank,segment_rev_rank
0,computers,audio_video_electronics,1,1
1,musical_instruments,audio_video_electronics,2,3
2,construction_tools_safety,audio_video_electronics,3,4
3,cine_photo,audio_video_electronics,4,2
4,telephony,audio_video_electronics,5,6
5,construction_tools_construction,audio_video_electronics,6,5
6,stationery,baby,1,1
7,fashion_bags_accessories,bags_backpacks,1,1
8,bed_bath_table,bed_bath_table,1,1
9,dvds_blu_ray,books,1,1
