# Red30 Tech Datasets

This notebook focuses on the course dataset: Red30 Tech US online retail sales.

Let's start by import the pandas and duckdb libraries:

In [1]:
import pandas as pd
import duckdb as db

Next, we will load the dataset from a CSV file from the `data` folder:

In [2]:
retail_sales = pd.read_csv("../data/Red30 Tech US online retail sales.csv")

Let's take a look at the table header:

In [3]:
retail_sales.head(10)

Unnamed: 0,OrderNum,OrderDate,OrderType,CustomerType,CustName,CustState,ProdCategory,ProdNumber,ProdName,Quantity,Price,Discount,OrderTotal
0,1100934,9/1/2017,Wholesale,Business,Gusikowski Group,North Carolina,Blueprints,BP102,Bsquare Robot Blueprint,10,8.99,1.8,88.1
1,1100935,9/1/2017,Retail,Individual,Spencer Educators,Delaware,Drone Kits,DK204,BYOD-300,2,89.0,0.0,178.0
2,1100936,9/1/2017,Wholesale,Business,Schinner Inc.,Florida,Training Videos,TV801,Aerial Security,10,36.99,7.4,362.5
3,1100937,9/1/2017,Retail,Individual,Saxon Laviss,Virginia,Robot Kits,RK602,BYOR-1000,1,189.0,0.0,189.0
4,1100938,9/1/2017,Retail,Business,Wilderman Technologies,Texas,eBooks,EB502,Building Your First Robot,4,24.95,0.0,99.8
5,1100939,9/2/2017,Wholesale,Business,Turcotte Corp,New York,Drones,DS302,DC-304 Drone,79,395.0,79.0,31126.0
6,1100940,9/2/2017,Retail,Individual,Kovacek Bernhard,Texas,Training Videos,TV803,Cloud Computing,1,29.99,0.0,29.99
7,1100941,9/2/2017,Retail,Individual,Antonina Noton,Missouri,Drones,DS306,DX-145 Drone,2,250.0,0.0,500.0
8,1100942,9/2/2017,Retail,Individual,Frederik Pantecost,Louisiana,eBooks,EB507,Drone Building Essentials,1,13.99,0.0,13.99
9,1100943,9/3/2017,Wholesale,Business,Hettinger and Sons,West Virginia,Blueprints,BP104,Cat Robot Blueprint,6,4.99,1.0,28.94


## Querying the Data with SQL

We will use the DuckDB library to query the data with native SQL queries using the `sql` function. For example:

In [4]:
db.sql("DESCRIBE TABLE retail_sales")

┌──────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name  │ column_type │  null   │   key   │ default │  extra  │
│   varchar    │   varchar   │ varchar │ varchar │ varchar │ varchar │
├──────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ OrderNum     │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
│ OrderDate    │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
│ OrderType    │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
│ CustomerType │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
│ CustName     │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
│ CustState    │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
│ ProdCategory │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
│ ProdNumber   │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
│ ProdName     │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
│ Quantity     │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
│ Pric

Another example, the following query calculate the top 10 states by total sales:

In [5]:
query = """
SELECT 
  CustState, SUM(OrderTotal) AS Total
FROM 
  retail_sales
GROUP BY 
  CustState
ORDER BY
  Total DESC
LIMIT 10;
"""

In [6]:
db.sql(query)

┌────────────────┬────────────────────┐
│   CustState    │       Total        │
│    varchar     │       double       │
├────────────────┼────────────────────┤
│ New York       │  616925.8400000001 │
│ California     │  540285.5199999997 │
│ Florida        │  394483.7399999998 │
│ Texas          │ 349925.48000000004 │
│ North Carolina │  345118.2600000001 │
│ Pennsylvania   │ 290120.09000000014 │
│ Minnesota      │ 267308.95000000007 │
│ Washington     │ 257140.79999999993 │
│ Virginia       │ 248797.89000000004 │
│ Georgia        │ 237607.26000000004 │
├────────────────┴────────────────────┤
│ 10 rows                   2 columns │
└─────────────────────────────────────┘