<a href="https://colab.research.google.com/github/PrintTrd/elgo_data_pipeline/blob/main/..%5Cscripts%5CPreprocess_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sales Dataset Processing

ข้อมูล Business Sales Dataset ที่ได้มาจากคนรู้จักนั้น มีทั้งหมด 14 tables ซึ่งมาจาก database ที่ยังไม่นิ่ง ไม่ได้ผ่านการออกแบบมาดี มีความซ้ำซ้อนอยู่มาก และรับข้อมูลมาจากหลายช่องทางการขาย/platform รวมกัน หลังจากดูข้อมูลคร่าว ๆ ก็ได้ไปคุยกับเจ้าของข้อมูลเพื่อทำความเข้าใจมากขึ้น

## Data integration
เนื่องจากมีความซ้ำซ้อนอยู่มากจึงต้อง normalize ข้อมูล โดยเริ่มจากการไปช่วยออกแบบ ER diagrams ให้ใหม่ก่อน เพื่อวางแผนหน้าตาโครงสร้างข้อมูลใหม่ที่เราควรจะแปลงไปด้วย

หลังจากออกแบบใหม่ก็จะเลือกตัดส่วนที่ดูไม่เกี่ยวข้องกับการขายออกไปบางส่วน เช่น table inventory_input ที่เกี่ยวข้องกับการจัดการ stock สินค้า เพื่อนำมาใช้กับโปรเจคนี้ และข้อมูลบางส่วนก็จะไม่ได้ normalize จนเป็นหน่วยเล็กที่สุด ออกแบบจนได้ ER diagram เป้าหมายออกมาดังภาพ

![](https://drive.google.com/uc?export=view&id=15IVQGpYuX6pkeUbAQVbwUNkN5HxD7I5E)
จากภาพ บางคอลัมน์ที่ยังไม่มีการเก็บข้อมูลเพิ่มเข้ามาก็จะข้ามไปก่อน

## Data Integration Plan Steps

1. Extract and rename tables

2. Transform - Schema/Value Integration <br>
- Rename columns
- Change format and add day
- รวม data
- แยก data
- สร้าง fake values
- สรุปผลข้อมูล
- Check Value

3.   Load - Insert data into MySQL Database
     - อ่าน Secrets ของ Colab (เก็บลง class Config)
     - ใช้ Sqlalchemy เชื่อมต่อไปที่ MySQL
     - สร้าง Schema (database)
     - Insert ข้อมูลลงใน tables

4.   Query ข้อมูลใน table และเซฟ output
     - Query ข้อมูลด้วย Pandas
     - Output ไฟล์เป็น parquet

5.   ทดลองอ่านไฟล์ parquet เพื่อตรวจสอบ



# Step 1) Download and extract

##  Rename tables
- Geocoding -> address
-  item_inventory -> inventory
- product_code -> product
- color_code -> color
- material_code -> material
- size_code -> size

โหลดไฟล์ข้อมูล csv มาใส่ Notebook จากนั้นย้ายที่เก็บ และเปลี่ยนชื่อใหม่

In [1]:
!mv 'elgo_source_database - sale_order.csv' sale_order.csv
!mv 'elgo_source_database - cn_order.csv' canceled_order.csv
!mv 'elgo_source_database - inventory_output.csv' order_item.csv
!mv 'elgo_source_database - cn_item.csv' canceled_item.csv

!mv 'elgo_source_database - awb_info.csv' waybill.csv
!mv 'elgo_source_database - Geocoding.csv' address.csv
!mv 'elgo_source_database - customer_code.csv' customer_code.csv

!mv 'elgo_source_database - item_inventory.csv' inventory.csv
!mv 'elgo_source_database - product_code.csv' product_code.csv
!mv 'elgo_source_database - color_code.csv' color_code.csv
!mv 'elgo_source_database - material_code.csv' material_code.csv
!mv 'elgo_source_database - size_code.csv' size_code.csv
!mv 'elgo_source_database - Price.csv' price.csv

## Read CSV

สร้าง Dataframes

In [6]:
import polars as pl
dataframes = {}
csv_list = ["sale_order", "canceled_order", "order_item", "canceled_item", "waybill", "address", "customer_code", "inventory", "product_code", "color_code", "material_code", "size_code", "price"]
for file_name in csv_list:
  dataframes[f'{file_name}'] = pl.read_csv(f'{file_name}.csv', has_header=True, infer_schema_length=10000, null_values=["COMPUTED_VALUE"])


In [19]:
dataframes

{'sale_order': shape: (1_468, 18)
 ┌──────────┬────────────┬────────────┬────────────┬───┬───────────┬───────────┬───────────┬────────┐
 │ order_id ┆ updated_at ┆ created_at ┆ invoice_nu ┆ … ┆ customer  ┆ customer  ┆ shiping_s ┆ status │
 │ ---      ┆ ---        ┆ ---        ┆ mber       ┆   ┆ tel       ┆ email     ┆ tatus     ┆ ---    │
 │ str      ┆ str        ┆ str        ┆ ---        ┆   ┆ ---       ┆ ---       ┆ ---       ┆ str    │
 │          ┆            ┆            ┆ str        ┆   ┆ str       ┆ str       ┆ str       ┆        │
 ╞══════════╪════════════╪════════════╪════════════╪═══╪═══════════╪═══════════╪═══════════╪════════╡
 │ 65272231 ┆ 6/17/2023  ┆ 6/17/2023  ┆ IV2306A001 ┆ … ┆ null      ┆ null      ┆ done      ┆ null   │
 │          ┆ 15:36:55   ┆ 15:36:55   ┆ 6          ┆   ┆           ┆           ┆           ┆        │
 │ 574db9e2 ┆ 6/17/2023  ┆ 6/17/2023  ┆ IV2306A001 ┆ … ┆ null      ┆ null      ┆ done      ┆ null   │
 │          ┆ 16:31:53   ┆ 16:31:53   ┆ 7       

## Drop and rename columns

### sale_order
  - id -> order_id
  - record date ->  updated_at
  - sell out date -> created_at
  - invoice no2 -> invoice_number
  - order number -> order_number
  - sales channel -> sales_channel
  - require vat -> require_vat
  - shipping charges -> shipping_charge
  - discount bath -> discount_baht
  - customer code -> customer_category_id
  - customer name -> customer_name
  - customer tax id -> tax_id
  - customer address -> billing_address
  - shipping address -> shipping_address
  - customer tel -> phone
  - customer email -> email
  - Status -> status

In [None]:
dataframes["sale_order"].columns

In [13]:
dataframes["sale_order"] = dataframes["sale_order"].drop(columns=['invoice no', 'invoice folder', 'invoice page', 'note', 'pdf', "shiping_status"])

In [14]:
dataframes["sale_order"] = dataframes["sale_order"].rename({
    "id": "order_id",
    "record date": "updated_at",
    "sell out date": "created_at",
    "invoice no2": "invoice_number",
    "order number": "order_number",
    "sales channel": "sales_channel",
    "require vat": "require_vat",
    "shipping charges": "shipping_charge",
    "discount bath": "discount_baht",
    "customer code": "customer_category_id",
    "customer name": "name",
    "customer tax id": "tax_id",
    "customer address": "billing_address",
    "shipping address": "shipping_address",
    "customer tel": "phone",
    "customer email": "email",
    "Status":"status"
})
dataframes["sale_order"].columns

['order_id',
 'updated_at',
 'created_at',
 'invoice_number',
 'order_number',
 'sales_channel',
 'require_vat',
 'shipping_charge',
 'discount_baht',
 'customer code',
 'customer name',
 'customer tax id',
 'customer address',
 'shipping address',
 'customer tel',
 'customer email',
 'shiping_status',
 'status']

### canceled_order
  - id -> canceled_id: ไว้เชื่อมกับ cn_item ก่อน
  - creditnote no -> updated_at
  - issue date -> created_at
  - invoice id -> invoice_number (ต้องตัด -CN ที่อยู่ท้าย invoice id ไปลงใน status แทน)
  - note -> cancel_reason
  - ...นอกนั้นคล้ายที่แก้ให้ sale_order ไป...

In [None]:
dataframes["canceled_order"].columns

In [16]:
dataframes["canceled_order"] = dataframes["canceled_order"].drop(columns=['invoice folder', 'invoice page', 'customer tax id','customer email','pdf', 'shiping_status', 'image'])

In [17]:
dataframes["canceled_order"] = dataframes["canceled_order"].rename({
    "id": "cancel_id",
    "creditnote no": "updated_at",
    "issue date": "created_at",
    "invoice id": "invoice_number",
    "order number": "order_number",
    "sales channel": "sales_channel",
    "require vat": "require_vat",
    "shipping charges": "shipping_charge",
    "discount bath": "discount_baht",
    "note":"cancel_reason",
    "Status": "status"
})
dataframes["canceled_order"].columns

['cancel_id',
 'updated_at',
 'created_at',
 'order_id',
 'invoice_number',
 'order_number',
 'sales_channel',
 'require_vat',
 'shipping_charge',
 'discount_baht',
 'customer code',
 'customer name',
 'customer address',
 'shipping address',
 'customer tel',
 'cancel_reason',
 'status']

### order_item
  - id -> order_item_id
  - sell out date -> created_at : จะได้รู้ราคาในเวลานั้นได้เร็ว
  - order id -> order_id
  - quantity (pack) -> quantity_pack
  - price per pack -> unit_price
  - price per pack (ex vat) -> unit_price_ex_vat
  - total amount -> total_price

In [20]:
dataframes["order_item"].columns

['id',
 'sell out date',
 'order id',
 'master product code',
 'master product name',
 'readable name',
 'product code',
 'color code',
 'size code',
 'price per pack',
 'pcs per pack',
 'quantity (pack)',
 'quantity',
 'price per pack (ex vat)',
 'total amount']

In [21]:
dataframes["order_item"] = dataframes["order_item"].drop(columns=['master product name', 'readable name', 'product code','color code','size code', 'pcs per pack', 'quantity'])

In [22]:
dataframes["order_item"] = dataframes["order_item"].rename({
    "id": "order_item_id",
    "sell out date": "created_at",
    "order id": "order_id",
    "master product code": "master_product_code",
    "price per pack": "unit_price",
    "price per pack (ex vat)": "unit_price_ex_vat",
    "quantity (pack)": "quantity_pack",
    "total amount": "total_price"
})

['order_item_id',
 'created_at',
 'order_id',
 'master_product_code',
 'unit_price',
 'quantity_pack',
 'unit_price_ex_vat',
 'total_price']

In [29]:
dataframes["order_item"].columns

['order_item_id',
 'created_at',
 'order_id',
 'master_product_code',
 'unit_price',
 'quantity_pack',
 'unit_price_ex_vat',
 'total_price']

### canceled_item
  - id -> order_item_id
  - cn id -> order_id
  - ...นอกนั้นคล้าย inventory_output แต่ข้อมูล created_at ต้องเติมเองอีกที...

In [23]:
dataframes["canceled_item"].columns

['id',
 'cn id',
 'readable name',
 'master product code',
 'product code',
 'size code',
 'color code',
 'price per pack',
 'pcs per pack',
 'quantity (pack)',
 'total amount']

In [26]:
dataframes["canceled_item"] = dataframes["canceled_item"].drop(columns=['readable name', 'product code','color code','size code', 'color code', 'pcs per pack'])

['id',
 'cn id',
 'master product code',
 'price per pack',
 'quantity (pack)',
 'total amount']

In [28]:
dataframes["canceled_item"] = dataframes["canceled_item"].rename({
    "id": "order_item_id",
    "cn id": "order_id",
    "master product code": "master_product_code",
    "price per pack": "unit_price",
    "quantity (pack)": "quantity_pack",
    "total amount": "total_price"
})
dataframes["canceled_item"].columns

['order_item_id',
 'order_id',
 'master_product_code',
 'unit_price',
 'quantity_pack',
 'total_price']

### waybill
  - ordernumber -> order_number
  - address -> shipping_address

In [30]:
dataframes["waybill"].columns

['ordernumber',
 'file',
 'image',
 'platform',
 'name',
 'address',
 'billing_address',
 'phone',
 'price',
 'postcode',
 'ship_date',
 'tracking_number',
 'invoice_number',
 'awb_format',
 'sales_channel_id']

In [31]:
dataframes["waybill"] = dataframes["waybill"].drop(columns=['file', 'image', 'platform', 'price', 'sales_channel_id'])
dataframes["waybill"].columns

In [33]:
dataframes["waybill"] = dataframes["waybill"].rename({
    "ordernumber": "order_number",
    "address": "shipping_address"
})
dataframes["waybill"].columns

['order_number',
 'name',
 'shipping_address',
 'billing_address',
 'phone',
 'postcode',
 'ship_date',
 'tracking_number',
 'invoice_number',
 'awb_format']

### customer_code
  - code -> customer_category_id
  - customer address -> shipping_address
  - billing address -> billing_address
  - tax id -> tax_id
  - tel -> phone
  - description -> note
  - channel catagory -> channel_catagory

In [34]:
dataframes["customer_code"].columns

['code',
 'category',
 'number',
 'name',
 'customer address',
 'billing address',
 'branch',
 'tax id',
 'tel',
 'email',
 'contact person',
 'description',
 'channel catagory',
 'require tax']

In [35]:
customer_code_df = dataframes["customer_code"].drop(columns=['category', 'number', 'contact person', 'require tax'])
customer_code_df.columns

['code',
 'name',
 'customer address',
 'billing address',
 'branch',
 'tax id',
 'tel',
 'email',
 'description',
 'channel catagory']

In [36]:
customer_code_df = customer_code_df.rename({
    "code": "customer_category_id",
    "customer address": "shipping_address",
    "billing address": "billing_address",
    "tax id": "tax_id",
    "tel": "phone",
    "description": "note",
    "channel catagory": "channel_catagory"
})
customer_code_df.columns

['customer_category_id',
 'name',
 'shipping_address',
 'billing_address',
 'branch',
 'tax_id',
 'phone',
 'email',
 'note',
 'channel_catagory']

# Step 2) Transform - Schema/Value Integration

Process ข้อมูลด้วย Polars ในการรวมและแยก Dataset
- เติมและย้ายข้อมูลให้ถูกต้อง
- Change format of day
- รวม data
  - sale_order + cn_order(canceled order) -> sale_order
  - inventory_output + cn_item(canceled item) + Price -> order_item
- แยก data
  - awb_info -> waybill + customer_address(1)
  > เอา platform ออกเพราะมี sales_channel ใน sale_order แล้ว
  - customer_code -> customer + customer_category + customer_address(2)
  >customer code/address/... -> customer_address_id
- สร้าง fake values
- สรุปผลข้อมูล
- Check Value

## เติมช่องว่าง และย้ายข้อมูลให้ถูกต้อง

In [None]:
# remove "-CN" from invoice_number column
canceled_order_df = canceled_order_df.with_columns(pl.col("invoice_number").str.replace("-CN",""))

In [None]:
# add "Canceled" to status instead
canceled_order_df = canceled_order_df.with_columns(
    pl.col("status").fill_null(
        "Canceled"
    )
)
canceled_order_df

cancel_id,updated_at,created_at,order_id,invoice_number,order_number,sales_channel,require_vat,shipping_charge,discount_baht,customer code,customer name,customer address,shipping address,customer tel,cancel_reason,status
str,str,str,str,str,str,str,bool,i64,i64,str,str,str,str,i64,str,str
,,,,,,,,,,,,,,,,"""Canceled"""
,,,,,,,,,,,,,,,,"""Canceled"""
"""93339d0c""","""01/03/2024 00:…","""01/03/2024""","""ee066971""","""87/88""",,"""nursing home""",True,0.0,125.0,,"""EH0072""","""65 Thanon Witt…",,923750555.0,"""ออกบิลใหม่เนื่…","""Canceled"""
"""77f53cb8""","""29/02/2024 00:…","""03/03/2024""","""094636c1""","""93/02""","""240218V0RUWE5H…","""shopee""",True,0.0,0.0,,"""Prem Singh""",,"""11 kasem samra…",66967755727.0,"""เบอร์ผิด""","""Canceled"""
"""5a03509a""","""29/02/2024 00:…","""03/03/2024""","""1f870eca""","""93/19""","""2402215U02C6XB…","""shopee""",True,0.0,0.0,,"""Prem Singh""",,"""11 kasem samra…",66967755727.0,"""เบอร์ผิด""","""Canceled"""
"""9a0e9066""","""03/03/2024 00:…","""03/03/2024""","""2b79b368""","""93/55""","""82547889311676…","""lazada""",True,0.0,0.0,,"""สคุ นธา ชยั ฤก…","""สคุ นธา ชยั ฤก…","""26/1 หม ู่13 ·…",660896015539.0,"""ลูกค้าขอยกเลิก…","""Canceled"""
"""ae6876dc""","""10/03/2024 00:…","""10/03/2024""","""2b79b368""","""93/55""","""82547889311676…","""lazada""",True,0.0,0.0,,"""สคุ นธา ชยั ฤก…","""สคุ นธา ชยั ฤก…","""26/1 หม ู่13 ·…",660896015539.0,"""ลูกค้าต้องการย…","""Canceled"""
"""6cee400d""","""13/03/2024 00:…","""13/03/2024""","""deb9a927""","""""","""2403034CWQ6837…","""shopee""",True,0.0,0.0,,"""ชัชวาลย์ นพรัก…",,"""หมู่บ้านเดอะเก…",66943456449.0,"""ไม่มีคนรับสาย""","""Canceled"""
"""01d7aa97""","""20/03/2024 00:…","""20/03/2024""","""4219b2ad""","""IV2403A0059""","""24031557TVKF2R…","""shopee""",True,0.0,0.0,,"""ชนากานต์ ด้วงก…",,"""91 หมู่ 12, ตํ…",66984146969.0,"""ไม่มีผู้รับ""","""Canceled"""
"""618c4e45""","""31/03/2024 00:…","""31/03/2024""","""776699ce""","""IV2403A0091""","""24032505MHJC6S…","""shopee""",True,0.0,0.0,,"""จิราพร""",,"""514 ถ. มหาจักร…",66957641503.0,"""พัสดุตีกลับ""","""Canceled"""


## Change format of day

In [None]:
from pandas.tseries.offsets import Day

df['Date'] = pd.to_datetime(df['Date'])

In [None]:
df

NameError: name 'df' is not defined

## รวม data
Create new tables

## แยก data
Create new tables

In [None]:
.drop_duplicates()


## Use Faker library to generate fake names

In [None]:
!pip install faker

[31mERROR: Operation cancelled by user[0m[31m
[0mTraceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/pip/_vendor/pkg_resources/__init__.py", line 3108, in _dep_map
    return self.__dep_map
  File "/usr/local/lib/python3.10/dist-packages/pip/_vendor/pkg_resources/__init__.py", line 2901, in __getattr__
    raise AttributeError(attr)
AttributeError: _DistInfoDistribution__dep_map

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/pip/_vendor/pyparsing/core.py", line 817, in _parseNoCache
    loc, tokens = self.parseImpl(instring, pre_loc, doActions)
  File "/usr/local/lib/python3.10/dist-packages/pip/_vendor/pyparsing/core.py", line 3864, in parseImpl
    loc, resultlist = self.exprs[0]._parse(
  File "/usr/local/lib/python3.10/dist-packages/pip/_vendor/pyparsing/core.py", line 821, in _parseNoCache
    loc, tokens = self.parseImpl(instring, pre_loc, d

In [None]:
from faker import Faker
fake = Faker()
customer_master['Name'] = customer_master['CustomerNo'].apply(lambda x: fake.name())

In [None]:
customer_master

Unnamed: 0,CustomerNo,Country,Name
0,17490.0,United Kingdom,Joseph Collins
1,13069.0,United Kingdom,Leslie Reynolds
20,12433.0,Norway,Jeffrey Bond
84,13426.0,United Kingdom,James Evans
94,17364.0,United Kingdom,Cheryl Pena
...,...,...,...
535301,16274.0,United Kingdom,Brittany Smith
535474,14142.0,United Kingdom,Kelli Holmes
535507,13065.0,United Kingdom,Kelsey Flores
536103,18011.0,United Kingdom,Jeffrey Nguyen


## Summary

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 536350 entries, 0 to 536349
Data columns (total 6 columns):
 #   Column         Non-Null Count   Dtype         
---  ------         --------------   -----         
 0   TransactionNo  536350 non-null  object        
 1   Date           536350 non-null  datetime64[ns]
 2   ProductNo      536350 non-null  object        
 3   Price          536350 non-null  float64       
 4   Quantity       536350 non-null  int64         
 5   CustomerNo     536295 non-null  float64       
dtypes: datetime64[ns](1), float64(2), int64(1), object(2)
memory usage: 24.6+ MB


In [None]:
df.describe()

Unnamed: 0,Date,Price,Quantity,CustomerNo
count,536350,536350.0,536350.0,536295.0
mean,2023-12-04 02:52:31.891116032,12.662182,9.919347,15227.893178
min,2023-05-03 00:00:00,5.13,-80995.0,12004.0
25%,2023-08-28 00:00:00,10.99,1.0,13807.0
50%,2023-12-20 00:00:00,11.94,3.0,15152.0
75%,2024-03-20 00:00:00,14.09,10.0,16729.0
max,2024-05-10 00:00:00,660.62,80995.0,18287.0
std,,8.49045,216.6623,1716.582932


# Step 3) Insert into Database

Insert ข้อมูลลงใน MySQL Database
- อ่าน Secrets ของ Colab (เก็บลง class Config)
- ใช้ Sqlalchemy เชื่อมต่อไปที่ MySQL
- สร้าง Schema (database)
- Insert ข้อมูลลงใน tables

## Getting Secrets

In [None]:
from google.colab import userdata

class Config:
  MYSQL_HOST = userdata.get("MYSQL_HOST")
  MYSQL_PORT = userdata.get("MYSQL_PORT")
  MYSQL_USER = userdata.get("MYSQL_USER")
  MYSQL_PASSWORD = userdata.get("MYSQL_PASSWORD")
  MYSQL_DB = ''
  MYSQL_CHARSET = 'utf8mb4'


## Connect to MySQL

In [None]:
!pip install pymysql

Collecting pymysql
  Downloading PyMySQL-1.1.1-py3-none-any.whl (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.0/45.0 kB[0m [31m807.2 kB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pymysql
Successfully installed pymysql-1.1.1


In [None]:
import sqlalchemy

engine = sqlalchemy.create_engine(
    "mysql+pymysql://{user}:{password}@{host}:{port}/{db}".format(
        user=Config.MYSQL_USER,
        password=Config.MYSQL_PASSWORD,
        host=Config.MYSQL_HOST,
        port=Config.MYSQL_PORT,
        db=Config.MYSQL_DB,
    )
)

In [None]:
df

Unnamed: 0,TransactionNo,Date,ProductNo,Price,Quantity,CustomerNo
0,581482,2024-05-10,22485,21.47,12,17490.0
1,581475,2024-05-10,22596,10.65,36,13069.0
2,581475,2024-05-10,23235,11.53,12,13069.0
3,581475,2024-05-10,23272,10.65,12,13069.0
4,581475,2024-05-10,23239,11.94,6,13069.0
...,...,...,...,...,...,...
536345,C536548,2023-05-03,22168,18.96,-2,12472.0
536346,C536548,2023-05-03,21218,14.09,-3,12472.0
536347,C536548,2023-05-03,20957,11.74,-1,12472.0
536348,C536548,2023-05-03,22580,16.35,-4,12472.0


**Tables:**
- transaction
- customer
- product

## Create Schema

In [None]:
schema_name = "elgo"

with engine.connect() as connection:
    connection.execute(sqlalchemy.text(f"CREATE DATABASE IF NOT EXISTS {schema_name};"))


OperationalError: (pymysql.err.OperationalError) (1044, "Access denied for user 'r2de3'@'%' to database 'r2de3'")
[SQL: CREATE DATABASE IF NOT EXISTS r2de3;]
(Background on this error at: https://sqlalche.me/e/20/e3q8)

## Insert to tables

In [None]:
df.to_sql(
    '',        # Name of the table to be created
    con=engine,           # SQLAlchemy engine connection
    schema=schema_name,   # Database name
    if_exists='replace',  # 'fail', 'replace', or 'append'
    index=False,          # Whether to include the DataFrame's index as a column
    chunksize=1000,       # Number of rows to insert in each chunk (for large DataFrames)
)

536350

# Step 4) Query and save output files

Query ข้อมูลใน table และเซฟไฟล์ output
- Query ข้อมูลด้วย Pandas
- Output ไฟล์เป็น parquet

## Query

In [None]:
sql = "SELECT * FROM "
 = pd.read_sql_query(sql, engine)

Unnamed: 0,CustomerNo,Country,Name
0,17490.0,United Kingdom,Sara Griffin
1,13069.0,United Kingdom,Michael Holt
2,12433.0,Norway,Kelli Sandoval
3,13426.0,United Kingdom,Dalton Graves
4,17364.0,United Kingdom,Michelle James
...,...,...,...
4734,16274.0,United Kingdom,Megan Young
4735,14142.0,United Kingdom,Luke Williams
4736,13065.0,United Kingdom,Lisa Jones
4737,18011.0,United Kingdom,Kelly Jenkins


## Save to Parquet files

In [None]:
df.to_parquet(".parquet", index=False)
