# Basics 1: Extract, transform, and load a single CSV file (from bucket to database)

In this lesson we load some sales rows from a CSV file located in a storage bucket and save them to fact_sales database table in PostgreSQL.

## Step 1: Add a file to the storage bucket

1. Execute `taito open bucket` on command-line to open the storage bucket on your web browser.
2. Login in with access key `minio` and secret key `secret1234`.
3. Create a folder named `sales` and upload a file named `sales-2021-04.csv` in the folder with the following content:

```excel
Order,Date,Product,Quantity,Price
00000000003,2021-04-15,1-1,2,2129.00
00000000003,2021-04-15,1-2,1,2659.00
00000000004,2021-04-16,1-1,1,2659.00
00000000005,2021-04-16,1-1,1,2129.00
```

## Step 2: Execute the code

In [None]:
# Imports
import pandas as pd
import os

%run ../../common/jupyter.ipynb
import src_common_database as db
import src_common_storage as st
import src_common_util as util

In [None]:
# Read the CSV file from the storage bucket
bucket = st.create_storage_bucket_client(os.environ['STORAGE_BUCKET'])
sales_csv = bucket.get_object_contents("/sales/sales-2021-04.csv")

# Read Sales.csv data into a Pandas dataframe
df = pd.read_csv(sales_csv)

# DEBUG: Show the contents
df.style

In [None]:
# Change dataframe schema to match the database table
db_df = df.rename(
    columns = {
        'Date': 'date_key',
        'Product': 'product_key',
        'Order': 'order_number',
        'Quantity': 'quantity',
        'Price': 'price',
    },
    inplace = False
)

# Generate unique key by concatenating order number and product SKU
db_df["key"] = db_df["order_number"].astype(str) + "." + db_df["product_key"]

# DEBUG: Show the renamed schema
db_df.style

In [None]:
# Insert data to the fact_sales database table
# NOTE: You will get "duplicate key value violates unique constraint" or "null value in column store_key violates not-null constraint" errors
#       if you have already executed some of the later lessons. In such case you should remove all your changes from "database/" and execute
#       `taito init --clean` to clean your database from old data.
database = db.create_engine()
db_df.to_sql('fact_sales', con=database, if_exists='append', index=False)

# DEBUG: Show the data stored in database
pd.read_sql('fact_sales', con=database).style

In [None]:
# TIP: In a real world example you would probably list all CSV files from a folder
# and execute the operation to each of them, for example:
filenames = bucket.list_objects("/sales/")
for filename in filenames:
    print("Executing operation for " + filename)

## Step 3: Connect to the database with Taito CLI

- Execute `taito db connect` on command-line to connect to the local database.
- Show all sales rows with `select * from fact_sales`.

## Step 4: Change the implementation to update existing data and insert new data

Currently our implementation only inserts new data to the database table and fails if there is existing data with the same unique key. Unfortunately Pandas does not support PostgreSQL upsert (insert or update). There are multiple ways to go around this, for example:

- Write data to a separate loading view that has a trigger that executes upsert for the target table on insert.
- Write data to a separate loading table that has a trigger that executes upsert for the target table on insert.
- Write data to a temporary table and then merge the data to the target table with a custom sql clause.
- Just overwrite all data in the target table, preferably with truncate mode to keep the table schema intact.

This is how you can implement the first option (loading view). Normally we would add a new database migration for this, but since our database tables are not yet in production, we can just modify the existing migrations and redeploy them.

1. Copy-paste the following content to the existing files: `database/deploy/fact_sales.sql`, `database/revert/fact_sales.sql`, and `database/verify/fact_sales.sql`.

```sql
-- Deploy fact_sales to pg

BEGIN;

CREATE TABLE fact_sales (
  key text PRIMARY KEY,
  date_key text NOT NULL REFERENCES dim_dates (key),
  product_key text NOT NULL REFERENCES dim_products (key),
  order_number text NOT NULL,
  quantity integer NOT NULL,
  price numeric(12,2) NOT NULL
);

CREATE VIEW load_sales AS SELECT * FROM fact_sales;

CREATE OR REPLACE FUNCTION load_sales() RETURNS TRIGGER AS $$
BEGIN
  INSERT INTO fact_sales VALUES (NEW.*)
  ON CONFLICT (key) DO
    UPDATE SET
      date_key = EXCLUDED.date_key,
      product_key = EXCLUDED.product_key,
      order_number = EXCLUDED.order_number,
      quantity = EXCLUDED.quantity,
      price = EXCLUDED.price;
  RETURN new;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER load_sales
INSTEAD OF INSERT ON load_sales
FOR EACH ROW EXECUTE PROCEDURE load_sales();

COMMIT;
```

```sql
-- Revert fact_sales from pg

BEGIN;

DROP TRIGGER load_sales ON load_sales;
DROP FUNCTION load_sales;
DROP VIEW load_sales;
DROP TABLE fact_sales;

COMMIT;
```

```sql
-- Verify fact_sales on pg

BEGIN;

SELECT key FROM load_sales LIMIT 1;
SELECT key FROM fact_sales LIMIT 1;

ROLLBACK;
```
    
2. Redeploy database migrations and example data to local database with `taito init --clean`.
3. Execute the following code to load CSV data to database yet again:

In [None]:
# Write the data to the "load_sales" view instead of "fact_sales" table
db_df.to_sql('load_sales', con=database, if_exists='append', index=False)

# DEBUG: Show the data stored in fact_sales. You manual data changes should have been overwritten.
pd.read_sql('fact_sales', con=database).style

## Next lesson: [Basics 2 - Listen storage bucket for uploads](02.ipynb)