### **Data Load Proccess**:
- Objective: This notebook presents the process of loading clean data resulting from the previous phase of the project. It is intended to use a new database due to the limitations of the free [Render database](https://github.com/DCajiao/workshop001_candidates_analysis/blob/main/docs/database/how_to_deploy_databases_on_render.md) instances.
- **Important note**: All the documentation that you will find in this and the following notebooks is an emulation of each pipeline task that will be run in Airflow, so *we will use a sample of the data for testing purposes*, while in the pipeline we will use all the data we have. 
---

#### **First Step**: Load clean, processed and previously transformed data from a csv. 

In [1]:
import pandas as pd
import sys
import os

# Add the 'src' folder to sys.path
sys.path.append(os.path.abspath(os.path.join('..', 'src')))

In [2]:
df = pd.read_csv('../data/credit_card_transactions_cleaned.csv')

In [3]:
print(f'The dataset has {df.shape[0]} rows and {df.shape[1]} columns')
print(f'The columns are: {df.columns.tolist()}')

The dataset has 1052352 rows and 21 columns
The columns are: ['id', 'trans_date_trans_time', 'cc_num', 'merchant', 'category', 'amt', 'first', 'last', 'gender', 'street', 'city', 'state', 'zip', 'lat', 'long', 'job', 'dob', 'trans_num', 'is_fraud', 'merch_zipcode', 'age']


In [4]:
# How many unique values does each column have?
df.nunique()

id                       1052352
trans_date_trans_time    1038039
cc_num                       946
merchant                     693
category                      14
amt                        48789
first                        348
last                         478
gender                         2
street                       946
city                         864
state                         49
zip                          935
lat                          933
long                         934
job                          488
dob                          931
trans_num                1052352
is_fraud                       2
merch_zipcode              28307
age                           75
dtype: int64

---

#### **Second Step**: Upload data to database

Task:

- Import db class to use connector
- Establish connection and execute the queries to create the schema and send the data.
- Validate that the table has been created and that all records have been loaded.

In [5]:
from utils.pysqlschema import SQLSchemaGenerator

generator = SQLSchemaGenerator(table_name='raw_table')
generator.generate_schema(df, '../sql/raw_table_schema.sql')
generator.generate_seed_data(df, '../sql/raw_table_seed_data.sql')

INFO:root:Generating schema for raw_table
INFO:root:Infering SQL type for int64
INFO:root:Infering SQL type for object
INFO:root:Infering SQL type for int64
INFO:root:Infering SQL type for object
INFO:root:Infering SQL type for object
INFO:root:Infering SQL type for float64
INFO:root:Infering SQL type for object
INFO:root:Infering SQL type for object
INFO:root:Infering SQL type for object
INFO:root:Infering SQL type for object
INFO:root:Infering SQL type for object
INFO:root:Infering SQL type for object
INFO:root:Infering SQL type for int64
INFO:root:Infering SQL type for float64
INFO:root:Infering SQL type for float64
INFO:root:Infering SQL type for object
INFO:root:Infering SQL type for object
INFO:root:Infering SQL type for object
INFO:root:Infering SQL type for bool
INFO:root:Infering SQL type for float64
INFO:root:Infering SQL type for int64
INFO:root:Query written to ../sql/raw_table_schema.sql
INFO:root:Generating seed data for raw_table
INFO:root:Query written to ../sql/raw_tab

In [9]:
from connections.db import DB
db = DB()

In [10]:
# Remove the table if it already exists
db.execute("../sql/queries/002_drop_tables.sql", fetch_results=False)

INFO:root:✔ Connected to database
INFO:root:✔ Query executed
INFO:root:✔ Cursor closed
INFO:root:✔ Connection closed


In [11]:
# Create schema
db.execute("../sql/raw_table_schema.sql", fetch_results=False)

INFO:root:✔ Connected to database
INFO:root:✔ Query executed
INFO:root:✔ Cursor closed
INFO:root:✔ Connection closed


In [12]:
# Seed data by executing the seed data script in batches
db.execute_in_batches("../sql/raw_table_seed_data.sql", batch_size=20000)

INFO:root:✔ Connected to database
INFO:root:✔ Executed a batch of 20000 records
INFO:root:✔ Executed a batch of 20000 records
INFO:root:✔ Executed a batch of 20000 records
INFO:root:✔ Executed a batch of 20000 records
INFO:root:✔ Executed a batch of 20000 records
INFO:root:✔ Executed a batch of 20000 records
INFO:root:✔ Executed a batch of 20000 records
INFO:root:✔ Executed a batch of 20000 records
INFO:root:✔ Executed a batch of 20000 records
INFO:root:✔ Executed a batch of 20000 records
INFO:root:✔ Executed a batch of 20000 records
INFO:root:✔ Executed a batch of 20000 records
INFO:root:✔ Executed a batch of 20000 records
INFO:root:✔ Executed a batch of 20000 records
INFO:root:✔ Executed a batch of 20000 records
INFO:root:✔ Executed a batch of 20000 records
INFO:root:✔ Executed a batch of 20000 records
INFO:root:✔ Executed a batch of 20000 records
INFO:root:✔ Executed a batch of 20000 records
INFO:root:✔ Executed a batch of 20000 records
INFO:root:✔ Executed a batch of 20000 records


In [13]:
# Query the tables to verify that the data has been inserted
db.execute("../sql/queries/001_view_tables_sizes.sql", fetch_results=True)

INFO:root:✔ Connected to database
INFO:root:✔ Query executed
INFO:root:✔ Cursor closed
INFO:root:✔ Connection closed


[('public.raw_table', 1034393)]

---

#### **Results**:

1. We extracted a sample representing 45% of the totality of our cleaned and processed data. 
2. We have created a raw table with that sample. 
3. We have uploaded the sample to generate the test to be performed on the following notebooks

---