### **Data Load Proccess**:
- Objective: This notebook presents the process of loading clean data resulting from the previous phase of the project. It is intended to use a new database due to the limitations of the free [Render database](https://github.com/DCajiao/workshop001_candidates_analysis/blob/main/docs/database/how_to_deploy_databases_on_render.md) instances.
- **Important note**: All the documentation that you will find in this and the following notebooks is an emulation of each pipeline task that will be run in Airflow, so *we will use a sample of the data for testing purposes*, while in the pipeline we will use all the data we have. 
---

#### **First Step**: Load clean, processed and previously transformed data from a csv. 

In [2]:
import pandas as pd
import sys
import os

# Add the 'src' folder to sys.path
sys.path.append(os.path.abspath(os.path.join('..', 'src')))

In [2]:
df = pd.read_csv('../data/credit_card_transactions_cleaned.csv')

In [3]:
print(f'The dataset has {df.shape[0]} rows and {df.shape[1]} columns')
print(f'The columns are: {df.columns.tolist()}')

The dataset has 1052352 rows and 21 columns
The columns are: ['id', 'trans_date_trans_time', 'cc_num', 'merchant', 'category', 'amt', 'first', 'last', 'gender', 'street', 'city', 'state', 'zip', 'lat', 'long', 'job', 'dob', 'trans_num', 'is_fraud', 'merch_zipcode', 'age']


In [4]:
# How many unique values does each column have?
df.nunique()

id                       1052352
trans_date_trans_time    1038039
cc_num                       946
merchant                     693
category                      14
amt                        48789
first                        348
last                         478
gender                         2
street                       946
city                         864
state                         49
zip                          935
lat                          933
long                         934
job                          488
dob                          931
trans_num                1052352
is_fraud                       2
merch_zipcode              28307
age                           75
dtype: int64

In [5]:
# Obtain a sample of the data (45% of the rows)
df_test = df.sample(int(df.shape[0]*0.45))

In [6]:
print(f'The dataset has {df_test.shape[0]} rows and {df.shape[1]} columns')
print(f'The columns are: {df_test.columns.tolist()}')

The dataset has 473558 rows and 21 columns
The columns are: ['id', 'trans_date_trans_time', 'cc_num', 'merchant', 'category', 'amt', 'first', 'last', 'gender', 'street', 'city', 'state', 'zip', 'lat', 'long', 'job', 'dob', 'trans_num', 'is_fraud', 'merch_zipcode', 'age']


In [9]:
# Obtain a sample of the data (45% of the rows)
df_test = df.sample(int(df.shape[0]*0.45))

In [12]:
df_test.nunique()

id                       473558
trans_date_trans_time    470686
cc_num                      943
merchant                    693
category                     14
amt                       35955
first                       347
last                        477
gender                        2
street                      943
city                        862
state                        49
zip                         932
lat                         930
long                        931
job                         488
dob                         928
trans_num                473558
is_fraud                      2
merch_zipcode             27679
age                          75
dtype: int64

In [13]:
# Save the sample to a new CSV file
df_test.to_csv('../data/credit_card_transactions_sample.csv', index=False)

---

#### **Second Step**: Upload data to database

Task:

- Import db class to use connector
- Establish connection and execute the queries to create the schema and send the data.
- Validate that the table has been created and that all records have been loaded.

In [5]:
from utils.pysqlschema import SQLSchemaGenerator

generator = SQLSchemaGenerator(table_name='dataraw_testing_table')
generator.generate_schema(df_test, '../sql/sample/schema_clean.sql')
generator.generate_seed_data(df_test, '../sql/sample/seed_data_clean.sql')

2024-09-26 00:08:09,945 - Generating schema for dataraw_testing_table
2024-09-26 00:08:09,946 - Infering SQL type for int64
2024-09-26 00:08:09,948 - Infering SQL type for object
2024-09-26 00:08:09,950 - Infering SQL type for int64
2024-09-26 00:08:09,951 - Infering SQL type for object
2024-09-26 00:08:09,952 - Infering SQL type for object
2024-09-26 00:08:09,953 - Infering SQL type for float64
2024-09-26 00:08:09,954 - Infering SQL type for object
2024-09-26 00:08:09,955 - Infering SQL type for object
2024-09-26 00:08:09,956 - Infering SQL type for object
2024-09-26 00:08:09,956 - Infering SQL type for object
2024-09-26 00:08:09,957 - Infering SQL type for object
2024-09-26 00:08:09,958 - Infering SQL type for object
2024-09-26 00:08:09,959 - Infering SQL type for int64
2024-09-26 00:08:09,960 - Infering SQL type for float64
2024-09-26 00:08:09,961 - Infering SQL type for float64
2024-09-26 00:08:09,962 - Infering SQL type for object
2024-09-26 00:08:09,963 - Infering SQL type for ob

In [6]:
from connections.db import DB
db = DB()

In [8]:
# Create schema
db.execute("../sql/sample/schema_clean.sql", fetch_results=False)

2024-09-26 00:09:24,952 - ✔ Connected to database
2024-09-26 00:09:25,121 - ✔ Query executed
2024-09-26 00:09:25,122 - ✔ Cursor closed
2024-09-26 00:09:25,122 - ✔ Connection closed


In [9]:
# Seed data by executing the seed data script in batches
db.execute_in_batches("../sql/sample/seed_data_clean.sql", batch_size=20000)

2024-09-26 00:09:28,177 - ✔ Connected to database
2024-09-26 00:09:41,245 - ✔ Executed a batch of 20000 records
2024-09-26 00:09:55,942 - ✔ Executed a batch of 20000 records
2024-09-26 00:10:07,761 - ✔ Executed a batch of 20000 records
2024-09-26 00:10:20,018 - ✔ Executed a batch of 20000 records
2024-09-26 00:10:32,053 - ✔ Executed a batch of 20000 records
2024-09-26 00:10:45,684 - ✔ Executed a batch of 20000 records
2024-09-26 00:11:04,935 - ✔ Executed a batch of 20000 records
2024-09-26 00:11:17,581 - ✔ Executed a batch of 20000 records
2024-09-26 00:11:29,545 - ✔ Executed a batch of 20000 records
2024-09-26 00:11:41,551 - ✔ Executed a batch of 20000 records
2024-09-26 00:12:03,535 - ✔ Executed a batch of 20000 records
2024-09-26 00:12:15,463 - ✔ Executed a batch of 20000 records
2024-09-26 00:12:27,944 - ✔ Executed a batch of 20000 records
2024-09-26 00:12:39,993 - ✔ Executed a batch of 20000 records
2024-09-26 00:12:59,700 - ✔ Executed a batch of 20000 records
2024-09-26 00:13:14,

In [11]:
# Query the tables to verify that the data has been inserted
db.execute("../sql/queries/01_view_tables_sizes.sql", fetch_results=True)

2024-09-26 00:19:22,464 - ✔ Connected to database
2024-09-26 00:19:22,618 - ✔ Query executed
2024-09-26 00:19:22,619 - ✔ Cursor closed
2024-09-26 00:19:22,620 - ✔ Connection closed


[('public.dataraw_testing_table', 440000)]

---

#### **Results**:

1. We extracted a sample representing 45% of the totality of our cleaned and processed data. 
2. We have created a raw table with that sample. 
3. We have uploaded the sample to generate the test to be performed on the following notebooks

---