#### **First Step**: Consulting data to the Database

Task:

- Establish connection to the database
- Load data into a data frame such as `df` for cleansing

In [48]:
import pandas as pd
import sys
import os

# Add the 'src' folder to sys.path
sys.path.append(os.path.abspath(os.path.join('..', 'src')))

In [49]:
from connections.db import DB

db = DB()

In [3]:
# Fetch the data from the database as a dataframe
df = db.fetch_as_dataframe('../sql/queries/004_get_raw_data.sql')

2024-08-26 22:02:33,747 - ✔ Connected to database
2024-08-26 22:03:39,892 - ✔ Data loaded into DataFrame
2024-08-26 22:03:40,060 - ✔ Cursor closed
2024-08-26 22:03:40,061 - ✔ Connection closed


---

#### **Second Step**: Transformation Process

Task:

- Column `Unnamed: 0` should be renamed to `id`.
- The `trans_date_trans_time` column should be loaded as `datetime`.
- The `dob` column should be loaded as `datetime`.
- Column `cc_num` should be loaded as string.
- Remove the `unix_time`, `city_pop`, `lat`, `long`,`merch_lat`,`merch_long`, columns.
- Remove records with null values.
- Convert `is_fraud` to boolean.
- Calculate the `age` of the customers and convert it to a new column.
- Remove records whose `age` is less than 21 years old.

In [56]:
# Change the name from 'Unnamed:0' to 'id'.
df = df.rename(columns={'Unnamed: 0': 'id'})

In [57]:
# Convert column 'trans_date_trans_time' and 'dob' to type datetime
df['trans_date_trans_time'] = pd.to_datetime(df['trans_date_trans_time'])
df['dob'] = pd.to_datetime(df['dob'])

In [58]:
# Convert column 'cc_num' to string
df['cc_num'] = df['cc_num'].astype(str)

In [59]:
# Remove unused columns 'unix_time', 'city_pop', 'lat', 'long', 'merch_lat', 'merch_long'
df = df.drop(columns=['unix_time', 'city_pop', 'lat', 'long', 'merch_lat', 'merch_long'])

In [60]:
# Remove the null values
df = df.dropna()

In [61]:
#Convert 'is_fraud' to boolean type
df['is_fraud'] = df['is_fraud'].astype(bool)

In [62]:
# Create the age column of customers 
df['age'] = df['trans_date_trans_time'].dt.year - df['dob'].dt.year

In [63]:
#filter the data so that only customers over 21 years of age are displayed.
df = df[df['age'] > 21]

#### **Third Step**: Upload data to database

Task:

- Import db class to use connector
- Establish connection and execute the queries to create the schema and send the data.
- Validate that the table has been created and that all records have been loaded.

In [64]:
from utils.pysqlschema import SQLSchemaGenerator

generator = SQLSchemaGenerator(table_name='credit_card_transactions_clean')
generator.generate_schema(df, '../sql/schema_clean.sql')
generator.generate_seed_data(df, '../sql/seed_data_clean.sql')

2024-08-26 22:56:51,836 - Generating schema for credit_card_transactions_clean
2024-08-26 22:56:51,836 - Infering SQL type for int64
2024-08-26 22:56:51,840 - Infering SQL type for datetime64[ns]
2024-08-26 22:56:51,840 - Infering SQL type for object
2024-08-26 22:56:51,840 - Infering SQL type for object
2024-08-26 22:56:51,840 - Infering SQL type for object
2024-08-26 22:56:51,840 - Infering SQL type for float64
2024-08-26 22:56:51,845 - Infering SQL type for object
2024-08-26 22:56:51,845 - Infering SQL type for object
2024-08-26 22:56:51,845 - Infering SQL type for object
2024-08-26 22:56:51,847 - Infering SQL type for object
2024-08-26 22:56:51,847 - Infering SQL type for object
2024-08-26 22:56:51,849 - Infering SQL type for object
2024-08-26 22:56:51,849 - Infering SQL type for int64
2024-08-26 22:56:51,852 - Infering SQL type for object
2024-08-26 22:56:51,853 - Infering SQL type for datetime64[ns]
2024-08-26 22:56:51,854 - Infering SQL type for object
2024-08-26 22:56:51,854 - 

In [65]:
from connections.db import DB
db = DB()

In [66]:
# Create schema
db.execute("../sql/schema_clean.sql", fetch_results=False)

2024-08-26 23:01:44,663 - ✔ Connected to database
2024-08-26 23:01:44,819 - ✔ Query executed
2024-08-26 23:01:44,819 - ✔ Cursor closed
2024-08-26 23:01:44,823 - ✔ Connection closed


In [67]:
# Seed data by executing the seed data script in batches
db.execute_in_batches("../sql/seed_data_clean.sql", batch_size=20000)

2024-08-26 23:01:59,927 - ✔ Connected to database
2024-08-26 23:02:12,116 - ✔ Executed a batch of 20000 records
2024-08-26 23:02:23,299 - ✔ Executed a batch of 20000 records
2024-08-26 23:02:34,703 - ✔ Executed a batch of 20000 records
2024-08-26 23:02:45,818 - ✔ Executed a batch of 20000 records
2024-08-26 23:02:57,664 - ✔ Executed a batch of 20000 records
2024-08-26 23:03:17,063 - ✔ Executed a batch of 20000 records
2024-08-26 23:03:28,393 - ✔ Executed a batch of 20000 records
2024-08-26 23:03:39,520 - ✔ Executed a batch of 20000 records
2024-08-26 23:03:50,597 - ✔ Executed a batch of 20000 records
2024-08-26 23:04:04,328 - ✔ Executed a batch of 20000 records
2024-08-26 23:04:20,537 - ✔ Executed a batch of 20000 records
2024-08-26 23:04:32,327 - ✔ Executed a batch of 20000 records
2024-08-26 23:04:43,518 - ✔ Executed a batch of 20000 records
2024-08-26 23:04:55,127 - ✔ Executed a batch of 20000 records
2024-08-26 23:05:12,623 - ✔ Executed a batch of 20000 records
2024-08-26 23:05:25,

In [68]:
# Query the tables to verify that the data has been inserted
db.execute("../sql/queries/003_view_tables_sizes.sql", fetch_results=True)

2024-08-26 23:16:25,941 - ✔ Connected to database
2024-08-26 23:16:26,210 - ✔ Query executed
2024-08-26 23:16:26,210 - ✔ Cursor closed
2024-08-26 23:16:26,210 - ✔ Connection closed


[('public.credit_card_transactions_clean', 1050668)]

---

#### **Results**:

1. Result 1:  The column`Unnamed: 0` was renamed as  `id` for ease of reading and structure
2. Result 2: lhe column `dob` and `trans_date_trans_time` were successfully converted to Datetime because they were in the wrong format.
3. Result 3: The column becomes `cc_num` to string type to optimize storage space, since no statistical analysis will be performed with this column.
4. Result 4: Columns are deleted  `unix_time`, `city_pop`, `lat`, `long`,`merch_lat`,`merch_long`, columns as they do not add value or contain erroneous information.
5. Result 5: The column  `is_fraud` is transformed to boolean as it was previously as Integer 
6. Result 6: The following column is added `age`. The purpose is to filter transactions made by minors under 21 years of age, since in the U.S. context this is the minimum age to be considered an adult.

---