# EXTRACTION-TRANSFORMATION-LOADING (ETL) 
* Extraction is the process of obtaining data from various sources and changing it to a destination designed to support analysis.

The first step in the ETL process is to extract data from different sources, which can include databases, flat files, APIs, and more. The goal is to gather all relevant data needed for analysis.

*Simulating Banking Data*

In [16]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import random

# simulating banking data
events = ['deposit', 'withdraw', 'loan_application', 'atm_withdrawal', 'login']
users = ['user_'+str(i) for i in range(1, 50)]
data = []

# start date for the simulation
start_date = datetime(2025, 5, 1)
end_date = datetime(2025, 5, 31)
delta = (end_date - start_date).days + 1# including end date

for i in range(delta):
    date = start_date + timedelta(days=i)
    for _ in range(random.randint(3, 6)):# 3–6 events per day
        # each event has a random user and timestamp
        data.append({
            'user_id': random.choice(users),
            'event': random.choice(events),
            'timestamp': (start_date + timedelta(days=i) + timedelta(hours=random.randint(0, 23), minutes=random.randint(0, 59))).isoformat()
        })
banking_data = pd.DataFrame(data)
banking_data.to_csv('banking_data_large.csv', index=False)
banking_data.tail()

Unnamed: 0,user_id,event,timestamp
128,user_14,loan_application,2025-05-30T17:34:00
129,user_14,atm_withdrawal,2025-05-31T07:25:00
130,user_2,loan_application,2025-05-31T17:15:00
131,user_11,deposit,2025-05-31T06:41:00
132,user_21,login,2025-05-31T23:32:00


## Full Extraction
A Full Extraction means that every time you run the ETL process, you extract all records from the data source, regardless of whether they have changed or not.

In [14]:
# loading the banking data
banking_data = pd.read_csv("banking_data_large.csv", parse_dates=["timestamp"])
print(f"Pulled {len(banking_data)} rows via full extraction.")
banking_data.head()

Pulled 140 rows via full extraction.


Unnamed: 0,user_id,event,timestamp
0,user_48,atm_withdrawal,2025-05-01 19:48:00
1,user_28,withdraw,2025-05-01 18:52:00
2,user_18,withdraw,2025-05-01 04:16:00
3,user_19,deposit,2025-05-01 16:28:00
4,user_46,login,2025-05-01 02:23:00


## Incremental Extraction
An Incremental Extraction means that you only extract records that have changed since the last extraction. This is often done by using timestamps or change data capture techniques to identify new or modified records.

In [15]:
# Create a tracking file with a last extraction timestamp
with open("last_extraction.txt", "w") as f:
    f.write("2025-05-15 12:00:00") # Initial checkpoint

# Read the last extraction timestamp
with open("last_extraction.txt", "r") as f:
    last_extraction = f.read().strip()

# Load the banking dataset
df = pd.read_csv("banking_data_large.csv", parse_dates=["timestamp"])

# Filter only new or updated records
last_extraction_time = pd.to_datetime(last_extraction)
df_incremental = df[df['timestamp'] > last_extraction_time]

# Display results
print(f"Extracted {len(df_incremental)} rows incrementally since {last_extraction}.")
df_incremental.head()

# Update the checkpoint
new_checkpoint = df['timestamp'].max().isoformat()
with open("last_extraction.txt", "w") as f:
    f.write(new_checkpoint)
print(f"Updated last extraction timestamp to {new_checkpoint}.")


Extracted 77 rows incrementally since 2025-05-15 12:00:00.
Updated last extraction timestamp to 2025-05-31T21:58:00.
