## Full Extraction
Full extraction refers to extracting **all** data from a source system each time the extraction process runs, regardless of whether the data has changed or not.

**Characteristics:**
- Simpler to implement as it doesn't require tracking changes
- Guarantees data consistency between source and target
- Resource-intensive (requires processing all data every time)
- Slower, especially with large datasets
- May require more storage space

**Use cases:** Small datasets, when source systems don't support change tracking, initial loads

Let's fully extract all the records found in our source.

In [174]:
import pandas as pd

# Load full dataset
df = pd.read_csv("google_5yr_one.csv")

# Display basic stats
print(f"Extracted {len(df)} rows fully.")
print("Shape:", df.shape)

Extracted 1256 rows fully.
Shape: (1256, 6)


## Incremental Extraction
Incremental extraction only retrieves data that has **changed** since the last extraction.

**Characteristics:**
- More complex to implement (requires change tracking)
- More efficient (processes only changed data)
- Faster execution
- Less resource-intensive
- Requires reliable change tracking mechanisms

**Types of incremental extraction:**
1. **Date/time-based:** Uses timestamp columns
2. **Version number-based:** Uses version or sequence numbers
3. **Log-based:** Reads database transaction logs
4. **Trigger-based:** Uses database triggers to track changes

**Use cases:** Large datasets, frequent updates, when source systems support change tracking

First, let's create a text file to track our last date of extraction. Let say the last time we extracted data was on January 1st, 2025.

In [176]:
# Save last extraction date timestamp to last_extraction.txt
last_extraction = datetime(2025, 1, 1, 0, 0, 0).strftime("%Y-%m-%d %H:%M:%S")
with open("last_extraction.txt", "w") as f:
    f.write(last_extraction)
print("Last extraction time saved as 'last_extraction.txt'")

Last extraction time saved as 'last_extraction.txt'


**Now, we can extract the data recorded after our last time of extraction, which is January 1st, 2025.**

In [180]:
from datetime import datetime

# Read last extraction timestamp
with open("last_extraction.txt", "r") as f:
    last_extraction_str = f.read().strip()

last_extraction = datetime.strptime(last_extraction_str, "%Y-%m-%d %H:%M:%S")

# Convert 'Date' column to datetime
df["Date"] = pd.to_datetime(df["Date"])

# Filter for rows after last extraction
new_data = df[df["Date"] > last_extraction]

print(f"Extracted {len(new_data)} rows incrementally since '{last_extraction}'.")
print("Shape:", new_data.shape)

Extracted 103 rows incrementally since '2025-01-01 00:00:00'.
Shape: (103, 6)


**Update the last time of extraction.**

In [188]:
# Save current timestamp to last_extraction.txt
now = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
with open("last_extraction.txt", "w") as f:
    f.write(now)

print(f"Last extraction date updated to '{now}'.")

Last extraction date updated to '2025-06-10 16:30:17'.
