## *LAB 3  Practicing Extraction in ETL*

### *Historical Description of the Car Sales Dataset*
- *This dataset represents simulated used car sales data  used car sales data over two months (April–May 2025). It includes purchases by major automotive dealers, rental companies, and vehicle auction services such as AutoNation, CarMax, Enterprise Holdings, and Manheim.*

- *The dataset features popular car brands like Toyota, Honda, Ford, Hyundai, and Chevrolet, with model years ranging from 2015 to 2023. It records key details such as vehicle age, mileage, price, and payment types, offering a realistic view of market diversity and buyer preferences.*


#### *Key Attributes*

*`Date of Sale:` The date on which the car transaction occurred, showing when the vehicle was purchased.*

*`Dealer:` Represents the company or automotive group that sold or distributed the vehicle.*

*`Car Make and Model:` Describes the manufacturer (e.g., Toyota, Ford) and specific model of the vehicle, indicating brand preference and market variety.*

*`Manufacture Year:` Indicates the year the vehicle was made, providing context for its age and potential value depreciation.*

*`Odometer Reading:` Shows the total distance the vehicle had traveled before the sale, an important factor influencing its condition and price.*

*`Sale Price:` The amount (in USD) for which the vehicle was sold, indicating market demand and valuation.*

*`Payment Method:` Specifies how the transaction was completed—via cash, credit, or loan—reflecting diverse financing choices.*

*`Last Updated:` The timestamp of the most recent update to the record, helping track edits or corrections.*

#### Import libraries

In [30]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import random

### `STEP 1: Generate synthetic data`

In [31]:
# Generate Car Sales Dataset
dealers = ['AutoNation', 'CarMax', 'Penske Automotive', 'Lithia Motors', 'Sonic Automotive', 
           'Enterprise Holdings', 'Hertz Global Holdings', 'Avis Budget Group', 'Manheim', 'Copart']

car_inventory = {
    'Toyota': ['Corolla', 'Camry', 'RAV4'],
    'Honda': ['Civic', 'Accord', 'CR-V'],
    'Ford': ['Focus', 'Fusion', 'Escape'],
    'Hyundai': ['Elantra', 'Tucson', 'Santa Fe'],
    'Chevrolet': ['Malibu', 'Cruze', 'Equinox']
}

payment_methods = ['Cash', 'Credit', 'Loan']

sales_records = []
start_date = datetime(2025, 4, 1)
for day_offset in range(1, 61):
    current_date = start_date + timedelta(days=day_offset)
    for _ in range(random.randint(3, 6)):
        make = random.choice(list(car_inventory.keys()))
        model = random.choice(car_inventory[make])
        manufacture_year = random.randint(2015, 2023)
        odometer_reading = random.randint(10000, 120000)
        sale_price = random.randint(5000, 25000)
        sales_records.append({
            'id': random.randint(10000, 99999),
            'customer': random.choice(dealers),
            'date': current_date.date().isoformat(),
            'car_make': make,
            'car_model': model,
            'year': manufacture_year,
            'mileage': odometer_reading,
            'price': sale_price,
            'payment_type': random.choice(payment_methods),
            'last_updated': (current_date + timedelta(hours=random.randint(0, 23),
                                                      minutes=random.randint(0, 59))).isoformat()
        })

####  Create DataFrame

In [32]:
# Create DataFrame
df = pd.DataFrame(sales_records )

# Save to CSV
df.to_csv('car_sales_data_may_2025.csv', index=False)

# Display the first 10 records
df.head(10)


Unnamed: 0,id,customer,date,car_make,car_model,year,mileage,price,payment_type,last_updated
0,58160,Copart,2025-04-02,Toyota,RAV4,2015,118857,8918,Cash,2025-04-02T23:57:00
1,95195,Penske Automotive,2025-04-02,Ford,Escape,2019,32204,9082,Credit,2025-04-02T22:25:00
2,45498,Sonic Automotive,2025-04-02,Toyota,Camry,2022,15750,18686,Loan,2025-04-02T08:42:00
3,17218,Copart,2025-04-03,Ford,Fusion,2021,112832,8614,Cash,2025-04-03T21:15:00
4,72832,Sonic Automotive,2025-04-03,Chevrolet,Equinox,2017,28635,17484,Credit,2025-04-03T16:48:00
5,13931,CarMax,2025-04-03,Hyundai,Elantra,2022,22931,13564,Cash,2025-04-03T09:33:00
6,63815,Copart,2025-04-03,Chevrolet,Cruze,2021,114482,19676,Credit,2025-04-03T07:53:00
7,58243,Avis Budget Group,2025-04-03,Toyota,Camry,2023,41100,23754,Loan,2025-04-03T01:19:00
8,23412,Manheim,2025-04-04,Hyundai,Tucson,2020,64140,20911,Loan,2025-04-04T08:47:00
9,98906,Copart,2025-04-04,Honda,Civic,2016,37805,5136,Loan,2025-04-04T23:33:00


In [33]:
# Count number of unique days
unique_days = df['date'].nunique()
print(f"Number of unique days with sales records: {unique_days}")

Number of unique days with sales records: 60


 ###  `Section 1: Full Extraction`

**Full Extraction** *means retrieving the entire dataset from the data source every time the extraction process runs, without considering whether any data has changed since the last extraction. This approach ensures that you always have a complete and up-to-date copy of the dataset.*

####  `What is done:`
- *The entire dataset is read from the source (e.g., a CSV file, database).* 
- *All records, regardless of whether they are new, updated, or unchanged, are loaded into memory.* 
- *Basic information about the dataset (such as the number of rows and columns) and a sample of the data are optionally displayed to help verify the extraction and understand the data structure.*


In [34]:
# FULL EXTRACTION:Load entire dataset
df_full = pd.read_csv("car_sales_data_may_2025.csv", parse_dates=["last_updated"])
print(f"Pulled {len(df_full)} rows via full extraction.")

# Show basic dataset info 
print(f"Dataset shape: {df_full.shape}") 

Pulled 259 rows via full extraction.
Dataset shape: (259, 10)


In [35]:
print("Columns:", df_full.columns.tolist())
df_full.head(5)

Columns: ['id', 'customer', 'date', 'car_make', 'car_model', 'year', 'mileage', 'price', 'payment_type', 'last_updated']


Unnamed: 0,id,customer,date,car_make,car_model,year,mileage,price,payment_type,last_updated
0,58160,Copart,2025-04-02,Toyota,RAV4,2015,118857,8918,Cash,2025-04-02 23:57:00
1,95195,Penske Automotive,2025-04-02,Ford,Escape,2019,32204,9082,Credit,2025-04-02 22:25:00
2,45498,Sonic Automotive,2025-04-02,Toyota,Camry,2022,15750,18686,Loan,2025-04-02 08:42:00
3,17218,Copart,2025-04-03,Ford,Fusion,2021,112832,8614,Cash,2025-04-03 21:15:00
4,72832,Sonic Automotive,2025-04-03,Chevrolet,Equinox,2017,28635,17484,Credit,2025-04-03 16:48:00


In [43]:
#full_extraction_ containing the results as csv file
df.to_csv('full_extraction_output.csv', index=False)

# Save incremental extraction results to CSV
print("Incremental extraction output saved as 'Full_extraction_output.csv'.")


Incremental extraction output saved as 'Full_extraction_output.csv'.


### `Section 2: Incremental Extraction`

*Extract only new or updated records since the last extraction, based on a timestamp.*

#### *`What is done:`*

- *Read the last extraction timestamp from a tracking file (`last_extraction.txt`).*

- *Load the entire dataset.*

- *Filter the dataset to include only records with a `last_updated` timestamp newer than the last extraction time.*

- *Show how many new/updated rows were extracted.*


#### `1. Create the tracking file: last_extraction.txt`

- *This file stores the timestamp of the last extraction, so the incremental process knows where to pick up.*



In [37]:
with open("last_extraction.txt", "w") as f:
    f.write("2025-04-20 12:00:00")

#### `2.Perform Incremental Extraction`
- Read the last extraction timestamp from the tracking file.

In [38]:
# INCREMENTAL EXTRACTION
with open("last_extraction.txt", "r") as f:
    last_extraction = f.read().strip()

#### *Load full dataset with date parsing and Converting to datetime*

In [39]:
df = pd.read_csv("car_sales_data_may_2025.csv", parse_dates=["last_updated"])
last_extraction_time = pd.to_datetime(last_extraction)

####  *Filter new or updated records since last extraction*

In [40]:
df_incremental = df[df['last_updated'] > last_extraction_time]
print(f"Pulled {len(df_incremental)} new/updated rows since {last_extraction}.")
df_incremental.head()

Pulled 183 new/updated rows since 2025-04-20 12:00:00.


Unnamed: 0,id,customer,date,car_make,car_model,year,mileage,price,payment_type,last_updated
76,29028,Penske Automotive,2025-04-20,Chevrolet,Cruze,2020,76406,9328,Credit,2025-04-20 17:15:00
77,46425,Sonic Automotive,2025-04-21,Honda,Civic,2018,17326,10482,Cash,2025-04-21 07:13:00
78,19893,Lithia Motors,2025-04-21,Chevrolet,Equinox,2020,14532,14172,Credit,2025-04-21 01:48:00
79,28699,Penske Automotive,2025-04-21,Ford,Fusion,2017,98981,22094,Cash,2025-04-21 23:36:00
80,17534,AutoNation,2025-04-21,Chevrolet,Cruze,2015,115453,5713,Cash,2025-04-21 00:26:00


In [None]:
# Save incremental extraction results to CSV
df_incremental.to_csv('incremental_extraction_output.csv', index=False)
print("Incremental extraction output saved as 'incremental_extraction_output.csv'.")

Incremental extraction output saved as 'incremental_extraction_output.csv'.


### `Update the last_extraction.txt`

### *Incremental Extraction*

- *First, we keep a note of the last time we checked for new data by saving a date and time in a file called `last_extraction.txt`.*
- *Next, we look through the full data but only pick out the new or updated entries that happened after that saved date and time.*
- *We then show a message telling how many new or updated records we found since the last time we checked.*

### *Save New Timestamp*

- *After getting the new data, we update the saved date and time in `last_extraction.txt` to mark the most recent check.*
- *This helps us know where to start from next time so we only get new changes, not old data again.*


In [None]:
# Get the most recent update
new_checkpoint = df['last_updated'].max()
# Save it

with open("last_extraction.txt", "w") as f:
    f.write(new_checkpoint.isoformat())
print(f"Updated last_extraction.txt to {new_checkpoint}")

Updated last_extraction.txt to 2025-05-31 23:57:00
