## *LAB 3  Practicing Extraction in ETL*

### *Historical Description of the Car Sales Dataset*
- *This dataset represents simulated used car sales data  used car sales data over two months (April–May 2025). It includes purchases by major automotive dealers, rental companies, and vehicle auction services such as AutoNation, CarMax, Enterprise Holdings, and Manheim.*

- *The dataset features popular car brands like Toyota, Honda, Ford, Hyundai, and Chevrolet, with model years ranging from 2015 to 2023. It records key details such as vehicle age, mileage, price, and payment types, offering a realistic view of market diversity and buyer preferences.*


#### *Key Attributes*

*`Date of Sale:` The date on which the car transaction occurred, showing when the vehicle was purchased.*

*`Dealer:` Represents the company or automotive group that sold or distributed the vehicle.*

*`Car Make and Model:` Describes the manufacturer (e.g., Toyota, Ford) and specific model of the vehicle, indicating brand preference and market variety.*

*`Manufacture Year:` Indicates the year the vehicle was made, providing context for its age and potential value depreciation.*

*`Odometer Reading:` Shows the total distance the vehicle had traveled before the sale, an important factor influencing its condition and price.*

*`Sale Price:` The amount (in USD) for which the vehicle was sold, indicating market demand and valuation.*

*`Payment Method:` Specifies how the transaction was completed—via cash, credit, or loan—reflecting diverse financing choices.*

*`Last Updated:` The timestamp of the most recent update to the record, helping track edits or corrections.*

#### Import libraries

In [16]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import random

### `STEP 1: Generate synthetic data`

In [17]:
# Generate Car Sales Dataset
dealers = ['AutoNation', 'CarMax', 'Penske Automotive', 'Lithia Motors', 'Sonic Automotive', 
           'Enterprise Holdings', 'Hertz Global Holdings', 'Avis Budget Group', 'Manheim', 'Copart']

car_inventory = {
    'Toyota': ['Corolla', 'Camry', 'RAV4'],
    'Honda': ['Civic', 'Accord', 'CR-V'],
    'Ford': ['Focus', 'Fusion', 'Escape'],
    'Hyundai': ['Elantra', 'Tucson', 'Santa Fe'],
    'Chevrolet': ['Malibu', 'Cruze', 'Equinox']
}

payment_methods = ['Cash', 'Credit', 'Loan']

sales_records = []
start_date = datetime(2025, 4, 1)
for day_offset in range(1, 61):
    current_date = start_date + timedelta(days=day_offset)
    for _ in range(random.randint(3, 6)):
        make = random.choice(list(car_inventory.keys()))
        model = random.choice(car_inventory[make])
        manufacture_year = random.randint(2015, 2023)
        odometer_reading = random.randint(10000, 120000)
        sale_price = random.randint(5000, 25000)

        # Introduce missing values with 10% chance each
        customer = random.choice(dealers)
        if random.random() < 0.10:
            customer = None

        mileage = odometer_reading
        if random.random() < 0.10:
            mileage = None

        payment_type = random.choice(payment_methods)
        if random.random() < 0.10:
            payment_type = None

        sales_records.append({
            'id': random.randint(10000, 99999),
            'customer': customer,
            'date': current_date.date().isoformat(),
            'car_make': make,
            'car_model': model,
            'year': manufacture_year,
            'mileage': mileage,
            'price': sale_price,
            'payment_type': payment_type,
            'last_updated': (current_date + timedelta(hours=random.randint(0, 23),
                                                      minutes=random.randint(0, 59))).isoformat()
        })


####  Create DataFrame

In [18]:
# Create DataFrame
df = pd.DataFrame(sales_records )

# Save to CSV
df.to_csv('car_sales_data_may_2025.csv', index=False)

# Display the first 10 records
df.head(10)


Unnamed: 0,id,customer,date,car_make,car_model,year,mileage,price,payment_type,last_updated
0,82510,CarMax,2025-04-02,Honda,Civic,2021,106136.0,5386,Credit,2025-04-02T05:06:00
1,38041,Lithia Motors,2025-04-02,Ford,Escape,2019,16572.0,18737,Loan,2025-04-02T06:33:00
2,46557,CarMax,2025-04-02,Toyota,Corolla,2021,45535.0,10101,Loan,2025-04-02T17:50:00
3,14586,Penske Automotive,2025-04-02,Chevrolet,Cruze,2019,47288.0,9671,Cash,2025-04-02T04:59:00
4,46220,Manheim,2025-04-02,Honda,CR-V,2021,107364.0,24396,Cash,2025-04-02T16:16:00
5,52477,CarMax,2025-04-03,Ford,Fusion,2021,69134.0,12677,Loan,2025-04-03T13:27:00
6,15671,Enterprise Holdings,2025-04-03,Ford,Fusion,2023,70052.0,5279,Loan,2025-04-03T12:53:00
7,62431,AutoNation,2025-04-03,Toyota,Camry,2019,,9987,Loan,2025-04-03T12:14:00
8,18261,Avis Budget Group,2025-04-03,Chevrolet,Malibu,2017,24974.0,23508,Loan,2025-04-03T04:35:00
9,68348,Avis Budget Group,2025-04-03,Hyundai,Tucson,2018,62735.0,22131,Loan,2025-04-03T10:46:00


In [19]:
# Count number of unique days
unique_days = df['date'].nunique()
print(f"Number of unique days with sales records: {unique_days}")

Number of unique days with sales records: 60


 ###  `Section 1: Full Extraction`

**Full Extraction** *means retrieving the entire dataset from the data source every time the extraction process runs, without considering whether any data has changed since the last extraction. This approach ensures that you always have a complete and up-to-date copy of the dataset.*

####  `in the following code:`
- *The entire dataset is read from the source "car_sales_data_may_2025.csv".* 
- *All records, regardless of whether they are new, updated, or unchanged, are loaded into memory.* 
- *Basic information about the dataset (such as the number of rows and columns) and a sample of the data are optionally displayed to help verify the extraction and understand the data structure.*


In [20]:
# FULL EXTRACTION:Load entire dataset
df_full = pd.read_csv("car_sales_data_may_2025.csv", parse_dates=["last_updated"])
print(f"Pulled {len(df_full)} rows via full extraction.")

# Show basic dataset info 
print(f"Dataset shape: {df_full.shape}") 

Pulled 253 rows via full extraction.
Dataset shape: (253, 10)


### `Sample_Data`

In [21]:
print("Columns:", df_full.columns.tolist())

Columns: ['id', 'customer', 'date', 'car_make', 'car_model', 'year', 'mileage', 'price', 'payment_type', 'last_updated']



- *A sample of the first few rows is printed to verify the data content and structure.*

In [22]:
df_full.head(5)

Unnamed: 0,id,customer,date,car_make,car_model,year,mileage,price,payment_type,last_updated
0,82510,CarMax,2025-04-02,Honda,Civic,2021,106136.0,5386,Credit,2025-04-02 05:06:00
1,38041,Lithia Motors,2025-04-02,Ford,Escape,2019,16572.0,18737,Loan,2025-04-02 06:33:00
2,46557,CarMax,2025-04-02,Toyota,Corolla,2021,45535.0,10101,Loan,2025-04-02 17:50:00
3,14586,Penske Automotive,2025-04-02,Chevrolet,Cruze,2019,47288.0,9671,Cash,2025-04-02 04:59:00
4,46220,Manheim,2025-04-02,Honda,CR-V,2021,107364.0,24396,Cash,2025-04-02 16:16:00


In [23]:
#full_extraction_ containing the results as csv file
df.to_csv('full_extraction_output.csv', index=False)

# Save incremental extraction results to CSV
print("Incremental extraction output saved as 'Full_extraction_output.csv'.")


Incremental extraction output saved as 'Full_extraction_output.csv'.


### `Section 2: Incremental Extraction`

-*`The steps are as follows:`*

- *The last extraction timestamp is read from the file `last_extraction.txt`*. 
- *The dataset is loaded from the file `car_sales_data_may_2025.csv`.  
   The `last_updated` column is parsed as a datetime object to facilitate time-based comparisons.*

- *The timestamp obtained from the text file is converted into a `pandas`    datetime object.*  
- *This allows accurate filtering of records based on their update time.*

- *The dataset is filtered to include only the rows where the `last_updated`   value is later than the last extraction timestamp.*  
- *This step simulates **incremental extraction** by retrieving only the new or modified records.*

- *A sample of these new or updated records is shown to verify the verification of the extraction process.*


#### `1. Create the tracking file: last_extraction.txt`

- *This file stores the timestamp of the last extraction, so the incremental process knows where to pick up.*



In [24]:
with open("last_extraction.txt", "w") as f:
    f.write("2025-04-20 12:00:00")

#### `2.Perform Incremental Extraction`
- *Read the last extraction timestamp from the tracking file.*

In [25]:
# INCREMENTAL EXTRACTION
with open("last_extraction.txt", "r") as f:
    last_extraction = f.read().strip()

#### *Load full dataset with date parsing and Converting to datetime*

In [26]:
df = pd.read_csv("car_sales_data_may_2025.csv", parse_dates=["last_updated"])
last_extraction_time = pd.to_datetime(last_extraction)

####  *Filter new or updated records since last extraction*

In [27]:
df_incremental = df[df['last_updated'] > last_extraction_time]
print(f"Pulled {len(df_incremental)} new/updated rows since {last_extraction}.")
df_incremental.head()

Pulled 166 new/updated rows since 2025-04-20 12:00:00.


Unnamed: 0,id,customer,date,car_make,car_model,year,mileage,price,payment_type,last_updated
85,64031,AutoNation,2025-04-20,Honda,CR-V,2019,49545.0,7755,Loan,2025-04-20 17:10:00
87,60963,Copart,2025-04-20,Chevrolet,Cruze,2017,117942.0,18396,Credit,2025-04-20 23:35:00
88,28738,Sonic Automotive,2025-04-20,Hyundai,Elantra,2021,41692.0,24521,Credit,2025-04-20 20:26:00
90,41071,Penske Automotive,2025-04-21,Toyota,Corolla,2018,32259.0,22977,Credit,2025-04-21 11:44:00
91,29456,Avis Budget Group,2025-04-21,Honda,Civic,2018,17055.0,11949,Credit,2025-04-21 05:46:00


In [28]:
# Save incremental extraction results to CSV
df_incremental.to_csv('incremental_extraction_output.csv', index=False)
print("Incremental extraction output saved as 'incremental_extraction_output.csv'.")

Incremental extraction output saved as 'incremental_extraction_output.csv'.


### `Update the last_extraction.txt`

#### *Incremental Extraction*

- *First,  keep a note of the last time we checked for new data by saving a date and time in a file called `last_extraction.txt`.*
- *Next, look through the full data but only pick out the new or updated entries that happened after that saved date and time.*
- *Then show a message telling how many new or updated records we found since the last time we checked.*

### *Save New Timestamp*

- *After getting the new data, we update the saved date and time in `last_extraction.txt` to mark the most recent check.*
- *This helps us know where to start from next time so we only get new changes, not old data again.*


In [29]:
# Get the most recent update
new_checkpoint = df['last_updated'].max()
# Save it

with open("last_extraction.txt", "w") as f:
    f.write(new_checkpoint.isoformat())
print(f"Updated last_extraction.txt to {new_checkpoint}")

Updated last_extraction.txt to 2025-05-31 19:30:00


## *LAB 4: Transform in ETL*