## *LAB 3  Practicing Extraction in ETL*

### *Historical Description of the Car Sales Dataset*
- *This dataset represents simulated used car sales data  used car sales data over two months (April–May 2025). It includes purchases by major automotive dealers, rental companies, and vehicle auction services such as AutoNation, CarMax, Enterprise Holdings, and Manheim.*

- *The dataset features popular car brands like Toyota, Honda, Ford, Hyundai, and Chevrolet, with model years ranging from 2015 to 2023. It records key details such as vehicle age, mileage, price, and payment types, offering a realistic view of market diversity and buyer preferences.*


#### *Key Attributes*

*`Date of Sale:` The date on which the car transaction occurred, showing when the vehicle was purchased.*

*`Dealer:` Represents the company or automotive group that sold or distributed the vehicle.*

*`Car Make and Model:` Describes the manufacturer (e.g., Toyota, Ford) and specific model of the vehicle, indicating brand preference and market variety.*

*`Manufacture Year:` Indicates the year the vehicle was made, providing context for its age and potential value depreciation.*

*`Odometer Reading:` Shows the total distance the vehicle had traveled before the sale, an important factor influencing its condition and price.*

*`Sale Price:` The amount (in USD) for which the vehicle was sold, indicating market demand and valuation.*

*`Payment Method:` Specifies how the transaction was completed—via cash, credit, or loan—reflecting diverse financing choices.*

*`Last Updated:` The timestamp of the most recent update to the record, helping track edits or corrections.*

#### Import libraries

In [30]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import random

### `STEP 1: Generate synthetic data`

In [31]:
# Generate Car Sales Dataset
dealers = ['AutoNation', 'CarMax', 'Penske Automotive', 'Lithia Motors', 'Sonic Automotive', 
           'Enterprise Holdings', 'Hertz Global Holdings', 'Avis Budget Group', 'Manheim', 'Copart']

car_inventory = {
    'Toyota': ['Corolla', 'Camry', 'RAV4'],
    'Honda': ['Civic', 'Accord', 'CR-V'],
    'Ford': ['Focus', 'Fusion', 'Escape'],
    'Hyundai': ['Elantra', 'Tucson', 'Santa Fe'],
    'Chevrolet': ['Malibu', 'Cruze', 'Equinox']
}

payment_methods = ['Cash', 'Credit', 'Loan']

sales_records = []
start_date = datetime(2025, 4, 1)
for day_offset in range(1, 61):
    current_date = start_date + timedelta(days=day_offset)
    for _ in range(random.randint(3, 6)):
        make = random.choice(list(car_inventory.keys()))
        model = random.choice(car_inventory[make])
        manufacture_year = random.randint(2015, 2023)
        odometer_reading = random.randint(10000, 120000)
        sale_price = random.randint(5000, 25000)

        # Introduce missing values with 10% chance each
        customer = random.choice(dealers)
        if random.random() < 0.10:
            customer = None

        mileage = odometer_reading
        if random.random() < 0.10:
            mileage = None

        payment_type = random.choice(payment_methods)
        if random.random() < 0.10:
            payment_type = None

        sales_records.append({
            'id': random.randint(10000, 99999),
            'customer': customer,
            'date': current_date.date().isoformat(),
            'car_make': make,
            'car_model': model,
            'year': manufacture_year,
            'mileage': mileage,
            'price': sale_price,
            'payment_type': payment_type,
            'last_updated': (current_date + timedelta(hours=random.randint(0, 23),
                                                      minutes=random.randint(0, 59))).isoformat()
        })


####  Create DataFrame

In [32]:
# Create DataFrame
df = pd.DataFrame(sales_records )

# Save to CSV
df.to_csv('car_sales_data_may_2025.csv', index=False)

# Display the first 10 records
df.head(10)


Unnamed: 0,id,customer,date,car_make,car_model,year,mileage,price,payment_type,last_updated
0,96170,Sonic Automotive,2025-04-02,Chevrolet,Malibu,2021,76680.0,22689,Loan,2025-04-02T15:07:00
1,42890,Lithia Motors,2025-04-02,Honda,Civic,2019,97159.0,23623,,2025-04-02T01:18:00
2,51710,CarMax,2025-04-02,Ford,Escape,2021,32141.0,19238,Credit,2025-04-02T03:15:00
3,58253,Lithia Motors,2025-04-02,Chevrolet,Equinox,2019,46321.0,8995,Cash,2025-04-02T09:37:00
4,33556,AutoNation,2025-04-02,Hyundai,Elantra,2021,100419.0,16999,Credit,2025-04-02T23:28:00
5,76562,AutoNation,2025-04-03,Toyota,Corolla,2021,117190.0,23539,Loan,2025-04-03T14:52:00
6,57843,Sonic Automotive,2025-04-03,Hyundai,Tucson,2018,112070.0,14813,Cash,2025-04-03T11:33:00
7,85898,CarMax,2025-04-03,Honda,Civic,2016,98896.0,21343,Cash,2025-04-03T05:00:00
8,72760,Copart,2025-04-03,Chevrolet,Equinox,2017,101749.0,19720,Credit,2025-04-03T01:33:00
9,59115,CarMax,2025-04-04,Honda,Civic,2022,65489.0,10875,Cash,2025-04-04T17:00:00


In [33]:
# Count number of unique days
unique_days = df['date'].nunique()
print(f"Number of unique days with sales records: {unique_days}")

Number of unique days with sales records: 60


 ###  `Section 1: Full Extraction`

**Full Extraction** *means retrieving the entire dataset from the data source every time the extraction process runs, without considering whether any data has changed since the last extraction. This approach ensures that you always have a complete and up-to-date copy of the dataset.*

####  `in the following code:`
- *The entire dataset is read from the source "car_sales_data_may_2025.csv".* 
- *All records, regardless of whether they are new, updated, or unchanged, are loaded into memory.* 
- *Basic information about the dataset (such as the number of rows and columns) and a sample of the data are optionally displayed to help verify the extraction and understand the data structure.*


In [34]:
# FULL EXTRACTION:Load entire dataset
df_full = pd.read_csv("car_sales_data_may_2025.csv", parse_dates=["last_updated"])
print(f"Pulled {len(df_full)} rows via full extraction.")

# Show basic dataset info 
print(f"Dataset shape: {df_full.shape}") 

Pulled 270 rows via full extraction.
Dataset shape: (270, 10)


### `Sample_Data`

In [35]:
print("Columns:", df_full.columns.tolist())

Columns: ['id', 'customer', 'date', 'car_make', 'car_model', 'year', 'mileage', 'price', 'payment_type', 'last_updated']



- *A sample of the first few rows is printed to verify the data content and structure.*

In [36]:
df_full.head(5)

Unnamed: 0,id,customer,date,car_make,car_model,year,mileage,price,payment_type,last_updated
0,96170,Sonic Automotive,2025-04-02,Chevrolet,Malibu,2021,76680.0,22689,Loan,2025-04-02 15:07:00
1,42890,Lithia Motors,2025-04-02,Honda,Civic,2019,97159.0,23623,,2025-04-02 01:18:00
2,51710,CarMax,2025-04-02,Ford,Escape,2021,32141.0,19238,Credit,2025-04-02 03:15:00
3,58253,Lithia Motors,2025-04-02,Chevrolet,Equinox,2019,46321.0,8995,Cash,2025-04-02 09:37:00
4,33556,AutoNation,2025-04-02,Hyundai,Elantra,2021,100419.0,16999,Credit,2025-04-02 23:28:00


In [37]:
#full_extraction_ containing the results as csv file
df.to_csv('full_extraction_output.csv', index=False)

# Save incremental extraction results to CSV
print("Incremental extraction output saved as 'Full_extraction_output.csv'.")


Incremental extraction output saved as 'Full_extraction_output.csv'.


### `Section 2: Incremental Extraction`

-*`The steps are as follows:`*

- *The last extraction timestamp is read from the file `last_extraction.txt`*. 
- *The dataset is loaded from the file `car_sales_data_may_2025.csv`.  
   The `last_updated` column is parsed as a datetime object to facilitate time-based comparisons.*

- *The timestamp obtained from the text file is converted into a `pandas`    datetime object.*  
- *This allows accurate filtering of records based on their update time.*

- *The dataset is filtered to include only the rows where the `last_updated`   value is later than the last extraction timestamp.*  
- *This step simulates **incremental extraction** by retrieving only the new or modified records.*

- *A sample of these new or updated records is shown to verify the verification of the extraction process.*


#### `1. Create the tracking file: last_extraction.txt`

- *This file stores the timestamp of the last extraction, so the incremental process knows where to pick up.*



In [38]:
with open("last_extraction.txt", "w") as f:
    f.write("2025-04-20 12:00:00")

#### `2.Perform Incremental Extraction`
- *Read the last extraction timestamp from the tracking file.*

In [39]:
# INCREMENTAL EXTRACTION
with open("last_extraction.txt", "r") as f:
    last_extraction = f.read().strip()

#### *Load full dataset with date parsing and Converting to datetime*

In [40]:
df = pd.read_csv("car_sales_data_may_2025.csv", parse_dates=["last_updated"])
last_extraction_time = pd.to_datetime(last_extraction)

####  *Filter new or updated records since last extraction*

In [41]:
df_incremental = df[df['last_updated'] > last_extraction_time]
print(f"Pulled {len(df_incremental)} new/updated rows since {last_extraction}.")
df_incremental.head()

Pulled 190 new/updated rows since 2025-04-20 12:00:00.


Unnamed: 0,id,customer,date,car_make,car_model,year,mileage,price,payment_type,last_updated
80,56904,Sonic Automotive,2025-04-20,Ford,Focus,2017,79471.0,21132,Cash,2025-04-20 19:46:00
81,85386,Sonic Automotive,2025-04-20,Ford,Focus,2022,,8381,Credit,2025-04-20 13:28:00
82,26349,Lithia Motors,2025-04-20,Toyota,Camry,2018,38996.0,10794,Loan,2025-04-20 19:12:00
83,33437,Avis Budget Group,2025-04-20,Honda,Civic,2023,112401.0,19419,Loan,2025-04-20 22:39:00
84,74370,,2025-04-20,Honda,Accord,2022,58209.0,18186,,2025-04-20 16:07:00


In [42]:
# Save incremental extraction results to CSV
df_incremental.to_csv('incremental_extraction_output.csv', index=False)
print("Incremental extraction output saved as 'incremental_extraction_output.csv'.")

Incremental extraction output saved as 'incremental_extraction_output.csv'.


### `Update the last_extraction.txt`

#### *Incremental Extraction*

- *First,  keep a note of the last time we checked for new data by saving a date and time in a file called `last_extraction.txt`.*
- *Next, look through the full data but only pick out the new or updated entries that happened after that saved date and time.*
- *Then show a message telling how many new or updated records we found since the last time we checked.*

### *Save New Timestamp*

- *After getting the new data, we update the saved date and time in `last_extraction.txt` to mark the most recent check.*
- *This helps us know where to start from next time so we only get new changes, not old data again.*


In [None]:
# Get the most recent update
new_checkpoint = df['last_updated'].max()
# Save it
with open("last_extraction.txt", "w") as f:
    f.write(new_checkpoint.isoformat())
print(f"Updated last_extraction.txt to {new_checkpoint}")

Updated last_extraction.txt to 2025-05-31 16:01:00


## *LAB 4: Transform in ETL*

### *Transform Full Data*
*Full Transformation applies all cleaning, enrichment, and formatting operations to the entire dataset,  — regardless of whether the  records are new, updated, or unchanged.*

*Use Cases:Ideal for initial data loads, schema updates, and ensuring consistency across the entire dataset.*

*Pros:It ensures complete uniformity, is easier to debug, and supports accurate analytics and reporting.*

*Cons:It is slower, uses more resources, and can be redundant if no changes exist in the data.*


#### *Data Cleaning*

In [44]:
df_full.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 270 entries, 0 to 269
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   id            270 non-null    int64         
 1   customer      247 non-null    object        
 2   date          270 non-null    object        
 3   car_make      270 non-null    object        
 4   car_model     270 non-null    object        
 5   year          270 non-null    int64         
 6   mileage       250 non-null    float64       
 7   price         270 non-null    int64         
 8   payment_type  240 non-null    object        
 9   last_updated  270 non-null    datetime64[ns]
dtypes: datetime64[ns](1), float64(1), int64(3), object(5)
memory usage: 21.2+ KB


In [45]:
df_full.duplicated().sum()

0

####  *Handle missing values*

In [46]:
df_full.isnull().sum()

id               0
customer        23
date             0
car_make         0
car_model        0
year             0
mileage         20
price            0
payment_type    30
last_updated     0
dtype: int64

In [49]:
# fill string-based NaNs with a placeholder
df_full['customer'] = df_full['customer'].fillna('Unknown')

### *Group-wise Imputation for mileage*

In [None]:
# Fill missing mileage with group mean (based on car_make and car_model)
df_full['mileage'] = df_full.groupby(['car_make', 'car_model'])['mileage']\
                            .transform(lambda x: x.fillna(x.mean()))

## *Categorize mileage*

In [50]:
# First, bin mileage into categories.
df_full['mileage_bin'] = pd.cut(df_full['mileage'], bins=[0, 30000, 60000, 90000, 120000], 
                                labels=['Low', 'Medium', 'High', 'Very High'])


#### *Impute Missing price Values Based on Groups*


In [None]:
df_full['price'] = df_full.groupby(['car_make', 'car_model', 'mileage_bin'], observed=True)['price']\
                          .transform(lambda x: x.fillna(x.mean()))

### 🧹 Data Cleaning Summary

#### 1. *String-Based Missing Value Handling*  
Missing string values in the `customer` column were filled with the placeholder `"Unknown"` to ensure consistency and prevent issues in downstream processing.

#### 2. *Group-wise Imputation for mileage*  
Missing values in the `mileage` column were imputed using the mean mileage grouped by `car_make` and `car_model`, offering a more accurate estimate than a  average.

#### 3. *Categorize mileage*  
The `mileage` column was binned into defined ranges (`Low`, `Medium`, `High`, `Very High`) to simplify analysis and assist with grouped imputations.

#### 4. *Impute Missing price Values Based on Groups*  
Missing `price` values were filled using the mean price calculated within groups defined by `car_make`, `car_model`, and `mileage_bin`, ensuring context-aware imputation.


### *Transform Incremental Data*

*Incremental Transformation targets only newly added or changed data 
(often based on timestamps or IDs) and applies transformation logic selectively.*

*Use Cases:Best for frequent updates in production, such as daily loads or streaming data.*

*Pros:  It is faster, more efficient, and reduces computation by only processing changed data.*


*Cons:It requires change tracking, adds complexity, and may lead to inconsistencies if historical transformation logic changes.*
