## Import Libraries and Setup Folder

In [5]:
# Importing pandas for data manipulation
import pandas as pd

# Importing os to manage folders
import os

# Ensure the 'data' folder exists to save extracted files
# If it already exists, 'exist_ok=True' prevents errors
os.makedirs("data", exist_ok=True)


## Load Both CSV Files

In [7]:
# Load the full raw dataset from CSV
# Assumes raw_data.csv is inside the 'data/' folder
df_raw = pd.read_csv("data/raw_data.csv")

# Load the smaller incremental dataset
df_incremental = pd.read_csv("data/incremental_data.csv")

# Display the first 5 rows of each dataset to get an overview
print("Preview of Raw Full Dataset:")
display(df_raw.head())  # Show first few rows of the raw full dataset

print("\nPreview of Incremental Dataset:")
display(df_incremental.head())  # Show first few rows of the incremental dataset



Preview of Raw Full Dataset:


Unnamed: 0,order_id,customer_name,product,quantity,unit_price,order_date,region
0,1,Diana,Tablet,,500.0,2024-01-20,South
1,2,Eve,Laptop,,,2024-04-29,North
2,3,Charlie,Laptop,2.0,250.0,2024-01-08,
3,4,Eve,Laptop,2.0,750.0,2024-01-07,West
4,5,Eve,Tablet,3.0,,2024-03-07,South



Preview of Incremental Dataset:


Unnamed: 0,order_id,customer_name,product,quantity,unit_price,order_date,region
0,101,Alice,Laptop,,900.0,2024-05-09,Central
1,102,,Laptop,1.0,300.0,2024-05-07,Central
2,103,,Laptop,1.0,600.0,2024-05-04,Central
3,104,,Tablet,,300.0,2024-05-26,Central
4,105,Heidi,Tablet,2.0,600.0,2024-05-21,North


## Data Summary: .info()

In [8]:
# View structure, data types, and missing values for raw dataset
print("Full Raw Dataset Info:")
df_raw.info()

# View structure, data types, and missing values for incremental dataset
print("\nIncremental Dataset Info:")
df_incremental.info()


Full Raw Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   order_id       100 non-null    int64  
 1   customer_name  99 non-null     object 
 2   product        100 non-null    object 
 3   quantity       74 non-null     float64
 4   unit_price     65 non-null     float64
 5   order_date     99 non-null     object 
 6   region         75 non-null     object 
dtypes: float64(2), int64(1), object(4)
memory usage: 5.6+ KB

Incremental Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   order_id       10 non-null     int64  
 1   customer_name  4 non-null      object 
 2   product        10 non-null     object 
 3   quantity       6 non-null      float64
 4   unit_price     10 n

## Save Extracted Data into /data

In [None]:
# Save a copy of the full raw data into the data/ folder
df_raw.to_csv("data/raw_data.csv", index=False)

# Save a copy of the incremental data into the data/ folder
df_incremental.to_csv("data/incremental_data.csv", index=False)

# These saved files will be used in the Transform phase


### Observations from Extracted Datasets

####  Full Raw Dataset (`raw_data.csv`)
- Total Records: **100 rows**, **7 columns**
- Missing Values Observed In:
  - `customer_name`: 1 missing
  - `quantity`: 26 missing
  - `unit_price`: 35 missing
  - `order_date`: 1 missing
  - `region`: 25 missing
- Data Types:
  - Numerical columns: `order_id`, `quantity`, `unit_price`
  - Categorical/text columns: `customer_name`, `product`, `order_date`, `region`
- **Observation**: About one-third of the rows have missing prices and quantities, which may affect analysis or require imputation.

---

#### Incremental Dataset (`incremental_data.csv`)
- Total Records: **10 rows**, **7 columns**
- Missing Values Observed In:
  - `customer_name`: 6 missing
  - `region`: 2 missing
- Data Types match those of the full dataset.
- **Observation**: This subset may represent recent or partial data and has significantly more missing names.

---

Next Step: These datasets are now saved to the `data/` folder and ready for transformation.

