## *EXTRACT*

### *Import Libraries*

In [10]:
import pandas as pd

### *Step 1: Load and Preview `raw_data.csv` and `incremental_data.csv`*

### *Load the Datasets*

In [11]:
# Load raw data
orders_full = pd.read_csv('data/raw_data.csv')

# Load incremental data
orders_incremental = pd.read_csv('data/incremental_data.csv')

### *Preview the Data Using*

In [12]:
# Preview first few rows of raw_data
print("Preview of orders_full:")
orders_full.head()

Preview of orders_full:


Unnamed: 0,order_id,customer_name,product,quantity,unit_price,order_date,region
0,1,Diana,Tablet,,500.0,2024-01-20,South
1,2,Eve,Laptop,,,2024-04-29,North
2,3,Charlie,Laptop,2.0,250.0,2024-01-08,
3,4,Eve,Laptop,2.0,750.0,2024-01-07,West
4,5,Eve,Tablet,3.0,,2024-03-07,South


In [13]:
# Preview first few rows of incremental_data
print("Preview of orders_incremental:")
orders_incremental.head()

Preview of orders_incremental:


Unnamed: 0,order_id,customer_name,product,quantity,unit_price,order_date,region
0,101,Alice,Laptop,,900.0,2024-05-09,Central
1,102,,Laptop,1.0,300.0,2024-05-07,Central
2,103,,Laptop,1.0,600.0,2024-05-04,Central
3,104,,Tablet,,300.0,2024-05-26,Central
4,105,Heidi,Tablet,2.0,600.0,2024-05-21,North


### *Inspect Data Types and Nulls*

In [14]:
# Check structure of raw data
print("Info for orders_full:")
orders_full.info()

Info for orders_full:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   order_id       100 non-null    int64  
 1   customer_name  99 non-null     object 
 2   product        100 non-null    object 
 3   quantity       74 non-null     float64
 4   unit_price     65 non-null     float64
 5   order_date     99 non-null     object 
 6   region         75 non-null     object 
dtypes: float64(2), int64(1), object(4)
memory usage: 5.6+ KB



#### *Observations from orders_full (raw_data.csv)*

*Number of Records:*  
*- The dataset contains a total of **100 entries** and **7 columns**.*

*The following columns in the `orders_full` dataset have either missing values or data types that require attention:*

1. *Data Types:*  
   *- `product`: object — categorical/text data; appropriate.*  
   *- `quantity`: float64 — numeric; may need conversion to integer if all values are whole numbers.*  
   *- `order_date`: object — should be converted to datetime format for proper time-based analysis.*  
   *- `region`: object — categorical; suitable as-is.*

2. *Missing Values:*  
   *The dataset contains missing values in the following columns: `customer_name (1 missing)`, `quantity (26 missing)`, `unit_price (35 missing)`, `order_date (1 missing)`, and `region (25 missing)`.*

3. *Initial Observations:*  
   *- The dataset has several missing values, especially in the `quantity`, `unit_price`, and `region` columns, which will need to be handled during the transformation phase.*  
   *- The `order_date` column is of type object and should be converted to `datetime64` for time-based analysis.*



In [16]:
# Check structure of raw data
print("Info for orders_incremental:")
orders_incremental.info()

Info for orders_incremental:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   order_id       10 non-null     int64  
 1   customer_name  4 non-null      object 
 2   product        10 non-null     object 
 3   quantity       6 non-null      float64
 4   unit_price     10 non-null     float64
 5   order_date     10 non-null     object 
 6   region         8 non-null      object 
dtypes: float64(2), int64(1), object(4)
memory usage: 692.0+ bytes


#### *Observations from orders_incremental (incremental_data.csv)*

*Number of Records:*  
*- The dataset contains a total of **10 entries** and **7 columns**.*

*The following columns in the `orders_incremental` dataset have either missing values or data types that require attention:*

1. *Data Types:*  
   *- `product`: object — categorical/text data; appropriate.*  
   *- `quantity`: float64 — numeric; may need conversion to integer if all values are whole numbers.*  
   *- `order_date`: object — should be converted to datetime format for proper time-based analysis.*  
   *- `region`: object — categorical; suitable as-is.*

2. *Missing Values:*  
   *The dataset contains missing values in the following columns: **`customer_name`** (6 missing), **`quantity`** (4 missing), and **`region`** (2 missing).*

3. *Initial Observations:*  
   *- The `orders_incremental` dataset has fewer records but still includes several missing values that will require attention during transformation.*  
   *- The `order_date` column is stored as an object and should be converted to `datetime64` format.*

### *Save Raw Copies*

In [18]:
# Save raw copies (even if already present, this ensures version consistency)
orders_full.to_csv('data/raw_data.csv', index=False)
orders_incremental.to_csv('data/incremental_data.csv', index=False)

In [19]:
print("Raw datasets saved to the data/ directory.")

Raw datasets saved to the data/ directory.
