## **Lab2 - Data Collection and Pre-Processing**

**Name:** Mandeep Singh Brar  
**ID Number:** 8989367  
**Course Name:** Machine Learning Programming  
**Course Code:** PROG8245

### **Step 1:** Hello, Data!

In [598]:
# Importing necessary libraries

import pandas as pd     # pandas is used for data loading, cleaning, and analysis
import numpy as np      # numpy is used for efficient numerical operations (e.g., for missing values)
import matplotlib.pyplot as plt  # matplotlib is used for creating data visualizations like plots and charts
import seaborn as sns   # seaborn is a statistical data visualization, great for advanced visualizations

# Load the dataset
df = pd.read_csv('c:/Users/singh/Desktop/500 Sales Records.csv')

# Display the first 3 rows
df.head(3)

Unnamed: 0,Order ID,Order Date,Customer ID,Product ID,Product Name,Category,Sales,Quantity,City
0,CA-2016-152156,11/8/2016,CG-12520,FUR-BO-10001798,Bush Somerset Collection Bookcase,Furniture,261.96,2.0,Henderson
1,CA-2016-152156,11/8/2016,CG-12520,FUR-CH-10000454,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",Furniture,731.94,3.0,Henderson
2,CA-2016-138688,6/12/2016,DV-13045,OFF-LA-10000240,Self-Adhesive Address Labels for Typewriters b...,Office Supplies,14.62,2.0,Los Angeles


### **Step 2:** Pick the Right Container

**Dict** stores sales data flexibly but can be messy. **namedtuple** is cleaner for fixed columns. **class** is best if we want to add methods like profit calculation.

**Conclusion:**
For above dataset, a custom class is most useful if we are planning to add data validation or extra logic. For simply storing and accessing rows, a namedtuple is clean and efficient.

### **Step 3:** Transaction Class and OO Data Structure

In [599]:
from datetime import datetime # For converting date strings into Python date objects
from typing import Optional # For type hinting optional fields in the Transaction class

class Transaction:
    def __init__( sales, 
        region: str, 
        country: str, 
        item_type: str, 
        sales_channel: str, 
        order_priority: str, 
        order_date: str, 
        order_id: float, 
        ship_date: str, 
        units_sold: float, 
        unit_price: float, 
        unit_cost: float, 
        total_revenue: float, 
        total_cost: float, 
        total_profit: float):
    
        # Assigning attributes to the instance
        sales.region = region
        sales.country = country
        sales.item_type = item_type
        sales.sales_channel = sales_channel
        sales.order_priority = order_priority
        
        # Convert dates from string to datetime for easier processing
        sales.order_date = sales._parse_date(order_date)                    # Converting string to date object
        sales.order_id = order_id
        sales.ship_date = sales._parse_date(ship_date)                      # Converting string to date object
        sales.units_sold = units_sold
        sales.unit_price = unit_price
        sales.unit_cost = unit_cost
        sales.total_revenue = total_revenue
        sales.total_cost = total_cost
        sales.total_profit = total_profit

    def _parse_date(self, date_str):
        try:
            return datetime.strptime(date_str, "%m/%d/%Y")
        except Exception:
            return None

    def is_valid(sales):  # Validate if the transaction has essential fields filled and reasonable values. Extend this method with more checks as needed.
        if (
            sales.units_sold is not None and sales.units_sold >= 0 and
            sales.unit_price is not None and sales.unit_price >= 0 and
            sales.order_date is not None and
            sales.total_revenue is not None and
            sales.total_cost is not None and
            sales.total_profit is not None
        ):
            return True
        return False

    def profit_margin(sales): # Calculate and return the profit margin as a float. Returns None if revenue is zero or not available.
        if sales.total_revenue and sales.total_revenue != 0:
            return sales.total_profit / sales.total_revenue
        return None

    def days_to_ship(sales): # Returns the number of days between order and ship date, or None if unavailable."""
        if sales.order_date and sales.ship_date:
            return (sales.ship_date - sales.order_date).days
        return None

    def __repr__(sales): # For easy display of the object's main fields
        return (f"Transaction({sales.region}, {sales.country}, {sales.item_type}, {sales.sales_channel}, "
                f"{sales.order_priority}, {sales.order_date.strftime('%Y-%m-%d') if sales.order_date else 'NA'}, "
                f"ID: {sales.order_id}, Units: {sales.units_sold}, Revenue: {sales.total_revenue})")

### **Step 4:** Bulk Loader

In [600]:
from typing import List

def load_transactions(df) -> List[Transaction]:
    transactions = []  # Empty list to store Transaction objects

    # Loop through each row in the DataFrame
    for idx, row in df.iterrows():
        # Create a Transaction object from row data
        transaction = Transaction(
            row['Region'],             # Region string
            row['Country'],            # Country string
            row['Item Type'],          # Item type string
            row['Sales Channel'],      # Sales channel string
            row['Order Priority'],     # Order priority string
            row['Order Date'],         # Order date string
            row['Order ID'],           # Order ID (float)
            row['Ship Date'],          # Ship date string
            row['Units Sold'],         # Units sold (float)
            row['Unit Price'],         # Unit price (float)
            row['Unit Cost'],          # Unit cost (float)
            row['Total Revenue'],      # Total revenue (float)
            row['Total Cost'],         # Total cost (float)
            row['Total Profit']        # Total profit (float)
        )
        transactions.append(transaction)  # Add the object to the list
    return transactions  # Return the complete list of Transaction objects

transactions = load_transactions(df)


KeyError: 'Region'

### **Step 5:** Quick Profiling

In [None]:
import math

unit_prices = [
    t.unit_price for t in transactions 
    if t.unit_price is not None and not math.isnan(t.unit_price)
]

# Calculate minimum, mean, and maximum unit price
if unit_prices:  # Make sure the list is not empty to avoid errors
    min_price = min(unit_prices)
    mean_price = sum(unit_prices) / len(unit_prices)
    max_price = max(unit_prices)
else:
    min_price = mean_price = max_price = None

# Get all unique regions (since 'City' is not in your dataset, i am use region and country)    
unique_region = set(t.region for t in transactions if t.region)

# Get all unique Countries
unique_country = set(t.country for t in transactions if t.country)
# Print results
print(f"Min Unit Price: {min_price}")
print(f"Mean Unit Price: {mean_price}")
print(f"Max Unit Price: {max_price}")
print(f"Unique region: {unique_region}")
print(f"Number of unique regions: {len(unique_region)}")
print(f"Unique country: {unique_country}")
print(f"Number of unique countries: {len(unique_country)}")

Min Unit Price: 9.33
Mean Unit Price: 274.29506
Max Unit Price: 668.27
Unique region: {'Australia and Oceania', 'North America', 'Central America and the Caribbean', 'Europe', nan, 'Sub-Saharan Africa', 'Middle East and North Africa', 'Asia'}
Number of unique regions: 8
Unique country: {'Lebanon', 'Chad', 'Trinidad and Tobago', 'Mauritius ', 'Indonesia', 'Tanzania', 'Kazakhstan', 'Monaco', 'Pakistan', 'Kenya', 'South Africa', 'New Zealand', 'Taiwan', 'Georgia', 'China', 'Syria', 'Belgium', 'Philippines', 'Papua New Guinea', 'Tonga', 'Brunei', 'The Bahamas', 'Poland', 'Nigeria', 'Malta', 'Cuba', 'Bangladesh', 'Tunisia ', 'Tajikistan', 'Austria', 'Qatar', 'Jordan', 'Saint Kitts and Nevis ', 'Greenland', 'Democratic Republic of the Congo', 'Honduras', 'Djibouti', 'South Sudan', 'Benin', 'Senegal', 'Malawi', 'Cyprus', 'Portugal', 'Belize', 'Armenia', 'Guinea-Bissau', 'Uganda', 'Azerbaijan', 'Fiji', 'Sudan', 'Japan', 'Gabon', 'Albania', 'Singapore', 'Eritrea', 'Saint Lucia', 'Croatia', 'Moz

### **Step 6:** Spot the Grime

In [None]:
missing_units = [t for t in transactions if t.units_sold is None]
negative_prices = [t for t in transactions if t.unit_price is not None and t.unit_price < 0]
missing_country = [t for t in transactions if not t.country]

print(f"Transactions with missing Units Sold: {len(missing_units)}")
print(f"Transactions with negative Unit Price: {len(negative_prices)}")
print(f"Transactions with missing Country: {len(missing_country)}")

Transactions with missing Units Sold: 0
Transactions with negative Unit Price: 0
Transactions with missing Country: 0


### **Step 7:** Cleaning Rules

In [None]:
# Define "dirty" as: missing/NaN/negative unit_price or units_sold
def dirty_data(t):
    # Checks for None, NaN, negative
    unit_price_data = t.unit_price
    unit_sold_data = t.units_sold
    return (
        unit_price_data is None or (isinstance(unit_price_data, float) and math.isnan(unit_price_data)) or unit_price_data < 0 or
        unit_sold_data is None or (isinstance(unit_sold_data, float) and math.isnan(unit_sold_data)) or unit_sold_data < 0
    )

# 1. Count dirty data before cleaning
dirty_records_before = [t for t in transactions if dirty_data(t)]
print("Dirty records before:", len(dirty_records_before))

# 2. Create a new cleaned list after removing dirty ones
total_cleaned_transactions = [t for t in transactions if not dirty_data(t)]

# 3. Count dirty data after cleaning (should be zero!)
dirty_records_after = [t for t in total_cleaned_transactions if dirty_data(t)]
print("Dirty records after:", len(dirty_records_after))
print("Cleaned data count:", len(total_cleaned_transactions))


Dirty records before: 500
Dirty records after: 0
Cleaned data count: 500


### **Step 8:** Transformations

Converting order priority to a numeric code.

In [None]:
priority_map = {'H': 3, 'M': 2, 'L': 1, 'C': 0}
for t in total_cleaned_transactions:
    t.priority_code = priority_map.get(t.order_priority, -1)
    print(f"Transaction ID: {t.order_id}, Priority Code: {t.priority_code}")

Transaction ID: 686800706.0, Priority Code: 2
Transaction ID: 185941302.0, Priority Code: 2
Transaction ID: 246222341.0, Priority Code: 0
Transaction ID: 161442649.0, Priority Code: 0
Transaction ID: 645713555.0, Priority Code: 3
Transaction ID: 683458888.0, Priority Code: 3
Transaction ID: 679414975.0, Priority Code: 3
Transaction ID: 208630645.0, Priority Code: 2
Transaction ID: 266467225.0, Priority Code: 3
Transaction ID: 118598544.0, Priority Code: 3
Transaction ID: 451010930.0, Priority Code: 2
Transaction ID: 220003211.0, Priority Code: 2
Transaction ID: 702186715.0, Priority Code: 0
Transaction ID: 544485270.0, Priority Code: 0
Transaction ID: 714135205.0, Priority Code: 1
Transaction ID: 448685348.0, Priority Code: 3
Transaction ID: 405997025.0, Priority Code: 3
Transaction ID: 414244067.0, Priority Code: 2
Transaction ID: 821912801.0, Priority Code: 0
Transaction ID: 247802054.0, Priority Code: 1
Transaction ID: 531023156.0, Priority Code: 0
Transaction ID: 880999934.0, Prior

### **Step 9:** Feature Engineering

In [None]:
for t in total_cleaned_transactions:
    try:
        order_date = datetime.strptime(t.order_date, "%m/%d/%Y") if isinstance(t.order_date, str) else t.order_date
        t.days_since_order = (datetime.now() - order_date).days if order_date else None
    except:
        t.days_since_order = None

### **Step 10:** Mini-Aggregation

In [None]:
revenue_based_on_per_country = {}
for t in total_cleaned_transactions:
    revenue_based_on_per_country[t.country] = revenue_based_on_per_country.get(t.country, 0) + (t.total_revenue or 0)

print({k: revenue_based_on_per_country[k] for k in list(revenue_based_on_per_country)[:5]})

{'Libya': 5884216.16, 'Canada': 1226103.2999999998, 'Japan': 6766832.07, 'Chad': 11828683.05, 'Armenia': 7103563.75}


### **Step 11:** Serialization Checkpoint

In [None]:
!pip install pyarrow





[notice] A new release of pip is available: 24.0 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [None]:
from pandas.io.parquet import to_parquet

# Convert back to DataFrame for saving (best for parquet/json)
df_cleaned = pd.DataFrame([t.__dict__ for t in total_cleaned_transactions])
df_cleaned.to_json("final_cleaned_transactions_data.json", orient="records", lines=True)
df_cleaned.to_parquet("final_cleaned_transactions_data.parquet")
print("Serialized to JSON and Parquet.")

Serialized to JSON and Parquet.
