## **Lab2 - Data Collection and Pre-Processing**

**Name:** Mandeep Singh Brar  
**ID Number:** 8989367  
**Course Name:** Machine Learning Programming  
**Course Code:** PROG8245

### **Step 1:** Hello, Data!

In [189]:
# Importing necessary libraries

import pandas as pd     # pandas is used for data loading, cleaning, and analysis
import numpy as np      # numpy is used for efficient numerical operations (e.g., for missing values)
import matplotlib.pyplot as plt  # matplotlib is used for creating data visualizations like plots and charts
import seaborn as sns   # seaborn is a statistical data visualization, great for advanced visualizations

# Load the dataset
df = pd.read_csv('c:/Users/singh/Desktop/500 Sales Records.csv')

# Display the first 3 rows
df.head(3)

Unnamed: 0,Region,Country,Item Type,Sales Channel,Order Priority,Order Date,Order ID,Ship Date,Units Sold,Unit Price,Unit Cost,Total Revenue,Total Cost,Total Profit
0,Middle East and North Africa,Libya,Cosmetics,Offline,M,10/18/2014,686800706.0,10/31/2014,8446.0,437.2,263.33,3692591.2,2224085.18,1468506.02
1,North America,Canada,Vegetables,Online,M,11/7/2011,185941302.0,12/8/2011,3018.0,154.06,90.93,464953.08,274426.74,190526.34
2,Middle East and North Africa,Libya,Baby Food,Offline,C,10/31/2016,246222341.0,12/9/2016,1517.0,255.28,159.42,387259.76,241840.14,145419.62


### **Step 2:** Pick the Right Container

**Dict** stores sales data flexibly but can be messy. **namedtuple** is cleaner for fixed columns. **class** is best if we want to add methods like profit calculation.

**Conclusion:**
For above dataset, a custom class is most useful if we are planning to add data validation or extra logic. For simply storing and accessing rows, a namedtuple is clean and efficient.

### **Step 3:** Transaction Class and OO Data Structure

In [190]:
from datetime import datetime # For converting date strings into Python date objects
from typing import Optional # For type hinting optional fields in the Transaction class

class Transaction:
    def __init__( sales, 
        region: str, 
        country: str, 
        item_type: str, 
        sales_channel: str, 
        order_priority: str, 
        order_date: str, 
        order_id: float, 
        ship_date: str, 
        units_sold: float, 
        unit_price: float, 
        unit_cost: float, 
        total_revenue: float, 
        total_cost: float, 
        total_profit: float):
    
        # Assigning attributes to the instance
        sales.region = region
        sales.country = country
        sales.item_type = item_type
        sales.sales_channel = sales_channel
        sales.order_priority = order_priority
        
        # Convert dates from string to datetime for easier processing
        sales.order_date = sales._parse_date(order_date)                    # Converting string to date object
        sales.order_id = order_id
        sales.ship_date = sales._parse_date(ship_date)                      # Converting string to date object
        sales.units_sold = units_sold
        sales.unit_price = unit_price
        sales.unit_cost = unit_cost
        sales.total_revenue = total_revenue
        sales.total_cost = total_cost
        sales.total_profit = total_profit

    def _parse_date(self, date_str):
        try:
            return datetime.strptime(date_str, "%m/%d/%Y")
        except Exception:
            return None

    def is_valid(sales):  # Validate if the transaction has essential fields filled and reasonable values. Extend this method with more checks as needed.
        if (
            sales.units_sold is not None and sales.units_sold >= 0 and
            sales.unit_price is not None and sales.unit_price >= 0 and
            sales.order_date is not None and
            sales.total_revenue is not None and
            sales.total_cost is not None and
            sales.total_profit is not None
        ):
            return True
        return False

    def profit_margin(sales): # Calculate and return the profit margin as a float. Returns None if revenue is zero or not available.
        if sales.total_revenue and sales.total_revenue != 0:
            return sales.total_profit / sales.total_revenue
        return None

    def days_to_ship(sales): # Returns the number of days between order and ship date, or None if unavailable."""
        if sales.order_date and sales.ship_date:
            return (sales.ship_date - sales.order_date).days
        return None

    def __repr__(sales): # For easy display of the object's main fields
        return (f"Transaction({sales.region}, {sales.country}, {sales.item_type}, {sales.sales_channel}, "
                f"{sales.order_priority}, {sales.order_date.strftime('%Y-%m-%d') if sales.order_date else 'NA'}, "
                f"ID: {sales.order_id}, Units: {sales.units_sold}, Revenue: {sales.total_revenue})")

### **Step 4:** Bulk Loader

In [191]:
from typing import List

def load_transactions(df) -> List[Transaction]:
    transactions = []  # Empty list to store Transaction objects

    # Loop through each row in the DataFrame
    for idx, row in df.iterrows():
        # Create a Transaction object from row data
        transaction = Transaction(
            row['Region'],             # Region string
            row['Country'],            # Country string
            row['Item Type'],          # Item type string
            row['Sales Channel'],      # Sales channel string
            row['Order Priority'],     # Order priority string
            row['Order Date'],         # Order date string
            row['Order ID'],           # Order ID (float)
            row['Ship Date'],          # Ship date string
            row['Units Sold'],         # Units sold (float)
            row['Unit Price'],         # Unit price (float)
            row['Unit Cost'],          # Unit cost (float)
            row['Total Revenue'],      # Total revenue (float)
            row['Total Cost'],         # Total cost (float)
            row['Total Profit']        # Total profit (float)
        )
        transactions.append(transaction)  # Add the object to the list
    return transactions  # Return the complete list of Transaction objects

transactions = load_transactions(df)


### **Step 5:** Quick Profiling

In [192]:
import math

unit_prices = [
    t.unit_price for t in transactions 
    if t.unit_price is not None and not math.isnan(t.unit_price)
]

# Calculate minimum, mean, and maximum unit price
if unit_prices:  # Make sure the list is not empty to avoid errors
    min_price = min(unit_prices)
    mean_price = sum(unit_prices) / len(unit_prices)
    max_price = max(unit_prices)
else:
    min_price = mean_price = max_price = None

# Get all unique regions (since 'City' is not in your dataset, i am use region and country)    
unique_region = set(t.region for t in transactions if t.region)

# Get all unique Countries
unique_country = set(t.country for t in transactions if t.country)
# Print results
print(f"Min Unit Price: {min_price}")
print(f"Mean Unit Price: {mean_price}")
print(f"Max Unit Price: {max_price}")
print(f"Unique region: {unique_region}")
print(f"Number of unique regions: {len(unique_region)}")
print(f"Unique country: {unique_country}")
print(f"Number of unique countries: {len(unique_country)}")

Min Unit Price: 9.33
Mean Unit Price: 274.29506
Max Unit Price: 668.27
Unique region: {'Australia and Oceania', 'North America', 'Central America and the Caribbean', 'Europe', nan, 'Sub-Saharan Africa', 'Middle East and North Africa', 'Asia'}
Number of unique regions: 8
Unique country: {'Lebanon', 'Chad', 'Trinidad and Tobago', 'Mauritius ', 'Indonesia', 'Tanzania', 'Kazakhstan', 'Monaco', 'Pakistan', 'Kenya', 'South Africa', 'New Zealand', 'Taiwan', 'Georgia', 'China', 'Syria', 'Belgium', 'Philippines', 'Papua New Guinea', 'Tonga', 'Brunei', 'The Bahamas', 'Poland', 'Nigeria', 'Malta', 'Cuba', 'Bangladesh', 'Tunisia ', 'Tajikistan', 'Austria', 'Qatar', 'Jordan', 'Saint Kitts and Nevis ', 'Greenland', 'Democratic Republic of the Congo', 'Honduras', 'Djibouti', 'South Sudan', 'Benin', 'Senegal', 'Malawi', 'Cyprus', 'Portugal', 'Belize', 'Armenia', 'Guinea-Bissau', 'Uganda', 'Azerbaijan', 'Fiji', 'Sudan', 'Japan', 'Gabon', 'Albania', 'Singapore', 'Eritrea', 'Saint Lucia', 'Croatia', 'Moz

### **Step 6:** Spot the Grime

In [193]:
missing_units = [t for t in transactions if t.units_sold is None]
negative_prices = [t for t in transactions if t.unit_price is not None and t.unit_price < 0]
missing_country = [t for t in transactions if not t.country]

print(f"Transactions with missing Units Sold: {len(missing_units)}")
print(f"Transactions with negative Unit Price: {len(negative_prices)}")
print(f"Transactions with missing Country: {len(missing_country)}")

Transactions with missing Units Sold: 0
Transactions with negative Unit Price: 0
Transactions with missing Country: 0
