## **Lab2 - Data Collection and Pre-Processing**

**Name:** Mandeep Singh Brar  
**ID Number:** 8989367  
**Course Name:** Machine Learning Programming  
**Course Code:** PROG8245

### **Step 1:** Hello, Data!

In [808]:
# Importing necessary libraries

import pandas as pd     # pandas is used for data loading, cleaning, and analysis
import numpy as np      # numpy is used for efficient numerical operations (e.g., for missing values)
import matplotlib.pyplot as plt  # matplotlib is used for creating data visualizations like plots and charts
import seaborn as sns   # seaborn is a statistical data visualization, great for advanced visualizations

# Load the dataset
df = pd.read_csv('c:/Users/singh/Desktop/500 Sales Records.csv')

# Display the first 3 rows
df.head(3)

Unnamed: 0,Order ID,Order Date,Customer ID,Product ID,Product Name,Category,Sales,Quantity,City
0,CA-2016-152156,11/8/2016,CG-12520,FUR-BO-10001798,Bush Somerset Collection Bookcase,Furniture,261.96,2.0,Henderson
1,CA-2016-152156,11/8/2016,CG-12520,FUR-CH-10000454,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",Furniture,731.94,3.0,Henderson
2,CA-2016-138688,6/12/2016,DV-13045,OFF-LA-10000240,Self-Adhesive Address Labels for Typewriters b...,Office Supplies,14.62,2.0,Los Angeles


### **Step 2:** Pick the Right Container

**Dict** stores sales data flexibly but can be messy. **namedtuple** is cleaner for fixed columns. **class** is best if we want to add methods like profit calculation.

**Conclusion:**
For above dataset, a custom class is most useful if we are planning to add data validation or extra logic. For simply storing and accessing rows, a namedtuple is clean and efficient.

### **Step 3:** Transaction Class and OO Data Structure

In [809]:
from datetime import datetime # For converting date strings into Python date objects
from typing import Optional # For type hinting optional fields in the Transaction class

class Transaction:
    def __init__(ecommerce, 
                 order_id, 
                 order_date, 
                 customer_id, 
                 product_id, 
                 product_name, 
                 category, 
                 sales, 
                 quantity, 
                 city):
        ecommerce.order_id = order_id
        ecommerce.order_date = datetime.strptime(order_date, '%m/%d/%Y')
        ecommerce.customer_id = customer_id
        ecommerce.product_id = product_id
        ecommerce.product_name = product_name
        ecommerce.category = category
        ecommerce.sales = float(sales)
        ecommerce.quantity = int(quantity)
        ecommerce.city = city

    def __sale__(ecommerce):
        return f"Transaction({ecommerce.order_id}, {ecommerce.sales}, {ecommerce.city})"

print("Transaction class defined.")


Transaction class defined.


### **Step 4:** Bulk Loader

In [810]:
from typing import List

# Function to load transactions from a DataFrame
def load_transactions(df) -> List[Transaction]:
    transactions = [] # Empty list to store Transaction objects

    # Loop through each row in the DataFrame
    for _, row in df.iterrows():

        # Create a Transaction object from row data
        transactions.append(Transaction(
            row['Order ID'],
            row['Order Date'],
            row['Customer ID'],
            row['Product ID'],
            row['Product Name'],
            row['Category'],
            row['Sales'],
            row['Quantity'],
            row['City']
        ))
    return transactions

transactions = load_transactions(df)
print(f"Loaded {len(transactions)} transactions.")

Loaded 500 transactions.


### **Step 5:** Quick Profiling

In [811]:
import math

# Collect all sales values and unique cities from the transactions list.
sales_values = [t.sales for t in transactions]
unique_cities = {t.city for t in transactions}

# Calculate min, mean, max sales.
min_sales = min(sales_values)
mean_sales = sum(sales_values) / len(sales_values)
max_sales = max(sales_values)

# Print results
print("------ QUICK PROFILING ------")
print(f"Minimum Sales: {min_sales:.2f}")
print(f"Mean Sales: {mean_sales:.2f}")
print(f"Maximum Sales: {max_sales:.2f}")
print(f"Number of Unique Cities: {len(unique_cities)}")
print(f"Sample Cities: {list(unique_cities)[:5]}")

------ QUICK PROFILING ------
Minimum Sales: 1.25
Mean Sales: 258.85
Maximum Sales: 8159.95
Number of Unique Cities: 117
Sample Cities: ['Tampa', 'West Jordan', 'Columbia', 'Henderson', 'Amarillo']


### **Step 6:** Spot the Grime

In [812]:
# Purpose:
# Identify at least three types of dirty data in the sales dataset:
# 1. Negative sales values (sales cannot be negative)
# 2. Non-positive quantity (quantity should be at least 1)
# 3. Missing or invalid city names (must be a non-empty string)

# List to store details of all dirty data cases found
dirty_cases = []

# Loop through each transaction object in the transactions list
for t in transactions:
    # Check for negative sales
    if t.sales < 0:
        dirty_cases.append({
            'issue': 'Negative Sales',
            'order_id': t.order_id,
            'sales': t.sales,
            'city': t.city
        })
        
    # Check for zero or negative quantity
    if t.quantity <= 0:
        dirty_cases.append({
            'issue': 'Non-positive Quantity',
            'order_id': t.order_id,
            'quantity': t.quantity,
            'city': t.city
        })
    
    # Check for missing, empty, or invalid city names
    if not isinstance(t.city, str) or not t.city.strip():
        dirty_cases.append({
            'issue': 'Missing or Invalid City',
            'order_id': t.order_id,
            'city': t.city,
            'sales': t.sales
        })

# Print a summary of findings
print("----- DIRTY DATA CASES FOUND -----")
print(f"Total dirty data cases: {len(dirty_cases)}\n")

# Show up to 5 sample dirty data cases for quick review
for i, case in enumerate(dirty_cases[:5]):
    print(f"Case {i+1}: {case}")

----- DIRTY DATA CASES FOUND -----
Total dirty data cases: 0



### **Step 7:** Cleaning Rules

In [813]:
def clean(transactions):
    """
    Cleans the transaction list by removing:
    - Negative sales
    - Non-positive (zero or negative) quantity
    - Missing, empty, or invalid city names
    Prints before/after counts.
    """
    before_count = len(transactions)

    cleaned_transactions = []
    for t in transactions:
        if t.sales < 0:
            continue
        if t.quantity <= 0:
            continue
        if not isinstance(t.city, str) or not t.city.strip():
            continue
        cleaned_transactions.append(t)
    
    after_count = len(cleaned_transactions)
    print("\n----- DATA CLEANING REPORT -----")
    print(f"Records before cleaning: {before_count}")
    print(f"Records after cleaning:  {after_count}")
    print(f"Records removed:         {before_count - after_count}")

    return cleaned_transactions

# Usage:
transactions_clean = clean(transactions)


----- DATA CLEANING REPORT -----
Records before cleaning: 500
Records after cleaning:  500
Records removed:         0


### **Step 8:** Transformations

Extract a numeric discount from product_name if present in the form '10%OFF'.
Returns the integer discount (e.g., 10) or 0 if not found.

In [814]:
def parse_discount(product_name):
    
    for word in product_name.split():
        if word.endswith('%OFF'):
            try:
                return int(word.replace('%OFF', ''))
            except ValueError:
                continue
    return 0

for t in transactions_clean:
    t.discount = parse_discount(t.product_name)

### **Step 9:** Feature Engineering

In [815]:
from datetime import datetime

today = datetime.today()
for t in transactions_clean:
    # Difference in days between now and order_date
    t.days_since_purchase = (today - t.order_date).days

# Show sample
print("\n------ DAYS SINCE PURCHASE ------")
for t in transactions_clean[:5]:
    print(f"Order Date: {t.order_date.date()}, Days Since Purchase: {t.days_since_purchase}")


------ DAYS SINCE PURCHASE ------
Order Date: 2016-11-08, Days Since Purchase: 3122
Order Date: 2016-11-08, Days Since Purchase: 3122
Order Date: 2016-06-12, Days Since Purchase: 3271
Order Date: 2015-10-11, Days Since Purchase: 3516
Order Date: 2015-10-11, Days Since Purchase: 3516


### **Step 10:** Mini-Aggregation

In [816]:
from collections import defaultdict

# Calculate total revenue per city
city_revenue = defaultdict(float)
for t in transactions_clean:
    city_revenue[t.city] += t.sales

# Sort cities by revenue in descending order and take the top 10
top_10_cities = sorted(city_revenue.items(), key=lambda x: x[1], reverse=True)[:10]

print("\n------ Top 10 Cities by Total Revenue ------")
for city, revenue in top_10_cities:
    print(f"{city}: ${revenue:.2f}")


------ Top 10 Cities by Total Revenue ------
New York City: $15037.80
Houston: $11566.00
San Antonio: $10546.57
Los Angeles: $8991.52
San Francisco: $8308.86
Chicago: $6762.27
Philadelphia: $6264.84
Franklin: $4181.40
Lakeville: $3745.63
San Diego: $3364.07


### **Step 11:** Serialization Checkpoint

In [817]:
!pip install pyarrow





[notice] A new release of pip is available: 24.0 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [818]:
from pandas.io.parquet import to_parquet

# Convert back to DataFrame for saving (best for parquet/json)
df_cleaned = pd.DataFrame([t.__dict__ for t in transactions_clean])
df_cleaned.to_json("final_cleaned_transactions_data.json", orient="records", lines=True)
df_cleaned.to_parquet("final_cleaned_transactions_data.parquet")
print("Serialized to JSON and Parquet.")

Serialized to JSON and Parquet.
