#### Lab 2 — Data Collection and Pre-Processing Lab Assignment
Albright Maduka  

PROG8245 

##### Overview
You will execute the 12-step Data Engineering road-map practiced in class, this time end-to-end on a realistic e-commerce dataset.
Your deliverable is a well-commented Jupyter Notebook that loads raw data, cleans and enriches it, and finishes with a concise analytical insight. All code, data, and documentation must live in a GitHub repository you control.

##### Connecting to Neon DB and saving my synthetic data

In [17]:

import pandas as pd
import psycopg2
from pathlib import Path
# Replace with your Neon connection details
conn_str = 'postgresql://neondb_owner:npg_x2uqmkNbRa7J@ep-broad-leaf-a8anq220-pooler.eastus2.azure.neon.tech/neondb?sslmode=require&channel_binding=require'

# Ensure "data" folder exists
data_dir = Path("data")
data_dir.mkdir(parents=True, exist_ok=True)

# Connect to the database
conn = psycopg2.connect(conn_str)
# Query the table and load into Pandas
df = pd.read_sql_query("SELECT * FROM customers;", conn)
# Show the DataFrame
print(df.head(3))

# Save to CSV as synthetic_customers.csv inside data/
csv_path = data_dir / "synthetic_customers.csv"
df.to_csv(csv_path, index=False)

print(f"Data saved to {csv_path}")
conn.close()

   customer_id coupon_code shipping_city
0          343    FREESHIP      San Jose
1          377   HOLIDAY15  Philadelphia
2          431   HOLIDAY15      San Jose
Data saved to data\synthetic_customers.csv


  df = pd.read_sql_query("SELECT * FROM customers;", conn)


##### 1) Hello, Data! — load raw CSV and show first 3 rows

In [21]:
import pandas as pd

# Load the sales dataset
sales_df = pd.read_csv("./data/1000 Sales Records.csv")
print("Sales Records shape:", sales_df.shape)
print(sales_df.head(3))   # show first 3 rows

# Load the synthetic customers dataset
customers_df = pd.read_csv("./data/synthetic_customers.csv")
print("\nSynthetic Customers shape:", customers_df.shape)
print(customers_df.head(3))   # show first 3 rows


Sales Records shape: (1000, 14)
                         Region Country   Item Type Sales Channel  \
0  Middle East and North Africa   Libya   Cosmetics       Offline   
1                 North America  Canada  Vegetables        Online   
2  Middle East and North Africa   Libya   Baby Food       Offline   

  Order Priority  Order Date   Order ID   Ship Date  Units Sold  Unit Price  \
0              M  10/18/2014  686800706  10/31/2014        8446      437.20   
1              M   11/7/2011  185941302   12/8/2011        3018      154.06   
2              C  10/31/2016  246222341   12/9/2016        1517      255.28   

   Unit Cost  Total Revenue  Total Cost  Total Profit  
0     263.33     3692591.20  2224085.18    1468506.02  
1      90.93      464953.08   274426.74     190526.34  
2     159.42      387259.76   241840.14     145419.62  

Synthetic Customers shape: (1000, 3)
   customer_id coupon_code shipping_city
0          343    FREESHIP      San Jose
1          377   HOLIDAY15  Ph

### 2) Pick the Right Container
**Dictionary (`dict`)**: Easily constructable from a CSV row, this dictionary offers flexible key-value pairs (e.g., `{"date":..., "price":...}`).  
 **Namedtuple / Dataclass / Custom Class**: More organized, permits methods (e.g., cleaning or derived fields), and provides attribute access (`row.price`).  
 - **Set**:  Only unique values are stored; column meanings are not preserved, making it useless for expressing a row.


##### 3) Implement Functions & Data Structure

In [39]:
from __future__ import annotations
from dataclasses import dataclass
from datetime import date
from typing import Optional
import re, math

COUPON_RE = re.compile(r"(\d{1,2})")  # e.g., SAVE15 → "15"

@dataclass
class Transaction:
    date: date
    customer_id: str
    product: str
    price: float
    quantity: int
    coupon_code: Optional[str]
    shipping_city: str

    def clean(self, price_median: float) -> None:
        """Fix bad values in price, quantity, and coupon_code."""
        # --- price cleaning ---
        try:
            p = float(self.price)
            if math.isnan(p) or p <= 0:
                p = price_median
        except Exception:
            p = price_median
        self.price = float(p)

        # --- quantity cleaning ---
        try:
            q = int(self.quantity)
            self.quantity = q if q > 0 else 1
        except Exception:
            self.quantity = 1

        # --- coupon_code cleaning (strip, uppercase, remove invalids) ---
        if isinstance(self.coupon_code, str):
            cc = self.coupon_code.strip().upper()
            if cc in {"N/A", "NONE", ""}:
                self.coupon_code = None
            else:
                self.coupon_code = cc
        else:
            self.coupon_code = None

    def discount_pct(self) -> float:
        """Extract numeric discount from coupon_code (SAVE15 → 0.15)."""
        if not self.coupon_code or self.coupon_code == "FREESHIP":
            return 0.0
        m = COUPON_RE.search(self.coupon_code)
        return (int(m.group(1)) / 100.0) if m else 0.0

    def net_price(self) -> float:
        """Price after discount."""
        return self.price * (1 - self.discount_pct())

    def revenue(self) -> float:
        """Revenue = net_price × quantity."""
        return self.net_price() * self.quantity

    def days_since_purchase(self) -> int:
        """Days passed since the order date."""
        return (date.today() - self.date).days


##### 4) Bulk Loaded

In [49]:
from typing import List
import pandas as pd

# --: load and clean headers ---
sales_df = pd.read_csv("data/1000 Sales Records.csv")
customers_df = pd.read_csv("data/synthetic_customers.csv")

# Standardize column names (strip spaces, lowercase, underscores)
sales_df.columns = sales_df.columns.str.strip().str.replace(" ", "_").str.lower()
customers_df.columns = customers_df.columns.str.strip().str.replace(" ", "_").str.lower()

print("Sales columns:", sales_df.columns.tolist())
print("Customers columns:", customers_df.columns.tolist())


# --- Bulk Loader Function ---
def load_transactions(
    sales_df: pd.DataFrame,
    customers_df: pd.DataFrame
) -> List[Transaction]:
    """Build list[Transaction] from cleaned sales & customers DataFrames."""

    n = len(sales_df)

    # Ensure customers length matches sales length
    if len(customers_df) < n:
        reps = int((n / len(customers_df)) + 1)
        customers_df = pd.concat([customers_df] * reps, ignore_index=True)
    customers_df = customers_df.iloc[:n].reset_index(drop=True)

    # Build Transaction objects (list comprehension )
    transactions: List[Transaction] = [
        Transaction(
            date=pd.to_datetime(sales_df.loc[i, "order_date"], errors="coerce").date(),
            customer_id=str(customers_df.loc[i, "customer_id"]),
            product=str(sales_df.loc[i, "item_type"]),
            price=float(sales_df.loc[i, "unit_price"]),
            quantity=int(sales_df.loc[i, "units_sold"]),
            coupon_code=str(customers_df.loc[i, "coupon_code"]),
            shipping_city=str(customers_df.loc[i, "shipping_city"]),
        )
        for i in range(n)
    ]
    return transactions


# --- Usage ---
tx = load_transactions(sales_df, customers_df)
print(f"Loaded {len(tx)} transactions")
tx[:3]  # quick peek

Sales columns: ['region', 'country', 'item_type', 'sales_channel', 'order_priority', 'order_date', 'order_id', 'ship_date', 'units_sold', 'unit_price', 'unit_cost', 'total_revenue', 'total_cost', 'total_profit']
Customers columns: ['customer_id', 'coupon_code', 'shipping_city']
Loaded 1000 transactions


[Transaction(date=datetime.date(2014, 10, 18), customer_id='343', product='Cosmetics', price=437.2, quantity=8446, coupon_code='FREESHIP', shipping_city='San Jose'),
 Transaction(date=datetime.date(2011, 11, 7), customer_id='377', product='Vegetables', price=154.06, quantity=3018, coupon_code='HOLIDAY15', shipping_city='Philadelphia'),
 Transaction(date=datetime.date(2016, 10, 31), customer_id='431', product='Baby Food', price=255.28, quantity=1517, coupon_code='HOLIDAY15', shipping_city='San Jose')]

##### 5) Quick Profiling — min/mean/max price, unique city count (set)

In [50]:
# Generator expressions for price stats
prices = (t.price for t in tx if isinstance(t.price, (int, float)))

# Convert generator to list so we can reuse
prices = list(prices)

print("Min Price :", min(prices))
print("Mean Price:", sum(prices) / len(prices))
print("Max Price :", max(prices))

# Use a set to count unique shipping cities
unique_cities = {t.shipping_city for t in tx}
print("\nUnique Shipping City Count:", len(unique_cities))
print("Cities:", unique_cities)


Min Price : 9.33
Mean Price: 262.10684
Max Price : 668.27

Unique Shipping City Count: 10
Cities: {'Dallas', 'Philadelphia', 'Phoenix', 'San Diego', 'Houston', 'San Antonio', 'San Jose', 'New York', 'Chicago', 'Los Angeles'}


##### 6) Spotting the Grime — inject a few deliberate errors

In [51]:
import math

# --- Inject grime (only if we have enough rows) ---
if len(tx) >= 5:
    tx[1].price = -9.0                 # negative price
    tx[2].price = float("nan")         # NaN price
    tx[3].coupon_code = "n/a"          # bad coupon text
    tx[4].coupon_code = "  Save10  "   # messy spacing/case

# --- Find grime with comprehensions ---
bad_price_idx = [
    i for i, t in enumerate(tx)
    if (not isinstance(t.price, (int, float)))
       or (isinstance(t.price, float) and (math.isnan(t.price) or t.price <= 0))
]

dirty_coupon_idx = [
    i for i, t in enumerate(tx)
    if isinstance(t.coupon_code, str)
       and t.coupon_code.strip().lower() in {"n/a", "", "none"}
]

print(f"Bad price rows: {len(bad_price_idx)} -> indices {bad_price_idx[:10]}")
print(f"Dirty coupon rows: {len(dirty_coupon_idx)} -> indices {dirty_coupon_idx[:10]}")

# quick peek at the first few offending rows
for i in (bad_price_idx[:2] + dirty_coupon_idx[:2]):
    if 0 <= i < len(tx):
        t = tx[i]
        print(f"\nRow {i}: price={t.price}, coupon_code={t.coupon_code}, product={t.product}, city={t.shipping_city}")


Bad price rows: 2 -> indices [1, 2]
Dirty coupon rows: 1 -> indices [3]

Row 1: price=-9.0, coupon_code=HOLIDAY15, product=Vegetables, city=Philadelphia

Row 2: price=nan, coupon_code=HOLIDAY15, product=Baby Food, city=San Jose

Row 3: price=205.7, coupon_code=n/a, product=Cereal, city=San Jose


##### 7) Cleaning

In [52]:
# Locate transactions with bad prices
import math
import pandas as pd

# 7a) FIND grime with boolean-style list comprehensions
bad_prices = [t for t in tx if (not isinstance(t.price, (int, float)))
              or (isinstance(t.price, float) and (math.isnan(t.price) or t.price <= 0))]
bad_coupons = [t for t in tx if (t.coupon_code is None) or
               (isinstance(t.coupon_code, str) and t.coupon_code.strip().lower() in {"n/a", "", "none"})]

print(f"Before clean → bad price rows: {len(bad_prices)}, dirty coupon rows: {len(bad_coupons)}")


Before clean → bad price rows: 2, dirty coupon rows: 1


In [53]:
# Compute a robust median from valid positive prices only
valid_prices = [t.price for t in tx
                if isinstance(t.price, (int, float)) and not math.isnan(t.price) and t.price > 0]
median_price = float(pd.Series(valid_prices).median()) if valid_prices else 10.0
print("Median price used for fixes:", median_price)

# CLEAN — call t.clean(...) on every Transaction (now also fixes N/A coupons)
for t in tx:
    t.clean(price_median=median_price)

# VERIFY — re-check after cleaning
bad_prices_after = [t for t in tx if (not isinstance(t.price, (int, float)))
                    or (isinstance(t.price, float) and (math.isnan(t.price) or t.price <= 0))]
bad_coupons_after = [t for t in tx if (t.coupon_code is None) or
                     (isinstance(t.coupon_code, str) and t.coupon_code.strip().lower() in {"n/a", "", "none"})]

print(f"After clean  → bad price rows: {len(bad_prices_after)}, dirty coupon rows: {len(bad_coupons_after)}")

# Optional: peek at 1 cleaned record
sample = tx[0]
print({"price": sample.price,
       "coupon_code": sample.coupon_code,
       "discount_pct()": sample.discount_pct(),
       "net_price()": sample.net_price(),
       "revenue()": sample.revenue()})

Median price used for fixes: 154.06
After clean  → bad price rows: 0, dirty coupon rows: 1
{'price': 437.2, 'coupon_code': 'FREESHIP', 'discount_pct()': 0.0, 'net_price()': 437.2, 'revenue()': 3692591.1999999997}


In [54]:
# Median from valid positive prices only
import math, pandas as pd
valid_prices = [t.price for t in tx if isinstance(t.price, (int,float)) and not math.isnan(t.price) and t.price > 0]
median_price = float(pd.Series(valid_prices).median()) if valid_prices else 10.0
print("Median price used:", median_price)

# Clean each transaction in-place
for t in tx:
    t.clean(price_median=median_price)

# Verify after cleaning
bad_prices_after = [t for t in tx if (not isinstance(t.price,(int,float))) or (isinstance(t.price,float) and (math.isnan(t.price) or t.price <= 0))]
bad_coupons_after = [t for t in tx if isinstance(t.coupon_code,str) and t.coupon_code.strip().lower() in {"n/a","","none"}]
print(f"After clean → bad prices: {len(bad_prices_after)}, bad coupons: {len(bad_coupons_after)}")


Median price used: 154.06
After clean → bad prices: 0, bad coupons: 0


##### 8) Transformation

In [57]:
# Transformations: discount, net_price, revenue, transaction_key → DataFrame

import re
import pandas as pd

# Build a tidy analysis DataFrame from the list[Transaction]
df = pd.DataFrame([{
    "date": t.date,
    "customer_id": t.customer_id,
    "product": t.product,
    "price": t.price,
    "quantity": t.quantity,
    "coupon_code": (t.coupon_code or "").strip().upper(),  # normalize for display
    "shipping_city": t.shipping_city,
    "discount_pct": t.discount_pct(),          # method (e.g., SAVE15 -> 0.15; FREESHIP/invalid -> 0.0)
    "net_price": t.net_price(),                # price after discount
    "revenue": t.revenue(),                    # net_price * quantity
    "days_since_purchase": t.days_since_purchase(),  # recency feature
    "transaction_key": f"{t.date:%Y%m%d}_{t.customer_id}_{re.sub(r'\\s+','', t.product)}"
} for t in tx])

print("Shape:", df.shape)
df.head(3)


Shape: (1000, 12)


Unnamed: 0,date,customer_id,product,price,quantity,coupon_code,shipping_city,discount_pct,net_price,revenue,days_since_purchase,transaction_key
0,2014-10-18,343,Cosmetics,437.2,8446,FREESHIP,San Jose,0.0,437.2,3692591.2,3999,20141018_343_Cosmetics
1,2011-11-07,377,Vegetables,154.06,3018,HOLIDAY15,Philadelphia,0.15,130.951,395210.118,5075,20111107_377_Vegetables
2,2016-10-31,431,Baby Food,154.06,1517,HOLIDAY15,San Jose,0.15,130.951,198652.667,3255,20161031_431_Baby Food


##### 9) Feature Engineering

In [None]:
import pandas as pd

# Add features: days_since_purchase, order_year, order_month
df["days_since_purchase"] = [t.days_since_purchase() for t in tx]
df["date"] = pd.to_datetime(df["date"], errors="coerce")
df["order_year"]  = df["date"].dt.year
df["order_month"] = df["date"].dt.month

df.head(3)


Unnamed: 0,date,customer_id,product,price,quantity,coupon_code,shipping_city,discount_pct,net_price,revenue,days_since_purchase,transaction_key,order_year,order_month,product_initial
0,2014-10-18,343,Cosmetics,437.2,8446,FREESHIP,San Jose,0.0,437.2,3692591.2,3999,20141018_343_Cosmetics,2014,10,C
1,2011-11-07,377,Vegetables,154.06,3018,HOLIDAY15,Philadelphia,0.15,130.951,395210.118,5075,20111107_377_Vegetables,2011,11,V
2,2016-10-31,431,Baby Food,154.06,1517,HOLIDAY15,San Jose,0.15,130.951,198652.667,3255,20161031_431_Baby Food,2016,10,B


##### 10) Mini-aggregation (by city)

In [62]:

from collections import defaultdict

# ---defaultdict(float) ---
city_rev = defaultdict(float)
for t in tx:
    city_rev[t.shipping_city] += t.revenue()   # uses your Transaction method

top10_py = sorted(city_rev.items(), key=lambda kv: kv[1], reverse=True)[:10]
print("Top 10 cities (pure Python):")
for city, rev in top10_py:
    print(f"{city:16} → {rev:,.2f}")

# --- Pandas: groupby ---
top10_pd = (df.groupby("shipping_city", dropna=False)["revenue"]
              .sum()
              .sort_values(ascending=False)
              .head(10))

print("\nTop 10 cities (pandas):")
print(top10_pd)


Top 10 cities (pure Python):
Los Angeles      → 149,823,346.24
San Jose         → 134,329,286.07
Chicago          → 131,133,374.93
Philadelphia     → 125,940,206.28
Dallas           → 118,441,596.02
San Antonio      → 117,734,279.08
New York         → 115,062,941.59
Phoenix          → 113,157,925.54
Houston          → 102,180,473.97
San Diego        → 92,775,818.31

Top 10 cities (pandas):
shipping_city
Los Angeles     1.498233e+08
San Jose        1.343293e+08
Chicago         1.311334e+08
Philadelphia    1.259402e+08
Dallas          1.184416e+08
San Antonio     1.177343e+08
New York        1.150629e+08
Phoenix         1.131579e+08
Houston         1.021805e+08
San Diego       9.277582e+07
Name: revenue, dtype: float64


##### 11) Serialization Checkpoint

In [64]:
# Save cleaned DataFrame to JSON file

from pathlib import Path

OUT = Path("data/out")
OUT.mkdir(parents=True, exist_ok=True)

out_path = OUT / "transactions_clean.json"
df.to_json(out_path, orient="records", indent=2)  # writes an array of records

print(f"Saved JSON to: {out_path.as_posix()}")

Saved JSON to: data/out/transactions_clean.json


##### 12) Soft Interview Reflection	Markdown: < 120 words explaining how Functions have helped

I was able to write clean, understandable code by using functions.  I could divide things up into manageable chunks rather than writing everything down in one spot.  Each function performed a single task, allowing me to reuse it and address issues more quickly.  It also made it simpler to interpret and distribute my code.
