#### PROG8245 Machine Learning Programming
#### Reham Abuarqoub 9062922

## Step 1: Hello, Data!
I downloaded the dataset from https://archive.ics.uci.edu/dataset/352/online%2Bretail
the data was  containing 541909 record, I took the first 500 record and save this file as online_retail_500.csv

In [1]:
import pandas as pd

# Load the CSV data file
raw_data = pd.read_csv("data\online_retail_500.csv")

# Show the first 3 rows
raw_data.head(3)
import pandas as pd




  raw_data = pd.read_csv("data\online_retail_500.csv")


## Step 2: Pick the Right Container
We considered three options: `dict`, `namedtuple`, and a custom `class` called Transaction. this class offers the best flexibility and encapsulation for our transactions. It allows us to include methods like `.clean()` and `.total()` and easily extend functionality later.

## Step 3: Transaction Class and OO data structure
In this step, The class transaction, It's called when a new Transaction object is created.
Clean method were created to check the date if it is string or not
and method total used to calculate the total revenue.

In [2]:
class Transaction:
    def __init__(self, date, customer_id, product, price, quantity, coupon_code, shipping_city):
        self.date = date
        self.customer_id = customer_id
        self.product = product
        self.price = float(price)
        self.quantity = int(quantity)
        self.coupon_code = coupon_code
        self.shipping_city = shipping_city
        self.discount_percentage = 0  # Filled in later via metadata

    def clean(self):
        self.shipping_city = self.shipping_city.strip().title()
        self.coupon_code = self.coupon_code.strip().upper() if pd.notna(self.coupon_code) else "NONE"

    def apply_discount(self, coupon_lookup):
        self.discount_percentage = coupon_lookup.get(self.coupon_code, 0)

    def total(self):
        discount_multiplier = 1 - (self.discount_percentage / 100)
        return round(self.price * self.quantity * discount_multiplier, 2)

## Step 4: Bulk Loader
In this step, the function `load_transactions` has been created to return a list of Transaction objects.
`transactions = []` this list has been intiated to collect the processed objects. I used for loop to check each row in the dataframe.
And because there is no Coupon code in my Dataset, I craeted a synthetic dataset and named it metadata.

In [3]:
from typing import List

def load_transactions(df: pd.DataFrame) -> List[Transaction]:
    transactions = []
    for _, row in df.iterrows():
        # Synthetic coupon code: if last digit of InvoiceNo is even, use 'DISCOUNT10', else 'NONE'
        invoice_no = str(row['InvoiceNo'])
        coupon_code = 'DISCOUNT10' if invoice_no[-1].isdigit() and int(invoice_no[-1]) % 2 == 0 else 'NONE'
        t = Transaction(
            row['InvoiceDate'], row['CustomerID'], row['Description'], row['UnitPrice'], row['Quantity'],
            coupon_code, row['Country']
        )
        transactions.append(t)
    return transactions

transactions = load_transactions(raw_data)

## Step 5: Quick Profiling
Calculate min/mean/max of Price
Count of unique shipping_city

In [4]:
prices = [t.price for t in transactions]
cities = set(t.shipping_city for t in transactions)

print("Min price:", min(prices))
print("Max price:", max(prices))
print("Average price:", sum(prices)/len(prices))
print("Unique cities:", len(cities))

Min price: 0.1
Max price: 165.0
Average price: 3.60888
Unique cities: 4


## Step 6: Spot the Grime

 - Extra spaces in city names
 - Mixed case in coupon codes
 - Some missing coupon codes

## Step 7: Cleaning Rules
We can see the city names are correct

In [5]:
before_clean = [t.shipping_city for t in transactions[:5]]

for t in transactions:
    t.clean()

after_clean = [t.shipping_city for t in transactions[:5]]
print("Before:", before_clean)
print("After:", after_clean)


Before: ['United Kingdom', 'United Kingdom', 'United Kingdom', 'United Kingdom', 'United Kingdom']
After: ['United Kingdom', 'United Kingdom', 'United Kingdom', 'United Kingdom', 'United Kingdom']


## Step 8: Transformations
use  coupon_code to apply the discount on the price
- it will create  a dictionary from the DataFrame. Then it will map each coupon code to its discount percentage.
- terates through each Transaction object in the list transactions.

Calls a method apply_discount() on each transaction.

This method uses the coupon_lookup dictionary to:

Check if the coupon_code used in that transaction exists in the dictionary.
If yes, it calculates the discount and stores it in an attribute.

In [None]:
# Load coupon metadata to apply discounts from coupon codes
coupon_df = pd.read_csv("data\coupon_metadata.csv")
coupon_lookup = dict(zip(coupon_df['coupon_code'], coupon_df['discount_percentage']))

for t in transactions:
    t.apply_discount(coupon_lookup)

  coupon_df = pd.read_csv("data\coupon_metadata.csv")


## Step 9: Feature Engineering
In this step, I added feature days_since_purchase
This will performing feature engineering by adding a new calculated field to each transaction: the number of days since the purchase happened.
to calculate how many days ago each transaction occurred, a feature that can be useful for analysis

In [7]:
from datetime import datetime

for t in transactions:
    t.date = pd.to_datetime(t.date)
    t.days_since_purchase = (pd.Timestamp.today() - t.date).days

## Step 10: Mini-Aggregation


> The small table below shows the total revenue after the total discount

In [8]:
data = [{
    'shipping_city': t.shipping_city,
    'total': t.total()
} for t in transactions]

agg_df = pd.DataFrame(data)
revenue_by_city = agg_df.groupby('shipping_city').sum().sort_values(by='total', ascending=False)
revenue_by_city.head()

Unnamed: 0_level_0,total
shipping_city,Unnamed: 1_level_1
United Kingdom,14440.67
France,770.27
Australia,358.25
Netherlands,192.6


I write these two scripts to check the price after applying the dicount coupon.

In [9]:
for t in transactions:
    t.apply_discount(coupon_lookup)


In [10]:
for t in transactions[:10]:
    original = t.price * t.quantity
    print(f"Code: {t.coupon_code:10s}  Discount: {t.discount_percentage:>2d}%  "
          f"Original: ${original:.2f}  After: ${t.total():.2f}")


Code: NONE        Discount:  0%  Original: $15.30  After: $15.30
Code: NONE        Discount:  0%  Original: $20.34  After: $20.34
Code: NONE        Discount:  0%  Original: $22.00  After: $22.00
Code: NONE        Discount:  0%  Original: $20.34  After: $20.34
Code: NONE        Discount:  0%  Original: $20.34  After: $20.34
Code: NONE        Discount:  0%  Original: $15.30  After: $15.30
Code: NONE        Discount:  0%  Original: $25.50  After: $25.50
Code: DISCOUNT10  Discount: 10%  Original: $11.10  After: $9.99
Code: DISCOUNT10  Discount: 10%  Original: $11.10  After: $9.99
Code: NONE        Discount:  0%  Original: $54.08  After: $54.08


## Step 11: Serialization Checkpoint
In this step, you can find the cleand data in .json and .parquet files stored in data folder.

In [11]:
# Convert all transactions to dict format
cleaned_data = [{
    'date': t.date.strftime('%Y-%m-%d'),
    'customer_id': t.customer_id,
    'product': t.product,
    'price': t.price,
    'quantity': t.quantity,
    'coupon_code': t.coupon_code,
    'shipping_city': t.shipping_city,
    'discount_percentage': t.discount_percentage,
    'total': t.total()
} for t in transactions]

cleaned_df = pd.DataFrame(cleaned_data)

cleaned_df.to_json("data\cleaned_transactions.json", orient='records', lines=True)
cleaned_df.to_parquet("data\cleaned_transactions.parquet")

  cleaned_df.to_json("data\cleaned_transactions.json", orient='records', lines=True)
  cleaned_df.to_parquet("data\cleaned_transactions.parquet")


## Step 12: Soft Interview Reflection

Object-oriented programming enabled logical grouping of data and methods. By encapsulating cleaning and transformation logic in the `Transaction` class, we made the process reusable, testable, and modular. It simplified bulk processing by making transactions uniform and easy to work with.


## Data Dictionary

| Field               | Type     | Description                      | Source              |
|--------------------|----------|----------------------------------|---------------------|
| date               | Date     | Date of purchase                 | online_retail_500.csv |
| customer_id        | String   | Unique customer identifier       | online_retail_500.csv |
| product            | String   | Product name                     | online_retail_500.csv |
| price              | Float    | Price per unit                   | online_retail_500.csv |
| quantity           | Integer  | Number of items purchased        | online_retail_500.csv |
| coupon_code        | String   | Code for discount                | online_retail_500.csv |
| shipping_city      | String   | Delivery destination             | online_retail_500.csv |
| discount_percentage| Integer  | Discount based on coupon         | coupon_metadata.csv |
| total              | Float    | Final price after discount       | derived             |


I read the .json and .parquet files to see how the clean data looks

In [12]:
pd.read_json("data/cleaned_transactions.json", lines=True).head()
pd.read_parquet("data/cleaned_transactions.parquet").head()


Unnamed: 0,date,customer_id,product,price,quantity,coupon_code,shipping_city,discount_percentage,total
0,2010-12-01,17850,WHITE HANGING HEART T-LIGHT HOLDER,2.55,6,NONE,United Kingdom,0,15.3
1,2010-12-01,17850,WHITE METAL LANTERN,3.39,6,NONE,United Kingdom,0,20.34
2,2010-12-01,17850,CREAM CUPID HEARTS COAT HANGER,2.75,8,NONE,United Kingdom,0,22.0
3,2010-12-01,17850,KNITTED UNION FLAG HOT WATER BOTTLE,3.39,6,NONE,United Kingdom,0,20.34
4,2010-12-01,17850,RED WOOLLY HOTTIE WHITE HEART.,3.39,6,NONE,United Kingdom,0,20.34
