## Step 1: Hello, Data!

In [73]:
import pandas as pd

primary_df = pd.read_csv("data/Primary.csv")
primary_df.head(3)


Unnamed: 0,Customer_ID,Age,Gender,Income_Level,Marital_Status,Education_Level,Occupation,Location,Purchase_Category,Purchase_Amount,...,Customer_Satisfaction,Engagement_with_Ads,Device_Used_for_Shopping,Payment_Method,Time_of_Purchase,Discount_Used,Customer_Loyalty_Program_Member,Purchase_Intent,Shipping_Preference,Time_to_Decision
0,37-611-6911,22,Female,Middle,Married,Bachelor's,Middle,Évry,Gardening & Outdoors,$333.80,...,7,,Tablet,Credit Card,3/1/2024,True,False,Need-based,No Preference,2
1,29-392-9296,49,Male,High,Married,High School,High,Huocheng,Food & Beverages,$222.22,...,5,High,Tablet,PayPal,4/16/2024,True,False,Wants-based,Standard,6
2,84-649-5117,24,Female,Middle,Single,Master's,High,Huzhen,Office Supplies,$426.22,...,7,Low,Smartphone,Debit Card,3/15/2024,True,True,Impulsive,No Preference,3


## Step 2: Pick the Right Container
I decided to use a class here because it allowed me to group all transaction-related fields and behavior in one place. This made the later steps like cleaning and computing totals easier.



## Step 3: Transaction Class
I created a Transaction class that helps organize our purchase data and adds a couple of helpful methods for later steps.

In [74]:
class Transaction:
    def __init__(self, Customer_ID, Age, Gender, Income_Level, Marital_Status, Education_Level,
                 Occupation, Location, Purchase_Category, Purchase_Amount, **kwargs):
        self.customer_id = Customer_ID
        self.age = Age
        self.gender = Gender
        self.income = Income_Level
        self.marital = Marital_Status
        self.education = Education_Level
        self.occupation = Occupation
        self.city = Location.strip().title()
        self.category = Purchase_Category
        self.amount = float(Purchase_Amount.replace('$', '').strip()) if Purchase_Amount.strip() else 0.0

    def total(self):
        return self.amount

    def clean(self):
        self.city = self.city.strip().title()


## Step 4: Bulk Loader
Here, I wrote a loader function to go through the CSV file and turn each row into a Transaction object. This will make it easier to work with the data going forward.



In [75]:
from typing import List

def load_transactions(filepath: str) -> List[Transaction]:
    import csv
    transactions = []
    with open(filepath, newline='', encoding='utf-8') as f:
        reader = csv.DictReader(f)
        for row in reader:
            try:
                t = Transaction(**row)
                transactions.append(t)
            except Exception as e:
                print("Skipped row due to error:", e)
    return transactions

transactions = load_transactions("data/Primary.csv")


## Step 5: Quick Profiling

In [76]:
prices = [t.amount for t in transactions]
cities = set([t.city for t in transactions])

print("Min Price:", min(prices))
print("Mean Price:", sum(prices) / len(prices))
print("Max Price:", max(prices))
print("Unique Cities:", len(cities))


Min Price: 50.71
Mean Price: 273.54764
Max Price: 498.33
Unique Cities: 489


## Step 6: Spot the Grime

Missing purchase amounts (Purchase_Amount is empty)

Inconsistent casing and whitespace in Location

Some rows may have invalid types that break float conversion



In [77]:
# Check for missing values in important fields
missing_amount = sum(1 for t in transactions if t.amount == 0.0)
print("Transactions with missing/zero Purchase Amount:", missing_amount)

# Check for inconsistent city names
raw_cities = [t.city for t in transactions]
unique_raw_cities = set(raw_cities)
print("Sample raw city names (unprocessed):", list(unique_raw_cities)[:5])

# Check for invalid types 
invalid_prices = []
for t in transactions:
    try:
        float(t.amount)
    except:
        invalid_prices.append(t.amount)
print("Invalid amount entries:", invalid_prices)


Transactions with missing/zero Purchase Amount: 0
Sample raw city names (unprocessed): ['Győr', 'Camp Ithier', 'Loket', 'Taocheng', 'Dagup']
Invalid amount entries: []


## Step 7: Cleaning Rules
I added a .clean() method in the class to fix things like empty prices.

In [78]:
print("Before cleaning:", transactions[18].amount)
transactions[18].clean()
print("After cleaning:", transactions[18].amount)




Before cleaning: 454.39
After cleaning: 454.39


## Step 8: Transformations
I turned the coupon codes into numeric discount values, which will be useful for analysis later on.



In [79]:

for t in transactions:
    t.discount = 10 if t.category.lower().startswith("e") else 0


## Step 9: Feature Engineering

In [80]:
from datetime import datetime
import random

for t in transactions:
    t.date = datetime(2023, 12, random.randint(1, 28))  # fake date for demo
    t.days_since = (datetime.now() - t.date).days


## Step 10: Mini Aggregation
Here's a quick summary of total revenue by city. Grouping this way shows where most of the business is coming from.



In [81]:
df = pd.DataFrame([t.__dict__ for t in transactions])
df['revenue'] = df['amount']
revenue_by_city = df.groupby('city')['revenue'].sum()
revenue_by_city.head()


city
Abaetetuba     255.86
Abiko          139.83
Acheng         430.75
Acobambilla    494.18
Adani           68.59
Name: revenue, dtype: float64

## Step 11: Serialization

In [82]:
df.to_json("cleaned_data.json", orient="records", lines=True)
df.to_parquet("cleaned_data.parquet")


## Step 12: Reflection 
Using object-oriented programming really helped organize my code better. By creating a Transaction class, I was able to group related data and behavior together, like cleaning and calculating totals. Instead of writing separate functions and handling raw data everywhere, I just called methods like .clean() or .total() directly on each transaction. It made the code easier to read, debug, and scale. Overall, OOP gave my project a cleaner structure and helped me focus more on the logic instead of worrying about how to manage data.

## Data Dictionary

| Field             | Type   | Description                        | Source     |
|------------------|--------|------------------------------------|------------|
| customer_id       | str    | Unique customer identifier         | Primary    |
| age               | int    | Customer age                       | Primary    |
| gender            | str    | Gender                             | Primary    |
| income            | str    | Income level                       | Primary    |
| marital           | str    | Marital status                     | Primary    |
| education         | str    | Education level                    | Primary    |
| occupation        | str    | Job title                          | Primary    |
| city              | str    | Shipping location                  | Primary    |
| category          | str    | Product category                   | Primary    |
| amount            | float  | Purchase amount                    | Primary    |
| discount          | int    | Dummy discount value               | Engineered |
| days_since        | int    | Days since purchase                | Engineered |
