# 01 — Transactions Data Cleaning

## 1.0 Business Understanding
Dormancy in customer accounts leads to reduced activity, low digital adoption, and lost business opportunities.  
To address this, we start by preparing a **clean transactions dataset** covering Dec 2024 – Aug 2025.  

This notebook focuses on:
- Merging raw monthly transaction reports.  
- Excluding invalid statuses (D, J, R).  
- Deriving reporting months from transaction dates.  
- Classifying transactions into channel groups.  
- Saving a clean file for later aggregation.

---

## 2.0 Data Sources
- **Transactions Data (Dec 2024 – Aug 2025)**  
  - Location: `data/raw/transactions/`  
  - Format: monthly `.csv` / `.xlsx` files.  
- **Exclusions**  
  - Status codes: D = Declined, J = Reversed, R = Rejected.  
- **Confidentiality**  
  - Raw files are excluded from GitHub (`.gitignore`).


In [10]:
# 3.0 Import Libraries
import pandas as pd
import numpy as np
from pathlib import Path
import os

pd.set_option("display.max_columns", None)

## 4.0 Locate Files
We check both relative paths (`data/raw/transactions` and `../data/raw/transactions`)  
so the notebook works whether run from the project root or the `notebooks/` folder.

In [11]:
# Check current working directory
print("cwd:", os.getcwd())

# Try possible paths
p1 = Path("data/raw/transactions")
p2 = Path("../data/raw/transactions")

if p1.exists():
    raw_dir = p1
elif p2.exists():
    raw_dir = p2
else:
    raise FileNotFoundError("❌ Could not find transactions folder. Check project structure.")

# List files
files = list(raw_dir.glob("*.csv")) + list(raw_dir.glob("*.xlsx")) + list(raw_dir.glob("*.xls"))
print("Looking in:", raw_dir.resolve())
print("Files found:", len(files))
for f in files:
    print(" -", f.name)

if not files:
    raise FileNotFoundError(f"No transaction files found in {raw_dir}")

cwd: C:\Users\USER\Documents\Personal Projects\Dormancy
Looking in: C:\Users\USER\Documents\Personal Projects\Dormancy\data\raw\transactions
Files found: 11
 - Branch Activity - 01.08.25 - 20.08.25.csv
 - Branch Activity - 21.08.25 - 31.08.25.csv
 - Branch Activity - 30.06.25.csv
 - Branch Activity - June 2025.csv
 - Branch Activity - March 2025.csv
 - Branch Activity -28.02.25.csv
 - Branch Activity 30.04.25.csv
 - Branch Activity 31.01.25.csv
 - Branch Activity 31.05.25.csv
 - Branch_Activity - July 2025.csv
 - Branch Activity - 31.12.24.xlsx


## 5.0 Load and Merge Monthly Transactions
All files are merged into one DataFrame.  

In [None]:
def read_file(f):
    if f.suffix.lower() in [".xlsx", ".xls"]:
        return pd.read_excel(f)
    return pd.read_csv(f)

tx = pd.concat([read_file(f) for f in files], ignore_index=True)
tx.columns = [c.strip() for c in tx.columns]
print("Merged shape:", tx.shape)
tx.head()/balances

## 6.0 Clean Data
Steps:
1. Exclude invalid statuses (D, J, R).  
2. Convert `created_date` into datetime.  
3. Create `Reporting_Month` = month-end date.  Full-time

In [None]:
# 1) Exclude invalid statuses
tx['status'] = tx['status'].astype(str).str.upper().str.strip()
tx = tx[~tx['status'].isin({'D','J','R'})]

# 2) Convert created_date
tx['created_date'] = pd.to_datetime(tx['created_date'], errors='coerce', dayfirst=True)

# 3) Reporting Month
tx['Reporting_Month'] = tx['created_date'].dt.to_period('M').dt.to_timestamp('M')

tx[['created_date','Reporting_Month']].head()

## 7.0 Classify Channels
Rules:
- **Mobile** → tran_type in Mobile Banking (withdrawal, transfer, charges, airtime, utility, paybill, pesalink).  
- **Biz2Bank** → RTS PAYBILL DEPOSITS, RTS MPESA DEPOSIT.  
- **GABCollect** → GABCollect.  
- **Internet** → if created_by = IBKTLR.  
- **ATM** → if channels = ATM.  
- **Other** → everything else.

In [None]:
MOBILE_TYPES = {
    'MOBILE BANKING MPESA WITHDRAWAL',
    'MOBILE BANKING TRANSFER',
    'MB CHARGES',
    'MOBILE BANKING AIRTIME PURCHASE',
    'MB UTILITY PAYMENT',
    'MB PAYBILL',
    'PESALINK TRANSFER',
    'PESALINK DEPOSIT'
}
BIZ2BANK_TYPES = {'RTS PAYBILL DEPOSITS', 'RTS MPESA DEPOSIT'}

def classify_channel(row):
    tt = str(row.get('tran_type','')).upper().strip()
    created_by = str(row.get('created_by','')).upper().strip()
    ch = str(row.get('channels','')).upper().strip()
    
    if tt in MOBILE_TYPES or tt.startswith("MOBILE BANKING") or "PESALINK" in tt:
        return "Mobile"
    if tt in BIZ2BANK_TYPES:
        return "Biz2Bank"
    if tt == "GABCOLLECT":
        return "GABCollect"
    if created_by == "IBKTLR":
        return "Internet"
    if ch == "ATM":
        return "ATM"
    return "Other"

tx['Channel_Group'] = tx.apply(classify_channel, axis=1)

# Flags
tx['IsDigital'] = tx['Channel_Group'].isin(['Mobile','Internet','Biz2Bank']).astype(int)
tx['IsPhysical'] = tx['Channel_Group'].eq('ATM').astype(int)

tx['Channel_Group'].value_counts()

## 8.0 Save Clean Transactions
The cleaned dataset is saved (ignored by Git).

In [None]:
out_path = Path("data/processed/transactions_clean.csv")
tx.to_csv(out_path, index=False)
print("✅ Clean data saved to:", out_path.resolve())