<a href="https://colab.research.google.com/github/Suraalani79/msc-dissertation-2025/blob/main/framework.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MSc Dissertation — Adaptive, Modular, Multi-Agent AI Framework for SMEs
**Notebook:** End-to-end pipeline (datasets → preprocessing → modelling → evaluation → adaptivity)

> Keep this notebook in sync with Chapter 3 (Methodology). Each section here maps to 3.x sections.

---

## 1. Dataset Loading
- Online Retail II (UCI): customer transactions (segmentation/demand)
- South German Credit (UCI): credit risk classification
- Bank Marketing (UCI): campaign response prediction

*Goal:* Load CSVs from Google Drive or direct URLs and show basic shapes/columns.

---

## 2. Preprocessing & Feature Engineering
- Handle missing values, duplicates, outliers
- Encode categoricals, scale numerics
- Task-specific features (e.g., RFM for retail)

*Goal:* Produce clean `X_train`, `X_test`, `y_train`, `y_test` (if supervised).

---

## 3. Modelling (per “agent” / task)
- **Risk (classification):** Logistic Regression (baseline) → Random Forest / XGBoost
- **Segmentation (clustering):** k-Means with Silhouette
- *(Optional)* Forecasting: add later if time permits

*Goal:* Train models with cross-validation and save metrics.

---

## 4. Evaluation & Reporting
- Classification: Accuracy, Precision, Recall, F1, ROC-AUC
- Clustering: Silhouette
- Compare **static** vs **retrainable** runs (later)

*Goal:* Store results in `/results/` as CSV/PNG.

---

## 5. Adaptivity (Continual Learning / Retraining)
- Simple retraining strategy (scheduled or drift-triggered)
- Document when/why a model is retrained and how performance changes

---

## 6. Repro Notes
- Library versions, random seeds, data sources
- Ethical/Legal note: public, non-personal datasets only (UCI)

---


In [2]:
# Core
import os, sys, pathlib, warnings
warnings.filterwarnings("ignore")

# Data
import numpy as np
import pandas as pd

# Prep & models
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.metrics import silhouette_score

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
# XGBoost optional: pip install xgboost then uncomment
# from xgboost import XGBClassifier

from sklearn.cluster import KMeans

# Utils
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

# Results directory (matches your repo structure)
RESULTS_DIR = pathlib.Path("results")
RESULTS_DIR.mkdir(exist_ok=True)


In [3]:
# === Direct UCI downloads: Online Retail, German Credit, Bank Marketing ===
import os, io, zipfile, urllib.request
import pandas as pd

DATA_DIR = "/content/data"
os.makedirs(DATA_DIR, exist_ok=True)

def fetch(url, out_path):
    try:
        urllib.request.urlretrieve(url, out_path)
        print(f"✓ Downloaded: {out_path}")
        return True
    except Exception as e:
        print(f"⚠️ Could not download {url}\n{e}")
        return False

# ---------------------------
# 1) Online Retail (UCI)
# ---------------------------
# NOTE: UCI sometimes hosts “Online Retail II” with access restrictions.
# We’ll try Online Retail II first; if it fails, we fall back to the classic Online Retail.
retail_xlsx_path = os.path.join(DATA_DIR, "OnlineRetail.xlsx")

online_retail_ii_urls = [
    # Try Online Retail II first (may require access; if it fails, we fall back)
    "https://archive.ics.uci.edu/ml/machine-learning-databases/00502/online_retail_II.xlsx",
    # Fallback to classic Online Retail
    "https://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx",
]

retail_ok = False
for u in online_retail_ii_urls:
    if fetch(u, retail_xlsx_path):
        retail_ok = True
        break

retail_df = None
if retail_ok:
    try:
        # Some versions have multiple sheets; sheet_name=None loads dict of sheets
        tmp = pd.read_excel(retail_xlsx_path, sheet_name=None)
        # If multiple sheets, concat them; otherwise use the single one
        if isinstance(tmp, dict):
            retail_df = pd.concat(tmp.values(), ignore_index=True)
        else:
            retail_df = tmp
        print("✓ Online Retail loaded:", retail_df.shape)
    except Exception as e:
        print("⚠️ Could not read Online Retail Excel:", e)

# ---------------------------
# 2) German Credit (Statlog) – UCI
# ---------------------------
# Raw file (no header). Use names from UCI documentation or create generic ones.
german_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data"
german_path = os.path.join(DATA_DIR, "german.data")
if fetch(german_url, german_path):
    # 20 features + 1 target (per UCI description). We’ll name C1..C20 and 'target'
    col_names = [f"C{i}" for i in range(1, 21)] + ["target"]
    credit_df = pd.read_csv(german_path, sep=r"\s+", header=None, names=col_names, engine="python")
    print("✓ German Credit loaded:", credit_df.shape)
else:
    credit_df = None

# ---------------------------
# 3) Bank Marketing – UCI (use “bank-additional-full.csv”)
# ---------------------------
bank_zip_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank-additional.zip"
bank_zip_path = os.path.join(DATA_DIR, "bank-additional.zip")
bank_df = None

if fetch(bank_zip_url, bank_zip_path):
    try:
        with zipfile.ZipFile(bank_zip_path, "r") as z:
            # prefer the full dataset inside the zip
            candidates = [
                "bank-additional/bank-additional-full.csv",
                "bank-additional/bank-additional.csv",        # smaller version
                "bank/bank-full.csv"                           # older folder name in some mirrors
            ]
            member = None
            for c in candidates:
                if c in z.namelist():
                    member = c
                    break
            if member is None:
                raise FileNotFoundError(f"None of expected CSVs found. Available: {z.namelist()[:5]} ...")

            with z.open(member) as f:
                bank_df = pd.read_csv(f, sep=';')
        print("✓ Bank Marketing loaded:", bank_df.shape)
    except Exception as e:
        print("⚠️ Could not extract/read Bank Marketing:", e)

# ---------------------------
# Quick peek (if loaded)
# ---------------------------
for name, df in [("Retail", retail_df), ("Credit", credit_df), ("Bank", bank_df)]:
    if df is not None:
        print(f"\n{name} head():")
        display(df.head())


✓ Downloaded: /content/data/OnlineRetail.xlsx
✓ Online Retail loaded: (1067371, 8)
✓ Downloaded: /content/data/german.data
✓ German Credit loaded: (1000, 21)
✓ Downloaded: /content/data/bank-additional.zip
✓ Bank Marketing loaded: (41188, 21)

Retail head():


Unnamed: 0,Invoice,StockCode,Description,Quantity,InvoiceDate,Price,Customer ID,Country
0,489434,85048,15CM CHRISTMAS GLASS BALL 20 LIGHTS,12,2009-12-01 07:45:00,6.95,13085.0,United Kingdom
1,489434,79323P,PINK CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
2,489434,79323W,WHITE CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
3,489434,22041,"RECORD FRAME 7"" SINGLE SIZE",48,2009-12-01 07:45:00,2.1,13085.0,United Kingdom
4,489434,21232,STRAWBERRY CERAMIC TRINKET BOX,24,2009-12-01 07:45:00,1.25,13085.0,United Kingdom



Credit head():


Unnamed: 0,C1,C2,C3,C4,C5,C6,C7,C8,C9,C10,...,C12,C13,C14,C15,C16,C17,C18,C19,C20,target
0,A11,6,A34,A43,1169,A65,A75,4,A93,A101,...,A121,67,A143,A152,2,A173,1,A192,A201,1
1,A12,48,A32,A43,5951,A61,A73,2,A92,A101,...,A121,22,A143,A152,1,A173,1,A191,A201,2
2,A14,12,A34,A46,2096,A61,A74,2,A93,A101,...,A121,49,A143,A152,1,A172,2,A191,A201,1
3,A11,42,A32,A42,7882,A61,A74,2,A93,A103,...,A122,45,A143,A153,1,A173,2,A191,A201,1
4,A11,24,A33,A40,4870,A61,A73,3,A93,A101,...,A124,53,A143,A153,2,A173,2,A191,A201,2



Bank head():


Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
