![Banner](https://github.com/LittleHouse75/flatiron-resources/raw/main/NevitsBanner.png)

----

# Ethereum Scam Address Modeling — Overview

----

## 1. Business problem and goal

Public blockchains like Ethereum are **full of fraud**, but there is no common “master list” of scam wallets.  
Wallet providers, exchanges, and analytics vendors each maintain their own internal lists built from:

- Proprietary scam labels  
- Ad-hoc rules and heuristics  
- Manual investigations  

These lists are expensive, incomplete, and often lag behind new scam patterns.

For a **wallet provider**, this means:

- Users may unknowingly send funds to **known scam addresses**  
- The provider takes on **reputational and compliance risk**  
- Fraud teams burn time on **manual triage** instead of deeper investigations  

**Business goal (wallet-provider framing):**

> Learn behavioral patterns from on-chain data so we can:
> - Flag **high-risk destinations before a transaction is sent**
> - Prioritize **which addresses fraud teams should review first**
> - Maintain an **internal, adaptive scam list** without sharing proprietary labels

## 2. Why machine learning?

Ethereum emits **high-dimensional, time-dependent behavior**:

- Wallet histories of sends/receives, gas usage, timing bursts, fan-in / fan-out  
- Scam patterns that change quickly and are **non-linear**  

Static rules and blacklists fall behind.

Machine learning lets us:

- Turn raw transactions into **address-level behavioral features**  
- Learn **non-linear scam patterns** from labeled history  
- **Retrain** as behavior drifts  
- Score **new or unseen addresses** in near real time  

This project tests how well that works using public data, and what breaks once we respect **time** and **dataset shift**.

## 3. What we did: datasets and experiment views

### Datasets

- **Benchmark Ethereum dataset (training ground)**  
  Public “Labeled transactions-based dataset on the Ethereum network” (~70k labeled transactions).  
  Used to build an **address-level table** and a binary `Scam` label (1 if the address ever appears as a labeled scam).

- **DFPI external evaluation dataset (reality check)**  
  Scam wallets from a regulator (e.g., California DFPI), plus their on-chain history from Etherscan, mixed with background traffic.  
  Used only as an **external hold-out test set**.

This two-dataset setup lets us check both:

- Best-case behavior on a **controlled benchmark**  
- Whether learned patterns **transfer to regulator data**

### Experiment views

Across notebooks we evaluate three scenarios:

1. **Random address split (i.i.d.)**  
   Train / validate / test on addresses drawn from the **same global time span**, with no address shared across splits.

2. **Time-based split (past → future)**  
   Train / validate on addresses built from **early transactions**, test on addresses built from **later transactions**, using the same feature recipe in each window.

3. **External DFPI evaluation**  
   Take the tuned model from the random-split setting and apply it, **without retraining**, to the DFPI scam-wallet dataset.

## 4. What this means for a wallet provider (BLUF)

### Random address split — feasibility on historical data

When train and test come from the same distribution:

- A tuned XGBoost model **separates scam vs. non-scam extremely well**  
  - ROC AUC ≈ **0.998**, Average Precision (AP) ≈ **0.79**  
- We can define a **small high-risk segment** where ~**80% of alerts are true scams**

**Implication:**  
Given decent labeled history and relatively stable patterns, a wallet provider can build an internal model that:

- Surfaces a **short, high-yield list** of suspect addresses  
- Great for **analyst triage** and deeper investigations  

---

### Time-based split — reality of deployment over time

When we enforce a **past → future** setup:

- Performance **drops** compared to the random split  
- At a high-precision operating point:
  - Precision ≈ **0.9+**
  - Recall ≈ **0.25**, AP ≈ **0.54**

**Implication:**  

- You still get a **very clean alert list**, but you **miss many new scams**  
- Scam behavior **drifts over time**; “train once and forget it” is not realistic  
- In production you’d need:
  - **Drift and performance monitoring**
  - **Periodic retraining** on newer labeled data
  - Careful choice of **precision vs. recall** based on team capacity and risk appetite  

---

### DFPI evaluation — transfer to regulator data

When we apply the random-split model to the **DFPI scam-wallet dataset** (no retraining):

- ROC AUC ≈ **0.97**, AP ≈ **0.90**  
- Most DFPI-listed scam addresses rank **near the top**, with few benign wallets mixed in  

**Implication:**  

- The model is not just memorizing the benchmark dataset  
- Behavior-based features have **real transfer value** to external, regulator-reported scams  
- Similar features + modeling approach can likely extract **useful signals from your own data**, even with different label sources and time periods  

## 5. How a stakeholder could use this work

Even without adopting this exact codebase, the analysis maps to a concrete path for a wallet provider or exchange.

### A. Data you need

Most providers already have:

- Raw or enriched **transaction tables** with:
  - from / to addresses  
  - value, gas, and fees  
  - timestamps or block numbers  
- Some internal notion of **“risky” or “confirmed scam”** addresses

This project (plus the appendices) shows a **minimum viable schema** and field dictionary to build:

- Address-level **behavioral features**  
- Time-aware **train / validation / test** splits  

---

### B. How to reuse this pipeline on your data

A practical adaptation plan:

1. Implement an **address-level feature pipeline** similar to the one used here.  
2. Run the same three evaluation views on your own labels:
   - Random address split → **best-case ceiling**  
   - Time-based split → **deployment realism**  
   - External / cross-dataset test → **transferability** (if you have another label source)  
3. Present curves and confusion matrices so fraud leadership sees the **precision vs. recall trade-offs** clearly.

---

### C. How this would look in production

The experiments suggest a production setup would involve:

- **Choosing thresholds** to:
  - Produce a **small, ultra-clean analyst queue**, or  
  - Cast a **wider net** for automated friction (warnings, extra checks)  
- **Scheduled retraining** (e.g., monthly / quarterly) as new scams appear  
- Simple **drift monitoring** on key features and score distributions  
- A small **held-out recent slice** for testing new models before promotion  

## 6. Notebook roadmap — for data and ML teams

Each notebook can be run independently. Together they tell the story end-to-end.

### [01_EDA.ipynb — Enhanced EDA: Ethereum scam dataset](01_EDA.ipynb)

**Purpose**

- Load and sanity-check the raw transaction dataset  
- Normalize timestamps and create `block_timestamp_dt`  
- Explore:
  - Time coverage  
  - Basic distributions (value, gas, etc.)  
  - Scam vs. non-scam label balance  

---

### [02_RandomSplitAnalysis.ipynb — Random address split (i.i.d.)](02_RandomSplitAnalysis.ipynb)

**Purpose**

- Aggregate raw transactions into **address-level features**  
- Perform a **random address-level split** (train / validation / test)  
- Train and tune models (focus on XGBoost) with:
  - Metric tables  
  - ROC / PR curves  
  - Threshold analysis for high-precision alerting  

---

### [03_TimeSplitAnalysis.ipynb — Train on the past, test on the future](03_TimeSplitAnalysis.ipynb)

**Purpose**

- Enforce a **past → future** scenario with a time cutoff  
- Rebuild address-level features separately for past (train / val) and future (test)  
- Re-run model training and thresholding under this more realistic setup  

---

### [04_DFPI_ExternalEval.ipynb — External evaluation on DFPI scam wallets](04_DFPI_ExternalEval.ipynb)

**Purpose**

- Build an evaluation set from **regulator-reported scam wallets** (DFPI)  
- Engineer the same behavioral features from Etherscan history  
- Apply the tuned model from the random-split experiment **without retraining**  

## 7. Limitations and future directions

**Key limitations**

- **Label coverage:** many addresses are unlabeled; “non-scam” often just means “not tagged as scam”  
- **Time coverage:** the benchmark dataset covers a specific historical window; newer scams may look different  
- **Feature scope:** features are **transactional, address-level** only:
  - No contract bytecode or ABI decoding  
  - No graph-structure features (communities, motifs, etc.)  
  - No off-chain signals (KYC, IPs, devices, etc.)  

**Potential next steps**

- Add **graph-based signals** (e.g., PageRank, communities, subgraph patterns)  
- Incorporate **contract-level metadata** and DeFi protocol interactions  
- Experiment with **rolling-window** or online training for drift  
- Wrap scoring into an **analyst dashboard** for triage and investigations  

## Appendix A — Raw Ethereum transaction schema

| Field | Type | Meaning | Use | Notes |
|---|---|---|---|---|
| hash | string | Unique transaction hash | context | Not modeled directly; can be used as row ID |
| nonce | int | Per-sender transaction count at time of tx | dropped | Not used in current feature set |
| transaction_index | int | Position of tx within block | dropped | Block-local ordering only |
| from_address | string | Sender address | analysis | Used as key to build per-address features |
| to_address | string | Recipient address | analysis | Used as key to build per-address features |
| value | float | Transferred ETH amount in wei | analysis | Aggregated into incoming/outgoing amount features |
| gas | int | Gas limit specified for tx | analysis | Used for avg gas limit per address |
| gas_price | float | Gas price offered (wei per gas unit) | analysis | Used for avg gas price per address |
| input | string | Hex calldata / payload | dropped | Not parsed in this project |
| receipt_cumulative_gas_used | int | Total gas used in block up to this tx | dropped | Not used in current features |
| receipt_gas_used | int | Gas used by this tx alone | dropped | Redundant with other gas behavior for this analysis |
| block_timestamp | string → datetime | Block time for tx | analysis | Parsed to UTC; basis for all time/sequence features |
| block_number | int | Block height containing tx | dropped | Highly collinear with timestamp; not modeled directly |
| block_hash | string | Hash of containing block | dropped | Not used in current features |
| from_scam | int (0/1) | Source is labeled scam address | analysis | Used to construct per-address Scam label |
| to_scam | int (0/1) | Destination is labeled scam address | analysis | Used to construct per-address Scam label |
| from_category | string | Labeled category for sender | analysis | Used to flag scam/fraud/phish categories |
| to_category | string | Labeled category for recipient | analysis | Used to flag scam/fraud/phish categories |

## Appendix B — Engineered address-level feature dictionary

Index: each row corresponds to a unique Ethereum `Address` (string), aggregated over all transactions.

| Field | Type | Meaning | Use | Notes |
|---|---|---|---|---|
| in_degree | int | Count of incoming txs to address | analysis | Number of rows where address is `to_address` |
| out_degree | int | Count of outgoing txs from address | analysis | Number of rows where address is `from_address` |
| unique in_degree | int | Number of distinct senders to this address | analysis | Unique `from_address` values seen as incoming |
| unique out_degree | int | Number of distinct recipients from this address | analysis | Unique `to_address` values seen as outgoing |
| Avg amount incoming | float | Mean incoming transfer value (wei) | analysis | Averaged over all txs where address is recipient |
| Total amount incoming | float | Sum of incoming transfer value (wei) | analysis | Total ETH in wei received |
| Max amount incoming | float | Maximum single incoming value (wei) | analysis | Largest inbound transfer |
| Min amount incoming | float | Minimum single incoming value (wei) | analysis | Smallest inbound transfer (0 if none) |
| Avg amount outgoing | float | Mean outgoing transfer value (wei) | analysis | Averaged over all txs sent by address |
| Total amount outgoing | float | Sum of outgoing transfer value (wei) | analysis | Total ETH in wei sent |
| Max amount outgoing | float | Maximum single outgoing value (wei) | analysis | Largest outbound transfer |
| Min amount outgoing | float | Minimum single outgoing value (wei) | analysis | Smallest outbound transfer (0 if none) |
| Avg time incoming | float | Mean timestamp of incoming txs (seconds) | analysis | Seconds since earliest block in dataset |
| Avg time outgoing | float | Mean timestamp of outgoing txs (seconds) | analysis | Seconds since earliest block in dataset |
| Total Tx Time | float | Sum of the actual time gaps between consecutive transactions (seconds) | analysis | For addresses with ≥2 txs, this equals the sum of all inter-transaction intervals; 0 if ≤1 tx |
| Active Duration | float | Lifespan between first and last tx (s) | analysis | 0 if only a single tx |
| Mean time interval | float | Mean gap between consecutive txs (s) | analysis | 0 if ≤1 tx |
| Max time interval | float | Largest gap between consecutive txs (s) | analysis | 0 if ≤1 tx |
| Min time interval | float | Smallest gap between consecutive txs (s) | analysis | 0 if ≤1 tx |
| Burstiness | float | max_gap / median_gap of tx times | analysis | 0 for ≤2 txs; higher = more bursty activity |
| Tx count | int | Total number of txs (in + out) for this address | analysis | Equals in_degree + out_degree over the dataset window |
| Activity Density | float | Tx count per second of Active Duration | analysis | `Tx count / (Active Duration + 1)` to avoid division by zero |
| Incoming count | int | Number of incoming txs | analysis | Count of records where address is recipient |
| Outgoing count | int | Number of outgoing txs | analysis | Count of records where address is sender |
| In/Out Ratio | float | (Incoming count + 1) divided by (Outgoing count + 1) | analysis | Higher values = sink-like behavior; +1 terms avoid divide-by-zero |
| Hour mean | float | Mean hour of day of activity (0–23) | analysis | Computed from UTC timestamps across all txs |
| Hour entropy | float | Entropy of hourly activity distribution (bits) | analysis | 0 = all txs at one hour; higher = spread across hours |
| Last seen | float | Timestamp of most recent tx (s) | analysis | Seconds since earliest block in dataset |
| Recency | float | How long before dataset end address was last active (s) | analysis | `dataset_end_ts_seconds − Last seen` |
| Avg gas price | float | Mean gas price used by address (wei per gas) | analysis | Aggregated across all in/out txs |
| Avg gas limit | float | Mean gas limit on txs involving address | analysis | Aggregated across all in/out txs |
| Scam | int (0/1) | Address labeled as scam-related | target | Derived from from_scam/to_scam and *_category text |