In [1]:
# Cell 1: Import libraries and set up path to custom utilities
import pandas as pd
import numpy as np
import warnings
import sys
import os

# Add the 'scripts' directory to the Python path to find our utility module
# This makes the import robust, regardless of where the notebook is run from.
sys.path.append(os.path.abspath('scripts'))

warnings.filterwarnings('ignore')

# Import our custom functions from the risk_utils module
try:
    from risk_utils import (
        load_and_preprocess_data,
        create_advanced_wallet_features,
        calculate_advanced_risk_score,
        apply_ml_refinements,
        calculate_final_risk_score
    )
except ImportError:
    print("FATAL ERROR: Could not import from 'risk_utils.py'.")
    print("Please ensure 'scripts/risk_utils.py' exists in your project directory.")
    # Exit or handle error appropriately
    sys.exit()

<hr>

### **Markdown: Feature Engineering Logic**

The `create_advanced_wallet_features` function aggregates transaction data to a wallet level. Below are the formulas for some of the key engineered features that serve as risk indicators:

- **Send/Receive Ratio (`send_receive_ratio`)**: Measures the ratio of outgoing to incoming transactions. A highly imbalanced ratio can be a risk indicator.  
  $$
  \text{Send/Receive Ratio}
  =
  \frac{\text{Total Sent Transactions}}
       {\max(\text{Total Received Transactions}, 1)}
  $$

- **Transaction Frequency (`transaction_frequency`)**: Calculates the average number of transactions per day over the wallet’s active lifespan. High frequency can indicate bot activity.  
  $$
  \text{Transaction Frequency}
  =
  \frac{\text{Total Transactions}}
       {\max(\text{Activity Span in Days}, 1)}
  $$

- **Contract Complexity (`contract_complexity`)**: Assesses the diversity of a wallet’s interactions. A low value (near 0) means the wallet repeatedly calls the same function; a high value (near 1) indicates interaction with many different functions.

  First compute the Dominant Function Ratio:
  $$
  \text{Dominant Function Ratio}
  =
  \frac{\text{Count of Most Common Function}}
       {\text{Total Transactions}}
  $$

  Then:
  $$
  \text{Contract Complexity}
  = 
  1 \;-\; \text{Dominant Function Ratio}
  $$

<hr>

In [3]:
# Cell 2: Run the Full Risk Scoring Pipeline
def run_pipeline(filepath='data/compound_v2_v3_transactions.csv'):
    """Executes the entire risk scoring process from loading data to saving the output."""
    
    # Step 1: Load and preprocess data
    df_processed = load_and_preprocess_data(filepath)
    if df_processed.empty:
        print("Stopping execution due to missing data file.")
        return None

    # Step 2: Create wallet-level features
    wallet_df = create_advanced_wallet_features(df_processed)
    wallet_df = wallet_df.fillna(0) # Fill NaNs from std dev etc.

    # Step 3: Calculate the base risk score
    wallet_df['base_risk_score'] = calculate_advanced_risk_score(wallet_df)

    # Step 4: Apply ML-based refinements (anomaly detection & clustering)
    wallet_df = apply_ml_refinements(wallet_df)

    # Step 5: Calculate the final, adjusted risk score
    wallet_df['final_risk_score'] = calculate_final_risk_score(wallet_df)

    # Step 6: Categorize risk for reporting
    def categorize_risk(score):
        if score < 200: return 'Very Low'
        elif score < 400: return 'Low'
        elif score < 600: return 'Medium'
        elif score < 800: return 'High'
        else: return 'Very High'
    wallet_df['risk_category'] = wallet_df['final_risk_score'].apply(categorize_risk)
    
    return wallet_df

# Execute the pipeline
wallet_risk_data = run_pipeline()


Successfully loaded data/compound_v2_v3_transactions.csv
Processing features for 80 wallets using 'wallet_address' as identifier...
...feature engineering complete.
Calculating advanced risk scores...
...base risk score calculation complete.
Applying ML refinements (Anomaly Detection and Clustering)...
...identified 4 anomalous wallets.
...clustered wallets into 5 groups.


<hr>

### **Markdown: Base Risk Score Calculation**

The `calculate_advanced_risk_score` function uses a weighted component model.

1.  **Normalization**: First, all feature values are normalized to a common scale of [0, 1] using Min-Max Scaling. This prevents features with large ranges from dominating the score.
    $$
    X_{\text{norm}} = \frac{X - X_{\min}}{X_{\max} - X_{\min}}
    $$

2.  **Component Score**: Features are grouped into logical risk components (e.g., `volume_risk`, `behavioral_risk`). The score for each component is the average of its normalized feature values.
    $$
    C_i = \frac{1}{N} \sum_{j=1}^{N} F_{j, \text{norm}}
    $$
    where $C_i$ is the score for component $i$, and $F_j$ are the $N$ features within that component.

3.  **Weighted Sum**: The base score is the sum of all component scores, each multiplied by its predefined weight ($w_i$).
    $$
    \text{Score}_{\text{base}} = \sum_{i=1}^{n} (w_i \cdot C_i)
    $$

4.  **Penalty Adjustments**: Finally, penalties are applied to wallets exhibiting specific high-risk behaviors (e.g., high error rate), multiplying their score by a penalty factor.

<hr>

### **Markdown: ML Refinements & Final Score**

The `apply_ml_refinements` and `calculate_final_risk_score` functions adjust the base score using unsupervised learning.

1.  **Anomaly Detection (Isolation Forest)**: This model identifies wallets that are statistical outliers based on their feature combinations. These are flagged as `is_anomaly = True`.

2.  **Clustering (K-Means)**: This model groups wallets into behavioral clusters. We then calculate a risk adjustment factor for each cluster based on its average risk relative to the global average.
    $$
    \text{Adjustment}_{\text{cluster}} = \frac{\text{Mean Score}_{\text{cluster}}}{\text{Mean Score}_{\text{global}}}
    $$

3.  **Final Score Calculation**: The final score combines the base score with the ML refinements. A significant penalty (boost) is applied to anomalies, and the cluster adjustment is then applied.
    $$
    \text{Score}_{\text{final}} = \text{Score}_{\text{base}} \times (\text{Anomaly Boost if True else 1}) \times \text{Adjustment}_{\text{cluster}}
    $$
    The final score is then capped between 0 and 1000.

<hr>


In [4]:
# Cell 3: Final Output and Summary
if wallet_risk_data is not None:
    # Create and save the final output file as required
    output_df = wallet_risk_data[['wallet_id', 'final_risk_score']].copy()
    output_df.columns = ['wallet_id', 'score']
    output_df = output_df.sort_values('score', ascending=False)
    output_df.to_csv('wallet_risk_scores.csv', index=False)

    print("\n=== FINAL RESULTS ===")
    print(f"Total wallets analyzed: {len(output_df)}")
    print(f"Average risk score: {output_df['score'].mean():.0f}")
    print("\nTop 5 highest risk wallets:")
    print(output_df.head(5))
    print("\nRisk category distribution:")
    print(wallet_risk_data['risk_category'].value_counts())
    print("\n✓ wallet_risk_scores.csv saved successfully.")
else:
    print("\nPipeline did not complete. No output file was generated.")



=== FINAL RESULTS ===
Total wallets analyzed: 80
Average risk score: 128

Top 5 highest risk wallets:
                                     wallet_id  score
0   0x0039f22efb07a647557c7c5d17854cfd6d489ef3   1000
39  0x70d8e4ab175dfe0eab4e9a7f33e0a2d19f44001e   1000
23  0x4814be124d7fe3b240eb46061f7ddfab468fe122   1000
22  0x427f2ac5fdf4245e027d767e7c3ac272a1f40a65   1000
57  0xa7f3c74f0255796fd5d3ddcf88db769f7a6bf46a    793

Risk category distribution:
risk_category
Very Low     71
Very High     4
High          4
Medium        1
Name: count, dtype: int64

✓ wallet_risk_scores.csv saved successfully.
