# TagAssist

## Problem Definition:

We aim to optimize pending fraud tag-change requests using
structured case features + investigator annotations.

Goal:
Predict probability of approval and apply risk-based threshold
to reduce manual review workload while controlling financial exposure.

## Data Design Strategy

- Fraud-heavy queue
- Moderate class imbalance (approx 65% approved)
- Annotation text influences approval probability
- Financial threshold logic simulated (<$500 safer automation)



In [6]:
print("Hello World!")


Hello World!


In [8]:
import numpy as np
import pandas as pd
import random

In [4]:
df = pd.DataFrame(columns=columns)

In [5]:
df

Unnamed: 0,fraud_score,prior_disputes,total_dispute_amount,account_age_days,linked_accounts_count,investigator_tenure_months,annotation_text,approved


In [33]:
import numpy as np
import pandas as pd
import random

def generate_synthetic_data(n_samples=5000, fraud_heavy=False):

    fraud_templates = [
        "Multiple linked accounts with coordinated dispute behavior.",
        "High dispute velocity within short time window.",
        "Pattern matches prior confirmed abuse cluster.",
        "Chargeback ratio exceeds acceptable threshold.",
        "Recurring fraud indicators aligned with historical abuse patterns.",
        "Order Vel",
        "Card Vel",
        "Risky Order Pattern"
        "bad gsi",
        "multiple pending orders",
        "address with multiple unwanted characters",
    ]

    weak_templates = [
        "Single dispute reported by customer.",
        "No prior history of abuse.",
        "Long-standing account with stable behavior.",
        "Insufficient evidence for coordinated fraud.",
        "Customer claims non-delivery without supporting signals."
    ]

    data = []

    for _ in range(n_samples):

        # ---- Fraud Score Distribution ----
        if fraud_heavy:
            fraud_score = np.random.beta(5, 2)   # Skewed high
        else:
            fraud_score = np.random.beta(2, 5)   # Mostly low-mid

        # ---- Structured Features ----
        prior_disputes = np.random.poisson(2)
        total_dispute_amount = np.random.gamma(2, 150)
        account_age_days = np.random.exponential(365)
        linked_accounts_count = np.random.poisson(1)
        investigator_tenure_months = np.random.exponential(24)

        # ---- Annotation Selection ----
        if fraud_score > 0.7:
            annotation_text = random.choice(fraud_templates)
        else:
            annotation_text = random.choice(weak_templates)

        # ---- Approval Probability Logic ----
        # base = 0.4 + (fraud_score * 0.4)

        base = 0.3 + (fraud_score * 0.6)


        if prior_disputes > 3:
            base += 0.1

        if linked_accounts_count > 1:
            base += 0.1

        if total_dispute_amount > 500:
            base -= 0.05

        if investigator_tenure_months > 18:
            base += 0.05

        # Clamp probability between 0 and 1
        approval_prob = min(max(base, 0), 1)

        approved = np.random.binomial(1, approval_prob)

        data.append([
            fraud_score,
            prior_disputes,
            total_dispute_amount,
            account_age_days,
            linked_accounts_count,
            investigator_tenure_months,
            annotation_text,
            approved
        ])

    columns = [
        "fraud_score",
        "prior_disputes",
        "total_dispute_amount",
        "account_age_days",
        "linked_accounts_count",
        "investigator_tenure_months",
        "annotation_text",
        "approved"
    ]

    return pd.DataFrame(data, columns=columns)

fraud_score → Beta distribution (most low-mid, some high)

prior_disputes → Poisson distribution (skewed small)

account_age → Exponential (many young, some very old)

amount → Gamma (right-skewed financial data)

That flips probability mass toward high fraud.

That’s elegant. No hacky logic

In [26]:
df_mixed = generate_synthetic_data(5000, fraud_heavy=False)
df_fraud = generate_synthetic_data(5000, fraud_heavy=True)

print("Mixed Approval Rate:", df_mixed["approved"].mean())
print("Fraud-heavy Approval Rate:", df_fraud["approved"].mean())


Mixed Approval Rate: 0.5798
Fraud-heavy Approval Rate: 0.7418


In [27]:
df_mixed.head()

Unnamed: 0,fraud_score,prior_disputes,total_dispute_amount,account_age_days,linked_accounts_count,investigator_tenure_months,annotation_text,approved
0,0.314385,1,844.448092,290.684046,2,14.022401,No prior history of abuse.,0
1,0.399918,4,304.312267,774.363448,1,5.019962,No prior history of abuse.,1
2,0.268837,0,473.549494,461.937552,2,1.812907,Long-standing account with stable behavior.,1
3,0.429174,0,45.865258,47.849157,2,8.838448,Insufficient evidence for coordinated fraud.,1
4,0.178211,2,186.333103,196.282061,0,5.249741,Customer claims non-delivery without supportin...,1


In [29]:
df_mixed["fraud_score"].describe()


count    5000.000000
mean        0.287315
std         0.158950
min         0.003416
25%         0.163014
50%         0.264687
75%         0.391899
max         0.840239
Name: fraud_score, dtype: float64

In [30]:
df_fraud["fraud_score"].describe()

count    5000.000000
mean        0.711391
std         0.160255
min         0.085301
25%         0.605944
50%         0.730039
75%         0.837853
max         0.998727
Name: fraud_score, dtype: float64

In [31]:
df_mixed.corr(numeric_only=True)["approved"].sort_values(ascending=False)


approved                      1.000000
fraud_score                   0.116370
linked_accounts_count         0.076688
prior_disputes                0.075367
investigator_tenure_months    0.036250
account_age_days              0.017168
total_dispute_amount         -0.042429
Name: approved, dtype: float64

In [32]:
df_mixed["approved"].value_counts(normalize=True)
df_fraud["approved"].value_counts(normalize=True)


approved
1    0.7418
0    0.2582
Name: proportion, dtype: float64

In [34]:
df_mixed = generate_synthetic_data(5000, fraud_heavy=False)
print(df_mixed["approved"].mean())
print(df_mixed.corr(numeric_only=True)["approved"].sort_values(ascending=False))


0.5332
approved                      1.000000
fraud_score                   0.185408
linked_accounts_count         0.060754
investigator_tenure_months    0.056841
prior_disputes                0.055889
account_age_days             -0.004705
total_dispute_amount         -0.037180
Name: approved, dtype: float64


In [36]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Select structured features only (no annotation yet)
X = df_mixed.drop(columns=["annotation_text", "approved"])
y = df_mixed["approved"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


In [37]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score

model = LogisticRegression()
model.fit(X_train_scaled, y_train)

y_pred = model.predict(X_test_scaled)
y_prob = model.predict_proba(X_test_scaled)[:, 1]

print("ROC-AUC:", roc_auc_score(y_test, y_prob))
print(classification_report(y_test, y_pred))


ROC-AUC: 0.6175570411800713
              precision    recall  f1-score   support

           0       0.60      0.48      0.54       493
           1       0.58      0.69      0.63       507

    accuracy                           0.59      1000
   macro avg       0.59      0.59      0.58      1000
weighted avg       0.59      0.59      0.58      1000



In [39]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import hstack

X_struct = df_mixed.drop(columns=["annotation_text", "approved"])
X_text_raw = df_mixed["annotation_text"]
y = df_mixed["approved"]

X_train_s, X_test_s, X_train_text_raw, X_test_text_raw, y_train, y_test = train_test_split(
    X_struct, X_text_raw, y, test_size=0.2, random_state=42
)

# Scale structured features
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train_s)
X_test_s = scaler.transform(X_test_s)

# Vectorize text
vectorizer = TfidfVectorizer(max_features=50)
X_train_text = vectorizer.fit_transform(X_train_text_raw)
X_test_text = vectorizer.transform(X_test_text_raw)

# Combine
X_train_combined = hstack([X_train_s, X_train_text])
X_test_combined = hstack([X_test_s, X_test_text])


In [40]:
model_text = LogisticRegression(max_iter=1000)
model_text.fit(X_train_combined, y_train)

y_prob_text = model_text.predict_proba(X_test_combined)[:, 1]
y_pred_text = model_text.predict(X_test_combined)

print("ROC-AUC with Text:", roc_auc_score(y_test, y_prob_text))
print(classification_report(y_test, y_pred_text))


ROC-AUC with Text: 0.6169089141471729
              precision    recall  f1-score   support

           0       0.60      0.48      0.53       493
           1       0.58      0.68      0.62       507

    accuracy                           0.58      1000
   macro avg       0.59      0.58      0.58      1000
weighted avg       0.59      0.58      0.58      1000

