Skip to content

Robust semi-supervised spam detection using Rust native NLP pipelines.

License

Notifications You must be signed in to change notification settings

Philipp-Sc/llm-fraud-detection

Repository files navigation

llm-fraud-detection

Robust semi-supervised fraud detection using Rust native NLP pipelines.

About

llm-fraud-detection relies on llama.cpp to generate text embeddings from a given text to predict the fraud likelihood.

The training data is generated from a diverse collection of commonly used spam/ham datasets:

  • Ling Spam,
  • Enron Spam,
  • Spam Assassin Dataset,
  • SMS Spam Collection,
  • Youtube Spam,
  • Crypto Governance Proposals.

llm-fraud-detection archives state-of-the-art performance without fine tuning the LLMs directly, instead the outputs of the LLMs (embeddings) are trained on and used for the spam/ham classification task.

Use

Git clone and train the required models. Edit the docker-compose.yml first.

docker-compose up -d --build

(the training data is provided in this repository, the models are not due to size limitations)

Add to your Cargo.toml manifest:

[dependencies]
rust_fraud_detection_tools = { git="https://github.com/Philipp-Sc/llm-fraud-detection.git" }

Predict fraud/ham:

pub const SENTENCES: [&str;6] = [
    "Lose up to 19% weight. Special promotion on our new weightloss.",
    "Hi Bob, can you send me your machine learning homework?",
    "Don't forget our special promotion: -30% on men shoes, only today!",
    "Hi Bob, don't forget our meeting today at 4pm.",
    "⚠️ FINAL: LAST TERRA PHOENIX AIRDROP 🌎 ✅ CLAIM NOW All participants in this vote will receive a reward..",
    "Social KYC oracle (TYC)  PFC is asking for 20k Luna to build a social KYC protocol.."
    ];


fn main() -> anyhow::Result<()> {

    let fraud_probabilities: Vec<f32> = rust_fraud_detection_tools::fraud_probabilities(&SENTENCES)?;
    println!("Predictions:\n{:?}",fraud_probabilities);
    println!("Labels:\n[1.0, 0.0, 1.0, 0.0, 1.0, 0.0]");
    Ok(())
}
Predictions:
[1.0, 0.0, 0.6536912, 0.0, 0.99759775, 0.17467633]
Labels:
[1.0, 0.0, 1.0, 0.0, 1.0, 0.0]

Architecture

Features

KNN Regressor on Text Embeddings (llama.cpp) ⭐

Single fraud likelihood prediction based on text embeddings. Model: uae-large-v1_fp32.gguf

Expand to display the full evaluation (F-Score = 0.94)
Performance on the training data (80%)
Threshold >= 0.1: True Positive = 5978, False Positive = 732, Precision = 0.891, Recall = 1.000, F-Score = 0.942
Threshold >= 0.2: True Positive = 5978, False Positive = 110, Precision = 0.982, Recall = 1.000, F-Score = 0.991
Threshold >= 0.3: True Positive = 5977, False Positive = 72, Precision = 0.988, Recall = 0.999, F-Score = 0.994
Threshold >= 0.4: True Positive = 5975, False Positive = 42, Precision = 0.993, Recall = 0.999, F-Score = 0.996
Threshold >= 0.5: True Positive = 5975, False Positive = 0, Precision = 1.000, Recall = 0.999, F-Score = 1.000
Threshold >= 0.6: True Positive = 5936, False Positive = 0, Precision = 1.000, Recall = 0.993, F-Score = 0.996
Threshold >= 0.7: True Positive = 5900, False Positive = 0, Precision = 1.000, Recall = 0.987, F-Score = 0.993
Threshold >= 0.8: True Positive = 5894, False Positive = 0, Precision = 1.000, Recall = 0.986, F-Score = 0.993
Threshold >= 0.9: True Positive = 5886, False Positive = 0, Precision = 1.000, Recall = 0.984, F-Score = 0.992
Performance on the test data (20%)
Threshold >= 0.1: True Positive = 1506, False Positive = 445, Precision = 0.772, Recall = 0.980, F-Score = 0.864
Threshold >= 0.2: True Positive = 1497, False Positive = 298, Precision = 0.834, Recall = 0.974, F-Score = 0.899
Threshold >= 0.3: True Positive = 1490, False Positive = 256, Precision = 0.853, Recall = 0.969, F-Score = 0.908
Threshold >= 0.4: True Positive = 1456, False Positive = 97, Precision = 0.938, Recall = 0.947, F-Score = 0.942
Threshold >= 0.5: True Positive = 1450, False Positive = 88, Precision = 0.943, Recall = 0.943, F-Score = 0.943
Threshold >= 0.6: True Positive = 1448, False Positive = 85, Precision = 0.945, Recall = 0.942, F-Score = 0.943
Threshold >= 0.7: True Positive = 1387, False Positive = 40, Precision = 0.972, Recall = 0.902, F-Score = 0.936
Threshold >= 0.8: True Positive = 1379, False Positive = 38, Precision = 0.973, Recall = 0.897, F-Score = 0.934
Threshold >= 0.9: True Positive = 1371, False Positive = 38, Precision = 0.973, Recall = 0.892, F-Score = 0.931

Evaluation

Training Data

Trained and tested with the following datasets:

  • enronSpamSubset.csv
  • lingSpam.csv
  • completeSpamAssassin.csv
  • youtubeSpamCollection.csv
  • smsspamcollection.csv
  • governance_proposal_spam_likelihood.csv
total: 27.982
---------------
count spam: 9.012
count ham: 18.970
---------------

Failed to generate embeddings for all entries using Llama.cpp. However, embeddings were successfully generated for the following data: 

Number of Spam entries: 7237
Number of Ham entries: 13706
Total entries: 23310

Expand to display the full dataset breakdown
enronSpamSubset.csv
---------------
count spam: 5000
count ham: 5000

lingSpam.csv
---------------
count spam: 433
count ham: 2172

completeSpamAssassin.csv
---------------
count spam: 1560
count ham: 3952

youtubeSpamCollection.csv
---------------
count spam: 1005
count ham: 951
 
smsspamcollection.csv
---------------
count spam: 747
count ham: 4825

governance_proposal_spam_likelihood.csv
--------------- 
count spam: ?
count ham: ?

This project is part of CosmosRustBot, which provides Governance Proposal Notifications for Cosmos Blockchains. The goal is automatically detect fraudulent and deceitful proposals to prevent users falling for crypto scams. The current model is very effective in detecting fake governance proposals.

About

Robust semi-supervised spam detection using Rust native NLP pipelines.

Resources

License

Stars

Watchers

Forks

Languages