<a href="https://colab.research.google.com/github/Otsebolu/Gen_AI_projects/blob/main/financial_fraud_detection_llm_rag_flan_t5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fraud Detection using LLM and RAG
This project leverages advanced AI technologies, including Large Language Models (LLM) and Retrieval-Augmented Generation (RAG), to identify and flag potential fraud in financial data.

### Large Language Models (LLM):
LLMs are trained on vast amounts of textual data and can understand and generate human-like text. In fraud detection, LLMs can analyze financial statements, detect anomalies, and recognize patterns indicative of fraudulent behavior.

### Retrieval-Augmented Generation (RAG):
RAG combines the capabilities of LLMs with a retrieval mechanism to enhance the generation process. It retrieves relevant documents or pieces of information from a large corpus and uses them to provide more accurate and contextually relevant responses. In this context, RAG can pull relevant financial records, reports, and contextual data to assist in the detection and explanation of potential fraud.

### Application:

**Input:** Financial statements and related documents.

**Process:** The system uses RAG to retrieve pertinent information from a database and employs LLM to analyze and interpret the data.

**Output:** A concise report indicating whether the financial statement exhibits fraudulent behavior, with an explanation based on the retrieved context.

This combination of LLM and RAG enhances the accuracy and reliability of fraud detection in financial filings, making it a powerful tool for auditors, regulators, and financial institutions.







In [4]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os



In [5]:
import pandas as pd
df1 = pd.read_csv('/content/fraud_nonfraud.csv', encoding='latin-1')
#df1.head()
print(df1)


      fraud_status                                               text
0                1  Life and disability insurers sometimes set hig...
1                1  What you need to know for Friday and the weeke...
2                1  Laws that restrict where convicted sex offende...
3                1  For years, day jobs helped Carolyn Coleman pur...
4                1  Banks that charge customers to use debit cards...
...            ...                                                ...
5035             0  The financial statement shows fabricated sales...
5036             0  There was intentional misstatement of cash flo...
5037             0  The company inflated the value of its assets t...
5038             0  Revenue from future periods was reported in th...
5039             0  The company engaged in channel stuffing to inf...

[5040 rows x 2 columns]


In [6]:
import random
#import pandas as pd

# Convert the DataFrame to a list of rows
df1_list = df1.values.tolist()
new_list=df1_list.copy()
# Shuffle the list of rows
random.shuffle(new_list)

print("Original list : ", df1_list)

print("List after shuffle", new_list)

# Create a new DataFrame from the shuffled list
df = pd.DataFrame(new_list, columns= df1.columns)
df.head(10)

# Create a new DataFrame from the shuffled list
#df = pd.DataFrame(df1_list, columns=df1.columns)




Unnamed: 0,fraud_status,text
0,0,Dozens of top recipients of government aid hav...
1,1,How much confidentiality are the members of a ...
2,1,You donât need to change everything about yo...
3,1,"In âRace Against Time,â the Mississippi jo..."
4,1,John Smoltz is recovering from shoulder surger...
5,0,The New York City Campaign Finance Board yeste...
6,1,The Reisinger Knockout is the premier team eve...
7,0,"When Sarbanes-Oxley was passed in 2002, it was..."
8,1,And when will we be able to say something like...
9,1,"After a two-week verbal standoff, Indonesia's ..."


In [7]:
!pip install -q langchain sentence-transformers faiss-cpu langchain-community langchain-core transformers chromadb

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/67.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━[0m [32m61.4/67.3 kB[0m [31m1.8 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.4/50.4 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m997.8/997.8 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.1/227.1 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.0/27.0 MB[0m [31m1.4 MB/

In [8]:
%pip install --upgrade --quiet  langchain sentence_transformers

In [9]:
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk

# Ensure NLTK resources are downloaded
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Function to clean text
def clean_text(text):
    # Remove non-ASCII characters
    text = text.encode('ascii', 'ignore').decode()

    # Remove punctuation and numbers
    text = re.sub(r'[^\w\s]', '', text)
    text = re.sub(r'\d+', '', text)

    # Convert to lowercase
    text = text.lower()

    # Tokenize text
    tokens = word_tokenize(text)

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]

    # Join tokens back into text
    cleaned_text = ' '.join(tokens)

    return cleaned_text

# Clean 'Fillings' column
df['Clean_Text'] = df['text'].apply(clean_text)

# Drop original 'Text' column if no longer needed
df.drop(columns=['text'], inplace=True)

# Save cleaned data back to CSV if desired
#df.to_csv('cleaned_financial_statements.csv', index=False)

# Example of how the cleaned data looks like
print(df.head())

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


   fraud_status                                         Clean_Text
0             0  dozens top recipients government aid laid furl...
1             1   much confidentiality members coop board entitled
2             1  dont need change everything job see major bene...
3             1  race time mississippi journalist jerry mitchel...
4             1  john smoltz recovering shoulder surgery hopes ...


In [10]:
!pip install -U langchain-community



In [12]:
from langchain.vectorstores import Chroma
from langchain.docstore.document import Document

documents = []

# Iterate over rows using .rows() method
for i, row_tuple in df.iterrows():
    document = f"id:{i}\Fillings: {row_tuple[1]}\Fraud_Status: {row_tuple[0]}"
    documents.append(Document(page_content=document))

  document = f"id:{i}\Fillings: {row_tuple[1]}\Fraud_Status: {row_tuple[0]}"


In [13]:
documents[0]

Document(page_content='id:0\\Fillings: dozens top recipients government aid laid furloughed cut pay tens thousands employees\\Fraud_Status: 0')

In [14]:
documents[1]

Document(page_content='id:1\\Fillings: much confidentiality members coop board entitled\\Fraud_Status: 1')

In [15]:
documents[3]

Document(page_content='id:3\\Fillings: race time mississippi journalist jerry mitchell chronicles four key cases racist violence role unearthing damning new evidence\\Fraud_Status: 1')

In [16]:
from langchain_community.embeddings import HuggingFaceEmbeddings
hg_embeddings = HuggingFaceEmbeddings()

  warn_deprecated(
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [21]:
!pip install --upgrade chromadb



In [23]:
from langchain.vectorstores import Chroma
persist_directory = 'docs/chroma_rag/'
langchain_chroma = Chroma.from_documents(
    documents=documents,
    collection_name="finance_data_new",
    embedding=hg_embeddings,
    persist_directory=persist_directory
)

In [40]:
from huggingface_hub import notebook_login
notebook_login(write_permission=True)

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [25]:
from torch import cuda, bfloat16
import torch
import transformers
from transformers import AutoTokenizer
from time import time
#import chromadb
#from chromadb.config import Settings
from langchain.llms import HuggingFacePipeline
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.chains import RetrievalQA
from langchain.vectorstores import Chroma

In [26]:
!pip install bitsandbytes


Collecting bitsandbytes
  Downloading bitsandbytes-0.43.3-py3-none-manylinux_2_24_x86_64.whl.metadata (3.5 kB)
Downloading bitsandbytes-0.43.3-py3-none-manylinux_2_24_x86_64.whl (137.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m137.5/137.5 MB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bitsandbytes
Successfully installed bitsandbytes-0.43.3


In [27]:
model_id = 'google/flan-t5-large'

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

# set quantization configuration to load large model with less GPU memory
# this requires the `bitsandbytes` library
#bnb_config = transformers.BitsAndBytesConfig(
#    load_in_4bit=True,
#    bnb_4bit_quant_type='nf4',
#    bnb_4bit_use_double_quant=True,
#    bnb_4bit_compute_dtype=bfloat16
#)

print(device)

cpu


In [28]:
!pip install accelerate
!pip install -i https://pypi.org/simple/ bitsandbytes

Looking in indexes: https://pypi.org/simple/


In [30]:
import os
import numpy as np


from transformers import AutoModelForCausalLM, AutoTokenizer

model = transformers.AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-small")
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-small")


#os.environ['CUDA_LAUNCH_BLOCKING'] = '1'

#model_config = transformers.AutoConfig.from_pretrained(
#   model_id,
 #   trust_remote_code=True,
  #  max_new_tokens=1024
#)


model.safetensors:   0%|          | 0.00/308M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

In [51]:
# Initialize the query pipeline with increased max_length
query_pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.float16,
    max_length=6000,  # Increase max_length
    max_new_tokens=500,  # Control the number of new tokens generated
    device_map="auto",
)

The model 'T5ForConditionalGeneration' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'LlamaForCausalLM', 'CodeGenForCausalLM', 'CohereForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'DbrxForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'FalconForCausalLM', 'FuyuForCausalLM', 'GemmaForCausalLM', 'Gemma2ForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'JambaForCausalLM', 'JetMoeForCausalLM', 'LlamaForCausalLM', 'MambaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MistralForCausalLM', 'MixtralForCausalLM', 'MptForCau

In [52]:
from IPython.display import display, Markdown
def colorize_text(text):
    for word, color in zip(["Reasoning", "Question", "Answer", "Total time"], ["blue", "red", "green", "magenta"]):
        text = text.replace(f"{word}:", f"\n\n**<font color='{color}'>{word}:</font>**")
    return text

In [69]:
llm = HuggingFacePipeline(pipeline=query_pipeline)

#question = "Please explain what EU AI Act is."
#response = llm(prompt=question)

full_response =  f"Question: {question}\nAnswer: {response}"
#display(Markdown(colorize_text(full_response)))

In [55]:
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain_community.llms import HuggingFaceHub
from IPython.display import display, Markdown
import os
import warnings
warnings.filterwarnings('ignore')

# llm=HuggingFaceHub(repo_id="google/flan-t5-large")
# Define the prompt template
template = """
You are an Fraud Detection Expert in Financial Text Data, Analyse them and Predict is the Given Statement is Fraud or not?. If you don't know the answer, just say "Sorry, I Don't Know."
Question: {question}
Context: {context}
Answer:
"""
PROMPT = PromptTemplate(input_variables=["context", "query"], template=template)

# Ensure llm and langchain_chroma are properly initialized
retriever = langchain_chroma.as_retriever(search_kwargs={"k": 1})

qa_chain = RetrievalQA.from_chain_type(
    llm, retriever=retriever, chain_type_kwargs={"prompt": PROMPT}
)

# Define your question
# question = "The company reported inflated revenues by including sales that never occurred."
question = "Financial records accurately reflect all expenses and liabilities."
# question = "Revenue was recognized prematurely before the actual sales occurred."
# question = "The balance sheet provides a true and fair view of the company’s financial position."

# Run the QA chain
try:
    result = qa_chain({"query": question})
    display(result)
except RuntimeError as e:
    print(f"RuntimeError encountered: {e}")

Both `max_new_tokens` (=500) and `max_length`(=6000) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


{'query': 'Financial records accurately reflect all expenses and liabilities.',
 'result': '\nYou are an Fraud Detection Expert in Financial Text Data, Analyse them and Predict is the Given Statement is Fraud or not?. If you don\'t know the answer, just say "Sorry, I Don\'t Know."\nQuestion: Financial records accurately reflect all expenses and liabilities.\nContext: id:1250\\Fillings: financial records accurately reflect expenses liabilities\\Fraud_Status: 1\nAnswer:\n'}

In [59]:
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain_community.llms import HuggingFaceHub
from IPython.display import display, Markdown
import os
import warnings
warnings.filterwarnings('ignore')

# llm=HuggingFaceHub(repo_id="google/flan-t5-large")
# Define the prompt template
template = """
You are an Fraud Detection Expert in Financial Text Data, Analyse them and Predict is the Given Statement is Fraud or not?. If you don't know the answer, just say "Sorry, I Don't Know."
Question: {question}
Context: {context}
Answer:
"""
PROMPT = PromptTemplate(input_variables=["context", "query"], template=template)

# Ensure llm and langchain_chroma are properly initialized
retriever = langchain_chroma.as_retriever(search_kwargs={"k": 1})

qa_chain = RetrievalQA.from_chain_type(
    llm, retriever=retriever, chain_type_kwargs={"prompt": PROMPT}
)

# Define your question
#question = "The company reported inflated revenues by including sales that never occurred."
#question = "Financial records accurately reflect all expenses and liabilities."
question = "Revenue was recognized prematurely before the actual sales occurred."
# question = "The balance sheet provides a true and fair view of the company’s financial position."

# Run the QA chain
try:
    result = qa_chain({"query": question})
    display(result)
except RuntimeError as e:
    print(f"RuntimeError encountered: {e}")

Both `max_new_tokens` (=500) and `max_length`(=6000) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


{'query': 'Revenue was recognized prematurely before the actual sales occurred.',
 'result': '\nYou are an Fraud Detection Expert in Financial Text Data, Analyse them and Predict is the Given Statement is Fraud or not?. If you don\'t know the answer, just say "Sorry, I Don\'t Know."\nQuestion: Revenue was recognized prematurely before the actual sales occurred.\nContext: id:3401\\Fillings: revenue recognized prematurely actual sales occurred\\Fraud_Status: 0\nAnswer:\n'}

In [60]:
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain_community.llms import HuggingFaceHub
from IPython.display import display, Markdown
import os
import warnings
warnings.filterwarnings('ignore')

# llm=HuggingFaceHub(repo_id="google/flan-t5-large")
# Define the prompt template
template = """
You are an Fraud Detection Expert in Financial Text Data, Analyse them and Predict is the Given Statement is Fraud or not?. If you don't know the answer, just say "Sorry, I Don't Know."
Question: {question}
Context: {context}
Answer:
"""
PROMPT = PromptTemplate(input_variables=["context", "query"], template=template)

# Ensure llm and langchain_chroma are properly initialized
retriever = langchain_chroma.as_retriever(search_kwargs={"k": 1})

qa_chain = RetrievalQA.from_chain_type(
    llm, retriever=retriever, chain_type_kwargs={"prompt": PROMPT}
)

# Define your question
#question = "The company reported inflated revenues by including sales that never occurred."
#question = "Financial records accurately reflect all expenses and liabilities."
#question = "Revenue was recognized prematurely before the actual sales occurred."
question = "The balance sheet provides a true and fair view of the company’s financial position."

# Run the QA chain
try:
    result = qa_chain({"query": question})
    display(result)
except RuntimeError as e:
    print(f"RuntimeError encountered: {e}")

Both `max_new_tokens` (=500) and `max_length`(=6000) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


{'query': 'The balance sheet provides a true and fair view of the company’s financial position.',
 'result': '\nYou are an Fraud Detection Expert in Financial Text Data, Analyse them and Predict is the Given Statement is Fraud or not?. If you don\'t know the answer, just say "Sorry, I Don\'t Know."\nQuestion: The balance sheet provides a true and fair view of the company’s financial position.\nContext: id:4945\\Fillings: balance sheet provides true fair view companys financial position\\Fraud_Status: 1\nAnswer:\n'}

In [66]:
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain_community.llms import HuggingFaceHub
from IPython.display import display, Markdown
import os
import warnings
warnings.filterwarnings('ignore')

# llm=HuggingFaceHub(repo_id="google/flan-t5-large")
# Define the prompt template
template = """
You are an Fraud Detection Expert in Financial Text Data, Analyse them and Predict is the Given Statement is Fraud or not?. If you don't know the answer, just say "Sorry, I Don't Know."
Question: {question}
Context: {context}
Answer:
"""
PROMPT = PromptTemplate(input_variables=["context", "query"], template=template)

# Ensure llm and langchain_chroma are properly initialized
retriever = langchain_chroma.as_retriever(search_kwargs={"k": 1})

qa_chain = RetrievalQA.from_chain_type(
    llm, retriever=retriever, chain_type_kwargs={"prompt": PROMPT}
)

# Define your question
question = "I lost my credit card to a buglar who came to my house last night"
#question = "My crypto account was hacked"
#question = "Financial records accurately reflect all expenses and liabilities."
#question = "Revenue was recognized prematurely before the actual sales occurred."
# question = "The balance sheet provides a true and fair view of the company’s financial position."

# Run the QA chain
try:
    result = qa_chain({"query": question})
    display(result)
except RuntimeError as e:
    print(f"RuntimeError encountered: {e}")

Both `max_new_tokens` (=500) and `max_length`(=6000) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


{'query': 'I lost my credit card to a buglar who came to my house last night',
 'result': '\nYou are an Fraud Detection Expert in Financial Text Data, Analyse them and Predict is the Given Statement is Fraud or not?. If you don\'t know the answer, just say "Sorry, I Don\'t Know."\nQuestion: I lost my credit card to a buglar who came to my house last night\nContext: id:791\\Fillings: first american bank chicago got publicity little satisfaction identified source debit card theft tried stop\\Fraud_Status: 0\nAnswer:\n'}

In [71]:
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain_community.llms import HuggingFaceHub
from IPython.display import display, Markdown
import os
import warnings
warnings.filterwarnings('ignore')

# llm=HuggingFaceHub(repo_id="google/flan-t5-large")
# Define the prompt template
template = """
You are an Fraud Detection Expert in Financial Text Data, Analyse them and Predict is the Given Statement is Fraud or not?. If you don't know the answer, just say "Sorry, I Don't Know."
Question: {question}
Context: {context}
Answer:
"""
PROMPT = PromptTemplate(input_variables=["context", "query"], template=template)

# Ensure llm and langchain_chroma are properly initialized
retriever = langchain_chroma.as_retriever(search_kwargs={"k": 1})

qa_chain = RetrievalQA.from_chain_type(
    llm, retriever=retriever, chain_type_kwargs={"prompt": PROMPT}
)

# Define your question
question = "My Jones was picked up by the cop for his involvement in the missing car"
#question = "My crypto account was hacked"
#question = "Financial records accurately reflect all expenses and liabilities."
#question = "Revenue was recognized prematurely before the actual sales occurred."
# question = "The balance sheet provides a true and fair view of the company’s financial position."

# Run the QA chain
try:
    result = qa_chain({"query": question})
    display(result)
except RuntimeError as e:
    print(f"RuntimeError encountered: {e}")

Both `max_new_tokens` (=500) and `max_length`(=6000) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


{'query': 'My Jones was picked up by the cop for his involvement in the missing car',
 'result': '\nYou are an Fraud Detection Expert in Financial Text Data, Analyse them and Predict is the Given Statement is Fraud or not?. If you don\'t know the answer, just say "Sorry, I Don\'t Know."\nQuestion: My Jones was picked up by the cop for his involvement in the missing car\nContext: id:2624\\Fillings: maurice knight queens charged impersonating fire lieutenant stealing eight people\\Fraud_Status: 0\nAnswer:\n'}

In [72]:
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain_community.llms import HuggingFaceHub
from IPython.display import display, Markdown
import os
import warnings
warnings.filterwarnings('ignore')

# llm=HuggingFaceHub(repo_id="google/flan-t5-large")
# Define the prompt template
template = """
You are an Fraud Detection Expert in Financial Text Data, Analyse them and Predict is the Given Statement is Fraud or not?. If you don't know the answer, just say "Sorry, I Don't Know."
Question: {question}
Context: {context}
Answer:
"""
PROMPT = PromptTemplate(input_variables=["context", "query"], template=template)

# Ensure llm and langchain_chroma are properly initialized
retriever = langchain_chroma.as_retriever(search_kwargs={"k": 1})

qa_chain = RetrievalQA.from_chain_type(
    llm, retriever=retriever, chain_type_kwargs={"prompt": PROMPT}
)

# Define your question
question = "Mrs. Lagbaja was detained in order to interrogate her of the missing jewelry over Mrs. Adetoun"
#question = "My crypto account was hacked"
#question = "Financial records accurately reflect all expenses and liabilities."
#question = "Revenue was recognized prematurely before the actual sales occurred."
# question = "The balance sheet provides a true and fair view of the company’s financial position."

# Run the QA chain
try:
    result = qa_chain({"query": question})
    display(result)
except RuntimeError as e:
    print(f"RuntimeError encountered: {e}")

Both `max_new_tokens` (=500) and `max_length`(=6000) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


{'query': 'Mrs. Lagbaja was detained in order to interrogate her of the missing jewelry over Mrs. Adetoun',
 'result': '\nYou are an Fraud Detection Expert in Financial Text Data, Analyse them and Predict is the Given Statement is Fraud or not?. If you don\'t know the answer, just say "Sorry, I Don\'t Know."\nQuestion: Mrs. Lagbaja was detained in order to interrogate her of the missing jewelry over Mrs. Adetoun\nContext: id:4414\\Fillings: people swindled member nigerian royalty associate promised fake jobs prosecutors said\\Fraud_Status: 0\nAnswer:\n'}

**In this project, RAG combines the capabilities of LLMs with a retrieval mechanism, and then pulls relevant financial records, reports, and contextual data, to detect if the inputted info or data, is of fraud nature or not.**