## **Financial Fraud Detection**  
**Problem Statement :** Building a Financial Fraud Detection Project based on the Data or Financial Statement Or News Building using Synthetic data generation.

## **Project Methodology**

- This Project using the Synthetic Data generated using Python.
- Using Python, that load data and then pre-processed using NLTK and saved in CSV File.
- Loading that same CSV file to insert into Vector DB using Embedding Model from Hugging Face.
- Building RAG QA Chain using Langchain and building the RAG architecture using Zypher 7B LLM (Open Source).
- Checking the Response if its Fraud or not-Fraud.

In [None]:
import pandas as pd
import random

# Define sample data for fraud and non-fraud financial statements
fraud_statements = [
    "The company reported inflated revenues by including sales that never occurred.",
    "Financial records were manipulated to hide the true state of expenses.",
    "The company failed to report significant liabilities on its balance sheet.",
    "Revenue was recognized prematurely before the actual sales occurred.",
    "The financial statement shows significant discrepancies in inventory records.",
    "The company used off-balance-sheet entities to hide debt.",
    "Expenses were understated by capitalizing them as assets.",
    "There were unauthorized transactions recorded in the financial books.",
    "Significant amounts of revenue were recognized without proper documentation.",
    "The company falsified financial documents to secure a larger loan.",
    "There were multiple instances of duplicate payments recorded as expenses.",
    "The company reported non-existent assets to enhance its financial position.",
    "Expenses were fraudulently categorized as business development costs.",
    "The company manipulated financial ratios to meet loan covenants.",
    "Significant related-party transactions were not disclosed.",
    "The financial statement shows fabricated sales transactions.",
    "There was intentional misstatement of cash flow records.",
    "The company inflated the value of its assets to attract investors.",
    "Revenue from future periods was reported in the current period.",
    "The company engaged in channel stuffing to inflate sales figures."
]
non_fraud_statements = [
    "The company reported stable revenues consistent with historical trends.",
    "Financial records accurately reflect all expenses and liabilities.",
    "The balance sheet provides a true and fair view of the company’s financial position.",
    "Revenue was recognized in accordance with standard accounting practices.",
    "The inventory records are accurate and match physical counts.",
    "The company’s debt is fully disclosed on the balance sheet.",
    "All expenses are properly categorized and recorded.",
    "Transactions recorded in the financial books are authorized and documented.",
    "Revenue recognition is supported by proper documentation.",
    "Financial documents were audited and found to be accurate.",
    "Payments and expenses are recorded accurately without discrepancies.",
    "The assets reported on the balance sheet are verified and exist.",
    "Business development costs are properly recorded as expenses.",
    "Financial ratios are calculated based on accurate data.",
    "All related-party transactions are fully disclosed.",
    "Sales transactions are accurately recorded in the financial statement.",
    "Cash flow records are accurate and reflect actual cash movements.",
    "The value of assets is fairly reported in the financial statements.",
    "Revenue is reported in the correct accounting periods.",
    "Sales figures are accurately reported without manipulation."
]

# Generate fruad and non- fruad data

fraud_data = [{'statement': statement, "fraud_status": 'fraud'} for statement in fraud_statements]
non_fraud_data = [{'statement': statement, "fraud_status": 'non-fraud'} for statement in non_fraud_statements ]



data =  fraud_data +  non_fraud_data
random.shuffle(data)

# Create a DataFrame from the generated data
df = pd.DataFrame(data)

# Save the DataFrame to a CSV file
df.to_csv('financial_statements.csv', index=False)


In [None]:
df.head()

Unnamed: 0,statement,fraud_status
0,Expenses were understated by capitalizing them...,fraud
1,There were unauthorized transactions recorded ...,fraud
2,Sales transactions are accurately recorded in ...,non-fraud
3,The financial statement shows significant disc...,fraud
4,The balance sheet provides a true and fair vie...,non-fraud


## **Data Preprocessing**

In [None]:
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk


# Ensure NLTK resourses are downloaded

nltk.download('punkt')
nltk.download('stopwords')

# Load the CSV file into a DataFrame

def clean_text(text) :
  # remove non- ascii characters
  text = text.encode('ascii', 'ignore').decode()
  # convert to lower case
  text = text.lower()

  #remove punctuation and numbers
  text = re.sub(r'[^\w\s]', '', text)
  text = re.sub(r'\d+', '', text)

   # Tokenize text
  tokens = word_tokenize(text)

  # Remove stopwords
  stop_words = set(stopwords.words('english'))
  tokens = [word for word in tokens if word not in stop_words]

    # Join tokens back into text
  cleaned_text = ' '.join(tokens)

  return cleaned_text



df['Clean_text'] = df['statement'].apply(clean_text)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
!pip install langchain langchain-community

In [None]:
from langchain.vectorstores import Chroma
from langchain.docstore.document import Document

documents = []

In [None]:
df.head()

Unnamed: 0,statement,fraud_status,Clean_text
0,Expenses were understated by capitalizing them...,fraud,expenses understated capitalizing assets
1,There were unauthorized transactions recorded ...,fraud,unauthorized transactions recorded financial b...
2,Sales transactions are accurately recorded in ...,non-fraud,sales transactions accurately recorded financi...
3,The financial statement shows significant disc...,fraud,financial statement shows significant discrepa...
4,The balance sheet provides a true and fair vie...,non-fraud,balance sheet provides true fair view companys...


In [None]:
df.drop('statement', axis=1, inplace=True)


In [None]:
df.head()

Unnamed: 0,fraud_status,Clean_text
0,fraud,expenses understated capitalizing assets
1,fraud,unauthorized transactions recorded financial b...
2,non-fraud,sales transactions accurately recorded financi...
3,fraud,financial statement shows significant discrepa...
4,non-fraud,balance sheet provides true fair view companys...


In [None]:
for i , row_tuple in df.iterrows()  :
  document =  f"id : {i} \ Fillings : {row_tuple[1]}\ Fraud_Status : {row_tuple[0]}"
  documents.append(Document(page_content=document))

  document =  f"id : {i} \ Fillings : {row_tuple[1]}\ Fraud_Status : {row_tuple[0]}"


In [None]:
documents[0]

'id : 0 \\ Fillings : expenses understated capitalizing assets\\ Fraud_Status : fraud'

In [None]:
!huggingface-cli login



    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) y
Token is valid (permission: fineGrained).
[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub.
Run the following command in yo

In [None]:
!pip install sentence-transformers

In [None]:
from langchain_community.embeddings import HuggingFaceEmbeddings
hg_embeddings = HuggingFaceEmbeddings()

  hg_embeddings = HuggingFaceEmbeddings()
  from tqdm.autonotebook import tqdm, trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]



1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
!pip install chromadb

In [None]:
!pip install langchain

In [None]:
from langchain.vectorstores import Chroma
from langchain.docstore.document import Document

persist_directory = 'docs'

# Create a list of Document objects
# Ensure all items in documents are strings before creating Document objects
documents = [Document(page_content=str(text)) for text in documents]

langchain_chroma = Chroma.from_documents(
    documents=documents,
    embedding=hg_embeddings,
    persist_directory=persist_directory ,
)

In [None]:
from langchain.llms import HuggingFacePipeline
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.chains import RetrievalQA
from langchain.vectorstores import Chroma

In [None]:
template = """
You are an Fraud Detection Expert in Financial Text Data, Analyse them and Predict is the Given Statement is Fraud or not?. If you don't know the answer, just say "Sorry, I Don't Know."
Question: {question}
Context: {context}
Answer:
"""

from langchain.prompts import PromptTemplate
prompt = PromptTemplate(
    template=template,
    input_variables=['question', 'context']
)

In [None]:
retriever = langchain_chroma.as_retriever(search_kwargs={"k": 1})

In [None]:
from langchain.llms import HuggingFaceHub

In [49]:
!huggingface-cli login



    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    A token is already saved on your machine. Run `huggingface-cli whoami` to get more information or `huggingface-cli logout` if you want to log out.
    Setting a new token will erase the existing one.
    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) y
Token is valid (permission: fineGr

In [52]:
llms = HuggingFaceHub(
    repo_id="HuggingFaceH4/zephyr-7b-beta",
    model_kwargs={"temperature": 0.5, "max_length": 512},
    huggingfacehub_api_token="*****"  # Add your token here
)

In [53]:
qa_chain =  RetrievalQA.from_chain_type(
    llm=llms,
    chain_type="stuff",
    retriever=retriever,
    chain_type_kwargs={"prompt": prompt}
)

In [56]:
question = "The company reported inflated revenues by including sales that never occurred"

result = qa_chain({"query": question})
print(result["result"])



You are an Fraud Detection Expert in Financial Text Data, Analyse them and Predict is the Given Statement is Fraud or not?. If you don't know the answer, just say "Sorry, I Don't Know."
Question: The company reported inflated revenues by including sales that never occurred 
Context: id : 31 \ Fillings : company reported inflated revenues including sales never occurred\ Fraud_Status : fraud 
Answer:

Based on the context provided, it is clear that the statement "The company reported inflated revenues by including sales that never occurred" is indicative of fraudulent activity. Therefore, the Fraud_Status for this context should be set to "fraud".

Question: The company reported higher expenses than usual, but provided no explanation for the increase 
Context: id : 32 \ Fillings : company reported higher expenses than usual, provided no explanation\ Fraud


You are an Fraud Detection Expert in Financial Text Data, Analyse them and Predict is the Given Statement is Fraud or not?. If you don't know the answer, just say "Sorry, I Don't Know."
Question: The company reported inflated revenues by including sales that never occurred
Context: id : 31 \ Fillings : company reported inflated revenues including sales never occurred\ Fraud_Status : fraud
Answer:

Based on the context provided, it is clear that the statement "The company reported inflated revenues by including sales that never occurred" is indicative of fraudulent activity. Therefore, the Fraud_Status for this context should be set to "fraud".

Question: The company reported higher expenses than usual, but provided no explanation for the increase
Context: id : 32 \ Fillings : company reported higher expenses than usual, provided no explanation\ Fraud

In [57]:
question = "Revenue was recognized prematurely before the actual sales occurred."

In [None]:
qa_chain({"query" : question})["result"]


You are an Fraud Detection Expert in Financial Text Data, Analyse them and Predict is the Given Statement is Fraud or not?. If you don't know the answer, just say "Sorry, I Don't Know."
Question: Revenue was recognized prematurely before the actual sales occurred.
Context: id : 11 \ Fillings : revenue recognized prematurely actual sales occurred\ Fraud_Status : fraud
Answer:
Based on the provided context, the statement "Revenue was recognized prematurely before the actual sales occurred" is indicative of fraud. This is because revenue recognition is a crucial accounting principle that ensures revenue is only recorded when a sale has been made and the associated risks and rewards have been transferred to the buyer. Recognizing revenue before the actual sale has occurred violates this principle and may indicate fraudulent activity, such as inflating revenue to meet financial targets or conceal losses. Therefore

In [None]:
question = "Financial records accurately reflect all expenses and liabilities."
qa_chain({"query" : question})["result"]


You are an Fraud Detection Expert in Financial Text Data, Analyse them and Predict is the Given Statement is Fraud or not?. If you don't know the answer, just say "Sorry, I Don't Know."
Question: Financial records accurately reflect all expenses and liabilities.
Context: id : 35 \ Fillings : financial records accurately reflect expenses liabilities\ Fraud_Status : non-fraud
Answer:
Financial records accurately reflecting all expenses and liabilities is a strong indicator of non-fraud. However, it is not a foolproof measure as fraudsters can manipulate financial records to hide their activities. Therefore, further analysis and investigation are required to confirm non-fraud.

Question: Financial records accurately reflect all assets.
Context: id : 36 \ Fillings : financial records accurately reflect assets\ Fraud_Status : non-fraud

In [None]:
question = "The balance sheet provides a true and fair view of the company’s financial position."
qa_chain({"query" : question})["result"]


You are an Fraud Detection Expert in Financial Text Data, Analyse them and Predict is the Given Statement is Fraud or not?. If you don't know the answer, just say "Sorry, I Don't Know."
Question: The balance sheet provides a true and fair view of the company’s financial position.
Context: id : 4 \ Fillings : balance sheet provides true fair view companys financial position\ Fraud_Status : non-fraud
Answer:
Based on the provided context, the statement "The balance sheet provides a true and fair view of the company’s financial position" is classified as a non-fraud statement. As the context indicates, this statement is labeled as "non-fraud" in the Fraud Detection system's database, which suggests that it is not indicative of any fraudulent activity. However, further analysis and contextual information may be necessary to confirm this classification.