# **Financial Data Extraction Using Open-Source LLMs**:

1. Extract text from the PDF.
2. Perform Named Entity Recognition (NER).
3. Save the extracted data as a JSON file.

### Installing Dependencies

In [1]:
!pip install pymupdf

Collecting pymupdf
  Downloading pymupdf-1.25.3-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (3.4 kB)
Downloading pymupdf-1.25.3-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (20.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.0/20.0 MB[0m [31m45.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pymupdf
Successfully installed pymupdf-1.25.3


## Importing Libraries

In [2]:
import fitz
import json
import re
import os
from transformers import pipeline

In [3]:
#  Loading the NER Model
nlp = pipeline("ner", model="dbmdz/bert-large-cased-finetuned-conll03-english")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Device set to use cpu


### Reading and Extracting Text from a PDF

In [4]:
def extract_text_from_pdf(pdf_documents):
    doc = fitz.open(pdf_documents)
    text = "\n".join([page.get_text("text") for page in doc])
    return text

In [None]:
text=extract_text_from_pdf("pdf_documents/1_FinancialResults_05022025142214.pdf")

### Performing NER on Extracted Text

In [5]:
def extract_financial_entities(text):
    extracted_data = {
        "Company Name": "",
        "Report Date": "",
        "Profit Before Tax": "",
        "Revenue": "",
        "Net Profit After Tax": ""
    }

    # Extract company name and date
    entities = nlp(text)
    for entity in entities:
        word = entity['word']
        label = entity['entity']

        if "ORG" in label:  # company nsme
            extracted_data["Company Name"] = word
        elif "MISC" in label or "DATE" in label:  # date
            extracted_data["Report Date"] = word

    # Use regex for extracting financial values
    patterns = {
        "Profit Before Tax": r"Profit Before Tax[:\s]+([\d,\.]+)",
        "Revenue": r"Revenue from operations[:\s]+([\d,\.]+)",
        "Net Profit After Tax": r"Net Profit after tax[:\s]+([\d,\.]+)"
    }

    for key, pattern in patterns.items():
        match = re.search(pattern, text, re.IGNORECASE)
        if match:
            extracted_data[key] = match.group(1)

    return extracted_data

In [None]:
extract_financial_entities(text)

{'Company Name': 'Bose',
 'Report Date': 'Regulations',
 'Profit Before Tax': '',
 'Revenue': '333.29',
 'Net Profit After Tax': ''}

In [None]:
text2=extract_text_from_pdf("pdf_documents/Amaar raja Earnings Summary.pdf")

In [None]:
extract_financial_entities(text2)

{'Company Name': '##BI',
 'Report Date': 'Regulation',
 'Profit Before Tax': '317.07',
 'Revenue': '3,250.73',
 'Net Profit After Tax': '226.32'}

In [None]:
#Process multiple PDFs and extract financial data
def process_multiple_pdfs(pdf_files):

    extracted_results = {}

    for pdf_file in pdf_files:
        text = extract_text_from_pdf(pdf_file)
        financial_data = extract_financial_entities(text)
        extracted_results[pdf_file] = financial_data

    return extracted_results

### Saving Extracted Data as JSON

In [None]:
def save_to_json(data, output_file):

    with open(output_file, "w") as f:
        json.dump(data, f, indent=4)

# list of PDF files to process
pdf_files = [os.path.join("pdf_documents", f) for f in os.listdir("pdf_documents") if f.endswith(".pdf")]

# process the PDFs
data = process_multiple_pdfs(pdf_files)

# saving the extracted data
save_to_json(data, "financial_data.json")

print("Extraction complete! Check financial_data.json")

Extraction complete! Check financial_data.json


### **Conclusion :** This notebook successfully extracts **text and named entities** from a PDF document. The results are stored in a JSON file.