<a href="https://colab.research.google.com/github/Jyothiraditya135/Virsoftech_Submission/blob/main/Final_Submission.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install -q PyPDF2
!pip install -q spacy gliner-spacy
!pip install -q transformers accelerate
!pip install -q datasets huggingface_hub

In [None]:
import spacy
import random
import pandas as pd
from PyPDF2 import PdfReader
from transformers import pipeline
from datasets import load_dataset

In [None]:
def read_pdf(file_path):
    """Reads the text from all pages of a PDF."""
    reader = PdfReader(file_path)
    return [page.extract_text() for page in reader.pages]

Given the entire text of the document, we now need to use ML techniques to extract relevant key value pairs from the given data. Using Named Entity Recognition is a first idea.

Gliner - Generalist and Lightweight Model for Named Entity Recognition seems to be a very useful spacy wrapper that can take custom labels and work Zero Shot on domain agnostic data.

In [None]:
def initialize_ner_model(keys, gliner_model="urchade/gliner_multi"):
    """Initializes a custom spaCy NER model with specified keys."""
    custom_spacy_config = {
        "gliner_model": gliner_model,
        "chunk_size": 250,
        "labels": keys,
        "style": "ent"
    }
    nlp = spacy.blank("en")
    nlp.add_pipe("gliner_spacy", config=custom_spacy_config)
    return nlp

There are more ways to perform NER
- Using custom NER models finetuned for this task:
 - https://huggingface.co/gouravsinha/finance-NER
 - https://github.com/Legal-NLP-EkStep/legal_NER
 - https://huggingface.co/saattrupdan/employment-contract-ner-da

 We can switch between the NER models, since we know the domain of the document beforehand. But this defeats the domain agnostic nature of the task.

- LLMner = using an llm call to perform ner by giving the llm access to the document and a small description of each label that we want it to extract.

The idea seemed to be effective, but the only drawback was that we would need to pass the document to 3rd party API's. This can however be bypassed by using a custom quantized model run on cpu from huggingface or ollama and writing an appropriate prompt.

In [None]:
def extract_entities(text, nlp_model, keys):
    """Extracts entities from text using the NER model."""
    doc = nlp_model(text)
    key_value_pairs = {key: None for key in keys}

    for ent in doc.ents:
        if ent.label_ in keys and not key_value_pairs[ent.label_]:
            key_value_pairs[ent.label_] = ent.text

    return key_value_pairs

In [None]:
def save_to_csv(data_dict, output_csv):
    """Saves a dictionary of extracted data to a CSV file."""
    df = pd.DataFrame.from_dict([data_dict])
    df.to_csv(output_csv, index=False)

For the summarization part, the huggingface transformers library gives us a ton of resources, quantized models and the accelerate library. Since most of the models require huggingface cli login and its corresponding key, I used a base open source model here. The commented code at the bottom uses hugging face cli and langchain to pipeline more efficient models for the summarization task.

In [None]:
def summarize_text(text):
    """Summarizes the given text using a pre-trained summarization model."""
    summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
    summary = summarizer(text, truncation=True, max_length=130)
    return summary[0]['summary_text']

The below block was inferencing on a legal Indian NER dataset

In [None]:
# def load_legal_ner_dataset():
#     """Loads the legal NER dataset from Hugging Face."""
#     dataset = load_dataset("opennyaiorg/InLegalNER")
#     return dataset["train"], dataset["dev"], dataset["test"]

# def process_ner_dataset_sample(nlp_model, sample):
#     """Processes a single sample from the dataset using the NER model."""
#     text = sample["data"]["text"]
#     annotations = sample["annotations"]
#     doc = nlp_model(text)
#     entities = [(ent.text, ent.label_) for ent in doc.ents]
#     return {"text": text, "annotations": annotations, "entities": entities}

In [None]:
if __name__ == "__main__":

    # An example I took for legal documents.
    keys = ["petitioner", "defendant", "respondent", "judge", "lawyer", "date", "organisation", "address"]

    pdf_text = read_pdf("test.pdf")
    combined_text = " ".join(pdf_text)

    nlp_model = initialize_ner_model(keys)

    extracted_data = extract_entities(combined_text, nlp_model, keys)

    save_to_csv(extracted_data, "output.csv")

    summary = summarize_text(combined_text)
    print("Summary:", summary)

In [None]:
# !pip install -q transformers langchain accelerate

In [None]:
# huggingface-cli login

In [None]:
# from huggingface_hub import notebook_login

# notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
# import torch
# import transformers
# from transformers import AutoTokenizer
# from langchain import LLMChain, HuggingFacePipeline, PromptTemplate

In [None]:
# model = "meta-llama/Llama-2-7b-chat-hf"
# tokenizer = AutoTokenizer.from_pretrained(model)

In [None]:
# pipeline = transformers.pipeline(
#     "text-generation",
#     model=model,
#     tokenizer=tokenizer,
#     torch_dtype=torch.bfloat16,
#     trust_remote_code=True,
#     device_map="auto",
#     max_length=3000,
#     do_sample=True,
#     top_k=10,
#     num_return_sequences=1,
#     eos_token_id=tokenizer.eos_token_id
# )

The temperature is set to 0 since with the domain of the documents, it is better to have accurate summaries withoit any hallucinations. Increasing the temperature gives creative space for the llm.

In [None]:
# llm = HuggingFacePipeline(pipeline=pipeline, model_kwargs={'temperature':0})

In [None]:
# template = """
#     Write a summary of the following text delimited by triple backticks.
#     Return your response which covers the key points of the text.
#     ```{text}```
#     SUMMARY:
# """
# prompt = PromptTemplate(template=template, input_variables=["text"])
# llm_chain = LLMChain(prompt=prompt, llm=llm)
# print(llm_chain.run(text))

However, this classic way of using an NER model does not yield the required results. Precision and Accuracy are low, and chunking the text removes the required context for differentiating the precedents(in a legal sense). To allow the use of complete context, attention seems to be the only suitbale alternative.

In [None]:
# This code is a BeRT Base NER model

# from transformers import AutoTokenizer, AutoModelForTokenClassification
# from transformers import pipeline

# tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
# model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")

# nlp = pipeline("ner", model=model, tokenizer=tokenizer)
# example = "My name is Wolfgang and I live in Berlin"

# ner_results = nlp(example)
# print(ner_results)

Delicate prompt engineering with the required domain and keys works much better in extracting the required values.

In [None]:
import re
import pandas as pd
from transformers import AutoTokenizer, AutoModelForCausalLM

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-1.5B")
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-1.5B")

tokenizer_config.json:   0%|          | 0.00/1.29k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/138 [00:00<?, ?B/s]

In [None]:
def extract_entities(text, keys):
    prompt = (
        f"Extract the following entities: {', '.join(keys)}, from the given text below"
        f"'{text}'. Provide only the extracted outputs in the format 'Key: Value'."
    )

    print(prompt)

    inputs = tokenizer(prompt, return_tensors="pt", max_length=4096, truncation=True)
    outputs = model.generate(**inputs, max_new_tokens=200)
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

    print(generated_text)

    pattern = r"(\w+):\s(.+)"  # Regex to match "Key: Value"
    matches = re.findall(pattern, generated_text)

    if not matches:
        print("No entities found in the generated text.")
        return

    data = [{"entity": match[0], "value": match[1]} for match in matches]
    return data

In [None]:
text = "Barack Obama was born in Hawaii and worked at the White House."
keys = ["Person", "Location", "Organization"]

In [None]:
data = extract_entities(text, keys)

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Extract the following entities: Person, Location, Organization, from the given text below'Barack Obama was born in Hawaii and worked at the White House.'. Provide only the extracted outputs in the format 'Key: Value'.
Extract the following entities: Person, Location, Organization, from the given text below'Barack Obama was born in Hawaii and worked at the White House.'. Provide only the extracted outputs in the format 'Key: Value'. Sure, here are the extracted entities from the given text:

- Person: Barack Obama
- Location: Hawaii
- Organization: White House


In [None]:
data[1:]

[{'entity': 'Person', 'value': 'Barack Obama'},
 {'entity': 'Location', 'value': 'Hawaii'},
 {'entity': 'Organization', 'value': 'White House'}]

However, the llm prompt engineered solution has some potential drawbacks
- too much time consuming for inferencing
- might suffer hallucinations and give wrong outputs
- context window of input depends on size of model, therefore we face a clash between length of context window and model size.