<a href="https://colab.research.google.com/github/Siddharth5723/AI-Powered-Regulatory-Compliance-Checker-for-Contracts/blob/main/Model_research.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

MODEL RESEARCH:

MODULE-1:

Clause identification and risk Analysis Engine.

Tasks it needs to perform:
1. Clause extraction(NER + segmentation)

2. Clause classification

3. Risk Scoring

4. Missing clause detection

Suitable models:

1. Hugging Face Legal-BERT

2. Google AI BERT / RoBERTa



In [None]:
# !pip install transformers datasets torch scikit-learn evaluate

In [None]:
# !pip install --upgrade transformers

In [None]:
#Hugging Face Legal-Bert
import torch
import numpy as np
from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer
)

from sklearn.metrics import precision_score, recall_score , f1_score , accuracy_score



In [None]:
dataset = load_dataset("lex_glue" , "ledgar")
dataset

In [None]:
model_name = "nlpaueb/legal-bert-base-uncased"

tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=len(dataset["train"].features["label"].names)
)

In [None]:
def tokenize_function(example):
    return tokenizer(
        example["text"],
        padding="max_length",
        truncation=True,
        max_length=512
    )

encoded_dataset = dataset.map(tokenize_function, batched=True)
encoded_dataset.set_format("torch")

In [None]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)

    precision = precision_score(labels, predictions, average="macro")
    recall = recall_score(labels, predictions, average="macro")
    f1 = f1_score(labels, predictions, average="macro")
    acc = accuracy_score(labels, predictions)

    return {
        "accuracy": acc,
        "precision": precision,
        "recall": recall,
        "f1": f1
    }

In [None]:
!pip install transformers==4.30.0

In [None]:
!pip install --upgrade transformers

In [None]:
training_args = TrainingArguments(
    output_dir="./legalbert_results",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir="./logs",
)

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset["validation"],
    compute_metrics=compute_metrics,
)

In [None]:
trainer.train()

  super().__init__(loader)


Step,Training Loss


In [None]:
results = trainer.evaluate(encoded_dataset["test"])
print(results)


MODULE-2:

Regulatory Update Tracking & Integration System

Tasks:

1. Scraping legal databases

2. Detecting regulatory changes

3. Semantic comparison with existing contracts

Suitable Models:
A. Retrieval + Embeddings

1. OpenAI Embeddings

2. ** Sentence-BERT (SBERT)**

3. Instructor-XL embeddings

In [9]:
# #Sentence-BERT (SBERT)
# !pip install sentence-transformers torch scikit-learn

In [4]:
import torch
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

model = SentenceTransformer('all-MiniLM-L6-v2')


#sentences = [
 #   "The processor shall delete personal data upon request.",
  #  "The company must erase user information when asked.",
   # "Payment must be made within 30 days of invoice receipt."
#]

sentences = [
    "The indemnifying party shall hold harmless the other party.",
    "The party agrees to compensate and protect the other party from liability.",
    "This agreement shall terminate after five years."
]

embeddings = model.encode(sentences)
similarity_matrix = cosine_similarity(embeddings)

print("Cosine Similarity Matrix:\n")
print(similarity_matrix)

print("\nSimilarity between sentence 1 and 2:",
      similarity_matrix[0][1])

print("Similarity between sentence 1 and 3:",
      similarity_matrix[0][2])

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


Cosine Similarity Matrix:

[[1.0000001  0.50195354 0.26221165]
 [0.50195354 1.0000001  0.28437954]
 [0.26221165 0.28437954 0.99999994]]

Similarity between sentence 1 and 2: 0.50195354
Similarity between sentence 1 and 3: 0.26221165


Module-3

Contract Modification & Deployment


Tasks:

1. Suggest clause edits

2. Rewrite non-compliant text

3. Version control

Suitable Models:
A. Instruction-tuned LLMs
  1. GPT-4
  2. **LLaMA 3 Instruct**

In [8]:
# import ollama

# response = ollama.chat(
#     model="llama3:8b-instruct-q4",
#     messages=[
#         {
#             "role": "system",
#             "content": "You are a legal compliance expert."
#         },
#         {
#             "role": "user",
#             "content": "Does this clause violate GDPR? The company may retain personal data indefinitely."
#         }
#     ]
# )

# print("\nModel Response:\n")
# print(response["message"]["content"])