Qn.2: Write a function in Python that takes a list of legal documents (as strings) and returns the most frequently occurring legal terms (e.g., "liability," "indemnification"). Optimize your solution for large datasets.

In [6]:
! pip  install spacy

Collecting spacy
  Downloading spacy-3.8.4-cp312-cp312-win_amd64.whl.metadata (27 kB)
Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy)
  Downloading spacy_legacy-3.0.12-py2.py3-none-any.whl.metadata (2.8 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0 (from spacy)
  Downloading spacy_loggers-1.0.5-py3-none-any.whl.metadata (23 kB)
Collecting murmurhash<1.1.0,>=0.28.0 (from spacy)
  Downloading murmurhash-1.0.12-cp312-cp312-win_amd64.whl.metadata (2.2 kB)
Collecting cymem<2.1.0,>=2.0.2 (from spacy)
  Downloading cymem-2.0.11-cp312-cp312-win_amd64.whl.metadata (8.8 kB)
Collecting preshed<3.1.0,>=3.0.2 (from spacy)
  Downloading preshed-3.0.9-cp312-cp312-win_amd64.whl.metadata (2.2 kB)
Collecting thinc<8.4.0,>=8.3.4 (from spacy)
  Downloading thinc-8.3.4-cp312-cp312-win_amd64.whl.metadata (15 kB)
Collecting wasabi<1.2.0,>=0.9.1 (from spacy)
  Downloading wasabi-1.1.3-py3-none-any.whl.metadata (28 kB)
Collecting srsly<3.0.0,>=2.4.3 (from spacy)
  Downloading srsly-2.5.1-cp312-cp312-win_amd64

In [8]:
import spacy
from collections import Counter

In [13]:
!python -m spacy download en_core_web_sm


Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ----- ---------------------------------- 1.8/12.8 MB 11.2 MB/s eta 0:00:01
     ------------- -------------------------- 4.2/12.8 MB 11.0 MB/s eta 0:00:01
     ------------------- -------------------- 6.3/12.8 MB 11.4 MB/s eta 0:00:01
     -------------------- ------------------- 6.6/12.8 MB 10.9 MB/s eta 0:00:01
     ---------------------- ----------------- 7.3/12.8 MB 7.4 MB/s eta 0:00:01
     ------------------------ --------------- 7.9/12.8 MB 6.4 MB/s eta 0:00:01
     --------------------------- ------------ 8.7/12.8 MB 6.1 MB/s eta 0:00:01
     ----------------------------- ---------- 9.4/12.8 MB 6.1 MB/s eta 0:00:01
     ------------------------------ --------- 9.7/12.8 MB 5.3 MB/s eta 0:00:01
     --------------------------------

Here’s an optimized Python function to extract the most frequently occurring legal terms from a list of legal documents:

Approach:

Efficient Tokenization: Use spaCy's NLP pipeline for tokenization and lemmatization.

Stopword Removal: Filter out common stopwords and non-legal terms.

Frequency Calculation: Use Counter from collections for efficient word counting.

Optimization: Process large datasets using a generator to reduce memory usage.

## Code Implementation:

In [20]:


# Load spaCy's English NLP model,
nlp = spacy.load("en_core_web_sm")


# Define some common legal terms we want to track,
LEGAL_TERMS = {
    "liability", "indemnification", "warranty", "termination",
    "confidentiality", "damages", "jurisdiction", "dispute", "breach"
}


def extract_frequent_legal_terms(documents, top_n=10):
    """
    Finds the most frequently occurring legal terms in a list of legal documents.
    
    :param documents: List of documents (each as a string).
    :param top_n: Number of top terms to return.
    :return: List of tuples (term, frequency).
    """
    term_counts = Counter()

    
    # Process each document one at a time (memory efficient),
    for doc in documents:
        spacy_doc = nlp(doc)
        
        # Extract words that match our legal terms,
        for token in spacy_doc:
            lemma = token.lemma_.lower()
            if lemma in LEGAL_TERMS:
                term_counts[lemma] += 1

    
    # Get the most common legal terms
    return term_counts.most_common(top_n),


# Example usage,
if __name__ == "__main__":
    docs = [
        "The liability clause states that the company is not responsible for damages.",
        "Indemnification shall be provided in case of a legal dispute.",
        "Termination of the contract shall occur in case of breach.",
    ]

    
    print(extract_frequent_legal_terms(docs))


([('liability', 1), ('indemnification', 1), ('dispute', 1), ('termination', 1), ('breach', 1)],)


## Why is this optimized for large datasets:


-Streaming Processing: Uses nlp.pipe() instead of processing documents one by one, reducing memory overhead.

-Batch Processing: Processes multiple documents at once, improving efficiency.

-Memory Efficient: Works with generators/iterables instead of loading everything into memory.

-Disabling Unnecessary Components: Loads spaCy with only tokenizer and lemmatizer for speed.

## Optimization:-

In [27]:

# Load spaCy model with only necessary components (tokenizer + lemmatizer),
nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])


# Define key legal terms to track,
LEGAL_TERMS = {
    "liability", "indemnification", "warranty", "termination",
    "confidentiality", "damages", "jurisdiction", "dispute", "breach"
}



def extract_frequent_legal_terms(documents, top_n=10):
    """
    Extracts the most frequent legal terms from a list of legal documents.
    Optimized for large datasets using batch processing.

    :param documents: Iterable of legal documents (each as a string).
    :param top_n: Number of most common terms to return.
    :return: List of (term, frequency) tuples.
    """
    term_counts = Counter()

    
    # Process documents efficiently in batches,
    for doc in nlp.pipe(documents, batch_size=50):  
        for token in doc:
            lemma = token.lemma_.lower()
            if lemma in LEGAL_TERMS:
                term_counts[lemma] += 1

    return term_counts.most_common(top_n)


# Example usage,
if __name__ == "__main__":
    documents = [
        "The liability clause states that the company is not responsible for damages.",
        "Indemnification shall be provided in case of a legal dispute.",
        "Termination of the contract shall occur in case of breach."
    ]

    print(extract_frequent_legal_terms(documents))


[('liability', 1), ('indemnification', 1), ('dispute', 1), ('termination', 1), ('breach', 1)]
