You are building a simple keyword extraction feature for an LLM-powered document analysis system. Given a list of processed document chunks (strings) and a target keyword, count how many chunks contain the keyword (case-insensitive).

Additionally, return a mapping of each unique chunk to the number of times it appears in the list, but only for chunks that contain the target keyword.

keyword extraction feature
how many chunks contain the keyword (case-insensitive)



In [3]:
def count_keyword_chunks(chunks: list[str], keyword: str) -> dict:
    """
    Args:
        chunks: List of document chunk strings
        keyword: Target keyword to search for (case-insensitive)

    Returns:
        Dictionary with:
        - 'total_matching_chunks': int - number of chunks containing keyword
        - 'chunk_frequency': dict - mapping of matching chunks to their frequencies
    """
    total_matching_chunks = 0
    chunk_frequency = {}
    for chunk in chunks:
        if keyword.lower() in chunk.lower():
            total_matching_chunks += 1
            chunk_frequency[chunk] = chunk_frequency.get(chunk, 0) + 1
    return {
        'total_matching_chunks': total_matching_chunks,
        'chunk_frequency': chunk_frequency
    }

In [4]:
chunks = [
    "The AI model generates responses",
    "Machine learning requires data",
    "The AI model is trained on data",
    "The AI model generates responses",
    "Deep learning is a subset of AI"
]
keyword = "ai"

result = count_keyword_chunks(chunks, keyword)
print(result)

{'total_matching_chunks': 4, 'chunk_frequency': {'The AI model generates responses': 2, 'The AI model is trained on data': 1, 'Deep learning is a subset of AI': 1}}


In [9]:
chunks = ["hello world", "goodbye world", "hello there"]
keyword = "world"

result = count_keyword_chunks(chunks, keyword)
print(result)

{'total_matching_chunks': 2, 'chunk_frequency': {'hello world': 1, 'goodbye world': 1}}


In [10]:
chunks = ["python programming", "java programming", "c++ coding"]
keyword = "rust"

result = count_keyword_chunks(chunks, keyword)
print(result)

{'total_matching_chunks': 0, 'chunk_frequency': {}}


The Big O notation for this function is O(n × m × k), where:

n = number of chunks in the list
m = average length of each chunk (in characters)
k = length of the keyword