# Topic Extraction

The following notebook extracts and summarizes topics/keywords from the sections. This to link sections that are addressing the same topic or keywords. In this case an embedding model is used to do this. This is an OpenAI-model (gpt-3.5-turbo).

In [None]:
from neo4j import GraphDatabase
from dotenv import load_dotenv
import pandas as pd
import os
import numpy as np
from langchain.chat_models import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
from langchain.prompts import PromptTemplate
from IPython.display import clear_output

## Setup Credentials

In [None]:
if os.path.exists('credentials.env'):
    load_dotenv('credentials.env', override=True)

    # Neo4j
    uri = os.getenv('NEO4J_URI')
    username = os.getenv('NEO4J_USERNAME')
    password = os.getenv('NEO4J_PASSWORD')
    database = os.getenv('NEO4J_DATABASE')

    # AI
    OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
    os.environ['OPENAI_API_KEY']=OPENAI_API_KEY
else:
    print("File 'credentials.env' not found.")

In [None]:
LLM_model = 'gpt-3.5-turbo'

## Connect to Neo4j

In [None]:
class App:
    def __init__(self, uri, user, password, database=None):
        self.driver = GraphDatabase.driver(uri, auth=(user, password), database=database)
        self.database = database

    def close(self):
        self.driver.close()

    def query(self, query):
        return self.driver.execute_query(query)

    def query_params(self, query, parameters):
        return self.driver.execute_query(query, parameters_=parameters)

    def count_nodes_in_db(self):
        query = "MATCH (n) RETURN COUNT(n)"
        result = self.query(query)
        (key, value) = result.records[0].items()[0]
        return value

    def count_nodes_with_label_in_db(self, label):
        query = f"MATCH (n:{label}) RETURN COUNT(n)"
        result = self.query(query)
        (key, value) = result.records[0].items()[0]
        return value

    def remove_nodes_relationships(self):
        query ="""
            CALL apoc.periodic.iterate(
                "MATCH (c) RETURN c",
                "WITH c DETACH DELETE c",
                {batchSize: 1000}
            )
        """
        result = self.query(query)

    def remove_all_constraints(self):
        query ="""
            CALL apoc.schema.assert({}, {})
        """
        result = self.query(query)

In [None]:
app = App(uri, username, password, database)

In [None]:
app.count_nodes_in_db()

7772

## Extract Topics/Keywords

Get section lists

In [None]:
query = """
    MATCH (c:Chunk)
    RETURN c.id as chunk_id, c.chunk AS chunk_text
"""

In [None]:
results = app.query(query)
chunk_list = results.records

Setup the LLM

In [None]:
llm = ChatOpenAI(
    model="gpt-3.5-turbo",
    temperature=0,
    max_tokens=None,
    timeout=None,
    max_retries=2
)

Create the prompt for extraction

In [None]:
def generate_prompt(chunk_text, num_keywords=5):
    chunk_text = chunk_text.replace("{", "")
    chunk_text = chunk_text.replace("}", "")
    prompt_template = f"""
        You are an insurance expert on insurance policies.
        You are given a piece of text from a policy of multiple insurances.
        Extract the keywords and topics of the piece of text you are given.
        Mainly focus on the type of costs is addressed in the text.
        Focus even more on what kind of treatment or sickness is discussed. This is important to find specific types of health costs.
        These keywords/topics should be based on the text. Be quite specific while generating the keywords from each particular text.
        Don't put generic words as treatment, reimbursement in the list. This will result in to many generic topics. Be specific depending on the given text.
        These keywords/topics should be one or two key words that or are addressed in the text.
        If you cannot extract any, then just return an empty list. Don't repeat the title or text fully itself.
        Don't generate any other information than given in the text. Be very strict and close to the text.
        These topics must be searchable for internal use so keep them in lower-case and consistent.
        You put these topics in a list of up to {num_keywords} one-to-two word phrases.
        You can provide fewer than {num_keywords} phrases.
        Return the phrases as a pipe separated list.
        Return only the list without a heading.

        {{chunk_text}}: {chunk_text}
    """

    prompt = PromptTemplate.from_template(prompt_template)
    theprompt = prompt.format_prompt(chunk_text=chunk_text)

    return theprompt

Run an example for one document

In [None]:
chunk_text = """
for dietetics as recovery care after COVID-19 (coronavirus); and
●Reimbursement of 7 hours of treatment, maximum, during a maximum of 6 months, until 1 January 2025
for extension of dietetics as recovery care after COVID-19 (coronavirus).
We use a variety of rates. See the attached General terms and conditions, section Rates.
Terms and conditions for dietetics as recovery care after COVID-19 (coronavirus) (clause B.22.)
`Nationale-Nederlanden Zorg Vrij' (`Combinatie health insurance policy) valid from 01-01-2024 to
31-12-2024 (inclusive)
Conditional healthcarePage 5 of 205
"""
theprompt = generate_prompt(chunk_text)
llm(theprompt.to_messages()).pretty_print()


dietetics|recovery care|COVID-19|treatment|reimbursement


Extract the topics/keywords from all documents.

In [None]:
app.query("CREATE CONSTRAINT unique_keyword IF NOT EXISTS FOR (k:KeyWord) REQUIRE k.word IS UNIQUE")

EagerResult(records=[], summary=<neo4j._work.summary.ResultSummary object at 0x17516db50>, keys=[])

In [None]:
len(chunk_list)

2880

In [None]:
index = 0
for chunk in chunk_list:
    clear_output(wait=True)
    chunk = dict(chunk)
    id = chunk['chunk_id']
    text = chunk['chunk_text'][:1000]
    prompt = generate_prompt(text)
    response = dict(llm(prompt.to_messages()))['content']
    keywords = response.split('|')
    for word in keywords:
        if (word != '') & (word != 'empty list'):
            word = word.strip()
            query = f"""
                MERGE (k:KeyWord{{word: "{word}"}})
                WITH k
                MATCH (c:Chunk{{id: {id}}})
                MERGE (c)-[:HAS_KEYWORD]->(k)
            """
            app.query(query)
    print("Progress: ", np.round((index+1)/len(chunk_list)*100,2), "%")
    index += 1

Progress:  100.0 %


### Test most extracted keywords

In [None]:
query = """
    MATCH (k:KeyWord)<-[x:HAS_KEYWORD]-(c:Chunk)
    WITH k, COUNT(x) as no_chunks
    RETURN k.word as word, no_chunks
    ORDER BY no_chunks DESC
"""

In [None]:
result = app.query(query)
data = [dict(record) for record in result.records]
df = pd.DataFrame.from_dict(data)
df

Unnamed: 0,word,no_chunks
0,reimbursement,277
1,vergoeding,222
2,eigen risico,189
3,zorgverlener,128
4,akkoordverklaring,113
...,...,...
4833,scientific treatment,1
4834,custody healthcare,1
4835,nuclear reactions,1
4836,full coverage,1


### Create embeddings for Keywords/Topics

In [None]:
model = 'text-embedding-3-small'

In [None]:
embeddings_model = OpenAIEmbeddings(
    model = model,
    openai_api_key = OPENAI_API_KEY
)

In [None]:
df['embedding'] = df['word'].apply(lambda x: embeddings_model.embed_query(x))

In [None]:
df.head()

Unnamed: 0,word,no_chunks,embedding
0,reimbursement,277,"[0.01502708392308506, -0.015148053401063314, 0..."
1,vergoeding,222,"[-0.0393074150783278, 0.019744070042352777, 0...."
2,eigen risico,189,"[0.018740978576694545, 0.01236120177740436, 0...."
3,zorgverlener,128,"[-0.027650795667909022, 0.0016075391282090446,..."
4,akkoordverklaring,113,"[-0.05345012671337905, 0.03983106969785649, -0..."


Load embeddings

In [None]:
query = """
    MATCH(w:KeyWord {word: $word})
    SET
        w.id = $id,
        w.no_chunks = $no_chunks,
        w.embedding = $embedding
    RETURN w
"""

In [None]:
for index, row in df.iterrows():
    d = {
        'word': row['word'],
        'no_chunks': row['no_chunks'],
        'embedding': row['embedding'],
        'id': index,
    }
    app.query_params(query, d)

### Create vector index

In [None]:
batch_size = 100
nr_batches = int(app.count_nodes_with_label_in_db('KeyWord') / batch_size) + 1
print(f'Running {nr_batches} batches with size {batch_size}')

Running 49 batches with size 100


In [None]:
for batch in range(nr_batches):
    query = f"""
        MATCH(w:Word)
        WHERE w.id >= {(batch*batch_size)+1} AND w.id <= {(batch+1)*batch_size}
        CALL db.create.setNodeVectorProperty(w, "embedding", w.embedding)
        RETURN count(w) AS propertySetCount
    """
    app.query(query)
    if ((batch % 10 == 0) & (batch != 0)):
        print(f"Finished: {batch}/{nr_batches} batches ({round(batch/nr_batches*100,2)}%)")

Finished: 10/49 batches (20.41%)
Finished: 20/49 batches (40.82%)
Finished: 30/49 batches (61.22%)
Finished: 40/49 batches (81.63%)


In [None]:
query = """
    CREATE VECTOR INDEX `keyword-embeddings` IF NOT EXISTS
    FOR (k:KeyWord) ON (k.embedding)
    OPTIONS {
        indexConfig: {
            `vector.dimensions`: 1536,
            `vector.similarity_function`: 'cosine'
        }
    }
"""

In [None]:
app.query(query)

EagerResult(records=[], summary=<neo4j._work.summary.ResultSummary object at 0x133fb5e50>, keys=[])

## Observe similar keywords

In [None]:
query= """
    MATCH (k:KeyWord)
    WITH k LIMIT 10
    CALL db.index.vector.queryNodes("keyword-embeddings", 10, k.embedding) YIELD node, score
    RETURN k.word, node.word as similar_words, score ORDER BY score DESC
"""

In [None]:
result = app.query(query)

In [None]:
data = [dict(record) for record in result.records]
df = pd.DataFrame.from_dict(data)
df

Unnamed: 0,k.word,similar_words,score
0,zorgkosten,zorgkosten,1.000000
1,ziektekosten,ziektekosten,1.000000
2,medicijnen,medicijnen,1.000000
3,vergoedingen aanvullende module(s),vergoedingen aanvullende module(s),1.000000
4,vergoedingen,vergoedingen,1.000000
...,...,...,...
95,zorgnota's declareren,kosten-declaratie,0.814468
96,zorgnota's declareren,kosten declaratie,0.811856
97,zorgnota's declareren,zorgaanvraag,0.811047
98,zorgnota's declareren,indicatie zorg,0.807821
