# Installation Requirements

*   [Neo4j](https://neo4j.com/) is used as the underlying graph vector database.
*   Langchain is used to orchestrate the database and LLM for final output.
*   This notebook aims to parse all blogs and construct a knowledge graph of the text data with embeddings.



In [None]:
!pip install neo4j
!pip install langchain
!pip install langchain_community
!pip install transformers torch
!pip install emoji
!pip install python-dotenv

Collecting neo4j
  Downloading neo4j-5.25.0-py3-none-any.whl.metadata (5.7 kB)
Downloading neo4j-5.25.0-py3-none-any.whl (296 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/296.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m296.6/296.6 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: neo4j
Successfully installed neo4j-5.25.0
Collecting langchain
  Downloading langchain-0.3.4-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain-core<0.4.0,>=0.3.12 (from langchain)
  Downloading langchain_core-0.3.12-py3-none-any.whl.metadata (6.3 kB)
Collecting langchain-text-splitters<0.4.0,>=0.3.0 (from langchain)
  Downloading langchain_text_splitters-0.3.0-py3-none-any.whl.metadata (2.3 kB)
Collecting langsmith<0.2.0,>=0.1.17 (from langchain)
  Downloading langsmith-0.1.137-py3-none-any.whl.metadata (13 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain-core<0.4.0,>=0.3.12->langchai

# Text Preprocessing

In [None]:
WORDLIMIT = 50 # text blocks with less than 50 words are ignored

In [None]:
import emoji

def remove_emoji(text):
    return emoji.replace_emoji(text, replace='')  # Replace emoji with an empty string

#basic funtion to remove emoji and non-breaking space
def text_cleaning(text_list):
    new_list = []
    for text in text_list:
        if len(text.split()) > WORDLIMIT:
            text = remove_emoji(text)
            text = text.replace('\xa0', '')
            new_list.append(text)
    return new_list

# Create KG

## Setting up vector database

In [None]:
from langchain_community.graphs import Neo4jGraph
from langchain_community.vectorstores import Neo4jVector
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQAWithSourcesChain
from neo4j import GraphDatabase
from dotenv import load_dotenv
import os

In [None]:
#Parameters to connect Neo4j graph database
load_dotenv('.env', override=True)
NEO4J_URI = os.getenv('NEO4J_URI')
NEO4J_USERNAME = os.getenv('NEO4J_USERNAME')
NEO4J_PASSWORD = os.getenv('NEO4J_PASSWORD')
NEO4J_DATABASE = os.getenv('NEO4J_DATABASE')
AUTH = (NEO4J_DATABASE, NEO4J_PASSWORD)



with GraphDatabase.driver(NEO4J_URI, auth=AUTH) as driver:
    driver.verify_connectivity()

In [None]:
kg = Neo4jGraph(
    url=NEO4J_URI, username=NEO4J_USERNAME, password=NEO4J_PASSWORD, database=NEO4J_DATABASE
)

## Embeddings

In [None]:
import torch
from transformers import BertTokenizer, BertModel

# Load the NB-BERT model and tokenizer
model_name = "NbAiLab/nb-bert-base"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)

# Ensure the model is in evaluation mode
model.eval()

def get_embeddings(texts):
    # Tokenize the input texts
    inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")

    # Move inputs to the appropriate device (GPU if available)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    inputs = {key: val.to(device) for key, val in inputs.items()}
    model.to(device)

    # Generate embeddings
    with torch.no_grad():
        outputs = model(**inputs)

    # Get the embeddings from the last hidden state
    embeddings = outputs.last_hidden_state

    # Return the mean of the embeddings for each sentence
    return torch.mean(embeddings, dim=1).cpu().numpy()


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/746 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/714M [00:00<?, ?B/s]

## Create nodes (Blog, Chunk) and relationships (Next, Part of)


Rules to construct the knowledge graph:
*   A Blog node is created for each blog.
*   Blog nodes have properties: title, time, label, source, id, and intro.
*   Each blog is splitt into several chunks and a chunk node is created for each chunk.
*   Chunk nodes have properties: id, sequence_id, title, text, embedding (vector property).
*   PART_OF relationship is established for each blog node and all its chunk nodes.
*   The consecutive chunks are connected with NEXT relationship.



The JSON file look like this:
    
    "label": "Prosess og rådgivning",

    "source": "https://www.kantega.no/blogg/komplekst-prosjekt-hva-na",

    "title": "Komplekst prosjekt, hva nå?",

    "time": "2019-10-25 03:00",

    "intro": "Vi har testet ut et samtaleverktøy for å snakke om kompleksitetene i prosjekter... ",

    "text": [
        "Vi har testet ut et samtaleverktøy for å snakke om kompleksitetene i prosjekter...",

        "Introduksjon og bakgrunnDet var en tidlig mai-morgen og ..."

    ]




In [None]:
import json

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 512,
    chunk_overlap  = 50,
    length_function = len,
    is_separator_regex = False,
)

def chunk_data_from_file(filepath, file_id):
    blog_with_metadata = {}
    chunks_with_metadata = [] # use this to accumlate chunk records
    doc_title = os.path.basename(filepath)
    # Open the JSON file
    with open(filepath, 'r') as file:
        data = json.load(file)  # Load the JSON data into a Python object
        blog_with_metadata['label'] = data['label']
        blog_with_metadata['title'] = data['title']
        blog_with_metadata['source'] = data['source']
        blog_with_metadata['time'] = data['time']
        blog_with_metadata['intro'] = data['intro']
        blog_with_metadata['fileId'] = file_id

        text_list = text_cleaning(data['text'])
        chunk_seq_id = 0
        for text in text_list:
            doc_chunks = text_splitter.split_text(text)
            for chunk in doc_chunks:
                chunks_with_metadata.append({
                    'text': chunk,
                    'chunkSeqId': chunk_seq_id,
                    # constructed metadata...
                    'fileId': f'{file_id}', # pulled from the filename
                    'chunkId': f'{file_id}-chunk{chunk_seq_id:04d}',
                    # metadata from file...
                    'title': doc_title.replace(".json",""),
                    'textEmbedding': get_embeddings(chunk),
                })
                chunk_seq_id += 1
            print(f'\tSplit into {chunk_seq_id} chunks')
    return blog_with_metadata, chunks_with_metadata

In [None]:
import os
import glob


filepath = '/content/drive/MyDrive/Projects/rag/json_data2/'
doc_files = glob.glob(os.path.join(filepath, "*.json"))

file_id = 0
all_blogs = []
all_chunks = []
for doc in doc_files:
    blog, doc_chunks = chunk_data_from_file(doc, file_id)
    all_blogs.append(blog)
    all_chunks.extend(doc_chunks)
    file_id += 1


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


	Split into 1 chunks
	Split into 20 chunks
	Split into 8 chunks
	Split into 1 chunks
	Split into 18 chunks
	Split into 11 chunks
	Split into 1 chunks
	Split into 22 chunks
	Split into 33 chunks
	Split into 25 chunks
	Split into 1 chunks
	Split into 16 chunks
	Split into 14 chunks
	Split into 20 chunks
	Split into 14 chunks
	Split into 1 chunks
	Split into 27 chunks
	Split into 28 chunks
	Split into 14 chunks
	Split into 5 chunks
	Split into 14 chunks
	Split into 1 chunks
	Split into 41 chunks
	Split into 1 chunks
	Split into 25 chunks
	Split into 16 chunks
	Split into 15 chunks
	Split into 24 chunks
	Split into 1 chunks
	Split into 9 chunks
	Split into 1 chunks
	Split into 35 chunks
	Split into 25 chunks
	Split into 5 chunks
	Split into 17 chunks
	Split into 1 chunks
	Split into 23 chunks
	Split into 9 chunks
	Split into 1 chunks
	Split into 11 chunks
	Split into 16 chunks
	Split into 9 chunks
	Split into 30 chunks
	Split into 22 chunks
	Split into 1 chunks
	Split into 17 chunks
	Split

In [None]:
len(all_chunks)

4020

In [None]:
#Cypher statement to creat blog nodes with properties
merge_blog_node_query = """
MERGE(mergedBlog:Blog {fileId: $blogParam.fileId})
    ON CREATE SET
        mergedBlog.title = $blogParam.title,
        mergedBlog.label = $blogParam.label,
        mergedBlog.source = $blogParam.source,
        mergedBlog.time = $blogParam.time,
        mergedBlog.intro = $blogParam.intro,
        mergedBlog.fileId = toString($blogParam.fileId)
RETURN mergedBlog
"""

In [None]:
all_blogs[0]

{'label': 'NAN',
 'title': 'Kredittorakel basert på språkmodeller (2024, Trondheim)',
 'source': 'https://www.kantega.no/blogg/kredittorakel-basert-pa-sprakmodeller-2024-trondheim',
 'time': '2024-08-30 10:00',
 'intro': 'Hva må man ha i bakhodet når man designer løsninger med AI-menneske-interaksjon? Hvordan kan man bruke RAG for å forbedre løsninger som benytter språkmodeller? Hvordan sperrer man et mastercard? Og viktigst av alt, hva er egentlig forskjellen på en terrasse, balkong og altan? Det er disse spørsmålene som har preget sommeren vår som sommerstudenter hos Kantega Trondheim, og vi skal nå dele noe av det vi har funnet ut av.',
 'fileId': 0}

In [None]:
#create a blog node for each blog
blog_count = 0
for blog in all_blogs:
    print(f"Creating `:Blog` node for blog ID {blog['fileId']}")
    kg.query(merge_blog_node_query,
            params={
                'blogParam': blog
            })
    blog_count += 1
print(f"Created {blog_count} nodes")

Creating `:Blog` node for blog ID 0
Creating `:Blog` node for blog ID 1
Creating `:Blog` node for blog ID 2
Creating `:Blog` node for blog ID 3
Creating `:Blog` node for blog ID 4
Creating `:Blog` node for blog ID 5
Creating `:Blog` node for blog ID 6
Creating `:Blog` node for blog ID 7
Creating `:Blog` node for blog ID 8
Creating `:Blog` node for blog ID 9
Creating `:Blog` node for blog ID 10
Creating `:Blog` node for blog ID 11
Creating `:Blog` node for blog ID 12
Creating `:Blog` node for blog ID 13
Creating `:Blog` node for blog ID 14
Creating `:Blog` node for blog ID 15
Creating `:Blog` node for blog ID 16
Creating `:Blog` node for blog ID 17
Creating `:Blog` node for blog ID 18
Creating `:Blog` node for blog ID 19
Creating `:Blog` node for blog ID 20
Creating `:Blog` node for blog ID 21
Creating `:Blog` node for blog ID 22
Creating `:Blog` node for blog ID 23
Creating `:Blog` node for blog ID 24
Creating `:Blog` node for blog ID 25
Creating `:Blog` node for blog ID 26
Creating `:

In [None]:
#Cypher statement to creat chunk nodes with properties
merge_chunk_node_query = """
MERGE(mergedChunk:Chunk {chunkId: $chunkParam.chunkId})
    ON CREATE SET
        mergedChunk.title = $chunkParam.title,
        mergedChunk.formId = $chunkParam.fileId,
        mergedChunk.chunkId = $chunkParam.chunkId,
        mergedChunk.chunkSeqId = $chunkParam.chunkSeqId,
        mergedChunk.text = $chunkParam.text
RETURN mergedChunk
"""

In [None]:
#create a chunk node for each chunk
node_count = 0
for chunk in all_chunks:
    print(f"Creating `:Chunk` node for chunk ID {chunk['chunkId']}")
    kg.query(merge_chunk_node_query,
            params={
                'chunkParam': chunk
            })
    node_count += 1
print(f"Created {node_count} nodes")

Creating `:Chunk` node for chunk ID 0-chunk0000
Creating `:Chunk` node for chunk ID 0-chunk0001
Creating `:Chunk` node for chunk ID 0-chunk0002
Creating `:Chunk` node for chunk ID 0-chunk0003
Creating `:Chunk` node for chunk ID 0-chunk0004
Creating `:Chunk` node for chunk ID 0-chunk0005
Creating `:Chunk` node for chunk ID 0-chunk0006
Creating `:Chunk` node for chunk ID 0-chunk0007
Creating `:Chunk` node for chunk ID 0-chunk0008
Creating `:Chunk` node for chunk ID 0-chunk0009
Creating `:Chunk` node for chunk ID 0-chunk0010
Creating `:Chunk` node for chunk ID 0-chunk0011
Creating `:Chunk` node for chunk ID 0-chunk0012
Creating `:Chunk` node for chunk ID 0-chunk0013
Creating `:Chunk` node for chunk ID 0-chunk0014
Creating `:Chunk` node for chunk ID 0-chunk0015
Creating `:Chunk` node for chunk ID 0-chunk0016
Creating `:Chunk` node for chunk ID 0-chunk0017
Creating `:Chunk` node for chunk ID 0-chunk0018
Creating `:Chunk` node for chunk ID 0-chunk0019
Creating `:Chunk` node for chunk ID 1-ch

In [None]:
#nodes to be queried need to have a source property
kg.query("""MATCH (chunk:Chunk)-[PART_OF]->(blog:Blog)
            SET
            chunk.source = blog.source""", )

[]

In [None]:
cypher = """
  MATCH (anyBlog:Blog)
  RETURN anyBlog.fileId AS fileId
  """
file_list = kg.query(cypher)


In [None]:
#Create PART_OF relationship between blog node and chunk nodes belonging to it
for file in file_list:
    cypher = """
      MATCH (from_same_file:Chunk), (blog:Blog{fileId: $formIdParam})
      WHERE from_same_file.formId = $formIdParam
      MERGE (from_same_file)-[:PART_OF]->(blog)
      WITH collect(from_same_file) as file_chunk_list
      RETURN size(file_chunk_list)
    """
    len = kg.query(cypher, params={'formIdParam': file['fileId']})
    print(f"Linked {len} chunks from file {file}")
kg.refresh_schema()
print(kg.schema)

Linked [{'size(file_chunk_list)': 20}] chunks from file {'fileId': '0'}
Linked [{'size(file_chunk_list)': 8}] chunks from file {'fileId': '1'}
Linked [{'size(file_chunk_list)': 18}] chunks from file {'fileId': '2'}
Linked [{'size(file_chunk_list)': 11}] chunks from file {'fileId': '3'}
Linked [{'size(file_chunk_list)': 22}] chunks from file {'fileId': '4'}
Linked [{'size(file_chunk_list)': 33}] chunks from file {'fileId': '5'}
Linked [{'size(file_chunk_list)': 25}] chunks from file {'fileId': '6'}
Linked [{'size(file_chunk_list)': 16}] chunks from file {'fileId': '7'}
Linked [{'size(file_chunk_list)': 14}] chunks from file {'fileId': '8'}
Linked [{'size(file_chunk_list)': 20}] chunks from file {'fileId': '9'}
Linked [{'size(file_chunk_list)': 14}] chunks from file {'fileId': '10'}
Linked [{'size(file_chunk_list)': 27}] chunks from file {'fileId': '11'}
Linked [{'size(file_chunk_list)': 28}] chunks from file {'fileId': '12'}
Linked [{'size(file_chunk_list)': 14}] chunks from file {'file

In [None]:
#create NEXT relationship between consecutive chunks
for file in file_list:
    cypher = """
      MATCH (from_same_file:Chunk)
      WHERE from_same_file.formId = $formIdParam
      WITH from_same_file
        ORDER BY from_same_file.chunkSeqId ASC
      WITH collect(from_same_file) as file_chunk_list
        CALL apoc.nodes.link(
            file_chunk_list,
            "NEXT",
            {avoidDuplicates: true}
        )
      RETURN size(file_chunk_list)
    """
    len = kg.query(cypher, params={'formIdParam': file['fileId']})
    print(f"Linked {len} chunks from file {file}")
kg.refresh_schema()
print(kg.schema)

Linked [{'size(file_chunk_list)': 20}] chunks from file {'fileId': '0'}
Linked [{'size(file_chunk_list)': 8}] chunks from file {'fileId': '1'}
Linked [{'size(file_chunk_list)': 18}] chunks from file {'fileId': '2'}
Linked [{'size(file_chunk_list)': 11}] chunks from file {'fileId': '3'}
Linked [{'size(file_chunk_list)': 22}] chunks from file {'fileId': '4'}
Linked [{'size(file_chunk_list)': 33}] chunks from file {'fileId': '5'}
Linked [{'size(file_chunk_list)': 25}] chunks from file {'fileId': '6'}
Linked [{'size(file_chunk_list)': 16}] chunks from file {'fileId': '7'}
Linked [{'size(file_chunk_list)': 14}] chunks from file {'fileId': '8'}
Linked [{'size(file_chunk_list)': 20}] chunks from file {'fileId': '9'}
Linked [{'size(file_chunk_list)': 14}] chunks from file {'fileId': '10'}
Linked [{'size(file_chunk_list)': 27}] chunks from file {'fileId': '11'}
Linked [{'size(file_chunk_list)': 28}] chunks from file {'fileId': '12'}
Linked [{'size(file_chunk_list)': 14}] chunks from file {'file

In [None]:
# create vector index
kg.query("""
         CREATE VECTOR INDEX `vector_chunks` IF NOT EXISTS
          FOR (c:Chunk) ON (c.textEmbedding)
          OPTIONS { indexConfig: {
            `vector.dimensions`: 768,
            `vector.similarity_function`: 'cosine'
         }}
""")

[]

In [None]:
kg.query("SHOW INDEXES")

[{'id': 0,
  'name': 'index_343aff4e',
  'state': 'ONLINE',
  'populationPercent': 100.0,
  'type': 'LOOKUP',
  'entityType': 'NODE',
  'labelsOrTypes': None,
  'properties': None,
  'indexProvider': 'token-lookup-1.0',
  'owningConstraint': None,
  'lastRead': neo4j.time.DateTime(2024, 10, 20, 22, 31, 2, 551000000, tzinfo=<UTC>),
  'readCount': 22811},
 {'id': 1,
  'name': 'index_f7700477',
  'state': 'ONLINE',
  'populationPercent': 100.0,
  'type': 'LOOKUP',
  'entityType': 'RELATIONSHIP',
  'labelsOrTypes': None,
  'properties': None,
  'indexProvider': 'token-lookup-1.0',
  'owningConstraint': None,
  'lastRead': None,
  'readCount': 0},
 {'id': 2,
  'name': 'vector_chunks',
  'state': 'ONLINE',
  'populationPercent': 100.0,
  'type': 'VECTOR',
  'entityType': 'NODE',
  'labelsOrTypes': ['Chunk'],
  'properties': ['textEmbedding'],
  'indexProvider': 'vector-2.0',
  'owningConstraint': None,
  'lastRead': neo4j.time.DateTime(2024, 10, 19, 13, 28, 3, 190000000, tzinfo=<UTC>),
  're

In [None]:
#set vector property for text embeddings
node_count = 0
for chunk in all_chunks:
    print(f"Creating `:Chunk` node for chunk ID {chunk['chunkId']}")
    kg.query("""
        MATCH (chunk:Chunk) WHERE chunk.chunkId = $chunkParam.chunkId AND chunk.textEmbedding IS NULL
        CALL db.create.setNodeVectorProperty(chunk, "textEmbedding", $chunkParam.textEmbedding[0])
        """, params={'chunkParam': chunk})
    node_count += 1
print(f"Created {node_count} nodes")

Creating `:Chunk` node for chunk ID 0-chunk0000
Creating `:Chunk` node for chunk ID 0-chunk0001
Creating `:Chunk` node for chunk ID 0-chunk0002
Creating `:Chunk` node for chunk ID 0-chunk0003
Creating `:Chunk` node for chunk ID 0-chunk0004
Creating `:Chunk` node for chunk ID 0-chunk0005
Creating `:Chunk` node for chunk ID 0-chunk0006
Creating `:Chunk` node for chunk ID 0-chunk0007
Creating `:Chunk` node for chunk ID 0-chunk0008
Creating `:Chunk` node for chunk ID 0-chunk0009
Creating `:Chunk` node for chunk ID 0-chunk0010
Creating `:Chunk` node for chunk ID 0-chunk0011
Creating `:Chunk` node for chunk ID 0-chunk0012
Creating `:Chunk` node for chunk ID 0-chunk0013
Creating `:Chunk` node for chunk ID 0-chunk0014
Creating `:Chunk` node for chunk ID 0-chunk0015
Creating `:Chunk` node for chunk ID 0-chunk0016
Creating `:Chunk` node for chunk ID 0-chunk0017
Creating `:Chunk` node for chunk ID 0-chunk0018
Creating `:Chunk` node for chunk ID 0-chunk0019
Creating `:Chunk` node for chunk ID 1-ch

In [None]:
emd = kg.query("""
              MATCH (chunk:Chunk)
              RETURN chunk.chunkId AS id, chunk.textEmbedding AS emb""")


In [None]:
emd

# Vector Similarity Search

With the vector database, we can perform search based on embeddings.

In [None]:
VECTOR_INDEX_NAME = 'vector_chunks'
def neo4j_vector_search(question):
  question_embedding = get_embeddings(question)[0]
  """Search for similar nodes using the Neo4j vector index"""
  vector_search_query = """
    CALL db.index.vector.queryNodes($index_name, $top_k, $question_embedding) yield node, score
    RETURN score, node.title AS title, node.text AS text
  """
  similar = kg.query(vector_search_query,
                     params={
                      'question': question,
                      'question_embedding': question_embedding,
                      'index_name':VECTOR_INDEX_NAME,
                      'top_k': 3})
  return similar

In [None]:
search_results = neo4j_vector_search(
    'Ansvarlig KI er hele organisasjonen sitt ansvar'
)
search_results

[{'score': 0.7677822113037109,
  'title': 'Ansvarlig KI er hele organisasjonen sitt ansvar!.json',
  'text': 'lenger. I Kantega har vi sett at det ofte er andre aspekter rundt utvikling av KI som smerter mest. Ansvarlig KIVi som har KI som vårt fagfelt i Kantega har de siste årene jobbet mye med temaet ansvarlig KI. Vi ønsker at oppdragene og prosjektene vi jobber i skal utføres på en måte som gir en trygghet og tillit i resten av organisasjonen. Det skal være åpenhet og transparens i det vi gjør.I denne artikkelen ønsker vi derfor å trekke fram sentrale elementer for å sikre at vi lager ansvarlig KI.Figuren over'},
 {'score': 0.7663803100585938,
  'title': 'Ansvarlig KI er hele organisasjonen sitt ansvar!.json',
  'text': 'for å sikre at vi lager ansvarlig KI.Figuren over viser en typisk syklus for et data science-prosjekt. Hvis vi ser på de ulike boksene, forstår vi at dette ikke er noe som kun berører data scientister, men derimot hele organisasjonen.På samme måte som at et data sci