# 

# Text generation comparison
This notebook compares text generation with different augmentation approaches. 
- Outputs results to `../data/exp_outputs/` for examination.
- Two prompt formats examined for generating individualized learning plans

## TODO for this notebook:
- intermediate test case doesnt generate

## Initialize

In [1]:
import os
import sys
import pickle

root_path = "c:\\Users\\jonathan.kasprisin\\gitlab\\DNoK_GraphRAG"
os.chdir(root_path)
sys.path.append(root_path)

output_path = "./data/exp_outputs/llm_reasoning_output_v2.txt"
context_review_path = "./data/exp_outputs/llm_reasoning_context_review_v2.txt"

output_pickle_path = "llm_reason_output_data.pkl"

# Initialize an empty list to collect rows for the DataFrame
data_records = []


In [2]:
from langchain_huggingface import HuggingFaceEndpoint

#Initialize the model endpoint
HOST_URL_INF = ":8080"
MAX_NEW_TOKENS = 1800

TEMPERATURE = 0.2
TIMEOUT = 180


llm = HuggingFaceEndpoint(
    endpoint_url=HOST_URL_INF,
    task="text-generation",
    max_new_tokens=MAX_NEW_TOKENS,
    do_sample=True,
    temperature = TEMPERATURE,
    timeout=TIMEOUT,
)
print(llm.invoke("In 10 words, what is HuggingFace?"))

 HuggingFace is a company that provides AI models and tools.


In [3]:
#load prompt templates and testing student profiles
from utils.prompt_templates import chatbot_prompt_template
from utils.test_case_data   import student_1, student_2, student_3

students = [student_1, student_2, student_3]

#test prompt template
prompt_test = chatbot_prompt_template.format(
        # student_name=student_1["name"],
        profile=student_1["profile"],
        context="",
        request=student_1["requests"][0],
    )

print(prompt_test)


You are an expert tutor in mathematics and linear algebra, specializing in personalized and context-aware explanations. 
Your goal is to provide clear, relevant, and engaging responses tailored to the student’s profile, retrieved context, and their request. 
Use the information provided to adapt your explanation to their background, strengths, weaknesses, and preferences.

### Student Profile:

        Background: Recent college graduate with a degree in Business Administration.
        Strengths: Strong organizational and project management skills.
        Weaknesses: Limited mathematical background; no prior programming experience.
        Preferences: Prefers real-world applications, interactive learning, and visualizations.
        Prior Course History: 
        - Introduction to Business Mathematics
        - Basic Statistics for Managers
    

### Retrieved Context:


### Student Request:
Help me understand how eigenvalues relate to matrix transformations. Provide content that v

## Vanilla Test 
Mistral-NeMo

In [4]:
#test
response = llm.invoke(prompt_test)
print(response)

**1. Summary:**
Eigenvalues and eigenvectors are special features of a matrix that describe how it transforms a vector. They are crucial in data analysis as they help identify patterns and reduce dimensionality. Visualizing these concepts with real-world examples will make them more accessible.

**2. Detailed Explanation:**

Imagine you're playing a game where you're given a matrix (a set of rules) that transforms vectors (your starting point). Some starting points (eigenvectors) remain unchanged or scaled (eigenvalues) when you apply the matrix rules. These are the 'eigen' (self) features of the matrix.

Let's consider a simple 2x2 matrix A:
```
A = [3 1;
     2 2]
```
If you start at point (1, 0), applying A moves you to (5, 2). But if you start at (1, 1), applying A keeps you at (1, 1) - this is an eigenvector with eigenvalue 1.

In data analysis, eigenvalues and eigenvectors help in Principal Component Analysis (PCA), a technique used to reduce dimensionality while retaining as muc

In [5]:

for student_index, student in enumerate(students):
    for request_index, request in enumerate(student["requests"]):
        # Generate test_case ID using student index and request index
        test_case_id = f"s{student_index}r{request_index}"

        request = student["requests"][request_index]
        # Fill in the template
        prompt = chatbot_prompt_template.format(
            profile=student["profile"],
            context="",
            request=request,
        )
    
        #add response to output text file
        response = llm.invoke(prompt)

        # Append data to the list
        data_records.append({
            "index": len(data_records),  # Auto-incrementing index
            "type": "Basic",
            "test_case": test_case_id,
            "student": student["profile"],
            "request": request,
            "context": "",
            "response": response
        })

try:
    with open(output_pickle_path, "wb") as f:
        pickle.dump(data_records, f)
except Exception as e:
    print(f"Error saving output pickle file: {e}")

    


## RAG

In [6]:
from langchain_huggingface import HuggingFaceEmbeddings

model_name = "dunzhang/stella_en_1.5B_v5" #"BAAI/bge-small-en-v1.5" #dunzhang/stella_en_1.5B_v5
model_kwargs = {"device": "cpu"}
encode_kwargs = {"normalize_embeddings": True}
embd = HuggingFaceEmbeddings(
    model_name=model_name, model_kwargs=model_kwargs, encode_kwargs=encode_kwargs
)

In [7]:
from langchain_chroma import Chroma
import os

vector_store_path = "./data/storage/chroma_db_stella_1.5B_chunk150" #"../data/storage/chroma_db_stella_1.5B"

                               
vector_store = Chroma(
    embedding_function=embd,
    persist_directory=vector_store_path,
    collection_name= "full_vstore_stella1.5B_chunk150", #"full_vstore_stella1.5B",
)


In [8]:
# Get the number of documents in the vector store
num_documents = vector_store._collection.count()

# Print the number of documents
print(f"Number of documents in the vector store: {num_documents}")

# Preview one of the documents (assuming the documents are stored in a collection)
if num_documents > 0:
    # Retrieve the first document (or any document by its ID)
    document = vector_store._collection.peek(limit=1)
    
    # Print a preview of the document
    print("Preview of the first document:")
    print(document)
else:
    print("No documents found in the vector store.")

Number of documents in the vector store: 124353
Preview of the first document:
{'ids': ['10665c28-7898-4d71-8e79-fe332f7a89eb'], 'embeddings': array([], dtype=float64), 'documents': ['Linear Algebra and Its Applications\nFourth Edition\nGilbert Strang\ny\nx y z \x1e \x0c \nz\nAx b\x1e\nb\n0\nAy b\x1e\n0Az \x1e\n0'], 'uris': None, 'data': None, 'metadatas': [{'page': 1, 'source': 'C:\\Users\\jonathan.kasprisin\\github\\Learning\\KG_ilp\\data\\pdfs\\Gilbert_Strang_Linear_Algebra_and_Its_Applicatio_230928_225121.pdf', 'start_index': 0}], 'included': [<IncludeEnum.embeddings: 'embeddings'>, <IncludeEnum.documents: 'documents'>, <IncludeEnum.metadatas: 'metadatas'>]}


In [9]:
from langgraph.graph import START, StateGraph
from langchain_core.documents import Document
from typing_extensions import List, TypedDict
from langchain.prompts import PromptTemplate

# Define the state structure
class LearningPlanState(TypedDict):
    profile: str
    request: str  # The overarching question or request
    context: List[Document]
    answer: str

def retrieve(state: LearningPlanState, k:int =20):
    # Combine the student's profile and objectives to form the query
    query = f"Student Profile: {state['profile']}\nRequest: {state['request']}"
    retrieved_docs = vector_store.similarity_search(query, k=k)
    return {"context": retrieved_docs}

def generate(state:LearningPlanState):
    #get the relevant documents and source metadata
    docs_content = "\n\n".join(
        f"Content source: {doc.metadata.get('source', 'Unknown')}\n Content: {doc.page_content}"
        for doc in state["context"]
    )
    formatted_prompt =  chatbot_prompt_template.format(
        profile=state['profile'],
        context=docs_content,
        request=state['request']
    )
    #print(formatted_prompt)
    response = llm.invoke(formatted_prompt)
    return {"answer": response}


# Build the workflow #fix
workflow_builder = StateGraph(LearningPlanState).add_sequence([retrieve, generate])
workflow_builder.add_edge(START, "retrieve")
workflow = workflow_builder.compile()

In [10]:
#TEST
state = LearningPlanState(
        profile=student_1["profile"],
        request=student_1["requests"][0],
        context=[],
        answer=""
    )

response = workflow.invoke(state)
print(response["answer"][:50])

**1. Summary:**
Eigenvalues and eigenvectors are s


In [11]:
for student_index, student in enumerate(students):
    for request_index, request in enumerate(student["requests"]):
        # Generate test_case ID using student index and request index
        test_case_id = f"s{student_index}r{request_index}"

        request = student["requests"][request_index]

        # Initialize the state
        state = LearningPlanState(
            profile=student["profile"],
            request=request,
            context=[],
            answer=""
        )
        
        # Execute the workflow
        response = workflow.invoke(state)

        #Write context to review file in format the llm sees
        llm_context = "\n\n".join(
            f"Content source: {doc.metadata.get('source', 'Unknown')}\n Content: {doc.page_content}"
            for doc in response['context']
        )

        # Append data to the list
        data_records.append({
            "index": len(data_records),  # Auto-incrementing index
            "type": "RAG_150chunk",
            "test_case": test_case_id,
            "student": student["profile"],
            "request": request,
            "context": llm_context,
            "response": response["answer"]
        })
    
try:
    with open(output_pickle_path, "wb") as f:
        pickle.dump(data_records, f)
except Exception as e:
    print(f"Error saving output pickle file: {e}")

## RAG w/ larger chunk size

In [12]:
from langchain_chroma import Chroma
import os

vector_store_path = "./data/storage/chroma_db_stella_1.5B"

                               
vector_store = Chroma(
    embedding_function=embd,
    persist_directory=vector_store_path,
    collection_name= "full_vstore_stella1.5B",
)

# Get the number of documents in the vector store
num_documents = vector_store._collection.count()

# Print the number of documents
print(f"Number of documents in the vector store: {num_documents}")

# Preview one of the documents (assuming the documents are stored in a collection)
if num_documents > 0:
    # Retrieve the first document (or any document by its ID)
    document = vector_store._collection.peek(limit=1)
    
    # Print a preview of the document
    print("Preview of the first document:")
    print(document)
else:
    print("No documents found in the vector store.")

Number of documents in the vector store: 11564
Preview of the first document:
{'ids': ['62950633-20ca-4022-ab4e-db4789aeee34'], 'embeddings': array([[-0.0374994 , -0.00979656,  0.05323608, ..., -0.02966512,
        -0.00455778,  0.03899739]]), 'documents': ['Linear Algebra and Its Applications\nFourth Edition\nGilbert Strang\ny\nx y z \x1e \x0c \nz\nAx b\x1e\nb\n0\nAy b\x1e\n0Az \x1e\n0'], 'uris': None, 'data': None, 'metadatas': [{'page': 1, 'source': 'C:\\Users\\jonathan.kasprisin\\github\\Learning\\KG_ilp\\data\\pdfs\\Gilbert_Strang_Linear_Algebra_and_Its_Applicatio_230928_225121.pdf', 'start_index': 0}], 'included': [<IncludeEnum.embeddings: 'embeddings'>, <IncludeEnum.documents: 'documents'>, <IncludeEnum.metadatas: 'metadatas'>]}


In [13]:
# Build the workflow
workflow_builder = StateGraph(LearningPlanState)
workflow_builder.add_node("retrieve", lambda state: retrieve(state, k=5))
workflow_builder.add_node("generate", generate)
workflow_builder.add_edge(START, "retrieve")
workflow_builder.add_edge("retrieve", "generate")
workflow = workflow_builder.compile()

In [14]:
#TEST
state = LearningPlanState(
        profile=student_1["profile"],
        request=student_1["requests"][0],
        context=[],
        answer=""
    )

response = workflow.invoke(state)
print(response["answer"][:50])

**1. Summary:**
Eigenvalues and eigenvectors help 


In [15]:
for student_index, student in enumerate(students):
    for request_index, request in enumerate(student["requests"]):
        # Generate test_case ID using student index and request index
        test_case_id = f"s{student_index}r{request_index}"

        request = student["requests"][request_index]

        # Initialize the state
        state = LearningPlanState(
            profile=student["profile"],
            request=request,
            context=[],
            answer=""
        )
        
        # Execute the workflow
        response = workflow.invoke(state)

        #Write context to review file in format the llm sees
        llm_context = "\n\n".join(
            f"Content source: {doc.metadata.get('source', 'Unknown')}\n Content: {doc.page_content}"
            for doc in response['context']
        )

        # Append data to the list
        data_records.append({
            "index": len(data_records),  # Auto-incrementing index
            "type": "RAG_1500chunk",
            "test_case": test_case_id,
            "student": student["profile"],
            "request": request,
            "context": llm_context,
            "response": response["answer"]
        })

try:
    with open(output_pickle_path, "wb") as f:
        pickle.dump(data_records, f)
except Exception as e:
    print(f"Error saving output pickle file: {e}")

## GraphRAG

#### Load different KGs

In [16]:

GR_kg_no_refine_graph_path = "../data/generated_graphs/GR_no_refine/final_augmented_graph.graphml"
GR_kg_w_refine_graph_path = "../data/generated_graphs/GR_w_refine/final_augmented_graph.graphml"
langchain_kg_graph_path = "../data/generated_graphs/langchain_KG/langchain_full_kg.graphml"
sme_kg_graph_path = "../data/generated_graphs/SME_graph/DNoKv3.graphml"

#dictionary of graphs
graphs_paths = {
    "GR_kg_no_refine": GR_kg_no_refine_graph_path,
    "GR_kg_w_refine": GR_kg_w_refine_graph_path,
    "langchain_kg": langchain_kg_graph_path,
    "sme_kg": sme_kg_graph_path
}

In [17]:
import networkx as nx
import traceback

graph_dict = {}

for graph_name, graph_path in graphs_paths.items():

    try:
        graph= nx.read_graphml(graph_path)
        graph_dict[graph_name] = graph

        # Print info about the graph
        print(f"Loaded Graph: {graph_name}")
        print(f"-->Number of nodes: {graph.number_of_nodes()}, Number of edges: {graph.number_of_edges()}")
        print(f"-->example node data: {list(graph.nodes(data=True))[:1]}")
        print(f"-->example edge data: {list(graph.edges(data=True))[:1]}\n")
    except Exception as e:
        print(f"Failed to load graph {graph_name} from {graph_path}: {e}\n")
        #traceback.print_exc()
        



Failed to load graph GR_kg_no_refine from ../data/generated_graphs/GR_no_refine/final_augmented_graph.graphml: [Errno 2] No such file or directory: '../data/generated_graphs/GR_no_refine/final_augmented_graph.graphml'

Failed to load graph GR_kg_w_refine from ../data/generated_graphs/GR_w_refine/final_augmented_graph.graphml: [Errno 2] No such file or directory: '../data/generated_graphs/GR_w_refine/final_augmented_graph.graphml'

Failed to load graph langchain_kg from ../data/generated_graphs/langchain_KG/langchain_full_kg.graphml: [Errno 2] No such file or directory: '../data/generated_graphs/langchain_KG/langchain_full_kg.graphml'

Failed to load graph sme_kg from ../data/generated_graphs/SME_graph/DNoKv3.graphml: [Errno 2] No such file or directory: '../data/generated_graphs/SME_graph/DNoKv3.graphml'



In [18]:
from langchain_community.graphs.networkx_graph import NetworkxEntityGraph
import networkx as nx
import pickle

#load langchain graph directly as a NetworkxEntityGraph

def make_nxe_graph(graph_documents) -> nx.Graph:
    print(f"Making nx graph from {len(graph_documents)} graph documents")
    graph_nxe = NetworkxEntityGraph()
    for doc in graph_documents:
        try:
            for node in doc.nodes:
                graph_nxe.add_node(node.id)
            for edge in doc.relationships:
                graph_nxe._graph.add_edge(edge.source.id, edge.target.id, relation=edge.type)
        except Exception as e:
            print(f"Error adding document to nx graph: {doc.source.metadata}, {e}")
    print(f"nx graph built with {graph_nxe.get_number_of_nodes()} nodes.") 
    return graph_nxe

with open("../data/generated_graphs/langchain_KG/full_graph_documents.pkl ", "rb") as file:
    graph_documents = pickle.load(file)

save_dir = "../data/generated_graphs/langchain_KG/" #save dropped docs

filtered_graph_documents = []
for doc in graph_documents:
    valid_nodes = [node for node in doc.nodes if node.type]
    if valid_nodes:
        doc.nodes = valid_nodes
        filtered_graph_documents.append(doc)
    else:
        with open(save_dir+"dropped_docs.txt", "a") as f:
            f.write(f" 'Dropped doc.metadata': '{doc.source.metadata}'\n")

langchain_nxe_graph = make_nxe_graph(filtered_graph_documents)

# Print info about the graph
# Try to get node data (adjust method name as needed)
try:
    triples = langchain_nxe_graph.get_triples()
    print(f"Example triple: {triples[0]}")
    entity= triples[0][0]
    print(f"Example node: {entity}")
    knowledge = langchain_nxe_graph.get_entity_knowledge(entity, 3)
    print(f"Example node knowledge: {knowledge}")

except AttributeError:
    print("Unable to access node data. Check the class documentation for the correct method.")

FileNotFoundError: [Errno 2] No such file or directory: '../data/generated_graphs/langchain_KG/full_graph_documents.pkl '

In [33]:
#save networkentity graph as graphml
nx.write_graphml(langchain_nxe_graph._graph, "../data/generated_graphs/langchain_KG/langchain_nxe_graph.graphml")

### GraphRAG - Langchain
Doesnt work for other KG, 
_Note: Does not find any context with langchain library for GraphQA_

#### Restructure KG to work with langchain NetworkxEntityGraph()

In [None]:
from langchain_community.graphs.networkx_graph import KnowledgeTriple
# Initialize a list to store knowledge triples
knowledge_triples = []

# Iterate over the edges to extract triples
for source, target, data in graph.edges(data=True):
    # Check if the 'title' key exists in the edge data
    if 'title' in data:
        relationship = data['title']
        triple = KnowledgeTriple(source, relationship, target)
        knowledge_triples.append(triple)

# Display the extracted knowledge triples
print(f"Extracted {len(knowledge_triples)} knowledge triples:")
print(f"example knowledge triple: {knowledge_triples[0]}")


In [None]:
# from langchain_community.graphs.networkx_graph import NetworkxEntityGraph
# langchain_graphgraph = NetworkxEntityGraph()

# #make langchain KnowledgeTriple from extracted triples
# for triple in knowledge_triples:
#     langchain_graphgraph.add_triple(triple)
    


#### GraphRAG using langchain package
_Note: only works with langchain generated graph_

In [35]:
# langchain_graph = langchain_nxe_graph

In [None]:

# print("Successfully initialized NetworkxEntityGraph.")
# #get number of nodes and edges
# print(f"Number of nodes: {langchain_graph.get_number_of_nodes()}")

# #get test information about an entity
# #get an entity from the graph
# list_triples = langchain_graph.get_triples()
# print(f"example triple: {list_triples[-1:]}")


In [37]:
# prompt_template_mod = """
# Student Profile:
# - Name: {student_name}
# - Learning Objectives: {learning_objectives}
# - Profile Details: {student_profile}

# Tasks:
# 1. Based on the provided profile and learning objectives, determine the optimal learning path(s) (including order) to achieve all objectives.
# 2. Identify and list specific content aligns with and supports the optimal path(s). Use additional context if provided.
# 3. Suggest alternative or backup content that can replace the primary content identified in case of availability issues or better alignment with the student's preferences.
# 4. Respond to the request with text output.

# Request: 
# {learning_request}
# """


In [None]:
# from langchain.chains import GraphQAChain
# #create graphQAchain
# chain = GraphQAChain.from_llm(
#     llm=llm, 
#     graph=langchain_graph, 
#     verbose=True
# )

# #test
# formatted_prompt = prompt_template_mod.format(
#         student_name=student_1["name"],
#         learning_objectives=student_1["learning_objectives"],
#         student_profile=student_1["profile"],
#         learning_request="Generate an individualized learning plan tailored to the student's needs.",
#     )


# chain.invoke(formatted_prompt)

In [None]:
# from langchain.chains import GraphQAChain

# with open(output_path, "a") as file:
#     file.write("\n-------------Generated with Langchain GraphRAG (mod template 1)-------------\n")

# with open(context_review_path, "a") as file:
#     file.write("\n-------------Generated with Langchain GraphRAG (mod template 1)-------------\n")

# for student in students:
    
#     #create graphQAchain
#     chain = GraphQAChain.from_llm(
#         llm=llm, 
#         graph=langchain_graph, 
#         verbose=True
#     )

#     formatted_prompt = prompt_template_mod.format(
#             student_name=student["name"],
#             learning_objectives=student["learning_objectives"],
#             student_profile=student["profile"],
#             learning_request="Generate an individualized learning plan tailored to the student's needs.",
#         )

#     chain.invoke(formatted_prompt)

#     with open(output_path, "a") as file:
#         file.write("\n------ "+student["test_case"]+"\n"+ response['answer'] + "\n\n")

#     with open(context_review_path, "a") as file:
#         file.write("\n------ "+student["test_case"]+"\n"+ response['context'] + "\n\n")

### GraphRAG  custom with community summaries #TODO?

In [40]:
# # from https://github.com/stephenc222/example-graphrag/
# # 5. Graph Communities → Community Summaries
# def detect_communities(graph):
#     communities = []
#     index = 0
#     for component in nx.connected_components(graph):
#         print(
#             f"Component index {index} of {len(list(nx.connected_components(graph)))}:")
#         subgraph = graph.subgraph(component)
#         if len(subgraph.nodes) > 1:  # Leiden algorithm requires at least 2 nodes
#             try:
#                 sub_communities = algorithms.leiden(subgraph)
#                 for community in sub_communities.communities:
#                     communities.append(list(community))
#             except Exception as e:
#                 print(f"Error processing community {index}: {e}")
#         else:
#             communities.append(list(subgraph.nodes))
#         index += 1
#     print("Communities from detect_communities:", communities)
#     return communities

# def summarize_communities(communities, graph):
#     community_summaries = []
#     for index, community in enumerate(communities):
#         print(f"Summarize Community index {index} of {len(communities)}:")
#         subgraph = graph.subgraph(community)
#         nodes = list(subgraph.nodes)
#         edges = list(subgraph.edges(data=True))
#         description = "Entities: " + ", ".join(nodes) + "\nRelationships: "
#         relationships = []
#         for edge in edges:
#             relationships.append(
#                 f"{edge[0]} -> {edge[2]['label']} -> {edge[1]}")
#         description += ", ".join(relationships)

#         response = client.chat.completions.create(
#             model="gpt-4",
#             messages=[
#                 {"role": "system", "content": "Summarize the following community of entities and relationships."},
#                 {"role": "user", "content": description}
#             ]
#         )
#         summary = response.choices[0].message.content.strip()
#         community_summaries.append(summary)
#     return community_summaries


# # 6. Community Summaries → Community Answers → Global Answer
# def generate_answers_from_communities(community_summaries, query):
#     intermediate_answers = []
#     for index, summary in enumerate(community_summaries):
#         print(f"Summary index {index} of {len(community_summaries)}:")
#         response = client.chat.completions.create(
#             model="gpt-4o",
#             messages=[
#                 {"role": "system", "content": "Answer the following query based on the provided summary."},
#                 {"role": "user", "content": f"Query: {query} Summary: {summary}"}
#             ]
#         )
#         print("Intermediate answer:", response.choices[0].message.content)
#         intermediate_answers.append(
#             response.choices[0].message.content)

#     final_response = client.chat.completions.create(
#         model="gpt-4o",
#         messages=[
#             {"role": "system",
#                 "content": "Combine these answers into a final, concise response."},
#             {"role": "user", "content": f"Intermediate answers: {intermediate_answers}"}
#         ]
#     )
#     final_answer = final_response.choices[0].message.content
#     return final_answer