# 

# Text generation with Graph Retrieval
This notebook impliments graph context retrieval and augmenation

## Notes
- issues with retrieving relevant nodes.. many things comes back as "linear algebra" when finding most similiar node
- generated KG has node that is irrelevant e.g. "a". tried cleaning the graph of common words... mediocre results

In [None]:
from langchain_huggingface import HuggingFaceEndpoint
from langchain_huggingface import HuggingFaceEmbeddings

#Initialize the model endpoint
HOST_URL_INF = ":8080"
MAX_NEW_TOKENS = 2200

TEMPERATURE = 0.2
TIMEOUT = 300


llm = HuggingFaceEndpoint(
    endpoint_url=HOST_URL_INF,
    task="text-generation",
    max_new_tokens=MAX_NEW_TOKENS,
    do_sample=True,
    temperature = TEMPERATURE,
    timeout=TIMEOUT,
)

model_name = "dunzhang/stella_en_1.5B_v5" #"BAAI/bge-small-en-v1.5" #dunzhang/stella_en_1.5B_v5
model_kwargs = {"device": "cpu"}
encode_kwargs = {"normalize_embeddings": True}
embd = HuggingFaceEmbeddings(
    model_name=model_name, model_kwargs=model_kwargs, encode_kwargs=encode_kwargs
)

In [2]:
print(llm.invoke("In 10 words, huggingface is"))

 a company that specializes in machine learning.


## GraphRAG - Graph reasoning using GraphReasoning by MIT

### load

In [3]:
import sys
import os
root_path = "c:\\Users\\jonathan.kasprisin\\gitlab\\DNoK_GraphRAG"
os.chdir(root_path)
sys.path.append(root_path)

In [4]:
from GraphReasoning_Mod.graph_tools import *
from GraphReasoning_Mod.utils import *
from GraphReasoning_Mod.graph_generation import *
from GraphReasoning_Mod.graph_analysis import *

In [5]:
#load prompt templates and testing student profiles
from utils.prompt_templates import chatbot_prompt_template
from utils.test_case_data   import student_1, student_2, student_3

students = [student_1, student_2, student_3]

#test prompt template
prompt_test = chatbot_prompt_template.format(
        # student_name=student_1["name"],
        profile=student_1["profile"],
        context="",
        request=student_1["requests"][0],
    )

print(prompt_test)


You are an expert tutor in mathematics and linear algebra, specializing in personalized and context-aware explanations. 
Your goal is to provide clear, relevant, and engaging responses tailored to the student’s profile, retrieved context, and their request. 
Use the information provided to adapt your explanation to their background, strengths, weaknesses, and preferences.

### Student Profile:

        Background: Recent college graduate with a degree in Business Administration.
        Strengths: Strong organizational and project management skills.
        Weaknesses: Limited mathematical background; no prior programming experience.
        Preferences: Prefers real-world applications, interactive learning, and visualizations.
        Prior Course History: 
        - Introduction to Business Mathematics
        - Basic Statistics for Managers
    

### Retrieved Context:


### Student Request:
Help me understand how eigenvalues relate to matrix transformations. Provide content that v

In [6]:
import pickle
import os
import networkx as nx

graph_file_names = {
    "GR_kg_no_refine_final": "final_augmented_graph.graphml",
    # "GR_kg_w_refine_final": "final_augmented_graph.graphml",
    # "langchain_kg": "langchain_full_kg.graphml",
    # "sme_kg": "DNoKv3_forGR.graphml",
    "GR_kg_no_refine3_1.0": "final_augmented_graph.graphml",
    "GR_kg_no_refine3_0.95": "0.95threshold_graphML_simplified.graphml",
    "GR_kg_no_refine3_0.85" : "0.85threshold_graphML_simplified.graphml",
    "GR_kg_no_refine3_0.75" : "0.75threshold_graphML_simplified.graphml",
}


graph_dirs = {
    "GR_kg_no_refine_final": "./data/generated_graphs/GR_no_refine/",
    # "GR_kg_w_refine_final": "./data/generated_graphs/GR_w_refine/",
    # "langchain_kg": "./data/generated_graphs/langchain_KG/",
    # "sme_kg": "./data/generated_graphs/SME_graph/" ,
    "GR_kg_no_refine3_1.0": "./data/generated_graphs/GR_no_refine3/",
    "GR_kg_no_refine3_0.95": "./data/generated_graphs/GR_no_refine3/",
    "GR_kg_no_refine3_0.85" : "./data/generated_graphs/GR_no_refine3/",
    "GR_kg_no_refine3_0.75" : "./data/generated_graphs/GR_no_refine3/",
}

embd_file_names = {
    "GR_kg_no_refine_final": "embeddings.pkl",
    # "GR_kg_w_refine_final": "embeddings.pkl",
    # "langchain_kg": "embeddings.pkl",
    # "sme_kg": "embeddings.pkl",
    "GR_kg_no_refine3_1.0": "embeddings.pkl",
    "GR_kg_no_refine3_0.95": "GR_kg_no_refine3_0.95_embeddings.pkl", #"0.95threshold_embeddings.pkl",
    "GR_kg_no_refine3_0.85" : "GR_kg_no_refine3_0.85_embeddings.pkl", #"0.85threshold_embeddings.pkl",
    "GR_kg_no_refine3_0.75" : "GR_kg_no_refine3_0.75_embeddings.pkl", #"0.75threshold_embeddings.pkl",
}


#dictionary of graphs
graphs_paths = {}
embeddings_paths = {}

for graph_name, graph_dir in graph_dirs.items():
    graph_path = os.path.join(graph_dir, graph_file_names[graph_name])
    embeddings_path = os.path.join(graph_dir, embd_file_names[graph_name])
    graphs_paths[graph_name] = graph_path
    embeddings_paths[graph_name] = embeddings_path


graph_dict = {}
embds_dict = {}

for graph_name, graph_path in graphs_paths.items():
    embeddings_path = embeddings_paths[graph_name]
    try:
        graph= nx.read_graphml(graph_path)
        graph_dict[graph_name] = graph

        # Print info about the graph
        print(f"Loaded Graph: {graph_name}")
        print(f"-->Number of nodes: {graph.number_of_nodes()}, Number of edges: {graph.number_of_edges()}")
        print(f"-->example node data: {list(graph.nodes(data=True))[:1]}")
        print(f"-->example edge data: {list(graph.edges(data=True))[:1]}\n")

        if os.path.exists(embeddings_path):
            with open(embeddings_path, 'rb') as f:
                existing_node_embeddings = pickle.load(f)
        else:
            existing_node_embeddings = generate_node_embeddings(graph, embd)
            with open(embeddings_path, 'wb') as f:
                pickle.dump(existing_node_embeddings, f)
        embds_dict[graph_name]= existing_node_embeddings
        print(f"-->number of embeddings loaded: {len(existing_node_embeddings)}\n")
    except Exception as e:
        print(f"Failed to load graph {graph_name} from {graph_path}: {e}\n")

Loaded Graph: GR_kg_no_refine_final
-->Number of nodes: 3484, Number of edges: 9841
-->example node data: [('C:\\Users\\jonathan.kasprisin\\github\\Learning\\KG_ilp\\data\\pdfs\\Gilbert_Strang_Linear_Algebra_and_Its_Applicatio_230928_225121.pdf', {'group': 1, 'color': '#57dbc2', 'size': 2497})]
-->example edge data: [('C:\\Users\\jonathan.kasprisin\\github\\Learning\\KG_ilp\\data\\pdfs\\Gilbert_Strang_Linear_Algebra_and_Its_Applicatio_230928_225121.pdf', 'a', {'title': 'is source document of', 'weight': 1.0})]

-->number of embeddings loaded: 3484

Loaded Graph: GR_kg_no_refine3_1.0
-->Number of nodes: 77039, Number of edges: 62284
-->example node data: [('cartesian product of vector spaces', {'group': 1, 'color': '#57db89', 'size': 1})]
-->example edge data: [('cartesian product of vector spaces', 'ordered pairs', {'title': 'consists of', 'metadata': "{'source': 'Gilbert_Strang_Linear_Algebra_and_Its_Applicatio_230928_225121.pdf - page: 3', 'source_type': 'Textbook_PDF', 'title': 'Lin

In [7]:
#test changing edge data
import ast 

def reduce_metadata(G):
    # Iterate over all edges and remove 'start_index' from 'metadata'
    for u, v, data in tqdm(G.edges(data=True), desc="Processing edges", total= G.number_of_edges()):
        if 'metadata' in data:
            try:
                # Convert string to dictionary
                metadata_dict = ast.literal_eval(data['metadata'])
                # Remove the 'start_index' and author keys if they exists
                metadata_dict.pop('start_index', None)
                metadata_dict.pop('author', None)
                # Update the metadata string
                data['metadata'] = str(metadata_dict)
            except (ValueError, SyntaxError):
                print(f"Failed to parse metadata for edge ({u}, {v}): {data['metadata']}")

# # Load your graph
# reduce_metadata(graph_dict["GR_kg_no_refine3_0.85"])
# print(f"-->example node data: {list(graph_dict["GR_kg_no_refine3_0.85"].edges(data=True))[:1]}")

In [8]:
for g in graph_dict:
    reduce_metadata(graph_dict[g])

Processing edges:   0%|          | 0/9841 [00:00<?, ?it/s]

Processing edges:   0%|          | 0/62284 [00:00<?, ?it/s]

Processing edges:   0%|          | 0/58504 [00:00<?, ?it/s]

Processing edges:   0%|          | 0/33153 [00:00<?, ?it/s]

Processing edges:   0%|          | 0/24305 [00:00<?, ?it/s]

### Helpers

In [9]:
def generate_llm(system_prompt='You are a helpful assistant.', 
                         prompt="Hello.",temperature=0.3,
                         max_tokens=1024, timeout=180 
                         ):
    
    llm = HuggingFaceEndpoint(
        endpoint_url=HOST_URL_INF,
        task="text-generation",
        max_new_tokens=max_tokens,
        do_sample=True,
        temperature = temperature,
        timeout=timeout,
    )

    prompt= system_prompt + prompt
    
    response = llm.invoke(prompt)

    return response

#usage
#print(generate_llm(prompt="What is spider silk?",temperature=0.3, max_tokens=1024, timeout=180))

In [10]:
def print_graph_info(graph):
    if graph.is_directed():
        print("directed graph ")
    else:
        print("undirected graph")
    print(f"-->Number of nodes: {graph.number_of_nodes()}, Number of edges: {graph.number_of_edges()}")
    print(f"-->example node data: {list(graph.nodes(data=True))[:1]}")
    print(f"-->example edge data: {list(graph.edges(data=True))[:1]}\n")

In [11]:
def clean_graph(graph, existing_node_embeddings):
    """ 
    Removes common word nodes from graph and corresponding embeddings
    """
    # Define the list of common words to be removed
    common_words = {
        # Articles
        'a', 'an', 'the',

        # Pronouns
        'i', 'me', 'you', 'he', 'she', 'it', 'we', 'us', 'they', 'them',
        'my', 'your', 'his', 'her', 'its', 'our', 'their', 'mine', 'yours',
        'hers', 'theirs',

        # Conjunctions
        'and', 'but', 'or', 'nor', 'for', 'yet', 'so',

        # Prepositions
        'about', 'above', 'across', 'after', 'against', 'along', 'among',
        'around', 'at', 'before', 'behind', 'below', 'beneath', 'beside',
        'between', 'beyond', 'by', 'down', 'during', 'except', 'for', 'from',
        'in', 'inside', 'into', 'near', 'of', 'off', 'on', 'out', 'outside',
        'over', 'through', 'to', 'toward', 'under', 'until', 'up', 'upon',
        'with', 'within', 'without',

        # Auxiliary (Helping) Verbs
        'am', 'is', 'are', 'was', 'were', 'be', 'being', 'been', 'do', 'does',
        'did', 'have', 'has', 'had', 'will', 'would', 'shall', 'should', 'can',
        'could', 'may', 'might', 'must',

        # Adverbs
        'not', 'no', 'very', 'too', 'just', 'only',

        # Other Common Words
        'all', 'any', 'both', 'each', 'every', 'few', 'many', 'more', 'most',
        'other', 'some', 'such', 'that', 'this', 'these', 'those', 'which',
        'what', 'who', 'whom', 'whose', 'how', 'when', 'where', 'why'
    }

    # Load the graph from a GraphML file
    graph_clean = graph

    # Print initial number of nodes and edge data for verification
    print(f"Initial number of nodes: {graph_clean.number_of_nodes()}")
    print(f"Initial number of edges: {graph_clean.number_of_edges()}")

    # Identify nodes to remove based on common words
    nodes_to_remove = [node for node in graph_clean.nodes if str(node).lower() in common_words]

    # Remove identified nodes from the graph
    graph_clean.remove_nodes_from(nodes_to_remove)

    # # Save the cleaned graph back to a new GraphML file
    # cleaned_graphml_path = "cleaned_graph.graphml"
    # nx.write_graphml(graph_clean, cleaned_graphml_path)

    # Print final node and edge data for verification
    print(f"Cleaned number of nodes: {graph_clean.number_of_nodes()}")
    print(f"Cleaned number of edges: {graph_clean.number_of_edges()}")
    print(f"Final node data: {list(graph_clean.nodes(data=True))[:1]}")
    print(f"Final edge data: {list(graph_clean.edges(data=True))[:1]}")
    # print(f"Cleaned graph saved to: {cleaned_graphml_path}")

    # Remove nodes from the embedding file if in nodes to remove
    existing_node_embeddings = {node: embedding for node, embedding in existing_node_embeddings.items() if node not in nodes_to_remove}
    print(f"Number of node embeddings: {len(existing_node_embeddings)}\n")

    return graph_clean, existing_node_embeddings

## Override module functions
TODO: update and re-integrate into GraphReasoning_mod

In [12]:
#from GraphReasoning_Mod...?
def find_path( G, node_embeddings,  embedding_object, keyword_1 = "", keyword_2 = "", 
            verbatim=True, second_hop=False,data_dir='./output_files/', similarity_fit_ID_node_1=0,
            similarity_fit_ID_node_2=0,save_files=False):
    """
    This function finds the shortest path with two hops between two keywords in a graph. 
    It first finds the best fitting nodes for each keyword, if the the best fitting node is the same node it takes the second best for the second keyword.
    Then finds the shortest path between these nodes.

    """
    
    best_node_1, best_similarity_1= find_best_fitting_node_list(
        keyword_1, node_embeddings, embedding_object, max (5, similarity_fit_ID_node_1+1)
    )[similarity_fit_ID_node_1]
    
    if verbatim:
        print(f"{similarity_fit_ID_node_1}nth best fitting node for '{keyword_1}': '{best_node_1}' with similarity: {best_similarity_1}")
    
    best_node_2, best_similarity_2 = find_best_fitting_node_list(
        keyword_2, node_embeddings, embedding_object,  max (5, similarity_fit_ID_node_2+1)
    )[similarity_fit_ID_node_2]

    if verbatim:
        print(f"{similarity_fit_ID_node_2}nth best fitting node for '{keyword_2}': '{best_node_2}' with similarity: {best_similarity_2}")

    if best_node_1 == best_node_2:
        if verbatim: 
            print(f"Warning: The two keywords are the same: '{keyword_1}' and '{keyword_2}'")
        best_node_2, best_similarity_2 = find_best_fitting_node_list(
            keyword_2, node_embeddings, embedding_object,  max (5, similarity_fit_ID_node_2+1)
        )[similarity_fit_ID_node_2+1]
    
    path, path_graph , shortest_path_length = find_shortest_path_with2hops (
        G,source=best_node_1, target=best_node_2, second_hop=second_hop, verbatim=verbatim, data_dir=data_dir,save_files=save_files,
    )
    
    return (best_node_1, best_similarity_1, best_node_2, best_similarity_2), path, path_graph, shortest_path_length #, fname, graph_GraphML

#from GraphReasoning_Mod.graph_analysis
def find_shortest_path_with2hops (G, source='graphene', target='complexity',
                                 second_hop=True,#otherwise just neighbors
                                  verbatim=True,data_dir='./output_files/', save_files=True,
                                 ):
    """ 
    Returns the shortest path between two nodes in a graph, along with a subgraph containing all nodes within 2 hops.
    Note: Returns None for all outputs if no path is found between the source and target nodes.
    """
    try:
        # Find the shortest path between two nodes
        path = nx.shortest_path(G, source=source, target=target)
    except Exception as e:
        if verbatim:
            print(f"No path found between '{source}' and '{target}'")
        return None, None, None
    
    # Initialize a set to keep track of all nodes within 2 hops
    nodes_within_2_hops = set(path)
    
    # Expand the set to include all neighbors within 2 hops of the path nodes
    for node in path:
        for neighbor in G.neighbors(node):
            nodes_within_2_hops.add(neighbor)
            # Include the neighbors of the neighbor (2 hops)

            if second_hop:
                for second_neighbor in G.neighbors(neighbor):
                    nodes_within_2_hops.add(second_neighbor)
    
    # Create a subgraph for the nodes within 2 hops
    path_graph = G.subgraph(nodes_within_2_hops)

    if save_files:
        if not os.path.exists(data_dir):
            os.makedirs(data_dir)
        
        # nt = Network('500px', '1000px', notebook=True)
        
        # # Add nodes and edges from the subgraph to the Pyvis network
        # nt.from_nx(path_graph)
        
        fname=f'shortest_path_2hops_{source[:10]}_{target[:10]}'
        #remove all characters that will cause issues with the file name
        fname = ''.join(e for e in fname if e.isalnum() or e in ['_', '.', '/', '\\'])

        # nt.show(f"{data_dir}/{fname}.html")
        # if verbatim:
        #     print(f"HTML visualization: {fname}")

        graph_GraphML = f'{data_dir}/{fname}.graphml'
        nx.write_graphml(path_graph, graph_GraphML)
        if verbatim:
            print(f"GraphML file: {graph_GraphML}")  
    else:
        fname=None
        graph_GraphML=None
        
    shortest_path_length = len(path) - 1  # As path length is number of edges
    
    return path, path_graph , shortest_path_length, #fname, graph_GraphML

#from GraphReasoning_Mod.graph_tools
def find_best_fitting_node_list(keyword, embeddings, embedding_object, N_samples=5):
    """
    Find the top N_samples nodes with the highest similarity to the keyword.

    Parameters:
    - keyword: str, the input keyword to find similar nodes for.
    - embeddings: dict, a dictionary where keys are nodes and values are their embeddings.
    - embedding_object: HuggingFaceEmbeddings.
    - N_samples: int, number of top similar nodes to return.

    Returns:
    - List of tuples [(node, similarity), ...] in descending order of similarity.
    """

    # Generate embedding for the keyword using the embedding endpoint
    keyword_embedding = embedding_object.embed_query(keyword)
    
    # Initialize a min-heap
    min_heap = []
    heapq.heapify(min_heap)
    
    for node, embedding in embeddings.items():
        embedding = np.array(embedding)  # Ensure embedding is a numpy array
        if embedding.ndim > 1:
            embedding = embedding.flatten()  # Flatten only if not already 1-D
        similarity = 1 - cosine(keyword_embedding, embedding)  # Cosine similarity
        
        # If the heap is smaller than N_samples, just add the current node and similarity
        if len(min_heap) < N_samples:
            heapq.heappush(min_heap, (similarity, node))
        else:
            # If the current similarity is greater than the smallest similarity in the heap
            if similarity > min_heap[0][0]:
                heapq.heappop(min_heap)  # Remove the smallest
                heapq.heappush(min_heap, (similarity, node))  # Add the current node and similarity
                
    # Convert the min-heap to a sorted list in descending order of similarity
    best_nodes = sorted(min_heap, key=lambda x: -x[0])

    assert len(best_nodes) >0 , f"Error in graph_tools.find_best_fitting_node_list(): No nodes found for keyword {keyword}."
    
    # Return a list of tuples (node, similarity)
    return [(node, similarity) for similarity, node in best_nodes]

#From GraphReasoning_Mod.graph_analysis. Add metadata if available
def print_node_pairs_edge_title_metadata(G):
    pairs_and_titles = []
    for node1, node2, data in G.edges(data=True):
        assert isinstance(data, dict), f"graph_analysis.print_node_pairs_edge_tile(): Expected data to be a dictionary, but got {type(data)}"
        assert 'title' in data, f"graph_analysis.print_node_pairs_edge_tile(): Expected 'title' key in data, but got {data.keys()}"

        metadata = data.get('metadata', 'No metadata')  # Default to 'No metadata' if not present #TODO: check the .get method

        # Assuming 'title' is the edge attribute you want to print
        title = data.get('title', 'No title')  # Default to 'No title' if not present
        pairs_and_titles.append(f"{node1}, {title}, {node2} | {metadata}")
    #print ("Format: node_1, relationship, node_2 | metadata")
    return pairs_and_titles

#From GraphReasoning_Mod.graph_analysis
def print_path_with_edges_as_list(G, path, keywords_separator=' --> '):
    path_elements = []

    for i in range(len(path) - 1):
        node1 = path[i]
        node2 = path[i + 1]

        # Retrieve edge data between node1 and node2
        edge_data = G.get_edge_data(node1, node2)

        # Access the 'title' directly from the edge_data
        if edge_data:
            edge_title = edge_data.get('title', 'No title')
        else:
            edge_title = 'No title'

        # Construct the path elements, inserting the edge title between node pairs
        if i == 0:
            path_elements.append(node1)  # Add the first node
        path_elements.append(edge_title)  # Add the edge title
        path_elements.append(node2)  # Add the second node

    # Convert the list of path elements into a string with the specified separator
    as_string = keywords_separator.join(path_elements)

    return path_elements, as_string


In [13]:
from typing_extensions import List, TypedDict

# Define the structure of the GraphRetrievalState dictionary
class GraphRAGData(TypedDict):
    keyword_1: str
    keyword_2: str
    graph_path_str: str
    node_list: List[str]


## New GraphRAG functions

In [14]:
from langchain.prompts import PromptTemplate

# Define the state structure
class State_kg(TypedDict):
    student_name: str
    profile: str
    request: str  # The overarching question or request
    context: List[GraphRAGData]
    answer: str

custom_prompt = PromptTemplate(
    template =chatbot_prompt_template,
    input_variables= ["profile", "context", "request"],
)

In [15]:
def determine_keyphase(llm, input_text, num_return=2):
    """
    This function analyzes the input text and extracts the top N most relevant phrases
    based on the text content. 

    Parameters:
    text (str): The input text from which to extract keywords.
    top_n (int): The number of top phrases to return. Default is 2.

    Returns:
    list: A list of the top N keyphrases extracted from the text ordered by importance as determined by the LLM.
    """

    num_keyphrase_to_generate = num_return

    #TODO: consider student current state and desired state as desired output
    prompt_keyphrase = f"""
You are an expert in creating concise, high-quality summaries optimized for embedding and retrieval. Your task is to extract key phrases or sentences from a provided text block (delimited by ```) that encapsulate the most important content. These summaries should focus on essential details and exclude any formatting instructions or examples unrelated to the content.

### Tasks:
1. Ignore all formatting instructions or irrelevant text in the provided text block.
2. Generate {num_keyphrase_to_generate} unique and meaningful sentences that align with the request and the associated student profile.
3. Ensure the extracted sentences are ordered by their importance to the content of the text block.
4. Output the key phrases in a plain list format, with no additional text or commentary.

### Example:
#### Number of sentences to generate: 3  
#### Input Text Block:
    ```
    You are an expert tutor in mathematics and linear algebra, specializing in personalized and context-aware explanations. Your goal is to provide clear, relevant, and engaging responses tailored to the student’s profile, retrieved context, and their request. Use the information provided to adapt your explanation to their background, strengths, weaknesses, and preferences.

### Student Profile:

        Background: Graduate student pursuing an Industrial Engineering degree with exposure to optimization techniques.
        Strengths: Comfortable with mathematical modeling and programming in Python.
        Weaknesses: Lacks practical experience with stochastic and simulation models.
        Preferences: Prefers structured lessons with hands-on coding exercises and case studies.
        Prior Course History: 
        - Linear Algebra for Engineers
        - Optimization Techniques
        - Applied Probability and Statistics

### Retrieved Context:


### Student Request:
Help me understand how SVD is used for dimensionality reduction in machine learning. Provide a Python-based example and resources to deepen my understanding.

### Instructions for Response:
1. Start with a concise summary addressing the student’s request directly.
2. Provide a detailed explanation that aligns with their preferences (e.g., visual aids, practical examples, or technical depth).
3. Relate your explanation to the student’s strengths while addressing their weaknesses constructively.
4. If applicable, suggest additional resources (e.g., textbooks, videos, or Python libraries) to help the student further explore the topic.
5. End with an encouraging note to motivate the student.

### Example Structure for Your Response:
**1. Summary:** A short, focused answer to the request.  
**2. Detailed Explanation:** An in-depth response tailored to the student’s background and the retrieved context.  
**3. Additional Resources (if applicable):** Suggestions for further exploration or practice.  
**4. Encouragement:** A motivating closing statement.  

Remember to always prioritize clarity and ensure the response is aligned with the student’s knowledge level and preferences.
    ```

    #### Example Output:
    [ 
        Graduate student specializing in Industrial Engineering with a background in optimization techniques and prior coursework in Linear Algebra, Optimization, and Applied Probability. ;
        Explanation of SVD for dimensionality reduction in machine learning, including a Python-based example and additional resources for further study. ;
        Limited experience with practical applications for simulation models.
    ]

    ### Instructions:
    Using the format and approach outlined above, extract key phrases from the text block provided below:

    Number of sentences to generate: {num_keyphrase_to_generate}
    Text Block:
    ```
    {input_text}
    ```
    Output:
    """
    response = llm.invoke(prompt_keyphrase)
    string_keyphrase=extract(response)


    start_index = response.find('[')
    end_index = response.rfind(']')
    string_keyphrase= response[start_index+1 :end_index]
    keyphrase = string_keyphrase.split(";")
    keyphrase = [phrase.strip() for phrase in keyphrase] # Remove leading/trailing whitespace

    keyphrase = keyphrase[:num_return]
    assert len(keyphrase) == num_return, f"determine_keyphrase() error. Expected {num_return} keyphrase, but got {len(keyphrase)} keyphrase."

    return keyphrase

# # #usage
# keyphrase = determine_keyphase(llm, prompt_test, num_return=4)
# print(keyphrase)

In [16]:
def graph_retrieve(g, node_embeddings, embedding_object, input_text, num_keywords=2, total_N_limit=300, out_dir = './outputs/')->List[GraphRAGData]:
    """ 
    This function will retrieve content from a graph based on the input text.

    Retrieved context includes:
    - Path between keywords (shortest path with two hops)
    - all nodes and edges for all nodes with two tops of the shortest path

    Note: Based on GraphReasoning.graph_analysis.find_path_and_reason()
    """

    ###Design decisions for awareness and future adjustment
    #keyword pair on combinations
    #llm determines most important keyphrase
    #limits the number of keyword pairs to 10 becuase of o(n^2) complexity


    #determine keyphrase
    keyphrase = determine_keyphase(llm, input_text, num_return=num_keywords)

    #make set of keyword pairs. Limit to 10 due to O(n^2) complexity
    keyphrase_pairs = set()
    for i in range(len(keyphrase)):
        for j in range(i+1, len(keyphrase)):
            keyphrase_pairs.add((keyphrase[i], keyphrase[j]))
            if len(keyphrase_pairs) >= 10:
                break
        if len(keyphrase_pairs) >= 10:
            break

    # print(f"keyphrase: {keyphrase}")#DEBUG
    # print(f"Keyword pairs: {keyword_pairs}")#DEBUG

    #limit the number of nodes for each keyword pair to sum to total_N_limit
    N_limit = total_N_limit//len(keyphrase_pairs) if total_N_limit//len(keyphrase_pairs) > 0 else None

    #get path and node info from KG for each keyword pair
    retrieved_data_list = []
    unique_nodes = set() # Set to keep track of unique nodes
    for key_1, key_2 in keyphrase_pairs:
        #TODO: address the fact that the best fitting node for the keywords is often the same node so it gets repeated and no new information is added. 
        #TODO: address semantic search to get nodes that are best related to the keywords. e.g. diversify by source, how to add keyword context, etc.

        similarity_fit_n1=0
        similarity_fit_n2=0 #which path to include 0=only best, 1 onlysecond best, etc.
        graph_path_str = ""
        node_list = []

        (best_node_1, best_similarity_1, best_node_2, best_similarity_2), path_of_nodes, path_graph, _=find_path(
            g, node_embeddings, embedding_object, keyword_1 =key_1,  keyword_2= key_2,verbatim= False,
            similarity_fit_ID_node_1=similarity_fit_n1,similarity_fit_ID_node_2=similarity_fit_n2,
            save_files=True, data_dir=out_dir,
        )

        # print(f"keyword: {key_1}, best node: {best_node_1}, Node similarity: {best_similarity_1}")#DEBUG
        # print(f"keyword: {key_2}, best node: {best_node_2}, Node similarity: {best_similarity_2}")#DEBUG

        #if no path between nodes then return string
        if path_of_nodes is None or len(path_of_nodes) < 1:
            path_list_string = f"No path found between (keyphrase: {key_1}, best node: {best_node_1}) and (keyphrase: {key_2}, best node: {best_node_2})."
            #node_list= [""] * N_limit

        else:
            #node and relationship pairs context
            node_list = print_node_pairs_edge_title_metadata(path_graph)
            if N_limit != None:
                node_list=node_list[:N_limit]
            
            #path with relationships
            if N_limit != None:
                path_of_nodes=path_of_nodes[:N_limit]
                _, path_list_string= print_path_with_edges_as_list(g, path_of_nodes)

                # print("path of nodes: ", path_of_nodes) #debug
                # print("path graph: ", path_graph)#debug

        # # Prevent duplicates for entries in the node lists betwen keyword pair lists 
        # node_list = [node for node in node_list if node not in unique_nodes] #Remove duplicates from node_list
        # unique_nodes.update(node_list)

        # print(f"Node list to add: {node_list}") #DEBUG

        #TODO prioritize "is source document of" and "is prerequisite for" relationships

        # print(f"Path list string to add: {path_list_string}")#DEBUG

        graph_retrieval_data = {
            "keyword_1": key_1,
            "keyword_2": key_2,
            "graph_path_str": path_list_string,
            "node_list": node_list,
        }
        retrieved_data_list.append(graph_retrieval_data)
    
    return retrieved_data_list

# # #usage
# retrieved_context_obj= graph_retrieve(graph, existing_node_embeddings, embd, prompt_test, num_keywords=3, total_N_limit=10)
# for context_obj in retrieved_context_obj:
#     print(f"Retrieved data list: {context_obj}")

In [17]:
def graph_response_generate(state:State_kg):
    """
    This function generates a response based on the retrieved context from the graph.
    The response includes the path between keywords and the nodes and relationships for each node in the path.

    Parameters:
    state (LearningPlanState_G): A dictionary containing the retrieved state with context for each keyword pair.

    Returns:
    dict: A dictionary containing the generated response.
    """

    # Helper function for formatting
    join_strings_newline = lambda strings: '\n'.join(strings)

    # Create context string
    graph_context = "### Knowledge Graph Context:\n\n"
    
    # Loop through each retrieved context object
    for idx, context_obj in enumerate(state["context"], start=1):
        keyword_1 = context_obj["keyword_1"]
        keyword_2 = context_obj["keyword_2"]
        graph_path_str = context_obj["graph_path_str"]
        node_list = context_obj["node_list"]

        # Add context for each keyword pair
        graph_context += f"**{idx}. Keywords:** `{keyword_1}` and `{keyword_2}`\n"
        graph_context += f"- **Path:** {graph_path_str}\n"
        graph_context += f"- **Subgraph Relationships:**\n"
        graph_context += f"  {join_strings_newline(node_list)}\n\n"

    #format the prompt
    formatted_prompt = custom_prompt.format(
        profile=state["profile"],
        request=state["request"],
        context=graph_context
    )

    # print(formatted_prompt)
    response = llm.invoke(formatted_prompt)
    state["answer"] = response
    return state


In [18]:
# #Usage and Testing

# state = State_kg(
#         profile=student_1["profile"],
#         request= student_1["requests"][0],
#         context=retrieved_context_obj, 
#         answer=""
#     )

# custom_prompt = PromptTemplate(
#     template =chatbot_prompt_template,
#     input_variables= ["profile", "request", "context"],
# )

# response = graph_response_generate(state)
# print(response['answer'])

## Run test cases

In [19]:
for graph_name in graph_dict.keys():
    
    graph=graph_dict[graph_name]
    embeddings = embds_dict[graph_name]

    graph_dict[graph_name], embds_dict[graph_name] = clean_graph(graph,embeddings)


Initial number of nodes: 3484
Initial number of edges: 9841
Cleaned number of nodes: 3483
Cleaned number of edges: 8924
Final node data: [('C:\\Users\\jonathan.kasprisin\\github\\Learning\\KG_ilp\\data\\pdfs\\Gilbert_Strang_Linear_Algebra_and_Its_Applicatio_230928_225121.pdf', {'group': 1, 'color': '#57dbc2', 'size': 2497})]
Final edge data: [('C:\\Users\\jonathan.kasprisin\\github\\Learning\\KG_ilp\\data\\pdfs\\Gilbert_Strang_Linear_Algebra_and_Its_Applicatio_230928_225121.pdf', 'linear algebra', {'title': 'is source document of', 'weight': 1.0})]
Number of node embeddings: 3483

Initial number of nodes: 77039
Initial number of edges: 62284
Cleaned number of nodes: 77032
Cleaned number of edges: 62257
Final node data: [('cartesian product of vector spaces', {'group': 1, 'color': '#57db89', 'size': 1})]
Final edge data: [('cartesian product of vector spaces', 'ordered pairs', {'title': 'consists of', 'metadata': "{'source': 'Gilbert_Strang_Linear_Algebra_and_Its_Applicatio_230928_22512

In [20]:
# output_pickle_path = "output_data_records_graph.pkl"
# data_records_graph = []

# custom_prompt = PromptTemplate(
#     template =chatbot_prompt_template,
#     input_variables= ["profile", "request", "context"],
# )

# for graph_name in tqdm(graph_dict.keys(), desc="Generating from Graphs", total=len(graph_dict)):

#     graph=graph_dict[graph_name]
#     existing_node_embeddings = embds_dict[graph_name]

#     for student_index, student in enumerate(students):
#         for request_index, request in enumerate(student["requests"]):

#             test_case_id = f"s{student_index}r{request_index}"

#             request = student["requests"][request_index]

#             empty_context = {
#                 "keyword_1": "n/a",
#                 "keyword_2": "n/a",
#                 "graph_path_str": "",
#                 "node_list": [""],
#             }
#             # Initialize the state
#             state = State_kg(
#                 profile=student_1["profile"],
#                 request= request,
#                 context= [empty_context], 
#                 answer=""
#             )

#             #format the prompt
#             formatted_prompt = custom_prompt.format(
#                 profile=state["profile"],
#                 request=state["request"],
#                 context=state["context"]
#             )

#             try:
#                 out_dir = f"./outputs/{graph_name}/"
#                 retrieved_context_obj= graph_retrieve(graph, existing_node_embeddings, embd, formatted_prompt, num_keywords=3, total_N_limit=100, out_dir=out_dir)

#                 # print(f"Retrieved context for {student['test_case']}") #DEBUG

#                 # update the state with the retrieved context
#                 state["context"] = retrieved_context_obj

#                 try:
#                     # Execute the workflow
#                     state = graph_response_generate(state)
#                 except Exception as e:
#                     state["answer"] = f"Error in graph response generation for {student['test_case']}: {e}"
#                     print(f"Error in graph response generation for {student['test_case']}: {e}")
#                     break

            
#             except Exception as e:
#                 state["answer"] = f"Error in graph retrieval for {student['test_case']} and graph {graph_name}: {e}"
            
#             # Format LLM context as a **string**
#             llm_context = "\n\n".join(
#                 f"Keywords: {context_obj.get('keyword_1', 'n/a')} and {context_obj.get('keyword_2', 'n/a')}\n"
#                 f"Path: {context_obj.get('graph_path_str', '')}\n"
#                 f"Node List: {', '.join(context_obj.get('node_list', ['']))}"
#                 for context_obj in state["context"]
#             )

#             # Append data to the list
#             data_records_graph.append({
#                 "index": len(data_records_graph),  # Auto-incrementing index
#                 "type": graph_name,
#                 "test_case": test_case_id,
#                 "student": student["profile"],
#                 "request": request,
#                 "context": llm_context,
#                 "response": state['answer']
#             })

#     try:
#         with open(output_pickle_path, "wb") as f:
#             pickle.dump(data_records_graph, f)
#     except Exception as e:
#         print(f"Error saving output pickle file: {e}")

In [21]:
#regenerate a specific test case

output_pickle_path = "output_data_records_graph.pkl"
data_records_graph = []

for graph_name in ['GR_kg_no_refine3_0.85', 'GR_kg_no_refine3_0.75']:
    graph=graph_dict[graph_name]
    existing_node_embeddings = embds_dict[graph_name]

    custom_prompt = PromptTemplate(
        template =chatbot_prompt_template,
        input_variables= ["profile", "request", "context"],
    )
    student_index = 1
    request_index = 0


    test_case_id = f"s{student_index}r{request_index}"

    student = students[student_index]
    request = student["requests"][request_index]

    empty_context = {
        "keyword_1": "n/a",
        "keyword_2": "n/a",
        "graph_path_str": "",
        "node_list": [""],
    }
    # Initialize the state
    state = State_kg(
        profile=student_1["profile"],
        request= request,
        context= [empty_context], 
        answer=""
    )

    #format the prompt
    formatted_prompt = custom_prompt.format(
        profile=state["profile"],
        request=state["request"],
        context=state["context"]
    )

    try:
        out_dir = f"./outputs/{graph_name}/"
        retrieved_context_obj= graph_retrieve(graph, existing_node_embeddings, embd, formatted_prompt, num_keywords=3, total_N_limit=100, out_dir=out_dir)

        # print(f"Retrieved context for {student['test_case']}") #DEBUG

        # update the state with the retrieved context
        state["context"] = retrieved_context_obj

        try:
            # Execute the workflow
            state = graph_response_generate(state)
        except Exception as e:
            state["answer"] = f"Error in graph response generation for {student['test_case']}: {e}"
            print(f"Error in graph response generation for {student['test_case']}: {e}")
        


    except Exception as e:
        state["answer"] = f"Error in graph retrieval for {student['test_case']} and graph {graph_name}: {e}"

    # Format LLM context as a **string**
    llm_context = "\n\n".join(
        f"Keywords: {context_obj.get('keyword_1', 'n/a')} and {context_obj.get('keyword_2', 'n/a')}\n"
        f"Path: {context_obj.get('graph_path_str', '')}\n"
        f"Node List: {', '.join(context_obj.get('node_list', ['']))}"
        for context_obj in state["context"]
    )

    # Append data to the list
    data_records_graph.append({
        "index": len(data_records_graph),  # Auto-incrementing index
        "type": graph_name,
        "test_case": test_case_id,
        "student": student["profile"],
        "request": request,
        "context": llm_context,
        "response": state['answer']
    })

try:
    with open(output_pickle_path, "wb") as f:
        pickle.dump(data_records_graph, f)
except Exception as e:
    print(f"Error saving output pickle file: {e}")

In [22]:
for record in data_records_graph:
    print(f"---->response: \n{record['response']}")

---->response: 
---

**1. Summary:**
Eigenvalues of a positive definite matrix are all positive, which is a key characteristic that helps define positive definiteness. We'll explore this relationship using a Python-based example to illustrate the concept.

**2. Detailed Explanation:**

Positive definite matrices play a crucial role in various fields, including optimization, machine learning, and statistics. To understand how eigenvalues relate to positive definite matrices, let's first recall the definition of a positive definite matrix:

A symmetric matrix A is positive definite if, for any non-zero vector v, the inequality v^T * A * v > 0 holds true.

Now, let's consider the relationship between eigenvalues and positive definite matrices. The eigenvalues of a positive definite matrix A are all real and positive. This is because the quadratic form v^T * A * v can be expressed as a sum of squares of the eigenvalues, each multiplied by the corresponding eigenvector component. Since the 