In [11]:
text = """ 
```json
{
  "title": "Sci-Q-Graph: Knowledge Graph-Guided Scientific Question Answering and Explanation Generation",
  "proposed_method": "Sci-Q-Graph proposes a novel approach to scientific question answering and explanation generation by leveraging a dynamic knowledge graph constructed from scientific literature. Instead of relying on a multi-agent LLM framework or direct document retrieval, this method focuses on building and querying a knowledge graph (KG) to synthesize answers. The core idea is to represent scientific knowledge as interconnected entities and relationships, enabling more structured and explainable reasoning. \n\nHere's how it works:\n\n1.  **Knowledge Graph Construction:**\n    *   **Literature Ingestion:** A large-scale corpus of scientific articles is ingested, parsed, and preprocessed. This involves extracting text, identifying sections (e.g., abstract, introduction, methods, results, discussion), and handling different document formats (e.g., PDF, XML).\n    *   **Entity and Relation Extraction:** Employing a combination of rule-based systems, statistical NLP techniques, and LLMs (fine-tuned for this task), the system extracts entities (e.g., scientific concepts, methods, experimental variables, organisms, chemicals, genes) and relations between them (e.g., 'is_a', 'used_for', 'affects', 'results_in'). The extracted entities and relations will be incorporated to create the nodes and edges of the KG.\n    *   **Event Extraction:** Extract causal or temporal events from the text. For example, 'Drug X *inhibits* Enzyme Y' or 'Treatment A *increased* survival rate *after* 6 months'.\n    *   **Knowledge Graph Integration:** Integrating extracted information to build a comprehensive, interconnected knowledge graph. This step will involve both automatic and human-in-the-loop processes to ensure data quality and accuracy. The KG is designed to be dynamic, constantly updated with new publications.\n\n2.  **Query Processing and KG Traversal:**\n    *   **Query Analysis:** The input scientific query is analyzed using NLP techniques, and the query is then transformed into a formal query language (e.g., SPARQL or a custom query language designed for the KG). The query analysis will identify the entities and relationships mentioned in the question and their desired properties.\n    *   **KG Traversal:** The KG is queried using the formal query. The system identifies relevant nodes and edges based on the query's constraints. This step involves efficient graph traversal algorithms (e.g., shortest path search, semantic similarity-based search) to find the most relevant information in the KG.\n    *   **Evidence Retrieval:** Retrieve the original text snippets from the source documents that support the relationships and entities identified during KG traversal. This step is essential for providing evidence and citations.\n\n3.  **Answer Synthesis and Explanation Generation:**\n    *   **Answer Generation:** The system aggregates the information retrieved from the KG to generate an answer to the scientific query. This step might involve combining and summarizing information from multiple nodes and paths in the graph. The answer is constructed to be factually accurate and comprehensive, addressing the requirements of the original query.\n    *   **Explanation Generation:** A key feature of Sci-Q-Graph is generating detailed, traceable explanations. The system will generate a step-by-step explanation tracing the reasoning path used to derive the answer from the KG. It will include the entities, relationships, and the relevant evidence snippets from the original documents. This is implemented by tracing the nodes and edges used during KG traversal and retrieving the corresponding text snippets. The explanation will cite the specific source documents and locations (e.g., page numbers, section headings) that support each claim. The explanation will highlight the relevant parts of the supporting evidence and present the citations in a clear, easy-to-understand format.\n    *   **Uncertainty Quantification:** Develop a method to quantify the uncertainty associated with the answer. The uncertainty quantification is based on factors such as the number of supporting sources, the consistency of the evidence, and the confidence scores of the entity and relation extraction. The uncertainty score is presented alongside the answer and the explanation.",
  "experiment_plan": "*   **Datasets:**\n    *   **Scientific Question Answering Datasets:** Utilize existing QA datasets such as SciQA, BioASQ, and others, to evaluate the performance of Sci-Q-Graph. Supplement with a new dataset that has a focus on questions that require reasoning across multiple articles and complex relationships.\n    *   **Scientific Literature Corpus:** Use a comprehensive corpus of scientific publications, e.g., PubMed Central, arXiv, and OpenAlex, to build the knowledge graph. This should contain full-text articles, abstracts, and supplementary materials.\n    *   **Evaluation Datasets for KG:** Create a specific dataset to evaluate the performance of the KG. This should include entity and relation annotations, and queries to test the graph traversal and reasoning capabilities.\n    *   **Gold-Standard Answers and Explanations:** For both QA and KG evaluation datasets, provide gold-standard answers and explanations with supporting citations for rigorous evaluation.\n*   **Models and Tools:**\n    *   **Knowledge Graph Construction Tools:** Use tools such as SpaCy, AllenNLP, and SciBERT to build and curate the KG. Develop custom modules and fine-tune LLMs for entity and relation extraction from scientific text.\n    *   **Graph Database:** Utilize a graph database such as Neo4j or Amazon Neptune to store and manage the knowledge graph.\n    *   **Embedding Models:** Use pre-trained embedding models (e.g., SciBERT, BioBERT) for entity and relation embeddings to enable efficient KG traversal, semantic similarity search and knowledge discovery.\n    *   **Query Language:** Develop a custom query language tailored to scientific queries and the structure of the knowledge graph. The system can also support SPARQL for querying.\n    *   **Answer Generation and Explanation Generation:** Use LLMs (e.g., GPT-3.5, GPT-4, or open-source LLMs like Llama 2) to generate human-readable answers and explanations based on the KG query results. These LLMs will be fine-tuned on scientific text and explanation generation tasks.\n*   **Metrics:**\n    *   **Answer Accuracy:** Evaluate answer accuracy against gold-standard answers using metrics like Exact Match, F1-score, ROUGE, and BLEU.\n    *   **Explanation Quality:** Evaluate the quality of explanations through both automated metrics and human evaluation. Automated metrics include:\n        *   **Completeness:** Does the explanation cover all relevant aspects of the answer?\n        *   **Coherence and Fluency:** Is the explanation well-written and easy to understand?\n        *   **Evidential Support:** Are the claims in the explanation supported by evidence from the KG and original documents? Does it cite all relevant documents?\n        *   **Faithfulness:** Does the explanation align with the facts in the knowledge graph and source documents?\n    *   **KG Evaluation Metrics:** Evaluate the KG's performance using:\n        *   **Precision and Recall for Entity and Relation Extraction:** Measure the accuracy of entity and relation extraction from scientific articles.\n        *   **KG Coverage:** Evaluate the completeness of the KG by measuring the number of entities, relationships, and concepts covered.\n        *   **Query Response Time:** Measure the time taken to answer scientific questions using the KG.\n    *   **Uncertainty Quantification:** Evaluate the effectiveness of uncertainty scores by comparing them to human-labeled confidence levels.\n*   **Experiment Setup:**\n    1.  **Baseline Systems:** Compare Sci-Q-Graph to the following baselines:\n        *   **RAG Systems:** Traditional RAG systems that retrieve relevant documents using keywords and generate answers using LLMs.\n        *   **Direct LLM Prompting:** Directly feeding the query to LLMs like GPT-4 and evaluating the output.\n        *   **Knowledge Graph Construction with Simple Querying:** Create a simplified KG with basic extraction techniques, and utilize simple query strategies, such as pattern matching.\n    2.  **KG Construction and Refinement:** Evaluate different entity and relation extraction models. Experiment with rule-based, statistical, and LLM-based methods. Use human annotators for KG quality control and refinement.\n    3.  **Prompt Engineering:** Develop specific prompts for query analysis, answer generation, and explanation generation. Experiment with different prompt formats, instructions, and constraints. Ensure prompts are tailored to the scientific domain and KG structure.\n    4.  **Graph Traversal Strategies:** Test different graph traversal algorithms (e.g., breadth-first search, depth-first search, shortest path) to identify optimal methods for answering scientific queries. Implement semantic similarity-based search to increase the flexibility of the system.\n    5.  **Answer and Explanation Generation:** Test different LLMs for answer and explanation generation. Experiment with various prompting techniques and fine-tuning strategies.\n    6.  **Ablation Studies:** Conduct ablation studies to evaluate the contribution of different components of the system. For instance, remove the explanation generation module or evaluate the impact of different entity and relation extraction techniques.\n    7.  **Human Evaluation:** Conduct human evaluation to assess the quality of the answers and the explanations. Recruit domain experts to evaluate the responses. Assess the quality of the citations by checking for relevance and accuracy. Evaluate the readability and understandability of the explanations.\n    8.  **Error Analysis:** Analyze the errors made by Sci-Q-Graph and the baselines to identify areas for improvement. Analyze failure cases to better understand the limitations of the model.\n    9.  **Statistical Analysis:** Apply statistical methods to analyze the experimental results, including the comparison of different methods (e.g., t-tests, ANOVA), correlation analysis, and significance tests. Conduct statistical analysis to determine the significance of improvements in accuracy, completeness, and other metrics.\n*   **Iterative Refinement:** The framework will be developed iteratively. Based on the evaluation results, the knowledge graph construction process, query analysis, graph traversal algorithms, and the answer/explanation generation processes will be refined to improve overall performance.\n*   **Test Case Examples:**\n    *   **Test Case 1: Baseline Failure and Proposed Method Success**\n        *   **Query:** *\"What are the main side effects of using drug X?\"*\n        *   **Baseline (RAG):** The RAG system retrieves several documents mentioning drug X. However, it fails to identify the specific side effects and struggles to synthesize a cohesive answer, often providing a list of fragmented sentences from different articles with no clear context. It may not provide explicit citations.\n            *   **Input (RAG):** Question + Retrieved documents mentioning Drug X.\n            *   **Expected Output (RAG):** Fragmented sentences mentioning Drug X, without clear synthesis of the side effects, no citations.\n        *   **Proposed Method (Sci-Q-Graph):** The system performs the following steps:\n            1.  Analyzes the query and identifies \"Drug X\" and \"side effects\".\n            2.  Traverses the KG to find entities and relationships related to drug X and its side effects.\n            3.  The KG will contain an edge for \"Drug X\" -\> \"side effects\" along with a collection of supporting evidence. In this case, the KG will contain the side effects found in the scientific publications, which are connected to the original literature through citations.\n            4.  Generates an answer by synthesizing the information from different nodes on the graph. The system generates a coherent answer by collecting the relevant entities, and edges from the KG. \n            5.  Generates a step-by-step explanation explaining how it found the side effects and it will cite the relevant articles.\n            *   **Input (Sci-Q-Graph):** Question + Constructed KG.\n            *   **Expected Output (Sci-Q-Graph):** *\"Drug X is associated with the following side effects (Citations): Side effect A (citation), Side effect B (citation), Side effect C (citation). The KG traversal started with the node for Drug X and followed the 'causes' or 'results_in' relations to the 'side effects' node. The following articles (list citations) provide the specific evidence.\"\n            *   **Explanation:** The Sci-Q-Graph is superior since the KG allows the system to synthesize evidence and provide detailed explanations with supporting citations. The KG provides structure allowing the system to navigate complex relationships across articles and generate a more accurate and useful response.\n    *   **Test Case 2: Complex Question Involving Multiple Concepts**\n        *   **Query:** *\"Compare the efficacy of treatment A versus treatment B for patients with condition C, considering the impact of variable D on the outcomes.\"*\n        *   **Baseline (Direct LLM Prompting):** The LLM might provide a general overview of the treatments but struggle to perform a detailed comparison, incorporate the impact of variable D, and often struggles to generate accurate citations.\n            *   **Input (LLM):** Question to the LLM.\n            *   **Expected Output (LLM):** A general summary of treatment A and treatment B without in-depth comparison, may not provide sufficient citations.\n        *   **Proposed Method (Sci-Q-Graph):** The system performs the following steps:\n            1.  Analyzes the query identifying treatment A, B, condition C, variable D, and \"efficacy\".\n            2.  The system constructs a KG with the nodes and relations for those concepts. The KG includes literature references. This step involves finding and extracting relevant nodes and relationships in the KG (e.g., treatment A and condition C, treatment B and condition C, variable D affecting the outcomes).\n            3.  The system retrieves supporting text snippets from the literature related to the relationships (e.g., Treatment A is effective for condition C, Treatment B is effective for condition C, and the effect of variable D). The KG will also incorporate details about the effect of variable D on outcomes.\n            4.  The system synthesizes an answer by comparing the efficacy of treatment A and B, considering variable D. It will provide citations from multiple sources and provide an explanation based on the KG.\n            *   **Input (Sci-Q-Graph):** Question + Constructed KG.\n            *   **Expected Output (Sci-Q-Graph):** *\"Treatment A and B are both used for condition C. Treatment A has an efficacy of X (citation), while Treatment B has an efficacy of Y (citation). The effect of variable D is Z (citation). Based on the KG, treatment A is generally more effective for patients without X, and treatment B is more effective in other scenarios (citations). The KG traversal started with the nodes corresponding to treatments, then followed the \"treats\" and \"impacts\" relationships, and finally extracted relevant evidence. The following articles (cite relevant papers) provide the supporting evidence. \"*\n            *   **Explanation:** The Sci-Q-Graph is superior here because it can handle complex questions that require the comparison of multiple treatments, consider multiple variables, and integrate information from various sources. The KG's structure will allow a step-by-step explanation of its reasoning, supported by citations."
}
```
"""

In [12]:
structured_data = {}

In [13]:
text[1298:1298+10]

" 'is_a', '"

In [None]:
if '"title"' in text:
    title_start = text.find('"title"') + len('"title"')
    # Find the colon after "title"
    title_start = text.find(':', title_start) + 1
    # Find the opening quote after the colon
    title_start = text.find('"', title_start) + 1
    # Find the closing quote
    title_end = text.find('"', title_start)
    if title_end > title_start:
        structured_data["title"] = text[title_start:title_end]
        # print(f"Found title using direct string search: {structured_data['title']}")

# Extract proposed_method - looking for text after "proposed_method": " and up till next field
if '"proposed_method"' in text:
    method_start = text.find('"proposed_method"') + len('"proposed_method"')
    # Find the colon after "proposed_method"
    method_start = text.find(':', method_start) + 1
    # Find the opening quote after the colon
    method_start = text.find('"', method_start) + 1
    
    # For the end, look for the next field or the closing brace
    if '"experiment_plan"' in text[method_start:]:
        # Find the end by locating the next field
        pattern_end = text.find('"experiment_plan"', method_start)
        # print(f"pattern_end: {pattern_end}")
        if pattern_end > method_start:
            structured_data["proposed_method"] = text[method_start:pattern_end]
            # print(f"Found proposed_method using direct string search: {structured_data['proposed_method']}")
    else:
        # If no next field, look for closing quote + comma or closing quote + }
        for end_pattern in ['",', '"}']:
            pattern_end = text.find(end_pattern, method_start)
            if pattern_end > method_start:
                structured_data["proposed_method"] = text[method_start:pattern_end]
                break

# Extract experiment_plan
if '"experiment_plan"' in text:
    plan_start = text.find('"experiment_plan"') + len('"experiment_plan"')
    # Find the colon after "experiment_plan"
    plan_start = text.find(':', plan_start) + 1
    # Find the opening quote after the colon
    plan_start = text.find('"', plan_start) + 1
    
    # For the end, look for the next field or closing brace
    for end_pattern in ['"}', '"\n}']:
        pattern_end = text.find(end_pattern, plan_start)
        if pattern_end > plan_start:
            structured_data["experiment_plan"] = text[plan_start:pattern_end]
            break

# Fix any special characters in extracted fields
for key in structured_data:
    if structured_data[key]:
        # Replace escaped quotes
        structured_data[key] = structured_data[key].replace('\\"', '"')
        # Replace escaped newlines
        structured_data[key] = structured_data[key].replace('\\n', '\n')
        # Replace other common escape sequences
        structured_data[key] = structured_data[key].replace('\\\\', '\\')
        structured_data[key] = structured_data[key].replace('\\/', '/')

# Return if we found something useful
if structured_data and len(structured_data) > 0:
    print(f"Extracted fields using direct string search: {list(structured_data.keys())}")
    return structured_data

Extracted fields using direct string search: ['title', 'proposed_method', 'experiment_plan']
{'title': 'Sci-Q-Graph: Knowledge Graph-Guided Scientific Question Answering and Explanation Generation', 'proposed_method': 'Sci-Q-Graph proposes a novel approach to scientific question answering and explanation generation by leveraging a dynamic knowledge graph constructed from scientific literature. Instead of relying on a multi-agent LLM framework or direct document retrieval, this method focuses on building and querying a knowledge graph (KG) to synthesize answers. The core idea is to represent scientific knowledge as interconnected entities and relationships, enabling more structured and explainable reasoning. \n\nHere\'s how it works:\n\n1.  **Knowledge Graph Construction:**\n    *   **Literature Ingestion:** A large-scale corpus of scientific articles is ingested, parsed, and preprocessed. This involves extracting text, identifying sections (e.g., abstract, introduction, methods, result