This notebook is module 2 of NEKO.

It does the following steps:
1. Search your keyword of interest in the knowledge graph and output a filtered knowledge graph.
2. Use LLM to summarize and generate a report.

To use this notebook, you need to:

- Input your OpenAI API key.
- Input your search keyword.
- Modify the prompt according to your needs.

In [1]:
from openai import OpenAI
# Set your OpenAI API key
client = OpenAI(api_key='your API key here') #******You can obtain your key from OpenAI

In [2]:
from pyvis.network import Network
import pandas as pd
import re
import networkx as nx

# Load the Excel file
filepath = 'modified_updated(Qwen1.5 14b)_Yarrowia_pdf_causal.xlsx'
df = pd.read_excel(filepath, engine='openpyxl')

# Perform search
keyword = "carotene"  # *******Input your keyword


# Initialize NetworkX Graph
G = nx.Graph()

# Nodes to exclude
# words_to_exclude = ['A', 'B', 'C']
words_to_exclude = []

# Regular expression to match the pattern (entity A, entity B)
pattern = r'\(([^,]+), ([^\)]+)\)'

# Iterate over the DataFrame rows to extract entity pairs and their sources
for _, row in df.iterrows():
    # value = row['Response to New Question']
    value = row['Answer to Question 2']
    source = row['Title']  # Extract source for each pair

    matches = re.findall(pattern, value)
    for entity_a, entity_b in matches:
        # Check if any word to exclude is part of the entity names
        if not any(word in entity_a for word in words_to_exclude) and not any(word in entity_b for word in words_to_exclude):
            G.add_node(entity_a, label=entity_a)
            G.add_node(entity_b, label=entity_b)
            G.add_edge(entity_a, entity_b, title=source)

def search_network(graph, keyword, depth=1):
    nodes_of_interest = {n for n, attr in graph.nodes(data=True) if keyword.lower() in attr['label'].lower()}
    for _ in range(depth):
        for node in list(nodes_of_interest):
            nodes_of_interest.update(set(nx.neighbors(graph, node)))
    return graph.subgraph(nodes_of_interest)

filtered_graph = search_network(G, keyword)

# Extract node names from the filtered graph
node_names = list(filtered_graph.nodes())

# Prepare a simple text summary of node names
node_names_text = ", ".join(node_names)

# Now, `node_names_text` contains a clean, comma-separated list of node names, ready for summarization
print(node_names_text)

# Initialize Pyvis network with the filtered graph
net = Network(height="2160px", width="100%", bgcolor="#222222", font_color="white")
net.from_nx(filtered_graph)

# Continue with setting options and saving the network as before
net.set_options("""
{
  "physics": {
    "barnesHut": {
      "gravitationalConstant": -80000,
      "centralGravity": 0.5,
      "springLength": 75,
      "springConstant": 0.05,
      "damping": 0.09,
      "avoidOverlap": 0.5
    },
    "maxVelocity": 100,
    "minVelocity": 0.1,
    "solver": "barnesHut",
    "timestep": 0.3,
    "stabilization": {
        "enabled": true,
        "iterations": 500,
        "updateInterval": 10,
        "onlyDynamicEdges": false,
        "fit": true
    }
  },
  "nodes": {
    "font": {
      "size": 30,
      "color": "white"
    }
  }
}
""")





# Save and show the network
net.write_html('filterd_entity_'+keyword+'_network.html')

hydroxylase, 'glycerol', 49, YPD medium, 'β-carotene synthesis pathway', SaGGPPS, 'GGS1', '6.5 g/L β-carotene', transform Yarrowia lipolytica for beta-carotene production, stress response, expression of tHmgR and GGS1, overexpression of DID2 gene, Yarrowia lipolytica, fermentation condition optimization, Yarrowia lipolytica, synthetic biology, peroxisome, central metabolic modules, 'beta-carotene production', IDI, 'beta-carotene production, astaxanthin, (leucine auxotrophy, SeACSL641P, 'erg8', 'CIBTS2111', β-carotene biosynthetic pathway construction, 'high-cell-density fermentation', SsCarS, 'erg12', b-Carotene, 'Deleting ΔMHY1', β-carotene biosynthetic genes, β-Carotene biosynthesis, Yarrowia lipolytica production, '17a, Escherichia coli, AaBFS, 'MVA pathway', change in cell wall structure, Yarrowia lipolytica, multiple-copy pathway genes, beta-carotene production, Xanthophyllomyces dendrorhous, 'YNBgly60', accumulation, 'genetic engineering tools', Recombinant E. coli, 111.8 mg/L, 8

In [3]:
from IPython.display import Markdown

def trim_text(text, max_length):
    if len(text) > max_length:
        return text[:max_length].rsplit(' ', 1)[0] + "..."  # Trim to max_length, avoid cutting words in half
    else:
        return text
    
# Apply the trimming function to node_names_text
cut_off_chunk_size = 30000
trimmed_node_names_text = trim_text(node_names_text, cut_off_chunk_size)

# Construct the prompt with the potentially trimmed node_names_text
prompt = "These are the terms related to " + filepath + keyword + ", categorize them and write a summary report." + trimmed_node_names_text

#****** you can modify the system prompt for your specific needs ******
messages = [{"role": "system", "content": "You are a helpful assistant."},
                {"role": "user", "content": prompt}]

response = client.chat.completions.create(
        model="gpt-4-turbo",  # Use the appropriate model name
        messages=messages,
        max_tokens=2048      # maximum output tokens are 4096
    )
completion_text = response.choices[0].message.content
display(Markdown(completion_text))

### Categorization and Summary Report on Beta-Carotene Production in Yarrowia lipolytica

**Objective:**
This report consolidates the various terms related to beta-carotene production in the yeast Yarrowia lipolylica, highlighting the engineering, metabolic pathways, and key components involved. These terms have been categorized into specific themes for clarity.

**Categories:**

1. **Organisms and Strains:**
   - Yarrowia lipolytica (various strains such as ATCC 201249, XK19, Po1f)
   - Escherichia coli and Saccharomyces cerevisiae (used for gene sourcing)
   - Xanthophyllomyces dendrorhous, Blakeslea trispora (related research)
  
2. **Genetic Components and Pathway Engineering:**
   - Genes: ERG (multiple), GGS1, CarB, CarRP, crtI, crtYB, crtW, crtZ, IDI, tHmgR
   - Genetic tools: nonhomologous end-joining, homologous recombination, promoter optimization, multiple-copy pathway genes, knockout strategies (e.g., beta-oxidation pathways)
   - Overexpression and Deletion: Overexpression of genes like DID2, HMG; Deletion of certain genes like MHY1 or ΔCLA4 for enhanced production.
   
3. **Metabolic Pathways:**
   - MVA Pathway, MEP Pathway: Central to the synthesis of precursors required for carotenoid synthesis.
   - Specific pathway elements such as β-carotene synthesis pathway, β-carotene biosynthetic pathway.
  
4. **Processes and Condition Optimization:**
   - Media and Culture Conditions: YPD medium, nitrogen-limited medium, high-cell-density fermentation, glucose and glycerol usage.
   - Fermentation Conditions: Optimization for higher yield in different media, utilization of glycerol as a carbon source.
   - Stress Responses: Impact on metabolic performance and β-carotene yield.
  
5. **Products and Yield Data:**
   - Specific Production Numbers: 6.5 g/L, 49 mg/L, and record high yields such as 797.1 mg/L.
   - Comparison of Production in Various Strains and Engineering Approaches.
  
6. **Technological and Methodological Approaches:**
   - Synthetic Biology: Implementations in the metabolic engineering of Y. lipolytica for optimized beta-carotene production.
   - Chromosomal Integrations: Techniques for stable gene integration and expression.
   - Advanced Genetic Engineering: Iterative integration, decentralized assembly strategy and enhancer sequences for improved expression rates.

7. **Biochemical and Molecular Aspects:**
   - Enzymes and Co-factors: Importance of hydroxylases, ketolases, synthases, reductases in the carotene synthesis pathway.
   - Lipid Accumulation and Utilization: Related to the substrate channeling towards β-carotene synthesis.
  
8. **Applications and Industrial Usage:**
   - High-Level β-Carotene Production for commercial use, like in nutraceuticals.
   - Optimization for higher β-carotene yield for cost-effectiveness in industrial applications.

**Summary:**
Yarrowia lipolytica has emerged as a versatile yeast host for metabolic engineering aimed at production of β-carotene. Through extensive genetic manipulation including gene overexpression, knockouts, and optimized promoter usage, significant improvements in β-carotene yield have been achieved. Research indicates varied methods ranging from basic fermentation condition tweaks to sophisticated synthetic biology approaches to propel β-carotene synthesis. Advancements in genetic tools such as in vivo homologous recombination facilitate iterative pathway optimizations, revealing a push towards more integrated systems that can stabilize high yields of desired products. The combination of molecular biology techniques with classical strain improvement strategies like stress response management and media optimization underscores an integrated approach in metabolic engineering of Y. lipolytica for enhanced β-carotene production. 

Given the health benefits and commercial value of β-carotene, these developments hold potential for more efficient, cost-effective production processes adaptable to industrial scales, marking significant contributions in the fields of biotechnology and metabolic engineering.