This notebook is module 2 of NEKO.

It does the following steps:
1. Search your keyword of interest in the knowledge graph and output a filtered knowledge graph.
2. Use LLM to summarize and generate a report.

To use this notebook, you need to:

- Input the path to Qwen model (from huggingface or from local storage).
- Input your search keyword.
- Modify the prompt according to your needs.

In [1]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3"

from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda" # the device to load the model onto

model = AutoModelForCausalLM.from_pretrained("/storage1/fs1/yinjie.tang/Active/Shawn_Xiao/Qwen1.5-72B-Chat",
    device_map="auto", torch_dtype="auto"
)
tokenizer = AutoTokenizer.from_pretrained("/storage1/fs1/yinjie.tang/Active/Shawn_Xiao/Qwen1.5-72B-Chat")

#Qwen1.5-72B-Chat-GPTQ-Int4
#Qwen1.5-14B-Chat

Loading checkpoint shards:   0%|          | 0/38 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [2]:
from pyvis.network import Network
import pandas as pd
import re
import networkx as nx

# Load the Excel file
filepath = 'modified_updated(Qwen1.5 14b)_Yarrowia_pdf_causal.xlsx'
df = pd.read_excel(filepath, engine='openpyxl')

# Initialize NetworkX Graph
G = nx.Graph()

# Nodes to exclude
# words_to_exclude = ['Synechocystis', 'Cyanobacteria', 'cyanobacteria']
words_to_exclude = []

# Regular expression to match the pattern (entity A, entity B)
pattern = r'\(([^,]+), ([^\)]+)\)'

# Iterate over the DataFrame rows to extract entity pairs and their sources
for _, row in df.iterrows():
    # value = row['Response to New Question']
    value = row['Answer to Question 2']
    source = row['Title']  # Extract source for each pair

    matches = re.findall(pattern, value)
    for entity_a, entity_b in matches:
        # Check if any word to exclude is part of the entity names
        if not any(word in entity_a for word in words_to_exclude) and not any(word in entity_b for word in words_to_exclude):
            G.add_node(entity_a, label=entity_a)
            G.add_node(entity_b, label=entity_b)
            G.add_edge(entity_a, entity_b, title=source)

def search_network(graph, keyword, depth=1):
    nodes_of_interest = {n for n, attr in graph.nodes(data=True) if keyword.lower() in attr['label'].lower()}
    for _ in range(depth):
        for node in list(nodes_of_interest):
            nodes_of_interest.update(set(nx.neighbors(graph, node)))
    return graph.subgraph(nodes_of_interest)

# Perform search
keyword = "carotene"  # Replace with your keyword
filtered_graph = search_network(G, keyword)

# Extract node names from the filtered graph
node_names = list(filtered_graph.nodes())

# Prepare a simple text summary of node names
node_names_text = ", ".join(node_names)

# Now, `node_names_text` contains a clean, comma-separated list of node names, ready for summarization
print(node_names_text)

# Initialize Pyvis network with the filtered graph
net = Network(height="2160px", width="100%", bgcolor="#222222", font_color="white")
net.from_nx(filtered_graph)

# Continue with setting options and saving the network as before
net.set_options("""
{
  "physics": {
    "barnesHut": {
      "gravitationalConstant": -80000,
      "centralGravity": 0.5,
      "springLength": 75,
      "springConstant": 0.05,
      "damping": 0.09,
      "avoidOverlap": 0.5
    },
    "maxVelocity": 100,
    "minVelocity": 0.1,
    "solver": "barnesHut",
    "timestep": 0.3,
    "stabilization": {
        "enabled": true,
        "iterations": 500,
        "updateInterval": 10,
        "onlyDynamicEdges": false,
        "fit": true
    }
  },
  "nodes": {
    "font": {
      "size": 30,
      "color": "white"
    }
  }
}
""")





# Save and show the network
net.write_html('filterd_entity_'+keyword+'_network.html')

B-carotene in E. coli, SQS1, 'expression levels', canthaxanthin, b-carotene, Saccharomyces cerevisiae, improved β-carotene production, W29-derived strain, '17a, absence in CIBTS1630 extracts, 'MVA pathway', (DCW, β-Carotene biosynthesis, 'HMG-CoA reductase', Yarrowia lipolytica, beta-carotene production, metabolic engineering, S. et al., 2017', Yarrowia lipolytica, β-carotene production, 'carotene synthase', ERG9 promoter, β-carotene ketolase, Squalene synthase, (β-carotene biosynthetic genes, integration, correct locus, average frequency (53%, 'beta-carotene biosynthetic pathway', (carRP gene, synthetic medium, astaxanthin biosynthesis, (beta-carotene production, 'Complementation with LEU2', GGPS1/crtE, flux to downstream bottleneck, Ku70-deficient Y. lipolytica, 'ERG13', 'lipid content', glucose concentration, SeACSL641P, b-carotene concentration, (enhanced β-carotene synthesis, 'lipid accumulation', morphological and metabolic engineering, Liu, M. et al., HMG, β-carotene hydroxylase

In [3]:
from IPython.display import Markdown

def trim_text(text, max_length):
    """
    Trims the text to the specified maximum length, appending "..." if the text was trimmed.
    
    Args:
    - text (str): The text to trim.
    - max_length (int): The maximum allowed length of the text.

    Returns:
    - str: The trimmed text.
    """
    if len(text) > max_length:
        return text[:max_length].rsplit(' ', 1)[0] + "..."  # Trim to max_length, avoid cutting words in half
    else:
        return text
    
# Apply the trimming function to node_names_text
cut_off_chunk_size = 30000
trimmed_node_names_text = trim_text(node_names_text, cut_off_chunk_size)

# Construct the prompt with the potentially trimmed node_names_text
prompt = "These are the terms related to " + filepath + keyword + ", categorize them and write a summary report." + trimmed_node_names_text

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)

generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=5000
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response1 = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
display(Markdown(response1))

Summary Report:

The terms provided relate to the research and optimization of β-carotene production in various microorganisms, primarily focusing on E. coli, Saccharomyces cerevisiae, and Yarrowia lipolytica. β-carotene is an essential nutrient and a precursor to vitamin A. The following key aspects are highlighted in the context of these organisms:

1. **Metabolic Engineering**: Techniques such as genetic engineering, metabolic stress, and metabolic engineering are employed to enhance β-carotene synthesis in microorganisms. This involves overexpression or modification of genes like HMG-CoA reductase (ERG13), MVA pathway genes, carotene synthase, and β-carotene ketolase.

2. **Strain Optimization**: Genetic modifications, including deletion of specific genes (e.g., CLA4, MHY1, and GGS1), integration of β-carotene biosynthetic genes at the correct locus, and the use of auxotrophic strains, are used to improve β-carotene production.

3. **Promoters and Gene Expression**: The strength of promoters (e.g., ERG9, PGPD) influences β-carotene biosynthesis. The expression levels of genes like HMG1, tHMG1, and GGS1 are manipulated to boost β-carotene synthesis.

4. **Pathway Engineering**: The MVA pathway, which is crucial for β-carotene biosynthesis, is often targeted for optimization. This includes overexpression of ERG10, manipulation of the β-carotene biosynthetic genes (carRP, crtYB, crtI, crtZ), and introduction of heterologous multifunctional carotene synthases.

5. **Fermentation Conditions**: Factors such as glucose concentration, carbon-to-nitrogen ratio, temperature regulation, and the use of different media (e.g., YPD, synthetic medium) affect β-carotene production. Optimization of fermentation parameters (high-cell-density fermentation, fed-batch fermentation) and stress response management also contribute to enhanced yields.

6. **Yeast Strains**: Yarrowia lipolytica, in particular, is a focus for β-carotene overproduction. Genetic tools, including DNA assembler, nonhomologous end-joining, and in vivo homologous recombination, are used to engineer Y. lipolytica strains for higher β-carotene titers.

7. **Recombinant Production**: Recombinant E. coli and yeast strains are developed for β-carotene synthesis, often involving the expression of genes from other organisms, such as Mucor circinelloides or cyanobacteria.

8. **Lipid Accumulation and β-carotene Sequestration**: Enhanced lipid accumulation and changes in cell wall structure can lead to increased β-carotene storage, as seen in lipid-overproducing strains.

9. **Measurement and Analysis**: Techniques like photometric measurement, 2,3-oxidosqualene analysis, and β-carotene extraction are employed to quantify and study β-carotene production and its intermediates.

10. **Synthetic Biology Approaches**: Synthetic biology strategies, including promoter shuffling, sequential gene integration, and decentralized assembly, are utilized to construct β-carotene biosynthesis pathways efficiently.

Overall, the research aims to develop microbial platforms for economically viable and sustainable β-carotene production through various genetic and metabolic manipulations, with a focus on Yarrowia lipolytica due to its potential for industrial-scale bioproduction.

In [4]:
# 