## This Notebook will demonstrate different configurations and executions of LLM-Analyst

### Environment Variables

* For proper execution of LLM_ANalyst, one or more of the below environment variables may be required.
* For these examples, we will require only `OPENAI_API_KEY`

```bash
export OPENAI_API_KEY=""
export TAVILY_API_KEY=""
export SERPER_API_KEY=""
export SERP_API_KEY=""
export HUGGINGFACEHUB_API_TOKEN=""
export LANGCHAIN_API_KEY=""
export GROQ_API_KEY=""
export GOOGLE_CX_KEY=""
export GOOGLE_API_KEY=""
export BING_API_KEY=""
export NCBI_API_KEY=""
export ORCID_ACCESS_TOKEN=""
export ORCID_REFRESH_TOKEN=""
export PYPI_API_TOKEN=""
export DOCKERHUB_API_TOKEN=""
```


### Prerequisites  
* You will need a local Python environment with all the required Python packages installed.
* If you are reading this, you have most likely already cloned the repo (If you have not)
    * Execute:
        ```bash
        cd my_local_development_dir
        git clone https://github.com/DanHUMassMed/llm_analyst.git
        cd llm_analyst
        ```
* Create a python environment (We use Conda-Forge Miniforge3)
    * Execute:
        ```bash
        conda create -n llm-analyst python=3.11 ipykernel
        conda activate llm-analyst
        pip install -r requirements.txt
        ```

-----

NOTE: You could also just execute `pip install research-task`

However, we expect you want to play with the code, not just use the package.

In [3]:
# System level imports
import sys
import os

# ##### SET SYS PATH TO WHERE THE CODE IS. #####
# my_local_development_dir/llm_analyst
# Note: Putting our code first in the sys path will make sure it gets picked up
llm_analyst_base_dir='/Users/dan/Code/LLM/llm_analyst'
sys.path.insert(0, llm_analyst_base_dir)


# Setting the USER_AGENT to fix warning with langchain_community code
# WARNING:langchain_community.utils.user_agent:USER_AGENT
user_agent = ("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
              "(KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36 Edg/119.0.0.0")
os.environ['USER_AGENT'] = user_agent

# Setting the OPENAI_API_KEY
#os.environ['OPENAI_API_KEY'] = "*****************************88"

In [4]:
# Let's import llm_analyst content in one cell to make the rest of the code a little cleaner.
from llm_analyst.core.config import Config, DataSource
from llm_analyst.core.research_analyst import LLMAnalyst
from llm_analyst.core.research_editor import LLMEditor
from llm_analyst.core.research_publisher import LLMPublisher
from llm_analyst.core.config import Config


### Demonstrate running LLM-Analyst on Local Data

In [5]:
## Now let's run a simple research report against a set of local documents (Published Papers)
## Three things are required.
## 1. An active research topic 
## 2. A defined path to the local data to research against
## 3. Indicate the data source (LOCAL_STORE, WEB, SELECT_URLS)

# Requirement 1 (research topic).
walker_lab_papers = "/Users/dan/Code/LLM/research_data/Walker_Lab_Slack/papers_lifespan_1cc"
research_topic = "How do chromatin-modifying factors influence lifespan in C. elegans, and what are the underlying molecular mechanisms driving these effects?"

# Requirement 2 (local data to research).
# We add a few additional config_params to just be explicit. 
# NOTE: The defaults would also work fine. 
config_params = {
    "internet_search" :"ddg_search",
    "llm_provider"    :"openai",
    "llm_model"       :"gpt-4o-2024-05-13",
    "local_store_dir" :f"{walker_lab_papers}",
    "report_out_dir"  :f"{llm_analyst_base_dir}/notebooks_local/data",
    "cache_dir"       :f"{llm_analyst_base_dir}/notebooks_local/data/cache"
}
config = Config()
config.set_values_for_config(config_params)
print(config)

# Take note of the config values

internet_search=<function ddg_search at 0x105e78f40>
embedding_provider=client=<openai.resources.embeddings.Embeddings object at 0x1329895d0> async_client=<openai.resources.embeddings.AsyncEmbeddings object at 0x132994990> model='text-embedding-ada-002' dimensions=None deployment='text-embedding-ada-002' openai_api_version='' openai_api_base=None openai_api_type='' openai_proxy='' embedding_ctx_length=8191 openai_api_key=SecretStr('**********') openai_organization=None allowed_special=None disallowed_special=None chunk_size=1000 max_retries=2 request_timeout=None headers=None tiktoken_enabled=True tiktoken_model_name=None show_progress_bar=False model_kwargs={} skip_empty=False default_headers=None default_query=None retry_min_seconds=4 retry_max_seconds=20 http_client=None http_async_client=None check_embedding_ctx_length=True
llm_provider=<class 'llm_analyst.chat_models.openai.OPENAI_Model'>
llm_model=gpt-4o-2024-05-13
llm_token_limit=4000
llm_temperature=0.25
browse_chunk_max_length

In [6]:
# Now that we have set things up, let's get down to conducting the research!
# To execute preliminary research, we use the LLMAnalyst Object
# Request the analyst to conduct research and then writes a report

llm_analyst = LLMAnalyst(active_research_topic = research_topic, 
                         data_source = DataSource.LOCAL_STORE, 
                         config = config)

await llm_analyst.conduct_research()
research_state = await llm_analyst.write_report()


# Once the report is written, we can ask the LLMPublisher to make a pdf
llm_publisher = LLMPublisher(**research_state.dump(), config = config)
published_research_path = await llm_publisher.publish_to_pdf_file()
published_research_path

DEBUG:root:PROMPT choose_agent response = {
    "agentType": "🔬 Molecular Biology Agent",
    "agentRole": "You are a highly knowledgeable AI molecular biology researcher. Your primary goal is to generate detailed, insightful, unbiased, and systematically structured research reports on molecular biology topics, focusing on the influence of chromatin-modifying factors on lifespan in C. elegans and the underlying molecular mechanisms."
}
  warn_deprecated(
  from tqdm.autonotebook import tqdm, trange
INFO:sentence_transformers.SentenceTransformer:Use pytorch device_name: mps
INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: all-MiniLM-L6-v2
DEBUG:root:PROMPT get_sub_queries response = ["chromatin-modifying factors influence lifespan in C. elegans July 24, 2024", "underlying molecular mechanisms chromatin modification lifespan C. elegans", "chromatin modification C. elegans lifespan research 2024"]
huggingface/tokenizers: The current process just got fork

Report written to /Users/dan/Code/LLM/llm_analyst/notebooks_local/data/Research-2024-07-24-1028595546.pdf


'/Users/dan/Code/LLM/llm_analyst/notebooks_local/data/Research-2024-07-24-1028595546.pdf'

In [7]:
## Very cool we just created a preliminary research report!
## Now we have decided that based on the collected data, we want to see a "detailed report"
## 
## A "detailed report" requires oversight, therefore, we will use an LLMEditor v.s. an LLMAnalyst
## The key difference between an Editor and the Analyst is that
## the Editor will coordinate the efforts of multiple Analysts and 
## will utilize a specialized Report Writer to pull the final report together

## The inputs are the same as the Research Analyst Report above

llm_editor = LLMEditor(active_research_topic = research_topic, 
                       data_source = DataSource.LOCAL_STORE,
                       config = config)

research_state = await llm_editor.create_detailed_report()

llm_publisher = LLMPublisher(**research_state.dump(), config = config)
published_research_path = await llm_publisher.publish_to_pdf_file()

# NOTE As the Code runs, you should see the logging line below
# INFO:root:*** Using Cached Repo. ***
# This is indicating that we are not recreating the embedding
# Instead we are using the cached data in the vector db


DEBUG:root:PROMPT choose_agent response = {
    "agentType": "🧬 Molecular Biology Agent",
    "agentRole": "You are a highly knowledgeable AI molecular biology researcher. Your primary goal is to generate detailed, insightful, impartial, and systematically organized research reports on molecular biology topics, focusing on genetic, epigenetic, and biochemical mechanisms."
}
INFO:sentence_transformers.SentenceTransformer:Use pytorch device_name: mps
INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: all-MiniLM-L6-v2
INFO:root:*** Using Cached Repo. ***
DEBUG:root:PROMPT get_sub_queries response = ["chromatin-modifying factors influence lifespan C. elegans molecular mechanisms 2024", "epigenetic regulation lifespan C. elegans chromatin modification July 24, 2024", "chromatin modifiers lifespan extension C. elegans underlying molecular pathways 2024"]
DEBUG:root:active_research_topic   = How do chromatin-modifying factors influence lifespan in C. elegans, 

Researching Histone Modifications and Lifespan Regulation


INFO:root:*** Using Cached Repo. ***
DEBUG:root:PROMPT get_sub_queries response = ["chromatin-modifying factors influence on C. elegans lifespan July 24, 2024", "histone modifications and lifespan regulation in C. elegans molecular mechanisms", "underlying molecular mechanisms of chromatin modifications affecting lifespan in C. elegans"]
DEBUG:root:PROMPT write_report response = ## Histone Modifications and Lifespan Regulation

### Histone Methylation and Lifespan

Histone methylation, particularly at lysine residues, plays a critical role in regulating lifespan in *Caenorhabditis elegans* (C. elegans). The trimethylation of histone H3 at lysine 4 (H3K4me3) has been extensively studied for its impact on longevity. The ASH-2 trithorax complex, which includes ASH-2, WDR-5, and the H3K4 methyltransferase SET-2, is responsible for the trimethylation of H3K4. Knockdown of ASH-2, SET-2, and WDR-5 has been shown to extend the lifespan of fertile worms by 23.1-30.9% ([Greer et al., 2010](https

Researching Epigenetic Mechanisms in Aging


INFO:root:*** Using Cached Repo. ***
DEBUG:root:PROMPT get_sub_queries response = ["chromatin-modifying factors influence lifespan C. elegans molecular mechanisms", "epigenetic mechanisms aging C. elegans chromatin modification July 24, 2024", "chromatin modification lifespan regulation C. elegans underlying molecular mechanisms"]
DEBUG:root:PROMPT write_report response = ## Epigenetic Mechanisms in Aging: Chromatin-Modifying Factors and Lifespan in C. elegans

### Histone Methylation and Longevity

Histone methylation plays a crucial role in regulating lifespan in *Caenorhabditis elegans* (C. elegans). The trimethylation of histone H3 at lysine 4 (H3K4me3) and lysine 27 (H3K27me3) are particularly significant. The ASH-2 trithorax complex, which trimethylates H3K4, has been shown to regulate lifespan in a germline-dependent manner. Knockdown of ASH-2, SET-2, and SET-4, all components of the H3K4me3 complex, significantly extends the lifespan of fertile worms ([Greer et al., 2010](https

Researching Role of Chromatin State in Lifespan Extension


INFO:root:*** Using Cached Repo. ***
DEBUG:root:PROMPT get_sub_queries response = ["chromatin-modifying factors influence lifespan C. elegans molecular mechanisms", "chromatin state lifespan extension C. elegans July 24, 2024", "role of chromatin modifications in aging C. elegans underlying mechanisms"]
DEBUG:root:PROMPT write_report response = ## Role of Chromatin State in Lifespan Extension

### Histone Methylation and Lifespan Regulation

Histone methylation is a critical epigenetic modification influencing chromatin structure and gene expression. In *Caenorhabditis elegans* (C. elegans), various histone methylation marks have been implicated in lifespan regulation. For instance, the trimethylation of histone H3 at lysine 4 (H3K4me3) is associated with active transcription and has been shown to influence longevity. Knockdown of components of the H3K4me3 methyltransferase complex, such as ASH-2, SET-2, and WDR-5, extends the lifespan of *C. elegans* ([Greer et al., 2010](https://www.

Researching How do chromatin-modifying factors influence lifespan in C. elegans, and what are the underlying molecular mechanisms driving these effects?


INFO:root:*** Using Cached Repo. ***
DEBUG:root:PROMPT get_sub_queries response = ["chromatin-modifying factors influence lifespan C. elegans molecular mechanisms July 24, 2024", "epigenetic regulation lifespan C. elegans chromatin modification July 24, 2024", "chromatin modifiers aging C. elegans underlying molecular mechanisms July 24, 2024"]
DEBUG:root:PROMPT write_report response = ## Chromatin-Modifying Factors and Lifespan in C. elegans

### Histone Methylation and Lifespan Regulation

Histone methylation plays a pivotal role in the regulation of lifespan in *Caenorhabditis elegans* (C. elegans). The trimethylation of histone H3 at lysine 4 (H3K4me3) has been extensively studied for its impact on longevity. Research has demonstrated that the demethylation of H3K4me3 in the germline can significantly extend the lifespan of C. elegans ([source](https://www.sciencedirect.com/science/article/pii/S0047637418300836)). Specifically, the reduction of H3K4me3 levels through the action of 

Report written to /Users/dan/Code/LLM/llm_analyst/notebooks_local/data/Research-2024-07-24-1035500454.pdf


### Demonstrate Running LLM-Analyst on Web Scraped Data

In [None]:
## Let's run a simple research report against the internet
## All that is required is an active research topic

# Requirement 1.
research_topic = "How does DAF-19 regulate transcription of regeneration associated genes?"

# 
config = Config()
print(config)

# Take note of the default config values
# Pay particular attention to the report_out_dir

In [None]:
# To execute some preliminary research, we use the LLMAnalyst 

# Note we are using the defaults data_source and config so we do not need to provide them
llm_analyst = LLMAnalyst(active_research_topic = research_topic)

await llm_analyst.conduct_research()
research_state = await llm_analyst.write_report()

# Once the report is written, we can ask the LLMPublisher to make a pdf
llm_publisher = LLMPublisher(**research_state.dump(), config = config)
published_research_path = await llm_publisher.publish_to_pdf_file()


In [None]:
## Let's build on the first internet research project and now build a "detailed report"

## Inputs are the same as the Research Report above

llm_editor = LLMEditor(active_research_topic = research_topic)

research_state = await llm_editor.create_detailed_report()

llm_publisher = LLMPublisher(**research_state.dump(), config = config)
published_research_path = await llm_publisher.publish_to_pdf_file()