## This Notebook will demonstrate different configurations and executions of LLM-Analyst

### Environment Variables

* For proper execution of LLM_ANalyst, one or more of the below environment variables may be required.
* For these examples, we will require only `OPENAI_API_KEY`

```bash
export OPENAI_API_KEY=""
export TAVILY_API_KEY=""
export SERPER_API_KEY=""
export SERP_API_KEY=""
export HUGGINGFACEHUB_API_TOKEN=""
export LANGCHAIN_API_KEY=""
export GROQ_API_KEY=""
export GOOGLE_CX_KEY=""
export GOOGLE_API_KEY=""
export BING_API_KEY=""
export NCBI_API_KEY=""
export ORCID_ACCESS_TOKEN=""
export ORCID_REFRESH_TOKEN=""
export PYPI_API_TOKEN=""
export DOCKERHUB_API_TOKEN=""
```


### Prerequisites  
* You will need a local Python environment with all the required Python packages installed.
* If you are reading this, you have most likely already cloned the repo (If you have not)
    * Execute:
        ```bash
        cd my_local_development_dir
        git clone https://github.com/DanHUMassMed/llm_analyst.git
        cd llm_analyst
        ```
* Create a python environment (We use Conda-Forge Miniforge3)
    * Execute:
        ```bash
        conda create -n llm-analyst python=3.11 ipykernel
        conda activate llm-analyst
        pip install -r requirements.txt
        ```

-----

NOTE: You could also just execute `pip install research-task`

However, we expect you want to play with the code, not just use the package.

In [1]:
# System level imports
import sys
import os

# ##### SET SYS PATH TO WHERE THE CODE IS. #####
# my_local_development_dir/llm_analyst
# Note: Putting our code first in the sys path will make sure it gets picked up
llm_analyst_base_dir='/Users/dan/Code/LLM/llm_analyst'
sys.path.insert(0, llm_analyst_base_dir)


# Setting the USER_AGENT to fix warning with langchain_community code
# WARNING:langchain_community.utils.user_agent:USER_AGENT
user_agent = ("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
              "(KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36 Edg/119.0.0.0")
os.environ['USER_AGENT'] = user_agent

# Setting the OPENAI_API_KEY
#os.environ['OPENAI_API_KEY'] = "*****************************88"

In [2]:
# Let's import llm_analyst content in one cell to make the rest of the code a little cleaner.
from llm_analyst.core.config import Config, DataSource
from llm_analyst.core.research_analyst import LLMAnalyst
from llm_analyst.core.research_editor import LLMEditor
from llm_analyst.core.research_publisher import LLMPublisher
from llm_analyst.core.config import Config


DEBUG:pyvirtualdisplay:version=3.0


### Demonstrate running LLM-Analyst on Local Data

In [None]:
# Let's get some realistic data to run our reports against
# We will download a set of papers from the Walker Lab

walker_lab_papers = f"{llm_analyst_base_dir}/reports_data/Walker_Lab_Papers"
 
!git clone https://github.com/DanHUMassMed/Walker_Lab_Papers.git {walker_lab_papers}

In [None]:
## Now let's run a simple research report against a set of local documents (Published Papers)
## Three things are required.
## 1. An active research topic 
## 2. A defined path to the local data to research against
## 3. Indicate the data source (LOCAL_STORE, WEB, SELECT_URLS)

# Requirement 1 (research topic).
# Oh cool researching metabolism!
research_topic = "I would like to better understand how the metabolism of S-adenosylmethionine is linked to lipid metabolism and stress-responsive gene expression."

# Requirement 2 (local data to research).
# We add a few additional config_params to just be explicit. 
# NOTE: The defaults would also work fine. 
config_params = {
    "internet_search" :"ddg_search",
    "llm_provider"    :"openai",
    "llm_model"       :"gpt-4o-2024-05-13",
    "local_store_dir" :f"{walker_lab_papers}",
    "report_out_dir"  :f"{llm_analyst_base_dir}/notebooks/data",
    "cache_dir"       :f"{llm_analyst_base_dir}/notebooks/data/cache"
}
config = Config()
config.set_values_for_config(config_params)
print(config)

# Take note of the config values

In [None]:
# Now that we have set things up, let's get down to conducting the research!
# To execute preliminary research, we use the LLMAnalyst Object
# Request the analyst to conduct research and then writes a report

llm_analyst = LLMAnalyst(active_research_topic = research_topic, 
                         data_source = DataSource.LOCAL_STORE, 
                         config = config)

await llm_analyst.conduct_research()
research_state = await llm_analyst.write_report()


# Once the report is written, we can ask the LLMPublisher to make a pdf
llm_publisher = LLMPublisher(**research_state.dump(), config = config)
published_research_path = await llm_publisher.publish_to_pdf_file()
published_research_path

In [None]:
## Very cool we just created a preliminary research report!
## Now we have decided that based on the collected data, we want to see a "detailed report"
## 
## A "detailed report" requires oversight, therefore, we will use an LLMEditor v.s. an LLMAnalyst
## The key difference between an Editor and the Analyst is that
## the Editor will coordinate the efforts of multiple Analysts and 
## will utilize a specialized Report Writer to pull the final report together

## The inputs are the same as the Research Analyst Report above

llm_editor = LLMEditor(active_research_topic = research_topic, 
                       data_source = DataSource.LOCAL_STORE,
                       config = config)

research_state = await llm_editor.create_detailed_report()

llm_publisher = LLMPublisher(**research_state.dump(), config = config)
published_research_path = await llm_publisher.publish_to_pdf_file()

# NOTE As the Code runs, you should see the logging line below
# INFO:root:*** Using Cached Repo. ***
# This is indicating that we are not recreating the embedding
# Instead we are using the cached data in the vector db


### Demonstrate Running LLM-Analyst on Web Scraped Data

In [None]:
## Let's run a simple research report against the internet
## All that is required is an active research topic

# Requirement 1.
research_topic = "How does DAF-19 regulate transcription of regeneration associated genes?"

# 
config = Config()
print(config)

# Take note of the default config values
# Pay particular attention to the report_out_dir

In [None]:
# To execute some preliminary research, we use the LLMAnalyst 

# Note we are using the defaults data_source and config so we do not need to provide them
llm_analyst = LLMAnalyst(active_research_topic = research_topic)

await llm_analyst.conduct_research()
research_state = await llm_analyst.write_report()

# Once the report is written, we can ask the LLMPublisher to make a pdf
llm_publisher = LLMPublisher(**research_state.dump(), config = config)
published_research_path = await llm_publisher.publish_to_pdf_file()


In [None]:
## Let's build on the first internet research project and now build a "detailed report"

## Inputs are the same as the Research Report above

llm_editor = LLMEditor(active_research_topic = research_topic)

research_state = await llm_editor.create_detailed_report()

llm_publisher = LLMPublisher(**research_state.dump(), config = config)
published_research_path = await llm_publisher.publish_to_pdf_file()