# **Install and Import Libraries**

> ##### **Add the OpenAI API key in config/secrets.env file as follows:**

> ###### **OPENAI_API_KEY = "<api_key>"**


In [1]:
%load_ext autoreload
%autoreload 2

In [5]:
from dotenv import load_dotenv

# load config
load_dotenv("../config/config.env")

# load secrets
load_dotenv("../config/secrets.env")

from data_pipeline import *

# change import dir to src
import sys
sys.path.append('../src')
import llm_kg_retrieval

# **1. Scrape Website**
> Takes approximately 12 minutes to run.

> One can possibly use asyncronous functions to speed up this process.

In [None]:
scrape_website()

# **2. Download all meeting documents from the scraped links**

> Takes 3+ hours to run.

> One can possibly use asyncronous functions to speed up this process.

In [None]:
download_documents()

# **3. Extract HTML from PDFs**

In [None]:
# only convert pdf and docx files so it might be less than the downloaded files
convert_files()

# **4. Extract Meeting Metadata from documents with LLM**

In [None]:
# get dataframe for meeting metadata documents. One can filter the dataframe and extract metadata for specific documents only
# the fetched dataframe consists of additional columns is_manual_metadata_extracted, is_llm_metadata_extracted 
# which shows if the data has already been extracted or not manually and with llm
type = "metadata"
metadata_df = get_documents_dataframe(type=type)
metadata_df

In [None]:
# asynchronously extract meeting metadata (taking into account openai rate limits; limit defined in config file)
await extract_meeting_data(df=metadata_df, type=type)

# **5. Extract Agenda from documents with LLM**

## Still not good output from LLM

In [None]:
# get dataframe for meeting agenda documents. One can filter the dataframe and extract agenda for specific documents only
# the fetched dataframe consists of additional columns is_manual_agenda_extracted, is_llm_agenda_extracted 
# which shows if the data has already been extracted or not manually and with llm
type = "agenda"
agenda_df = get_documents_dataframe(type=type)
agenda_df = agenda_df[agenda_df["body"] == "Äldrerådet"]
agenda_df

In [None]:
# asynchronously extract meeting metadata (taking into account openai rate limits; limit defined in config file)
await extract_meeting_data(df=agenda_df, type=type)

# **6. Export JSON**

In [None]:
construct_aggregate_json(construct_from="llm", validate_json=True) # construct_from = "llm" or "manual"

# **7. Create a Knowledge Graph from JSON**

In [None]:
create_knowledge_graph(construct_from = "llm") # construct_from = "llm" or "manual"

By default it will construct the knowledge graph from LLM extracted data. If you want to construct it from manually created JSON data, then add the data manually as follows:

1. Manually create JSON files with extracted data inside respective folders in `data/protocols` folder and name it `manual_meeting_metadata.json` or `manual_meeting_agenda.json` depending on the document type. Folder structure is `<body>`/`<meeting_date>`/`<document>`. Put the JSON inside the `<document>` folder.

2. Execute the `construct_aggregate_json(construct_from="manual")` function. This will fail if the created JSON does not follow the schema defined in `data/schema/schema.json`

3. Execute `create_knowledge_graph(constuct_from = "manual")` function.

# **8. Test data retrieval from Knowledge Graph with LLM**

In [38]:
prompt = "Are there anything related to health?"

In [39]:
# instanciate the LLM query processor
processor = llm_kg_retrieval.KnowledgeGraphRAG(
                        url=os.getenv("NEO4J_URI"),
                        username=os.getenv("NEO4J_USERNAME"),
                        password=os.getenv("NEO4J_PASSWORD"),
                    )


In [None]:
# get response from LLM
response, _, _= processor.process_prompt(prompt)
print("Response:", response)