## Llama Index and Llama 2 tutorial on Lonestar 6

Llama2 is the Meta open source Large Language Model. LlamaIndex is a python library that connects data to the LLMs such as Llama2. This allows the user to quickly use their unstructured data as a basis for any chats or outputs. 


In [1]:
import os
import logging
import sys
from IPython.display import Markdown, display

from llama_index.llms import HuggingFaceLLM
from llama_index.prompts import PromptTemplate, PromptType
from pathlib import Path
from llama_index import download_loader, KnowledgeGraphIndex
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

import re

In [2]:
import torch

## Set your working directory
Change your working directory to your Scratch location. This will improve performance, and ensure you have access to the model you rsynced earlier

In [3]:
scratch = ! echo $SCRATCH
os.chdir(scratch[0])
! pwd


/scratch/06659/wmobley
/scratch/06659/wmobley


## Access the model
Next we'll access the models. You have 4 models to access the 7 and 13billion parameters chat and normal model. The folder will also have access to the 70b parameter models; however, we have not tested their performance on the LS6 dev machines. 



In [4]:
# Model names (make sure you have access on HF)
LLAMA2_7B = f"{scratch[0]}/HF/noRef/Llama7b"
LLAMA2_7B_CHAT = f"{scratch[0]}/HF/noRef/Llama7bchat"
LLAMA2_13B = "meta-llama/Llama-2-13b-hf"
LLAMA2_13B_CHAT = "meta-llama/Llama-2-13b-chat-hf"

## Select Model
For this script we will chose the Llama 2 13B parameter chat model. 

In [5]:
selected_model  =LLAMA2_7B_CHAT

## Sytem Prompt
We are going to have llama2 create triplets used in a knowledge graph based on a pdf. 

We set up the system prompt below. 


In [6]:
SYSTEM_PROMPT="""You are social scientist researcher. You have been tasked with identifying the topics of research papers. 

Classify the topics of the documents using only the following topics:
            
## Topics
Federal, State, Regional, Local, Operators-in-training, System Operators, Community Members, Not Stage Specific, Source Water, Potable Water Treatment, Potable Water Distribution, Wastewater Collection, Wastewater Treatment, Water Storage, End-Users Storage System, Administrative Processes, Compliance, Strategic Planning, Funding, Worker Safety, Testing, Laboratory, Field, At-home, Effluent at the conclusion of the treatment process, Biological, Chemical, Physical properties, Water received by the end-user, Biological, Chemical, Physical properties, Treatment Process, Equipment Installation, Equipment Operations, Equipment Maintenance, System Components, System Monitoring, Climate and Environment, Cybersecurity, System Breakdown, Hazardous Materials, Background, Application Process, Benefits of Certification, Accessibility of Material, In-Person Training, Library, Online, Cost Considerations, Continuing Education Requirements, Experience Requirements, Previous Education Considerations, Study Tactics, Consequences of Poor Management, Collaboration, Water Governance , Federal policies, regulations , State policies, regulations , Tribal governance , Community Outreach, Water System Stakeholders, Environmental Attorneys, End-Users, Relevance , To environment, To operator, Positive implications, Negative implications, To end user, COVID-19
            
Here are some rules you always follow:
- Do not include numbers before the topics. 
- Generate only the requested output, don't include any other language before or after the requested output.
- Never say thank you, that you are happy to help, that you are an AI agent, etc. Just answer directly.
- Generate professional language typically used in business documents in North America.
- Never generate offensive or foul language.
"""

query_wrapper_prompt = PromptTemplate(
    "[INST]<<SYS>>\n" + SYSTEM_PROMPT + "<</SYS>>\n\n{query_str}[/INST] "
)

## Load the Model
Next we'll load the model. If it can't find the model it will download it. 

In [7]:

llm = HuggingFaceLLM(
    context_window=4096,
    max_new_tokens=2048,
    generate_kwargs={"temperature": 0.0, "do_sample": False},
    query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_name=selected_model,
    model_name=selected_model,
    device_map="auto",
    # change these settings below depending on your GPU
    model_kwargs={"torch_dtype": torch.float16, "load_in_8bit": True, "cache_dir":f"{scratch[0]}/HF/noRef"},
)
print(selected_model)

INFO:torch.distributed.nn.jit.instantiator:Created a temporary directory at /tmp/tmpnwe30ojv
INFO:torch.distributed.nn.jit.instantiator:Created a temporary directory at /tmp/tmpnwe30ojv
Created a temporary directory at /tmp/tmpnwe30ojv
Created a temporary directory at /tmp/tmpnwe30ojv
INFO:torch.distributed.nn.jit.instantiator:Writing /tmp/tmpnwe30ojv/_remote_module_non_scriptable.py
INFO:torch.distributed.nn.jit.instantiator:Writing /tmp/tmpnwe30ojv/_remote_module_non_scriptable.py
Writing /tmp/tmpnwe30ojv/_remote_module_non_scriptable.py
Writing /tmp/tmpnwe30ojv/_remote_module_non_scriptable.py


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

/scratch/06659/wmobley/HF/noRef/Llama7bchat
/scratch/06659/wmobley/HF/noRef/Llama7bchat


## Load the PDF documents 

In [8]:


PDFReader = download_loader("PDFReader")

loader = PDFReader()


In [9]:
from llama_index import VectorStoreIndex, ServiceContext, set_global_service_context


# Better Explain Each of these steps. 
topics = {}

In [None]:

for root, dirs, files in os.walk(f'{scratch[0]}/HF/GrayLit', topdown=False):
    for i, name in enumerate(files):
        if name.endswith('pdf'):
            service_context = ServiceContext.from_defaults(
                llm=llm, embed_model="local:BAAI/bge-small-en"
            )
            set_global_service_context(service_context)
            documents = loader.load_data(file=Path(os.path.join(root, name)))
            index = VectorStoreIndex.from_documents(documents)
            query_engine = index.as_query_engine()
            response = query_engine.query("Provide a list of Topics")
            if response.response =='Empty Response':
                print(name)
                continue
            response_dict = {name: re.sub('\d', "" , response.response.split(":")[1].replace('\n',"")).replace('()',"").split(".")}
            topics.update(response_dict)
            print(i)



1
1
2
2
3
3
4
4
5
5
6
6
7
7


In [None]:
print(response.response)
import json 
with open("Topics.json", "w") as outfile: 
    json.dump(topics, outfile)

In [None]:
response = query_engine.query("What are the topics found in this article?")
display(Markdown(f"<b>{response}</b>"))

In [None]:
## Document By Document Topic Analysis
# Unsupervised
## Semi-Supervised