## Llama Index and Llama 2 tutorial on Lonestar 6

Llama2 is the Meta open source Large Language Model. LlamaIndex is a python library that connects data to the LLMs such as Llama2. This allows the user to quickly use their unstructured data as a basis for any chats or outputs. 


In [1]:
import os
import logging
import sys
from IPython.display import Markdown, display

from llama_index.llms import HuggingFaceLLM
from llama_index.prompts import PromptTemplate, PromptType
from pathlib import Path
from llama_index import download_loader, Document, KnowledgeGraphIndex
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

import re

In [2]:
import torch

## Set your working directory
Change your working directory to your Scratch location. This will improve performance, and ensure you have access to the model you rsynced earlier

In [3]:
scratch = ! echo $SCRATCH
os.chdir(scratch[0])
! pwd


/scratch/06659/wmobley


## Access the model
Next we'll access the models. You have 4 models to access the 7 and 13billion parameters chat and normal model. The folder will also have access to the 70b parameter models; however, we have not tested their performance on the LS6 dev machines. 



In [4]:
# Model names (make sure you have access on HF)
LLAMA2_7B = f"{scratch[0]}/HF/noRef/Llama7b"
LLAMA2_7B_CHAT = f"{scratch[0]}/HF/noRef/Llama7bchat"
LLAMA2_13B = f"{scratch[0]}/HF/noRef/Llama-2-13b-hf"
LLAMA2_13B_CHAT = f"{scratch[0]}/HF/noRef/Llama-2-13b-chat-hf"

## Select Model
For this script we will chose the Llama 2 13B parameter chat model. 

In [5]:
selected_model  =LLAMA2_13B_CHAT

## Sytem Prompt
We are going to have llama2 create triplets used in a knowledge graph based on a pdf. 

We set up the system prompt below. 


In [6]:
SYSTEM_PROMPT="""You are social scientist researcher. You have been tasked with identifying the answering aspects of research papers. 

           
Here are some rules you always follow:
- If applicable, answer with a yes or no at the beginning of the response. 
- Answer the question succintly. 
- Do not include numbers before the topics. 
- Generate only the requested output, don't include any other language before or after the requested output.
- Never say thank you, that you are happy to help, that you are an AI agent, etc. Just answer directly.
- Generate professional language typically used in business documents in North America.
- Never generate offensive or foul language.
"""

query_wrapper_prompt = PromptTemplate(
    "[INST]<<SYS>>\n" + SYSTEM_PROMPT + "<</SYS>>\n\n{query_str}[/INST] "
)

## Load the Model
Next we'll load the model. If it can't find the model it will download it. 

In [7]:

llm = HuggingFaceLLM(
    context_window=4096,
    max_new_tokens=2048,
    generate_kwargs={"temperature": 0.0, "do_sample": False},
    query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_name=selected_model,
    model_name=selected_model,
    device_map="balanced",
    # change these settings below depending on your GPU
    model_kwargs={"torch_dtype": torch.float16, "load_in_8bit": False, "cache_dir":f"{scratch[0]}/HF/noRef"},
)
print(selected_model)

INFO:torch.distributed.nn.jit.instantiator:Created a temporary directory at /tmp/tmpfyhs868w
Created a temporary directory at /tmp/tmpfyhs868w
INFO:torch.distributed.nn.jit.instantiator:Writing /tmp/tmpfyhs868w/_remote_module_non_scriptable.py
Writing /tmp/tmpfyhs868w/_remote_module_non_scriptable.py


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

/scratch/06659/wmobley/HF/noRef/Llama-2-13b-chat-hf


## Load the PDF documents 

In [8]:


PDFReader = download_loader("PDFReader")

loader = PDFReader()


In [9]:
from llama_index import VectorStoreIndex, ServiceContext, set_global_service_context
from llama_index.query_engine import CitationQueryEngine


# Better Explain Each of these steps. 
topics = {}

In [10]:

def query(text):
    
    response = query_engine.query(text)
#     display(Markdown(f"<b>{response}</b>"))
#     print(len(response.source_nodes))
#     display(response.source_nodes[0].node.metadata)
#     display(Markdown(response.source_nodes[0].node.get_text()))
    return response

Struggles being asked true or false. 

In [11]:
queries = {
#     "Author": "Who is the author?",
#           "Federal":"are these documents written by a federal agency",
#           "State":"are  these documents written by a state agency",
#           "Local":"are  these documents written by a local agency",
           "All":"What type of agency likely wrote the documents: federal, state, local, or regional agency?",
         "Audience": "Of the following audience types: Operators-in-training; Operators; Governmental or Tribal Officials; or End-Users; What type of Audience are the documents written for? ",
             "Stage in Water Process":"Of the following stages:  Small Systems; Not Stage Specific; Source Water;\
           Potable Water Distribution; Potable Water Treatment; Wastewater Collection; Wastewater Treatment;\
           Water Storage; End-Users Storage System. What Stage of the water treatment process do the documents address",
    "Administrative process":"Do the documents focus on actions and process related to adminitstering water systems?",
    "Administrative Considerations":"Do the documents focus on actions and procedures related to record-keeping. Examples: billing, human resources, time keeping, overhead tracking, company management, etc.",
"Compliance":"Do the documents focus on Tasks related to water quality regulations and permitting requirements from the water operator’s side. Examples: compliance paperwork, filing compliance forms, tracking deliverables, interacting with inspectors.",
"Strategic Planning":"Do the documents focus on Practices pertaining to future planning, including best management practices, gold standards, and strategic planning. Examples: budgeting for  future expansions, hiring specialists).",
"Funding":"Do the documents focus onDiscussion of financing, spending, and other costs considerations. Examples: funding from government entities for system operations. ",
"Worker Safety":"Do the documents focus on Procedures to ensure the safety of workers and the security of the water system. Examples: PPE requirements, staffing requirements, worker health",
"Health care":"Do the documents provide information on health care considerations?",
    "Water Quality ":	"Do the documents focus onå Considerations of water contaminants and cleanliness",
"Testing": 	"Do the documents focus on Processes for testing potable water quality. ",
"Laboratory": 	"Do the documents focus on Formal water quality testing in a laboratory, chemical analysis. Conducted by a specialist. ",
"Field":	"Do the documents focus on testing Conducted during routine checks and maintenance, more back of the envelope. Conducted by an operator. ",
"At-home" : "Do the documents focus on testing Conducted by end-user, not mandatory. ",
    "Source Water Testing": 	"Do the documents focus on Processes for testing source water quality. ",
"Biological" :	"Do the documents focus on biological testing of the source water E.g.: Viruses, bacteria, zoological contaminants, organic compounds. ",
"Chemical": 	"Do the documents focus on chemical testing of the source water E.g.: Nitrous oxide, rust. ",
"Physical contaminents":"Do the documents focus on testing for physical propoerties of the source water E.g.: Turbidity, pH, temperature. ",

"Possible Contamination Sources":"Do the documents identify possible water system contamination sources?",
    "Effluent at End of Treatment Process":	"Do the documents focus on Considerations of the quality of treated effluent water leaving the water treatment plant, entering the distribution portion of the system.  ",
"Biological.1" :	"Do the documents focus on biological testing during the treatment process E.g.: Viruses, bacteria, zoological contaminants, organic compounds. ",
"Chemical.1": 	"Do the documents focus on chemical testing during the treatment process E.g.: Nitrous oxide, rust. ",
"Physical properties.1":"Do the documents focus on testing for physical propoerties during the treatment process E.g.: Turbidity, pH, temperature. ",

    "Water received by the end-user":	"Do the documents focus on water received by the end user including Considerations of the quality of water received at the tap for the end-users. ",
"Biological.2":	"Do the documents focus on the biological conditions of water received by the end user? E.g.: Viruses, bacteria, zoological contaminants, organic compounds. ",
"Chemical.2": "Do the documents focus on the chemical conditions of the water received by the end user?	E.g.: Nitrous oxide, rust. ",
"Physical Properties.1": "Do the documents focus on the phyiscal properties of the water received by the end user? 	E.g.: Turbidity, pH, temperature. ",

    "Treatment Process": "Do the documents focus on	The process of water treatment, including flocculation, tanks, equipment?",
"Operating System": "Do the documents focus on	Considerations of the water system equipment ", 
"Equipment Installation":	"Do the documents focus on Instructions for creating new water systems ",
"Equipment Operations":	"Do the documents focus on Procedures for operating the water treatment and distribution infrastructure system.",
"Equipment Maintenance":	"Do the documents focus on the Procedures for maintaining the integrity of the water treatment system, including routine maintenance and repairs. ",
"System Components": "Do the documents address the  breakdown of each material or piece of a system?",
"System Monitoring": "Do the documents Tells us if maintenance is needed – about the integrity. Measuring processes occurring in water utility (uses devises), humidity sensors, water level measurements.?",
"Design Considerations":"Do the documents focus on the design process for  water treatment system, including routine maintenance and repairs.",
    
    "Threats and Security":"Do the documents focus onAnything that threatens the daily operations of the water system, both physically and operationally? ",
"Climate and Environment": "Do the documents focus on Flooding, landslides, freezing, extreme weather events, erosion, change in landscape, changes in water supply patterns. ",
"Cybersecurity": "Do the documents focus on	Cyber hacks, ransom ware, hostage data?",
"System Breakdown":	"Do the documents focus on Rusting, lead-based paints, mold, corrosion?",
"Hazardous Materials":	"Do the documents focus on Chemical spills, asbestos?", 

"Technical Considerations":"Do the documents provide Information which helps inform, train or edcate the user on the technical aspects of water systems.",
    "Technical Background":"Do the documents provide Information which helps inform, train or educate the user on  what the technical/engineering-based/physics reason is ?",

    "Social Background": "Do the documents provide	Information which helps inform, train or educate the user on why a process is designed like it is",

    "Social Considerations": "Do the documents address social considerations?", 	
"Certification": "Do the documents focus on the process of earning certification to become a certified water operator? ",
"Statistics on Certified People": "Do the documents provide statistics on the number of people certified for water system management? ",
"Design of Certification Process":"Do the documents provide information on the certification process?",
    "Content Covered":"Do the documents provide information on the content covered during the certification process?",
    "Application Process": "Do the documents Provide details on how to apply for exams and certification?",
"Benefits of Certification": "Do the documents Explain why pursuing certification is of value to the individual and the community?",
"Accessibility of Material":	"Do the documents address Where the  is material stored or available?",
"Request hard copy":"Can the user request a hard copy of the documetns?",
    "In-Person":	"Are the documents Provided at an in-person training? ",
"Library": "Are the documents likely Found in a publicly available library. ",
"Online": "Are the documents found online?",

    "Cost Considerations": "Do the documents focus on The cost incurred by the individual pursuing the certification and the affordability of this cost?",
"Continuing Education Requirements or Renewal": "Do the documents identify the	Details the requirements for maintaining certification?",
"Experience Requirements": "Do the documents Detail the experiences required for achieving certification.",
"Previous Education Considerations": "Do the documents identify the	Education  requirements and education paths to get certified?",

    "Study Tactics":	"Do the documents focus proivde Advice for how to prepare and study for the test itself, strategies?",

    "Consequences of Poor Management": "Do the documents identify the Detrimental impacts of mismanaged water utilities on population and the environment?",
"Collaboration": "Do the documents identify The importance of cross-systems collaboration and disciplines?",
"Water Governance": "Do the documents identify How different levels of government interact and deal with policy and regulation related issues?",
"Federal Policies and Regulations": "Do the documents identify federal policies such as Acts, laws, policies?",
"State Policies and Regulations": 	"Do the documents address How these policies and regulations are implemented at the state level?",
"Tribal Governance": "Do the  documents focus 	How tribal policies interact with federal policies?",

    "Community Outreach": "Do the documents focus on considerations of how water systems can interface with and interact with the communities they are serving?",
"Water System Stakeholders" : "Do the documents focus om Discussion of who will be impacted by the water system?",
"Environmental Attorneys Lawyers": "Do the documents identify who are responsible for defending the water systems?",
"End-Users.1": "Do the documents focus on People who are receiving water from the water systems?",
"Developers":"Do the documents focus on People who are developing the water systems?",
    "Relevance" : "Do the documents focus om	How is the content covered in the document made relevant to the context it is addressing or the individual using the materials?",
"Relevance To environment": "Do the documents provide Materials that are better suited for the Alaska environment?",
"To Operator" : "Do the documents provide relevant Materials that are tailored for operators in Alaska?",
"Positive implications":	"Do the documents provide Positive assumptions about the backgrounds, interests of, skills of operators that are getting certified?",
"Negative implications":	"Do the documents provide Negative assumptions about the backgrounds, interests of, skills of operators that are getting certified, condescending, tone sounds like being talked down to?",
"To End-User":	"Do the documents provide relevant Information directed towards household management of water and interacting with the water system?",
"To Endvirnment":	"Do the documents provide relevant information directed towards environmental management of water and interacting with the water system?",

    "COVID-19":	"Do the documents focus on Any aspects relevant to the COVID-19 Pandemic"

          }

In [12]:
service_context = ServiceContext.from_defaults(
                llm=llm, embed_model="local:BAAI/bge-large-en-v1.5"
            )
set_global_service_context(service_context)

In [13]:
import pandas as pd
documents = pd.read_csv(f'{scratch[0]}/HF/Survey/all_public_interviews.csv')

    
   

In [14]:
import gc
graylit_codes = {}
graylit_outputs = {}
for g in documents.groupby(["Interview",'Set']):
    print(g[0])
    codes ={}
    prompts={}
    doc_chunks= []
    for text in g[1].Text.tolist():
        doc = Document(text=text)
        doc_chunks.append(doc)
    index = VectorStoreIndex.from_documents(doc_chunks)
    query_engine = index.as_query_engine()
    for key in queries.keys():
        response = query_engine.query(queries[key])

        if response.response =='Empty Response':
            print(g[0])
            continue
        if key == 'All': 
            for agency in ['federal', 'state', 'local', 'regional agency' ]:
                if agency.lower()  in response.response.lower():
                    codes.update({key:agency })   
                    break
        elif key == "Audience":
              for audience in ['Operators-in-training', 'Operators', 'Governmental or Tribal Officials', 'End-Users']:
                    if audience.lower() in response.response.lower():
                        codes.update({key:audience })   
                        break
        else: codes.update({key:"yes"  in response.response.lower() })
    graylit_codes.update({g[0]:codes})
    graylit_outputs.update({g[0]:prompts})

('INTERVIEW 1 ', '21-22 Combined')




('INTERVIEW 10 ', '21-22 Combined')
('INTERVIEW 11 ', '21-22 Combined')
('INTERVIEW 12 ', '21-22 Combined')
('INTERVIEW 13 ', '21-22 Combined')
('INTERVIEW 14 ', '21-22 Combined')
('INTERVIEW 15 ', '21-22 Combined')
('INTERVIEW 16 ', '21-22 Combined')
('INTERVIEW 17 ', '21-22 Combined')
('INTERVIEW 2 ', '21-22 Combined')
('INTERVIEW 3 ', '21-22 Combined')
('INTERVIEW 4 ', '21-22 Combined')
('INTERVIEW 5 ', '21-22 Combined')
('INTERVIEW 6 ', '21-22 Combined')
('INTERVIEW 7 ', '21-22 Combined')
('INTERVIEW 8 ', '21-22 Combined')
('INTERVIEW 9 ', '21-22 Combined')
('Interview 1 ', 'April 22')
('Interview 10 ', 'April 22')
('Interview 11 ', 'April 22')
('Interview 12 ', 'April 22')
('Interview 13 ', 'April 22')
('Interview 14 ', 'April 22')
('Interview 15 ', 'April 22')
('Interview 16 ', 'April 22')
('Interview 17 ', 'April 22')
('Interview 18 ', 'April 22')
('Interview 19 ', 'April 22')
('Interview 2 ', 'April 22')
('Interview 20 ', 'April 22')
('Interview 21 ', 'April 22')
('Interview 22

In [27]:
graylit_codes

{'INTERVIEW 1 , 21-22 Combined': {'All': 'local',
  'Audience': 'Operators-in-training',
  'Stage in Water Process': False,
  'Administrative process': True,
  'Administrative Considerations': True,
  'Compliance': True,
  'Strategic Planning': True,
  'Funding': True,
  'Worker Safety': True,
  'Health care': False,
  'Water Quality ': True,
  'Testing': True,
  'Laboratory': False,
  'Field': True,
  'At-home': True,
  'Source Water Testing': True,
  'Biological': False,
  'Chemical': False,
  'Physical contaminents': False,
  'Possible Contamination Sources': False,
  'Effluent at End of Treatment Process': True,
  'Biological.1': True,
  'Chemical.1': True,
  'Physical properties.1': True,
  'Water received by the end-user': True,
  'Biological.2': False,
  'Chemical.2': False,
  'Physical Properties.1': False,
  'Treatment Process': False,
  'Operating System': True,
  'Equipment Installation': False,
  'Equipment Operations': True,
  'Equipment Maintenance': True,
  'System Compo

In [16]:
graylit_codes= {f"{a}, {b}":graylit_codes[(a,b)] for a, b in graylit_codes.keys()}
graylit_outputs= {f"{a}, {b}":graylit_outputs[(a,b)] for a, b in graylit_outputs.keys()}

In [17]:
print(response.response)
import json 
with open("survey_codes.json", "w") as outfile: 
    json.dump(graylit_codes, outfile)
with open("survey_outputs.json", "w") as outfile: 
    json.dump(graylit_outputs, outfile)

Yes, the documents focus on aspects relevant to the COVID-19 pandemic.


In [18]:
import pandas as pd
df = pd.DataFrame( graylit_codes)

In [19]:
df = df.transpose()

In [20]:
for audience in df['All'].unique():
    df[audience]=df['All']==audience

In [21]:
for audience in graylit_codes:
    df[audience]=df['Audience']==audience

In [22]:
df = df.replace(False, 0)
df = df.replace(True, 1)

In [23]:
home = ! echo $HOME

In [34]:
df.to_csv(f"{home[0]}/survey_coded.csv")

In [25]:
!ls

'1urBYxybPjtjGM7EVY1_ZiSfGcc9w5OHl?usp=share_link'
'2019_Supporting DMDU.pdf'
 cache2
 CodedSurveyTopics.json
 geospatial_latest.sif
 git-lfs-1.2.0
 gray_lit_coded.csv
 graylit_codes.json
 graylit_outputs.json
 HF
 index.html
 Jupyter
 Llama-2-13b-chat-hf
 Llama-2-13b-hf
 Llama-2-70b-chat-hf
 Llama-2-70b-hf
 Llama-2-7b
 Llama-2-7b-chat-hf
 Llama-2-7b-hf
 LLAMAIndexenvironment.yml
 llama-main
 Miniconda3-py311_23.5.2-0-Linux-x86_64.sh
 NonCodedSurveyTopics.json
 r-gdal-container-tacc_0.1.sif
 r-gdal-container-tacc_dev.sif
 survey_codes.json
 survey_outputs.json
 SurveyTopicsNoPrompt.json
 SurveyTopicsWithPrompt.json
 TIFF
 Topics.json


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [33]:
df[['All', 'Audience', 'Stage in Water Process',
       'Administrative process', 'Administrative Considerations',
       'Compliance', 'Strategic Planning', 'Funding', 'Worker Safety',
       'Health care', 'Water Quality ', 'Testing', 'Laboratory', 'Field',
       'At-home', 'Source Water Testing', 'Biological', 'Chemical',
       'Physical contaminents', 'Possible Contamination Sources',
       'Effluent at End of Treatment Process', 'Biological.1',
       'Chemical.1', 'Physical properties.1',
       'Water received by the end-user', 'Biological.2', 'Chemical.2',
       'Physical Properties.1', 'Treatment Process', 'Operating System',
       'Equipment Installation', 'Equipment Operations',
       'Equipment Maintenance', 'System Components', 'System Monitoring',
       'Design Considerations', 'Threats and Security',
       'Climate and Environment', 'Cybersecurity', 'System Breakdown',
       'Hazardous Materials', 'Technical Considerations',
       'Technical Background', 'Social Background',
       'Social Considerations', 'Certification',
       'Statistics on Certified People',
       'Design of Certification Process', 'Content Covered',
       'Application Process', 'Benefits of Certification',
       'Accessibility of Material', 'Request hard copy', 'In-Person',
       'Library', 'Online', 'Cost Considerations',
       'Continuing Education Requirements or Renewal',
       'Experience Requirements', 'Previous Education Considerations',
       'Study Tactics', 'Consequences of Poor Management',
       'Collaboration', 'Water Governance',
       'Federal Policies and Regulations',
       'State Policies and Regulations', 'Tribal Governance',
       'Community Outreach', 'Water System Stakeholders',
       'Environmental Attorneys Lawyers', 'End-Users.1', 'Developers',
       'Relevance', 'Relevance To environment', 'To Operator',
       'Positive implications', 'Negative implications', 'To End-User',
       'To Endvirnment', 'COVID-19', 'local', 'federal', 'state', ]]

Unnamed: 0,All,Audience,Stage in Water Process,Administrative process,Administrative Considerations,Compliance,Strategic Planning,Funding,Worker Safety,Health care,...,Relevance To environment,To Operator,Positive implications,Negative implications,To End-User,To Endvirnment,COVID-19,local,federal,state
"INTERVIEW 1 , 21-22 Combined",local,Operators-in-training,0,1,1,1,1,1,1,0,...,1,1,0,0,1,1,1,1,0,0
"INTERVIEW 10 , 21-22 Combined",local,Operators-in-training,1,1,0,1,1,1,1,0,...,1,1,0,0,0,1,0,1,0,0
"INTERVIEW 11 , 21-22 Combined",federal,Governmental or Tribal Officials,0,1,1,1,1,1,0,0,...,1,1,0,0,0,1,0,0,1,0
"INTERVIEW 12 , 21-22 Combined",federal,Operators-in-training,0,1,1,1,1,1,1,0,...,1,0,1,0,0,1,0,0,1,0
"INTERVIEW 13 , 21-22 Combined",local,Operators-in-training,0,1,0,0,1,1,1,0,...,1,1,0,0,1,1,1,1,0,0
"INTERVIEW 14 , 21-22 Combined",state,Operators-in-training,0,1,1,1,1,1,1,0,...,0,0,0,0,0,1,1,0,0,1
"INTERVIEW 15 , 21-22 Combined",federal,Operators-in-training,1,1,1,1,1,1,1,0,...,1,1,0,0,1,1,1,0,1,0
"INTERVIEW 16 , 21-22 Combined",state,End-Users,1,1,1,1,1,1,1,1,...,1,1,1,0,1,1,1,0,0,1
"INTERVIEW 17 , 21-22 Combined",state,Operators-in-training,1,1,1,1,1,1,1,1,...,1,1,0,0,1,1,1,0,0,1
"INTERVIEW 2 , 21-22 Combined",federal,Governmental or Tribal Officials,0,1,1,1,1,1,1,1,...,1,1,0,0,1,1,1,0,1,0
