## Llama Index and Llama 2 tutorial on Lonestar 6

Llama2 is the Meta open source Large Language Model. LlamaIndex is a python library that connects data to the LLMs such as Llama2. This allows the user to quickly use their unstructured data as a basis for any chats or outputs. 


In [1]:
import os
import logging
import sys
from IPython.display import Markdown, display

from llama_index.llms import HuggingFaceLLM
from llama_index.prompts import PromptTemplate, PromptType
from pathlib import Path
from llama_index import download_loader, KnowledgeGraphIndex
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

import re

In [2]:
import torch

## Set your working directory
Change your working directory to your Scratch location. This will improve performance, and ensure you have access to the model you rsynced earlier

In [3]:
scratch = ! echo $SCRATCH
os.chdir(scratch[0])
! pwd


/scratch/06659/wmobley


## Access the model
Next we'll access the models. You have 4 models to access the 7 and 13billion parameters chat and normal model. The folder will also have access to the 70b parameter models; however, we have not tested their performance on the LS6 dev machines. 



In [4]:
# Model names (make sure you have access on HF)
LLAMA2_7B = f"{scratch[0]}/HF/noRef/Llama7b"
LLAMA2_7B_CHAT = f"{scratch[0]}/HF/noRef/Llama7bchat"
LLAMA2_13B = "meta-llama/Llama-2-13b-hf"
LLAMA2_13B_CHAT = "meta-llama/Llama-2-13b-chat-hf"

## Select Model
For this script we will chose the Llama 2 13B parameter chat model. 

In [5]:
selected_model  =LLAMA2_7B_CHAT

## Sytem Prompt
We are going to have llama2 create triplets used in a knowledge graph based on a pdf. 

We set up the system prompt below. 


In [6]:
SYSTEM_PROMPT="""You are social scientist researcher. You have been tasked with identifying the topics of research papers. 

Classify the topics of the documents using only the following topics:
            
    
Here are some rules you always follow:
- Do not include numbers before the topics. 
- Generate only the requested output, don't include any other language before or after the requested output.
- Never say thank you, that you are happy to help, that you are an AI agent, etc. Just answer directly.
- Generate professional language typically used in business documents in North America.
- Never generate offensive or foul language.
"""

query_wrapper_prompt = PromptTemplate(
    "[INST]<<SYS>>\n" + SYSTEM_PROMPT + "<</SYS>>\n\n{query_str}[/INST] "
)

## Load the Model
Next we'll load the model. If it can't find the model it will download it. 

In [7]:

llm = HuggingFaceLLM(
    context_window=4096,
    max_new_tokens=2048,
    generate_kwargs={"temperature": 0.0, "do_sample": False},
    query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_name=selected_model,
    model_name=selected_model,
    device_map="auto",
    # change these settings below depending on your GPU
    model_kwargs={"torch_dtype": torch.float16, "load_in_8bit": True, "cache_dir":f"{scratch[0]}/HF/noRef"},
)
print(selected_model)

INFO:torch.distributed.nn.jit.instantiator:Created a temporary directory at /tmp/tmpzuux3vun
Created a temporary directory at /tmp/tmpzuux3vun
INFO:torch.distributed.nn.jit.instantiator:Writing /tmp/tmpzuux3vun/_remote_module_non_scriptable.py
Writing /tmp/tmpzuux3vun/_remote_module_non_scriptable.py


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

/scratch/06659/wmobley/HF/noRef/Llama7bchat


## Load the PDF documents 

In [8]:


PDFReader = download_loader("PDFReader")

loader = PDFReader()


In [9]:
from llama_index import VectorStoreIndex, ServiceContext, set_global_service_context


# Better Explain Each of these steps. 
topics = {}

In [10]:

for root, dirs, files in os.walk(f'{scratch[0]}/HF/GrayLit', topdown=False):
    for i, name in enumerate(files):
        print(i, name)
        if name.endswith('pdf'):
            service_context = ServiceContext.from_defaults(
                llm=llm, embed_model="local:BAAI/bge-small-en"
            )
            set_global_service_context(service_context)
            documents = loader.load_data(file=Path(os.path.join(root, name)))
            index = VectorStoreIndex.from_documents(documents)
            query_engine = index.as_query_engine()
            response = query_engine.query("Provide a list of Topics")
            if response.response =='Empty Response':
                print(name)
                continue
            response_dict = {name: re.sub('\d', "" , response.response.split(":")[1].replace('\n',"")).replace('()',"").split(".")}
            topics.update(response_dict)
            
            print(response.response)

0 .amlignore
1 YRITWC_nd_Aquatic Buffer Ordinance.pdf


Downloading (…)lve/main/config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

[nltk_data] Downloading package punkt to /tmp/llama_index...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Based on the provided context, the following are the topics of the research papers:
1. Landscape planting
2. Fencing
3. Signs
4. Water quality best management practices
5. Conflict with other regulations
6. Timber harvesting
7. Land disturbance activities
8. Environmental protective measures
9. Buffer management and maintenance
10. Native vegetation planting

Please note that some of these topics may overlap or be related to each other, but they are distinct enough to be listed separately.
2 Bethel City Council_2020_City's Water & Sewer Services Expenditures Compared to Budget from July 1, 2020 to April 30, 2021.pdf
Based on the provided documents, the following are the topics that can be identified:
1. Vehicle maintenance and repair
2. Gasoline, diesel, and oil purchases
3. Electricity expenses
4. Telephone expenses
5. Heating fuel purchases
6. Water, sewer, and garbage expenses
7. Staff cellular phone expenses
8. Lab tests
9. Property maintenance expenses
10. Other purchased services

Based on the provided context information, the following are the topics of the research papers:
1. Capacity Development
2. Drinking Water Systems
3. Small Drinking Water Systems
4. Financial and Operational Challenges
5. Providing Safe Drinking Water
6. EPA's Role in Assisting Small Systems
7. State and Tribal Partnerships
8. Technical and Financial Resources
9. Sustainable Drinking Water Provision
10. History of Capacity Development Program

Please note that these topics are based on the information provided in the context information and may not be exhaustive or accurate if there is additional information available.
17 AWWA_nd_WSO-WPP-Get the Job.pdf
Based on the provided context information, the following are the topics of the research papers:
1. Ideal crop marks
2. GET THE JOB
3. Putting Your Best Foot Forward on Paper and in Person
4. TAP INTO YOUR FUTURE
5. IT’S UP TO YOU: YOU ARE THE ONLY ONE WHO KNOWS IF
6. SENDING A FOLLOW-UP EMAIL OR HAND-WRITTEN THANK YOU CARD

Note: These a

Based on the provided context information, the following are the topics of the research papers:
1. Operator Certification Program
2. State Fiscal Year 2021

Please note that these are the topics that can be identified based on the provided information, and there may be other topics discussed in the research papers that are not listed here.
26 ABC_2018_Guide WasteTreat ClassI.pdf
Certainly! Based on the provided context information, the following are the topics of the research papers:
1. Wastewater Treatment Operator Class I
2. Need-to-Know Criteria
3. Association of Boards of Certification
4. ProfessionalOperator.org
5. ABC Wastewater Treatment Operator Class I Certification Exam

I hope this helps! Let me know if you have any further questions.
27 DCRA_2018_RUBA Training Courses.pdf
DCRA_2018_RUBA Training Courses.pdf
28 AFE_2022_Environmental Technician Apprenticeship Information.pdf
Certainly! Based on the provided context information, the following are the topics of the research pa

Certainly! Based on the provided context information, the following are the topics of the research papers:
1. Workforce Development
2. Job Skills Training
3. Environmental Worker Training
4. Apprenticeship Training
5. Rural Alaska Landfill Operator Training
6. Request a Training

I hope this helps! Let me know if you have any further questions.
43 AK DEC_2022_How to Get Certified as a Water Operator.pdf
Certainly! Based on the provided documents, the following are the topics that can be identified:
1. Water Operator Certification
2. Exams and Certification
3. Eligibility Requirements for Certification
4. Application Process for Certification
5. Water Distribution
6. Water Treatment
7. Wastewater Treatment
8. Wastewater Collection

These are the topics that can be identified from the provided documents. Please let me know if you need further assistance.
44 AK DEC_2002_Operator's Guide for Small Treated Public Water Systems in Alaska.pdf
Based on the provided document, the following are 

Based on the provided context information, the following are the topics of the research papers:
1. Wastewater Collection Operator Class I
2. Need-to-Know Criteria
3. Emergency Plans
4. Safety Audits
5. Employment Policies and Procedures
6. Replacement Equipment Purchase
7. Employee Reports regarding Customer Backup in Home
8. Training of New Operators
9. Review of Easement and Right-of-Way Issues/Problems.
56 USDA_2018_LOC-Avenues Project.PDF
57 EPA_2017_Water Systems Partnerships Meeting Summary.pdf
Based on the provided document, the following are the topics that can be identified:
1. Water systems partnerships
2. Collaborative approaches to address drinking water issues
3. Building trust and relationships
4. Looking for low-hanging fruit
5. Mechanisms to trigger meaningful consideration
6. Facilitation and planning
7. Training
8. Leadership and incentives
9. Enabling environment
10. Governance and funding
111. Operator and engineering communities
12. Outreach and cultural trust.
58 

Certainly! Based on the provided context information, the following are the topics of the research papers:
1. Wastewater Collection Operator Class IV
2. Need-to-Know Criteria
3. ABC Wastewater Collection Operator Class IV Certification Exam 2017
4. Association of Boards of Certification
5. ProfessionalOperator.org

I hope this helps! Let me know if you have any further questions.
3 AK DEC_2021_RMW Report-Yukon-Kuskokwin Health Corporation.pdf
Based on the provided context information, the following are the topics of the research papers:
1. Pilot Station Wastewater Collection System - Class 1
2. Pilot Station Wastewater Treatment System - Class SP
3. Pilot Station Water Distribution System - Class 2
4. Pilot Station Water Treatment System - Class 1
5. Operator Certification
6. Certificate Type
7. Expiration Date
8. DW Core
9. CEUs
10. CEUs
111. Noncore CEUs
122. Total CEUs

Note: The numbers in brackets represent the number of times each topic appears in the provided context information

Based on the provided context information, the following are the topics of the research papers:
1. Asset Management
2. Policy Statements
3. Asset Management
4. Committees on Teams
5. Advertise with Us
6. Jobs
7. Follow Us
8. Work for AWWA
9. Contact AWWA

These are the topics that can be identified from the provided context information.
17 AK DEC_2021_Certified Water-Wastewater Operator Database.pdf
Based on the provided context, the following are the topics of the research papers listed in the document:
1. Water treatment
2. Wastewater management
3. Operator certification
4. Water quality management
5. Water supply and distribution

6. Wastewater disposal and treatment

7. Water resources management

8. Environmental impact of water and wastewater management

9. Public health and safety

10. Water and wastewater infrastructure


































































































































































Based on the provided context, the following are the topics that can be identified in the documents:
1. Environmental compliance and enforcement
2. COVID-19 pandemic and its impact on environmental compliance
3. Drinking water safety and regulations
4. Public water systems and their responsibilities
5. Laboratory testing and analysis for water systems
6. Enforcement discretion and prioritization of compliance monitoring
7. Critical infrastructure and short-term No Action Assurance
8. Routine compliance monitoring and reporting
9. Environmental regulations and statutes
10. Training and certification requirements
111. Reporting and documentation of noncompliance

Please note that these topics are based on the information provided in the context and may not be exhaustive or accurate without additional information.
28 AK DEC_2020_Small Untreated-Treated Water Systems Exam Registration Form.pdf
Based on the provided context, the following are the topics that can be identified in the researc

Based on the provided context information, the following are the topics of the research papers:
1. Alkalinity
2. Area of Circle
3. Area of Cone
4. Area of Cylinder
5. Area of Rectangle
6. Average (arithmetic mean)
7. Average (geometric mean)
8. Biochemical Oxygen Demand (seeded)
9. Biochemical Oxygen Demand (unseeded)
10. Chemical Feed Pump Setting
111. Chemical Feed Pump Setting
12. Flow Maximum Flow Desired
13. Chemical, Density, Chemical (Feedgal/MG)
14. Chemical, Density, Chemical (Feed mg/L)
15. Chemical, Density, Chemical (Active g/cm)
16. Chemical, Density, Chemical (Feed mg/L)
17. Chemical, Density, Chemical (Dosem (Flow, 40 mg/mL) (1,4 Density, Chemical (Feedgal/MG) 00,000 L/gal) (1,0 5 mg/L) (3.78 MGD) (Dose, (Flow, 40 mg/mL) (1,4 Density, Chemical (Feedgal/MG) 00,000 L/gal) (1,0 5 mg/L) (3.78 MGD) (Dose, (Flow, 40 mg/mL) (1,4 Density, Chemical (Feedgal/MG) 00,000 L/gal) (1,0 5 mg/L) (3.78 MGD) (Dose, (Flow, 40 mg/mL) (1,4 Density, Chemical (Feedgal/MG) 00,000 L/gal) (1,0 5 m

Based on the provided context and rules, the topics of the research papers can be classified as follows:
Topic 1: Education
Topic 2: Mind Training
Topic 3: Learning vs. Facts

Please note that these are the only topics that can be identified based on the provided information, and there may be other topics present in the documents as well.
48 ABC_2017_Guide Collect ClassIII.pdf
Based on the provided context information, the following are the topics that can be identified in the two documents:
Topic 1: Wastewater Collection Operator Class III
Topic 2: Need-to-Know Criteria for the ABC Wastewater Collection Operator Class III Certification Exam 2017
Topic 3: Pre-Test Questions
Topic 4: Exam Preparation Resources
Topic 5: Copyright Notice

Note: These topics are based on the information provided in the documents and do not include any additional information that may be present but not relevant to the query.
49 ABC_2019_Certification through Exam.pdf
Based on the provided document, the foll

Based on the provided documents, the following are the topics that can be identified:
1. Exam Registration
2. Water Operator Certification
3. Certification Exams
4. myAlaska Account
5. Credit Card Payment
6. Exam Fee
7. PDF Download
8. Internet Access
9. Download Issues

These are the topics that can be identified based on the provided documents.
63 AWWA_2016_National Survey of Commercial-Industrial-Institutional Water Efficiency Programs.pdf
Based on the information provided, the following are the topics related to the research paper:
1. Water efficiency programs
2. Commercial-industrial-institutional (CII) water efficiency programs
3. Information sources for improving CII programs
4. Utility profiles
5. Services provided by responding utilities
6. Online or paper submission forms for CII rebate/incentive applications
7. Coordination with local energy utilities on CII programs and incentives
8. Consideration of conserved water as one of water supply alternatives

These are the main to

Based on the provided context information, the following are the topics of the research papers:
1. Water Treatment
2. Water Distribution
3. Certification Exam Preparation
4. Practice Questions and Answers
5. Wastewater Certification Exam
6. Wastewater Operator Certification Resource

Note: These topics are based on the information provided in the context and do not include any additional information that may be present in the documents.
76 ABC_2018_Guide WasteTreat ClassII.pdf
Certainly! Based on the provided context information, the following are the topics of the research papers:
1. Wastewater Treatment Operator Class II
2. Need-to-Know Criteria
3. Association of Boards of Certification
4. Phone Number
5. Website URLs
6. Email Addresses

I hope this helps! Let me know if you have any further queries.
77 EPA_2012_EUM and Lean Resource Guide.pdf
Certainly! Based on the provided context information, the following are the topics of the research papers:
1. Resource Guide to Effective Util

Based on the provided context information, the following are the topics of the research papers:
1. Water and Wastewater Works Advisory Board Policy Statement 2009-08
2. Construction industry experience and certification for water distribution and wastewater collection systems in Alaska.
3. Recognition of construction experience towards water distribution and wastewater collection level 1 certification.
4. Limitations on substituting construction experience for other types of experience under 18 AAC 74.050(b)(2) and (c).
5. Background and discussion on the recognition of construction experience in Alaska's water and wastewater industry.
6. Policy statement on construction industry experience and certification for water distribution and wastewater collection systems in Alaska.
92 ABC_2018_Guide Treat ClassIII.pdf
Based on the provided context information, the following are the topics of the research papers:
1. Water Treatment Operator Class III
2. Need-to-Know Criteria
3. Association of 

Based on the provided documents, the following are the topics that can be identified:
1. Water treatment systems
2. Point rating system for water treatment classifications
3. Components of water treatment systems and their classification points
4. Water supply source
5. Pretreatment systems
6. Adjustment and corrosion control
7. Activated carbon systems
8. Oxidation systems
9. Filtration systems
10. Disinfection systems
111. Storage systems
12. Operator certification and training regulations
13. Water and wastewater operator certification and training
14. Exam and application status
15. Operator search and system search
16. Library and fee payment
17. Resident and business in Alaska

Please note that some of these topics may overlap or be related to each other, but they can be identified as distinct topics based on the information provided in the documents.
102 AK DEC_nd_Bethel Wastewater Collection System.pdf
Based on the provided context, the following are the topics that can be iden

Certainly! Based on the provided context information, the following are the topics of the research papers:
1. Wastewater Treatment Operator Class III
2. Need-to-Know Criteria
3. Association of Boards of Certification
4. Phone Number
5. Website URLs
6. Email Addresses

I hope this helps! Let me know if you have any further queries.
115 EPA_2021_Water Utility Response Mobile Application.pdf
Based on the provided context information, the following are the topics of the research papers:
1. Water Utility Response
2. Mobile Application
3. US EP A
4. White House
5. Contact EPA
6. EPA Disclaimers
7. Hotlines
8. FOIA Requests
9. Frequent Questions

Note: These are the topics that can be identified based on the provided information, without any prior knowledge or assumptions.
116 AK DEC_2022_Remote Maintenance Worker Directory.pdf
Based on the provided documents, the following are the topics that can be identified:
1. Remote Maintenance Worker Directory
2. Nome, AK 99762
3. Fairbanks, AK 99701
4

Based on the provided context information, the following are the topics of the research papers:
1. Protecting the Water Sector's Critical Infrastructure Information
2. Water Sector's Critical Infrastructure
126 AK DEC_2022_COVID Caused Non-Compliance Concerns Assurance Memorandum.pdf
Based on the provided documents, the following are the topics that can be identified:
1. Water quality requirements
2. Permit compliance
3. Non-compliance reporting
4. COVID-19 pandemic and its impact on environmental compliance
5. Enforcement discretion and exemptions
6. Guidance and criteria for enforcement flexibility
7. Communication and coordination with the Environmental Protection Agency (EPA)
8. Compliance and enforcement in the context of COVID-19
9. No action assurance memorandum and its purpose
10. Documentation and reporting of non-compliance events.
127 AK DEC_2022_Training Coalition Calendar.pdf
Based on the provided documents, the following are the topics that can be identified:
1. Water ope

In [14]:
print(response.response)
import json 
with open("Topics.json", "w") as outfile: 
    json.dump(topics, outfile)

Based on the provided context information, the topics found in the article are:
1. Ideal crop marks
2. Win the day
3. Making the most of your workday
4. American Water Works Association (AWWA)
5. WaterWisePro Training
6. Water system operations (WSO)
7. Certification resources
8. Exam prep courses
9. Distribution and treatment
10. Future of water
11. Operators
12. Water needs
13. @americanwaterworksassociation
14. @waterwisepro
15. #WSOWWP
16. #NOOPERATORNOWATER


In [17]:
len(topics)

195

In [12]:
response = query_engine.query("What are the topics found in this article?")
display(Markdown(f"<b>{response}</b>"))

<b>Based on the provided context information, the topics found in the article are:
1. Ideal crop marks
2. Win the day
3. Making the most of your workday
4. American Water Works Association (AWWA)
5. WaterWisePro Training
6. Water system operations (WSO)
7. Certification resources
8. Exam prep courses
9. Distribution and treatment
10. Future of water
11. Operators
12. Water needs
13. @americanwaterworksassociation
14. @waterwisepro
15. #WSOWWP
16. #NOOPERATORNOWATER</b>

In [13]:
## Document By Document Topic Analysis
# Unsupervised
## Semi-Supervised