## Llama Index and Llama 2 tutorial on Lonestar 6

Llama2 is the Meta open source Large Language Model. LlamaIndex is a python library that connects data to the LLMs such as Llama2. This allows the user to quickly use their unstructured data as a basis for any chats or outputs. 


In [1]:
import os
import logging
import sys
from IPython.display import Markdown, display

from llama_index.llms import HuggingFaceLLM
from llama_index.prompts import PromptTemplate, PromptType
from pathlib import Path
from llama_index import download_loader, KnowledgeGraphIndex,  Document, SummaryIndex
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

import re

In [2]:
import torch

## Set your working directory
Change your working directory to your Scratch location. This will improve performance, and ensure you have access to the model you rsynced earlier

In [3]:
scratch = ! echo $SCRATCH
os.chdir(scratch[0])
! pwd


/scratch/06659/wmobley


## Access the model
Next we'll access the models. You have 4 models to access the 7 and 13billion parameters chat and normal model. The folder will also have access to the 70b parameter models; however, we have not tested their performance on the LS6 dev machines. 



In [4]:
# Model names (make sure you have access on HF)
LLAMA2_7B = f"{scratch[0]}/HF/noRef/Llama7b"
LLAMA2_7B_CHAT = f"{scratch[0]}/HF/noRef/Llama7bchat"
LLAMA2_13B = f"{scratch[0]}/HF/noRef/Llama-2-13b-chat-hf"
LLAMA2_13B_CHAT = f"{scratch[0]}/HF/noRef/Llama-2-13b-chat-hf"

## Select Model
For this script we will chose the Llama 2 13B parameter chat model. 

In [5]:
selected_model  =LLAMA2_7B_CHAT

## Sytem Prompt
We are going to have llama2 create triplets used in a knowledge graph based on a pdf. 

We set up the system prompt below. 


In [6]:
SYSTEM_PROMPT="""You are social scientist researcher. You have been tasked with identifying the topics of research papers. 

Classify the topics of the documents using only the following topics:
            
## Topics
Federal, State, Regional, Local, Operators-in-training, System Operators, Community Members, Not Stage Specific, Source Water, Potable Water Treatment, Potable Water Distribution, Wastewater Collection, Wastewater Treatment, Water Storage, End-Users Storage System, Administrative Processes, Compliance, Strategic Planning, Funding, Worker Safety, Testing, Laboratory, Field, At-home, Effluent at the conclusion of the treatment process, Biological, Chemical, Physical properties, Water received by the end-user, Biological, Chemical, Physical properties, Treatment Process, Equipment Installation, Equipment Operations, Equipment Maintenance, System Components, System Monitoring, Climate and Environment, Cybersecurity, System Breakdown, Hazardous Materials, Background, Application Process, Benefits of Certification, Accessibility of Material, In-Person Training, Library, Online, Cost Considerations, Continuing Education Requirements, Experience Requirements, Previous Education Considerations, Study Tactics, Consequences of Poor Management, Collaboration, Water Governance , Federal policies, regulations , State policies, regulations , Tribal governance , Community Outreach, Water System Stakeholders, Environmental Attorneys, End-Users, Relevance , To environment, To operator, Positive implications, Negative implications, To end user, COVID-19
            
Here are some rules you always follow:
- Do not include numbers before the topics. 
- Generate only the requested output, don't include any other language before or after the requested output.
- Never say thank you, that you are happy to help, that you are an AI agent, etc. Just answer directly.
- Generate professional language typically used in business documents in North America.
- Never generate offensive or foul language.
"""

query_wrapper_prompt = PromptTemplate(
    "[INST]<<SYS>>\n" + SYSTEM_PROMPT + "<</SYS>>\n\n{query_str}[/INST] "
)

## Load the Model
Next we'll load the model. If it can't find the model it will download it. 

In [7]:

llm = HuggingFaceLLM(
    context_window=4096,
    max_new_tokens=2048,
    generate_kwargs={"temperature": 0.0, "do_sample": False},
    query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_name=selected_model,
    model_name=selected_model,
    device_map="auto",
    # change these settings below depending on your GPU
    model_kwargs={"torch_dtype": torch.float16, "load_in_8bit": True, "cache_dir":f"{scratch[0]}/HF/noRef"},
)
print(selected_model)

INFO:torch.distributed.nn.jit.instantiator:Created a temporary directory at /tmp/tmpwkdh0eb6
Created a temporary directory at /tmp/tmpwkdh0eb6
INFO:torch.distributed.nn.jit.instantiator:Writing /tmp/tmpwkdh0eb6/_remote_module_non_scriptable.py
Writing /tmp/tmpwkdh0eb6/_remote_module_non_scriptable.py


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

/scratch/06659/wmobley/HF/noRef/Llama7bchat


## Load the PDF documents 

In [8]:


pandas = download_loader("PandasCSVReader")

loader = pandas()


In [9]:
from llama_index import VectorStoreIndex, ServiceContext, set_global_service_context


# Better Explain Each of these steps. 
topics = {}

In [10]:
import pandas as pd
documents = pd.read_csv(f'{scratch[0]}/HF/Survey/all_public_interviews.csv')

In [11]:

documents

Unnamed: 0.1,Unnamed: 0,Set,Interview,Page,Block,Speaker,Text,Time_Mark
0,0,21-22 Combined,INTERVIEW 1,1,2,Interviewee 1,"A sewer project called the [REDACTED], I forge...",00:00
1,2,21-22 Combined,INTERVIEW 1,1,2,Interviewer 1,So like with the new subdivisions coming online?,02:56
2,1,21-22 Combined,INTERVIEW 1,1,2,Interviewee 1,"That's what I'm worried about. Yeah, so we're ...",02:59
3,3,21-22 Combined,INTERVIEW 1,1,2,Interviewer 1,I think the scope of what we're trying to look...,04:13
4,4,21-22 Combined,INTERVIEW 1,2,1,Interviewee 1,One challenge on the hauled side of it for us....,04:47
...,...,...,...,...,...,...,...,...
8968,4791,April 22,Interview 9,63,1,Interviewee,"are they in fact, well calling through for som...",134
8969,4792,April 22,Interview 9,63,1,Interviewee,off kinda thing.,135
8970,4794,April 22,Interview 9,63,1,Interviewer,Yeah. Okay.,136
8971,4793,April 22,Interview 9,63,1,Interviewee,"So, moving forward the next few months, that i...",137


In [12]:

for g in documents.groupby(["Interview",'Set']):
    doc_chunks= []
    for text in g[1].Text.tolist():
        doc = Document(text=text)
        doc_chunks.append(doc)
    service_context = ServiceContext.from_defaults(
        llm=llm, embed_model="local:BAAI/bge-small-en"
    )
    set_global_service_context(service_context)

    index = VectorStoreIndex.from_documents(doc_chunks)
    query_engine = index.as_query_engine()
    response = query_engine.query("Provide a list of Topics")

    response_dict = {"".join(g[0]): re.sub('\d', "" , response.response.split(":")[1].replace('\n',"")).replace('()',"").split(".")}
    topics.update(response_dict)
    print(g[0])

Downloading model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

[nltk_data] Downloading package punkt to /tmp/llama_index...
[nltk_data]   Unzipping tokenizers/punkt.zip.


('INTERVIEW 1 ', '21-22 Combined')
('INTERVIEW 10 ', '21-22 Combined')
('INTERVIEW 11 ', '21-22 Combined')
('INTERVIEW 12 ', '21-22 Combined')
('INTERVIEW 13 ', '21-22 Combined')
('INTERVIEW 14 ', '21-22 Combined')
('INTERVIEW 15 ', '21-22 Combined')
('INTERVIEW 16 ', '21-22 Combined')
('INTERVIEW 17 ', '21-22 Combined')
('INTERVIEW 2 ', '21-22 Combined')
('INTERVIEW 3 ', '21-22 Combined')
('INTERVIEW 4 ', '21-22 Combined')
('INTERVIEW 5 ', '21-22 Combined')
('INTERVIEW 6 ', '21-22 Combined')
('INTERVIEW 7 ', '21-22 Combined')
('INTERVIEW 8 ', '21-22 Combined')
('INTERVIEW 9 ', '21-22 Combined')
('Interview 1 ', 'April 22')
('Interview 10 ', 'April 22')
('Interview 11 ', 'April 22')
('Interview 12 ', 'April 22')
('Interview 13 ', 'April 22')
('Interview 14 ', 'April 22')
('Interview 15 ', 'April 22')
('Interview 16 ', 'April 22')
('Interview 17 ', 'April 22')
('Interview 18 ', 'April 22')
('Interview 19 ', 'April 22')
('Interview 2 ', 'April 22')
('Interview 20 ', 'April 22')
('Intervi

In [13]:
print(response.response)
import json 
with open("CodedSurveyTopics.json", "w") as outfile: 
    json.dump(topics, outfile)

Based on the provided context information, the following are the topics that can be identified in the research papers:
1. Federal policies
2. State policies
3. Tribal governance
4. Community outreach
5. Water system stakeholders
6. Environmental attorneys
7. End-users
8. Positive implications
9. Negative implications
10. Operators-in-training
111. System operators
12. Regional
13. Local
14. Compliance
15. Strategic planning
16. Funding
17. Worker safety
18. Testing
19. Laboratory
20. Field
21. At-home
22. Effluent at the conclusion of the treatment process
23. Biological properties
24. Chemical properties
25. Physical properties
26. Water received by the end-user
27. Treatment process
28. Equipment installation
29. Equipment operations
30. Equipment maintenance
31. System components
32. System monitoring
33. Climate and environment
34. Cybersecurity
35. System breakdown
36. Hazardous materials
37. Background
38. Application process
39. Benefits of certification
40. Accessibility of mat

In [14]:
topics

{'INTERVIEW 1 21-22 Combined': ['',
  ' Federal policies and regulations',
  ' State policies and regulations',
  ' Tribal governance and water management',
  ' Community outreach and engagement',
  ' Water system stakeholders and their roles',
  ' Environmental attorneys and their involvement in water management',
  ' Water governance and its impact on water management',
  ' Operators-in-training and their role in water management',
  ' System operators and their responsibilities',
  ' Local and regional water management practices',
  ' Potable water treatment and distribution',
  ' Wastewater collection and treatment',
  ' Water storage and distribution',
  ' End-users and their storage systems',
  ' Administrative processes and their impact on water management',
  ' Compliance with regulations and standards',
  ' Strategic planning and its role in water management',
  ' Funding and financing for water management projects',
  ' Worker safety and its importance in water management',
 