## Llama Index and Llama 2 tutorial on Lonestar 6

Llama2 is the Meta open source Large Language Model. LlamaIndex is a python library that connects data to the LLMs such as Llama2. This allows the user to quickly use their unstructured data as a basis for any chats or outputs. 


In [1]:
import os
import logging
import sys
from IPython.display import Markdown, display

from llama_index.llms import HuggingFaceLLM
from llama_index.prompts import PromptTemplate, PromptType
from pathlib import Path
from llama_index import download_loader, KnowledgeGraphIndex
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

import re

In [2]:
import torch

## Set your working directory
Change your working directory to your Scratch location. This will improve performance, and ensure you have access to the model you rsynced earlier

In [3]:
scratch = ! echo $SCRATCH
os.chdir(scratch[0])
! pwd


/scratch/06659/wmobley
/scratch/06659/wmobley


## Access the model
Next we'll access the models. You have 4 models to access the 7 and 13billion parameters chat and normal model. The folder will also have access to the 70b parameter models; however, we have not tested their performance on the LS6 dev machines. 



In [4]:
# Model names (make sure you have access on HF)
LLAMA2_7B = f"{scratch[0]}/HF/noRef/Llama7b"
LLAMA2_7B_CHAT = f"{scratch[0]}/HF/noRef/Llama7bchat"
LLAMA2_13B = f"{scratch[0]}/HF/noRef/Llama-2-13b-hf"
LLAMA2_13B_CHAT = f"{scratch[0]}/HF/noRef/Llama-2-13b-chat-hf"

## Select Model
For this script we will chose the Llama 2 13B parameter chat model. 

In [5]:
selected_model  =LLAMA2_13B_CHAT

## Sytem Prompt
We are going to have llama2 create triplets used in a knowledge graph based on a pdf. 

We set up the system prompt below. 


In [6]:
SYSTEM_PROMPT="""You are social scientist researcher. You have been tasked with identifying the answering aspects of research papers. 

           
Here are some rules you always follow:
- If applicable, answer with a yes or no at the beginning of the response. 
- Answer the question succintly. 
- Do not include numbers before the topics. 
- Generate only the requested output, don't include any other language before or after the requested output.
- Never say thank you, that you are happy to help, that you are an AI agent, etc. Just answer directly.
- Generate professional language typically used in business documents in North America.
- Never generate offensive or foul language.
"""

query_wrapper_prompt = PromptTemplate(
    "[INST]<<SYS>>\n" + SYSTEM_PROMPT + "<</SYS>>\n\n{query_str}[/INST] "
)

## Load the Model
Next we'll load the model. If it can't find the model it will download it. 

In [7]:

llm = HuggingFaceLLM(
    context_window=4096,
    max_new_tokens=2048,
    generate_kwargs={"temperature": 0.0, "do_sample": False},
    query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_name=selected_model,
    model_name=selected_model,
    device_map="balanced",
    # change these settings below depending on your GPU
    model_kwargs={"torch_dtype": torch.float16, "load_in_8bit": False, "cache_dir":f"{scratch[0]}/HF/noRef"},
)
print(selected_model)

INFO:torch.distributed.nn.jit.instantiator:Created a temporary directory at /tmp/tmp0sdq04kn
INFO:torch.distributed.nn.jit.instantiator:Created a temporary directory at /tmp/tmp0sdq04kn
Created a temporary directory at /tmp/tmp0sdq04kn
Created a temporary directory at /tmp/tmp0sdq04kn
INFO:torch.distributed.nn.jit.instantiator:Writing /tmp/tmp0sdq04kn/_remote_module_non_scriptable.py
INFO:torch.distributed.nn.jit.instantiator:Writing /tmp/tmp0sdq04kn/_remote_module_non_scriptable.py
Writing /tmp/tmp0sdq04kn/_remote_module_non_scriptable.py
Writing /tmp/tmp0sdq04kn/_remote_module_non_scriptable.py
INFO:accelerate.utils.modeling:We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).
INFO:accelerate.utils.modeling:We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

/scratch/06659/wmobley/HF/noRef/Llama-2-13b-chat-hf
/scratch/06659/wmobley/HF/noRef/Llama-2-13b-chat-hf


## Load the PDF documents 

In [8]:


PDFReader = download_loader("PDFReader")

loader = PDFReader()


In [9]:
from llama_index import VectorStoreIndex, ServiceContext, set_global_service_context
from llama_index.query_engine import CitationQueryEngine


# Better Explain Each of these steps. 
topics = {}

In [10]:

def query(text):
    
    response = query_engine.query(text)
#     display(Markdown(f"<b>{response}</b>"))
#     print(len(response.source_nodes))
#     display(response.source_nodes[0].node.metadata)
#     display(Markdown(response.source_nodes[0].node.get_text()))
    return response

Struggles being asked true or false. 

In [11]:
queries = {
#     "Author": "Who is the author?",
#           "Federal":"are these documents written by a federal agency",
#           "State":"are  these documents written by a state agency",
#           "Local":"are  these documents written by a local agency",
"All":"What type of agency likely wrote the documents: federal, state, local, or regional?",
"Audience": "Of the following audience types: Operators-in-training; Operators; Governmental or Tribal Officials; or End-Users; What type of Audience are the documents written for? ",
"Technical Considerations":"Do the documents provide Information which helps inform, train or edcate the user on the technical aspects of water systems.",
"Technical Background":"Do the documents provide Information which helps inform, train or educate the user on  what the technical/engineering-based/physics reason is ?",
"Stage in Water Process":"Do the documents provide information on The stage of water treatment ?",
    "Not Stage Specific": "Does The document identify or make specific which stage of water treatment the document is referring to?" ,
    "Source Water":"Do the documents Address where the water is initially coming from? ",
   "Potable Water Treatment":"Do the documents Address the process of treating and making ready for consumption potable water?",
    "Potable Water Distribution":"Do the documents Address the process of getting potable water from the utility to the end-user?",
    "Wastewater Collection":"Do the documents addresses the process of aggregating grey and black water. ",
    "Wastewater Treatment":"Do the documents Address the process of treating and making ready for release water. ",
"Water Storage":"Do the documents Address a utility’s storage system?",
"Small Systems":"Do the documents address small systems  that serves less than 10,000 people?",
"End-Users Storage System":"Do the documents Address an end user’s personal storage system?",


"Administrative Considerations":"Do the documents focus on actions and procedures related to record-keeping. Examples: billing, human resources, time keeping, overhead tracking, company management, etc.",
"Administrative Processes":"Do the documents focus on actions and process related to adminitstering water systems?",
    "Compliance":"Do the documents focus on Tasks related to water quality regulations and permitting requirements from the water operator’s side. Examples: compliance paperwork, filing compliance forms, tracking deliverables, interacting with inspectors.",
"Strategic Planning":"Do the documents focus on Practices pertaining to future planning, including best management practices, gold standards, and strategic planning. Examples: budgeting for  future expansions, hiring specialists).",
"Funding":"Do the documents focus onDiscussion of financing, spending, and other costs considerations. Examples: funding from government entities for system operations. ",
"Worker Safety":"Do the documents focus on Procedures to ensure the safety of workers and the security of the water system. Examples: PPE requirements, staffing requirements, worker health",

    "Health care":"Do the documents provide information on health care considerations?",

    "Water Quality ":	"Do the documents focus onå Considerations of water contaminants and cleanliness",
"Testing": 	"Do the documents focus on Processes for testing potable water quality. ",
"Laboratory": 	"Do the documents focus on Formal water quality testing in a laboratory, chemical analysis. Conducted by a specialist. ",
"Field":	"Do the documents focus on testing Conducted during routine checks and maintenance, more back of the envelope. Conducted by an operator. ",
"At-home" : "Do the documents focus on testing Conducted by end-user, not mandatory. ",
"Possible Contaminants":"Do the documents identify possible water system contamination sources?",
    "Source Water Testing": 	"Do the documents focus on Processes for testing source water quality. ",
"Biological" :	"Do the documents focus on biological testing of the source water E.g.: Viruses, bacteria, zoological contaminants, organic compounds. ",
"Chemical": 	"Do the documents focus on chemical testing of the source water E.g.: Nitrous oxide, rust. ",
"Physical contaminents":"Do the documents focus on testing for physical propoerties of the source water E.g.: Turbidity, pH, temperature. ",
    
    "Effluent at End of Treatment Process":	"Do the documents focus on Considerations of the quality of treated effluent water leaving the water treatment plant, entering the distribution portion of the system.  ",
"Biological.1" :	"Do the documents focus on biological testing during the treatment process E.g.: Viruses, bacteria, zoological contaminants, organic compounds. ",
"Chemical.1": 	"Do the documents focus on chemical testing during the treatment process E.g.: Nitrous oxide, rust. ",
"Physical properties.1":"Do the documents focus on testing for physical propoerties during the treatment process E.g.: Turbidity, pH, temperature. ",

"Water received by the end-user":	"Do the documents focus on water received by the end user including Considerations of the quality of water received at the tap for the end-users. ",
"Biological.2":	"Do the documents focus on the biological conditions of water received by the end user? E.g.: Viruses, bacteria, zoological contaminants, organic compounds. ",
"Chemical.2": "Do the documents focus on the chemical conditions of the water received by the end user?	E.g.: Nitrous oxide, rust. ",
"Physical Properties.1": "Do the documents focus on the phyiscal properties of the water received by the end user? 	E.g.: Turbidity, pH, temperature. ",

"Treatment Process": "Do the documents focus on The process of water treatment, including flocculation, tanks, equipment?",
"Operating System": "Do the documents focus on Considerations of the water system equipment ", 
"Equipment Installation":	"Do the documents focus on Instructions for creating new water systems ",
"Equipment Operations":	"Do the documents focus on Procedures for operating the water treatment and distribution infrastructure system.",
"Equipment Maintenance":	"Do the documents focus on the Procedures for maintaining the integrity of the water treatment system, including routine maintenance and repairs. ",
"Design Considerations":"Do the documents focus on the design process for  water treatment system, including routine maintenance and repairs.",
    "System Components": "Do the documents address the  breakdown of each material or piece of a system?",
"System Monitoring": "Do the documents Tells us if maintenance is needed – about the integrity. Measuring processes occurring in water utility (uses devises), humidity sensors, water level measurements.?",

"Threats and Security":"Do the documents focus on Anything that threatens the daily operations of the water system, both physically and operationally? ",
"Climate and Environment": "Do the documents focus on Flooding, landslides, freezing, extreme weather events, erosion, change in landscape, changes in water supply patterns. ",
"Cybersecurity": "Do the documents focus on	Cyber hacks, ransom ware, hostage data?",
"System Breakdown":	"Do the documents focus on Rusting, lead-based paints, mold, corrosion?",
"Hazardous Materials":	"Do the documents focus on Chemical spills, asbestos?", 

"Social Background": "Do the documents provide Information which helps inform, train or educate the user on why a process is designed like it is",

"Social Considerations": "Do the documents address social considerations?", 	
"Certification": "Do the documents focus on the process of earning certification to become a certified water operator? ",
"Application Process": "Do the documents Provide details on how to apply for exams and certification?",
"Benefits of Certification": "Do the documents Explain why pursuing certification is of value to the individual and the community?",
"Accessibility of Material":	"Do the documents address Where the  is material stored or available?",
"Request hard copy":"Can the user request a hard copy of the documetns?",
"In-Person":	"Are the documents Provided at an in-person training? ",
"Library": "Are the documents likely Found in a publicly available library. ",
"Online": "Are the documents found online?",
"Design of Certification Process":"Do the documents provide information on the certification process?",
  "Statistics on Certified People": "Do the documents report the passing rate of individuals taking the licensure test?",
"Content Covered":"Do the documents provide information on the content covered during the certification process?",
"Cost Considerations": "Do the documents focus on The cost incurred by the individual pursuing the certification and the affordability of this cost?",
"Continuing Education Requirements or Renewal": "Do the documents identify the	Details the requirements for maintaining certification?",
"Experience Requirements": "Do the documents Detail the experiences required for achieving certification.",
"Previous Education Considerations": "Do the documents identify the	Education  requirements and education paths to get certified?",
"Study Tactics":	"Do the documents focus proivde Advice for how to prepare and study for the test itself, strategies?",

"Consequences of Poor Management": "Do the documents identify the Detrimental impacts of mismanaged water utilities on population and the environment?",
"Collaboration": "Do the documents identify The importance of cross-systems collaboration and disciplines?",
"Water Governance": "Do the documents identify How different levels of government interact and deal with policy and regulation related issues?",
"Federal Policies and Regulations": "Do the documents identify federal policies such as Acts, laws, policies?",
"State Policies and Regulations": 	"Do the documents address How these policies and regulations are implemented at the state level?",
"Tribal Governance": "Do the  documents focus 	How tribal policies interact with federal policies?",
"Community Outreach": "Do the documents focus on considerations of how water systems can interface with and interact with the communities they are serving?",
"Water System Stakeholders" : "Do the documents focus om Discussion of who will be impacted by the water system?",
"Environmental Attorneys": "Do the documents identify who are responsible for defending the water systems?",
"End-Users.1": "Do the documents focus on People who are receiving water from the water systems?",
"Developers":"Do the documents focus on People who are developing the water systems?",
"Relevance" : "Do the documents focus om	How is the content covered in the document made relevant to the context it is addressing or the individual using the materials?",
"To Environment": "Do the documents provide Materials that are better suited for the Alaska environment?",
"To Operator" : "Do the documents provide relevant Materials that are tailored for operators in Alaska?",
"Positive implications":	"Do the documents provide Positive assumptions about the backgrounds, interests of, skills of operators that are getting certified?",
"Negative implications":	"Do the documents provide Negative assumptions about the backgrounds, interests of, skills of operators that are getting certified, condescending, tone sounds like being talked down to?",
"To End-User":	"Do the documents provide relevant Information directed towards household management of water and interacting with the water system?",
"COVID-19":	"Do the documents focus on Any aspects relevant to the COVID-19 Pandemic",
"Healthcare":	"Do the documents focus on the Relevance to public health implications?"
}

In [12]:
service_context = ServiceContext.from_defaults(
                llm=llm, embed_model="local:BAAI/bge-large-en-v1.5"
            )
set_global_service_context(service_context)

In [13]:
import gc
graylit_codes = {}
graylit_outputs = {}
for root, dirs, files in os.walk(f'{scratch[0]}/HF/GrayLit', topdown=False):
    print(len(files))
    for i, name in enumerate(files):
        print(i)
        codes ={}
        prompts={}
        if name.endswith('pdf'):
            gc.collect()
            documents = loader.load_data(Path(os.path.join(root, name)))
            index = VectorStoreIndex.from_documents(documents, service_context=service_context)
            query_engine = index.as_query_engine()
#             query_engine = CitationQueryEngine.from_args(
#             index,
#             similarity_top_k=3,
#             # here we can control how granular citation sources are, the default is 512
#             citation_chunk_size=512,
#             )
            for key in queries.keys():
                response = query_engine.query(queries[key])
#                 display(Markdown(f'<b>{key}: </b>{"yes"  in response.response.lower()}'),
#                       Markdown(response.response))
#                 prompts.update({"prompt":queries[key],
#                                "response":response.response})
                if response.response =='Empty Response':
                    print(name)
                    continue
                if key == 'All': 
                    for agency in ['federal', 'state', 'local', 'regional agency' ]:
                        if agency.lower()  in response.response.lower():
                            codes.update({key:agency })   
                            break
                elif key == "Audience":
                    
                    for audience in ['Operators-in-training', 'Operators', 'Governmental or Tribal Officials', 'End-Users']:
                        if audience.lower() in response.response.lower():
                            print (audience)
                            codes.update({key:audience })   
                            break
                else: codes.update({key:"yes"  in response.response.lower() })
            graylit_codes.update({name:codes})
            graylit_outputs.update({name:prompts})

70
0
1
70
0
1




Governmental or Tribal Officials
Governmental or Tribal Officials
2
2
Governmental or Tribal Officials
Governmental or Tribal Officials
3
3
Operators-in-training
Operators-in-training
4
4
Operators-in-training
Operators-in-training
5
5
Operators-in-training
Operators-in-training
6
6
Operators-in-training
Operators-in-training
7
7
Governmental or Tribal Officials
Governmental or Tribal Officials
8
9
8
9
Governmental or Tribal Officials
Governmental or Tribal Officials
10
10
Operators-in-training
Operators-in-training
11
11
End-Users
End-Users
12
12
Operators-in-training
Operators-in-training
13
13
End-Users
End-Users
14
14
Operators-in-training
Operators-in-training
15
15
Operators-in-training
Operators-in-training
16
16
Governmental or Tribal Officials
Governmental or Tribal Officials
17
17
Operators-in-training
Operators-in-training
18
18
Governmental or Tribal Officials
Governmental or Tribal Officials
19
20
19
20
Operators-in-training
Operators-in-training
21
21
Governmental or Trib

Bethel City Council_2022_Bid Tabulations.pdf
Bethel City Council_2022_Bid Tabulations.pdf
Bethel City Council_2022_Bid Tabulations.pdf
Bethel City Council_2022_Bid Tabulations.pdf
Bethel City Council_2022_Bid Tabulations.pdf
Bethel City Council_2022_Bid Tabulations.pdf
Bethel City Council_2022_Bid Tabulations.pdf
Bethel City Council_2022_Bid Tabulations.pdf
Bethel City Council_2022_Bid Tabulations.pdf
Bethel City Council_2022_Bid Tabulations.pdf
Bethel City Council_2022_Bid Tabulations.pdf
Bethel City Council_2022_Bid Tabulations.pdf
Bethel City Council_2022_Bid Tabulations.pdf
Bethel City Council_2022_Bid Tabulations.pdf
Bethel City Council_2022_Bid Tabulations.pdf
Bethel City Council_2022_Bid Tabulations.pdf
Bethel City Council_2022_Bid Tabulations.pdf
Bethel City Council_2022_Bid Tabulations.pdf
Bethel City Council_2022_Bid Tabulations.pdf
Bethel City Council_2022_Bid Tabulations.pdf
Bethel City Council_2022_Bid Tabulations.pdf
Bethel City Council_2022_Bid Tabulations.pdf
Bethel Cit

53
53
Governmental or Tribal Officials
Governmental or Tribal Officials
54
54
End-Users
End-Users
55
55
Operators-in-training
Operators-in-training
56
57
56
57
Governmental or Tribal Officials
Governmental or Tribal Officials
58
58
Governmental or Tribal Officials
Governmental or Tribal Officials
59
59
Governmental or Tribal Officials
Governmental or Tribal Officials
60
60
Governmental or Tribal Officials
Governmental or Tribal Officials
61
61
Governmental or Tribal Officials
Governmental or Tribal Officials
62
62
Governmental or Tribal Officials
Governmental or Tribal Officials
63
63
Operators-in-training
Operators-in-training
64
64
Operators
Operators
65
65
End-Users
End-Users
66
66
Operators-in-training
Operators-in-training
67
67
Operators-in-training
Operators-in-training
68
68
Governmental or Tribal Officials
Governmental or Tribal Officials
69
69
Governmental or Tribal Officials
Governmental or Tribal Officials
1
0
136
0
1
0
136
0
Governmental or Tribal Officials
Governmental or

AK DEC_2018_Let-DEC about BEST Practices Score.pdf
AK DEC_2018_Let-DEC about BEST Practices Score.pdf
AK DEC_2018_Let-DEC about BEST Practices Score.pdf
AK DEC_2018_Let-DEC about BEST Practices Score.pdf
AK DEC_2018_Let-DEC about BEST Practices Score.pdf
AK DEC_2018_Let-DEC about BEST Practices Score.pdf
AK DEC_2018_Let-DEC about BEST Practices Score.pdf
AK DEC_2018_Let-DEC about BEST Practices Score.pdf
AK DEC_2018_Let-DEC about BEST Practices Score.pdf
AK DEC_2018_Let-DEC about BEST Practices Score.pdf
AK DEC_2018_Let-DEC about BEST Practices Score.pdf
AK DEC_2018_Let-DEC about BEST Practices Score.pdf
AK DEC_2018_Let-DEC about BEST Practices Score.pdf
AK DEC_2018_Let-DEC about BEST Practices Score.pdf
AK DEC_2018_Let-DEC about BEST Practices Score.pdf
AK DEC_2018_Let-DEC about BEST Practices Score.pdf
AK DEC_2018_Let-DEC about BEST Practices Score.pdf
AK DEC_2018_Let-DEC about BEST Practices Score.pdf
AK DEC_2018_Let-DEC about BEST Practices Score.pdf
AK DEC_2018_Let-DEC about BEST 

127
127
Operators-in-training
Operators-in-training
128
128
Governmental or Tribal Officials
Governmental or Tribal Officials
129
129
Governmental or Tribal Officials
Governmental or Tribal Officials
130
130
Governmental or Tribal Officials
Governmental or Tribal Officials
131
131
Operators-in-training
Operators-in-training
132
132
Operators-in-training
Operators-in-training
133
134
133
134
Operators-in-training
Operators-in-training
135
135
Operators-in-training
Operators-in-training
2
0
1
2
0
1


In [14]:
graylit_codes

{'YRITWC_nd_Aquatic Buffer Ordinance.pdf': {'All': 'local',
  'Audience': 'Governmental or Tribal Officials',
  'Technical Considerations': True,
  'Technical Background': True,
  'Stage in Water Process': True,
  'Not Stage Specific': True,
  'Source Water': True,
  'Potable Water Treatment': False,
  'Potable Water Distribution': False,
  'Wastewater Collection': False,
  'Wastewater Treatment': True,
  'Water Storage': False,
  'Small Systems': True,
  'End-Users Storage System': False,
  'Administrative Considerations': False,
  'Administrative Processes': True,
  'Compliance': True,
  'Strategic Planning': False,
  'Funding': False,
  'Worker Safety': True,
  'Health care': False,
  'Water Quality ': True,
  'Testing': False,
  'Laboratory': False,
  'Field': False,
  'At-home': True,
  'Possible Contaminants': True,
  'Source Water Testing': False,
  'Biological': False,
  'Chemical': False,
  'Physical contaminents': True,
  'Effluent at End of Treatment Process': False,
  'Biol

{'YRITWC_nd_Aquatic Buffer Ordinance.pdf': {'All': 'local',
  'Audience': 'Governmental or Tribal Officials',
  'Technical Considerations': True,
  'Technical Background': True,
  'Stage in Water Process': True,
  'Not Stage Specific': True,
  'Source Water': True,
  'Potable Water Treatment': False,
  'Potable Water Distribution': False,
  'Wastewater Collection': False,
  'Wastewater Treatment': True,
  'Water Storage': False,
  'Small Systems': True,
  'End-Users Storage System': False,
  'Administrative Considerations': False,
  'Administrative Processes': True,
  'Compliance': True,
  'Strategic Planning': False,
  'Funding': False,
  'Worker Safety': True,
  'Health care': False,
  'Water Quality ': True,
  'Testing': False,
  'Laboratory': False,
  'Field': False,
  'At-home': True,
  'Possible Contaminants': True,
  'Source Water Testing': False,
  'Biological': False,
  'Chemical': False,
  'Physical contaminents': True,
  'Effluent at End of Treatment Process': False,
  'Biol

In [15]:
print(response.response)
import json 
with open("graylit_codes.json", "w") as outfile: 
    json.dump(graylit_codes, outfile)
with open("graylit_outputs.json", "w") as outfile: 
    json.dump(graylit_outputs, outfile)

No.
No.


In [16]:
import pandas as pd
# df = pd.read_json( "graylit_codes.json")
df = pd.DataFrame(graylit_codes)

In [17]:
df = df.transpose()

In [18]:
for audience in df['All'].unique():
    df[audience]=df['All']==audience

In [19]:
for audience in df['Audience'].unique():
    df[audience]=df['Audience']==audience

In [20]:
df = df.replace(False, 0)
df = df.replace(True, 1)

In [21]:
home = ! echo $HOME

In [22]:
df

Unnamed: 0,All,Audience,Technical Considerations,Technical Background,Stage in Water Process,Not Stage Specific,Source Water,Potable Water Treatment,Potable Water Distribution,Wastewater Collection,...,Healthcare,local,federal,state,NaN,regional agency,Governmental or Tribal Officials,Operators-in-training,End-Users,Operators
YRITWC_nd_Aquatic Buffer Ordinance.pdf,local,Governmental or Tribal Officials,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,...,1.0,1,0,0,0,0,1,0,0,0
"Bethel City Council_2020_City's Water & Sewer Services Expenditures Compared to Budget from July 1, 2020 to April 30, 2021.pdf",local,Governmental or Tribal Officials,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,...,0.0,1,0,0,0,0,1,0,0,0
AK DEC_2009_Introduction to Small Water Systems-Ch9.pdf,federal,Operators-in-training,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,...,1.0,0,1,0,0,0,0,1,0,0
AWWA_nd_WSO-WPP-Study Tactics.pdf,state,Operators-in-training,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,...,0.0,0,0,1,0,0,0,1,0,0
HUD_2021_Phase 1 Priorities.pdf,federal,Operators-in-training,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,...,1.0,0,1,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
AK DEC_nd_Bethel Native Corp Offices Water Treatment System.pdf,state,Governmental or Tribal Officials,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,...,1.0,0,0,1,0,0,1,0,0,0
ABC_2018_Guide Distr ClassI.pdf,local,Operators-in-training,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,...,1.0,1,0,0,0,0,0,1,0,0
EPA_1999_Community and Nontransient Noncommunity Operator Certification.pdf,regional agency,Operators-in-training,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,...,1.0,0,0,0,0,1,0,1,0,0
ABC_nd_WastewaterTreatmentOperatorCertificationApplication031716.pdf,state,Operators-in-training,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,...,1.0,0,0,1,0,0,0,1,0,0


Unnamed: 0,All,Audience,Technical Considerations,Technical Background,Stage in Water Process,Not Stage Specific,Source Water,Potable Water Treatment,Potable Water Distribution,Wastewater Collection,...,Healthcare,local,federal,state,NaN,regional agency,Governmental or Tribal Officials,Operators-in-training,End-Users,Operators
YRITWC_nd_Aquatic Buffer Ordinance.pdf,local,Governmental or Tribal Officials,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,...,1.0,1,0,0,0,0,1,0,0,0
"Bethel City Council_2020_City's Water & Sewer Services Expenditures Compared to Budget from July 1, 2020 to April 30, 2021.pdf",local,Governmental or Tribal Officials,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,...,0.0,1,0,0,0,0,1,0,0,0
AK DEC_2009_Introduction to Small Water Systems-Ch9.pdf,federal,Operators-in-training,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,...,1.0,0,1,0,0,0,0,1,0,0
AWWA_nd_WSO-WPP-Study Tactics.pdf,state,Operators-in-training,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,...,0.0,0,0,1,0,0,0,1,0,0
HUD_2021_Phase 1 Priorities.pdf,federal,Operators-in-training,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,...,1.0,0,1,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
AK DEC_nd_Bethel Native Corp Offices Water Treatment System.pdf,state,Governmental or Tribal Officials,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,...,1.0,0,0,1,0,0,1,0,0,0
ABC_2018_Guide Distr ClassI.pdf,local,Operators-in-training,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,...,1.0,1,0,0,0,0,0,1,0,0
EPA_1999_Community and Nontransient Noncommunity Operator Certification.pdf,regional agency,Operators-in-training,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,...,1.0,0,0,0,0,1,0,1,0,0
ABC_nd_WastewaterTreatmentOperatorCertificationApplication031716.pdf,state,Operators-in-training,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,...,1.0,0,0,1,0,0,0,1,0,0


In [26]:
df.to_csv(f"{home[0]}/CodedOutputs/gray_lit_coded.csv")

In [25]:
!ls

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


'1urBYxybPjtjGM7EVY1_ZiSfGcc9w5OHl?usp=share_link'
'2019_Supporting DMDU.pdf'
 cache2
 CodedSurveyTopics.json
 geospatial_latest.sif
 git-lfs-1.2.0
 gray_lit_coded.csv
 graylit_codes.json
 graylit_outputs.json
 HF
 index.html
 Jupyter
 Llama-2-13b-chat-hf
 Llama-2-13b-hf
 Llama-2-70b-chat-hf
 Llama-2-70b-hf
 Llama-2-7b
 Llama-2-7b-chat-hf
 Llama-2-7b-hf
 LLAMAIndexenvironment.yml
 llama-main
 Miniconda3-py311_23.5.2-0-Linux-x86_64.sh
 NonCodedSurveyTopics.json
 r-gdal-container-tacc_0.1.sif
 r-gdal-container-tacc_dev.sif
 survey_codes.json
 survey_outputs.json
 SurveyTopicsNoPrompt.json
 SurveyTopicsWithPrompt.json
 TIFF
 Topics.json
'1urBYxybPjtjGM7EVY1_ZiSfGcc9w5OHl?usp=share_link'
'2019_Supporting DMDU.pdf'
 cache2
 CodedSurveyTopics.json
 geospatial_latest.sif
 git-lfs-1.2.0
 gray_lit_coded.csv
 graylit_codes.json
 graylit_outputs.json
 HF
 index.html
 Jupyter
 Llama-2-13b-chat-hf
 Llama-2-13b-hf
 Llama-2-70b-chat-hf
 Llama-2-70b-hf
 Llama-2-7b
 Llama-2-7b-chat-hf
 Llama-2-7b-hf
 L

In [None]:
df[['All', 'Audience', 'Stage in Water Process',
       'Administrative process', 'Administrative Considerations',
       'Compliance', 'Strategic Planning', 'Funding', 'Worker Safety',
       'Health care', 'Water Quality ', 'Testing', 'Laboratory', 'Field',
       'At-home', 'Source Water Testing', 'Biological', 'Chemical',
       'Physical contaminents', 'Possible Contamination Sources',
       'Effluent at End of Treatment Process', 'Biological.1',
       'Chemical.1', 'Physical properties.1',
       'Water received by the end-user', 'Biological.2', 'Chemical.2',
       'Physical Properties.1', 'Treatment Process', 'Operating System',
       'Equipment Installation', 'Equipment Operations',
       'Equipment Maintenance', 'System Components', 'System Monitoring',
       'Design Considerations', 'Threats and Security',
       'Climate and Environment', 'Cybersecurity', 'System Breakdown',
       'Hazardous Materials', 'Technical Considerations',
       'Technical Background', 'Social Background',
       'Social Considerations', 'Certification',
       'Statistics on Certified People',
       'Design of Certification Process', 'Content Covered',
       'Application Process', 'Benefits of Certification',
       'Accessibility of Material', 'Request hard copy', 'In-Person',
       'Library', 'Online', 'Cost Considerations',
       'Continuing Education Requirements or Renewal',
       'Experience Requirements', 'Previous Education Considerations',
       'Study Tactics', 'Consequences of Poor Management',
       'Collaboration', 'Water Governance',
       'Federal Policies and Regulations',
       'State Policies and Regulations', 'Tribal Governance',
       'Community Outreach', 'Water System Stakeholders',
       'Environmental Attorneys Lawyers', 'End-Users.1', 'Developers',
       'Relevance', 'Relevance To environment', 'To Operator',
       'Positive implications', 'Negative implications', 'To End-User',
       'To Endvirnment', 'COVID-19', 'local', 'federal', 'state',
       'regional agency']]