# GLANSIS INVASIVE SPECIES ENVIRONMENTAL IMPACT PDF CHATBOT INTERFACE

This PDF Chatbot Interface was created to test Large Language Models (LLM) potiential application to literature review involving literature reviews. The project focused on the impact assessments performed by GLANSIS given the clear criteria used to determine impact categories.

To run this script, users need to select a PDF and enter the species name of the invasive species under review. Seven queries are asked of the text related to impact type, location, and description.

## Import libraries

In [5]:
# Import libraries
from personal import secrets
import os
import pandas as pd
import openai
from tkinter import Tk, filedialog
from PyPDF2 import PdfReader
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.chat_models import ChatOpenAI
from langchain.chains.question_answering import load_qa_chain


## User Input Information
Users will need to add specific information before using the interface.

#### Enter User API Key
Users will need to enter their personal OpenAI API key to run chatbot interface

In [7]:
# Open the API key for openai
api_key = secrets.get('OPENAI API KEY')
os.environ["OPENAI_API_KEY"] = str(api_key)


#### Enter Species Name
Users will need to enter the name of the invasive species they want to find impacts for.

In [3]:
species = ""


#### Select a PDF to use ChatBot Interface
A filedialog box will open so users can select desired PDF. To the text will be extracted from PDF. To prepare the PDF for the chatbot, text will be split and converted to vector database for quearies. 

In [None]:
# Open a file dialog to select an Excel file
root = Tk()                                         
root.attributes("-topmost", True)                   
root.withdraw()
file_path = filedialog.askopenfilename(filetypes=[("PDF files", "*.pdf")])

# Import PDF
reader = PdfReader(file_path)

# Read in text from PDF 
raw_text = ''
for i, page in enumerate(reader.pages):
    text = page.extract_text()
    if text:
        raw_text += text


# Split the text into smaller chunks so information retrieval doesn't hit token limit size
text_splitter = CharacterTextSplitter(
    separator = '\n',
    chunk_size = 1000,
    chunk_overlap = 200,
    length_function = len,
)

text_chunk = text_splitter.split_text(raw_text)

# Download embeddings from OpenAI - list of 
embeddings = OpenAIEmbeddings()

# Create vector database
docsearch = FAISS.from_texts(text_chunk, embeddings)

# Sets openai model - currently set to default
llm = ChatOpenAI(model_name="gpt-3.5-turbo")
chain = load_qa_chain(llm, chain_type = 'stuff')


## Establish GLANSIS Query Criteria
GLANSIS has specific definitions for impacts, study types, and study locations. These will be used as part of the prompts for later queries. 

In [12]:
# Description of environmental impacts
env_impact = f"""
Disease/Parasite/Toxicity: The species pose some hazard or threat to the health of native species (e.g., it magnifies toxin levels; is poisonous; is a pathogen, parasite, or a vector of either); the species has introduced a novel or rare disease or parasite to another organism in the area that was unafflicted with said disease or parasite before its introduction, including moving a native parasite outside of its typical range; toxicity includes both envenomation and poisoning. The species pose some hazard or threat to human health (e.g., it magnifies toxin levels, is poisonous, a virus, bacteria, parasite, or a vector of one).

Predation/Herbivory: species consumes or is consumed by another species. 
it alter predator-prey relationships.

Food Web: species changes second order or higher nutrient/feeding cascades.

Competition: The species out-compete native species for available resources (e.g., habitat, food, nutrients, light). species shares a niche with another species where introduced, such that they compete for resources (such as food and habitat).

Genetic: Has it affected any native populations genetically (e.g., through hybridization, selective pressure, introgression). Species hybridizes with another organism as a result of its introduction, with the resulting offspring viability being irrelevant.

Water Quality: species creates measurable changes in water chemistry/quality/parameters as compared to pre-introduction. negatively affect water quality (e.g., increased turbidity or clarity, altered nutrient, oxygen, or other chemical levels/cycles).

Habitat Alteration: introduction of the species modifies the environment to which it was introduced, such as zebra mussels that attached to surfaces, changing the substrate of a waterbody. 

alter physical components of the ecosystem in some way (e.g., facilitated erosion/siltation, altered hydrology, altered macrophyte/phytoplankton communities, physical or chemical changes to substrate)?
"""

# Description of study types
study_type = f"""
Experimental: a study/reference with a claim that was supported experimentally, i.e. at least one variable in the study was manipulated.

Observational: a study/reference that with a claim that was founded observing something, i.e. nothing in the study or report was a result of manipulating any variables.

Anecdotal: a study/reference with a claim that is unfounded with direct research, but supported by theory or correlation, therefore anecdotal.
"""

# Description of study locations
study_location = f"""
Field: The study/impact occurred in the field.

Laboratory: The study/impact occurred in the laboratory.

N/A: Study/impact was not in a lab or field setting and falls in neither of the previous categories
 """


## Environmental Impact Queries
GLANSIS looks at 6 different environmental impact types. From those specific impacts, data concerning location and impacted species is collected. The following queries will help create a dataframe with data concerning each impact.

In [14]:
# Categorizing different impacts
query = f"""
What are the documented categorical ecological impacts of invasive ```{species}``` in invaded 
regions? Generate a list seperated by commas containing only “Disease/Parasite/Toxicity”, “Predation/Herbivory”, 
“Food Web”, “Competition”, “Genetic”, “Water Quality”, or “Habitat Alteration” using
```{env_impact}```as guidance.
If there are no impacts report as NA”
"""

docs = docsearch.similarity_search(query)
impact_types = chain.run(input_documents = docs, question = query).split(", ")
print(impact_types)


['Predation/Herbivory', 'Competition', 'Food Web', 'Habitat Alteration']


In [18]:
# Study type
study_type = []

for impact in impact_types:
    
    query = f"""
    How was the impact '''{impact}''' documented? Use ```{study_type}``` as guidance for determining study type. 
    Possible responses are: “Experimental”, “Observational”, or “Anecdotal.” Do not add any additional content, 
    formatting, or punctuation.
    """
    docs = docsearch.similarity_search(query)
    study_type.append(chain.run(input_documents = docs, question = query))
    
print(study_type)


['[Experimental]', 'Experimental', 'Experimental', 'Experimental']


In [19]:
# Study location

study_location = []

for impact in impact_types:
    
    query = f"""
    Where did the documented impact '''{impact}''' '''occur? Use ```{study_location}``` as guidance for study location. 
    Possible response are: “Field”, “Laboratory”, or “N/A.” Do not add any additional content or formatting.
    """
    
    docs = docsearch.similarity_search(query)
    study_location.append(chain.run(input_documents = docs, question = query))
    
print(study_location)    


['[Field]', 'Field', 'Field', 'Field']


In [20]:
# Descripe the impact

impact_description = []

for impact in impact_types:
    
    query = f"""
    Write one to three sentence descriptions of the '''{impact}''' of ```{species}``` on the
    environment. This should include the scientific name of the species, as well as the geographic
    location of the impact (Country/state, waterbody). Descriptions must be fewer than 500 characters.
    """
    
    docs = docsearch.similarity_search(query)
    impact_description.append(chain.run(input_documents = docs, question = query))
    
print(impact_description)


["The predation/herbivory of silver carp (Hypophthalmichthys molitrix) in the Illinois River, Illinois, USA, has led to significant effects on zooplankton communities, impacting native fishes and altering ecosystem dynamics. Silver carp's feeding behavior has caused changes in zooplankton populations and community structures, contributing to concerns about its invasive impact on the aquatic environment.", 'The invasive silver carp (Hypophthalmichthys molitrix) has been shown to impact native fish species in the Illinois River, USA, leading to reduced condition factors and potential competition due to shared resources.', 'Silver carp (Hypophthalmichthys molitrix) impact zooplankton communities in the Missouri River, USA, affecting trophic dynamics. In the Illinois River, Illinois, silver carp alter zooplankton composition, influencing native fish diets and ecosystem balance.', 'The invasive silver carp (Hypophthalmichthys molitrix) has been observed to alter habitats by impacting zoopla

In [21]:
# Experiment location

geo_loc = []

for impact in impact_types:
    
    query = f"""
    Where is the geographic location of the impact '''{impact}''' ? Report answer in the format 
    “waterbody, State/province, Country” or “waterbody, Country.” Do not had any additional content 
    or formatting. Do not use abbreviations for location, waterbody, or country.
    """
    
    docs = docsearch.similarity_search(query)
    geo_loc.append(chain.run(input_documents = docs, question = query))
    
print(geo_loc)


['Mississippi River, Iowa, USA.', 'Mississippi River, Iowa, USA.', 'Upper Mississippi River, Illinois, USA', 'UMR, Iowa, United States']


In [22]:
# Great Lakes - experiment location

gl_loc = []

for impact in impact_types:
    
    query = f"""
    Did this impact happen within the Great Lakes Basin? Possible responses are: “yes” or “no.”
    Do not add any additional content, formatting, or punctuation.
    """
    
    docs = docsearch.similarity_search(query)
    gl_loc.append(chain.run(input_documents = docs, question = query))

print(gl_loc)


['no', 'no', 'no', 'no']


In [23]:
# Impacted species

impacted_sp = []

for impact in impact_types:
    
    query = f"""
    If applicable, which native species were impacted by '''{species} from '''{impact}'''? Report as list of common 
    or scientific names. If none, report "NA." Do not add any additional content, formatting, or punctuation.
    """
    
    docs = docsearch.similarity_search(query)
    impacted_sp.append(chain.run(input_documents = docs, question = query))

print(impacted_sp)


['NA', 'NA', 'NA', 'Bluegill, emerald shiner, yellow perch']


## Create Data Table
Collate all responses from queries into a dataframe. Users can view the table.

In [24]:
data = {
    "Impact Type": impact_types,
    "Study Type": study_type,
    "Study Location": study_location,
    "Impact Description": impact_description,
    "Geographic Location": geo_loc,
    "Great Lakes Region": gl_loc,
    "Impacted Species": impacted_sp   
}

df = pd.DataFrame(data)
df['File Name'] = os.path.basename(file_path)
df


Unnamed: 0,Impact Type,Study Type,Study Location,Impact Description,Geographic Location,Great Lakes Region,Impacted Species,File Name
0,Predation/Herbivory,[Experimental],[Field],The predation/herbivory of silver carp (Hypoph...,"Mississippi River, Iowa, USA.",no,,testPDF.pdf
1,Competition,Experimental,Field,The invasive silver carp (Hypophthalmichthys m...,"Mississippi River, Iowa, USA.",no,,testPDF.pdf
2,Food Web,Experimental,Field,Silver carp (Hypophthalmichthys molitrix) impa...,"Upper Mississippi River, Illinois, USA",no,,testPDF.pdf
3,Habitat Alteration,Experimental,Field,The invasive silver carp (Hypophthalmichthys m...,"UMR, Iowa, United States",no,"Bluegill, emerald shiner, yellow perch",testPDF.pdf


## Export
The data table can be exported as an Excel file for later analysis.

In [25]:
# Save Excel file
root = Tk()                                         
root.attributes("-topmost", True)                   
root.withdraw()
file_path = filedialog.asksaveasfilename(defaultextension=".xlsx", filetypes=[("Excel files", "*.xlsx"), ("All files", "*.*")])
df.to_excel(file_path, index=False)
