<a href="https://colab.research.google.com/github/TimothyChenAllen/fema-experiments/blob/main/FEMA_Strategic_Plan_ChatBot_2023_11_21.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Train an LLM with the FEMA Strategic Plan
tim.allen@fema.dhs.gov 2023-11-21

I found a lot of this code here: https://huggingface.co/distilbert-base-uncased-distilled-squad

## Configuration
You'll need to upload the fema_2022-2026-strategic-plan.pdf file locally, grab its path, and put the `filename` and `corpus_path` here:

In [1]:
filename = "fema_2022-2026-strategic-plan.pdf"
corpus_path = "/content/"
textversion = "fema_2022-2026-strategic-plan.txt"

In [2]:
import os
strat_plan_pdf = os.path.join(corpus_path, filename)
strat_plan_txt = os.path.join(corpus_path, textversion)

In [3]:
# Install necessary libraries:

%pip install "accelerate>=0.16.0,<1" "transformers[torch]>=4.28.1,<5" "torch>=1.13.1,<2" "PyPDF2"

[0m

In [4]:
# Can I load the libraries I need? Yes I can!
import torch
from transformers import pipeline

## Import text from the Strategic Plan PDF

In [5]:
import PyPDF2
# Open the PDF file
pdf_reader = PyPDF2.PdfReader(strat_plan_pdf)
pages_object = pdf_reader.pages
# print(pages_object[3].extract_text())

## Clean up the text
Here we join together text and remove headers and footers.

In [6]:
import re
# Knit all the text one a page into a single line, then all pages together into a single text variable
text = ""
# Skip the cover and table of contents. Probably a cleverer way to do this
for p in range(2, pages_object.length_function()):
    page_text = pages_object[p].extract_text()
    lines = page_text.split("\n")
    single_line = " ".join(lines)
    # Every page has headers and footers on it. I don't want that
    single_line = re.sub(r"Building the FEMA our Nation Needs and Deserves\d+", "", single_line)
    single_line = re.sub(r"\d+ +2022-2026 FEMA Strategic Plan", "", single_line)
    # print(f"Page {p} has {len(single_line)} characters.")
    if len(single_line) > 80:
        text += single_line + "\n"

## Save a local text copy

In [7]:
import os
# Let's save a text file copy
with open(strat_plan_txt, 'w') as f:
    f.write(text)

## Create the model and associate the strat plan with it

In [8]:
model_name = "distilbert-base-uncased-distilled-squad"
question_answerer = pipeline("question-answering", model=model_name)
context = text
result = question_answerer(question="How should we treat the disaster survivor?",     context=context)
print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")

config.json:   0%|          | 0.00/451 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/265M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Answer: 'measure the culture of FEMA’s workforce', score: 0.2755, start: 62207, end: 62246


## The Main Event
Here's how you call this model

In [9]:
import time
from transformers import set_seed
set_seed(42)

while True:
  question_text = input("Enter your question:")
  if question_text.lower() in ['q','quit','stop']:
    break
  start_time = time.time()
  res = question_answerer(question=question_text,     context=context)
  end_time = time.time()
  time_taken = end_time - start_time
  answer = res['answer']
  print(f"{answer}\nTime taken: {time_taken:.2f} seconds")

Enter your question:What is FEMA?
Federal Emergency Management Agency
Time taken: 46.47 seconds
Enter your question:What is DHS?
U.S. Department of Homeland Security
Time taken: 55.80 seconds
Enter your question:q
