# Quick checkilist creator 

Summarizer
Goal: Create a "checklist" (like requirements matrix) from an RFP

There's a few ways to address this:

Full RFP
- Input
  - FULL RFP
  - Prompt asks for the checklist directly
- Output
  - Checklist
  - Approx count of tokens (roughly 3.5 characters)

Page-wise summary
- Input
  - Per page summary
  - Prompt asks to consolidate into checklist
- Output
  - As above

RAG-driven
- Input
  - Vector DB
  - Query for relevant information about sections
  - Top X pages provided to prompt
  - Prompt using that information
- Output
  - As above


In [33]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [34]:
import sys
sys.path.append('../rfpgo/')
from credentials import *
from process.prompts import *
from checklist.prompts import *
from summarize.summarizer import Summarizer
from utils import *
import os
import pandas as pd
from pathlib import Path
os.environ["ANTHROPIC_API_KEY"] = ANTHROPIC_KEY

In [35]:
from langchain.llms import Ollama
from langchain_anthropic import ChatAnthropic

gemma = Ollama(model="gemma2")
anth_haiku = ChatAnthropic(model='claude-3-haiku-20240307')
anth_opus = ChatAnthropic(model='claude-3-opus-20240229')

## Checklistizer


In [44]:
class Checklist(object):

    # full RFP checklist
    full_rfp_prompt = full_rfp

    # checklist_from_page_summaries
    checklist_from_page_summaries = checklist_from_page_summaries

    # formatting prompt
    format_sections = format_sections

    def _count_tokens(self, text):
        # standard - ~3.5 characters / token
        return round(len(text)/3.5)

    def __init__(self, llm, fn):
        self.llm = llm
        self.llm_name = llm.dict()['model']
        self.summarizer = Summarizer(llm, fn)
        self.split_doc = self.summarizer.split_doc

    def checklist(self):

        full_rfp_text = '\n'.join(self.split_doc)
        self.c_full = self.full_rfp_prompt.format(document='\n'.join(self.split_doc))
        self.c_full_tokens = self._count_tokens(self.c_full)
        self.c_full_response = call_llm(self.c_full, self.llm)

        if not hasattr(self.summarizer, 'page_summaries'):
            print('Running summarizer...')
            self.summarizer.summarize()

        self.c_page = self.checklist_from_page_summaries.format(
            document=self.summarizer.joined_p)
        self.c_page_tokens = self._count_tokens(self.c_page)
        self.c_page_response = call_llm(self.c_page, self.llm)


In [45]:
# example rfp
fn = '../data/labels/drafter_09262024/RFP_Study to evaluate methods to calculate area median income.pdf'

In [76]:
#checklist_dict = {}

for model in [gemma, anth_haiku, anth_opus]:#, anth_haiku, anth_opus]:#, oai_3, oai_4]:
    model_name = model.dict()['model']
    if model_name not in checklist_dict:
        checklist_dict[model_name] = Checklist(model, fn)
        checklist_dict[model_name].checklist()
    else:
        print('Skipping', model_name)

Skipping gemma2
Skipping claude-3-haiku-20240307
Running summarizer...


In [59]:
results = []

for model in checklist_dict:
    results.append(
        [model, 
        checklist_dict[model].c_full_response, 
        checklist_dict[model].c_page_response])

df = pd.DataFrame(results, 
    columns=['model', 'full_response', 'page_response'])

df.to_csv('../data/output/checklist_10212024/checklists.csv')

In [55]:
# the token count is the same no matter what
print(checklist_dict['gemma2'].c_full_tokens, checklist_dict['gemma2'].c_page_tokens)

39071 4671


In [78]:
for model in checklist_dict:
    print(model)
    print(checklist_dict[model].c_full_response)
    print('----')

gemma2
This document appears to be a portion of a government contract between COMMERCE and a Contractor. Here's a breakdown of the outlined sections:

**Section 37: Liability of the Authorized Representative**

* Defines the extent of liability for the Authorized Representative (likely a government official) acting on behalf of COMMERCE.
* Specifies that disagreements about liability determination can lead to disputes handled according to a separate "Disputes" clause in the contract.
* Allows COMMERCE to withhold funds from the Contractor if necessary to protect against potential losses or liabilities.

**Section 38: Treatment of Assets**

* Outlines ownership rights for various types of property involved in the contract.
*  COMMERCE owns any property it supplies to the Contractor.
* The Contractor initially owns property they purchase with funds from COMMERCE, but title transfers to COMMERCE upon delivery or use according to specific conditions. 
* Establishes responsibilities for mai

In [79]:
for model in checklist_dict:
    print(model)
    print(checklist_dict[model].c_page_response)
    print('----')

gemma2
This contract document between the Washington Department of Commerce (COMMERCE) and a Contractor outlines the terms and conditions governing their work together. 

**Key Sections:**

* **Pages 31-32:** Introductions - Table of Contents, Contract Face Sheet outlining parties, amount, dates, purpose, and incorporated documents.
* **Pages 33-34:** Financial Terms - Compensation, reimbursement, billing procedures, payment timelines, financial reporting requirements via Access Equity.
* **Pages 35-36:** Insurance -  Mandatory coverage types (general liability, cyber, automobile, professional, fidelity) with specific minimum limits; Fidelity insurance details for both primary contractor and subcontractors.
* **Page 37:** General Terms & Conditions - Key definitions ("Authorized Representative," "Personal Information"), data access provisions, advance payments, amendments, and the principle that the written contract is all-encompassing.
* **Pages 38-39:** Legal & Ethical Requirements -

In [77]:
# # display results
# print(gemma_summary.summary_short)
# print('--')
# print(gemma_summary.summary)


In [112]:
# step in between - need to process the checklist text into sections and descriptions
checklist_source = checklist_dict['claude-3-haiku-20240307'].c_full_response
section_dict = {}
cur_section = None
for l in checklist_source.split('\n'):
    # identify whether line is a section
    # if so, add to dict
    # if not, add to current section
    if len(l)>0:
        if l[0].isnumeric():
            cur_section = ' '.join(l.split()[1:])
            section_dict[cur_section] = []
        else:
            if cur_section is not None:
                section_dict[cur_section].append(l.strip())

# RAG workflow

In [60]:
from langchain.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.llms import HuggingFacePipeline
# this will need to be downloaded from the HF hub
emb_model_name = "sentence-transformers/all-MiniLM-L6-v2"
emb = HuggingFaceEmbeddings(model_name=emb_model_name)

  emb = HuggingFaceEmbeddings(model_name=emb_model_name)


In [71]:
class RAG(object):
    def __init__(self, llm, fn, section_dict):
        self.llm = llm
        self.llm_name = llm.dict()['model']
        self.summarizer = Summarizer(llm, fn)
        self.split_doc = self.summarizer.split_doc
        self.section_dict = section_dict
        self._create_db()
    
    def _create_db(self):
        self.document_db = FAISS.from_texts(self.split_doc, emb)

    def _query_db(self, prompt, k=3):
        relevant_docs = self.document_db.similarity_search(prompt, k=k)
        return relevant_docs

    def _store_req_resp(self, req, resp):
        d = {
            'documents': req,
            'response': resp
            }
        return d
    
    def retrieve_sections(self, k=3):
        # get sections
        section_docs = self._query_db(rag_sections, k=k)
        joined_docs = '\n'.join([d.page_content for d in section_docs])
        section_prompt = get_sections.format(
            document=joined_docs
            )
        section_response = call_llm(section_prompt, self.llm)
        # store for later
        self.sections = self._store_req_resp(joined_docs, section_response)

    


In [72]:
r = RAG(gemma, fn)
r.retrieve_sections(k=5)

In [73]:
for d in r.sections['documents'].split('\n'):
    print(d)

 
Page | 13 of 48 
 3. PROPOSAL CONTENTS 
 
ELECTRONIC PROPOSALS: 
To be responsive, Proposals must contain all eight items below , written in English, and submitted 
electronically to the RFP Coordinator in the following order: 
1. Letter of Submittal 
2. Certifications and Assurances (Exhibit A to this RFP) 
3. Technical Proposal 
4. Management Proposal 
5. Cost Proposal 
6. Diverse Business Inclusion Plan (Exhibit B to this RFP) 
7. Workers’ Rights Certification (Exhibit C to this RFP) 
8. Small or Veteran-Owned Business Certification (Exhibit D to this RFP) 
 
Proposals must provide information in the same order as presented in this document with the same 
headings. This will not only be helpful to the evaluators of the Proposal, but should also assist the 
Proposer in preparing a thorough response. 
 
Items marked “mandatory” must be included as part of the Proposal to be considered 
responsive, however, these items are not scored. Items marked “scored” are those that are 
awarded