
# Customizing Large Language Models with Additional Input

## Table of Contents

1. [Customizing Large Language Models](#introduction)
2. [Question-Answering LLMs](#qa)
3. [Setting up the Environment](#setup)
4. [Paper-QA](#paper)
5. [Demo](#demo)

---
        


## 1. Customizing Large Language Models <a name="introduction"></a>

 Customizing Large Language Models (LLMs) with additional data is a powerful way
 to tailor their capabilities to specific tasks or domains. This process, often
 referred to as "fine-tuning," involves training the model on a new dataset that
 is related to the specific task at hand. The new data effectively guides the
 model to adjust its internal parameters and better align its language
 generation capabilities with the desired task. For instance, you might
 fine-tune a general-purpose language model on medical literature to create a
 model that excels at answering medical questions. Or you could fine-tune a
 model on customer support transcripts to create a chatbot that understands the
 specific language and issues related to a particular product or service.
 Fine-tuning allows us to leverage the power of LLMs that have been trained on
 vast amounts of data, while still creating models that are highly specialized
 and effective in specific domains or tasks.
 
 ## 2. Question-Answering LLMs <a name="qa"></a>

Question-Answering (QA) Large Language Models are a specialized application of LLMs that have been fine-tuned to answer questions based on provided context or broad knowledge learned during training. These models can interpret a wide range of questions and provide precise answers, making them extremely useful in applications like chatbots, virtual assistants, and customer service automation. Some QA models are designed to generate answers based on a specific piece of text or a set of documents, while others can answer questions based on a broad range of general knowledge. The latter, known as "open-domain" QA models, can answer questions about virtually any topic, drawing on the vast amounts of information they were trained on. Examples of open-domain QA models include GPT-3 by OpenAI and T5 by Google. These models have significantly advanced the field of natural language understanding and opened up new possibilities for AI-powered question answering.


## 3. Setting up the Environment <a name="setup"></a>

Before we start coding, we need to install the necessary libraries. This can be done by running the following commands in your Jupyter notebook:
        

In [None]:

!pip install paperqa        

In [1]:

import paperqa
print('PaperQA version:', paperqa.__version__)
        

PaperQA version: 3.3.3



## 4. Paper QA <a name="paper"></a>

Paper QA is a minimal package for doing question and answering from
PDFs, HTML or raw text files. It aims to give very good answers, with no hallucinations, by grounding responses with in-text citations.

By default, it uses [OpenAI Embeddings](https://platform.openai.com/docs/guides/embeddings) with a vector DB called [FAISS](https://github.com/facebookresearch/faiss) to embed and search documents. However, via [langchain](https://github.com/hwchase17/langchain) you can use open-source models or embeddings (see details below).

PaperQA uses the process shown below:

1. embed docs into vectors
2. embed query into vector
3. search for top k passages in docs
4. create summary of each passage relevant to query
5. put summaries into prompt
6. generate answer with prompt

## 5. Demo <a name="demo"></a>
        

### Before fine-tuning

In [4]:
from paperqa import Docs

# Required for Jupyter Notebook
import nest_asyncio
nest_asyncio.apply()

docs = Docs(llm='gpt-4')
answer = docs.query("What is Justice40 initiative?")
print(answer.formatted_answer)


Question: What is Justice40 initiative?

I cannot answer this question due to insufficient information.



### Add a document describing Justice40 initiative

In [5]:
import time
start = time.time()
docs.add('/home/keceli/data/papers/M-21-28.pdf')
end = time.time()
print(f'Adding the document took {(end-start):.2f} seconds')


### After fine-tuning


In [6]:
start = time.time()
answer = docs.query("What is Justice40 initiative?")
end = time.time()
print(answer.formatted_answer)
print(f'Fine-tuning and query took {(end-start):.2f} seconds')


Question: What is Justice40 initiative?

The Justice40 Initiative is a U.S. federal program aimed at securing environmental justice and economic opportunity for disadvantaged communities historically burdened by pollution and underinvestment. The initiative, outlined in Executive Order 14008, seeks to ensure that 40% of the overall benefits of certain federal investments flow to these communities (Interim2021 pages 1-2). It involves various agencies and programs, including the EPA, HHS, HUD, and USDA, among others, which are directed to align their policies, practices, and procedures with the goals of the initiative (Interim2021 pages 12-13). The initiative focuses on seven areas: climate change, clean energy and energy efficiency, clean transportation, affordable and sustainable housing, training and workforce development, remediation and reduction of legacy pollution, and critical clean water and waste infrastructure (Interim2021 pages 3-4).

References

1. (Interim2021 pages 12-13):