
# Customizing Large Language Models with Additional Input

## Table of Contents

1. [Customizing Large Language Models](#introduction)
2. [Question-Answering LLMs](#qa)
3. [Setting up the Environment](#setup)
4. [Paper-QA](#paper)
5. [Demo](#demo)

---
        


## 1. Customizing Large Language Models <a name="introduction"></a>

 Customizing Large Language Models (LLMs) with additional data is a powerful way
 to tailor their capabilities to specific tasks or domains. This process, often
 referred to as "fine-tuning," involves training the model on a new dataset that
 is related to the specific task at hand. The new data effectively guides the
 model to adjust its internal parameters and better align its language
 generation capabilities with the desired task. For instance, you might
 fine-tune a general-purpose language model on medical literature to create a
 model that excels at answering medical questions. Or you could fine-tune a
 model on customer support transcripts to create a chatbot that understands the
 specific language and issues related to a particular product or service.
 Fine-tuning allows us to leverage the power of LLMs that have been trained on
 vast amounts of data, while still creating models that are highly specialized
 and effective in specific domains or tasks.
 
 ## 2. Question-Answering LLMs <a name="qa"></a>

Question-Answering (QA) Large Language Models are a specialized application of LLMs that have been fine-tuned to answer questions based on provided context or broad knowledge learned during training. These models can interpret a wide range of questions and provide precise answers, making them extremely useful in applications like chatbots, virtual assistants, and customer service automation. Some QA models are designed to generate answers based on a specific piece of text or a set of documents, while others can answer questions based on a broad range of general knowledge. The latter, known as "open-domain" QA models, can answer questions about virtually any topic, drawing on the vast amounts of information they were trained on. Examples of open-domain QA models include GPT-3 by OpenAI and T5 by Google. These models have significantly advanced the field of natural language understanding and opened up new possibilities for AI-powered question answering.


## 3. Setting up the Environment <a name="setup"></a>

Before we start coding, we need to install the necessary libraries. This can be done by running the following commands in your Jupyter notebook:
        

In [1]:
%pip install paper-qa        

Note: you may need to restart the kernel to use updated packages.


In [2]:
import os
api_key = os.environ.get("OPENAI_API_KEY")
if not api_key:
    api_key = input("OPENAI_API_KEY is not defined, please enter it: ")
    os.environ["OPENAI_API_KEY"] = api_key

OPENAI_API_KEY is not defined, please enter it:  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


In [3]:

import paperqa
print('PaperQA version:', paperqa.__version__)

# Required for Jupyter Notebook
import nest_asyncio
nest_asyncio.apply()    

PaperQA version: 3.5.0



## 4. Paper QA <a name="paper"></a>

Paper QA is a minimal package for doing question and answering from
PDFs, HTML or raw text files. It aims to give very good answers, with no hallucinations, by grounding responses with in-text citations.

By default, it uses [OpenAI Embeddings](https://platform.openai.com/docs/guides/embeddings) with a vector DB called [FAISS](https://github.com/facebookresearch/faiss) to embed and search documents. However, via [langchain](https://github.com/hwchase17/langchain) you can use open-source models or embeddings (see details below).

PaperQA uses the process shown below:

1. embed docs into vectors
2. embed query into vector
3. search for top k passages in docs
4. create summary of each passage relevant to query
5. put summaries into prompt
6. generate answer with prompt

## 5. Demo <a name="demo"></a>
        

### Before fine-tuning

In [3]:
from paperqa import Docs

docs = Docs(llm='gpt-4')
answer = docs.query("What is Justice40 initiative?")
print(answer.formatted_answer)


Question: What is Justice40 initiative?

I cannot answer this question due to insufficient information.



### Add a document describing Justice40 initiative

In [5]:
import time
start = time.time()
docs.add_url('https://www.whitehouse.gov/environmentaljustice/justice40/')
end = time.time()
print(f'Adding the document took {(end-start):.2f} seconds')


Adding the document took 17.89 seconds


### After fine-tuning


In [None]:
start = time.time()
answer = docs.query("What is Justice40 initiative?")
end = time.time()
print(answer.formatted_answer)
print(f'Fine-tuning and query took {(end-start):.2f} seconds')


## Running a local model

In [4]:
%pip install llama-cpp-python

Note: you may need to restart the kernel to use updated packages.


In [5]:
from paperqa import Docs
from langchain.llms import LlamaCpp
from langchain import PromptTemplate, LLMChain
from langchain.callbacks.manager import CallbackManager
from langchain.embeddings import LlamaCppEmbeddings
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

modelpath="/global/cfs/cdirs/m4388/Project1-AI4EJ/Tutorials/models/alpaca-native-7B-ggml/ggml-model-q8_0.bin"
modelpath="/global/cfs/cdirs/m4388/Project1-AI4EJ/Tutorials/models/Llama-2-7B-Chat-GGML/llama-2-7b-chat.ggmlv3.q8_0.bin"
modelpath="/global/cfs/cdirs/m4388/Project1-AI4EJ/Tutorials/models/Llama-2-7B-Chat-GGML/llama-2-7b-chat.ggmlv3.q4_1.bin"

# Make sure the model path is correct for your system!
llm = LlamaCpp(
    model_path=ggml_q40, callbacks=[StreamingStdOutCallbackHandler()], n_ctx=4096
)
embeddings = LlamaCppEmbeddings(model_path=modelpath)
docs = Docs(llm=llm, embeddings=embeddings)
answer = docs.query("What is Justice40 initiative?")
print(answer.formatted_answer)

llama.cpp: loading model from /global/cfs/cdirs/m4388/Project1-AI4EJ/Tutorials/models/Llama-2-7B-Chat-GGML/llama-2-7b-chat.ggmlv3.q4_1.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 4096
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_head_kv  = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 1.0e-06
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 3 (mostly Q4_1)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.08 MB
llama_model_load_internal: mem required  = 4543.35 MB (+ 2048.00 MB per 

Question: What is Justice40 initiative?

I cannot answer this question due to insufficient information.



llama_new_context_with_model: kv self size  =  512.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 


In [6]:
import time
start = time.time()
docs.add_url('https://www.whitehouse.gov/environmentaljustice/justice40/')
end = time.time()
print(f'Adding the document took {(end-start):.2f} seconds')

answer = docs.query("What is Justice40 initiative?")
print(answer)


Please provide the citation for the HTML code in MLA format for the text provided above. 


llama_print_timings:        load time =   573.88 ms
llama_print_timings:      sample time =    15.59 ms /    22 runs   (    0.71 ms per token,  1411.52 tokens per second)
llama_print_timings: prompt eval time = 294232.23 ms /  3183 tokens (   92.44 ms per token,    10.82 tokens per second)
llama_print_timings:        eval time = 10999.29 ms /    21 runs   (  523.78 ms per token,     1.91 tokens per second)
llama_print_timings:       total time = 306135.76 ms
llama_tokenize_with_model: too many tokens


ValueError: could not broadcast input array from shape (8,) into shape (0,)