"""
This script demonstrates a complete workflow for setting up a Question-Answering (QA) system using
Hugging Face's models and LangChain's tools. The system extracts text from NASA PDFs, embeds the text
into a vector space, and utilizes a language model for generating responses to user queries.

The script includes the following steps:
1. GPU availability check and setup
2. Installing necessary Python packages
3. Downloading and loading NASA PDF documents
4. Preprocessing the documents (text extraction and chunking)
5. Embedding the text using the instructor-xl model
6. Creating a vector database for efficient retrieval
7. Setting up a language model (Dolly-v2-3b) for generating responses
8. Creating a retrieval QA chain
9. A function to interact with the QA system and display results

Performance Considerations:
- The script checks for GPU availability to leverage faster computations.
- Text embeddings and model loading are configured to utilize GPU resources if available.
- The language model is loaded with optimizations for reduced memory usage.
"""

In [1]:
# Check GPU availability and install required packages
!nvidia-smi
!pip install --quiet accelerate bitsandbytes chromadb langchain InstructorEmbedding
!pip install -q pypdf sentencepiece tiktoken transformers Xformers
!pip install -U sentence-transformers
!pip install -U langchain-community

Sat Jun 22 08:38:32 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   58C    P8              10W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [2]:
# Standard library imports
import textwrap

# Third-party imports
from langchain.chains import RetrievalQA
from langchain.document_loaders import DirectoryLoader, PyPDFLoader
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.llms import HuggingFacePipeline
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

In [3]:
# Step 1: Check available GPUs for computation
if torch.cuda.is_available():
    num_gpus = torch.cuda.device_count()
    for i in range(num_gpus):
        gpu_props = torch.cuda.get_device_properties(i)
        print(f"Device details for GPU {i+1}:")
        print(f"* Name: {gpu_props.name}")
        print(f"* Memory size: {round(gpu_props.total_memory / 1024**3, 2)} GB")
        if i < num_gpus - 1:
            print("-" * 79)
    active_gpu = torch.cuda.current_device()
    active_gpu_props = torch.cuda.get_device_properties(active_gpu)
    print("=" * 79)
    print(f"Currently active GPU device: {active_gpu_props.name}")
    print(f"Memory size: {round(active_gpu_props.total_memory / 1024**3, 2)} GB")
    print("=" * 79)
else:
    print("No GPU devices found.")

Device details for GPU 1:
* Name: Tesla T4
* Memory size: 14.75 GB
Currently active GPU device: Tesla T4
Memory size: 14.75 GB


In [4]:
# Step 2: Create a directory for NASA PDFs and download them
!mkdir -p nasa
!wget -P nasa/ https://www.nasa.gov/sites/default/files/atoms/files/naca_to_nasa_to_now_tagged.pdf
!wget -P nasa/ https://www.nasa.gov/sites/default/files/atoms/files/nasa_-_planetary_defense_strategy_-_final-508.pdf
!wget -P nasa/ https://www.nasa.gov/sites/default/files/atoms/files/advancing_nasas_climate_strategy_2023.pdf
!wget -P nasa/ https://www.nasa.gov/sites/default/files/atoms/files/iss_benefits_for_humanity_3rded-508.pdf


--2024-06-22 08:42:15--  https://www.nasa.gov/sites/default/files/atoms/files/naca_to_nasa_to_now_tagged.pdf
Resolving www.nasa.gov (www.nasa.gov)... 192.0.66.108, 2a04:fa87:fffd::c000:426c
Connecting to www.nasa.gov (www.nasa.gov)|192.0.66.108|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2024-06-22 08:42:16 ERROR 404: Not Found.

--2024-06-22 08:42:16--  https://www.nasa.gov/sites/default/files/atoms/files/nasa_-_planetary_defense_strategy_-_final-508.pdf
Resolving www.nasa.gov (www.nasa.gov)... 192.0.66.108, 2a04:fa87:fffd::c000:426c
Connecting to www.nasa.gov (www.nasa.gov)|192.0.66.108|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://www.nasa.gov/wp-content/uploads/2023/06/nasa_-_planetary_defense_strategy_-_final-508.pdf?emrc=37bb97 [following]
--2024-06-22 08:42:16--  https://www.nasa.gov/wp-content/uploads/2023/06/nasa_-_planetary_defense_strategy_-_final-508.pdf?emrc=37bb97
Reusing existing connection

In [5]:
# Step 3: Load and process PDF documents
loader = DirectoryLoader("./nasa/", glob="*.pdf", loader_cls=PyPDFLoader)
documents = loader.load()
print(f"Loaded {len(documents)} documents.")


Loaded 282 documents.


In [6]:
# Step 4: Split extracted text into overlapping chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
texts = text_splitter.split_documents(documents)

In [7]:
# Step 5: Check if CUDA is available and set the device accordingly
device = "cuda" if torch.cuda.is_available() else "cpu"

In [8]:
# Step 6: Load the instructor-xl model to embed the corpus into vector space
instructor_embeddings = HuggingFaceEmbeddings(
    model_name="hkunlp/instructor-xl",
    model_kwargs={"device": device}
)


  warn_deprecated(
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/461 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/66.3k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/4.96G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.40k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/270 [00:00<?, ?B/s]

2_Dense/config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/3.15M [00:00<?, ?B/s]

In [9]:
# Step 7: Create a Chroma vector database from corpus embeddings
vectordb = Chroma.from_documents(documents=texts, embedding=instructor_embeddings, persist_directory="db")


In [10]:
# Step 8: Load model, tokenizer, and text generation pipeline
model_name = "databricks/dolly-v2-3b"
tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side="left")
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_8bit=True,
    device_map="auto",
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True
)
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_length=1024,
    temperature=0,
    top_p=0.95,
    repetition_penalty=1.15
)
local_llm = HuggingFacePipeline(pipeline=pipe)


tokenizer_config.json:   0%|          | 0.00/450 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/228 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


config.json:   0%|          | 0.00/819 [00:00<?, ?B/s]

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


pytorch_model.bin:   0%|          | 0.00/5.68G [00:00<?, ?B/s]

  warn_deprecated(


In [11]:
# Step 9: Setup Q&A retrieval chain
retriever = vectordb.as_retriever(search_kwargs={"k": 3})
qa_chain = RetrievalQA.from_chain_type(
    llm=local_llm,
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True
)

In [12]:
# Function to convert LLM outputs into readable text
def InsightfulQuery(width: int = 100) -> None:
    """
    This function takes a user's question, parses that question through the Q&A
    retrieval chain and the local LLM model. The result is printed in a neatly
    formatted manner, which includes the sources for the answer at the end.

    Args:
        * width (`int`, optional): Maximum line width for the formatted response.
        Defaults to 100.
    """
    query = input("Question: ")
    print("=" * width)
    llm_response = qa_chain(query)
    response = llm_response["result"].split("\n")
    response = [textwrap.fill(line, width=width) for line in response]
    wrapped_response = "\n".join(response)
    print(wrapped_response)
    print("-" * width)
    for source in llm_response["source_documents"]:
        print(f"{source.metadata['source']} - page {source.metadata['page']}")
    print("-" * width)



In [14]:
InsightfulQuery()

Question: What are the benefits of international space?




Use the following pieces of context to answer the question at the end. If you don't know the answer,
just say that you don't know, don't try to make up an answer.

ixExecutive Summary
The third edition of the International Space Station Benefits for Humanity is a compilation of
benefits being
realized from International Space Station (ISS) activities in the areas of human health, Earth
observations and
disaster response, innovative technology, global education, and economic development of space. This
revision
also includes new assessments of economic value and scientific value in more detail than the second
edition. The third edition contains updated statistics on the impacts of the benefits as well as new
benefits that have developed since the previous publication. International Space Station Benefits
for Humanity is a product
of the ISS Program Science Forum (PSF), which consists of senior science representatives across the
ISS
international partnership.
With respect to economic valu

In [15]:
InsightfulQuery()

Question: What size of meteors could threaten humans?




Use the following pieces of context to answer the question at the end. If you don't know the answer,
just say that you don't know, don't try to make up an answer.

NASA Planetary Defense Strategy And Action Plan | 2The threat exists because our planet orbits the
Sun amidst millions of objects that cross our orbit –
asteroids and comets. Even a rare interstellar asteroid or comet from outside our solar system can
enter
Earth’s neighborhood.
Characteristics of the estimated NEO population:
• Around 1,000 NEOs greater than one kilometer in size that are potentially capable of causing
global impact effects. Approximately 95 percent of these bodies have been found and none are a
current threat.
• Around 25,000 objects larger than 140 meters in size, capable of causing regional devastation,
are believed to exist. Less than 50 percent have been detected and tracked to date.
• An estimated 230,000 or more objects exist that are equal to or larger than 50 meters in size and
could destroy a conc