# **Deep Learning Assignment 3 - BDS 2nd semester**

Made by: Mikkel Ørts Nielsen and Annika i Jakupsstovu

# **Information about the assignment**

## **Task Description**

Your task is to create a system that uses RAG for extracting information from a set of documents or a document which can be either a scientific paper or report. This involves integrating a database to store vectors of document information and designing customized prompts to effectively use GPT models for generation. Here are some project ideas:

*  Build a QA system that retrieves information from a given set of documents (or a document) to answer complex queries.
*  Develop a tool for summarizing research papers, where the system extracts key points from a database of paper vectors.
*  Create a recommendation engine that suggests content based on user queries and retrieved document data.
*  Explore other innovative applications of RAG, such as automated content generation, data analysis, or any other creative use case you can envision.

## **Key Components**

*  Database Integration: Set up a database to store and retrieve vectors representing document information.
*  Customized Prompts: Design and implement prompts that effectively utilize GPT models for generation based on retrieved data.
*  RAG Implementation: Use Langchain to integrate retrieval-augmented generation in your system.

## **Data**

*  Utilize open-source datasets or create your own corpus of documents for retrieval.
*  Ensure the chosen datasets are suitable for demonstrating the capabilities of your RAG system.

## **Delivery**

*  Create a dedicated GitHub repository for this assignment.
*  Store all relevant materials, including the Colab notebook, in the repository.
*  Provide a README.md file with a concise description of the assignment and its components.
*  You may work individually or in groups of up to three members.
*  Submit your work by emailing a link to the repository to Hamid (hamidb@business.aau.dk).

# **Our Assignment**

We've chosen to work with the scientific paper "Using sequences of life-events to predict
human lives" by Savcisens et al. The paper is available to download in the github repository.

We've used the paper to make a Q&A system using RAG that retrieves information from the paper and answers any questions, you may have.

## **Loading data**

We'll use UnstructuredMarkdownLoader to load the paper:

In [None]:
!pip install accelerate --q

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/280.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.7/280.0 kB[0m [31m1.1 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.2/280.0 kB[0m [31m1.3 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━[0m [32m225.3/280.0 kB[0m [31m2.3 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m280.0/280.0 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25h

Then we restart kernel

In [None]:
!pip install pypdf --q
!pip install -qqq chromadb==0.4.10 --progress-bar off
!pip install -qqq sentence_transformers==2.2.2 --progress-bar off!pip install -Uqqq pip --progress-bar off
# !pip install -qqq torch==2.0.1 --progress-bar off
!pip install -qqq langchain==0.0.299 --progress-bar off
!pip install -qqq xformers==0.0.21 --progress-bar off
!pip install -qqq sentence_transformers==2.2.2 --progress-bar off
!pip install -qqq tokenizers==0.14.0 --progress-bar off
!pip install -qqq optimum==1.13.1 --progress-bar off
!pip install -qqq auto-gptq==0.4.2 --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/ --progress-bar off
!pip install -qqq unstructured==0.10.16 --progress-bar off

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/284.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.0/284.0 kB[0m [31m1.1 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━[0m [32m143.4/284.0 kB[0m [31m2.0 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m284.0/284.0 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for pypika (pyproject.toml) ... [?25l[?25hdone

Usage:   
  pip3 install [options] <requirement specifier> [package-index-options] ...
  pip3 install [options] -r <requirements file> [package-index-options] ...
  pip3 install [options] [-e] <vcs project url> ...
  pip3 install [options] 

Next we'll load the packages and the chosen dataset

In [None]:
from langchain.document_loaders import UnstructuredMarkdownLoader
from langchain.document_loaders import PyPDFLoader
from langchain.llms import HuggingFaceHub

loader = PyPDFLoader("/content/life2vec.pdf") #You need to upload the paper to colab before running this cell

docs = loader.load()

# Checking how many pages how been loaded
len(docs)

17

## **Preprocessing**

Before implementing RAG, we need to do some preprocessing.

First we'll divide the paper into smaller chunks using RecursiveCharacterTextSplitter. We need to do this because LLM can only look at a limited number of tokens at once.

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=64)
texts = text_splitter.split_documents(docs)
len(texts)

104

Then we make some embeddings for the smaller chunks using HuggingFaceEmbeddings.

In [None]:
from langchain.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(
    model_name="thenlper/gte-large",
    model_kwargs={"device": "cuda"},
    encode_kwargs={"normalize_embeddings": True},
)

query_result = embeddings.embed_query(texts[0].page_content)
print(len(query_result))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/191 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/67.9k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/619 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/670M [00:00<?, ?B/s]

onnx/config.json:   0%|          | 0.00/632 [00:00<?, ?B/s]

model.onnx:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

onnx/special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

onnx/tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

onnx/tokenizer_config.json:   0%|          | 0.00/342 [00:00<?, ?B/s]

onnx/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/670M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/342 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

1024


## **Creating database**

We also need to make a database to store the embeddings and get easy access to them. For this we'll use Chroma.

In [None]:
from langchain.vectorstores import Chroma

db = Chroma.from_documents(texts, embeddings, persist_directory="db")
results = db.similarity_search("Transformer models", k=2)
print(results[0].page_content)

Nature Computational Science | Volume 4 | January 2024 | 43–56 44
Article https://doi.org/10.1038/s43588-023-00573-5discussed in the following, we filter the dataset, focusing on the period 
2008–2016 and an age-limited subset of individuals.
The raw stream of temporal data has traditionally posed substantial 
methodological challenges, such as irregular sampling rates, sparsity, 
complex interactions between features, and a large number of dimen-
sions28. Classical methods for time-series analysis29,30 become cum -
bersome because they are challenging to scale, inflexible, and require 
considerable preprocessing. Transformer methods allow us to avoid 
hand-crafted features and instead encode the data in a way that exploits 
the similarity to language15,18. Further, transformers are well-suited for 
representing life-sequences due to their ability to compress contextual 
information13,31 and take into account temporal and positional informa -


## **Loading the model and setting configurations**

In [None]:
import torch
from langchain import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig, pipeline

MODEL_NAME = "TheBloke/Llama-2-7B-Chat-GPTQ"

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME, torch_dtype=torch.float16, trust_remote_code=True, device_map="auto"
)

# Create a configuration for text generation based on the specified model name
generation_config = GenerationConfig.from_pretrained(MODEL_NAME)

# Set the maximum number of new tokens in the generated text to 1024.
# This limits the length of the generated output to 1024 tokens.
generation_config.max_new_tokens = 1024

# Set the temperature for text generation. Lower values (e.g., 0.0001) make output more deterministic, following likely predictions.
# Higher values make the output more random.
generation_config.temperature = 0.0001

# Set the top-p sampling value. A value of 0.95 means focusing on the most likely words that make up 95% of the probability distribution.
generation_config.top_p = 0.95

# Enable text sampling. When set to True, the model randomly selects words based on their probabilities, introducing randomness.
generation_config.do_sample = True

# Set the repetition penalty. A value of 1.15 discourages the model from repeating the same words or phrases too frequently in the output.
generation_config.repetition_penalty = 1.15


# Create a text generation pipeline using the initialized model, tokenizer, and generation configuration
text_pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    generation_config=generation_config,
)

# Create a LangChain pipeline that wraps the text generation pipeline and set a specific temperature for generation
llm = HuggingFacePipeline(pipeline=text_pipeline, model_kwargs={"temperature": 0})

tokenizer_config.json:   0%|          | 0.00/727 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/789 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.90G [00:00<?, ?B/s]



generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

## **Customizing prompts**

In [None]:
from langchain.chains import RetrievalQA
from langchain import PromptTemplate

template = """
<s>[INST] <<SYS>>
Act as a ML expert. Use the following information to answer the question at the end.
<</SYS>>

{context}

{question} [/INST]
"""

prompt = PromptTemplate(template=template, input_variables=["context", "question"])


qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=db.as_retriever(search_kwargs={"k": 2}),
    return_source_documents=True,
    chain_type_kwargs={"prompt": prompt},
)

result = qa_chain(
    "how does life2vec predict the time of death? Explain like I am five."
)
print(result["result"].strip())

Hi there! So, you want to know how life2vec predicts when someone will die? Well, let me tell you! 😊
Life2vec is like a super smart computer program that can look at lots of things about a person, like their age, gender, where they live, and more. It then uses all those things to try to figure out if that person will die soon or not. But don't worry, it doesn't just say "Oh no, Person X will definitely die tomorrow!" Instead, it gives a number from 0 to 1, where 0 means the person will definitely not die soon, and 1 means they will probably die soon. 🤔
So, how does it do this magic trick? Life2vec looks at lots of little pieces of information, called "tokens," about the person. These tokens might be things like their age, gender, job, or even what kind of phone they have! 📱 Then, it puts all those tokens together into a big list, called a "sequence." Think of it like a long sentence about the person, like "Person X is 35 years old, lives in New York City, and works as an accountant." 🗨

## **Testing another prompt**

In [None]:
from textwrap import fill

result = qa_chain(
    "Summarize the practical use of life2vec in 2-3 sentences."
)
print(fill(result["result"].strip(), width=80))

Life2vec is a machine learning model that generates a compact summary of a
person's entire sequence of life events, such as medical history, employment
status, and living situation, into a single vector representation. This vector
representation captures the essential aspects of an individual's life and can be
used to predict various outcomes, such as mortality likelihood, with great
accuracy. By exploring the structure of the space of person-summaries, we can
gain insights into which factors drive a certain prediction, revealing how
life2vec uses information from the concept space to make predictions.
