## Personalized Learning/Research Assistant
> An RAG based NLP-powered tool that delivers tailored answers, summaries, and recommendations based on user-uploaded study materials or a research paper.

In [3]:
!pip install transformers accelerate huggingface_hub==0.25.0
!pip install torch bitsandbytes
!pip install einops==0.6.1 langchain==0.0.300 xformers==0.0.21 \
 sentence_transformers==2.2.2 chromadb==0.4.12

Collecting huggingface_hub==0.25.0
  Downloading huggingface_hub-0.25.0-py3-none-any.whl.metadata (13 kB)
Downloading huggingface_hub-0.25.0-py3-none-any.whl (436 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m436.4/436.4 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: huggingface_hub
  Attempting uninstall: huggingface_hub
    Found existing installation: huggingface-hub 0.29.0
    Uninstalling huggingface-hub-0.29.0:
      Successfully uninstalled huggingface-hub-0.29.0
Successfully installed huggingface_hub-0.25.0
Collecting bitsandbytes
  Downloading bitsandbytes-0.45.3-py3-none-manylinux_2_24_x86_64.whl.metadata (5.0 kB)
Downloading bitsandbytes-0.45.3-py3-none-manylinux_2_24_x86_64.whl (76.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.1/76.1 MB[0m [31m21.8 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25hInstalling collected packages: bitsandbytes
Successfully installed bitsan

**Chroma:**
> **Module:** langchain.vectorstores

> **Purpose:** Chroma is a vector store that allows you to store and retrieve high-dimensional vectors. It is used to store embeddings of documents or text and retrieve them based on similarity searches.

> **Usage:** Typically used in applications that require efficient similarity searches, such as document retrieval or question-answering systems.

**HuggingFaceEmbeddings:**
> **Module:** langchain.embeddings

> **Purpose:** HuggingFaceEmbeddings provides a way to generate embeddings using models from the Sentence Transformers library. These embeddings are numerical representations of text that capture semantic meaning.

> **Usage:** Used to convert text into embeddings that can be stored in a vector store like Chroma for similarity searches.

**RetrievalQA**
> **Module:** langchain.chains

> **Purpose:** RetrievalQA is a chain that enables question-answering over a document database by retrieving relevant documents and passing them to an LLM for answering queries.

> **Usage:** Used in applications where answering user queries based on a set of stored documents is required, such as knowledge-based chatbots and document search systems.

**PyPDFLoader**
> **Module:** langchain.document_loaders

> **Purpose:** PyPDFLoader is a document loader that reads pdf files and prepares them for further processing, such as embedding generation and retrieval.

> **Usage:** Typically used to load textual data from pdfs before processing it for NLP applications like retrieval-augmented generation (RAG).

**RecursiveCharacterTextSplitter**
> **Module:** langchain.text_splitter

> **Purpose:** RecursiveCharacterTextSplitter is a text splitter that divides long documents into smaller chunks while preserving context. It ensures that chunks remain meaningful and are not cut off mid-sentence.

> **Usage:** Used in scenarios where long documents need to be split into manageable parts for embedding and retrieval, ensuring effective information retrieval in vector-based search.

**HuggingFacePipeline**
> **Module:** langchain.llms

> **Purpose:** HuggingFacePipeline is an interface that allows integration with Hugging Face models using pipelines. It enables the use of transformer-based models for various NLP tasks, such as text generation and question answering.

> **Usage:** Typically used to run Hugging Face models as LLMs (large language models) in LangChain applications, leveraging pre-trained models for text-based tasks.


In [4]:
import torch
from torch import cuda, bfloat16
import transformers
from transformers import AutoTokenizer
## Importing Relevant Langchain Library
from langchain.llms import HuggingFacePipeline
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.document_loaders import TextLoader, PyPDFLoader

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings


In [5]:
model_id = '/kaggle/input/llama-2/pytorch/7b-chat-hf/1'

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

In [6]:
model_config = transformers.AutoConfig.from_pretrained(
    model_id
)
model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map='auto',
    low_cpu_mem_usage=True
)
tokenizer = AutoTokenizer.from_pretrained(model_id)



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



In [7]:
%%time
query_pipeline = transformers.pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        torch_dtype=torch.float16,
        device_map="auto")

Device set to use cuda:0


CPU times: user 196 ms, sys: 15.6 ms, total: 211 ms
Wall time: 249 ms


In [8]:
%%time
def test_model(tokenizer, pipeline, prompt_to_test):
    """
    Perform a query
    print the result
    Args:
        tokenizer: the tokenizer
        pipeline: the pipeline
        prompt_to_test: the prompt
    Returns
        None
    """
    # adapted from https://huggingface.co/blog/llama2#using-transformers
    sequences = pipeline(
        prompt_to_test,
        do_sample=True,
        top_k=10,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
        max_length=200)
    for seq in sequences:
        print(f"Result: {seq['generated_text']}")
        
## Testing the Model using Custom Pipeline
test_model(tokenizer,
           query_pipeline,
           "Please explain what is the State of the Union address. Give just a definition. Keep it in 100 words.")

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Result: Please explain what is the State of the Union address. Give just a definition. Keep it in 100 words.
The State of the Union address is an annual speech given by the President of the United States to a joint session of Congress, in which the President reports on the current state of the union and outlines their legislative agenda for the upcoming year.
CPU times: user 3.74 s, sys: 319 ms, total: 4.06 s
Wall time: 5.75 s


In [9]:
%%time
# Testing The Model using HuggingFacePipeline
llm = HuggingFacePipeline(pipeline=query_pipeline)
llm(prompt="Please explain what is the State of the Union address. Give just a definition. Keep it in 100 words.")

CPU times: user 3.46 s, sys: 96.4 ms, total: 3.55 s
Wall time: 3.55 s


'\n\nThe State of the Union address is an annual speech given by the President of the United States to a joint session of Congress, in which the President reports on the current state of the union and outlines their legislative agenda for the upcoming year.'

In [10]:
# Processing the PDF user want to enquire and understand
loader = PyPDFLoader("/kaggle/input/pdf-temp/1706.03762")
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)
all_splits = text_splitter.split_documents(documents)

In [11]:
## Loading the MiniLM-L6 sentence transformer to generate vector embeddings
model_name = "sentence-transformers/all-MiniLM-L6-v2"
model_kwargs = {"device": "cuda"}

embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)



.gitattributes:   0%|          | 0.00/1.23k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

model.onnx:   0%|          | 0.00/90.4M [00:00<?, ?B/s]

model_O1.onnx:   0%|          | 0.00/90.4M [00:00<?, ?B/s]

model_O2.onnx:   0%|          | 0.00/90.3M [00:00<?, ?B/s]

model_O3.onnx:   0%|          | 0.00/90.3M [00:00<?, ?B/s]

model_O4.onnx:   0%|          | 0.00/45.2M [00:00<?, ?B/s]

model_qint8_arm64.onnx:   0%|          | 0.00/23.0M [00:00<?, ?B/s]

model_qint8_avx512.onnx:   0%|          | 0.00/23.0M [00:00<?, ?B/s]

model_qint8_avx512_vnni.onnx:   0%|          | 0.00/23.0M [00:00<?, ?B/s]

model_quint8_avx2.onnx:   0%|          | 0.00/23.0M [00:00<?, ?B/s]

openvino_model.bin:   0%|          | 0.00/90.3M [00:00<?, ?B/s]

openvino_model.xml:   0%|          | 0.00/211k [00:00<?, ?B/s]

openvino_model_qint8_quantized.bin:   0%|          | 0.00/22.9M [00:00<?, ?B/s]

openvino_model_qint8_quantized.xml:   0%|          | 0.00/368k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [12]:
vectordb = Chroma.from_documents(documents=all_splits, embedding=embeddings, persist_directory="chroma_db")

Batches:   0%|          | 0/2 [00:00<?, ?it/s]

In [13]:
retriever = vectordb.as_retriever()

qa = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=retriever, 
    verbose=True
)

In [14]:
def test_rag(qa, query):
    print(f"Query: {query}\n")
    result = qa.run(query)
    print("\nResult: ", result)

In [15]:
query = "What is the attention? Summarize. Keep it under 200 words."
test_rag(qa, query)

Query: What is the attention? Summarize. Keep it under 200 words.



[1m> Entering new RetrievalQA chain...[0m


Batches:   0%|          | 0/1 [00:00<?, ?it/s]




[1m> Finished chain.[0m

Result:   The attention is a mechanism in the Transformer model that allows it to focus on specific parts of the input sequence when generating the output. It is a self-attention mechanism, meaning that it computes a representation of the input sequence based on its own internal state, rather than relying on a sequence-aligned RNN or convolution. This allows the model to capture long-distance dependencies in the input sequence and generate more accurate and informative output.


In [16]:
query = "Explain attention in detail. Explain the main components. Keep it under 200 words."
test_rag(qa, query)

Query: Explain attention in detail. Explain the main components. Keep it under 200 words.



[1m> Entering new RetrievalQA chain...[0m


Batches:   0%|          | 0/1 [00:00<?, ?it/s]


[1m> Finished chain.[0m

Result:   Attention is a mechanism that allows neural networks to focus on specific parts of the input when processing sequential data. In the context of natural language processing, attention allows the model to selectively focus on certain words or phrases in a sentence and weigh their importance when generating the output.

The main components of attention are:

1. Query: This is the input that the attention mechanism uses to compute the attention weights.
2. Key: This is the input that the attention mechanism uses to compute the attention weights.
3. Value: This is the input that the attention mechanism uses to compute the attention weights.
4. Attention weights: These are the weights computed by the attention mechanism to determine how much each input should be "attended" to.
5. Output: This is the output of the attention mechanism, which represents the weighted sum of the input values.

In detail, the attention mechanism first computes the attention we