# The Use Case

## For The Business
For this proof of concept (POC), we wanted to use an actual use case from our HR department. In fact, the exact ask was "Imagine if we flipped our HR inbox/etc. into something similar like a bot…trained it with our handbook, IT policies, benefits, engagement protocol, etc…it would cut down on so many of the asks!"

## On The Tech Side
To keep the data (in this case, our employee handbook) within our Azure and Databricks environments, we are used an open source foundational LLM. We wanted to test out running a POC on a ridiculously small Databricks cluster...so yeah 15GB, CPU based, with the 13.1 ML Runtime. #jarvisTakeTheWheel

## Final Note
While this is functional, it is a proof of concept...**raw and uncut**. It's meant to showcase a basic LLM solution approach within Databricks and accelerate your LLM experimentation efforts. Despite its miniscule size (don't judge a notebook by its size ;)) there is a fair amount of exploration and experimentation that went into this POC.  

Let's get into it.

# Proof of Concept, Iteration 1: Start

In [0]:
%pip install azure-storage-blob langchain transformers unstructured lancedb bitsandbytes einops safetensors

In [0]:
dbutils.library.restartPython()

## Load & Chunk

In [0]:
# Get the Azure blob storage keys from the container under Security + networking -> Access keys
# Using Langchain to load our pdf
from langchain.document_loaders import AzureBlobStorageFileLoader
from langchain.text_splitter import TokenTextSplitter

loader = AzureBlobStorageFileLoader(
    conn_str="[your connection string here]",
    container="[name of blob container here]",
    blob_name="[name of the handbook (in pdf format) uploaded to the blob container above]",
)

handbook = loader.load()
text_splitter = TokenTextSplitter(chunk_size=200, chunk_overlap=0)
chunks = text_splitter.split_documents(handbook)

# Result is LangChain document objects with a token size around what you defined in chunk_size
print ("Your handbook has been chunked into", len(chunks), "documents. ")

## Time For Embedding

In [0]:
# Need this for a local pipeline wrapper around the model rather than using a hosted model on Hugging Face Hub
from langchain.embeddings import HuggingFaceEmbeddings

# Download model from Hugging face
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

## Let's do some vectorization. (It's a Lance...hello)

In [0]:
from langchain.vectorstores import LanceDB

import lancedb

db = lancedb.connect("/tmp/lancedb")
table = db.create_table(
    "handbook_table",
    data=[
        {
            "vector": embeddings.embed_query("HR"),
            "text": "Hello HR",
            "id": "1",
        }
    ],
    mode="overwrite",
)

docsearch = LanceDB.from_documents(chunks, embeddings, connection=table)

def get_similar_docs(question, similar_doc_count):
  docs = docsearch.similarity_search(question, k=similar_doc_count)
  return docs

get_similar_docs("How much PTO do I get in my first year?", 1)

# Proof of Concept, Iteration 1: Done
Ok, so what we have demonstrated is we can take an employee handbook, in pdf format, and ask questions that would be answered by that handbook. At this point, we could consider the first iteration of a proof of concept (POC) complete. It's definitely rough around the edges but the whole point was to proof out the initial concept. 

## The Learnings
The runtime selected in Databricks matters.

In a number of cases, using a LLM isn't absolute necessary.

# Proof of Concept, Iteration 2: Start

After some initial feedback and review of the first iteration of the POC, the request was made for another iteration  to make the responses a bit more "human understandable". Now comes the language model.

## Load a language model

In [0]:
import torch
from transformers import GPT2Tokenizer, GPT2Model, AutoModelForCausalLM, AutoTokenizer, pipeline
from transformers import TFAutoModelForCausalLM

tokenizer = GPT2Tokenizer.from_pretrained("distilgpt2")
model2 = TFAutoModelForCausalLM.from_pretrained("distilgpt2")

## Configure our pipeline
Here we are configuring our transformers pipeline

In [0]:
from transformers import pipeline

pipeline2 = pipeline(
        model="distilgpt2",
        use_cache=True,
        device_map="auto",
        max_new_tokens=500,
)

## Create and run our chain

In [0]:
# Yet Another Test
from langchain.chains import RetrievalQA
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from langchain import PromptTemplate, LLMChain
from langchain.llms import HuggingFacePipeline
from langchain.chains.question_answering import load_qa_chain

query = "How much PTO do I get in my first year?"

# create a chain to answer questions 
qa = RetrievalQA.from_chain_type(
    llm=HuggingFacePipeline(pipeline=pipeline2), chain_type="stuff", retriever=docsearch.as_retriever())

result = qa.run(query)
print("The question was", query, "The answer is", result)

# Proof of Concept, Iteration 2: Done

## The Learnings

### Model selection matters. 
We tested a fair number of models, including but not limited to: 
- dolly (I mean, we had to right?)
- tinyroberta-squad2 (we needed a hyperparameter with a max_seq_len of 2000+)
- mpt-7b
- falcon-7b-instruct (alot of python kernel timeouts with this one)
- gpt2

Remember this: Task selection is an important consideration when choose the model

**Performance**

According to the creators of the model we used ( [GPT-2](https://github.com/openai/gpt-2/blob/master/model_card.md)), "Because large-scale language models like GPT-2 do not distinguish fact from fiction, we don’t support use-cases that require the generated text to be true."

In this POC, you will see that the model does return back results...but it also peppers its response with alot of...let's call it mumbo jumbo. Others may call it hallucinations.

Below we commented out one approach we tried, which was quantizizing a model to 4bits. We achieved moderate results but we found ourselves distracted by twisting alot of knobs and dials. For the sake of speed to value from the POC, we pivoted. But it's an approach we will cover in another notebook.

In [0]:
#CONFIGURE THE QUANTIZATION OF MODEL AND LOAD IT
#Keeping this block in here to showcase a bit of the experimentation during this iteration. One of the attempts made to run the POC on low-end resources was to quantizize a model to 4 bits. 


#import torch
#from transformers import BitsAndBytesConfig
#from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

#quantization_config = BitsAndBytesConfig(
#    load_in_4bit=True,
#    bnb_4bit_compute_dtype=torch.float16,
#    bnb_4bit_quant_type="nf4",
#    bnb_4bit_use_double_quant=True,
#)

#model_id = "ComCom/gpt2-small"

#tokenizer = AutoTokenizer.from_pretrained(model_id)
#model_4bit = AutoModelForCausalLM.from_pretrained(
#        model_id, 
#        device_map="auto",
#        offload_folder="offload",
#        quantization_config=quantization_config,
#        trust_remote_code=True)

