In [1]:
#Traditional NLP Techniques for Document Retrieval

Traditional Natural Language Processing (NLP) techniques have been foundational in building systems for document retrieval and information extraction. These methods primarily involve converting text into numerical representations (vectors) and then using similarity measures to retrieve relevant information based on user queries. Key techniques include Bag of Words (BoW), Term Frequency-Inverse Document Frequency (TF-IDF), Word2Vec, and GloVe.

![image.png](attachment:image.png)

**#Vector Representation in BoW**

BoW Vector: A sparse vector where each dimension corresponds to a unique word in the vocabulary. The value represents the frequency of the word in the document.

**#Vector Representation in TF-IDF**

TF-IDF Vector: A weighted vector where each dimension corresponds to a unique word in the vocabulary. The value is the product of term frequency (TF) and inverse document frequency (IDF), emphasizing important words in the context of the entire corpus.

Example: For the word "lmu" appearing frequently in one document but infrequently across all documents, its TF-IDF score would be high in that document.

**#Applications**

Search Engines: Enhances keyword-based search by weighting terms based on their importance.
Document Clustering: Groups similar documents based on their TF-IDF vectors.
Recommendation Systems: Suggests relevant content by comparing TF-IDF similarities.

**#Vector Representation in Word2Vec and GloVe**

Word Embeddings: Both Word2Vec and GloVe generate dense, fixed-size vectors for words, capturing semantic relationships. Unlike BoW and TF-IDF, these embeddings encode contextual and semantic information.

Word2Vec Example: The words "king" and "queen" might have vectors that reflect their semantic similarity and relationships.

GloVe Example: Embeddings capture global word-word co-occurrence statistics, allowing for more nuanced vector representations.

![image-2.png](attachment:image-2.png)

In [2]:
#Limitations of Traditional Document Retrieval

**Limited Context:** Traditional methods like TF-IDF primarily focus on keyword matching and might not capture the broader context or meaning of a sentence.

**Limited Understanding of Nuance:** These techniques struggle to understand synonyms, sarcasm, or other subtleties in language.

**Single Sentence Answers:** The system retrieves only one sentence as an answer, which might not be sufficient for complex queries.


In [4]:
#Transfer learning

**Transfer Learning:**

Transfer learning involves taking a pre-trained model (in this case, BERT fine-tuned on SQuAD) and adapting it to a different but related task. 

In [5]:
#BERT

**#BERT (Bidirectional Encoder Representations from Transformers):**

**Architecture:** BERT is based on the Transformer architecture, which relies on attention mechanisms to learn contextual relationships between words in a sentence. Unlike traditional models that process text either left-to-right or right-to-left, BERT processes text bidirectionally, capturing context from both directions.

**Pre-training:** BERT is pre-trained on large corpora with two unsupervised tasks: Masked Language Modeling (MLM) and Next Sentence Prediction (NSP). MLM involves predicting missing words in a sentence, while NSP involves predicting if a sentence follows another sentence. This allows BERT to understand language deeply.

**Transformers Pipeline:**

Pipeline: The pipeline function from the Hugging Face transformers library simplifies the use of complex models like BERT for specific tasks like QA. It abstracts the process of loading models, tokenizing input, running inference, and decoding the output.

**QA Task:** In the code, the QA pipeline is created using the pre-trained bert-large-uncased-whole-word-masking-squad2 model, which is fine-tuned on the SQuAD2.0 dataset. This model is particularly good at reading a given context (in this case, text from the PDF) and answering questions based on that context.

![image.png](attachment:image.png)

In [3]:
#Working with open source llm models

![image.png](attachment:image.png)

In [11]:
#Few models

![image.png](attachment:image.png)

In [6]:
#Basic LLM

![image.png](attachment:image.png)

In [None]:
#Vannila RAG

![image.png](attachment:image.png)

![image.png](attachment:image.png)

In [15]:
#Working with various data bases

![image.png](attachment:image.png)

In [9]:
#Not all casess require generative approach

![image.png](attachment:image.png)

#workedwith various approaches --> metadata, NER , adaptive rag retrieval, fusion retrieval techniques

![image.png](attachment:image.png)

![image.png](attachment:image.png)

In [7]:
#URL Question answering using local models

![image.png](attachment:image.png)

**Vannila RAG (Retriever-Augmenter-Generator) limitations:**

To bridge the gap between unstructured documents and LLM needs, but with limitations.

Challenges of Basic RAG:

**Data Quality Reliance:** Relies on the quality of retrieved documents and training data. Poor-quality inputs lead to poor outputs.

**Limited Augmentation:** Basic augmenters might not effectively address complex noise or lack context.

**New & Company-Specific Content:**
LLMs struggle with unseen information, especially company-specific terms or jargon

#Working with simple approach does not work for business specific use cases , while working with many use cases you will find different limitations

**#How to solve business usecases with LLMS?**

**#How to How to Overcome these Limitations?**


![image.png](attachment:image.png)

In [14]:
#Prompt engineering

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

In [10]:
#Agentic workflow

![image-2.png](attachment:image-2.png)

#Sample use case

**Vanilla approach**

**Misra C Mismatch:** Traditional LLMs struggle to grasp specific coding rules like MISRA C.

**Black Box Decisions:** Unclear reasoning behind LLM outputs makes them unreliable for safety-critical tasks.

**Focus Issue:** LLMs consider irrelevant information, hindering rule application during code analysis.
**Complexity Overload:** Traditional LLMs struggle with multi-step tasks, making MISRA C checking overwhelming.


**Solution**

Breaking Down Complexity: By splitting the MISRA C compliance checking process into subtasks handled by dedicated agents, the overall task becomes more manageable. The Rule Learner Agent doesn't need to directly analyze code, and the Code Analyzer Agent doesn't need to understand the intricacies of every rule's meaning.


![image.png](attachment:image.png)

![image.png](attachment:image.png)

**Improved Efficiency and Automation:**

Distributed Workflows: Break down complex processes into smaller tasks handled by individual agents. This can automate repetitive tasks and streamline workflows, freeing up human employees for higher-level activities.

Task Delegation: Agents can handle specialized tasks based on their capabilities. Imagine an MAF for order fulfillment, where one agent handles inventory management, another routes orders, and a third interacts with shipping carriers.

Decentralized Decision-Making: Empower agents to make decisions within predefined parameters, reducing the need for constant human oversight. This can be helpful for tasks requiring real-time responses or geographically dispersed operations.

**#Garbage in Garbage out**

For LLMs the key inputs are query relevant text and prompt shapes the response

**#Input -- chunking**

![image.png](attachment:image.png)

**Choose chunk size**-- test with your own dataset

Evaluate with some sample bench mark data set response time,faithfulness,relevance

**Semantic chunking**

Langchain Semantic chunking or LLamaIndex Semantic chunking

1.Percentile--split data greater than this percentile

2.Standard deviation

3.Interquartile

**#User enter question any way but model should be able to answer it correctly**

**HyDE(Hypothetical Document Embedding)**

Traditional approach lag by short queries and long queries

Expanda query into full hypothetical document 

Improving retrieal relevancy

**Query rewriting**--- specific and detailed prompt template

**Step back Prompting**---Generates more general step back queries

**Sub Query Decompostion**--Breaks downs complex queries

**#Retrieval**

**Adaptive RAG**-- Different queries requires different retrieval strategy

Factual---uses orginal query

Analytical---Multiple Sub queries

Opinion---Different  view points on tpic

Contextual--- in corporats user specific context into queries

**Context enrichment around chunk**

retrieve with context overlap

Mitigates to return isolated text fragment

**Contextual compression in document retrieval**

Compress the retrieved text with context relevant to query only

Traditional retrieval retrieve entire chunk which might contain irrelevant innformation

**Fusion retrieval**

Semantic understanding(Vector base)+Key word matching(BM@%)

**Hierarichal index**

2-tier search system 

summary chunks to detailed relevant chunk

**Note one size wont fit all experiment and research various techniques and apply to your problem statement and evaluate**

**#Fine tuning techniques**

Taking pretrained model adapting to specific task

Fully finetuning--Updating all parameters of pretrained LLM

Fetaure based fine tuning-- final layers(task specific) are pretrained

Adapter based fine tuning

Prompt based fine tuning

Parameter efficient fine tuning--LoRA,QLoRa