# <font color = 'green'>Language models

1. It is Probabilistic model which assigns probabilities to sequences of words or tokens in a given language.
2. It captures the structure and patterns of the language to predict the likelihood of a particular sequence of words.
3. The model is trained to assign higher probabilities to more likely sequences based on the context, helping generate text or predict missing words.
4. The probability of a sequence is calculated by considering the conditional probabilities of each word based on the previous ones in the sequence. The process involves multiplying the probabilities of each word given its preceding words.

p(𝑤1,𝑤2,…,𝑤𝑛) = p(w1).p(w2|w1).p(w3|w1, w2).p(w4|w1, w2, w3).p(wn|w1, w2, w3,...wn-1)

5. LM provides probabilities to each word using knowledge of trained data.

For example- Lets we have vocabulary 𝑉 that contains a sequence of words.
The LM assigns probabilities (𝑝) to every possible sequence or order of words belonging to a vocabulary (𝑉).
Probability expressed as - p(𝑤1,𝑤2,…,𝑤𝑛)
- `p{chase, the, cat, the, mouse}` = 0.0001
- `p{the, chase, cat, the, mouse}` = 0.003
- `p{chase, the, mouse, the, cat}` = 0.0021 
- `p{the, cat, chase, the, mouse}` = 0.02 
- `p{the, mouse, chase, the, cat}` = 0.01

But modern language models, such as Transformer-based models like GPT-3, utilize deep learning techniques to capture intricate patterns and dependencies in the data.

# <font color = 'green'> N-gram language model

Also a type of probabilistic language model.
These models are based on the idea that the probability of a word depends on the previous 𝑛−1 words in the sequence. 
The term “n-gram” refers to a consecutive sequence of 𝑛 items.

For example, consider the following sentence: I love language models.

- **Unigram (1-gram)**: “I,” “love,” “language,” “models”

- **Bigram (2-gram)**: “I love,” “love language,” “language models”

- **Trigram (3-gram)**: “I love language,” “love language models”

- **4-gram**: “I love language models”

N-gram models are simple and computationally efficient, making them suitable for various natural language processing tasks. However, their limitations include the inability to capture long-range dependencies in language and the sparsity problem when dealing with higher-order n-grams.

# <font color = 'green'>Large Language Models

**(LLMs)** refer to advanced NLP models trained on massive amounts of textual data. 
 
These models are designed to understand and generate human-like text based on the input they receive.

**LLMs** :
- Billions of parameters.
- trained on vast and diverse datasets from the internet.
- highly versatile, excelling across various NLP tasks.
- Demand significant computational power and specialized hardware.
- are used for complex language understanding, translation, summarization, creative writing.
- built on the Transformer architecture
    
The core strength of the Transformer models is their ability to process text in parallel, increasing efficiency for language tasks.

# <font color = 'green'>Types of LLMs

Language representation models focus on understanding bidirectional context, capturing word meanings by considering both left and right contexts in a sentence.

- **Zero-shot learning models** like the GPT series, can perform tasks without specific training by leveraging their pretraining on diverse datasets, enabling general task application without fine-tuning.

![zero-shot-2.png](attachment:zero-shot-2.png)

- **Multi-shot learning models** adapt to tasks with few examples, excelling at performing tasks with minimal training due to their strong context-awareness and pretraining.
![multi-shot-2.png](attachment:multi-shot-2.png)

- **Fine-tuned or domain-specific models** undergo additional training for specific tasks or domains, improving performance in targeted areas, like specialized models in fields such as biomedicine.

### <font color = 'green'>LLM Challenges

**Data Challenges**:

- **`Data Bias`**: 
- **`Limited World Knowledge and Hallucination`**: LLMs may lack comprehensive understanding of real-world events and information and tend to hallucinate information.
- **`Dependency on Training Data Quality`**: LLM performance is heavily influenced by the quality and representativeness of the training data.

**Technical Challenges**

- **`Computational Resources`**: 

- **`Evaluation`**: Evaluation presents a notable challenge as assessing models across diverse tasks and domains is inadequately designed, particularly due to the challenges posed by freely generated content.

- **`Fine-tuning Challenges`**: Difficulties in adapting pre-trained models to specific tasks or domains.

- **`Contextual Understanding`**: LLMs may face challenges in maintaining coherent context over longer passages or conversations.

General AI models like ChatGPT good in text generation but may lack the detailed understanding needed for specialized fields. They are more prone to errors or hallucinations, especially in domains like healthcare, where terms such as "electronic health record interoperability" require deeper knowledge. Task-specific and domain-specific LLMs, trained on industry-specific data, are essential for accurately interpreting specialized concepts and ensuring relevant content.

### <font color = 'green'>LLM Use-Cases

![llm%20use%20cases.png](attachment:llm%20use%20cases.png)

**Benefits of domain-specific LLMs:**

- **Depth and Precision**: These models are designed to accurately interpret industry-specific terminology.

- **Overcoming Limitations** They excel in domains like finance or medicine where precise terminology is crucial.

- **Enhanced User Experiences**: They provide tailored, personalized responses, improving applications like customer service.

- **Improved Efficiency and Productivity**: By automating tasks and aligning with industry terms, they boost business productivity.

- **Addressing Privacy Concerns**: They ensure data protection and privacy, particularly in sensitive industries like healthcare.

Previously we saw that we had multiple ways to use LLMs in specific use cases, namely

- Zero-shot learning
- Few-shot/Multi-shot learning
- Domain Specific models

### <font color = 'green'>Types of Domain Specific Methods</font>

**1. Domain-Specific Pre-Training**: 

  **Training Duration**: Days to weeks to months

Domain-specific pre-training involves training large language models on specialized datasets to enhance performance within a particular field. Examples include models like BloombergGPT for finance, ESMFold and ProGen2 for protein sequences, and Galactica for science. These models outperform generalist models in their respective domains.

BloombergGPT, a 50 billion parameter model, is tailored for finance and excels in tasks like financial sentiment analysis, named entity recognition, news classification, question answering, and conversational systems related to finance. It is designed to meet the specific needs of financial applications while remaining competitive on general tasks.

**2. Domain-Specific Fine-Tuning**:

Training Duration: Minutes to hours

Domain-specific fine-tuning involves refining a pre-trained language model for a specific task or domain, enhancing its performance within that context. Unlike domain-specific pre-training, which starts from scratch on domain-exclusive data, fine-tuning builds on a general pre-trained model by adapting it to domain-specific data.

Key advantages include:

Specialization in a particular domain, improving task performance.

Saving time and computational resources by utilizing existing pre-trained knowledge.

Adapting the model to the unique needs and nuances of the target domain for better accuracy.

**3. Retrieval Augmented Generation (RAG)**

Retrieval Augmented Generation (RAG) is an AI framework that enhances the quality of responses generated by LLMs by incorporating up-to-date and contextually relevant information from external sources during the generation process. 

RAG involves two phases: `retrieval`, where relevant information is searched and retrieved, and `content generation`, where the LLM synthesizes an answer based on the retrieved information and its internal training data. This approach improves accuracy, allows source verification, and reduces the need for continuous model retraining.

### Choosing Between RAG, Domain-Specific Fine-Tuning, and Domain-Specific Pre-Training

#### When to Use Domain-Specific Pre-Training:
**Exclusive Domain Focus**

**Customizing Model Architecture:** Enables tailoring the model’s architecture, size, and tokenizer to meet domain needs.

**Extensive Training Data:** Requires large amounts of domain-specific data for effective training.

#### When to Use Domain-Specific Fine-Tuning:
**Specialization Needed:** Adapt an already pre-trained LLM for domain-specific tasks.

**Task Optimization:** Adjust parameters for optimal performance in the domain.

**Time and Resource Efficiency**

#### When to Use RAG:
**Information Freshness:** Access up-to-date data from external sources.

**Reducing Hallucination:** Ground models with verifiable facts from a knowledge base.

**Cost-Efficiency:** Avoid training by directly utilizing external data.

### Fine-Tuning LLMs

Fine-tuning adjusts a pre-trained model on a specific dataset to enhance its performance for particular tasks, transforming a general model into a specialized one. This process improves the model’s ability to handle specific jobs, such as answering medical questions or drafting legal documents.

#### Need for fine tuning

Fine-tuning is essential to align general-purpose models with user needs. For example, GPT-3, initially trained for text completion, lacked the ability to follow specific instructions accurately. Researchers at OpenAI addressed this by fine-tuning GPT-3 on prompt-based data, leading to the creation of InstructGPT models that better adhere to user prompts and generate more relevant outputs.