### 4. **Language Modeling**

Language modeling is a key task in natural language processing (NLP) that involves predicting the likelihood of a sequence of words or predicting the next word in a sentence. It is an essential component in many NLP tasks such as text generation, machine translation, and speech recognition. There are several methods used for language modeling:

---

### 1. **n-gram Models**

An **n-gram** model is a simple and commonly used language model. It predicts the probability of a word given the previous n-1 words in the sequence.

* **Unigram**: Predicts the next word without considering previous words.
  $P(w_i) = P(w_1, w_2, ..., w_i)$
* **Bigram**: Predicts the next word based on the previous word.
  $P(w_i|w_{i-1}) = \frac{P(w_1, w_2, ..., w_i)}{P(w_1, w_2, ..., w_{i-1})}$
* **Trigram**: Predicts the next word based on the previous two words.
  $P(w_i|w_{i-2}, w_{i-1}) = \frac{P(w_1, w_2, ..., w_i)}{P(w_1, w_2, ..., w_{i-2}, w_{i-1})}$

### Pros of n-gram models:

* Simple to implement.
* Works well for small datasets.

### Cons:

* Requires large memory to store probabilities for larger n-grams (e.g., 4-gram, 5-gram).
* Doesn't capture long-range dependencies because it only considers a fixed window of previous words.

---

### 2. **Recurrent Neural Networks (RNNs) and LSTMs**

**Recurrent Neural Networks (RNNs)** are a type of neural network designed to process sequences of data, such as text. They have a memory component that allows them to "remember" previous inputs and use that information to predict future ones. However, basic RNNs are limited by the **vanishing gradient problem**, where they struggle to learn long-range dependencies.

To address this, **Long Short-Term Memory (LSTM)** networks were introduced, which are a type of RNN designed to remember longer sequences more effectively.

### Key Concepts in LSTMs:

* **Cell state**: Carries information across time steps.
* **Forget gate**: Decides what information to throw away from the cell state.
* **Input gate**: Decides which values will be updated in the cell state.
* **Output gate**: Decides what the next hidden state will be.

### Pros of RNNs/LSTMs:

* Can learn long-range dependencies in sequential data.
* More powerful than n-grams in handling complex sequences.

### Cons:

* Can be computationally expensive.
* Training can be slow and requires large datasets.

---

### 3. **Transformers**

**Transformers** are the state-of-the-art architecture for many NLP tasks today. Unlike RNNs, transformers do not rely on sequential processing. Instead, they use a mechanism called **self-attention** that allows the model to look at all parts of the input at once and weigh their importance.

The transformer architecture is based on two main components:

* **Encoder**: Processes input sequences.
* **Decoder**: Generates the output sequences.

Transformers have been particularly successful in models like **BERT**, **GPT-2**, **GPT-3**, and **T5**.

### Advantages of Transformers:

* **Parallel processing**: Unlike RNNs, transformers can process all words at once, allowing for faster training.
* **Contextual understanding**: The self-attention mechanism allows the model to learn long-range dependencies more effectively.
* **Scalability**: Transformers scale well with large datasets and model sizes.

---

### Example of Language Models:

#### 1. **n-gram Example**:

For simplicity, we can implement a **Bigram** model using Python to predict the next word in a sentence:

```python
from collections import defaultdict
import random

# Sample text (corpus)
corpus = [
    "I love programming in Python",
    "Python is great for natural language processing",
    "Natural language processing is fun"
]

# Preprocess corpus to generate bigrams
bigrams = defaultdict(list)
for sentence in corpus:
    words = sentence.split()
    for i in range(len(words)-1):
        bigrams[words[i]].append(words[i+1])

# Function to generate next word based on the bigram model
def generate_sentence(start_word, bigrams, length=5):
    current_word = start_word
    sentence = [current_word]
    for _ in range(length-1):
        next_word = random.choice(bigrams[current_word])
        sentence.append(next_word)
        current_word = next_word
    return " ".join(sentence)

# Test the bigram model
start_word = "Python"
print(generate_sentence(start_word, bigrams))
```

This is a simple example where we generate text based on bigrams. The output will vary but might look something like:

```
Python is great for natural language
```

#### 2. **RNN Example**:

For RNN or LSTM-based models, you can use frameworks like **TensorFlow** or **PyTorch** to build a more advanced language model. Here's a simple example of using an LSTM for text generation:

```python
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Embedding
from tensorflow.keras.optimizers import Adam

# Sample corpus
text = "I love programming in Python. Python is great for natural language processing."

# Preprocessing (tokenization, padding, etc.)
# For simplicity, we'll skip detailed preprocessing steps and assume the corpus is tokenized.

# Build a simple LSTM-based model
model = Sequential([
    Embedding(input_dim=1000, output_dim=64, input_length=10),
    LSTM(128, return_sequences=True),
    LSTM(128),
    Dense(64, activation='relu'),
    Dense(1, activation='sigmoid')  # Binary classification output
])

# Compile the model
model.compile(optimizer=Adam(), loss='binary_crossentropy', metrics=['accuracy'])

# Sample training data (would be preprocessed text and labels)
X_train = [[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]]  # Example sequences
y_train = [1]  # Example label (e.g., positive sentiment)

# Train the model
model.fit(X_train, y_train, epochs=5)
```

Note: This is a very basic example. In a real scenario, you would need much more data and detailed preprocessing, including tokenization, padding, and handling large text corpora.

#### 3. **Transformer Example**:

Transformers are typically used with libraries like **Hugging Face's Transformers**. Here's an example of using a pre-trained GPT model for text generation:

```python
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load pre-trained GPT-2 model and tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

# Encode input text
input_text = "Once upon a time"
inputs = tokenizer.encode(input_text, return_tensors="pt")

# Generate output from the model
outputs = model.generate(inputs, max_length=50, num_return_sequences=1)

# Decode and print the generated text
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
```

This will generate text from the model, continuing the prompt `"Once upon a time"`.

---

### **Summary**:

* **n-grams**: Simpler but less powerful; good for small datasets.
* **RNNs/LSTMs**: Powerful for capturing sequential data but require more resources and training time.
* **Transformers**: Currently the most powerful for large-scale tasks, capable of learning complex language dependencies, and are the backbone of models like BERT and GPT.

If you're working with large datasets and need state-of-the-art performance, **transformers** (like **GPT** or **BERT**) are the way to go. For smaller datasets or simpler tasks, **n-grams** or **RNNs**/LSTMs can still work effectively.


Sure! Here is a full example of how you can use **Transformers** (like **GPT-2**) for text generation, which is a state-of-the-art method for language modeling. We'll use **Hugging Face's Transformers** library for this task. This example will guide you through:

1. Setting up the environment.
2. Loading a pre-trained transformer model (GPT-2).
3. Using the model for text generation.
4. Fine-tuning the model (optional, for specific tasks).

### Steps:

### **1. Set Up Environment**

First, you need to install the required libraries. Run this in your terminal or notebook:

```bash
pip install transformers torch
```

### **2. Import Necessary Libraries**

```python
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch
```

### **3. Load Pre-trained GPT-2 Model and Tokenizer**

In this step, we load the pre-trained GPT-2 model and its tokenizer. Tokenizers convert the text into tokens that the model can understand.

```python
# Load pre-trained GPT-2 model and tokenizer from Hugging Face
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")
```

### **4. Prepare Text for Input**

Now, you need to input the text that will serve as the prompt for the model. You can pass any text, and the model will generate the continuation.

```python
# Define the prompt text (the starting point of the text generation)
input_text = "Once upon a time, in a land far, far away,"

# Encode the input text (convert it to tokens the model understands)
inputs = tokenizer.encode(input_text, return_tensors="pt")
```

### **5. Generate Text**

Now that the text is tokenized, you can use the `generate()` function to generate new text based on your input. The function allows you to control how much text is generated by specifying `max_length`.

```python
# Generate the continuation of the text
outputs = model.generate(inputs, max_length=100, num_return_sequences=1, no_repeat_ngram_size=2, top_p=0.92, temperature=0.7)

# Decode the generated text (convert tokens back into human-readable text)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
```

### **6. Fine-Tuning the Model (Optional)**

If you want to fine-tune the GPT-2 model on your own dataset (e.g., specific text or language), you can use the **Trainer API** from Hugging Face. However, this step requires more computational resources and time.

Here’s a simple overview of the fine-tuning process:

```python
from transformers import Trainer, TrainingArguments, GPT2Tokenizer, GPT2LMHeadModel
from datasets import load_dataset

# Load a custom dataset (for example, the "wikitext" dataset)
dataset = load_dataset("wikitext", "wikitext-103-raw-v1")

# Tokenize the dataset
def tokenize_function(examples):
    return tokenizer(examples["text"], return_tensors="pt", padding=True, truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Initialize the model
model = GPT2LMHeadModel.from_pretrained("gpt2")

# Define the training arguments
training_args = TrainingArguments(
    output_dir="./results",  # Output directory
    evaluation_strategy="epoch",  # Evaluation strategy
    learning_rate=2e-5,  # Learning rate
    per_device_train_batch_size=2,  # Batch size for training
    per_device_eval_batch_size=2,  # Batch size for evaluation
    num_train_epochs=1,  # Number of training epochs
    weight_decay=0.01,  # Weight decay to prevent overfitting
)

# Initialize the trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
)

# Train the model
trainer.train()
```

This step allows you to train the GPT-2 model on your custom text corpus, but it's much more resource-intensive, requiring a GPU for practical use.

### **7. Advanced Features**

Transformers like GPT-2 support many advanced options for controlling the output, such as:

* **Top-p sampling** (nucleus sampling): Controls the probability of selecting the next word based on its likelihood (as used above with `top_p=0.92`).
* **Temperature**: Controls the randomness of predictions (lower values make the model more deterministic, higher values make it more random).
* **No-repeat n-gram size**: Prevents the model from repeating phrases (as used above with `no_repeat_ngram_size=2`).

### **Complete Example Code**

Here’s a complete example:

```python
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch

# Load pre-trained GPT-2 model and tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

# Input text (prompt)
input_text = "Once upon a time, in a land far, far away,"

# Encode the input text (convert to tokens)
inputs = tokenizer.encode(input_text, return_tensors="pt")

# Generate the continuation of the text
outputs = model.generate(inputs, max_length=100, num_return_sequences=1, no_repeat_ngram_size=2, top_p=0.92, temperature=0.7)

# Decode the generated text (convert tokens back to text)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

# Print the generated text
print(generated_text)
```

### **Output Example** (Generated by GPT-2)

```
Once upon a time, in a land far, far away, there was a great kingdom ruled by a wise and just king. His people loved him, and he was adored by his people. The king had three children, and each of them was trained to be a leader in their own way. The youngest was known for his intelligence, while the eldest was known for his strength. The middle child was known for her kindness and compassion.
```

---

### **Summary of Key Concepts**:

1. **Language Models**: Predict the next word in a sequence based on the previous words. GPT-2 (Transformer) is one of the best models for this task.
2. **Tokenization**: Text is broken down into smaller chunks (tokens) that the model can process.
3. **Text Generation**: Given a prompt, the model generates a sequence of words that follow.
4. **Fine-tuning**: Pre-trained models like GPT-2 can be fine-tuned on your custom dataset to adapt them for specific tasks or domains.

---

This example shows you how to use a pre-trained transformer model (GPT-2) for text generation. You can fine-tune it further for specific tasks or use it for generating creative content or even applying it in dialogue systems.


In [1]:
from transformers import BartTokenizer, BartForConditionalGeneration

# 1. Load pre-trained BART model and tokenizer
model_name = "facebook/bart-large-cnn"
tokenizer = BartTokenizer.from_pretrained(model_name)
model = BartForConditionalGeneration.from_pretrained(model_name)

# 2. Input long text to summarize
text = """
Daffodil International University (DIU) is a private university located in Dhaka, Bangladesh.
The university was established in 2002 and has rapidly grown in terms of its academic offerings, research contributions, and student population.
DIU is known for its modern approach to higher education, incorporating technology and innovation into its curriculum.
It offers undergraduate and graduate programs in various fields including computer science, business, engineering, public health, and more.
DIU also emphasizes entrepreneurship and encourages students to engage in startups and innovation.
The university has a diverse student body and is actively involved in global academic collaborations.
"""

# 3. Tokenize input text
inputs = tokenizer.encode(
    text,
    return_tensors="pt",
    max_length=1024,
    truncation=True
)

# 4. Generate summary
summary_ids = model.generate(
    inputs,
    max_length=100,
    min_length=30,
    length_penalty=2.0,
    num_beams=4,
    early_stopping=True
)

# 5. Decode and print summary
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print("\n🔹 Summarized Text:\n", summary)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]


🔹 Summarized Text:
 Daffodil International University (DIU) is a private university located in Dhaka, Bangladesh. The university was established in 2002 and has rapidly grown in terms of its academic offerings, research contributions, and student population.
