# History of Language Models: Evolution, Frameworks, Deployment


## Naive Language Model (N-gram)

**Concept:**
- **N-gram Models**: These models use sequences of 'N' words to predict the next word in a sentence. A bigram model considers pairs of words, while a trigram model considers triplets.
- **Bigram**: Uses pairs of words, e.g., "natural language", "language processing".
- **Trigram**: Uses triplets of words, e.g., "natural language processing".
- **Limitations**: Limited context as it only considers fixed-size word groups and cannot capture long-range dependencies or semantic meaning.

**Application**:
- Predicts the next word based on the previous words in the sequence. For instance, after "natural language", a bigram model predicts "processing" if it's seen frequently.

In [1]:
from collections import defaultdict, Counter
import random

# Example text
text = "I love natural language processing because natural language processing is fun"

# Tokenize the text
words = text.split()

# Generate bigrams and trigrams
bigrams = [(words[i], words[i+1]) for i in range(len(words)-1)]
trigrams = [(words[i], words[i+1], words[i+2]) for i in range(len(words)-2)]

# Count bigram and trigram frequencies
bigram_freq = defaultdict(Counter)
trigram_freq = defaultdict(Counter)

for bigram in bigrams:
    bigram_freq[bigram[0]][bigram[1]] += 1

for trigram in trigrams:
    trigram_freq[(trigram[0], trigram[1])][trigram[2]] += 1

# Function to predict the next word using bigrams
def predict_next_bigram(current_word):
    if current_word in bigram_freq:
        #  selects one word from the list of possible next words,
        # with the probability of each word being proportional to its frequency.
        # This means that more frequent words are more likely to be selected.
        next_word = random.choices(list(bigram_freq[current_word].keys()), list(bigram_freq[current_word].values()))[0]
        return next_word
    else:
        return None

# Function to predict the next word using trigrams
def predict_next_trigram(word_pair):
    if word_pair in trigram_freq:
        next_word = random.choices(list(trigram_freq[word_pair].keys()), list(trigram_freq[word_pair].values()))[0]
        return next_word
    else:
        return None

# Example predictions
current_word = "natural"
next_word_bigram = predict_next_bigram(current_word)
print(f"Bigram Prediction for '{current_word}':", next_word_bigram)

Bigram Prediction for 'natural': language


In [3]:
current_word = ("natural", "language")
next_word_bigram = predict_next_trigram(current_word)
print(f"trigram Prediction for '{current_word}':", next_word_bigram)

trigram Prediction for '('natural', 'language')': processing


## Statistical Language Models (HMM and MLE)

First of a kind appoach where the aim was to find the implicit structure in the textual data. 

**Hidden Markov Models (HMM)**:
- **Concept**: HMMs are used for sequence prediction where the system being modeled is assumed to follow a Markov process with hidden states.
- **Components**: States (e.g., part-of-speech tags), Observations (words), Transition Probabilities (state to state), Emission Probabilities (state to word), and Initial Probabilities (start state).
- **Application**: Used in part-of-speech tagging, where the hidden states are tags, and the observations are words.

**Maximum Likelihood Estimation (MLE)**:
- **Concept**: MLE is a statistical method for estimating the parameters of a model that maximize the likelihood of the observed data.
- **Application**: In language models, MLE estimates the probabilities of sequences of words (e.g., bigram/trigram probabilities) by maximizing the likelihood of the observed word sequences in the training data.

  ![hmm1](hmm1.png)

  ![hmm2](hmm2.png)

  

In [4]:
import nltk
from nltk.tag import hmm
from nltk.probability import FreqDist


# Prepare training data (simplified for illustration purposes)
train_data = [
    [('Natural', 'JJ'), ('language', 'NN'), ('processing', 'NN'), ('is', 'VBZ'), ('a', 'DT'), ('subfield', 'NN'), ('of', 'IN'), ('artificial', 'JJ'), ('intelligence', 'NN'), ('.', '.')],
    [('The', 'DT'), ('ultimate', 'JJ'), ('goal', 'NN'), ('of', 'IN'), ('NLP', 'NNP'), ('is', 'VBZ'), ('to', 'TO'), ('enable', 'VB'), ('computers', 'NNS'), ('to', 'TO'), ('understand', 'VB'), (',', ','), ('interpret', 'VB'), (',', ','), ('and', 'CC'), ('generate', 'VB'), ('human', 'JJ'), ('language', 'NN'), ('.', '.')],
    [('Over', 'IN'), ('the', 'DT'), ('years', 'NNS'), (',', ','), ('NLP', 'NNP'), ('has', 'VBZ'), ('seen', 'VBN'), ('significant', 'JJ'), ('advancements', 'NNS'), (',', ','), ('driven', 'VBN'), ('by', 'IN'), ('the', 'DT'), ('development', 'NN'), ('of', 'IN'), ('sophisticated', 'JJ'), ('algorithms', 'NNS'), ('and', 'CC'), ('the', 'DT'), ('availability', 'NN'), ('of', 'IN'), ('large', 'JJ'), ('datasets', 'NNS'), ('.', '.')],
    [('Techniques', 'NNS'), ('such', 'JJ'), ('as', 'IN'), ('machine', 'NN'), ('learning', 'NN'), (',', ','), ('deep', 'JJ'), ('learning', 'NN'), (',', ','), ('and', 'CC'), ('neural', 'JJ'), ('networks', 'NNS'), ('have', 'VBP'), ('transformed', 'VBN'), ('NLP', 'NNP'), (',', ','), ('enabling', 'VBG'), ('applications', 'NNS'), ('like', 'IN'), ('speech', 'NN'), ('recognition', 'NN'), (',', ','), ('machine', 'NN'), ('translation', 'NN'), (',', ','), ('sentiment', 'NN'), ('analysis', 'NN'), (',', ','), ('and', 'CC'), ('conversational', 'JJ'), ('agents', 'NNS'), ('.', '.')],
    [('These', 'DT'), ('technologies', 'NNS'), ('have', 'VBP'), ('not', 'RB'), ('only', 'RB'), ('improved', 'VBN'), ('the', 'DT'), ('accuracy', 'NN'), ('and', 'CC'), ('efficiency', 'NN'), ('of', 'IN'), ('language', 'NN'), ('processing', 'NN'), ('tasks', 'NNS'), ('but', 'CC'), ('have', 'VBP'), ('also', 'RB'), ('expanded', 'VBN'), ('the', 'DT'), ('range', 'NN'), ('of', 'IN'), ('possible', 'JJ'), ('applications', 'NNS'), (',', ','), ('making', 'VBG'), ('human-computer', 'JJ'), ('interaction', 'NN'), ('more', 'RBR'), ('natural', 'JJ'), ('and', 'CC'), ('intuitive', 'JJ'), ('.', '.')]
]

# Train HMM
trainer = hmm.HiddenMarkovModelTrainer()
tagger = trainer.train(train_data)



In [5]:
# Function to generate the next word using HMM with consideration to avoid repetition
def generate_next_word(tagger, context, prev_word=None):
    tagged_context = tagger.tag(context.split())
    last_word = tagged_context[-1][0]
    last_tag = tagged_context[-1][1]

    # Get the next state (tag)
    next_tag = max(tagger._transitions[last_tag].samples(), key=lambda tag: tagger._transitions[last_tag].prob(tag))

    # Get the next word from the emission probabilities, avoiding repetition
    word_probs = {word: tagger._outputs[next_tag].prob(word) for word in tagger._outputs[next_tag].samples()}
    if prev_word and prev_word in word_probs:
        del word_probs[prev_word]
    
    next_word = max(word_probs, key=word_probs.get)

    return next_word

# Function to generate a sentence of n words
def generate_sentence(tagger, n):
    context = 'Natural'
    sentence = [context]

    for _ in range(n - 1):
        next_word = generate_next_word(tagger, context, sentence[-1])
        sentence.append(next_word)
        context += ' ' + next_word

    return ' '.join(sentence)

# Example usage
sentence = generate_sentence(tagger, 10)
print(sentence)

Natural language processing language processing language processing language processing language


## The Rise of Machine Learning

**Concept**:
- **Shift from Rule-based to Data-driven Approaches**: Early NLP systems used hand-crafted rules, which were rigid and limited. The rise of machine learning introduced data-driven approaches that could learn patterns from large datasets.
- **Naive Bayes**: A probabilistic classifier that applies Bayes' theorem with strong (naive) independence assumptions between features. Commonly used for text classification tasks like spam detection.
- **Logistic Regression**: A statistical model used for binary classification tasks. It models the probability that a given input belongs to a particular category.
- Example: Adding labels to customer issues, tshirt-sizes to developer issues

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

# Example data
texts = ["I love natural language processing", "Natural language processing is fun", "I dislike processing errors"]
labels = [1, 1, 0]  # 1: positive, 0: negative

# Vectorize text
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)

# Train Naive Bayes
model = MultinomialNB()
model.fit(X, labels)

# Predict example
example = vectorizer.transform(["I I love natural language processing"])
prediction = model.predict(example)
print("Prediction:", prediction)

model.predict(vectorizer.transform([texts[2]]))


Prediction: [1]


## Neural Networks in NLP

**Feedforward Neural Networks**:
- **Concept**: Basic neural network architecture where information moves in one direction from input to output.
- **Limitation**: Cannot capture sequential information in text data.

## Word Embeddings
![word2vec](w2v.png)
**Concept**:
- **Word Embeddings**: These are dense vector representations of words that capture semantic meanings. Words with similar meanings have similar vector representations.
- **Word2Vec**: Predicts the context of a word (Skip-gram) or words in context (CBOW). It captures the semantic relationships between words.
- **GloVe**: Global Vectors for Word Representation, which combines global word-word co-occurrence statistics.
- **FastText**: Extends Word2Vec by considering subword information, which helps in handling out-of-vocabulary words.

**Application**:
- Top-k similar products (Spotify, Amazon, Youtube recommender systems)
- Information retrieval



In [13]:
from gensim.models import Word2Vec

# Example data
sentences = [["I", "love", "natural", "language", "processing"],
             ["natural", "language", "processing", "is", "fun"]]

# Train Word2Vec
model = Word2Vec(sentences, vector_size=50, min_count=1)

# Get embeddings
vector = model.wv['natural']
print("Word2Vec Embedding for 'natural':", vector)


Word2Vec Embedding for 'natural': [-0.01723938  0.00733148  0.01037977  0.01148388  0.01493384 -0.01233535
  0.00221123  0.01209456 -0.0056801  -0.01234705 -0.00082045 -0.0167379
 -0.01120002  0.01420908  0.00670508  0.01445134  0.01360049  0.01506148
 -0.00757831 -0.00112361  0.00469675 -0.00903806  0.01677746 -0.01971633
  0.01352928  0.00582883 -0.00986566  0.00879638 -0.00347915  0.01342277
  0.0199297  -0.00872489 -0.00119868 -0.01139127  0.00770164  0.00557325
  0.01378215  0.01220219  0.01907699  0.01854683  0.01579614 -0.01397901
 -0.01831173 -0.00071151 -0.00619968  0.01578863  0.01187715 -0.00309133
  0.00302193  0.00358008]


**Recurrent Neural Networks (RNNs)**:
![](rnn.png)
- **Concept**: Designed to handle sequential data by maintaining a hidden state that captures information about previous elements in the sequence.
- **Limitation**: Struggles with long-term dependencies due to vanishing gradient problems.

**Long Short-Term Memory (LSTM)**:
![](lstm.png)
- **Concept**: An improved version of RNNs designed to capture long-term dependencies by using gates to control the flow of information. There are input, output, and forget gates. The gates allow for the gradients to flow.  
- **Application**: Used in tasks like language modeling, machine translation, and speech recognition.

In [14]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from sklearn.preprocessing import LabelEncoder
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

# Example data
texts = ["I love natural language processing", "natural language processing is fun"]
labels = [1, 1]  # Example binary labels

# Tokenize text
tokenizer = Tokenizer()
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
word_index = tokenizer.word_index

# Pad sequences
data = pad_sequences(sequences)
labels = torch.tensor(labels, dtype=torch.float32)

# Create TensorDataset
data_tensor = torch.tensor(data, dtype=torch.long)
dataset = TensorDataset(data_tensor, labels)
dataloader = DataLoader(dataset, batch_size=2, shuffle=True)

# Define LSTM model
class LSTMModel(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size, output_size):
        super(LSTMModel, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.lstm = nn.LSTM(embed_size, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        x = self.embedding(x)
        x, (hn, cn) = self.lstm(x)
        x = self.fc(x[:, -1, :])
        x = self.sigmoid(x)
        return x

# Model parameters
vocab_size = len(word_index) + 1
embed_size = 50
hidden_size = 50
output_size = 1

# Initialize model, loss function, and optimizer
model = LSTMModel(vocab_size, embed_size, hidden_size, output_size)
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training loop (for illustration, normally would train for more epochs)
num_epochs = 10
for epoch in range(num_epochs):
    for inputs, targets in dataloader:
        outputs = model(inputs)
        loss = criterion(outputs.squeeze(), targets)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    print(f'Epoch {epoch+1}/{num_epochs}, Loss: {loss.item()}')

# Example prediction
with torch.no_grad():
    example_sequence = torch.tensor(pad_sequences(tokenizer.texts_to_sequences(["I love natural language processing"]), maxlen=data.shape[1]), dtype=torch.long)
    prediction = model(example_sequence)
    print("LSTM Prediction:", prediction.item())


Epoch 1/10, Loss: 0.7109162211418152
Epoch 2/10, Loss: 0.6827574968338013
Epoch 3/10, Loss: 0.6552144885063171
Epoch 4/10, Loss: 0.6282566785812378
Epoch 5/10, Loss: 0.6018567681312561
Epoch 6/10, Loss: 0.5759904384613037
Epoch 7/10, Loss: 0.550626277923584
Epoch 8/10, Loss: 0.5257291793823242
Epoch 9/10, Loss: 0.501266598701477
Epoch 10/10, Loss: 0.47721171379089355
LSTM Prediction: 0.6247726082801819


## The Transformer Revolution

![t](tf.png)

**Concept**:
- **Transformer Model**: Introduced by Vaswani et al., it relies on a self-attention mechanism to process entire sequences in parallel, unlike RNNs which process sequentially.
- **Self-Attention Mechanism**: Allows the model to weigh the importance of different words in a sequence, capturing long-range dependencies more effectively.
- **Impact**: Revolutionized NLP by enabling the training of much larger models on larger datasets, leading to significant improvements in performance on various NLP tasks.



In [27]:
# Note: Transformers require a more complex setup and often large datasets.
# Here we use a pre-trained model from Hugging Face for simplicity.

from transformers import pipeline

# Load pre-trained model
generator = pipeline('text-generation', model='gpt2')

# Generate text
result = generator("I love natural language processing", max_length=100)
print("Generated Text:", result[0]['generated_text'])


Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated Text: I love natural language processing, and that sounds like an interesting topic. As another good example, I


## BERT and Pre-trained Models

**Concept**:
- **BERT (Bidirectional Encoder Representations from Transformers)**: Pre-trained on large corpora and fine-tuned for specific tasks. It uses a bidirectional approach to understand context from both directions.
- **Pre-training and Fine-tuning**: Pre-training on a large dataset and then fine-tuning on a smaller, task-specific dataset.
- **Applications**: Question answering, sentiment analysis, named entity recognition, and more.

## GPT Series

![](gpt.png)

**Concept**:
- **GPT (Generative Pre-trained Transformer)**: A series of models (GPT, GPT-2, GPT-3, GPT-4) that generate human-like text based on a given prompt.
- **Architecture**: Uses a transformer architecture with a focus on generating text.
- **Evolution**: Each subsequent model in the series has more parameters, leading to better performance and more coherent text generation.
- **Applications**: Text completion, content creation, conversational agents, and more.

### Summary:
From early n-gram models to advanced transformer-based models, the evolution of language models in NLP has dramatically improved the ability to understand and generate human language. Each advancement brought better handling of context, dependencies, and semantic meaning, leading to more accurate and versatile applications in various NLP tasks.

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_name = "openlm-research/open_llama_3b_v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, low_cpu_mem_usage=True)

# Encode the input text
input_text = "I love natural language processing because"
inputs = tokenizer(input_text, return_tensors='pt')

# Generate text
outputs = model.generate(inputs['input_ids'], max_length=50, num_return_sequences=1)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print("Generated Text:", generated_text)

You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

In [2]:
print("Generated Text:", generated_text)

Generated Text: I love natural language processing because it is a great way to get a sense of what people are thinking. I have been working on a project to build a natural language processing system that can understand the meaning of a sentence.
I have been working
