# Transformers (Encoder-Decoder Arch.)

<img src="https://machinelearningmastery.com/wp-content/uploads/2021/10/transformer_1.png" width=500px height=700px/>

## **Attenttion is all you Need Paper** '2017

`**ANN**` - better for Tabular Data <br>
`**CNN**` - better for Image Data <br>
`**RNN**` -Better for Sequential Data(Sequence) <br>
`**Transformers**` - better for Sequence to Sequence (Text)

# Impact of Transformers in NLP:


 ### **Evolution of NLP** -
# - **Heuristic Methods**

Here are some common heuristic methods used in NLP:

## 1.1 Rule-Based Approaches
**Regular Expressions (Regex)**: A powerful tool for pattern matching in text, often used for tasks like text extraction, data validation, or string manipulation.

**Dictionary/Keyword Matching:** Involves using predefined lists of words or phrases to identify specific entities or concepts in the text (e.g., identifying names, locations, or dates).

**Grammar-Based Parsing:** Uses syntactic rules to parse sentences and extract structured information, useful in tasks like Information Extraction (IE).

## 1.2. Heuristics for Text Preprocessing
**Stopword Removal:** Heuristically identifying and removing common words (like "the," "is," "in") that usually carry less meaning in text.

**Stemming/Lemmatization:** Reducing words to their base or root form, based on rules or dictionaries, to standardize text (e.g., "running" → "run").

**Tokenization:** Splitting text into words or phrases using simple rules like splitting by spaces or punctuation.

## 1.3. Heuristics in Named Entity Recognition (NER)
**Capitalization Rules:** Identifying proper nouns (e.g., names of people, places) based on capitalization patterns.

**Contextual Rules:** Using surrounding words to infer entity types (e.g., "Dr." before a word likely indicates a person's name).

## 1.4. Heuristics in Sentiment Analysis
**Sentiment Lexicons:** Using predefined lists of positive and negative words to estimate the sentiment of a sentence or document.

**Polarity Shifting:** Applying simple rules to handle negations or modifiers (e.g., "not good" → negative sentiment).

## 1.5. Heuristics in Machine Translation
**Phrase-Based Heuristics:** Translating common phrases directly based on bilingual dictionaries, handling exceptions with simple rules.

**Contextual Heuristics:** Adjusting translations based on the context (e.g., different meanings of the same word in different sentences).
## 1.6. Heuristics in Topic Modeling

**Keyword-Based Topic Identification:** Identifying topics in a document by matching keywords to predefined topic lists.

**Frequency-Based Heuristics:** Assuming high-frequency terms in a document likely represent its main topics.

## 1.7. Heuristics in Search and Information Retrieval
**TF-IDF (Term Frequency-Inverse Document Frequency):** Heuristically ranking documents based on the importance of words within them.

**Proximity Heuristics:** Ranking search results based on the closeness of query terms in documents.


# **Statistical Machine Learning Models for NLP**

## 1. Naive Bayes Classifier
- **Description:** A probabilistic model based on Bayes' theorem, which assumes that features (e.g., words in a document) are independent given the class label. Despite its "naive" independence assumption, it performs surprisingly well in many NLP tasks.
- **Use Cases:** Text classification (e.g., spam detection, sentiment analysis), document categorization.

## 2. Logistic Regression
- **Description:** A linear model that estimates the probability of a binary outcome (e.g., positive vs. negative sentiment) by fitting a logistic function to the input features. It can be extended to multiclass classification using techniques like one-vs-all.
- **Use Cases:** Sentiment analysis, text classification, named entity recognition (NER).

## 3. Support Vector Machines (SVM)
- **Description:** A powerful model that finds the hyperplane in a high-dimensional space that best separates different classes. It is particularly effective in high-dimensional spaces like those encountered in text data.

- **Use Cases:** Text classification, sentiment analysis, topic categorization.

## 4. Hidden Markov Models (HMM)

- **Description:** A statistical model that represents sequences (e.g., words in a sentence) where the system is assumed to be a Markov process with hidden states. HMMs are often used for tasks where the goal is to predict sequences or uncover hidden states.

- **Use Cases:** Part-of-speech tagging, named entity recognition, speech recognition.

## 5. Conditional Random Fields (CRF)

- **Description:** A discriminative model used for structured prediction, which directly models the conditional probability of the output sequence given the input sequence. It addresses limitations of HMMs by avoiding independence assumptions.

- **Use Cases:** Named entity recognition, part-of-speech tagging, chunking.

## 6. Latent Dirichlet Allocation (LDA)

- **Description:** A generative probabilistic model used for topic modeling, which assumes that documents are mixtures of topics and topics are mixtures of words. It uses statistical inference to discover the underlying topics in a corpus.

- **Use Cases:** Topic modeling, document classification, content recommendation.

## 7. K-Nearest Neighbors (KNN)

- **Description:** A simple, non-parametric model that classifies a data point based on the majority label of its nearest neighbors in the feature space. It relies on distance metrics to find the closest neighbors.

- **Use Cases:** Text classification, document similarity, recommendation systems.

## 8. Decision Trees and Random Forests

- **Description:** Decision trees are hierarchical models that split data based on feature values to make predictions. Random forests are an ensemble of decision trees, combining their outputs to improve robustness and accuracy.

- **Use Cases:** Text classification, sentiment analysis, feature selection.

## 9. Gaussian Mixture Models (GMM)
- **Description:** A probabilistic model that represents a mixture of multiple Gaussian distributions. GMMs are used for clustering tasks where the data is assumed to be generated from a mixture of several Gaussian distributions.

- **Use Cases:** Document clustering, topic modeling, anomaly detection.

## 10. Recurrent Neural Networks (RNN) and Long Short-Term Memory (LSTM)
- **Description:** RNNs are a type of neural network designed for sequential data, where connections between nodes form a directed cycle. LSTMs are a variant of RNNs that address the problem of vanishing gradients, making them effective for capturing long-range dependencies.

- **Use Cases:** Text generation, language modeling, machine translation, speech recognition.

## **Embedding Models**

# Problems That Embeddings Solve in NLP

## 1. High Dimensionality
- **Problem:** Traditional representations of text, like the Bag of Words (BoW) or one-hot encoding, result in extremely high-dimensional vectors, especially for large vocabularies. This makes the models computationally expensive and difficult to generalize.
- **Solution by Embeddings:** Embeddings reduce the dimensionality by representing words, sentences, or documents as dense vectors in a lower-dimensional space, making the models more efficient and scalable.

## 2. Sparsity
- **Problem:** One-hot vectors and BoW models are sparse, meaning most of the vector elements are zeros. This sparsity can make learning patterns from data difficult and inefficient.
- **Solution by Embeddings:** Embeddings provide dense representations, where each vector has non-zero values, capturing more meaningful information in fewer dimensions.

## 3. Lack of Semantic Similarity
- **Problem:** Traditional methods like one-hot encoding do not capture the semantic similarity between words. For example, "king" and "queen" are treated as completely unrelated, even though they have similar meanings.
- **Solution by Embeddings:** Embeddings capture semantic relationships by placing similar words closer together in the vector space. For instance, "king" and "queen" will have vectors that are close in the embedding space.

## 4. Context Ignorance
- **Problem:** Static word representations (e.g., one-hot, BoW, or even early embeddings like Word2Vec) do not take the context of words into account. A word like "bank" can have multiple meanings depending on the context, but traditional models treat it as a single entity.
- **Solution by Contextualized Embeddings:** Models like BERT and ELMo generate different vectors for the same word depending on its context, effectively capturing its meaning in each specific instance.

## 5. Handling Out-of-Vocabulary (OOV) Words
- **Problem:** Traditional models struggle with words that were not seen during training (OOV words), as they cannot generate vectors for these unseen words.
- **Solution by Subword Embeddings (e.g., FastText):** Subword embeddings break down words into smaller units like character n-grams, allowing the model to generate embeddings for OOV words by composing them from their subwords.

## 6. Capturing Long-Range Dependencies
- **Problem:** Traditional models, especially those relying on local context (e.g., n-grams), struggle to capture dependencies between words that are far apart in a sentence or document.
- **Solution by Contextualized Embeddings and Transformers:** Embeddings from models like BERT and GPT capture long-range dependencies by considering the entire sequence when generating embeddings, thus preserving the relationships between distant words.

## 7. Improving Generalization
- **Problem:** High-dimensional, sparse representations tend to overfit, especially on small datasets, leading to poor generalization to new, unseen data.
- **Solution by Embeddings:** Dense embeddings help in better generalization by learning compact representations that can be effectively used across different tasks and domains.

## 8. Difficulty in Transfer Learning
- **Problem:** Traditional models often require training from scratch for each new task, which is time-consuming and resource-intensive.
- **Solution by Pre-trained Embeddings:** Pre-trained embeddings like those from Word2Vec, GloVe, BERT, etc., can be transferred to new tasks, providing a strong starting point and reducing the amount of labeled data and computation needed.

## 9. Capturing Hierarchical Relationships
- **Problem:** Simple models may struggle to capture hierarchical or relational information, such as the relationship between "country" and "capital."
- **Solution by Embeddings:** Advanced embeddings can capture such relationships by learning patterns in the data that reflect these hierarchies, enabling tasks like analogy reasoning (e.g., "king" is to "queen" as "man" is to "woman").

## 10. Language and Domain Adaptation
- **Problem:** Traditional models trained on one domain or language may not perform well on another due to differences in vocabulary, syntax, or semantics.
- **Solution by Embeddings:** Embeddings can be fine-tuned for specific domains or languages, allowing for better adaptation and performance across different contexts.





# **Transformers Democratise the AI Field**

### - One architecure for All your Problems
### - Models like ***BERT***, ***GPT*** using this and solving multiple problems trained on Big dataset used for specific purposes with bigger knowledge

### - Transformers can handle different forms of data and wider ranges of Multimodal challenges for being flexible (Text, Speech, Image)



# **Acceleration of GenAI**

### - Models like GANS were there but not industry grade
### - Textgen Evolved when Transformers came
### - Image Generation, AI Image editing

# **Unification of Deep Learning**

### - In the past different architecture and models were used for different problems
### - But now Transformers are solving a wide range of problems
### - Also Transfer Learning is Also Easy with this architecture for the excessive support from vast libraries


# Origin of the Transformers research Paper

### Paper1 : Sequence to Sequence Learning with Neural Network

- For Sequence to Sequence task like Machine Learning we use a **Encoder-Decorder Architecture**

- In the Both Encoder and Decoder **LSTMs** were used

- => The first encoder Inputs the data <br> => Maintains its hidden state <br> => And At the end returns a context Vector(representation of the sentence inputted)
- eg: "I am a student" <br>[0.25, -0.11, 0.43, 0.67, -0.28, ..., 0.59]

- => Now the Decoder processes it and return the answer

### Paper2: Neural Machine Translation by Jointly Learning to Align and Translation

- Here **Attention Mechanism** was Introduced
- The Processing of Encoder stays the same.
- The Decoder get the context vector(**hidden state**) from each of the units of the decoder bytime parallely and check for the weightage of each of the words (***attention weight***) to actually improve the performance

**Note** : **The performance of translation improved after the attension mechanism was introduced to enocoder- decoder architecture even for sentence greater than 30 words**

## Limitation :
- it is LSTM based as processing happens word to word so the training will be slow

- Therefore very difficult to do it for large datasets

- So transfer Learning can't happen as we would be able to get a big base model

- Ending up doing Model from Scratch which wouldn't be feasible (data collection Cost / Time / Effort )

## Finally: Attention is all you need (2017)

- No LSTMs or RNNs were used

- Special Form of Attenion is there(**Self Attention**)

- Multiple Components which can be handled Parallely(**Trains Faster**)