# NLPnLLM Series: A gentle introduction

NLP forms the basis of understanding natural languages, and plays a crucial role in the development of Large Language Models. Without understanding how NLP works, and how preliminary research in NLP took place, it is difficult to understand LLMs in detail. So, in this NLPnLLM series, I will uncover basics of NLP, different techniques, models, all the way to LLMs, and will try to give an understanding of years of research process that defines the present state of AI. I will also explain in detail the difference in different processes involved (like Embeddings, tokenization, etc) in NLP and LLM, and what changed from NLP to LLMs that make the current tech much more robust and realistic, resembling human level language understanding, that too in various languages.

Here is how I will be uploading the blogs:

**Part I - Very basic**

Our objective in this module will be to understand very basic reserach in NLP that were useful in understanding and parsing natural language. We will understand how text corpus (collection of texts) is broken down into simple units, how they are represented as numbers and how we work on top of them. We will also write lots of code, building some mini-projects as well.
Here are the things we will work on:
- Tokenization (Sentence level, word level, character level and subword level tokenization, byte-pair encoding, unigram-lm tokenizer, hugging-face open source tokenizers, etc)
- Embeddings (One Hot Code embedding, bag-of-words, TF-IDF, word2vec, attention mechanism embedding and positional embeddings)
- Building these from Scratch in python, using libraries like NLTK, gensim and pytorch, using 2 different languages and a few mini projects

**Part-II - Intermediate NLP**
- Language models (Understanding and building n-grams models, LSTM and RNN language models, etc. We will also go through different sampling techniques helpful in deciding the next word of the language model)
- Working on NLP algorithms like GRUs, LSTMs, Bidirectional LSTM and RNNs. Building them from Scratch, and using pytorch on time-series and text data.
- Building semantic search, mis-spelled word detection, sentence completion, NER, and other similar projects

**Part-III - Advanced NLP**
- This portion will consist of an in depth understanding of Encoder-decoder architecture, Bahdanau attention and other pre transformers models.
- Understanding of Attention, self-attention, masked-self attention, transformers in detail and building Transformers from Scratch, using pytorch and python.
- Exploring Word2Vec, Transformers, Bag-of-trees, ULMfit, Sentence embedding and other different papers.
- Understanding and building GPT, and BERT models.

*Later, after NLP, we will explore LLMs. And in between, we will keep exploring how these processes are done in modern LLM application, and how they changed or gets replaced. We will understand advantages and dis-advantages of all these processes in detail.*

*All the codes can be found at this Github repository: **https://github.com/Harsh-Agarwals/LLMs/tree/main/Codes***

# Steps in an NLP workflow:

- **Data Cleaning**: For any text dataset, the very first is cleaning. The data that we get from different data sources (could be online articles, encyclopedia, news articles, blogs, etc) contains useless whitespaces, next-line characters, HTML tags, spam, duplicated contents (we need to remove duplicates because these can result in overfitting), etc which are needed to be removed. We start with cleaning the data after gathering data from different souces. There are various pre-processing steps involved in this step, some of which are removing stop-words and punctuation, and lowercasing the data.

- **Tokenization**: Now once we have the cleaned data, we need to convert this data into numerical format, since computers can only understand numbers. This process is called as embedding, but how will we convert data to numerical vectors? A simple process for this is to convert corpus of data into sentence, or word, or token, or subword, and then converting these to vectors. This process is called as tokenization. There are different types of tokenization like sentence level tokenization, word level tokenization, subword tokenization (using byte pair encoding and used primarily in LLMs and is language independent) and token tokenization. There are various pre-processing steps involved in this step, some of which are stemming and lemmatization.

- **Embedding**: After tokenization, we convert tokens/subwords to vectors using embedding techniques. Older techniques includes bag-of-words, TF-IDF, one hotcode encoding, and better techniques are word2vec and glove. In modern LLM applications, we use contextual and positional embedding using attention mechanism.

- **Modelling**: Now, once our data is ready, we will train model using this data.

- **Output**: Using various sampling techniques, we generate outputs.

## Basic Tokenization

The first thing we will start is tokenization. Tokenizaton is an initial process in the NLP workflow. Once we have the cleaned data, we need to convert this data into numerical format, since computers can only understand numbers. This process is called as embedding, but how will we convert data to numerical vectors? Here comes the tokenization!

### What is Tokenization and why is it used?

A simple process for embedding is to convert corpus of data into sentence, or word, or token, or subword, and then converting these to vectors. This is tokenization. There are different types of tokenization like sentence level tokenization, word level tokenization, subword tokenization (using byte pair encoding and used primarily in LLMs and is language independent) and token tokenization. There are various pre-processing steps involved in this step, some of which are stemming and lemmatization. In this notebook we will understand tokenization and different method to tokenize the text of the corpus.

### How do we tokenize in NLP?

**Steps**:

***Word level tokenizer***
- Removing punctuations
- Lowercasing
- separating words by whitespace, converting into list
- Removing stopwords
- Lemmatization or stemming of every leftout word

One question must be arriving in your mind, why do we remove punctuations, lowercase alphabets, remove stopwords, etc even though they plays an important role in the language. Now that where these methods fails. 

We **lowercase because** vocabular for Dog, dog, or DOG, etc is same, and if we include capital letters, our vocabular will shoot up, since there is no limit to such data patterns.

Next, **why we remove punctuations?** Because these doesn't help in embedding methods like TF-IDF, bag-of-words, etc which focuses on words and mainly relies on word frequencies. They can shoot up our vocabulary for n-grams models too. This is perfectly understood, **but why we remove stopwords?** How does this helps? This question must be coming to your mind now. Let's see that. The reason we remove stopwords is that they occurs very frequently throughtout the texts, and thus provides very less signal to ML models.

### Now let's understand Lemmatization and Stemming

**Lemmatization**: Breaking word into root word.

**Stemming**: Removing grammar/suffix from the word.

#### Why are these useful?
- It makes sure our vocabular size is limited

**Why lemmatization and stemming is not good?**

What does lemmatization or stemming does? Simply breaking a word into simpler word (either by converting to the base word or by removing grammar). Is this a good idea to drop grammar at all? That represents the core of the language, and dropping this is not good at all. But we do it because using grammar suffix in words can lead to exploading vocabulary, which can further lead to very sparse encoding. Having a few word with suffix can lead to model not recognizing their importance, under fitting basically.

This looks okayish for languages like English with low sets of grammar/suffix. How about hindi or other language with dense suffix/grammar? They can explode up the vocabulary because single word can have multple attributes, forming new words. This would result in a bad under-fitting model. Models will not be able to understand how suffix/grammar relates to words in a sentence. This is why lemmatization and stemming is a flaw, that is needed to be dropped.

One question that might come naturally is - if lemmatization and stemming is not good, and can explode our vocabulary, how does LLMs handles grammar and suffix? Again, subword tokenization solves this issue.

*Now lets code all these!*

In [1]:
import numpy as np

### Tokenization in NLP v/s LLM

Now once it is clear how these pre-processing methods works, one this is certain and is easy to recognize, that - *This is not the best way, many relevant information are missed out!*. Since these are important, we must come up with a way to tokenize in better way. Here comes "Byte pair encoding - subword tokenization method" that doesn't remove anything from the text, and works best too! **And most importantly, it works for any language, no matter how complicated the language is!** Isn't that awesome? We'll study this in other tutorial in detail. But keep in mind that it is one of the best method to tokenize, and is used in LLMs as well!