# NLPnLLM Series: A gentle introduction

NLP forms the basis of understanding natural languages, and plays a crucial role in the development of Large Language Models. Without understanding how NLP works, and how preliminary research in NLP took place, it is difficult to understand LLMs in detail. So, in this NLPnLLM series, I will uncover basics of NLP, different techniques, models, all the way to LLMs, and will try to give an understanding of years of research process that defines the present state of AI. I will also explain in detail the difference in different processes involved (like Embeddings, tokenization, etc) in NLP and LLM, and what changed from NLP to LLMs that make the current tech much more robust and realistic, resembling human level language understanding, that too in various languages.

Here is how I will be uploading the blogs:

**Part I - Very basic**

Our objective in this module will be to understand very basic reserach in NLP that were useful in understanding and parsing natural language. We will understand how text corpus (collection of texts) is broken down into simple units, how they are represented as numbers and how we work on top of them. We will also write lots of code, building some mini-projects as well.
Here are the things we will work on:
- Tokenization (Sentence level, word level, character level and subword level tokenization, byte-pair encoding, unigram-lm tokenizer, hugging-face open source tokenizers, etc)
- Embeddings (One Hot Code embedding, bag-of-words, TF-IDF, word2vec, attention mechanism embedding and positional embeddings)
- Building these from Scratch in python, using libraries like NLTK, gensim and pytorch, using 2 different languages and a few mini projects

**Part-II - Intermediate NLP**
- Language models (Understanding and building n-grams models, LSTM and RNN language models, etc. We will also go through different sampling techniques helpful in deciding the next word of the language model)
- Working on NLP algorithms like GRUs, LSTMs, Bidirectional LSTM and RNNs. Building them from Scratch, and using pytorch on time-series and text data.
- Building semantic search, mis-spelled word detection, sentence completion, NER, and other similar projects

**Part-III - Advanced NLP**
- This portion will consist of an in depth understanding of Encoder-decoder architecture, Bahdanau attention and other pre transformers models.
- Understanding of Attention, self-attention, masked-self attention, transformers in detail and building Transformers from Scratch, using pytorch and python.
- Exploring Word2Vec, Transformers, Bag-of-trees, ULMfit, Sentence embedding and other different papers.
- Understanding and building GPT, and BERT models.

*Later, after NLP, we will explore LLMs. And in between, we will keep exploring how these processes are done in modern LLM application, and how they changed or gets replaced. We will understand advantages and dis-advantages of all these processes in detail.*

*All the codes can be found at this Github repository: **https://github.com/Harsh-Agarwals/LLMs/tree/main/Codes***

# Steps in an NLP workflow:

- **Data Cleaning**: For any text dataset, the very first is cleaning. The data that we get from different data sources (could be online articles, encyclopedia, news articles, blogs, etc) contains useless whitespaces, next-line characters, HTML tags, spam, duplicated contents (we need to remove duplicates because these can result in overfitting), etc which are needed to be removed. We start with cleaning the data after gathering data from different souces. There are various pre-processing steps involved in this step, some of which are removing stop-words and punctuation, and lowercasing the data.

- **Tokenization**: Now once we have the cleaned data, we need to convert this data into numerical format, since computers can only understand numbers. This process is called as embedding, but how will we convert data to numerical vectors? A simple process for this is to convert corpus of data into sentence, or word, or token, or subword, and then converting these to vectors. This process is called as tokenization. There are different types of tokenization like sentence level tokenization, word level tokenization, subword tokenization (using byte pair encoding and used primarily in LLMs and is language independent) and token tokenization. There are various pre-processing steps involved in this step, some of which are stemming and lemmatization.

- **Embedding**: After tokenization, we convert tokens/subwords to vectors using embedding techniques. Older techniques includes bag-of-words, TF-IDF, one hotcode encoding, and better techniques are word2vec and glove. In modern LLM applications, we use contextual and positional embedding using attention mechanism.

- **Modelling**: Now, once our data is ready, we will train model using this data.

- **Output**: Using various sampling techniques, we generate outputs.

## Basic Tokenization

The first step in any NLP pipeline is tokenization. Tokenization is a fundamental process where we break down text into smaller units, such as words, subwords, or characters. Once we have cleaned our data, we need to convert this text into a numerical format because computers can only process numbers. This numerical representation step is called embedding. But before we embed, we need to tokenize. That's where tokenization comes into play.

### What is Tokenization and why is it used?

Tokenization involves splitting a corpus of text into smaller units: sentences, words, or subwords. These tokens are then used as the basis for vectorization techniques like TF-IDF, bag-of-words, or more advanced embeddings such as Word2Vec or transformers.

There are different types of tokenization, link:
- sentence level tokenization, 
- word level tokenization, 
- subword tokenization (using byte pair encoding and used primarily in LLMs and is language independent) 

Tokenization is often paired with preprocessing steps like stemming and lemmatization to reduce vocabulary size and improve learning efficiency.

In this notebook, we'll understand different methods to tokenize text in a corpus.

### How do we tokenize in NLP?

**Steps**:

***Word level tokenizer***
- Remove punctuations
- Convert text to lowercase
- separating words by whitespace, converting into list
- Remove stopwords
- Lemmatization or stemming of every leftout word

One question must be arriving in your mind, why do we remove punctuations, lowercase alphabets, remove stopwords, etc even though they plays an important role in the language. Now that where these methods fails. 

We **lowercase because** vocabular for Dog, dog, or DOG, etc is same, and if we include capital letters, our vocabular will shoot up, since there is no limit to such data patterns.

Next, **why we remove punctuations?** Because these doesn't help in embedding methods like TF-IDF, bag-of-words, etc which focuses on words and mainly relies on word frequencies. They can shoot up our vocabulary and n-grams combinations too. This is perfectly understood, **but why we remove stopwords?** How does this helps? This question must be coming to your mind now. Let's see that. The reason we remove stopwords is that they occurs very frequently throughtout the texts, and thus provides very less signal to ML models. Removing them reduces noise and model complexity.


### Now let's understand Lemmatization and Stemming

**Lemmatization**: Breaking word into root word. *Eg. Running -> Run*

**Stemming**: Removing affix from the word, may not result in a valid word. *Eg. Running -> Runn*

#### Why are these useful?
- It makes sure our vocabular size is limited
- Generalize to various forms of word
- Simplify training for statistical models

**Are there any downside to lemmatization and stemming?**

What does lemmatization or stemming does? Simply breaking a word into simpler word (either by converting to the base word or by removing grammar). Is this a good idea to drop grammar at all? Grammar represents the core understanding and syntax of the language, and dropping this is not good at all. But we do it because using grammar suffix in words can lead to exploading vocabulary, which can further lead to very sparse encoding. Having a few word with suffix can lead to model not recognizing their importance, under fitting basically.

This looks okayish for languages like English with low sets of grammar/suffix. How about hindi or other language with dense suffix/grammar? They can explode up the vocabulary because single word can have multple attributes, forming new words. This would result in a bad under-fitting model. Needless to say, loss of informations is an issue as well. Models will not be able to understand how suffix/grammar relates to words in a sentence. This is why lemmatization and stemming is a flaw, that is needed to be dropped.

One question that might come naturally is - if lemmatization and stemming is not good, and can explode our vocabulary, how does LLMs handles grammar and suffix? Again, subword tokenization solves this issue.

*Now lets code all these!*

## Importing libraries

Since we are building this tokenization from scratch, some important libraries for tokenization are:
- Numpy: Numerical computations
- Pandas: Data manipulation
- NLTK: tokenization using library

In [4]:
import numpy as np
import pandas as pd
import nltk
import kagglehub

import warnings
warnings.filterwarnings("ignore")

  from .autonotebook import tqdm as notebook_tqdm


In [5]:
path = kagglehub.dataset_download("selimkhaled50/new-data")
path

'/Users/harshagarwal/.cache/kagglehub/datasets/selimkhaled50/new-data/versions/1'

In [6]:
import os

os.listdir(path)

['New-data.json']

In [7]:
text = pd.read_json(os.path.join(path, os.listdir(path)[0]))
text

Unnamed: 0,intents
0,"{'tag': 'greeting', 'patterns': ['Is anyone th..."
1,"{'tag': 'morning', 'patterns': ['it was great ..."
2,"{'tag': 'afternoon', 'patterns': ['it is a nic..."
3,"{'tag': 'evening', 'patterns': ['it is a nice ..."
4,"{'tag': 'night', 'patterns': ['good evening.',..."
...,...
75,"{'tag': 'fact-28', 'patterns': ['if i am conce..."
76,"{'tag': 'fact-29', 'patterns': ['i am not sure..."
77,"{'tag': 'fact-30', 'patterns': ['how can i kee..."
78,"{'tag': 'fact-31', 'patterns': ['there is a di..."


In [8]:
text.loc[0, 'intents'], text.loc[0, 'intents'].keys()

({'tag': 'greeting',
  'patterns': ['Is anyone there?',
   'Ola',
   'Hey there',
   'Hi there',
   'Howdy',
   'Konnichiwa',
   'Guten tag',
   'Hi',
   'Hola',
   'Hey',
   'Bonjour',
   'Hello'],
  'responses': ['Hello there. Tell me how are you feeling today?',
   'Hi there. What brings you here today?',
   'Hi there. How are you feeling today?',
   'Great to see you. How do you feel currently?',
   "Hello there. Glad to see you're back. What's going on in your world right now?"]},
 dict_keys(['tag', 'patterns', 'responses']))

In [9]:
text['tag'] = text.intents.apply(lambda x: x['tag'])
text['patterns'] = text.intents.apply(lambda x: ' '.join(x['patterns']))
text['responses'] = text.intents.apply(lambda x: ' '.join(x['responses']))

text.drop(columns=['intents'], inplace=True)

text

Unnamed: 0,tag,patterns,responses
0,greeting,Is anyone there? Ola Hey there Hi there Howdy ...,Hello there. Tell me how are you feeling today...
1,morning,it was great to wake up. a good start to the d...,Good morning. I hope you had a good night's sl...
2,afternoon,it is a nice day. a good start to the day. goo...,Good afternoon. How is your day going?
3,evening,it is a nice day. it is a nice morning. it was...,Good evening. How has your day been?
4,night,good evening. Good night it was nice. it was a...,Good night. Get some proper sleep Good night. ...
...,...,...,...
75,fact-28,"if i am concerned about my mental health, what...",The most important thing is to talk to someone...
76,fact-29,i am not sure if i'm well. if i'm not feeling ...,"If your beliefs , thoughts , feelings or behav..."
77,fact-30,how can i keep in touch with people? can i sta...,"A lot of people are alone right now, but we do..."
78,fact-31,there is a difference between stress and anxie...,Stress and anxiety are often used interchangea...


In [10]:
text.tag.unique()

array(['greeting', 'morning', 'afternoon', 'evening', 'night', 'goodbye',
       'thanks', 'no-response', 'neutral-response', 'about', 'skill',
       'creation', 'name', 'help', 'sad', 'stressed', 'worthless',
       'depressed', 'happy', 'casual', 'anxious', 'not-talking', 'sleep',
       'scared', 'death', 'understand', 'done', 'suicide', 'hate-you',
       'hate-me', 'default', 'jokes', 'repeat', 'wrong', 'stupid',
       'location', 'something-else', 'friends', 'ask', 'problem',
       'no-approach', 'learn-more', 'user-agree', 'meditation',
       'user-meditation', 'pandora-useful', 'user-advice',
       'learn-mental-health', 'mental-health-fact', 'fact-1', 'fact-2',
       'fact-3', 'fact-5', 'fact-6', 'fact-7', 'fact-8', 'fact-9',
       'fact-10', 'fact-11', 'fact-12', 'fact-13', 'fact-14', 'fact-15',
       'fact-16', 'fact-17', 'fact-18', 'fact-19', 'fact-20', 'fact-21',
       'fact-22', 'fact-23', 'fact-24', 'fact-25', 'fact-26', 'fact-27',
       'fact-28', 'fact-29', '

## Text preprocessing

**Steps are:**
- Removing punctuation
- lowercasing
- Revmoving stopwords
- stemming/lemmatization

#### Removing Punctuations

In [11]:
import string
print(string.punctuation)

def remove_punctuations(text):
    return ''.join(i for i in list(text) if i not in string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [12]:
texts = text.copy()
texts['patterns'] = texts.patterns.apply(lambda x: remove_punctuations(x))
texts['responses'] = texts.responses.apply(lambda x: remove_punctuations(x))
texts['tag'] = texts.tag.apply(lambda x: remove_punctuations(x))

texts

Unnamed: 0,tag,patterns,responses
0,greeting,Is anyone there Ola Hey there Hi there Howdy K...,Hello there Tell me how are you feeling today ...
1,morning,it was great to wake up a good start to the da...,Good morning I hope you had a good nights slee...
2,afternoon,it is a nice day a good start to the day good ...,Good afternoon How is your day going
3,evening,it is a nice day it is a nice morning it was a...,Good evening How has your day been
4,night,good evening Good night it was nice it was a n...,Good night Get some proper sleep Good night Sw...
...,...,...,...
75,fact28,if i am concerned about my mental health what ...,The most important thing is to talk to someone...
76,fact29,i am not sure if im well if im not feeling wel...,If your beliefs thoughts feelings or behavio...
77,fact30,how can i keep in touch with people can i stay...,A lot of people are alone right now but we don...
78,fact31,there is a difference between stress and anxie...,Stress and anxiety are often used interchangea...


#### Lowercasing alphabets

In [13]:
def convert_to_lowercase(text):
    return text.lower()

texts['patterns'] = texts.patterns.apply(lambda x: convert_to_lowercase(x))
texts['responses'] = texts.responses.apply(lambda x: convert_to_lowercase(x))
texts['tag'] = texts.tag.apply(lambda x: convert_to_lowercase(x))

texts


Unnamed: 0,tag,patterns,responses
0,greeting,is anyone there ola hey there hi there howdy k...,hello there tell me how are you feeling today ...
1,morning,it was great to wake up a good start to the da...,good morning i hope you had a good nights slee...
2,afternoon,it is a nice day a good start to the day good ...,good afternoon how is your day going
3,evening,it is a nice day it is a nice morning it was a...,good evening how has your day been
4,night,good evening good night it was nice it was a n...,good night get some proper sleep good night sw...
...,...,...,...
75,fact28,if i am concerned about my mental health what ...,the most important thing is to talk to someone...
76,fact29,i am not sure if im well if im not feeling wel...,if your beliefs thoughts feelings or behavio...
77,fact30,how can i keep in touch with people can i stay...,a lot of people are alone right now but we don...
78,fact31,there is a difference between stress and anxie...,stress and anxiety are often used interchangea...


### Removing stopwords

In [14]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/harshagarwal/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [15]:
set(stopwords.words('english'))

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 "he'd",
 "he'll",
 "he's",
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 "i'd",
 "i'll",
 "i'm",
 "i've",
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it'd",
 "it'll",
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'on

In [22]:
stopwords = set(stopwords.words('english'))
def remove_stopwords(text):
    return ' '.join([word for word in text.split() if word not in stopwords])

In [24]:
texts['patterns'] = texts.patterns.apply(lambda x: remove_stopwords(x))
texts['responses'] = texts.responses.apply(lambda x: remove_stopwords(x))

texts

Unnamed: 0,tag,patterns,responses
0,greeting,anyone ola hey hi howdy konnichiwa guten tag h...,hello tell feeling today hi brings today hi fe...
1,morning,great wake good start day great start day good...,good morning hope good nights sleep feeling today
2,afternoon,nice day good start day good day good afternoo...,good afternoon day going
3,evening,nice day nice morning good night good morning ...,good evening day
4,night,good evening good night nice nice day good nig...,good night get proper sleep good night sweet d...
...,...,...,...
75,fact28,concerned mental health im worried mental heal...,important thing talk someone trust might frien...
76,fact29,sure im well im feeling well know im well know...,beliefs thoughts feelings behaviours significa...
77,fact30,keep touch people stay touch friends done main...,lot people alone right dont lonely together th...
78,fact31,difference stress anxiety stress anxiety diffe...,stress anxiety often used interchangeably over...


Now we have pre-processed data.


### Sentence tokenizer and word tokenizer

In [34]:
from nltk.tokenize import  sent_tokenize, word_tokenize
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/harshagarwal/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [35]:
sent_tokenize("Hello, how are you? I am fine. Thank you.")

['Hello, how are you?', 'I am fine.', 'Thank you.']

In [36]:
word_tokenize("Hello, how are you? I am fine. Thank you.")

['Hello',
 ',',
 'how',
 'are',
 'you',
 '?',
 'I',
 'am',
 'fine',
 '.',
 'Thank',
 'you',
 '.']

## Stemming and Lemmatization

We will apply
- Stemming on patterns column, and
- Lemmatization on responses columns

### Lemmatization

In [28]:
from nltk.stem import WordNetLemmatizer
nltk.download("wordnet")

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/harshagarwal/nltk_data...


True

In [31]:
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize("programming", "v"), lemmatizer.lemmatize("programming", "n")

('program', 'programming')

In [40]:
def lematize_text_n(text):
    return ' '.join(lemmatizer.lemmatize(word, pos="n") for word in text.split())

def lematize_text_v(text):
    return ' '.join(lemmatizer.lemmatize(word, pos="v") for word in text.split())

lematize_text_n("programming is a fun activity"), lematize_text_v("programming is a fun activity")

('programming is a fun activity', 'program be a fun activity')

In [42]:
texts['patterns'] = texts.patterns.apply(lambda x: lematize_text_v(x))
texts

Unnamed: 0,tag,patterns,responses
0,greeting,anyone ola hey hi howdy konnichiwa guten tag h...,hello tell feeling today hi brings today hi fe...
1,morning,great wake good start day great start day good...,good morning hope good nights sleep feeling today
2,afternoon,nice day good start day good day good afternoo...,good afternoon day going
3,evening,nice day nice morning good night good morning ...,good evening day
4,night,good even good night nice nice day good night ...,good night get proper sleep good night sweet d...
...,...,...,...
75,fact28,concern mental health im worry mental health i...,important thing talk someone trust might frien...
76,fact29,sure im well im feel well know im well know po...,beliefs thoughts feelings behaviours significa...
77,fact30,keep touch people stay touch friend do maintai...,lot people alone right dont lonely together th...
78,fact31,difference stress anxiety stress anxiety diffe...,stress anxiety often used interchangeably over...


### Stemming

In [43]:
from nltk.stem import PorterStemmer

In [44]:
stemmer = PorterStemmer()
stemmer.stem("programming"), stemmer.stem("program")

('program', 'program')

In [45]:
def stem_text(text):
    return " ".join(stemmer.stem(word) for word in text.split())

stem_text("programming is a fun activity")

'program is a fun activ'

In [46]:
texts['responses'] = texts.responses.apply(lambda x: stem_text(x))
texts

Unnamed: 0,tag,patterns,responses
0,greeting,anyone ola hey hi howdy konnichiwa guten tag h...,hello tell feel today hi bring today hi feel t...
1,morning,great wake good start day great start day good...,good morn hope good night sleep feel today
2,afternoon,nice day good start day good day good afternoo...,good afternoon day go
3,evening,nice day nice morning good night good morning ...,good even day
4,night,good even good night nice nice day good night ...,good night get proper sleep good night sweet d...
...,...,...,...
75,fact28,concern mental health im worry mental health i...,import thing talk someon trust might friend co...
76,fact29,sure im well im feel well know im well know po...,belief thought feel behaviour signific impact ...
77,fact30,keep touch people stay touch friend do maintai...,lot peopl alon right dont lone togeth think di...
78,fact31,difference stress anxiety stress anxiety diffe...,stress anxieti often use interchang overlap st...


With this, we complete basic tokenization.

Now all the text data in above dataframe is well tokenized, ready to be embedded.

### Tokenization in NLP v/s LLM

Now once it is clear how these pre-processing methods works, one this is certain and is easy to recognize, that - *This is not the best way, many relevant information are missed out!*. Since these are important, we must come up with a way to tokenize in better way. Here comes "Byte pair encoding - subword tokenization method" that doesn't remove anything from the text, and works best too! **And most importantly, it works for any language, no matter how complicated the language is!** Isn't that awesome? We'll study this in other tutorial in detail. But keep in mind that it is one of the best method to tokenize, and is used in LLMs as well!