# Complete NLP Overview

### **NLP Overview**

Text PrepProcessing: Goal: Claining the input: Tokenization, Lemmatization, Stemming

Text PrepProcessing 2: Convert input text to Vectors: Bag of words, TFIDF, Unigrams, Bigrams

Text PreProcessing 3: Converting input text to vectors: Word2Vec, Avgword2vec

Nerural Networks: RNN, LSTM, RNN, GRU(Neural Networks)

Word Embedding: Convert input text to vectors

Transformers

BERT

### **Basic Terminilogies**:
- Corpus - Paragraph
- Documents - Sentences
- Vocabulary - Unique words in dict
- Words - words in a corpus


### **Tokenization:**

Paragraphs >> Sentences: Creates sentences from paragraphs

sentences >> words: Converting the sentences to words

Vocabulary>> Count of unique words

Using NLTK Library

Tokenization is a process to convert either pragraphs or sentences into tokens

In [1]:
corpus="""Hello Welcome,to Krish Naik's NLP Tutorials.
Please do watch the entire course! to become expert in NLP.
"""

In [2]:
print(corpus)

Hello Welcome,to Krish Naik's NLP Tutorials.
Please do watch the entire course! to become expert in NLP.



In [5]:
#Tokenization: Paragrahs > Sentences
import nltk
from nltk.tokenize import sent_tokenize
nltk.download('punkt')
documents  = sent_tokenize(corpus)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ADMIN\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


In [6]:
for sentence in documents:
    print(sentence)

Hello Welcome,to Krish Naik's NLP Tutorials.
Please do watch the entire course!
to become expert in NLP.


In [7]:
#Tokenization: Paragraphs >> Words
from nltk.tokenize import word_tokenize
word_tokenize(corpus)

['Hello',
 'Welcome',
 ',',
 'to',
 'Krish',
 'Naik',
 "'s",
 'NLP',
 'Tutorials',
 '.',
 'Please',
 'do',
 'watch',
 'the',
 'entire',
 'course',
 '!',
 'to',
 'become',
 'expert',
 'in',
 'NLP',
 '.']

In [8]:
#Tokenization: Sentences >> Words
for sentence in documents:
    print(word_tokenize(sentence))

['Hello', 'Welcome', ',', 'to', 'Krish', 'Naik', "'s", 'NLP', 'Tutorials', '.']
['Please', 'do', 'watch', 'the', 'entire', 'course', '!']
['to', 'become', 'expert', 'in', 'NLP', '.']


In [9]:
#punctuations
from nltk.tokenize import wordpunct_tokenize
wordpunct_tokenize(corpus)

['Hello',
 'Welcome',
 ',',
 'to',
 'Krish',
 'Naik',
 "'",
 's',
 'NLP',
 'Tutorials',
 '.',
 'Please',
 'do',
 'watch',
 'the',
 'entire',
 'course',
 '!',
 'to',
 'become',
 'expert',
 'in',
 'NLP',
 '.']

In [10]:
# Handling Full stops
from nltk.tokenize import TreebankWordTokenizer
tokenizer=TreebankWordTokenizer()
tokenizer.tokenize(corpus)

['Hello',
 'Welcome',
 ',',
 'to',
 'Krish',
 'Naik',
 "'s",
 'NLP',
 'Tutorials.',
 'Please',
 'do',
 'watch',
 'the',
 'entire',
 'course',
 '!',
 'to',
 'become',
 'expert',
 'in',
 'NLP',
 '.']

### **Stemming**

Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma. Stemming is important in natural language understanding (NLU) and natural language processing (NLP).


In [11]:
## Classification Problem
## Comments of product is a positive review or negative review
## Reviews----> eating, eat,eaten [going,gone,goes]--->go

words=["eating","eats","eaten","writing","writes","programming","programs","history","finally","finalized"]

**PorterStemmer**

In [12]:
from nltk.stem import PorterStemmer
stemming = PorterStemmer()

In [13]:
for word in words:
    print(word + "----->"+stemming.stem(word))

eating----->eat
eats----->eat
eaten----->eaten
writing----->write
writes----->write
programming----->program
programs----->program
history----->histori
finally----->final
finalized----->final


When using stemming the original meaning of the word mighht change: e.g. history > histori

### RegexpStemmer class
NLTK has RegexpStemmer class with the help of which we can easily implement Regular Expression Stemmer algorithms. It basically takes a single regular expression and removes any prefix or suffix that matches the expression. Let us see an example

In [14]:
from nltk.stem import RegexpStemmer
reg_stemmer = RegexpStemmer('ing$|s$|e$|able$', min=4)

In [15]:
reg_stemmer.stem('eating')

'eat'

In [16]:
reg_stemmer.stem('enable')

'en'

In [17]:
reg_stemmer.stem('ingeating')

'ingeat'

### Snowball Stemmer
It is a stemming algorithm which is also known as the Porter2 stemming algorithm as it is a better version of the Porter Stemmer since some issues of it were fixed in this stemmer.

In [18]:
from nltk.stem import SnowballStemmer
snowballsstemmer=SnowballStemmer('english')

In [20]:
for word in words:
    print(word + '----->' + snowballsstemmer.stem(word))

eating----->eat
eats----->eat
eaten----->eaten
writing----->write
writes----->write
programming----->program
programs----->program
history----->histori
finally----->final
finalized----->final


In [21]:
stemming.stem("fairly"),stemming.stem("sportingly")

('fairli', 'sportingli')

In [22]:
snowballsstemmer.stem("fairly"),snowballsstemmer.stem("sportingly")

('fair', 'sport')

In [23]:
stemming.stem('goes'),snowballsstemmer.stem('goes')

('goe', 'goe')

### Lemmatization

Lemmatization technique is like stemming. The output we will get after lemmatization is called ‘lemma’, which is a root word rather than root stem, the output of stemming. After lemmatization, we will be getting a valid word that means the same thing.

NLTK provides WordNetLemmatizer class which is a thin wrapper around the wordnet corpus. This class uses morphy() function to the WordNet CorpusReader class to find a lemma. Let us understand it with an example −

In [26]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ADMIN\AppData\Roaming\nltk_data...


In [27]:
lemmatizer.lemmatize('going')

'going'

In [28]:
'''
POS- Noun-n
verb-v
adjective-a
adverb-r
'''
lemmatizer.lemmatize("going",pos='v')

'go'

In [29]:
words=["eating","eats","eaten","writing","writes","programming","programs","history","finally","finalized"]

In [30]:
for word in words:
    print(word+"---->"+lemmatizer.lemmatize(word,pos='v'))

eating---->eat
eats---->eat
eaten---->eat
writing---->write
writes---->write
programming---->program
programs---->program
history---->history
finally---->finally
finalized---->finalize


In [31]:
lemmatizer.lemmatize("goes",pos='v')

'go'

In [32]:
lemmatizer.lemmatize("fairly",pos='v'),lemmatizer.lemmatize("sportingly")

('fairly', 'sportingly')

Compared to Stemming, Lemmatization will take a longer time to get the root words. 

**Stop Words**
Removing the stop words

**Named Entity Recognition**

Name entity tags: Like Person, Place, Date, Time


**One Hot Encoding**

- Need to revise from







