#NLP -- Natural Language Processing

###Procedure of NLP
- Import General libraries,NLTK, SPACY.
- Load the dataset
- Text pre-processing like removing html tags, removing punctuations, removing stop words, expanding contractions.
- Apply Tokenization
- Apply Stemming
- Apply POS Tagging
- Apply lemmatization
- Apply label encoding
- Feature extraction
- Text to numerical vector conversion with applying BOW(Count-vectorizer), applying TFIDF vectorizer, Word2vector and Glove.
- Data preprocessing
- Model building

##Text Preprocessing steps
Text  preprocessing is an important step in NLP as it consists of cleaning our text data in order to convert it into a presentable format that is analyzable and predictable for our task is known as text preprocessing.

###Basic techniques:
- Lowering case
- Remove punctuations
- removal of special characters and numbers
- removal of html tags
- removal of url's
- removal of extra spaces
- expanding contraction
- text correction

###Advanced techniques:
- Apply tokenization
- stop word removal
- apply stemming
- apply lemmatization

###More Advanced techniques:
- POS(part-of-speech) tagging
- NER (name entity recognation)


####Lets now code this techniques and get some hands-on experience.

###1. Lowering case

Why it is essential? because same words, one in upper case and other in lower case are considered as different words while creating Bag Of Words.

In TF-IDF count vectorization techniques, the frequency of words is considered with irrespective of the case.

Lowering decrease the vocabulary and hence reduce the dimensionality.

In [None]:
Sent="What are you doing? I am doing great!"
sent_lower=str(Sent).lower()
sent_lower

'what are you doing? i am doing great!'

###2. Removing punctuations

In [None]:
import string
punc=string.punctuation
punc

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [None]:
Sent="What are you doing? I am doing great!"
sent_punc=[word for word in Sent.split(" ") if word not in list(punc)]
sent_punc

['What', 'are', 'you', 'doing?', 'I', 'am', 'doing', 'great!']

###3. Removing special characters and numbers

Special Characters and numbers like "!,@,#,%,^,&,$,+,*, 1 to 9" have no meaning in the sentence and they do not contribute to any sentence classification.

And there is one senario when these special charactersattached to any word will considered as different word which is alreadypresent in the sentence. eg. "Shocked" and "Shocked!" considered as different words but we know they have same meaning. Hence its better to remove anyspecial characters there for dimensionality is also reduces.

In [None]:
import re
sentence="Find the remainder when [math]23^{24}[/math] is divided by 24,23?"
sent_clean=re.sub("[^a-zA-Z]"," ",sentence)
sent_clean

'Find the remainder when  math          math  is divided by       '

###4. Removal of HTML tags

When we Scrap data from any website then dataset contains HTML tags. Wemight face problem if HTML Tags present in our dataset. Hence it prefered toremove these tags.

In [None]:
sent='''<h3 style="color:red; font-family:Arial Black">Hello Guys How Are You</h3>'''
sent_html=re.sub("<.*?>", "", sent)
sent_html

'Hello Guys How Are You'

###5. Removing URL's



In [None]:
sen="visited https://github.com/surajh8596/NLP-Sentiment-Analysis-/tree/main/Senti"
clean_sent=re.sub("(http|https|www)\S+", "", sen)
clean_sent

###6. Removing Extra Spaces

There is some senario where users insert extra spaces at the start, at the end
or at the anywhere in the sentence. We need to remove all the extra spaces
inserted by an user.


In [None]:
senten="Hi   team   how  are    you ??"
cleaned_sen=re.sub(" +", " ", senten)
cleaned_sen

'Hi team how are you ??'

###7. Expanding contraction

Contractions are words or combinations of words that are shortened by
dropping letters and replacing them by an apostrophe. Nowadays, where
everything is shifting online, we communicate with others more through text
messages or posts on different social media like Facebook, Instagram,
Whatsapp, Twitter, LinkedIn, etc. in the form of texts. With so many people to
talk, we rely on abbreviations and shortened form of words for texting people.


In [None]:
!pip install contractions



In [None]:
import contractions
sent="we have reached final step of our data science internship. We'll send offer letter soon."
sent_cont=contractions.fix(sent)
sent_cont

'we have reached final step of our data science internship. We will send offer letter soon.'

###8. Text Correction

To correct the text we are going to use TextBlob from NLTK.

In [None]:
!pip install TextBlob



In [None]:
from textblob import TextBlob
sentenced="We'll meet youu soon"
text=TextBlob(sentenced)
correct_sen=text.correct()
correct_sen

TextBlob("He'll meet you soon")

###Advanced Techniques

1. Apply Tokenization

- Process of breaking down sentence into words/tokens. It can be words, characters, or subwords.

####Types of tokenization:
(a) Sentence Tokenization
(b) Word Tokenization
(c) SubWord(n-gram characters) Tokenization


Here we can use string "Split" method for word tokenization only. For Charcter
and SubWord Tokenization we need to use "NLTK" in-buit function.

In [None]:
!pip install nltk
import nltk
nltk.download('punkt')



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
#1. Sentence Tokenization

from nltk.tokenize import sent_tokenize
sent="I am Nisharg Nargund. Founder of OpenRAG - A GenerativeAI Startup"
tokens=sent_tokenize(sent)
tokens


#2. Word Tokenization
from nltk.tokenize import word_tokenize
sentence="I am Nisharg Nargund. Founder of OpenRAG - A GenerativeAI Startup"
tokens=word_tokenize(sentence)  #token=sentence.split(" ")
tokens

#3. SubWord(n-gram character) Tokenization

from nltk import ngrams
sen="I am Nisharg Nargund. Founder of OpenRAG - A GenerativeAI Startup"
n_gram=list(ngrams((sentence.split(" ")),n=3))
n_gram


[('I', 'am', 'Nisharg'),
 ('am', 'Nisharg', 'Nargund.'),
 ('Nisharg', 'Nargund.', 'Founder'),
 ('Nargund.', 'Founder', 'of'),
 ('Founder', 'of', 'OpenRAG'),
 ('of', 'OpenRAG', '-'),
 ('OpenRAG', '-', 'A'),
 ('-', 'A', 'GenerativeAI'),
 ('A', 'GenerativeAI', 'Startup')]

####2. Remove stopwords

In [None]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
from nltk.corpus import stopwords
stopwords_eng=stopwords.words("english")
print(len(stopwords_eng)) #English language contains 179 Stop Words.

179


In [None]:
sentence="Our Team name is Team Data Dynamos and we have selected Quora question similar"
sentence_non_stopword=[word for word in sentence.split(" ") if not word in stopwords_eng]
print("Sentence with StopWOrds:", sentence)
print("Sentence without StopWords:", " ".join(sentence_non_stopword))

Sentence with StopWOrds: Our Team name is Team Data Dynamos and we have selected Quora question similar
Sentence without StopWords: Our Team name Team Data Dynamos selected Quora question similar


####3. Apply Stemming

Types of Stemmer in NLP:

a. Porter Stemmer

b. Snowball Stemmer

c. Lancaster Stemmer

d. Regexp Stemmer

####a. Porter Stemmer

- Stems illogical or non-dictionary word.

In [None]:
from nltk.stem import PorterStemmer
porter=PorterStemmer()
sentence="Hello guys! I am Nisharg Nargund"
porter_stem=[porter.stem(word) for word in sentence.split(" ")]
porter_stem

['hello', 'guys!', 'i', 'am', 'nisharg', 'nargund']

####b. Snowball Stemmer

- Stems faster and logical than porter stemmer.

In [None]:
from nltk.stem import SnowballStemmer
snowball=SnowballStemmer("english")
sentence="Hello all! I am Nisharg Nargund"
snow_stem=[snowball.stem(word) for word in sentence.split(" ")]
snow_stem

['hello', 'all!', 'i', 'am', 'nisharg', 'nargund']

####c. Lancaster stemmer

- More aggresive and dynamic. Algorithm is a bit confusing when it comes in dealing with small words.

In [None]:
from nltk.stem import LancasterStemmer
lancaster=LancasterStemmer()
Sent="Hello all! I am nisharg nargund"
lans_sent=[lancaster.stem(word) for word in Sent.split(" ")]
lans_sent

['hello', 'all!', 'i', 'am', 'nisharg', 'nargund']

####d. Regexp Stemmer

- Identifies morphological affixes using regular expressions. Substrings matching the regular expressions will be discarded.

In [None]:
from nltk.stem import RegexpStemmer
regex=RegexpStemmer(regexp="ing$|s$|e$", min=0)
sent="Hello all! I am founder of OpenRAG and my teams working hard for unicorn"
regex_stem=[regex.stem(word) for word in sent.split(" ")]
regex_stem

['Hello',
 'all!',
 'I',
 'am',
 'founder',
 'of',
 'OpenRAG',
 'and',
 'my',
 'team',
 'work',
 'hard',
 'for',
 'unicorn']

###4. Apply Lemmatization

####Types of lemmatization in NLP:

a. WordNet Lemmatizer

b. TextBlob Lemmatizer


###a. WordNet Lemmatizer



In [None]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [None]:
from nltk.stem import WordNetLemmatizer
lemma=WordNetLemmatizer()
sent="Hello all! how you all doing Keep working harder"
sentence_lem=[lemma.lemmatize(word) for word in sent.split(" ")]
sentence_lem

['Hello', 'all!', 'how', 'you', 'all', 'doing', 'Keep', 'working', 'harder']

####b. Textblob lemmatizer

In [None]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
from textblob import TextBlob, Word
sent_text="The bats are hanging on their feet in upright positions"
sent=TextBlob(sent_text)
textblob_lemma=[w.lemmatize() for w in sent.words]
textblob_lemma

['The',
 'bat',
 'are',
 'hanging',
 'on',
 'their',
 'foot',
 'in',
 'upright',
 'position']

####C. More Advanced Techniques

#####1. POS Tagging

- Adding a part of Speech tags to every word in the corpus is called POS tagging. If we want to perform POS tagging then no need to remove stopwords. Its aim is to figure out the hidden connections between words which can later boost the performance of ML Model.

######It can be performed using two libraries:
#####(1) POS Tagging using NLTK
#####(2) POS tagging using spacy

In [None]:
import nltk
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [None]:
#POS using NLTK

from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
doc=word_tokenize("What is the capital of India")
for i in range(len(doc)):
  print("Word:", pos_tag(doc)[i][0], "POS tag:", pos_tag(doc)[i][1])

Word: What POS tag: WP
Word: is POS tag: VBZ
Word: the POS tag: DT
Word: capital POS tag: NN
Word: of POS tag: IN
Word: India POS tag: NNP


In [None]:
#POS using Spacy

import spacy
nlp=spacy.load("en_core_web_sm")
doc=nlp("What is the capital of India")
for word in doc:
  print(word.pos_)

PRON
AUX
DET
NOUN
ADP
PROPN


####2. Name Entity Recognition(NER)

- NER is NLP method that extracts info from text. NEW involves detecting and categorizing important info in text known as named entities.

#####NER can be performed using two libraries:
######(1) NER using NLTK
######(2) NER using spacy

In [None]:
import nltk
from nltk.corpus import stopwords
stopwords_en=stopwords.words("english")

In [None]:
import nltk
nltk.download('maxent_ne_chunker')

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.


True

In [None]:
import nltk
nltk.download('words')

[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


True

In [None]:
sent_ner="Openrag in india is founded by Nisharg Nargund. Started back in April, 2024"
words=[word for word in sent_ner.split(" ") if word not in stopwords_en]
tagged=nltk.pos_tag(words)
entities=nltk.ne_chunk(tagged)
for entity in entities:
  print(entity)

(GPE Openrag/NNP)
('india', 'NN')
('founded', 'VBD')
(PERSON Nisharg/NNP Nargund./NNP)
('Started', 'VBD')
('back', 'RB')
('April,', 'NNP')
('2024', 'CD')


In [None]:
#NER Using spacy

nlp = spacy.load("en_core_web_sm")
sent="Openrag is a genAI startup which builds domain and market specific chatbots"
doc = nlp(sent)
for entity in doc.ents:
  print(entity.text, entity.label_)

genAI GPE


###Text to Numerical Vector Conversion

####Major techniques are:
(1) Bag of Word: Count vectorizer

(2) TF-IDF (Term frequence-Inverse document frequency)

(3) Word2Vec(Word to Vector)

(4) GloVe(Global Vector)

(5) BERT (Bidirectional Encoder representations from transformers)


####1. Bag of Word(Count Vectorizer)

Advantages:


a. Simple Procedure and easy to implement.


b. Easy to Understand

Disadvantages:



a. Does not consider the symmentic meaning of the word.


 b. Due to large vector size computational time is high.


 c. Count Vectorizer Generates Spars matrix.


 d. Out of Vocabulary words are not captured.

####2. TF-IDF

- Measures how important the term or word is within a document or sentence relative to a collection of documents or corpus. weightage for those words is given high if that word
occuring in that document but occuring less in corpus


####3. Word2Vec

- A pre-trained Word Embed Model. It creates vectors of the words that are distributed numerical representations of word features. These
word features represents the context for the each words present in vocabulary.

The vectors capture semantic associations between words, such as their meaning and proximity to other words. For example, the relationship between Italy and Rome is similar to the relationship between France and Paris, so Italy – Rome + Paris ≈ France.

Advantages:


a. Word embeddings eventually help in establishing the association of a
word with another similar meaning word through the created vectors.


b. Captures symmantic meaning.


 c. Low Dimensional vectors hence the computational time reduces.


d. Dense vectors.


Disadvantages


a. Contexual meaning only captured within the window size. or in other
word it has local context scope.


b. Not able to generate vectors for unseen words

####4. GloVe(Global Vector)

It is also a Pre-trained word embedding technique used to overcome drawback
of Word2Vec.

Advantages:


   a. Contexual meaning captured for both local and global scope.


 b. It uses co-occurance matrix to tell us how often two words occuring
together.


Disadvantages:


a. Utilizes massive memory and takes time to load.


####5. BERT

BERT is the Pre-trained birectional trasformer for Language understanding. It
has trained on 2500M Wikipedia words and 800M+ Books words. And BERT
used by Google search Engine. BERT uses the encoder part of the
Transformer, since it’s goal is to create a model that performs a number of
different NLP tasks.

Advantages:


 a. Contexual meaning captured for both local and global scope.


 b. Captures symmantic meaning.


 c. Powerful than all previous wod embedding techniques.


Disadvantages:


 a. Utilizes massive memory and takes time to load and train.
