# Natural Language Processing
Natural Language processing is method to deal with text data, Perform Text Preprocessing...

1. Tokenization
2. Stemming
3. Lemmatization
4. Stopword Removing
5. POS (Part of Speech) Tagging
6. NER (Name Entity Recognition)
7. Word Embedding

## Tokenization
Tokenization is the process of splitting sentence or paragraph into smaller tokens (word), such as words, subwords, sentences.

In [9]:
import nltk
from nltk.tokenize import word_tokenize

nltk.download('punkt')
nltk.download('punkt_tab')   # <-- required in newer NLTK versions

text = "Tokenization is the process of splitting sentence. or paragraph into smaller tokens (word). such as words, subwords, sentences."
word_tokens= word_tokenize(text)
sent_tokens=sent_tokenize(text)
print("Word Tokenizer:",len(word_tokens), word_tokens)
print("Sent Tokenizer",len(sent_tokens),sent_tokens)


Word Tokenizer: 25 ['Tokenization', 'is', 'the', 'process', 'of', 'splitting', 'sentence', '.', 'or', 'paragraph', 'into', 'smaller', 'tokens', '(', 'word', ')', '.', 'such', 'as', 'words', ',', 'subwords', ',', 'sentences', '.']
Sent Tokenizer 3 ['Tokenization is the process of splitting sentence.', 'or paragraph into smaller tokens (word).', 'such as words, subwords, sentences.']


[nltk_data] Downloading package punkt to /home/hp/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /home/hp/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


## Stemming
Stemming is reduce the words to their root from by chopping suffixes. It may not produce valid word.

In [22]:
# PorterStemmer
from nltk.stem import PorterStemmer

stemmer=PorterStemmer()

words=['running', 'files','easily','eaten', 'writing']

stem_word = [stemmer.stem(word) for word in words]
print(stem_word)

['run', 'file', 'easili', 'eaten', 'write']


In [None]:
#Snowball Stemmer
from nltk.stem import SnowballStemmer

snow_stemmer=SnowballStemmer('english')

words=['running', 'files','easily','eaten', 'writing']

snow_stem_word = [snow_stemmer.stem(word) for word in words]

print(snow_stem_word)

['run', 'file', 'easili', 'eaten', 'write']


In [26]:
# Regex Stemmer
from nltk.stem import RegexpStemmer

#while calling regexpstemmer function we need to write regex expression
reg_stemmer=RegexpStemmer('ing$|s$|en$', min=3)

words=['running', 'files','easily','eaten', 'writing']

reg_stem_word = [reg_stemmer.stem(word) for word in words]
print(reg_stem_word)

['runn', 'file', 'easily', 'eat', 'writ']


## Lemmatization

Lemmatization reduces words to their base or dictionary form (lemma), considering grammar and vocabulary.

In [31]:
import nltk

nltk.download('wordnet')

from nltk.stem import WordNetLemmatizer

lematizer = WordNetLemmatizer()

words=['running', 'files','easily','eaten', 'writing','watched']

lemmatize_words = [lematizer.lemmatize(word) for word in words]
print(lemmatize_words)

['running', 'file', 'easily', 'eaten', 'writing', 'watched']


[nltk_data] Downloading package wordnet to /home/hp/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## Stop Word:
Removing stopword is important in text-preprocessing, because some word not contributiong much meaning data. So Stopward removes less contributing word from text dataset.

In [8]:
import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

text_data="This is the NLP method which usefull for the text preprocessing."

text_tokens=word_tokenize(text_data)
print("token data:  ",text_tokens)
print("token lenth:", len(text_tokens))

stop_word_fun=set(stopwords.words('english'))

filter_word= [word for word in text_tokens if word.lower() not in stop_word_fun]
print("Filetere data:  ",filter_word)
print("Filter token Lenth:", len(filter_word))

token data:   ['This', 'is', 'the', 'NLP', 'method', 'which', 'usefull', 'for', 'the', 'text', 'preprocessing', '.']
token lenth: 12
Filetere data:   ['NLP', 'method', 'usefull', 'text', 'preprocessing', '.']
Filter token Lenth: 6


[nltk_data] Downloading package stopwords to /home/hp/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [5]:
from nltk.corpus import stopwords
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))
words = word_tokenize("This is an example showing off stopword filtration.")
filtered = [w for w in words if w.lower() not in stop_words]

print("Original:", words)
print("After Stopword Removal:", filtered)

Original: ['This', 'is', 'an', 'example', 'showing', 'off', 'stopword', 'filtration', '.']
After Stopword Removal: ['example', 'showing', 'stopword', 'filtration', '.']


[nltk_data] Downloading package stopwords to /home/hp/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### POS (Part of Speech) Tagging:
POS tagging is used to categorised text grammatically.

In [16]:
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger_eng')

text_data="This is the NLP method which usefull for the text preprocessing."
token=word_tokenize(text_data)

print(pos_tag(token))

[('This', 'DT'), ('is', 'VBZ'), ('the', 'DT'), ('NLP', 'NNP'), ('method', 'NN'), ('which', 'WDT'), ('usefull', 'NN'), ('for', 'IN'), ('the', 'DT'), ('text', 'NN'), ('preprocessing', 'NN'), ('.', '.')]


[nltk_data] Downloading package punkt to /home/hp/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /home/hp/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


1️⃣ DT — Determiner

Words: This, the

A determiner introduces a noun.

It gives information about which one, how many, or whose.

Examples:

This book

The method

A car

My laptop

2️⃣ VBZ — Verb (3rd person singular present)

Word: is

Verb in present tense

Used with he/she/it

Examples:

He runs

She writes

It works

This is correct

3️⃣ NNP — Proper Noun (Singular)

Word: NLP

Name of a specific person, place, organization, or concept.

Always capitalized.

Examples:

John

India

Python

NLP

4️⃣ NN — Noun (Singular or Mass)

Words: method, usefull, text, preprocessing

General noun (not capitalized).

Represents a thing, idea, concept, or object.

Examples:

method

book

computer

preprocessing

⚠️ Important:
usefull is tagged as NN because:

It is misspelled (correct spelling: useful).

The tagger thinks it's a noun due to structure.

If spelled correctly, it would likely be tagged as JJ (adjective).

5️⃣ WDT — Wh-determiner

Word: which

Used to introduce a relative clause.

Refers back to something mentioned earlier.

Example:

The book which I bought

The method which works best

6️⃣ IN — Preposition or Subordinating Conjunction

Word: for

Shows relationship between words.

Often indicates direction, purpose, location, time.

Examples:

for the exam

in the room

on the table

because of rain

7️⃣ . — Punctuation

Word: .

End of sentence marker.

In [15]:
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger_eng')

nltk.download('averaged_perceptron_tagger')

text = word_tokenize("John is learning NLP using Python.")
print(nltk.pos_tag(text))

[nltk_data] Downloading package punkt to /home/hp/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /home/hp/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/hp/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


[('John', 'NNP'), ('is', 'VBZ'), ('learning', 'VBG'), ('NLP', 'NNP'), ('using', 'VBG'), ('Python', 'NNP'), ('.', '.')]


## Word Embedding:
Embedding represent text data or words as vectors of real numeric value, To capture semantic meaning of words.

### Word Embedding Methods:
1. TF-IDF (Term Frequency - Inverse Document Frequency)
2. Word2vec (CBow and Skipgram)
3. One Hot Encoding
4. Bag of word

### TF-IDF
TF = (No of repetition of word in sentence/no of word in sentence) = embedding = (1/18) and of = (2/18)
IDF = loge(no of sentence/ no of sentence containing the word)

### Word2Vec
Unlike TF-IDF, Word2Vec creates dense vectors that capture semantic meaning.

#### A) CBOW (Continuous Bag of Words)

Predicts the target word using surrounding context words.

Example:

Context: "I love ___ learning"

Target: machine

1. Faster
2. Works well for frequent words

#### B) Skip-Gram

Predicts surrounding context words using the target word.

Example:

Input: machine

Output: love, learning

1. Works well for rare words
2. Better semantic representation

### One Hot Encoding:
Each word is represented as a vector of 0s with a single 1 at its index position.

Example:

"Bag of word"

Bag  - 1 0 0

of   - 0 1 0

word - 0 0 1

### Bag of Words (BoW)


Represents text as word frequency counts (ignores grammar and order).

Example:

Text:

"I love machine learning"
"I love coding"

Vocabulary:

["I", "love", "machine", "learning", "coding"]

Vectors:

[1,1,1,1,0]
[1,1,0,0,1]

In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer

corpus=[
    "This is the NLP method which usefull for the text preprocessing.",
    "Embedding represent text data or words as vectors of real numeric value, To capture semantic meaning of words.",
    "John is learning NLP using Python."
]

vectorizer=TfidfVectorizer()

X =vectorizer.fit_transform(corpus)
print(X)
print("Vocalabary:", vectorizer.get_feature_names_out())
print("TF-IDF: ", X.toarray)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 32 stored elements and shape (3, 29)>
  Coords	Values
  (0, 21)	0.2919139048919681
  (0, 5)	0.22200804888354075
  (0, 20)	0.5838278097839362
  (0, 10)	0.22200804888354075
  (0, 9)	0.2919139048919681
  (0, 27)	0.2919139048919681
  (0, 23)	0.2919139048919681
  (0, 4)	0.2919139048919681
  (0, 19)	0.22200804888354075
  (0, 14)	0.2919139048919681
  (1, 19)	0.16372097540083433
  (1, 3)	0.2152734077990568
  (1, 17)	0.2152734077990568
  (1, 2)	0.2152734077990568
  (1, 13)	0.2152734077990568
  (1, 28)	0.4305468155981136
  (1, 0)	0.2152734077990568
  (1, 26)	0.2152734077990568
  (1, 12)	0.4305468155981136
  (1, 16)	0.2152734077990568
  (1, 11)	0.2152734077990568
  (1, 25)	0.2152734077990568
  (1, 22)	0.2152734077990568
  (1, 1)	0.2152734077990568
  (1, 18)	0.2152734077990568
  (1, 8)	0.2152734077990568
  (2, 5)	0.3349067026613031
  (2, 10)	0.3349067026613031
  (2, 6)	0.4403620672313486
  (2, 7)	0.4403620672313486
  (2, 24)	0.440362067

Task:

1. Please understand the definition of each concept and then perform handson. practice Tokenization, Stemming, Lemmatization, Stopword , POS Tagging


In [19]:
## Word2vec
from gensim.models import Word2Vec
sentence=[
  ["I", "Like","Machine"],
  ["I","am","Learning"],
  ["Machine","Coding"]
 ]

# Skip Gram
skipgram_model=Word2Vec(sentences=sentence,vector_size=50,window=2,min_count=1,sg=1)

cbow_model=Word2Vec(sentences=sentence,vector_size=50,window=2,min_count=1,sg=0)

vectors=skipgram_model.wv["Machine"]
print("SkipGram Model Output:","\n",vectors)

vectors_cbow=cbow_model.wv["Learning"]
print("\n","Cbow Model Output:","\n",vectors_cbow)

SkipGram Model Output: 
 [-1.0724545e-03  4.7286271e-04  1.0206699e-02  1.8018546e-02
 -1.8605899e-02 -1.4233618e-02  1.2917745e-02  1.7945977e-02
 -1.0030856e-02 -7.5267432e-03  1.4761009e-02 -3.0669428e-03
 -9.0732267e-03  1.3108104e-02 -9.7203208e-03 -3.6320353e-03
  5.7531595e-03  1.9837476e-03 -1.6570430e-02 -1.8897636e-02
  1.4623532e-02  1.0140524e-02  1.3515387e-02  1.5257311e-03
  1.2701781e-02 -6.8107317e-03 -1.8928028e-03  1.1537147e-02
 -1.5043275e-02 -7.8722071e-03 -1.5023164e-02 -1.8600845e-03
  1.9076237e-02 -1.4638334e-02 -4.6675373e-03 -3.8754821e-03
  1.6154874e-02 -1.1861792e-02  9.0324880e-05 -9.5074680e-03
 -1.9207101e-02  1.0014586e-02 -1.7519170e-02 -8.7836506e-03
 -7.0199967e-05 -5.9236289e-04 -1.5322480e-02  1.9229487e-02
  9.9641159e-03  1.8466286e-02]

 Cbow Model Output: 
 [ 1.56351421e-02 -1.90203730e-02 -4.11062239e-04  6.93839323e-03
 -1.87794445e-03  1.67635437e-02  1.80215668e-02  1.30730132e-02
 -1.42324204e-03  1.54208085e-02 -1.70686692e-02  6.414213

In [27]:
### One Hot Encoding
from sklearn.preprocessing import OneHotEncoder
import numpy as np

words = np.array([["I"],["Like"],["Machine"],["Learning"]])

ohe_encoder=OneHotEncoder(sparse_output=False)
encoded_word=ohe_encoder.fit_transform(words)
print(encoded_word)

[[1. 0. 0. 0.]
 [0. 0. 1. 0.]
 [0. 0. 0. 1.]
 [0. 1. 0. 0.]]


In [28]:
### Bag Of Word

from sklearn.feature_extraction.text import CountVectorizer

bow_corpus=[
    "This is the NLP method which usefull for the text preprocessing.",
    "Embedding represent text data or words as vectors of real numeric value, To capture semantic meaning of words.",
    "John is learning NLP using Python."
]

bow_vectorizer=CountVectorizer()
X=bow_vectorizer.fit_transform(bow_corpus)

print("Vocabulary:", bow_vectorizer.get_feature_names_out())
print("Bow Matrix: \n", X.toarray())

Vocabulary: ['as' 'capture' 'data' 'embedding' 'for' 'is' 'john' 'learning' 'meaning'
 'method' 'nlp' 'numeric' 'of' 'or' 'preprocessing' 'python' 'real'
 'represent' 'semantic' 'text' 'the' 'this' 'to' 'usefull' 'using' 'value'
 'vectors' 'which' 'words']
Bow Matrix: 
 [[0 0 0 0 1 1 0 0 0 1 1 0 0 0 1 0 0 0 0 1 2 1 0 1 0 0 0 1 0]
 [1 1 1 1 0 0 0 0 1 0 0 1 2 1 0 0 1 1 1 1 0 0 1 0 0 1 1 0 2]
 [0 0 0 0 0 1 1 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0]]
