**Stemming in NLP**

In [97]:
!pip install nltk #natural language toolkit



In [98]:
import nltk
nltk.download('punkt')  # Download the required resource (tokenizer models)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

**Word tokens**

In [99]:
word = ['change','changing','changes','changed']

In [100]:
word

['change', 'changing', 'changes', 'changed']

In [101]:
from nltk.stem import PorterStemmer

In [102]:
p = PorterStemmer()

In [103]:
for w in word: #checking for every word
    print(p.stem(w))

chang
chang
chang
chang


In [104]:
for w in word:
    print(w , p.stem(w)) # w -> real word.

change chang
changing chang
changes chang
changed chang


**Sentence tokens**

In [105]:
sen = 'I want to change the world if world changed my career by changing abcd'

In [106]:
from nltk.tokenize import word_tokenize

In [107]:
toke = word_tokenize(sen) #no need to tokenize in case of word.

#Tokenization is needed for Sentence.

In [108]:
toke

['I',
 'want',
 'to',
 'change',
 'the',
 'world',
 'if',
 'world',
 'changed',
 'my',
 'career',
 'by',
 'changing',
 'abcd']

In [109]:
#sen.split()

In [110]:
for w in toke:
    print(w , p.stem(w))

I i
want want
to to
change chang
the the
world world
if if
world world
changed chang
my my
career career
by by
changing chang
abcd abcd


**Character tokens**

In [111]:
word_tokens

['I',
 "'m",
 'from',
 'aiQuest',
 'Intelligence',
 '.',
 'I',
 'am',
 'learning',
 'NLP',
 '.',
 'It',
 'is',
 'fascinating',
 '!']

In [112]:
# Character tokens
char_tokens = [list(word) for word in word_tokens]
print(char_tokens)

[['I'], ["'", 'm'], ['f', 'r', 'o', 'm'], ['a', 'i', 'Q', 'u', 'e', 's', 't'], ['I', 'n', 't', 'e', 'l', 'l', 'i', 'g', 'e', 'n', 'c', 'e'], ['.'], ['I'], ['a', 'm'], ['l', 'e', 'a', 'r', 'n', 'i', 'n', 'g'], ['N', 'L', 'P'], ['.'], ['I', 't'], ['i', 's'], ['f', 'a', 's', 'c', 'i', 'n', 'a', 't', 'i', 'n', 'g'], ['!']]


**Part-of-speech (POS) tags**

In [113]:
# Part-of-speech (POS) tagging
pos_tags = nltk.pos_tag(word_tokens)
print(pos_tags)

[('I', 'PRP'), ("'m", 'VBP'), ('from', 'IN'), ('aiQuest', 'JJ'), ('Intelligence', 'NN'), ('.', '.'), ('I', 'PRP'), ('am', 'VBP'), ('learning', 'VBG'), ('NLP', 'NNP'), ('.', '.'), ('It', 'PRP'), ('is', 'VBZ'), ('fascinating', 'VBG'), ('!', '.')]


**Sub-word tokens**

In [114]:
# Sub-word tokens
subword_tokens = [p.stem(w) for w in word_tokens]
print(subword_tokens)

['i', "'m", 'from', 'aiquest', 'intellig', '.', 'i', 'am', 'learn', 'nlp', '.', 'it', 'is', 'fascin', '!']


**Lemmatization in NLP**

In [115]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [116]:
from nltk.stem import WordNetLemmatizer

In [117]:
le = WordNetLemmatizer()

In [118]:
toke

['I',
 'want',
 'to',
 'change',
 'the',
 'world',
 'if',
 'world',
 'changed',
 'my',
 'career',
 'by',
 'changing',
 'abcd']

In [119]:
for w in toke:
    print(w , le.lemmatize(w))

I I
want want
to to
change change
the the
world world
if if
world world
changed changed
my my
career career
by by
changing changing
abcd abcd


In [120]:
le.lemmatize('changes')

'change'

**Tokenization in NLP**

In Python, there are several libraries and tools available for performing tokenization and other NLP tasks. Here are a few examples using popular libraries.

**NLTK**

NLTK (Natural Language Toolkit) is a widely used library for NLP tasks. To perform tokenization using NLTK, you need to install it first. You can do so by running pip install nltk. Here's an example of tokenizing a sentence using NLTK.

In [121]:
from nltk.tokenize import word_tokenize, sent_tokenize

sentence = "I'm from aiQuest Intelligence. I am learning NLP. It is fascinating!"
word_tokens = word_tokenize(sentence)
sentence_tokens = sent_tokenize(sentence)

print(word_tokens)
print(sentence_tokens)


['I', "'m", 'from', 'aiQuest', 'Intelligence', '.', 'I', 'am', 'learning', 'NLP', '.', 'It', 'is', 'fascinating', '!']
["I'm from aiQuest Intelligence.", 'I am learning NLP.', 'It is fascinating!']


**spaCy**

spaCy is another powerful library for NLP. To install spaCy, you can run pip install spacy and then download the appropriate language model. Here's an example of tokenization using spaCy.

In [122]:
!pip install spacy
# python -m spacy download en_core_web_sm    -> install in conda



In [123]:
import spacy

nlp = spacy.load('en_core_web_sm')  # Load the English language model

sentence = "I'm from aiQuest Intelligence. I am learning NLP. It is fascinating!"
doc = nlp(sentence)

word_tokens = [token.text for token in doc]

print(word_tokens)


['I', "'m", 'from', 'aiQuest', 'Intelligence', '.', 'I', 'am', 'learning', 'NLP', '.', 'It', 'is', 'fascinating', '!']


**Transformers**

Transformers is a library built by Hugging Face that provides state-of-the-art pre-trained models for NLP. It offers various functionalities, including tokenization. It is used in neural network. To install Transformers, run pip install transformers. Here's an example of tokenization using Transformers.

In [124]:
pip install transformers




In [125]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

sentence = "I'm from aiQuest Intelligence. I am learning NLP. It is fascinating!"
tokens = tokenizer.tokenize(sentence)

print(tokens)


['i', "'", 'm', 'from', 'ai', '##quest', 'intelligence', '.', 'i', 'am', 'learning', 'nl', '##p', '.', 'it', 'is', 'fascinating', '!']


**Named Entity Tokenization (NET) using NLTK**

To perform named entity tokenization using NLTK (Natural Language Toolkit), you can utilize the named entity recognition (NER) functionality provided by NLTK. Here's an example of how to extract named entity tokens from a sentence using NLTK.

In [126]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [127]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [128]:
import nltk
nltk.download('maxent_ne_chunker')  # Download the required resource (NER models)
nltk.download('words')  # Download the required resource (word corpus)

from nltk import word_tokenize, pos_tag, ne_chunk

sentence = "I'm from aiQuest Intelligence. I am learning NLP. It is fascinating!"

# Tokenize the sentence into words
tokens = word_tokenize(sentence)

# Perform part-of-speech tagging
pos_tags = pos_tag(tokens)

# Perform named entity recognition
ner_tags = ne_chunk(pos_tags) #ne_chunk-named entity chunk. chunk means total text.
#ner_tags-> recognizes the names after fitting.

# Extract named entity tokens
named_entity_tokens = []

for chunk in ner_tags: #checking all the recognized value(s) with the total chunk and adding only the matched ones (append).
    if hasattr(chunk, 'label'): #hasattr(object, attribute)

        named_entity_tokens.append(' '.join(c[0] for c in chunk))

print(named_entity_tokens)


['aiQuest Intelligence', 'NLP']


[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!


**Text Vectorizer** (Count Vectorizer, TF-IDF Vectorizer, Word2Vec)

In [129]:
import pandas as pd
df = pd.read_csv('data.csv')

In [130]:
df

Unnamed: 0,test,class
0,I love Bangladesh,1
1,Could you give me an iphone?,0
2,Hello how are you?,1
3,I want to talk you.,1


**CountVectorizer**

Only the unique names will be in the columns. If there is any single letter, then it will ignore. Ex: I, A, a, etc.

In [131]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [132]:
cv = CountVectorizer()

In [133]:
cv_x = cv.fit_transform(df['test'])
cv_x

<4x14 sparse matrix of type '<class 'numpy.int64'>'
	with 16 stored elements in Compressed Sparse Row format>

In [134]:
cv_x.toarray()

array([[0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
       [1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1],
       [0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]])

In [135]:
cv.get_feature_names_out()

array(['an', 'are', 'bangladesh', 'could', 'give', 'hello', 'how',
       'iphone', 'love', 'me', 'talk', 'to', 'want', 'you'], dtype=object)

In [136]:
cv_df = pd.DataFrame(cv_x.toarray(), columns=cv.get_feature_names_out(), index=df['test'])

In [137]:
cv_df

Unnamed: 0_level_0,an,are,bangladesh,could,give,hello,how,iphone,love,me,talk,to,want,you
test,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
I love Bangladesh,0,0,1,0,0,0,0,0,1,0,0,0,0,0
Could you give me an iphone?,1,0,0,1,1,0,0,1,0,1,0,0,0,1
Hello how are you?,0,1,0,0,0,1,1,0,0,0,0,0,0,1
I want to talk you.,0,0,0,0,0,0,0,0,0,0,1,1,1,1


In [138]:
cv_df = pd.DataFrame(cv_x.toarray(), columns=cv.get_feature_names_out())

In [139]:
cv_df #didn't mention the index like index=df['test']. So, by default, it will start from 0.

Unnamed: 0,an,are,bangladesh,could,give,hello,how,iphone,love,me,talk,to,want,you
0,0,0,1,0,0,0,0,0,1,0,0,0,0,0
1,1,0,0,1,1,0,0,1,0,1,0,0,0,1
2,0,1,0,0,0,1,1,0,0,0,0,0,0,1
3,0,0,0,0,0,0,0,0,0,0,1,1,1,1


**TfidfVectorizer**

In [140]:
tf = TfidfVectorizer()

In [141]:
tf_z = tf.fit_transform(df['test'])

In [142]:
tf_z

<4x14 sparse matrix of type '<class 'numpy.float64'>'
	with 16 stored elements in Compressed Sparse Row format>

In [143]:
cv_df = pd.DataFrame(tf_z.toarray(), columns=tf.get_feature_names_out(), index=df['test'])

In [144]:
cv_df

Unnamed: 0_level_0,an,are,bangladesh,could,give,hello,how,iphone,love,me,talk,to,want,you
test,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
I love Bangladesh,0.0,0.0,0.707107,0.0,0.0,0.0,0.0,0.0,0.707107,0.0,0.0,0.0,0.0,0.0
Could you give me an iphone?,0.430037,0.0,0.0,0.430037,0.430037,0.0,0.0,0.430037,0.0,0.430037,0.0,0.0,0.0,0.274487
Hello how are you?,0.0,0.541736,0.0,0.0,0.0,0.541736,0.541736,0.0,0.0,0.0,0.0,0.0,0.0,0.345783
I want to talk you.,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.541736,0.541736,0.541736,0.345783


**Word2Vec**

In [145]:
!pip install gensim



In [146]:
from gensim.models import Word2Vec, KeyedVectors

In [147]:
text_vector = [nltk.word_tokenize(test) for test in df['test']] #It will directly tokenize, then list comprehension will be done. Without list comprehension, there are other methods as well.
text_vector

[['I', 'love', 'Bangladesh'],
 ['Could', 'you', 'give', 'me', 'an', 'iphone', '?'],
 ['Hello', 'how', 'are', 'you', '?'],
 ['I', 'want', 'to', 'talk', 'you', '.']]

In [148]:
model = Word2Vec(text_vector, min_count=1) #fitting the tokenized values into Word2Vec.
#min_count -> ignores all words with total frequency lower than this.

#shift+tab

In [149]:
model.wv.most_similar('want') #which are the most similar words of 'want'?

[('an', 0.17826786637306213),
 ('I', 0.16072483360767365),
 ('give', 0.10560770332813263),
 ('how', 0.09215974807739258),
 ('iphone', 0.048910051584243774),
 ('are', 0.02700837142765522),
 ('Could', 0.007729300297796726),
 ('you', -0.03771638125181198),
 ('.', -0.04552280902862549),
 ('talk', -0.0464920699596405)]