## Normalization in NLP

Stemming

In [2]:
## !pip install nltk

In [3]:
import nltk
import warnings

warnings.filterwarnings('ignore')

In [4]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [5]:
words=['change','changing','changes','changed']

In [6]:
from nltk.stem import PorterStemmer

In [7]:
p=PorterStemmer()

In [8]:
p.stem('changing')

'chang'

In [9]:
for word in words:
    print(word,"-->", p.stem(word) )

change --> chang
changing --> chang
changes --> chang
changed --> chang


In [10]:
sen = 'The constant flux of life necessitates embracing change, whether its adapting to the changes around us or actively changing ourselves to meet new challenges.'

In [11]:
sen

'The constant flux of life necessitates embracing change, whether its adapting to the changes around us or actively changing ourselves to meet new challenges.'

In [12]:
## from nltk.tokenize import word_tokenize  ## this is creating problems

In [13]:
## tokenss= word_tokenize(sen)   ## this is creating problems

In [14]:
from nltk.tokenize import TreebankWordTokenizer

In [15]:
word_tokenizer=TreebankWordTokenizer()

In [16]:
token=word_tokenizer.tokenize(sen)

In [17]:
token

['The',
 'constant',
 'flux',
 'of',
 'life',
 'necessitates',
 'embracing',
 'change',
 ',',
 'whether',
 'its',
 'adapting',
 'to',
 'the',
 'changes',
 'around',
 'us',
 'or',
 'actively',
 'changing',
 'ourselves',
 'to',
 'meet',
 'new',
 'challenges',
 '.']

In [18]:
for word in token:
    print(p.stem(word))

the
constant
flux
of
life
necessit
embrac
chang
,
whether
it
adapt
to
the
chang
around
us
or
activ
chang
ourselv
to
meet
new
challeng
.


## Lemmatization

In [19]:
# import nltk

# # Download WordNet explicitly to your NLTK data folder
# nltk.download('wordnet', download_dir=r"D:\AiQuest\Class_13_Text Data Processing\Text Data Processing and Vectorizer\nltk_data")
# nltk.download('omw-1.4', download_dir=r"D:\AiQuest\Class_13_Text Data Processing\Text Data Processing and Vectorizer\nltk_data")

# # Tell NLTK to use that folder
# nltk.data.path.append(r"D:\AiQuest\Class_13_Text Data Processing\Text Data Processing and Vectorizer\nltk_data")

In [20]:
from nltk.stem import WordNetLemmatizer

In [21]:
le= WordNetLemmatizer()

In [22]:
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [23]:
for w in token:
    print(le.lemmatize(w))

The
constant
flux
of
life
necessitates
embracing
change
,
whether
it
adapting
to
the
change
around
u
or
actively
changing
ourselves
to
meet
new
challenge
.


In [24]:
for w in words:
    print(w,"-->",le.lemmatize(w))

change --> change
changing --> changing
changes --> change
changed --> changed


## Tokenization in NLP

NLTK

In [25]:
## from nltk.tokenize import word_tokenize,sent_tokenize  ## not working

In [26]:
from nltk.tokenize import TreebankWordTokenizer,PunktSentenceTokenizer

In [27]:
word_tokenizer=TreebankWordTokenizer()
sen_tokenizer=PunktSentenceTokenizer()

In [28]:
sentence = "I'm from aiQuest Intelligence. I am learning NLP. It is fascinating!"

word_tkn= word_tokenizer.tokenize(sentence)
sent_tkn= sen_tokenizer.tokenize(sentence)

print(word_tkn)
print(sent_tkn)

['I', "'m", 'from', 'aiQuest', 'Intelligence.', 'I', 'am', 'learning', 'NLP.', 'It', 'is', 'fascinating', '!']
["I'm from aiQuest Intelligence.", 'I am learning NLP.', 'It is fascinating!']


## Spacy

In [29]:
## !pip install spacy

In [30]:
import spacy

In [31]:
spc = spacy.load('en_core_web_sm')

text = "I'm from aiQuest Intelligence. I am learning NLP. It is fascinating!"
doc= spc(text)
doc

I'm from aiQuest Intelligence. I am learning NLP. It is fascinating!

In [32]:
type(doc)

spacy.tokens.doc.Doc

In [33]:
word_tokens=[token.text for token in doc ]
print(word_tokens)

['I', "'m", 'from', 'aiQuest', 'Intelligence', '.', 'I', 'am', 'learning', 'NLP', '.', 'It', 'is', 'fascinating', '!']


## Transformers

In [34]:
## !pip install transformers


In [35]:
from transformers import AutoTokenizer

In [36]:
tokenizer= AutoTokenizer.from_pretrained('bert-base-uncased')

In [37]:
sentence = "I'm from aiQuest Intelligence. I am learning NLP. It is fascinating!"
tokens=tokenizer.tokenize(sentence)

In [38]:
tokens

['i',
 "'",
 'm',
 'from',
 'ai',
 '##quest',
 'intelligence',
 '.',
 'i',
 'am',
 'learning',
 'nl',
 '##p',
 '.',
 'it',
 'is',
 'fascinating',
 '!']

## Named Entity Tokenization using NLTK

In [39]:
## nltk.download('wordnet')
## nltk.download('maxent_ne_chunker') 
## nltk.download('words') 
## nltk.download('averaged_perceptron_tagger')

In [40]:
from nltk import TreebankWordTokenizer,pos_tag,ne_chunk

In [41]:
sentence = "I'm from aiQuest Intelligence. I am learning NLP. It is fascinating!, Hasan Khan, my name is Joe"

tokens=TreebankWordTokenizer()
tokenize=tokens.tokenize(sentence)


In [42]:
tokenize

['I',
 "'m",
 'from',
 'aiQuest',
 'Intelligence.',
 'I',
 'am',
 'learning',
 'NLP.',
 'It',
 'is',
 'fascinating',
 '!',
 ',',
 'Hasan',
 'Khan',
 ',',
 'my',
 'name',
 'is',
 'Joe']

pos_tag using spacy

In [43]:
spc = spacy.load('en_core_web_sm')

## text = "I'm from aiQuest Intelligence. I am learning NLP. It is fascinating!"
text = "Shakil Khan lives in Germany Europe"
tokenize2=spc(text)

In [44]:
tokenize2

Shakil Khan lives in Germany Europe

In [45]:
word_tokens=[(i.text,i.pos_) for i in tokenize2]

In [46]:
word_tokens

[('Shakil', 'PROPN'),
 ('Khan', 'PROPN'),
 ('lives', 'VERB'),
 ('in', 'ADP'),
 ('Germany', 'PROPN'),
 ('Europe', 'PROPN')]

doc.ents

When you pass text to spaCy (doc = nlp(text)), spaCy automatically finds named entities.

doc.ents is a list of all detected entities.

Entities are things like:

PERSON (people)

ORG (organizations)

GPE (cities, countries)

DATE (dates)

MONEY

etc.

In [47]:
for ent in tokenize2.ents:
    print(ent.text,ent.label_)

Shakil Khan PERSON
Germany GPE
Europe LOC


## Text Vectorizer

In [48]:
## !pip install openpyxl

In [49]:
import pandas as pd

In [50]:
df=pd.read_excel('data.xlsx')

In [51]:
df.head()

Unnamed: 0,text,class
0,"Hey, I love Bangladesh;",1
1,"Good afternoon, I am happy!",1
2,I live in Germany,1
3,Nice to meet you man-,1
4,You won an iPhone,0


Data processing

In [52]:
from nltk.corpus import stopwords

In [53]:
## import nltk
## nltk.download('stopwords')

In [54]:
en_words=set(stopwords.words('english'))

In [55]:
en_words

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 "he'd",
 "he'll",
 "he's",
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 "i'd",
 "i'll",
 "i'm",
 "i've",
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it'd",
 "it'll",
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'on

In [56]:
stopwords.fileids()

['albanian',
 'arabic',
 'azerbaijani',
 'basque',
 'belarusian',
 'bengali',
 'catalan',
 'chinese',
 'danish',
 'dutch',
 'english',
 'finnish',
 'french',
 'german',
 'greek',
 'hebrew',
 'hinglish',
 'hungarian',
 'indonesian',
 'italian',
 'kazakh',
 'nepali',
 'norwegian',
 'portuguese',
 'romanian',
 'russian',
 'slovene',
 'spanish',
 'swedish',
 'tajik',
 'tamil',
 'turkish']

In [57]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [58]:
len(string.punctuation)

32

In [59]:
def preprocess_text(text): 
    
    remove_punc = [char for char in text if char not in string.punctuation] # Remove punctuation
    clean_words = ''.join(remove_punc) # char joining
    split_words = clean_words.split()

    #Remove stopwords
    text= [word for word in split_words if word.lower() not in en_words]
    
    return text

In [60]:
df['text'] = df['text'].apply(preprocess_text) 

In [61]:
df['text']

0     [Hey, love, Bangladesh]
1    [Good, afternoon, happy]
2             [live, Germany]
3           [Nice, meet, man]
4                    [iPhone]
Name: text, dtype: object

In [62]:
lemmatized=WordNetLemmatizer()

In [63]:
def lemmatize_text(text):
    lemmatize_text=' '.join([lemmatized.lemmatize(word) for word in text])
    return lemmatize_text

In [64]:
df['text']=df['text'].apply(lemmatize_text)

In [65]:
df.head()

Unnamed: 0,text,class
0,Hey love Bangladesh,1
1,Good afternoon happy,1
2,live Germany,1
3,Nice meet man,1
4,iPhone,0


## CountVectorizer

In [66]:
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer

In [67]:
cv=CountVectorizer()

In [68]:
cv_x = cv.fit_transform(df['text'])

In [69]:
cv_x

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 12 stored elements and shape (5, 12)>

In [70]:
column_names=cv.get_feature_names_out()

In [71]:
column_names

array(['afternoon', 'bangladesh', 'germany', 'good', 'happy', 'hey',
       'iphone', 'live', 'love', 'man', 'meet', 'nice'], dtype=object)

In [72]:
cv_df=pd.DataFrame(cv_x.toarray(),index=df['text'],columns=cv.get_feature_names_out())

In [73]:
cv_df

Unnamed: 0_level_0,afternoon,bangladesh,germany,good,happy,hey,iphone,live,love,man,meet,nice
text,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Hey love Bangladesh,0,1,0,0,0,1,0,0,1,0,0,0
Good afternoon happy,1,0,0,1,1,0,0,0,0,0,0,0
live Germany,0,0,1,0,0,0,0,1,0,0,0,0
Nice meet man,0,0,0,0,0,0,0,0,0,1,1,1
iPhone,0,0,0,0,0,0,1,0,0,0,0,0


## TF-IDFVectorizer

In [74]:
tfv=TfidfVectorizer()

In [75]:
tfv_x=tfv.fit_transform(df['text'])

In [76]:
tfv_x

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 12 stored elements and shape (5, 12)>

In [77]:
tfv_df=pd.DataFrame(tfv_x.toarray(),index=df['text'],columns=tfv.get_feature_names_out())

In [78]:
tfv_df

Unnamed: 0_level_0,afternoon,bangladesh,germany,good,happy,hey,iphone,live,love,man,meet,nice
text,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Hey love Bangladesh,0.0,0.57735,0.0,0.0,0.0,0.57735,0.0,0.0,0.57735,0.0,0.0,0.0
Good afternoon happy,0.57735,0.0,0.0,0.57735,0.57735,0.0,0.0,0.0,0.0,0.0,0.0,0.0
live Germany,0.0,0.0,0.707107,0.0,0.0,0.0,0.0,0.707107,0.0,0.0,0.0,0.0
Nice meet man,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.57735,0.57735,0.57735
iPhone,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


## Transformers

In [79]:
from transformers import pipeline

In [80]:
## !pip install torch


In [81]:
classifier=pipeline("sentiment-analysis")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Device set to use cpu


In [82]:
classifier('The course from aiQuest is amazing!')

[{'label': 'POSITIVE', 'score': 0.9998749494552612}]

In [84]:
classifier.save_pretrained('sentiment_model')

## Word2Vec

In [86]:
## !pip install gensim

In [92]:
from gensim.models import Word2Vec

In [89]:
text_vector= [TreebankWordTokenizer().tokenize(test) for test in df['text']]

In [90]:
text_vector

[['Hey', 'love', 'Bangladesh'],
 ['Good', 'afternoon', 'happy'],
 ['live', 'Germany'],
 ['Nice', 'meet', 'man'],
 ['iPhone']]

In [93]:
model=Word2Vec(text_vector,min_count=1)

In [94]:
model

<gensim.models.word2vec.Word2Vec at 0x20eead15390>

In [96]:
model.wv.most_similar('iPhone')

[('Bangladesh', 0.21617142856121063),
 ('afternoon', 0.09291722625494003),
 ('Hey', 0.07963486760854721),
 ('love', 0.06285078823566437),
 ('Good', 0.0270574688911438),
 ('happy', 0.016134677454829216),
 ('man', -0.01083916611969471),
 ('Germany', -0.027750369161367416),
 ('meet', -0.05234673246741295),
 ('live', -0.059876296669244766)]