#TOKENIZER

Tokenization is the process of dividing a text into smaller units known as tokens. Tokens are typically words or sub-words in the context of natural language processing. Tokenization is a critical step in many NLP tasks, including text processing, language modelling, and machine translation. The process involves splitting a string, or text into a list of tokens. One can think of tokens as parts like a word is a token in a sentence, and a sentence is a token in a paragraph.

In [1]:
corpus="""
  Tokenization is the process of tokenizing or splitting a string, text into a list of tokens. One can think of token as parts like a word is a token in a sentence, and a sentence is a token in a paragraph.
"""

In [2]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [3]:
nltk.download('')

[nltk_data] Error loading : Package '' not found in index


False

##Types of Tokenization
1.Word tokenizaton:- Word tokenization divides the text into individual words. Many NLP tasks use this approach, in which words are treated as the basic units of meaning.

In [4]:
from nltk.tokenize import word_tokenize
words=word_tokenize(corpus)


In [5]:
print(words)


['Tokenization', 'is', 'the', 'process', 'of', 'tokenizing', 'or', 'splitting', 'a', 'string', ',', 'text', 'into', 'a', 'list', 'of', 'tokens', '.', 'One', 'can', 'think', 'of', 'token', 'as', 'parts', 'like', 'a', 'word', 'is', 'a', 'token', 'in', 'a', 'sentence', ',', 'and', 'a', 'sentence', 'is', 'a', 'token', 'in', 'a', 'paragraph', '.']


2.Sentence Tokenization:- The text is segmented into sentences during sentence tokenization. This is useful for tasks requiring individual sentence analysis or processing.

Example:

In [6]:
from nltk.tokenize import sent_tokenize
sent_tokenize(corpus)

['\n  Tokenization is the process of tokenizing or splitting a string, text into a list of tokens.',
 'One can think of token as parts like a word is a token in a sentence, and a sentence is a token in a paragraph.']

##Stemming
Stemming generates the base word from the inflected word by removing the affixes of the word. It has a set of pre-defined rules that govern the dropping of these affixes. It must be noted that stemmers might not always result in semantically meaningful base words.

#STEMMING

In [7]:
from nltk.stem import PorterStemmer
ps=PorterStemmer()

In [8]:
for i in words:
  print(i+"--->"+ps.stem(i ))

Tokenization--->token
is--->is
the--->the
process--->process
of--->of
tokenizing--->token
or--->or
splitting--->split
a--->a
string--->string
,--->,
text--->text
into--->into
a--->a
list--->list
of--->of
tokens--->token
.--->.
One--->one
can--->can
think--->think
of--->of
token--->token
as--->as
parts--->part
like--->like
a--->a
word--->word
is--->is
a--->a
token--->token
in--->in
a--->a
sentence--->sentenc
,--->,
and--->and
a--->a
sentence--->sentenc
is--->is
a--->a
token--->token
in--->in
a--->a
paragraph--->paragraph
.--->.


#LAncaster Stemmming

In [9]:
from nltk.stem import LancasterStemmer
ls=LancasterStemmer()

In [10]:
for i in words:
  print(i+"--->"+ls.stem(i ))

Tokenization--->tok
is--->is
the--->the
process--->process
of--->of
tokenizing--->tok
or--->or
splitting--->splitting
a--->a
string--->string
,--->,
text--->text
into--->into
a--->a
list--->list
of--->of
tokens--->tok
.--->.
One--->on
can--->can
think--->think
of--->of
token--->tok
as--->as
parts--->part
like--->lik
a--->a
word--->word
is--->is
a--->a
token--->tok
in--->in
a--->a
sentence--->sent
,--->,
and--->and
a--->a
sentence--->sent
is--->is
a--->a
token--->tok
in--->in
a--->a
paragraph--->paragraph
.--->.


In [11]:
 from nltk.stem import RegexpStemmer
 rs=RegexpStemmer("ing&|able$")

In [12]:
for i in words:
  print(i+"--->"+rs.stem(i ))

Tokenization--->Tokenization
is--->is
the--->the
process--->process
of--->of
tokenizing--->tokenizing
or--->or
splitting--->splitting
a--->a
string--->string
,--->,
text--->text
into--->into
a--->a
list--->list
of--->of
tokens--->tokens
.--->.
One--->One
can--->can
think--->think
of--->of
token--->token
as--->as
parts--->parts
like--->like
a--->a
word--->word
is--->is
a--->a
token--->token
in--->in
a--->a
sentence--->sentence
,--->,
and--->and
a--->a
sentence--->sentence
is--->is
a--->a
token--->token
in--->in
a--->a
paragraph--->paragraph
.--->.


In [15]:
from nltk.stem import SnowballStemmer
nltk.download('stopwords')
nltk.download('wordnet')
ss=SnowballStemmer("english",ignore_stopwords=True)
for i in words:
  print(i+"--->"+ss.stem(i ))

Tokenization--->token
is--->is
the--->the
process--->process
of--->of
tokenizing--->token
or--->or
splitting--->split
a--->a
string--->string
,--->,
text--->text
into--->into
a--->a
list--->list
of--->of
tokens--->token
.--->.
One--->one
can--->can
think--->think
of--->of
token--->token
as--->as
parts--->part
like--->like
a--->a
word--->word
is--->is
a--->a
token--->token
in--->in
a--->a
sentence--->sentenc
,--->,
and--->and
a--->a
sentence--->sentenc
is--->is
a--->a
token--->token
in--->in
a--->a
paragraph--->paragraph
.--->.


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...


##Lemmatization
Lemmatization involves grouping together the inflected forms of the same word. This way, we can reach out to the base form of any word which will be meaningful in nature. The base from here is called the Lemma.

###WORDNET LEMMATIZATION

In [16]:
 from nltk.stem import WordNetLemmatizer
 wl=WordNetLemmatizer()
 for i in words:
  print(i+"--->"+wl.lemmatize(i,pos='v'))

Tokenization--->Tokenization
is--->be
the--->the
process--->process
of--->of
tokenizing--->tokenizing
or--->or
splitting--->split
a--->a
string--->string
,--->,
text--->text
into--->into
a--->a
list--->list
of--->of
tokens--->tokens
.--->.
One--->One
can--->can
think--->think
of--->of
token--->token
as--->as
parts--->part
like--->like
a--->a
word--->word
is--->be
a--->a
token--->token
in--->in
a--->a
sentence--->sentence
,--->,
and--->and
a--->a
sentence--->sentence
is--->be
a--->a
token--->token
in--->in
a--->a
paragraph--->paragraph
.--->.


In [17]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [18]:
print(wl.lemmatize("tokenizing",pos='s'))

tokenizing


#ONE HOT ENCODING

One-hot encoding is a process used in machine learning and data processing to convert categorical data into a binary matrix. This technique is particularly useful for algorithms that cannot work directly with categorical data and require numerical input. Here's how it works:

Steps for One-Hot Encoding:

1.Identify Categorical Variables: Determine which variables in your dataset are categorical.

2.Create Binary Columns: For each unique category within a categorical variable, create a new binary column.

3.Assign Binary Values: Assign a 1 or 0 to these columns to indicate the presence or absence of the category for each observation.

In [19]:
import pandas as pd

# Sample categorical data

data = {'fruit': ['apple', 'orange', 'banana', 'apple']}

df = pd.DataFrame(data)
print(df)

# One Hot Encoding using Pandas get_dummies

encoded_df = pd.get_dummies(df, columns=['fruit'])
encoded_df

    fruit
0   apple
1  orange
2  banana
3   apple


Unnamed: 0,fruit_apple,fruit_banana,fruit_orange
0,True,False,False
1,False,False,True
2,False,True,False
3,True,False,False


In [20]:
data = [['apple'], ['orange'], ['banana'], ['apple']]

from sklearn.preprocessing import OneHotEncoder
encoder=OneHotEncoder(sparse_output=False)
encoder.fit(data)
encoded_data=encoder.transform(data)
print(encoded_data)

[[1. 0. 0.]
 [0. 0. 1.]
 [0. 1. 0.]
 [1. 0. 0.]]


ONE HOT ENCODING ON SENTENCE

In [21]:
from sklearn.preprocessing import OneHotEncoder
import numpy as np

In [22]:
sentence = "One hot encoding is a process used to convert categorical data into a binary matrix."

In [23]:
# Tokenize the sentence into words
words = sentence.split()
words

['One',
 'hot',
 'encoding',
 'is',
 'a',
 'process',
 'used',
 'to',
 'convert',
 'categorical',
 'data',
 'into',
 'a',
 'binary',
 'matrix.']

In [24]:
# Create the vocabulary
vocab = list(set(words))

In [25]:
vocab

['used',
 'matrix.',
 'to',
 'convert',
 'binary',
 'process',
 'into',
 'is',
 'data',
 'categorical',
 'a',
 'encoding',
 'hot',
 'One']

In [26]:
# Initialize the OneHotEncoder
onehot_encoder = OneHotEncoder(sparse=False)

In [27]:
# Fit the encoder and transform the words
onehot_encoded = onehot_encoder.fit_transform(np.array(words).reshape(-1, 1))



In [28]:
# Display the result
print("Vocabulary:", vocab)
print("One-Hot Encoded Vectors:\n", onehot_encoded)

Vocabulary: ['used', 'matrix.', 'to', 'convert', 'binary', 'process', 'into', 'is', 'data', 'categorical', 'a', 'encoding', 'hot', 'One']
One-Hot Encoded Vectors:
 [[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]]


###TF_IDF VECTORIZATION
TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents (or corpus). It is often used in information retrieval and text mining to reflect how important a word is to







**Term Frequency (TF)**: Measures how frequently a term appears in a document. It can be calculated as:

**TF(𝑡,𝑑)=Number of times term 𝑡 appears in document 𝑑/total number of terms in document 𝑑**


**Inverse Document Frequency (IDF)**: Measures how important a term is in the entire corpus. It is calculated as:

**IDF(𝑡)=log(Total number of documents/Number of documents containing term 𝑡)**

The TF-IDF score for a term in a document is the product of its TF and IDF scores:

**TF-IDF(𝑡,𝑑)=TF(𝑡,𝑑)×IDF(𝑡)**

In [29]:
d1="Tokenization is the process of tokenizing or splitting a string, text into a list of tokens."
d2="One can think of token as parts like a word is a token in a sentence, and a sentence is a token in a paragraph."

In [30]:
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer

In [31]:
doc_corpus=[d1,d2]
print(doc_corpus)

['Tokenization is the process of tokenizing or splitting a string, text into a list of tokens.', 'One can think of token as parts like a word is a token in a sentence, and a sentence is a token in a paragraph.']


In [32]:
vec=TfidfVectorizer(stop_words='english')
matrix=vec.fit_transform(doc_corpus)
print("Feature Names n",vec.get_feature_names_out())


Feature Names n ['like' 'list' 'paragraph' 'parts' 'process' 'sentence' 'splitting'
 'string' 'text' 'think' 'token' 'tokenization' 'tokenizing' 'tokens'
 'word']


In [33]:
print("Sparse Matrix n",matrix.shape,"n",matrix.toarray())

Sparse Matrix n (2, 15) n [[0.         0.35355339 0.         0.         0.35355339 0.
  0.35355339 0.35355339 0.35355339 0.         0.         0.35355339
  0.35355339 0.35355339 0.        ]
 [0.23570226 0.         0.23570226 0.23570226 0.         0.47140452
  0.         0.         0.         0.23570226 0.70710678 0.
  0.         0.         0.23570226]]


#BOW

In [34]:
import nltk

paragraph =  """I have three visions for India. In 3000 years of our history, people from all over
               the world have come and invaded us, captured our lands, conquered our minds.
               From Alexander onwards, the Greeks, the Turks, the Moguls, the Portuguese, the British,
               the French, the Dutch, all of them came and looted us, took over what was ours.
               Yet we have not done this to any other nation. We have not conquered anyone.
               We have not grabbed their land, their culture,
               their history and tried to enforce our way of life on them.
               Why? Because we respect the freedom of others.That is why my
               first vision is that of freedom. I believe that India got its first vision of
               this in 1857, when we started the War of Independence. It is this freedom that
               we must protect and nurture and build on. If we are not free, no one will respect us.
               My second vision for India’s development. For fifty years we have been a developing nation.
               It is time we see ourselves as a developed nation. We are among the top 5 nations of the world
               in terms of GDP. We have a 10 percent growth rate in most areas. Our poverty levels are falling.
               Our achievements are being globally recognised today. Yet we lack the self-confidence to
               see ourselves as a developed nation, self-reliant and self-assured. Isn’t this incorrect?
               I have a third vision. India must stand up to the world. Because I believe that unless India
               stands up to the world, no one will respect us. Only strength respects strength. We must be
               strong not only as a military power but also as an economic power. Both must go hand-in-hand.
               My good fortune was to have worked with three great minds. Dr. Vikram Sarabhai of the Dept. of
               space, Professor Satish Dhawan, who succeeded him and Dr. Brahm Prakash, father of nuclear material.
               I was lucky to have worked with all three of them closely and consider this the great opportunity of my life.
               I see four milestones in my career"""





In [35]:
# Cleaning the texts
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

In [36]:
ps = PorterStemmer()
wordnet=WordNetLemmatizer()
sentences = nltk.sent_tokenize(paragraph)
corpus = []

In [37]:
for i in range(len(sentences)):
    review = re.sub('[^a-zA-Z]', ' ', sentences[i])
    review = review.lower()
    review = review.split()
    review = [wordnet.lemmatize(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)

In [38]:
# Creating the Bag of Words model
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(corpus).toarray()
print(X)

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 1 1 0]
 [0 1 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


#WORD2VEC
Word2Vec creates vectors of the words that are distributed numerical representations of word features – these word features could comprise of words that represent the context of the individual words present in our vocabulary. Word embeddings eventually help in establishing the association of a word with another similar meaning word through the created vectors

In [39]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from gensim.models import Word2Vec
from nltk.corpus import stopwords

import re

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [40]:
paragraph = """I have three visions for India. In 3000 years of our history, people from all over
               the world have come and invaded us, captured our lands, conquered our minds.
               From Alexander onwards, the Greeks, the Turks, the Moguls, the Portuguese, the British,
               the French, the Dutch, all of them came and looted us, took over what was ours.
               Yet we have not done this to any other nation. We have not conquered anyone.
               We have not grabbed their land, their culture,
               their history and tried to enforce our way of life on them."""


In [41]:
# Preprocessing the data
text = re.sub(r'\[[0-9]*\]',' ',paragraph)
text = re.sub(r'\s+',' ',text)
text = text.lower()
text = re.sub(r'\d',' ',text)
text = re.sub(r'\s+',' ',text)

In [42]:
# Preparing the dataset
sentences = nltk.sent_tokenize(text)

sentences = [nltk.word_tokenize(sentence) for sentence in sentences]

In [43]:
for i in range(len(sentences)):
    sentences[i] = [word for word in sentences[i] if word not in stopwords.words('english')]

In [44]:

# Training the Word2Vec model
model = Word2Vec(sentences, min_count=1)

In [45]:
words = model.wv.key_to_index
for i in words:
  print(words[i],"------->",i)

0 -------> ,
1 -------> .
2 -------> conquered
3 -------> history
4 -------> us
5 -------> life
6 -------> invaded
7 -------> alexander
8 -------> minds
9 -------> lands
10 -------> captured
11 -------> world
12 -------> come
13 -------> greeks
14 -------> people
15 -------> years
16 -------> india
17 -------> visions
18 -------> onwards
19 -------> turks
20 -------> way
21 -------> done
22 -------> enforce
23 -------> tried
24 -------> culture
25 -------> land
26 -------> grabbed
27 -------> anyone
28 -------> nation
29 -------> yet
30 -------> moguls
31 -------> took
32 -------> looted
33 -------> came
34 -------> dutch
35 -------> french
36 -------> british
37 -------> portuguese
38 -------> three


In [46]:
# Finding Word Vectors
vector = model.wv['grabbed']
print(vector)

[-6.9625224e-03 -2.4570711e-03 -8.0231009e-03  7.4998569e-03
  6.1294287e-03  5.2598049e-03  8.3753048e-03 -6.9806911e-04
 -9.3106739e-03  9.1157416e-03 -4.9290746e-03  7.8482348e-03
  5.5324221e-03 -1.0805655e-03 -7.6641268e-03 -1.4598919e-03
  6.2540262e-03 -6.9685564e-03  1.4447495e-03 -7.9497863e-03
  8.7225642e-03 -2.8563982e-03  9.4370665e-03 -5.7080411e-03
 -9.7175669e-03 -8.6270161e-03 -4.0752478e-03  4.7115544e-03
 -2.4033277e-04  9.2256609e-03  3.1082083e-03  3.7466774e-03
  2.9941848e-03  8.1482371e-03 -2.3958571e-03  7.4063162e-03
 -9.5373588e-03  2.9241510e-03 -6.8165950e-04  4.5327315e-04
  6.8427618e-03 -2.8411502e-03 -2.3576040e-03 -9.9652709e-05
 -5.0032663e-04 -3.5738496e-03  6.2434361e-03 -6.5583410e-03
  7.8907264e-03 -9.2743649e-05  2.6102422e-03  3.2235659e-03
 -2.8280876e-04  1.7051279e-03 -3.1412125e-03  4.7563668e-03
  2.4294248e-04 -3.2808175e-03 -8.7157963e-03 -9.9969180e-03
  3.0956228e-04 -5.7464228e-03 -1.1101884e-03 -4.2061582e-03
 -8.6368443e-03  1.06075

In [47]:
# Most similar words
similar = model.wv.most_similar('nation')
for i in similar:
  print(i)

('invaded', 0.30389490723609924)
(',', 0.19518142938613892)
('turks', 0.1888754814863205)
('grabbed', 0.16689462959766388)
('india', 0.14204353094100952)
('lands', 0.1263914704322815)
('moguls', 0.11590827256441116)
('three', 0.09515701234340668)
('british', 0.08005909621715546)
('dutch', 0.07988414168357849)


In [50]:
from sklearn.linear_model import LinearRegression
reg=LinearRegression()

In [54]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(vector,test_size=0.2,random_state=2)

ValueError: not enough values to unpack (expected 4, got 2)

1.create document for all embeddings
2.simple linear regression model on all with complete detail
3.2 sql questions
4.python question
5.remind to connect at 6
