# "Bag of Words is a text representation technique in natural language processing where each document is represented as a vector, ignoring word order and considering word frequency. It involves tokenization, vocabulary creation, and vectorization, making it a simple yet effective method for text analysis."

In [65]:
!pip install nltk
import nltk
nltk.download('punkt') ##for tokenization



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [66]:
paragraph = """The octopus is a fascinating marine creature known for its intelligence, unique appearance, and remarkable abilities. Belonging to the class Cephalopoda, which also includes squids and cuttlefish, octopuses are highly adaptable and can be found in various ocean environments, from shallow coastal waters to deep-sea regions.
Key characteristics of octopuses include their soft, bulbous bodies, large heads, and distinctive arms, usually eight in number. Each arm is lined with suckers that are highly sensitive to touch and taste. Octopuses are known for their exceptional problem-solving abilities, which are attributed to their well-developed nervous system and complex brain.
One of the most intriguing features of octopuses is their ability to change color and texture rapidly, allowing them to blend into their surroundings for camouflage or communicate with other octopuses. This camouflaging ability is achieved through specialized pigment cells called chromatophores in their skin.
Octopuses are also known for their intelligence, exhibiting advanced problem-solving skills, memory, and the ability to learn through observation. They have been observed using tools, escaping from predators, and even opening jars to access food. Some species of octopuses are also known for their unique behaviors, such as mimicry, where they imitate other marine animals to avoid predators.
Reproduction in octopuses is a fascinating process. Males typically use a specialized arm called a hectocotylus to transfer sperm to the female's mantle during mating. After laying a large number of eggs, the female guards and cares for them until they hatch. Interestingly, octopuses are semelparous, meaning they reproduce only once in their lifetime, and females often die shortly after their eggs hatch.
Despite their intriguing characteristics, octopuses have relatively short lifespans, ranging from a few months to a couple of years, depending on the species. Their adaptability, intelligence, and unique features make octopuses subjects of great interest in marine biology and have inspired curiosity and awe among scientists and enthusiasts alike.
"""

**Tokenization**

In [67]:
sentences = nltk.sent_tokenize(paragraph)

In [68]:
sentences

['The octopus is a fascinating marine creature known for its intelligence, unique appearance, and remarkable abilities.',
 'Belonging to the class Cephalopoda, which also includes squids and cuttlefish, octopuses are highly adaptable and can be found in various ocean environments, from shallow coastal waters to deep-sea regions.',
 'Key characteristics of octopuses include their soft, bulbous bodies, large heads, and distinctive arms, usually eight in number.',
 'Each arm is lined with suckers that are highly sensitive to touch and taste.',
 'Octopuses are known for their exceptional problem-solving abilities, which are attributed to their well-developed nervous system and complex brain.',
 'One of the most intriguing features of octopuses is their ability to change color and texture rapidly, allowing them to blend into their surroundings for camouflage or communicate with other octopuses.',
 'This camouflaging ability is achieved through specialized pigment cells called chromatophores

**Stopwords and then Lemmatization**

In [69]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
nltk.download('stopwords') ##for stopwords
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [70]:
lemma = WordNetLemmatizer()

In [71]:
for i in range (len(sentences)):
  words = nltk.word_tokenize(sentences[i])
  words = [lemma.lemmatize(word.lower(),pos='v') for word in words if word not in set(stopwords.words('english'))]
  sentences[i] = ' '.join(words)

In [72]:
sentences

['the octopus fascinate marine creature know intelligence , unique appearance , remarkable abilities .',
 'belong class cephalopoda , also include squids cuttlefish , octopuses highly adaptable find various ocean environments , shallow coastal water deep-sea regions .',
 'key characteristics octopuses include soft , bulbous body , large head , distinctive arm , usually eight number .',
 'each arm line suckers highly sensitive touch taste .',
 'octopuses know exceptional problem-solving abilities , attribute well-developed nervous system complex brain .',
 'one intrigue feature octopuses ability change color texture rapidly , allow blend surround camouflage communicate octopuses .',
 'this camouflage ability achieve specialize pigment cells call chromatophores skin .',
 'octopuses also know intelligence , exhibit advance problem-solving skills , memory , ability learn observation .',
 'they observe use tool , escape predators , even open jar access food .',
 'some species octopuses also

In [73]:
##Before this we can also use lemmatization
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()

In [74]:
##VEctorization
X = cv.fit_transform(sentences)

In [75]:
print(cv.get_feature_names_out())

['abilities' 'ability' 'access' 'achieve' 'adaptability' 'adaptable'
 'advance' 'after' 'alike' 'allow' 'also' 'among' 'animals' 'appearance'
 'arm' 'attribute' 'avoid' 'awe' 'behaviors' 'belong' 'biology' 'blend'
 'body' 'brain' 'bulbous' 'call' 'camouflage' 'care' 'cells' 'cephalopoda'
 'change' 'characteristics' 'chromatophores' 'class' 'coastal' 'color'
 'communicate' 'complex' 'couple' 'creature' 'curiosity' 'cuttlefish'
 'deep' 'depend' 'despite' 'developed' 'die' 'distinctive' 'each' 'egg'
 'eight' 'enthusiasts' 'environments' 'escape' 'even' 'exceptional'
 'exhibit' 'fascinate' 'feature' 'female' 'females' 'find' 'food' 'great'
 'guard' 'hatch' 'head' 'hectocotylus' 'highly' 'imitate' 'include'
 'inspire' 'intelligence' 'interest' 'interestingly' 'intrigue' 'jar'
 'key' 'know' 'large' 'lay' 'learn' 'lifespans' 'lifetime' 'line' 'make'
 'males' 'mantle' 'marine' 'mat' 'mean' 'memory' 'mimicry' 'months'
 'nervous' 'number' 'observation' 'observe' 'ocean' 'octopus' 'octopuses'
 'o

In [76]:
cv.vocabulary_

{'the': 137,
 'octopus': 99,
 'fascinate': 57,
 'marine': 88,
 'creature': 39,
 'know': 78,
 'intelligence': 72,
 'unique': 145,
 'appearance': 13,
 'remarkable': 112,
 'abilities': 0,
 'belong': 19,
 'class': 33,
 'cephalopoda': 29,
 'also': 10,
 'include': 70,
 'squids': 130,
 'cuttlefish': 41,
 'octopuses': 100,
 'highly': 68,
 'adaptable': 5,
 'find': 61,
 'various': 148,
 'ocean': 98,
 'environments': 52,
 'shallow': 119,
 'coastal': 34,
 'water': 149,
 'deep': 42,
 'sea': 116,
 'regions': 110,
 'key': 77,
 'characteristics': 31,
 'soft': 124,
 'bulbous': 24,
 'body': 22,
 'large': 79,
 'head': 66,
 'distinctive': 47,
 'arm': 14,
 'usually': 147,
 'eight': 50,
 'number': 95,
 'each': 48,
 'line': 84,
 'suckers': 132,
 'sensitive': 118,
 'touch': 142,
 'taste': 135,
 'exceptional': 55,
 'problem': 106,
 'solving': 125,
 'attribute': 15,
 'well': 150,
 'developed': 45,
 'nervous': 94,
 'system': 134,
 'complex': 37,
 'brain': 23,
 'one': 102,
 'intrigue': 75,
 'feature': 58,
 'abili

In [79]:
print(sentences[8])

they observe use tool , escape predators , even open jar access food .


In [80]:
X[8].toarray()

array([[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0]])