# "Bag of Words is a text representation technique in natural language processing where each document is represented as a vector, ignoring word order and considering word frequency. It involves tokenization, vocabulary creation, and vectorization, making it a simple yet effective method for text analysis."

In [1]:
!pip install nltk
import nltk
nltk.download('punkt') ##for tokenization



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [2]:
paragraph = """The octopus is a fascinating marine creature known for its intelligence, unique appearance, and remarkable abilities. Belonging to the class Cephalopoda, which also includes squids and cuttlefish, octopuses are highly adaptable and can be found in various ocean environments, from shallow coastal waters to deep-sea regions.
Key characteristics of octopuses include their soft, bulbous bodies, large heads, and distinctive arms, usually eight in number. Each arm is lined with suckers that are highly sensitive to touch and taste. Octopuses are known for their exceptional problem-solving abilities, which are attributed to their well-developed nervous system and complex brain.
One of the most intriguing features of octopuses is their ability to change color and texture rapidly, allowing them to blend into their surroundings for camouflage or communicate with other octopuses. This camouflaging ability is achieved through specialized pigment cells called chromatophores in their skin.
Octopuses are also known for their intelligence, exhibiting advanced problem-solving skills, memory, and the ability to learn through observation. They have been observed using tools, escaping from predators, and even opening jars to access food. Some species of octopuses are also known for their unique behaviors, such as mimicry, where they imitate other marine animals to avoid predators.
Reproduction in octopuses is a fascinating process. Males typically use a specialized arm called a hectocotylus to transfer sperm to the female's mantle during mating. After laying a large number of eggs, the female guards and cares for them until they hatch. Interestingly, octopuses are semelparous, meaning they reproduce only once in their lifetime, and females often die shortly after their eggs hatch.
Despite their intriguing characteristics, octopuses have relatively short lifespans, ranging from a few months to a couple of years, depending on the species. Their adaptability, intelligence, and unique features make octopuses subjects of great interest in marine biology and have inspired curiosity and awe among scientists and enthusiasts alike.
"""

**Tokenization**

In [3]:
sentences = nltk.sent_tokenize(paragraph)

In [4]:
sentences

['The octopus is a fascinating marine creature known for its intelligence, unique appearance, and remarkable abilities.',
 'Belonging to the class Cephalopoda, which also includes squids and cuttlefish, octopuses are highly adaptable and can be found in various ocean environments, from shallow coastal waters to deep-sea regions.',
 'Key characteristics of octopuses include their soft, bulbous bodies, large heads, and distinctive arms, usually eight in number.',
 'Each arm is lined with suckers that are highly sensitive to touch and taste.',
 'Octopuses are known for their exceptional problem-solving abilities, which are attributed to their well-developed nervous system and complex brain.',
 'One of the most intriguing features of octopuses is their ability to change color and texture rapidly, allowing them to blend into their surroundings for camouflage or communicate with other octopuses.',
 'This camouflaging ability is achieved through specialized pigment cells called chromatophores

**Stopwords and then Lemmatization**

In [5]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
nltk.download('stopwords') ##for stopwords
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [6]:
lemma = WordNetLemmatizer()

In [7]:
for i in range (len(sentences)):
  words = nltk.word_tokenize(sentences[i])
  words = [lemma.lemmatize(word.lower(),pos='v') for word in words if word not in set(stopwords.words('english'))]
  sentences[i] = ' '.join(words)

In [8]:
sentences

['the octopus fascinate marine creature know intelligence , unique appearance , remarkable abilities .',
 'belong class cephalopoda , also include squids cuttlefish , octopuses highly adaptable find various ocean environments , shallow coastal water deep-sea regions .',
 'key characteristics octopuses include soft , bulbous body , large head , distinctive arm , usually eight number .',
 'each arm line suckers highly sensitive touch taste .',
 'octopuses know exceptional problem-solving abilities , attribute well-developed nervous system complex brain .',
 'one intrigue feature octopuses ability change color texture rapidly , allow blend surround camouflage communicate octopuses .',
 'this camouflage ability achieve specialize pigment cells call chromatophores skin .',
 'octopuses also know intelligence , exhibit advance problem-solving skills , memory , ability learn observation .',
 'they observe use tool , escape predators , even open jar access food .',
 'some species octopuses also

In [15]:
##Before this we can also use lemmatization
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(binary=True,ngram_range=(2,3)) ##Binary if true all frequency will be equals to 1 else it will be the no of frequency

In [16]:
##VEctorization
X = cv.fit_transform(sentences)

In [18]:
print(cv.get_feature_names_out())

['abilities attribute' 'abilities attribute well' 'ability achieve'
 'ability achieve specialize' 'ability change' 'ability change color'
 'ability learn' 'ability learn observation' 'access food'
 'achieve specialize' 'achieve specialize pigment'
 'adaptability intelligence' 'adaptability intelligence unique'
 'adaptable find' 'adaptable find various' 'advance problem'
 'advance problem solving' 'after lay' 'after lay large' 'allow blend'
 'allow blend surround' 'also include' 'also include squids' 'also know'
 'also know intelligence' 'also know unique' 'among scientists'
 'among scientists enthusiasts' 'animals avoid' 'animals avoid predators'
 'appearance remarkable' 'appearance remarkable abilities' 'arm call'
 'arm call hectocotylus' 'arm line' 'arm line suckers' 'arm usually'
 'arm usually eight' 'attribute well' 'attribute well developed'
 'avoid predators' 'awe among' 'awe among scientists' 'behaviors mimicry'
 'behaviors mimicry imitate' 'belong class' 'belong class cephalopo

In [19]:
cv.vocabulary_

{'the octopus': 304,
 'octopus fascinate': 217,
 'fascinate marine': 118,
 'marine creature': 199,
 'creature know': 84,
 'know intelligence': 172,
 'intelligence unique': 155,
 'unique appearance': 319,
 'appearance remarkable': 30,
 'remarkable abilities': 256,
 'the octopus fascinate': 305,
 'octopus fascinate marine': 218,
 'fascinate marine creature': 119,
 'marine creature know': 200,
 'creature know intelligence': 85,
 'know intelligence unique': 174,
 'intelligence unique appearance': 156,
 'unique appearance remarkable': 320,
 'appearance remarkable abilities': 31,
 'belong class': 45,
 'class cephalopoda': 74,
 'cephalopoda also': 66,
 'also include': 21,
 'include squids': 149,
 'squids cuttlefish': 292,
 'cuttlefish octopuses': 88,
 'octopuses highly': 225,
 'highly adaptable': 141,
 'adaptable find': 13,
 'find various': 131,
 'various ocean': 331,
 'ocean environments': 215,
 'environments shallow': 108,
 'shallow coastal': 268,
 'coastal water': 76,
 'water deep': 333,
 

In [20]:
print(sentences[8])

they observe use tool , escape predators , even open jar access food .


In [21]:
X[8].toarray()

array([[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
        1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 