Bag of words

Steps :

Data ----> Clean data ----> Tokenize ----> Build Vocabulary ----> Generate Vectors

BOW ( Bag of words ) is an approach used with 

1. NLP 
2. Information retrieval from documents 
3. Document classification.

Importing the libraries

In [1]:
import numpy
import re

Word Extraction , Tokenize and Bag of words

In [2]:
def word_extraction(sentence):
  
  ignore = ["a","the","is"]
  words = re.sub("[^\w]"," ", sentence).split()
  cleaned_text =[w.lower() for w in words if w not in ignore]
  return cleaned_text

def tokenize(sentences):
  
  words = []
  
  for sentence in sentences:
    
    w = word_extraction(sentence)
    words.extend(w)
    
  
  words = sorted(list(set(words)))
  return words

def generate_bow(allsentences):

  vocab = tokenize(allsentences)
  print("Word list \n{0}\n".format(vocab))

  for sentence in allsentences:

    words = word_extraction(sentence)
    bag_vector = numpy.zeros(len(vocab))

    for w in words:

      for i, word in enumerate(vocab):

        if word == w:

          bag_vector[i] += 1
      
    print("{0}\n{1}\n".format(sentence , numpy.array(bag_vector)))

allsentences =["Joe waited for the train", "The train was late","Mary and Samantha took the bus", "I looked for Samantha and Mary at the bus station","Mary and Samantha arrived at the bus station early but waited until noon for the bus"]

generate_bow(allsentences)


Word list 
['and', 'arrived', 'at', 'bus', 'but', 'early', 'for', 'i', 'joe', 'late', 'looked', 'mary', 'noon', 'samantha', 'station', 'the', 'took', 'train', 'until', 'waited', 'was']

Joe waited for the train
[0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0.]

The train was late
[0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 1.]

Mary and Samantha took the bus
[1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 1. 0. 0. 0. 0.]

I looked for Samantha and Mary at the bus station
[1. 0. 1. 1. 0. 0. 1. 1. 0. 0. 1. 1. 0. 1. 1. 0. 0. 0. 0. 0. 0.]

Mary and Samantha arrived at the bus station early but waited until noon for the bus
[1. 1. 1. 2. 1. 1. 1. 0. 0. 0. 0. 1. 1. 1. 1. 0. 0. 0. 1. 1. 0.]



Vectorization

In [3]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(allsentences)
print("All sentences:\n\n")
for sentence in allsentences:
  print(sentence,"\n\n")
print("Matrix of features for each sentence : \n\n",X.toarray())

All sentences:


Joe waited for the train 


The train was late 


Mary and Samantha took the bus 


I looked for Samantha and Mary at the bus station 


Mary and Samantha arrived at the bus station early but waited until noon for the bus 


Matrix of features for each sentence : 

 [[0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 1 0 1 0]
 [0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 1]
 [1 0 0 1 0 0 0 0 0 0 1 0 1 0 1 1 0 0 0 0]
 [1 0 1 1 0 0 1 0 0 1 1 0 1 1 1 0 0 0 0 0]
 [1 1 1 2 1 1 1 0 0 0 1 1 1 1 2 0 0 1 1 0]]


Bigrams , Trigrams and N grams

Importing the libraries

In [13]:
import nltk
nltk.download('stopwords')
nltk.download('gutenberg')

from nltk.corpus import stopwords
from collections import Counter

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


Code for bigrams of the word "sun"

In [20]:
word_list = []
stops = set(stopwords.words('english'))
[word_list.extend(nltk.corpus.gutenberg.words(f)) for f in nltk.corpus.gutenberg.fileids()]
cleaned_words = [w.lower() for w in word_list if w.isalnum()]
sun_bigrams = [b for b in nltk.bigrams(cleaned_words) if (b[0] == 'sun' or b[1] == 'sun')\
               and b[0] not in stops and b[1] not in stops]
print("Sets :\n",set(sun_bigrams))
print("\n\nLength of total number of combinations found for sun:\n",len(set(sun_bigrams)))


Sets :
 {('sun', 'soaked'), ('sun', 'ariseth'), ('ye', 'sun'), ('sun', '12'), ('spangling', 'sun'), ('sun', 'making'), ('sun', '1'), ('blazing', 'sun'), ('sun', 'riseth'), ('lucifer', 'sun'), ('sun', '2'), ('blinding', 'sun'), ('sun', 'shining'), ('sun', 'move'), ('sun', 'moon'), ('sun', 'threw'), ('sun', 'impearls'), ('runaway', 'sun'), ('silent', 'sun'), ('sun', 'toasted'), ('sun', 'shone'), ('sun', 'lit'), ('sun', 'excludes'), ('forenoon', 'sun'), ('great', 'sun'), ('day', 'sun'), ('sun', 'carefully'), ('sun', 'gilds'), ('sun', 'upon'), ('sultry', 'sun'), ('coined', 'sun'), ('haired', 'sun'), ('sheeny', 'sun'), ('rising', 'sun'), ('sun', 'stand'), ('sun', 'gained'), ('sun', 'sinking'), ('sun', 'ahab'), ('sun', 'keep'), ('sun', 'aye'), ('sun', 'go'), ('sun', 'dial'), ('sun', 'rise'), ('sun', 'unto'), ('sun', 'seeking'), ('sun', 'dalroy'), ('sun', 'burnt'), ('sun', 'entering'), ('abated', 'sun'), ('gayety', 'sun'), ('flashing', 'sun'), ('sun', 'measuring'), ('sinking', 'sun'), ('sun',

N grams

In [22]:
nltk.download('punkt')

from nltk.tokenize import word_tokenize
from nltk.util import ngrams

sentences = ["To Sherlock Holmes she is always the woman.","I have seldom heard him mention her under any other name."]
bigrams = []

for sentence in sentences:
  sequence = word_tokenize(sentence)
  bigrams.extend(list(ngrams(sequence,2)))

freq_dist = nltk.FreqDist(bigrams)
prob_dist = nltk.MLEProbDist(freq_dist)
number_of_bigrams = freq_dist.N()

print("\n\nAll the bigrams:\n\n ",bigrams)
print("\n\n Number of bigrams:\n\n",number_of_bigrams)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


All the bigrams:

  [('To', 'Sherlock'), ('Sherlock', 'Holmes'), ('Holmes', 'she'), ('she', 'is'), ('is', 'always'), ('always', 'the'), ('the', 'woman'), ('woman', '.'), ('I', 'have'), ('have', 'seldom'), ('seldom', 'heard'), ('heard', 'him'), ('him', 'mention'), ('mention', 'her'), ('her', 'under'), ('under', 'any'), ('any', 'other'), ('other', 'name'), ('name', '.')]


 Number of bigrams:

 19


Trying trigrams and fourgrams

In [35]:
from nltk.util import ngrams

text = "I am aware that nltk only offers bigrams and trigrams, but is there a way to split to 4 grams"

tokenize = nltk.word_tokenize(text)
print("\n\nNormal Tokenization :\n",tokenize)
print("\nTotal number of tokens :\n",len(tokenize))

trigrams = ngrams(tokenize,3)

fourgrams = ngrams(tokenize,4)

def get_ngrams(n_grams):
  return [" ".join(grams) for grams in n_grams]

trigram = get_ngrams(trigrams)
print("\n\n Trigrams :\n",trigram)
print("\n No of tokens :\n",len(trigram))

fourgram = get_ngrams(fourgrams)
print("\n\n Fourgrams :\n",fourgram)
print("\n No of tokens :\n",len(fourgram))



Normal Tokenization :
 ['I', 'am', 'aware', 'that', 'nltk', 'only', 'offers', 'bigrams', 'and', 'trigrams', ',', 'but', 'is', 'there', 'a', 'way', 'to', 'split', 'to', '4', 'grams']

Total number of tokens :
 21


 Trigrams :
 ['I am aware', 'am aware that', 'aware that nltk', 'that nltk only', 'nltk only offers', 'only offers bigrams', 'offers bigrams and', 'bigrams and trigrams', 'and trigrams ,', 'trigrams , but', ', but is', 'but is there', 'is there a', 'there a way', 'a way to', 'way to split', 'to split to', 'split to 4', 'to 4 grams']

 No of tokens :
 19


 Fourgrams :
 ['I am aware that', 'am aware that nltk', 'aware that nltk only', 'that nltk only offers', 'nltk only offers bigrams', 'only offers bigrams and', 'offers bigrams and trigrams', 'bigrams and trigrams ,', 'and trigrams , but', 'trigrams , but is', ', but is there', 'but is there a', 'is there a way', 'there a way to', 'a way to split', 'way to split to', 'to split to 4', 'split to 4 grams']

 No of tokens :
 18