<a href="https://colab.research.google.com/github/Kirtiwardhan01/Paraquantum-/blob/main/Count_Vectorizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook we'll see how sklearn library CountVectorizer can be used in converting human readable English text into a language machine can understand

Let’s begin with a very simple text

In [1]:
text=["Vishal's family includes 3 kids a dog and 2 cats", 
      "The cats are friendly.The dog is beautiful",
      "The cat is 11 years old",
      "Vishal lives in the United States of America"]


How can we make machine understand this?

####**Token**

First step is to take the text and break it into individual words (tokens). We are going to use sklearn library for this.

Import CountVectorizer class from feature_extraction.text library of sklearn. Create an instance of CountVectorizer and fit the instance with the text.

In [7]:
import sklearn
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
vectorizer.fit(text)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

Notice that by default lowercase is True

That means, while creating tokens, CountVectorizer will change all the words in lowercase.

Call **get_feature_names function** to get the list of tokens

In [8]:
print(vectorizer.get_feature_names())

['11', 'america', 'and', 'are', 'beautiful', 'cat', 'cats', 'dog', 'family', 'friendly', 'in', 'includes', 'is', 'kids', 'lives', 'of', 'old', 'states', 'the', 'united', 'vishal', 'years']


Make note of few things here

a. As mentioned all the tokens are in small case

b. Single character words like numbers 3, 1, 2 are not there in the list of tokens/features. This is happening because, by default CountVectorizer considers only words with at least two characters. If you want to include single character words, you need to change the token_pattern option.

c. Instead of Vishal’s the token has vishal. That means the punctuation is not considered.

d. United States of America is tokenized as four different words. What if you want to tell Machine to check if a document contains United States of America as a country? We will handle this scenario later in this notebook

In [11]:
vectorizer=CountVectorizer(token_pattern=u'(?u)\\b\\w+\\b')
vectorizer.fit(text)
print(vectorizer.get_feature_names())

['11', '2', '3', 'a', 'america', 'and', 'are', 'beautiful', 'cat', 'cats', 'dog', 'family', 'friendly', 'in', 'includes', 'is', 'kids', 'lives', 'of', 'old', 's', 'states', 'the', 'united', 'vishal', 'years']


In real world project, when your model has to read several thousand documents to categorize or humongous number of emails to filter spam emails regular words like ‘is’, ‘the’, ‘a’, ‘an’, ‘and’, ‘are’ don’t add any value. It is better to not tokenize them. 

In order to not consider them use the option stop_words as shown below

In [12]:
vectorizer=CountVectorizer(token_pattern=u'(?u)\\b\\w+\\b',stop_words=['is','are','and','in','the'])
vectorizer.fit(text)
print(vectorizer.get_feature_names())

['11', '2', '3', 'a', 'america', 'beautiful', 'cat', 'cats', 'dog', 'family', 'friendly', 'includes', 'kids', 'lives', 'of', 'old', 's', 'states', 'united', 'vishal', 'years']


So far the tokens are still human readable text. Machine cannot understand them. These tokens have to be encoded. This encoding is done by CountVectorizer while fitting the text. You can see the encoded values by calling vocabulary_

In [13]:
print(vectorizer.vocabulary_)

{'vishal': 19, 's': 16, 'family': 9, 'includes': 11, '3': 2, 'kids': 12, 'a': 3, 'dog': 8, '2': 1, 'cat': 6, 'cats': 7, 'friendly': 10, 'beautiful': 5, '11': 0, 'years': 20, 'old': 15, 'lives': 13, 'united': 18, 'states': 17, 'of': 14, 'america': 4}


There are twenty one tokens encoded with numbers ranging from 0 to 20. The sequence of words in text is not considered while encoding the tokens.

**Frequency**

After tokens and encoding comes counting how many times each word appeared in the text.

Call transform method of CountVectorizer instance.

In [14]:
vector = vectorizer.transform(text)
vector

<4x21 sparse matrix of type '<class 'numpy.int64'>'
	with 24 stored elements in Compressed Sparse Row format>

In [15]:
#Check the shape of vector
print(vector.shape)

(4, 21)


The text has been encoded into sparse matrix. Convert this sparse matrix into numpy array by calling toarray method.


In [16]:
vector.toarray()

array([[0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0]])

There are twenty four (representing twenty four words considered) ‘1’ spread across four rows. Each one represent a word in the text as per encoded value. 

To understand this array better, convert it into pandas dataframe


In [17]:
import pandas as pd
pd.DataFrame(vector.toarray(),columns=vectorizer.get_feature_names())

Unnamed: 0,11,2,3,a,america,beautiful,cat,cats,dog,family,friendly,includes,kids,lives,of,old,s,states,united,vishal,years
0,0,1,1,1,0,0,1,0,1,1,0,1,1,0,0,0,1,0,0,1,0
1,0,0,0,0,0,1,0,1,1,0,1,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,1
3,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,0,0,1,1,1,0


Now compare the token numbers with the values to understand which word appeared how many times.

For example token 0 is for number 1. Number 1 appeared once in first row, hence it shows 1 for column 0 and row 0 in above data frame. For all other rows column 0 contains 0.

Vocabulary

Now coming back to the scenario we discussed in point d of Tokenization section. Suppose you are only interested in knowing whether or not the text contain United States of America.

You can achieve this by using ngram_range and vocabulary option.

ngram_range indicates how many words will be considered together for tokenization. Default value is [1,1], that’s why it considered every word as new token.

Run below line of code to understand how ngram_range works.


In [20]:
vectorizer=CountVectorizer(token_pattern=u'(?u)\\b\\w+\\b',stop_words=['is','are','and','in','the'],ngram_range=[1,4])
vectorizer.fit(text)
print(vectorizer.get_feature_names())

['11', '11 years', '11 years old', '2', '2 cat', '3', '3 kids', '3 kids a', '3 kids a dog', 'a', 'a dog', 'a dog 2', 'a dog 2 cat', 'america', 'beautiful', 'cat', 'cat 11', 'cat 11 years', 'cat 11 years old', 'cats', 'cats friendly', 'cats friendly dog', 'cats friendly dog beautiful', 'dog', 'dog 2', 'dog 2 cat', 'dog beautiful', 'family', 'family includes', 'family includes 3', 'family includes 3 kids', 'friendly', 'friendly dog', 'friendly dog beautiful', 'includes', 'includes 3', 'includes 3 kids', 'includes 3 kids a', 'kids', 'kids a', 'kids a dog', 'kids a dog 2', 'lives', 'lives united', 'lives united states', 'lives united states of', 'of', 'of america', 'old', 's', 's family', 's family includes', 's family includes 3', 'states', 'states of', 'states of america', 'united', 'united states', 'united states of', 'united states of america', 'vishal', 'vishal lives', 'vishal lives united', 'vishal lives united states', 'vishal s', 'vishal s family', 'vishal s family includes', 'year

ngram_range is mentioned as 1 to 4, hence CountVectorizer considers single word to four word combination as separate token. 

Now if you add vocabulary option to this, it will meet the requirement.


In [21]:
vectorizer=CountVectorizer(token_pattern=u'(?u)\\b\\w+\\b',stop_words=['is','are','and','in','the'],ngram_range=[1,4],vocabulary=['united states of america'])
vectorizer.fit(text)
print(vectorizer.get_feature_names())

['united states of america']


It returns only one token as it found the word mentioned in vocabulary.
If the text doesn’t contain united states of america, it will not return any thing. This way you can filter a document based on content you are looking for.