# Bag of Words and TF-IDF
Below, we'll look at three useful methods of vectorizing text.
- `CountVectorizer` - Bag of Words
- `TfidfTransformer` - TF-IDF values
- `TfidfVectorizer` - Bag of Words AND TF-IDF values

Let's first use an example from earlier and apply the text processing steps we saw in this lesson.

In [1]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import word_tokenize


nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ziaeeamir\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ziaeeamir\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ziaeeamir\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [2]:
corpus = ["The first time you see The Second Renaissance it may look boring.",
        "Look at it at least twice and definitely watch part 2.",
        "It will change your view of the matrix.",
        "Are the human people the ones who started the war?",
        "Is AI a bad thing ?"]

In [3]:
stop_words = stopwords.words('english')
lemmatizer = WordNetLemmatizer()

Th function `tokenize`  takes in a string of text and applies the following:

- case normalization (convert to all lowercase)
- punctuation removal
- tokenization, lemmatization, and stop word removal using `nltk`



In [4]:
def tokenize(text):
    # normalize case and remove punctuation
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())
    
    # tokenize text
    tokens = word_tokenize(text)
    
    # lemmatize and remove stop words
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]

    return tokens

# `CountVectorizer` (Bag of Words)

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

# initialize count vectorizer object
vect = CountVectorizer(tokenizer = tokenize)


In [7]:
# get counts of each token (word) in text data
X = vect.fit_transform(corpus)



In [8]:
# convert sparse matrix to numpy array to view
print(X.toarray())
print(X.toarray().shape)

[[0 0 0 1 0 0 1 0 0 1 0 1 0 0 0 1 1 1 0 0 1 0 0 0 0]
 [1 0 0 0 0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1]
 [0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
 [0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 1 0]
 [0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0]]
(5, 25)


In [9]:
# view token vocabulary and counts
vect.vocabulary_

{'first': 6,
 'time': 20,
 'see': 17,
 'second': 16,
 'renaissance': 15,
 'may': 11,
 'look': 9,
 'boring': 3,
 'least': 8,
 'twice': 21,
 'definitely': 5,
 'watch': 24,
 'part': 13,
 '2': 0,
 'change': 4,
 'view': 22,
 'matrix': 10,
 'human': 7,
 'people': 14,
 'one': 12,
 'started': 18,
 'war': 23,
 'ai': 1,
 'bad': 2,
 'thing': 19}

# `TfidfTransformer`

In [10]:
from sklearn.feature_extraction.text import TfidfTransformer

# initialize tf-idf transformer object
transformer = TfidfTransformer(smooth_idf=False)

In [11]:
transformer.fit(X)

TfidfTransformer(norm='l2', smooth_idf=False, sublinear_tf=False,
         use_idf=True)

To get a glimpse of how the IDF values look, we are going to print it by placing the IDF values in a python DataFrame. The values will be sorted in ascending order.

In [12]:
import pandas as pd
df_idf = pd.DataFrame(transformer.idf_, index = vect.get_feature_names(), columns=['idf_weights'] )
df_idf.sort_values(by=['idf_weights'])

Unnamed: 0,idf_weights
look,1.916291
2,2.609438
view,2.609438
twice,2.609438
time,2.609438
thing,2.609438
started,2.609438
see,2.609438
second,2.609438
renaissance,2.609438


The lower the IDF value of a word, the less unique

# Compute the TFIDF score for  corpus

In [13]:
# use counts from count vectorizer results to compute tf-idf values
tfidf = transformer.transform(X)

In [14]:
# convert sparse matrix to numpy array to view
tfidf.toarray()

array([[0.        , 0.        , 0.        , 0.36419547, 0.        ,
        0.        , 0.36419547, 0.        , 0.        , 0.26745392,
        0.        , 0.36419547, 0.        , 0.        , 0.        ,
        0.36419547, 0.36419547, 0.36419547, 0.        , 0.        ,
        0.36419547, 0.        , 0.        , 0.        , 0.        ],
       [0.39105193, 0.        , 0.        , 0.        , 0.        ,
        0.39105193, 0.        , 0.        , 0.39105193, 0.28717648,
        0.        , 0.        , 0.        , 0.39105193, 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.39105193, 0.        , 0.        , 0.39105193],
       [0.        , 0.        , 0.        , 0.        , 0.57735027,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.57735027, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.57735027, 0.

In [15]:
print(tfidf.toarray().shape)

(5, 25)


Now, let’s print the tf-idf values of the first document to see if it makes sense. What we are doing below is, placing the tf-idf scores from the first document into a pandas data frame and sorting it in descending order of scores.

In [16]:
df = pd.DataFrame(tfidf[0].T.todense(), index=vect.get_feature_names(), columns=["tfidf"])
df.sort_values(by=["tfidf"],ascending=False)

Unnamed: 0,tfidf
renaissance,0.364195
boring,0.364195
first,0.364195
time,0.364195
may,0.364195
see,0.364195
second,0.364195
look,0.267454
2,0.0
war,0.0


Notice that only certain words have scores. This is because our first document is "The first time you see The Second Renaissance it may look boring." all the words in this document have a tf-idf score and everything else show up as zeroes. 

Notice that the word “ it” is missing from this list. This is possibly due to pre-processing where it removes single and stop characters.

The scores above make sense. The more common the word across documents, the lower its score and the more unique a word is to our first document the higher the score.

# `TfidfVectorizer`
`TfidfVectorizer` = `CountVectorizer` + `TfidfTransformer`

Now, we are going to use the same 5 documents from above to do the same thing as we did for Tfidftransformer – which is to get the tf-idf scores of a set of documents. But, notice how this is much shorter.

In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer

# initialize tf-idf vectorizer object
vectorizer = TfidfVectorizer()

In [18]:
# compute bag of word counts and tf-idf values
X = vectorizer.fit_transform(corpus)

In [19]:
# convert sparse matrix to numpy array to view
X.toarray()[0]

array([0.        , 0.        , 0.        , 0.        , 0.        ,
       0.30298183, 0.        , 0.        , 0.30298183, 0.        ,
       0.        , 0.20291046, 0.        , 0.24444384, 0.        ,
       0.30298183, 0.        , 0.        , 0.        , 0.        ,
       0.30298183, 0.30298183, 0.30298183, 0.        , 0.40582093,
       0.        , 0.30298183, 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.30298183, 0.        ])

## Tfidftransformer vs. Tfidfvectorizer

In summary, the main difference between the two modules are as follows:

* With Tfidftransformer you will systematically compute word counts using CountVectorizer and then compute the Inverse Document Frequency (IDF) values and only then compute the Tf-idf scores.

* With Tfidfvectorizer on the contrary, you will do all three steps at once. Under the hood, it computes the word counts, IDF values, and Tf-idf scores all using the same dataset.