###### TERM FREQUENCY–INVERSE DOCUMENT FREQUENCY (tf-idf)
# Introduction

It’s a dark night in the middle of winter as you make your way through another of Emily Dickinson’s poems. As you grapple with questions of immortality and death, you notice the word choice in each poem you read. With each passing poem, you discover for yourself which words are common throughout her work, and which indicate more unique meaning in individual poems.

You might not even realize, but you are building a language model in your head similar to term frequency-inverse document frequency, commonly known as tf-idf. Tf-idf is another powerful tool in your NLP toolkit that has a variety of use cases included:

- ranking results in a search engine
- text summarization
- building smarter chatbots

See also the [Scikit-Learn site](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html#from-occurrences-to-frequencies).

# What is Tf-idf?

Term frequency-inverse document frequency is a numerical statistic used to indicate how important a word is to each document in a collection of documents, or a corpus.

When applying tf-idf to a corpus, each word is given a tf-idf score for each document, representing the relevance of that word to the particular document. A higher tf-idf score indicates a term is more important to the corresponding document.

Tf-idf has many similarities with the bag-of-words language model, which if you recall is concerned with word count — how many times each word appears in a document.

While tf-idf can be used in any situation bag-of-words can be used, there is a key difference in how it is calculated.

Tf-idf relies on two different metrics in order to come up with an overall score:

- term frequency, or how often a word appears in a document. This is the same as bag-of-words’ word count.
- inverse document frequency, which is a measure of how often a word appears in the overall corpus. By penalizing the score of words that appear throughout a corpus, tf-idf can give better insight into how important a word is to a particular document of a corpus.

In [8]:
from modules.preprocessing import preprocess_text
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

# sample documents
document_1 = "All you need is love"
document_2 = "Love is all you need"
document_3 = "I love jelly and ice cream"

# corpus of documents
corpus = [document_1, document_2, document_3]

# preprocess documents
processed_corpus = [preprocess_text(doc) for doc in corpus]

# initialize and fit TfidfVectorizer
vectorizer = TfidfVectorizer(norm=None)
tf_idf_scores = vectorizer.fit_transform(processed_corpus)

# get vocabulary of terms
feature_names = vectorizer.get_feature_names()
corpus_index = [n for n in processed_corpus]

# create pandas DataFrame with tf-idf scores
df_tf_idf = pd.DataFrame(tf_idf_scores.T.todense(), index=feature_names, columns=corpus_index)
print(df_tf_idf)

       all you need be love  love be all you need  i love jelly and ice cream
all                1.287682              1.287682                    0.000000
and                0.000000              0.000000                    1.693147
be                 1.287682              1.287682                    0.000000
cream              0.000000              0.000000                    1.693147
ice                0.000000              0.000000                    1.693147
jelly              0.000000              0.000000                    1.693147
love               1.000000              1.000000                    1.000000
need               1.287682              1.287682                    0.000000
you                1.287682              1.287682                    0.000000


# Breaking It Down Part I: Term Frequency

The first component of tf-idf is term frequency, or how often a word appears in a document within the corpus.

The value for the term frequency is the same as if applying the bag-of-words language model to a document. If you have previously studied bag-of-words, this will all be familiar! If not, have no fear.

Term frequency indicates how often each word appears in the document. The intuition for including term frequency in the tf-idf calculation is that the more frequently a word appears in a single document, the more important that term is to the document.

Consider the stanza from Emily Dickinson’s poem I’m Nobody! Who are you? below:

``stanza = '''I'm nobody! Who are you?
Are you nobody, too?
Then there's a pair of us — don't tell!
They'd banish us, you know.'''``

The term frequency for “you” is 3, “nobody” is 2, “are” is 2, “us” is 2, and the rest of the terms have a frequency of 1. We can get a general sense of what this stanza is about by the most frequently used words.

Term frequency can be calculated in Python using scikit-learn’s CountVectorizer, as shown below:

``vectorizer = CountVectorizer()``
 
``term_frequencies = vectorizer.fit_transform([stanza])``

- A CountVectorizer object is initialized
- The CountVectorizer object is fit (trained) and transformed (applied) on the corpus of data, returning the term frequencies for each term-document pair

In [10]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from modules.preprocessing import preprocess_text

poem = '''
Success is counted sweetest
By those who ne'er succeed.
To comprehend a nectar
Requires sorest need.

Not one of all the purple host
Who took the flag to-day
Can tell the definition,
So clear, of victory,

As he, defeated, dying,
On whose forbidden ear
The distant strains of triumph
Break, agonized and clear!'''

In [6]:
# preprocess text
processed_poem = preprocess_text(poem)

# initialize and fit CountVectorizer
vectorizer = CountVectorizer()
term_frequencies = vectorizer.fit_transform([processed_poem])

# get vocabulary of terms
feature_names = vectorizer.get_feature_names()

# create pandas DataFrame with term frequencies
try:
    df_term_frequencies = pd.DataFrame(term_frequencies.T.todense(), index=feature_names, columns=['Term Frequency'])
    print(df_term_frequencies)
except:
    pass

            Term Frequency
agonize                  1
all                      1
and                      1
be                       1
break                    1
by                       1
can                      1
clear                    2
comprehend               1
count                    1
day                      1
defeat                   1
definition               1
die                      1
distant                  1
ear                      1
er                       1
flag                     1
forbid                   1
he                       1
host                     1
ne                       1
nectar                   1
need                     1
not                      1
of                       3
on                       1
one                      1
purple                   1
require                  1
so                       1
sorest                   1
strain                   1
succeed                  1
success                  1
sweet                    1
t

# Breaking It Down Part II: Inverse Document Frequency

The inverse document frequency component of the tf-idf score penalizes terms that appear more frequently across a corpus. The intuition is that words that appear more frequently in the corpus give less insight into the topic or meaning of an individual document, and should thus be deprioritized.

For example, terms like “the” or “go” are used all over the place, so in a bag-of-words model, they would be given priority even though they don’t provide much meaning; tf-idf would deprioritize these sorts of common words.

We can calculate the inverse document frequency for some term t across a corpus using the below equation.

$$\log\left(\frac{\textrm{Total number of documents}}{\textrm{Number of documents with term } t}\right)$$

The important take away from the equation is that as the number of documents with the term t increases, the inverse document frequency decreases (due to the nature of the log function). The more frequently a term appears across the corpus, the less important it becomes to an individual document.

Inverse document frequency can be calculated on a group of documents using scikit-learn’s TfidfTransformer:

``transformer = TfidfTransformer(norm=None)
transformer.fit(term_frequencies)
inverse_doc_frequency = transformer.idf_``

- a ``TfidfTransformer`` object is initialized. Don’t worry about the ``norm=None`` keyword argument for now, we will dig into this in the next exercise
- the ``TfidfTransformer`` is fit (trained) on a term-document matrix of term frequencies
- the ``.idf_`` attribute of the ``TfidfTransformer`` stores the inverse document frequencies of the terms as a NumPy array

In [12]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfTransformer
from modules.term_frequency import term_frequencies, feature_names, df_term_frequencies

In [13]:
# display term-document matrix of term frequencies
print(df_term_frequencies)

         Poem 1  Poem 2  Poem 3  Poem 4  Poem 5  Poem 6
abash         0       0       0       0       1       0
across        0       0       0       1       0       0
admire        0       0       1       0       0       0
again         0       0       0       1       0       0
agonize       1       0       0       0       0       0
...         ...     ...     ...     ...     ...     ...
word          0       0       0       0       1       0
wreck         0       0       0       1       0       0
yet           0       0       0       0       1       0
you           0       0       3       0       0       0
your          0       0       1       0       0       0

[173 rows x 6 columns]


In [18]:
# initialize and fit TfidfTransformer
transformer = TfidfTransformer()
transformer.fit(term_frequencies)
idf_values = transformer.idf_

In [19]:
# create pandas DataFrame with inverse document frequencies
try:
    df_idf = pd.DataFrame(idf_values, index = feature_names, columns=['Inverse Document Frequency'])
    print(df_idf)
except:
    pass

         Inverse Document Frequency
abash                      2.252763
across                     2.252763
admire                     2.252763
again                      2.252763
agonize                    2.252763
...                             ...
word                       2.252763
wreck                      2.252763
yet                        2.252763
you                        2.252763
your                       2.252763

[173 rows x 1 columns]


# Putting It All Together: Tf-idf

Now that we understand how term frequency and inverse document frequency are calculated, let’s put it all together to calculate tf-idf!

Tf-idf scores are calculated on a term-document basis. That means there is a tf-idf score for each word, for each document. The tf-idf score for some term t in a document d in some corpus is calculated as follows:

$$\texttt{tfidf}(t,d)=\texttt{tf}(t,d) ∗ \texttt{idf}(t,corpus)$$

- ``tf(t,d)`` is the term frequency of term ``t`` in document ``d``
- ``idf(t,corpus)`` is the inverse document frequency of a term ``t`` across ``corpus``

We can easily calculate the tf-idf values for each term-document pair in our corpus using scikit-learn’s ``TfidfVectorizer``:

``vectorizer = TfidfVectorizer(norm=None)
tfidf_vectorizer = vectorizer.fit_transform(corpus)``

- a ``TfidfVectorizer`` object is initialized. The ``norm=None`` keyword argument prevents scikit-learn from modifying the multiplication of term frequency and inverse document frequency
- the ``TfidfVectorizer`` object is fit and transformed on the corpus of data, returning the tf-idf scores for each term-document pair

In [23]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfTransformer
from modules.term_frequency import term_frequencies, feature_names, df_term_frequencies
from modules.poems import poems

In [29]:
# preprocess documents
processed_poems = [preprocess_text(poem) for poem in poems]

# initialize and fit TfidfVectorizer
vectorizer = TfidfVectorizer(norm=None)
tfidf_scores = vectorizer.fit_transform(processed_poems)

# get vocabulary of terms
feature_names = vectorizer.get_feature_names()

# get corpus index
corpus_index = [f"Poem {i+1}" for i in range(len(poems))]

In [30]:
# create pandas DataFrame with tf-idf scores
try:
    df_tf_idf = pd.DataFrame(tfidf_scores.T.todense(), index=feature_names, columns=corpus_index)
    print(df_tf_idf)
except:
    pass

           Poem 1  Poem 2    Poem 3    Poem 4    Poem 5  Poem 6
abash    0.000000     0.0  0.000000  0.000000  2.252763     0.0
across   0.000000     0.0  0.000000  2.252763  0.000000     0.0
admire   0.000000     0.0  2.252763  0.000000  0.000000     0.0
again    0.000000     0.0  0.000000  2.252763  0.000000     0.0
agonize  2.252763     0.0  0.000000  0.000000  0.000000     0.0
...           ...     ...       ...       ...       ...     ...
word     0.000000     0.0  0.000000  0.000000  2.252763     0.0
wreck    0.000000     0.0  0.000000  2.252763  0.000000     0.0
yet      0.000000     0.0  0.000000  0.000000  2.252763     0.0
you      0.000000     0.0  6.758289  0.000000  0.000000     0.0
your     0.000000     0.0  2.252763  0.000000  0.000000     0.0

[173 rows x 6 columns]


# Converting Bag-of-Words to Tf-idf

In addition to directly calculating the tf-idf scores for a set of terms across a corpus, you can also convert a bag-of-words model you have already created into tf-idf scores.

Scikit-learn’s TfidfTransformer is up to the task of converting your bag-of-words model to tf-idf. You begin by initializing a TfidfTransformer object.

``tf_idf_transformer = TfidfTransformer(norm=False)``

Given a bag-of-words matrix ``count_matrix``, you can now multiply the term frequencies by their inverse document frequency to get the tf-idf scores as follows:

``tf_idf_scores = tfidf_transformer.fit_transform(count_matrix)``

This is very similar to how we calculated inverse document frequency, except this time we are fitting and transforming the TfidfTransformer to the term frequencies/bag-of-words vectors rather than just fitting the TfidfTransformer to them.

# Review

Let’s recount all you have learned:

- Term frequency-inverse document frequency, known as tf-idf, is a numerical statistic used to indicate how important a word is to each document in a collection of documents
- tf-idf consists of two components, term frequency and inverse document frequency
- term frequency is how often a word appears in a document. This is the same as bag-of-words’ word count
- inverse document frequency is a measure of how often a word appears across all documents of a corpus
- tf-idf is calculated as the term frequency multiplied by the inverse document frequency
- term frequency, inverse document frequency, and tf-idf can be calculated in scikit-learn using the CountVectorizer, TfidfTransformer, and TfidfVectorizer objects, respectively

# Raven example

In [36]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from modules.raven import the_raven_stanzas
from modules.preprocessing import preprocess_text

In [37]:
# view first stanza
print(the_raven_stanzas[0])


Once upon a midnight dreary, while I pondered, weak and weary,
 Over many a quaint and curious volume of forgotten lore,
 While I nodded, nearly napping, suddenly there came a tapping,
 As of some one gently rapping, rapping at my chamber door


In [38]:
# preprocess documents
processed_stanzas = [preprocess_text(stanza) for stanza in the_raven_stanzas]

# initialize and fit TfidfVectorizer
vectorizer = TfidfVectorizer(norm=None)
tfidf_scores = vectorizer.fit_transform(processed_stanzas)


# get vocabulary of terms
feature_names = vectorizer.get_feature_names()

# get stanza index
stanza_index = [f"Stanza {i+1}" for i in range(len(the_raven_stanzas))]

# create pandas DataFrame with tf-idf scores
try:
    df_tf_idf = pd.DataFrame(tfidf_scores.T.todense(), index=feature_names, columns=stanza_index)
    print(df_tf_idf)
except:
    pass

        Stanza 1  Stanza 2  Stanza 3  Stanza 4  Stanza 5   Stanza 6  Stanza 7  \
above        0.0       0.0  0.000000       0.0       0.0   0.000000       0.0   
adore        0.0       0.0  0.000000       0.0       0.0   0.000000       0.0   
again        0.0       0.0  0.000000       0.0       0.0   0.000000       0.0   
agree        0.0       0.0  0.000000       0.0       0.0   0.000000       0.0   
ah           0.0       0.0  3.079442       0.0       0.0   0.000000       0.0   
...          ...       ...       ...       ...       ...        ...       ...   
wretch       0.0       0.0  0.000000       0.0       0.0   0.000000       0.0   
yet          0.0       0.0  0.000000       0.0       0.0   0.000000       0.0   
yore         0.0       0.0  0.000000       0.0       0.0   0.000000       0.0   
you          0.0       0.0  0.000000       0.0       0.0  10.454720       0.0   
your         0.0       0.0  0.000000       0.0       0.0   3.484907       0.0   

        Stanza 8  Stanza 9 