# Representing texts with TF-IDF

We can go one step further and use the TF-IDF algorithm to count words and ngrams in
incoming documents. TF-IDF stands for term frequency-inverse document frequency
and gives more weight to words that are unique to a document than to words that are
frequent, but repeated throughout most documents. This allows us to give more weight
to words uniquely characteristic to particular documents.

In this recipe, we will use a diﬀerent type of vectorizer that can apply the TF-IDF
algorithm to the input text. Like the CountVectorizer class, it has an analyzer that we
will use to show the representations of new sentences.

# Getting ready
We will be using the TfidfVectorizer class from the sklearn package.

# How to do it…
Te TfidfVectorizer class allows for all the functionality of CountVectorizer,
except that it uses the TF-IDF algorithm to count the words instead of direct counts. The
other features of the class should be familiar.

MPORT LIBRARIES

In [1]:
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import string

In [2]:
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

READ IN THE BOOK TEXT(corpus)

In [4]:
nltk.download("punkt_tab")

filename = "002_Sign_of_Four.txt"
file = open(r"/content/002_Sign_of_Four.txt", "r", encoding = "UTF-8")
text = file.read()

text = text.replace("\n", " ")
# Initialize an NLTK tokenizer. Tis uses the punkt model we downloaded
tokenizer = nltk.data.load("tokenizers/punkt/english.pickle")

# Divide the text into sentences:
sentences = tokenizer.tokenize(text)

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


INITIALIZE THE Snowball Stemmer(english)

In [5]:
stemmer = SnowballStemmer("english")

GET THE SET OF ENGLISH stopwords

In [6]:
english_stopwords = set(stopwords.words("english"))

STEMMMING AND TOKENIZATION

In [7]:
def stemmer_tokenizer(text):
  text = text.translate(str.maketrans("", "", string.punctuation)).lower() # removes punctuation and converts to lowercase

  words = re.findall(r"\b\w+\b", text) # tokenization

  stemmed_words = [
      stemmer.stem(word)
      for word in words if word not in english_stopwords
  ]
  return stemmed_words

INITIALIZE THE VECTORIZER

In [8]:
tfidf_vectorizer = TfidfVectorizer(
    tokenizer = stemmer_tokenizer, analyzer = "word"
)

FIT_TRANSFORM THE corpus

In [9]:
x_tfidf_sparse = tfidf_vectorizer.fit_transform(sentences)
x_tfidf_dense = x_tfidf_sparse.toarray()
features = tfidf_vectorizer.get_feature_names_out()

print(f"Total Features: {len(features)}")
print(f"Features: {features}")



Total Features: 4168
Features: ['10the' '11the' '12the' ... 'youwould' 'zigzag' 'zum']


RESULTS

In [10]:
data_tfidf = pd.DataFrame(x_tfidf_dense,
                          columns=features,
                          index=[f"Document {i+1}" for i in range(len(sentences))])

print(f"TF_IDF Weight Matrix")
print(data_tfidf)

TF_IDF Weight Matrix
               10the  11the  12the  1857  1871  1878  1878near  1882  1882an  \
Document 1       0.0    0.0    0.0   0.0   0.0   0.0       0.0   0.0     0.0   
Document 2       0.0    0.0    0.0   0.0   0.0   0.0       0.0   0.0     0.0   
Document 3       0.0    0.0    0.0   0.0   0.0   0.0       0.0   0.0     0.0   
Document 4       0.0    0.0    0.0   0.0   0.0   0.0       0.0   0.0     0.0   
Document 5       0.0    0.0    0.0   0.0   0.0   0.0       0.0   0.0     0.0   
...              ...    ...    ...   ...   ...   ...       ...   ...     ...   
Document 2921    0.0    0.0    0.0   0.0   0.0   0.0       0.0   0.0     0.0   
Document 2922    0.0    0.0    0.0   0.0   0.0   0.0       0.0   0.0     0.0   
Document 2923    0.0    0.0    0.0   0.0   0.0   0.0       0.0   0.0     0.0   
Document 2924    0.0    0.0    0.0   0.0   0.0   0.0       0.0   0.0     0.0   
Document 2925    0.0    0.0    0.0   0.0   0.0   0.0       0.0   0.0     0.0   

                  

How it works…
The TfidfVectorizer class works almost exactly like the CountVectorizer class,
diﬀering only in the way the word frequencies are calculated, so most of the steps should
be familiar here. Word frequencies are calculated as follows. For each word, the overall
frequency is a product of the term frequency and the inverse document frequency. Term
frequency is the number of times the word occurs in the document. Inverse document
frequency is the total number of documents divided by the number of documents where
the word occurs. Usually, these frequencies are logarithmically scaled.