### Description

This notebook shows a method of word representation for NLP related problems and data analysis called **TF-IDF** which is a short of term frequency–inverse document frequency.

It is an improved concept of **Bag of Words** which treats each word equaly. **TF-IDF** is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.

**TF-IDF** equation:

$$ tf_{i,j} = \frac{n_{i,j}}{\sum_k n_{k,j}} $$ 

$$ idf(w) = \mbox{log} \frac{N}{df_i} $$

$$ w_{i,j} = tf_{i,j} \times \mbox{log} \frac{N}{df_i} $$

where:
- $tf$ - term frequecny
- $idf$ - inverse document frequency
- $w$ - tfidf

- $i$ - index of term
- $j$ - index of document
- $k$ - number of terms in document
- $N$ - corpus length (number of documents)
- $df_i$ - number of documents containing term i

### 1. Data

In [16]:
corpus = [
    "The dog barks in the morning.",
    "Over the sofa lies sleeping dog.",
    "My dog name is Farell, it is very energetic.",
    "The dog barks at the cars.",
    "Cat dislikes vegetables.",
    "Cats sleep during day and hunt during night.",
    "Cats, dogs and elephants are animals.",
    "Dogs can run quickly.",
    "My favourite animals are dogs.",
    "There are many different animals in the world.",
    "When I buy a house I will also adopt two cats.",
    "On cat is black and the other cat is white."
]

### 2. Cleaning corpus

In [17]:
import re

import nltk
for package in ["punkt", "wordnet", "stopwords"]:
    nltk.download(package)

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer, PorterStemmer

porter_stemmer = PorterStemmer()
wodnet_lemmatizer = WordNetLemmatizer()

def normalize_document(document, stemmer=porter_stemmer, lemmatizer=wodnet_lemmatizer):
    """Noramlizes data by performing following steps:
        1. Changing each word in corpus to lowercase.
        2. Removing special characters and interpunction.
        3. Dividing text into tokens.
        4. Removing english stopwords.
        5. Stemming words.
        6. Lemmatizing words.
    """
    
    temp = document.lower()
    temp = re.sub(r"[^a-zA-Z0-9]", " ", temp)
    temp = word_tokenize(temp)
    temp = [t for t in temp if t not in stopwords.words("english")]
    temp = [porter_stemmer.stem(token) for token in temp]
    temp = [lemmatizer.lemmatize(token) for token in temp]
        
    return temp

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/kamilkrzyk/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/kamilkrzyk/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/kamilkrzyk/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Previeving results.

In [3]:
offset = max(map(len, corpus))
for document in corpus:
    print(document.rjust(offset), " -> ", normalize_document(document))

                      The dog barks in the morning.  ->  ['dog', 'bark', 'morn']
                   Over the sofa lies sleeping dog.  ->  ['sofa', 'lie', 'sleep', 'dog']
       My dog name is Farell, it is very energetic.  ->  ['dog', 'name', 'farel', 'energet']
                         The dog barks at the cars.  ->  ['dog', 'bark', 'car']
                           Cat dislikes vegetables.  ->  ['cat', 'dislik', 'veget']
       Cats sleep during day and hunt during night.  ->  ['cat', 'sleep', 'day', 'hunt', 'night']
Cats and dogs are not getting along. I prefer cats.  ->  ['cat', 'dog', 'get', 'along', 'prefer', 'cat']
              Cats, dogs and elephants are animals.  ->  ['cat', 'dog', 'eleph', 'anim']
                              Dogs can run quickly.  ->  ['dog', 'run', 'quickli']
                     My favourite animals are dogs.  ->  ['favourit', 'anim', 'dog']
     There are many different animals in the world.  ->  ['mani', 'differ', 'anim', 'world']
     When I buy a ho

It is possible to observe what tokens are left from each sentence.

### 4. Creating Bag of Words

Initiating the CountVectorizer model and removing English stopwords.

In [4]:
from sklearn.feature_extraction.text import CountVectorizer

bow = CountVectorizer(tokenizer=normalize_document)

Building Bag Of Words based on corpus.

In [5]:
bow.fit(corpus)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=<function normalize_document at 0x107080b70>,
        vocabulary=None)

Previewing tokens in the bag.

In [6]:
print(bow.get_feature_names())

['adopt', 'along', 'also', 'anim', 'bark', 'black', 'buy', 'car', 'cat', 'day', 'differ', 'dislik', 'dog', 'eleph', 'energet', 'farel', 'favourit', 'get', 'hous', 'hunt', 'lie', 'mani', 'morn', 'name', 'night', 'prefer', 'quickli', 'run', 'sleep', 'sofa', 'two', 'veget', 'white', 'world']


As it is possible to see te size of the bag is 16 as there are 16 tokens inside of it. Because of that each sentence will be represented with vector of size:

In [7]:
corpus_vectorized = bow.transform(corpus)

In [8]:
offset = max(map(len, corpus))
for document, document_vector in zip(corpus, corpus_vectorized.toarray()):
    print(document.rjust(offset), " -> ", document_vector)

                      The dog barks in the morning.  ->  [0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
                   Over the sofa lies sleeping dog.  ->  [0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 0 0 0 0]
       My dog name is Farell, it is very energetic.  ->  [0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
                         The dog barks at the cars.  ->  [0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
                           Cat dislikes vegetables.  ->  [0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
       Cats sleep during day and hunt during night.  ->  [0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0]
Cats and dogs are not getting along. I prefer cats.  ->  [0 1 0 0 0 0 0 0 2 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0]
              Cats, dogs and elephants are animals.  ->  [0 0 0 1 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0

Such vectors are now representing sentences in corpus.

### Creating TF-IDF values

Initializing Tfidf transformer.

In [9]:
from sklearn.feature_extraction.text import TfidfTransformer

tf_idf_transformer = TfidfTransformer()

Calculating frequencies.

In [10]:
tf_idf_transformer.fit(corpus_vectorized)

TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)

Visualising frequencies per term.

In [11]:
tf_idf_transformer.idf_

array([2.94591015, 2.94591015, 2.94591015, 2.25276297, 2.54044504,
       2.94591015, 2.94591015, 2.94591015, 1.69314718, 2.94591015,
       2.94591015, 2.94591015, 1.44183275, 2.94591015, 2.94591015,
       2.94591015, 2.94591015, 2.94591015, 2.94591015, 2.94591015,
       2.94591015, 2.94591015, 2.94591015, 2.94591015, 2.94591015,
       2.94591015, 2.94591015, 2.94591015, 2.54044504, 2.94591015,
       2.94591015, 2.94591015, 2.94591015, 2.94591015])

In [12]:
for term, freq in zip(bow.get_feature_names(), tf_idf_transformer.idf_):
    print(term.rjust(10), " : ", freq)

     adopt  :  2.9459101490553135
     along  :  2.9459101490553135
      also  :  2.9459101490553135
      anim  :  2.252762968495368
      bark  :  2.540445040947149
     black  :  2.9459101490553135
       buy  :  2.9459101490553135
       car  :  2.9459101490553135
       cat  :  1.6931471805599454
       day  :  2.9459101490553135
    differ  :  2.9459101490553135
    dislik  :  2.9459101490553135
       dog  :  1.4418327522790393
     eleph  :  2.9459101490553135
   energet  :  2.9459101490553135
     farel  :  2.9459101490553135
  favourit  :  2.9459101490553135
       get  :  2.9459101490553135
      hous  :  2.9459101490553135
      hunt  :  2.9459101490553135
       lie  :  2.9459101490553135
      mani  :  2.9459101490553135
      morn  :  2.9459101490553135
      name  :  2.9459101490553135
     night  :  2.9459101490553135
    prefer  :  2.9459101490553135
   quickli  :  2.9459101490553135
       run  :  2.9459101490553135
     sleep  :  2.540445040947149
      sofa  :  2.

Visualising frequency for document.

In [13]:
tfidf_docs = tf_idf_transformer.transform(corpus_vectorized)

In [14]:
print(tfidf_docs.toarray().shape)
print(tfidf_docs.toarray())

(13, 34)
[[0.         0.         0.         0.         0.61235761 0.
  0.         0.         0.         0.         0.         0.
  0.34754433 0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.71009232 0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.        ]
 [0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.28336938 0.         0.         0.         0.         0.
  0.         0.         0.57897196 0.         0.         0.
  0.         0.         0.         0.         0.49928422 0.57897196
  0.         0.         0.         0.        ]
 [0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.27192757 0.         0.5555944  0.5555944  0.         0.
  0.         0.         0.         0.         0.         0.5555944
  0.         0.         0.         0.     

In [15]:
for doc_id in range(len(corpus)):
    print("Document id.{}: {}".format(doc_id, corpus[doc_id]))
    print("Tokens: {}".format(normalize_document(corpus[doc_id])))
    print("\n -- TF IDF Values for words in dictionary:")
    for term, freq in zip(bow.get_feature_names(), tfidf_docs[doc_id].T.toarray()):
        print(term.rjust(10), " : ", freq)
    print("\n ------------------")

Document id.0: The dog barks in the morning.
Tokens: ['dog', 'bark', 'morn']

 -- TF IDF Values for words in dictionary:
     adopt  :  [0.]
     along  :  [0.]
      also  :  [0.]
      anim  :  [0.]
      bark  :  [0.61235761]
     black  :  [0.]
       buy  :  [0.]
       car  :  [0.]
       cat  :  [0.]
       day  :  [0.]
    differ  :  [0.]
    dislik  :  [0.]
       dog  :  [0.34754433]
     eleph  :  [0.]
   energet  :  [0.]
     farel  :  [0.]
  favourit  :  [0.]
       get  :  [0.]
      hous  :  [0.]
      hunt  :  [0.]
       lie  :  [0.]
      mani  :  [0.]
      morn  :  [0.71009232]
      name  :  [0.]
     night  :  [0.]
    prefer  :  [0.]
   quickli  :  [0.]
       run  :  [0.]
     sleep  :  [0.]
      sofa  :  [0.]
       two  :  [0.]
     veget  :  [0.]
     white  :  [0.]
     world  :  [0.]

 ------------------
Document id.1: Over the sofa lies sleeping dog.
Tokens: ['sofa', 'lie', 'sleep', 'dog']

 -- TF IDF Values for words in dictionary:
     adopt  :  [0.]
  

      mani  :  [0.52816399]
      morn  :  [0.]
      name  :  [0.]
     night  :  [0.]
    prefer  :  [0.]
   quickli  :  [0.]
       run  :  [0.]
     sleep  :  [0.]
      sofa  :  [0.]
       two  :  [0.]
     veget  :  [0.]
     white  :  [0.]
     world  :  [0.52816399]

 ------------------
Document id.11: When I buy a house I will also adopt two cats.
Tokens: ['buy', 'hous', 'also', 'adopt', 'two', 'cat']

 -- TF IDF Values for words in dictionary:
     adopt  :  [0.4331346]
     along  :  [0.]
      also  :  [0.4331346]
      anim  :  [0.]
      bark  :  [0.]
     black  :  [0.]
       buy  :  [0.4331346]
       car  :  [0.]
       cat  :  [0.24894195]
       day  :  [0.]
    differ  :  [0.]
    dislik  :  [0.]
       dog  :  [0.]
     eleph  :  [0.]
   energet  :  [0.]
     farel  :  [0.]
  favourit  :  [0.]
       get  :  [0.]
      hous  :  [0.4331346]
      hunt  :  [0.]
       lie  :  [0.]
      mani  :  [0.]
      morn  :  [0.]
      name  :  [0.]
     night  :  [0.]
    p