<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Bag-of-words" data-toc-modified-id="Bag-of-words-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Bag of words</a></span></li><li><span><a href="#TF---IDF-(Term-Frequency---Inverse-Document-Frequency)" data-toc-modified-id="TF---IDF-(Term-Frequency---Inverse-Document-Frequency)-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>TF - IDF (Term Frequency - Inverse Document Frequency)</a></span></li></ul></div>

## Bag of words

Also known as count vectorisation (because it counts the unqiue words and represents it as a number).<br/>
Note: Remember to remve stopwords from the documents, as they don't add any value.

In [2]:
!pip install keras

Collecting keras
  Downloading Keras-2.4.3-py2.py3-none-any.whl (36 kB)
Installing collected packages: keras
Successfully installed keras-2.4.3


In [1]:
from keras.preprocessing.text import Tokenizer

docs = [
        'a dog live in home',
        'a dog live in the hut',
        'hut is dog home',
]

t = Tokenizer()
t.fit_on_texts(docs)

print(f'Vocabulary:{list(t.word_index.keys())}')

vector = t.texts_to_matrix(docs, mode='count')
print(vector)

Vocabulary:['dog', 'a', 'live', 'in', 'home', 'hut', 'the', 'is']
[[0. 1. 1. 1. 1. 1. 0. 0. 0.]
 [0. 1. 1. 1. 1. 0. 1. 1. 0.]
 [0. 1. 0. 0. 0. 1. 1. 0. 1.]]


Drawbacks : 
This model is only concerned with whether known words occur in the document, not their position.

## TF - IDF (Term Frequency - Inverse Document Frequency)  

TF-IDF provides a way to give rarer words greater weight.

Term Freq: tf(t,d) <br/>
This summarizes how often a gn word appears within a doc.
Two methods are
1. Term freq adjusted for document length: <br/>tf(t,d) = (number of times term t appear in document d ) / (number 0f words in d )
2. logarithmically scaled freq: <br/> tf(t,d) = log ( 1 + number of times term t appear in document d )

doc1 = 'a dog live in home'

1. tf(dog,doc1) = 1/5
2. tf(dog, doc1) = 1+ log(1)

Inverse Document Freq: idf <br/>
IDF measure of term importance. It is logarithmically scaled ratio of the total number of documents vs the count of documents with term t.



d = [
        'a dog live in home',
        'a dog live in the hut',
        'hut is dog home',
]
D is the corpus

idf(dog, D) = log(total no. of documents(3)/ total no. of documents with term 'dog')

= log(3/3) = log(1) = 0

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

docs = [
        'a dog live in home',
        'a dog live in the hut',
        'hut is dog home',
]

vectorizer = TfidfVectorizer()
vectorizer.fit(docs)
print('Vocabualry:', list(vectorizer.vocabulary_.keys()), '\n')
print('N\n: ', vectorizer.idf_,'\n')
print('idf = log(N/n): ',vectorizer.vocabulary_,'\n')
vector = vectorizer.transform([docs[0]])
print(vector.toarray())

Vocabualry: ['dog', 'live', 'in', 'home', 'the', 'hut', 'is'] 

N
:  [1.         1.28768207 1.28768207 1.28768207 1.69314718 1.28768207
 1.69314718] 

idf = log(N/n):  {'dog': 0, 'live': 5, 'in': 3, 'home': 1, 'the': 6, 'hut': 2, 'is': 4} 

[[0.40912286 0.52682017 0.         0.52682017 0.         0.52682017
  0.        ]]


Drawbacks: <br/>
    TF-IDF makes the feature extraction more robust than just counting the number of instances of a term in a document as presented in Bag-of-words model. But it doesn’t solve for the major drawbacks of BoW model, the order or structure of words in the document is still discarded in TF-IDF model