## CountVectorizer
CountVectorizer converts a bunch of documents into vectors so that we can use it with models. It basically just counts the number of times a particular word has occured.

In [1]:
from sklearn.feature_extraction.text import CountVectorizer

docs = ["Ivan is a nice boy.", "Ivan rock! wohooo!", "My name is Ivan, and I am a Pythonista!"]
cv = CountVectorizer()
X = cv.fit_transform(docs)
print(X.todense())
print(cv.vocabulary_)

[[0 0 1 1 1 0 0 1 0 0 0]
 [0 0 0 0 1 0 0 0 0 1 1]
 [1 1 0 1 1 1 1 0 1 0 0]]
{'ivan': 4, 'is': 3, 'nice': 7, 'boy': 2, 'rock': 9, 'wohooo': 10, 'my': 5, 'name': 6, 'and': 1, 'am': 0, 'pythonista': 8}


## DictVectorizer
DictVectorizer will convert mappings to vectors.

In [2]:
from sklearn.feature_extraction import DictVectorizer

docs = [{"Ivan": 1, "is": 1, "awesome": 2}, {"No": 1, "I": 1, "don't": 2, "wanna": 3, "fall": 1, "in": 2, "love": 3}]
dv = DictVectorizer()
X = dv.fit_transform(docs)
print(X.todense())
print(dv.vocabulary_)

[[0. 1. 0. 2. 0. 0. 0. 1. 0. 0.]
 [1. 0. 1. 0. 2. 1. 2. 0. 3. 3.]]
{'Ivan': 1, 'is': 7, 'awesome': 3, 'No': 2, 'I': 0, "don't": 4, 'wanna': 9, 'fall': 5, 'in': 6, 'love': 8}


## TfidfVectorizer
In many text analytics applications, we need to convert the text into vectors to use with Machine Learning algorithms. This is known as the Vector Space Model. While CountVectorizer could be a solution, words like "the", "in", "a", etc, are common words and often are used in all kinds of documents. Using CountVectorizer gives more emphasis on such word counts which are not relevant. You could circumvent this problem by using `stop_words = "english"` which would filter out common words, but let's say you have a different vocabulary, for instance a conversation between 2 Finance students would have words like "Balance Sheet", "Interest Rate", "Profit/Loss" mentioned too often and you'd have to manually add the stop words everytime for all the problems you solve.

Thus in such scenarios, it is recommended to use `TfidfVectorizer` which will take care of such things. Every word is given a number according to the following formula.

$$ \text{tfidf(word)} = \text{tf(word, document}_i \text{)} \cdot \text{idf(word)}$$

Where,
1. tf(word, document_i) = Term Frequency of a word in the specific document i.
2. idf(word) = Inverse Document Frequency of the word

Inverse Document Frequency is defined as the log of ratio of the total number of documents to the number of documents which contains that particular word.

$$ \text{idf}\left(w\right)=\log\left(\frac{n_d}{df\left(w\right)}\right)$$

Where,
1. n_d = The total number of documents
2. df(word) = The Document Frequency of the word, the number of documents where that particular word appears

What it does intuitively is if a word has occured too many times in other documents as well (common words like "the", "is") then it gives lesser weightage to such words in contrast to words that have occured more number of times in a single document compared to others. Which basically means that if a particular word occurs more number of times in a single document only, then it might be an important feature.

Note that the denominator is added with 1 to avoid underflow e.g. when the document frequency is 0.

Sklearn additionally also normalizes the output of tfidf to have a norm of 1. This is important since we are only interested in similarities which means vectors like (1, 1) and (3, 3) are really the same (they go in the same direction, the just have different weights). This is achieved by dividing each element in the vector by the length of the vector.

$$ v_i = \frac{v_i}{|v_i|} = \frac{v_i}{\sqrt{v_1^2 + v_2^2 + v_3^2 + .... + v_n^2}}$$

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

tfidf_vectorizer = TfidfVectorizer()
cv_vectorizer = CountVectorizer()
docs = ["Ivan is a chess player", "Ivan is a skateboarder", "Ivan is also a programmer"]
X_idf = tfidf_vectorizer.fit_transform(docs)
X_cv = cv_vectorizer.fit_transform(docs)
print(X_idf.todense())
print(tfidf_vectorizer.vocabulary_)
print(X_cv.todense())

[[0.         0.6088451  0.35959372 0.35959372 0.6088451  0.
  0.        ]
 [0.         0.         0.45329466 0.45329466 0.         0.
  0.76749457]
 [0.6088451  0.         0.35959372 0.35959372 0.         0.6088451
  0.        ]]
{'ivan': 3, 'is': 2, 'chess': 1, 'player': 4, 'skateboarder': 6, 'also': 0, 'programmer': 5}
[[0 1 1 1 1 0 0]
 [0 0 1 1 0 0 1]
 [1 0 1 1 0 1 0]]
