## CountVectorizer

The CountVectorizer provides a simple way to both tokenize a collection of text documents and build a vocabulary of known words, but also to encode new documents using that vocabulary.

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
# list of text documents
cv_text = ["The quick brown fox jumped over the lazy dog."]
# create the transform
cv_vectorizer = CountVectorizer()
# tokenize and build vocab
cv_vectorizer.fit(cv_text)
# summarize
print("Vocabulary------>>>",cv_vectorizer.vocabulary_)
print("Tokens------>>>",cv_vectorizer.get_feature_names())
# encode document
cv_vector = cv_vectorizer.transform(cv_text)
# summarize encoded vector
print("Vactor shape------>>>",cv_vector.shape)
print("Generated Vector------->>",cv_vector.toarray())
cv_vector_lst=list(*cv_vector.toarray())
print("Word Dictionary ------>>>",dict(list(zip(cv_vectorizer.get_feature_names(),cv_vector_lst))))

Vocabulary------>>> {'the': 7, 'quick': 6, 'brown': 0, 'fox': 2, 'jumped': 3, 'over': 5, 'lazy': 4, 'dog': 1}
Tokens------>>> ['brown', 'dog', 'fox', 'jumped', 'lazy', 'over', 'quick', 'the']
Vactor shape------>>> (1, 8)
Generated Vector------->> [[1 1 1 1 1 1 1 2]]
Word Dictionary ------>>> {'brown': 1, 'dog': 1, 'fox': 1, 'jumped': 1, 'lazy': 1, 'over': 1, 'quick': 1, 'the': 2}


## TfidfVectorizer

The TfidfVectorizer will tokenize documents, learn the vocabulary and inverse document frequency weightings, and allow you to encode new documents.

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer
# list of text documents
text = ["The quick brown fox jumped over the lazy dog.",
        "The dog.",
        "The fox"]
# create the transform
tf_vectorizer = TfidfVectorizer()
# tokenize and build vocab
tf_vectorizer.fit(text)
# summarize
print("Vocabulary------>>>",tf_vectorizer.vocabulary_)
print("Tokens------>>>",tf_vectorizer.get_feature_names())
# encode document
tf_vector = tf_vectorizer.transform([text[0]])
# summarize encoded vector
print("Vactor shape------>>>",tf_vector.shape)
print("Generated Vector------->>",tf_vector.toarray())
tf_vector_lst=list(*tf_vector.toarray())
print("IDF Word Dictionary ------>>> ------>>>",dict(list(zip(tf_vectorizer.get_feature_names(),tf_vectorizer.idf_))))
print("TF-IDF Score Word Dictionary ------>>>",dict(list(zip(tf_vectorizer.get_feature_names(),tf_vector_lst))))

Vocabulary------>>> {'the': 7, 'quick': 6, 'brown': 0, 'fox': 2, 'jumped': 3, 'over': 5, 'lazy': 4, 'dog': 1}
Tokens------>>> ['brown', 'dog', 'fox', 'jumped', 'lazy', 'over', 'quick', 'the']
Vactor shape------>>> (1, 8)
Generated Vector------->> [[0.36388646 0.27674503 0.27674503 0.36388646 0.36388646 0.36388646
  0.36388646 0.42983441]]
IDF Word Dictionary ------>>> ------>>> {'brown': 1.6931471805599454, 'dog': 1.2876820724517808, 'fox': 1.2876820724517808, 'jumped': 1.6931471805599454, 'lazy': 1.6931471805599454, 'over': 1.6931471805599454, 'quick': 1.6931471805599454, 'the': 1.0}
TF-IDF Score Word Dictionary ------>>> {'brown': 0.3638864554802418, 'dog': 0.27674502873103346, 'fox': 0.27674502873103346, 'jumped': 0.3638864554802418, 'lazy': 0.3638864554802418, 'over': 0.3638864554802418, 'quick': 0.3638864554802418, 'the': 0.4298344050159891}


1. The inverse document frequencies are calculated for each word in the vocabulary, assigning the lowest score of 1.0 to the most frequently observed word: “the” at index 7.
2. Generated Vector is normalized score values between 0 and 1



## TfidfTransformer

1. Alternately, if you already have a learned CountVectorizer, you can use it with a TfidfTransformer to just calculate the inverse document frequencies and start encoding documents.
2. With Tfidftransformer you will systematically compute word counts using CountVectorizer and then compute the Inverse Document Frequency (IDF) values and only then compute the Tf-idf scores.

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
tf_text = ["The quick brown fox jumped over the lazy dog.",
        "The dog.",
        "The fox"]
cv_vectorizer=CountVectorizer() 
cv_word_count=cv_vectorizer.fit_transform(tf_text)
tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True) 
tfidf_transformer.fit(cv_word_count)

print("IDF Word Dictionary ------>>>",dict(list(zip(cv_vectorizer.get_feature_names(),tfidf_transformer.idf_))))


tf_idf_vector=tfidf_transformer.transform(cv_word_count)
document_vector=tf_idf_vector[0]  ### Score for 1st Document
tf_doc_vector_lst=list(*document_vector.toarray())
print("TF-IDF Score Word Dictionary ------>>>",dict(list(zip(cv_vectorizer.get_feature_names(),tf_doc_vector_lst))))


IDF Word Dictionary ------>>> {'brown': 1.6931471805599454, 'dog': 1.2876820724517808, 'fox': 1.2876820724517808, 'jumped': 1.6931471805599454, 'lazy': 1.6931471805599454, 'over': 1.6931471805599454, 'quick': 1.6931471805599454, 'the': 1.0}
TF-IDF Score Word Dictionary ------>>> {'brown': 0.3638864554802418, 'dog': 0.27674502873103346, 'fox': 0.27674502873103346, 'jumped': 0.3638864554802418, 'lazy': 0.3638864554802418, 'over': 0.3638864554802418, 'quick': 0.3638864554802418, 'the': 0.4298344050159891}


3. The scores above make sense. The more common the word across documents, the lower its score and the more unique a word is to our first document.

### Summary:
1. With Tfidfvectorizer on the contrary, you will do all three steps at once. Under the hood, it computes the word counts, IDF values, and Tf-idf scores all using the same dataset.
2. If you need the term frequency (term count) vectors for different tasks, use Tfidftransformer.
3. If you need to compute tf-idf scores on documents within your “training” dataset, use Tfidfvectorizer
4. If you need to compute tf-idf scores on documents outside your “training” dataset, use either one, both will work.

## HashingVectorizer

1. The HashingVectorizer class implements this approach that can be used to consistently hash words, then tokenize and encode documents as needed.
2. Above Algorithms will require large vectors for encoding documents and impose large requirements on memory and slow down algorithms.

In [34]:
from sklearn.feature_extraction.text import HashingVectorizer
# list of text documents
text = ["The quick brown fox jumped over the lazy dog."]
# create the transform
hv_vectorizer = HashingVectorizer(n_features=20)
# encode document
hv_vector = hv_vectorizer.transform(text)
# summarize encoded vector
print("Vactor shape------>>>",hv_vector.shape)
print("Generated Vector------->>",hv_vector.toarray())

Vactor shape------>>> (1, 20)
Generated Vector------->> [[ 0.          0.          0.          0.          0.          0.33333333
   0.         -0.33333333  0.33333333  0.          0.          0.33333333
   0.          0.          0.         -0.33333333  0.          0.
  -0.66666667  0.        ]]
