<b>TfidfTransformer</b> :

* Transform a count matrix to a normalized tf or tf-idf representation
* Tf means term-frequency while tf-idf means term-frequency times inverse document-frequency. This is a common term weighting scheme in information retrieval, that has also found good use in document classification.

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer

* A <b>Term Frequency</b> is a count of how many times a word occurs in a given document (synonymous with bag of words).
* The <b>Inverse Document Frequency</b> is the the number of times a word occurs in a corpus of documents.

The first step is to create our training and testing document set and computing the term frequency matrix

https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting

### Creating a count vector



In [None]:
from sklearn.feature_extraction.text import CountVectorizer

train_text = ["A bird in hand is worth two in the bush.",
              "Good things come to those who wait.",
              "These watches cost $1500! ",
              "There are other fish in the sea.",
              "The ball is in your court.",
              "Mr. Smith Goes to Washington ",
              "Doogie Howser M.D."]

In [None]:
count_vectorizer = CountVectorizer()

frequency_term_matrix = count_vectorizer.fit_transform(train_text)

In [None]:
len(count_vectorizer.vocabulary_)

In [None]:
frequency_term_matrix.shape

In [None]:
frequency_term_matrix.toarray()

### Building the tf-idf matrix



In [None]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer = TfidfTransformer()

In [None]:
tfidf_vector1 = tfidf_transformer.fit_transform(frequency_term_matrix)

tfidf_vector1.shape

In [None]:
tfidf_vector1.toarray()

In [None]:
print(tfidf_vector1.toarray())

## TfidfVectorizer = CountVectorizer + TfidfTransformer

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()

In [None]:
tfidf_vector2 = tfidf_vectorizer.fit_transform(train_text)

tfidf_vectorizer.vocabulary_

In [None]:
tfidf_vector2.shape

In [None]:
tfidf_vectorizer.idf_

In [None]:
dict(zip(tfidf_vectorizer.get_feature_names(), tfidf_vectorizer.idf_))

### Final scorings of each word from the other words in the vocabulary.
* The scores are normalized to values between 0 and 1

In [None]:
tfidf_vector2.toarray()

In [None]:
print(tfidf_vector2)

In [None]:
print(tfidf_vector1)

# Done !