In [None]:
%pip install sklearn numpy

## TF-IDF Word Embedding
The `TfidfVectorizer` combines the `CountVectorizer` and `TfidfTransformer` into one class for ease of use.

### TF-IDF Vectorizer
Basic `TfidfVectorizer` process:
* Counts the number of occurrences of each word in each document (each string in the `corpus` list below)
* Computes the document frequency from the number of documents in the corpus (4 in the case below)
* `CountVectorizer` gives the term frequency for each term in the document
* Inverse document frequency and term frequency-inverse document frequency can be computed from the above

The `TfidfTransformer` gets the TF-IDF values from the word counts alone. It does not analyze the corpus.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

corpus = [
    'this is the first document',
    'this is the second document',
    'and the third one',
    'is this the first document?'
]

vectorizer = TfidfVectorizer()
corpus_vectors = vectorizer.fit_transform(corpus)

print("TF-IDF Vecotrs:\n", corpus_vectors.toarray())

### TF-IDF Transformer
This is a combination of the `CountVectorizer` and `TfidfTransformer` (shoud give the same results as the code above)

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer

corpus = [
    'this is the first document',
    'this is the second document',
    'and the third one',
    'is this the first document?'
]

count_vectorizer = CountVectorizer()
count_vectors = vectorizer.fit_transform(corpus)

print("Count Vectors:\n", count_vectors.toarray())

In [None]:
tfidf_transformer = TfidfTransformer()
vectors = tfidf_transformer.fit_transform(count_vectors)

print("TF-IDF Vectors:\n", vectors.toarray())