# Text Vectorization Methods

**<center>Overview</center>**

Vectorization Method | Function | Good For | Considerations
---|---|---|---
Frequency | Counts term frequencies | Bayesian models | Most frequent words not always most informative
One-Hot Encoding | Binarizes term occurrence (0, 1) | Neural networks | All words equidistant, so normalization extra important
TF–IDF | Normalizes term frequencies across documents | General purpose | Moderately frequent terms may not be representative of document topics
Distributed Representations | Context-based, continuous term similarity encoding | Modeling more complex relationships | Performance intensive; difficult to scale without additional tools (e.g., Tensorflow)

**Doc2Vec Demonstration**

In [1]:
corpus = [
"The elephant sneezed at the sight of potatoes.",
"Bats can see via echolocation. See the bat sight sneeze!",
"Wondering, she opened the door to the studio.",
]

from gensim.models.doc2vec import TaggedDocument, Doc2Vec
corpus = [list(doc.split()) for doc in corpus]
corpus = [
    TaggedDocument(words, ['d{}'.format(idx)]) for idx, words in enumerate(corpus)
]

model = Doc2Vec(vector_size=5, min_count=0)

In [2]:
model.build_vocab(corpus)
model.train(corpus, total_examples=model.corpus_count, epochs=model.epochs)

In [3]:
print(model.docvecs[0])

[-0.03661469 -0.06084158 -0.08878306  0.01230712 -0.03080011]


In [4]:
model.infer_vector(['The', 'Bats', 'sneezed', 'at', 'the', 'door'])

array([ 0.06216872, -0.04823388,  0.01324501,  0.05525212, -0.02168426],
      dtype=float32)