# Doc2Vec: Distributed Memory (DM) Model

This notebook demonstrates how to train a Doc2Vec model using the Distributed Memory (DM) architecture on a sample corpus. You will learn how to tag documents, build vocabulary, train the model, and retrieve document vectors and similarities.

**Outline:**
- Introduction to Doc2Vec DM
- Tagging documents
- Building vocabulary
- Training the DM model
- Extracting document vectors
- Finding most similar documents

**References:**
- [Doc2Vec Paper](https://arxiv.org/pdf/1405.4053.pdf)
- [Gensim Doc2Vec Documentation](https://radimrehurek.com/gensim/models/doc2vec.html)


## What is Doc2Vec DM?

Doc2Vec DM (Distributed Memory) is an extension of Word2Vec for learning vector representations of documents. The DM model predicts a target word using both context words and a unique document id, allowing the document vector to act as a memory for additional context.

**Workflow:**
1. Tag each document with a unique id.
2. Tokenize the documents.
3. Build the vocabulary.
4. Train the DM model.
5. Extract document vectors and find similar documents.


In [None]:
# Import required libraries
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
import nltk
import os
nltk.download('punkt')

# Sample corpus
corpus = [
    'The quick brown fox jumps over the lazy dog',
    'Never jump over the lazy dog quickly',
    'A fox is quick and brown',
    'Dogs and foxes are animals',
    'Foxes are clever and quick',
    'Dogs are loyal and friendly'
]


## Tagging and Tokenizing Documents

Each document is tagged with a unique id and tokenized for training. This step prepares the data for the Doc2Vec model.


In [None]:
# Tag and tokenize documents
# Each document is tagged with a unique id
# TaggedDocument(words, [tag])
tagged_data = [TaggedDocument(words=nltk.word_tokenize(doc.lower()), tags=[str(i)]) for i, doc in enumerate(corpus)]
print('Tagged and tokenized documents:', tagged_data[:2])  # Show first two for illustration


## Building Vocabulary and Initializing DM Model

Build the vocabulary from the tagged data and initialize the Doc2Vec DM model with chosen parameters.


In [None]:
# Build vocabulary and initialize DM model
vector_size = 50
window = 2
min_count = 1
workers = 1
epochs = 40
seed = 42
dm = 1  # Distributed Memory

doc2vec_dm = Doc2Vec(vector_size=vector_size, window=window, min_count=min_count, workers=workers, epochs=epochs, seed=seed, dm=dm)
doc2vec_dm.build_vocab(tagged_data)
print('Vocabulary size:', len(doc2vec_dm.wv))


## Training the Doc2Vec DM Model

Train the Doc2Vec DM model on the tagged and tokenized corpus. This step learns document and word vectors.


In [None]:
# Train the DM model
doc2vec_dm.train(tagged_data, total_examples=doc2vec_dm.corpus_count, epochs=doc2vec_dm.epochs)
print('Model trained.')


## Extracting Document Vectors and Finding Similar Documents

After training, you can extract the vector for any document and find the most similar documents in the corpus using cosine similarity.


In [None]:
# Extract vector for the first document and find most similar documents
first_doc_tag = '0'
vector = doc2vec_dm.dv[first_doc_tag]
print('Vector for first document:', vector)

similar_docs = doc2vec_dm.dv.most_similar(first_doc_tag)
print('Most similar documents to first document:', similar_docs)
