# Metadata

```yaml
Course:   DS 5001
Module:   10 Lab
Topic:    Doc2Vec
Author:   R.C. Alvarado
Date:     02 April 2023 (revised)
```

**Purpose:** Demonstrate use of Gensim's doc2vec implementation.

See https://www.tutorialspoint.com/gensim/gensim_doc2vec_model.htm#

> Doc2Vec model, as opposite to Word2Vec model, is used to create a vectorised representation of a group of words taken collectively as a single unit. It doesn’t only give the simple average of the words in the sentence.



# Set Up

In [1]:
data_path = "../data"
corpus_prefix = 'austen-melville'
OHCO = ['book_id','chap_id','para_num','sent_num','token_num']
BAG = OHCO[:1] # BOOKS

In [2]:
import pandas as pd
import numpy as np
import gensim
import plotly_express as px

# Get Data

In [3]:
LIB = pd.read_csv(f"{data_path}/output/{corpus_prefix}-LIB.csv").set_index(['book_id'])
LIB['author_id'] = LIB.author.str.split(', ').str[0]
LIB['book_label'] = LIB.author_id + ' ' + LIB.index.astype('str') + ': ' + LIB.title.str[:20]

In [4]:
CORPUS = pd.read_csv(f"{data_path}/output/{corpus_prefix}-CORPUS.csv").set_index(OHCO)[['pos','term_str']]

In [5]:
DOCS = CORPUS.groupby(BAG)

In [6]:
DOCIDX = DOCS.term_str.count().index

# Convert to Gensim

We follow Gensim recipe for converting our data from a dataframe to a TaggedDocument.

In [7]:
data = DOCS.term_str.apply(lambda x: list(x)).to_list()

In [8]:
def tagged_document(list_of_list_of_words):
    for i, list_of_words in enumerate(list_of_list_of_words):
      yield gensim.models.doc2vec.TaggedDocument([str(w) for w in list_of_words], [i])

In [9]:
data_for_training = list(tagged_document(data))

In [10]:
# data_for_training[:1]

# Generate Model

In [11]:
model = gensim.models.doc2vec.Doc2Vec(vector_size=40, min_count=2, epochs=30)

In [12]:
model.build_vocab(data_for_training)

In [13]:
model.train(data_for_training, total_examples=model.corpus_count, epochs=model.epochs)

# Document Embedding Matrix

In [14]:
# model.docvecs.vectors_docs

In [15]:
X = pd.DataFrame(model.docvecs.vectors_docs, index=LIB.book_label)

In [16]:
import sys
sys.path.append("../lib")
from hac2 import HAC

In [17]:
# dv_tree = HAC(X)
dv_tree.color_thresh = 1
dv_tree.plot()

NameError: name 'dv_tree' is not defined

# Try Out

In [None]:
r1 = model.infer_vector("We went sailing on the Pacific".split())
r2 = model.infer_vector("I so enjoyed the visit to Bath".split())

In [None]:
R = pd.DataFrame(dict(r1=r1, r2=r2))

In [None]:
R.style.background_gradient(cmap='YlGnBu', axis=None)

In [None]:
R['w'] = X.sum().abs()

In [None]:
R

In [None]:
# X.sum().to_list()

In [None]:
px.scatter(R.reset_index(), 'r1', 'r2', height=600, width=700, text='index', size='w')

In [None]:
(R.r1 - R.r2).sort_values().plot.barh(figsize=(5,10));

In [None]:
mode