# <center> <font size = 24 color = 'steelblue'> <b> Doc2Vec

## Overview: 

The goal is to implement paragraph vector models for text representation by preparing the data and leveraging two key approaches: Distributed Bag of Words (DBoW) and Distributed Memory (PV-DM). These methods capture semantic information to create meaningful paragraph embeddings for downstream NLP tasks.

<div class="alert alert-block alert-info">
    
<font size = 4>

- Demonstration of Doc2Vec using a custom corpus

# <a id= 'dv0'>
<font size = 4>
    
**Table of contents:**<br>
[1. Install and import the requirements](#dv1)<br>
[2. Preparing the data](#dv2)<br>
[3. Distributed bag of words version of paragraph vector (DBoW)](#dv3)<br>
[4. Distributed memory version of paragraph vector (PV-DM)](#dv4)<br>

##### <a id = 'dv1'>
<font size = 10 color = 'midnightblue'> <b>Install and import the requirements

In [1]:
!pip install gensim==4.2.0
!pip install nltk==3.8.1
!pip install spacy==3.5.1



<font size = 5 color = seagreen> <b>Import necessary packages

In [1]:
import nltk
from nltk.tokenize import word_tokenize
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

In [2]:
# To suppress warning messages
import warnings
warnings.filterwarnings('ignore')

<font size = 5 color = seagreen><b> Download the necessary corpora.

In [3]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /voc/work/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

[top](#dv0)

##### <a id = 'dv2'>
<font size = 10 color = 'midnightblue'> <b>Preparing the data

<font size = 5 color = pwdrblue><b> Define the documents

In [4]:
documents = ["The analyst reviews the dataset to identify trends and patterns."
             "Data analysis helps businesses make informed decisions based on facts and figures.",
             "In a research project the team gathers data for subsequent analysis.",
             "Charts and graphs are used to visually represent the results of data analysis.",
             "Analyzing customer feedback data provides valuable insights for product improvement."]

<font size = 5 color = pwdrblue><b>Create tagged documents:
<div class="alert alert-block alert-success">
    
<font size = 4>
    
- The TaggedDocument function represents document along with a tag.
- This generates data in the acceptable input format for the Doc2Vec function.

In [5]:
tagged_data = [TaggedDocument(words=word_tokenize(word.lower()), tags=[str(i)]) for i, word in enumerate(documents)]

In [6]:
print(tagged_data[1])

TaggedDocument<['in', 'a', 'research', 'project', 'the', 'team', 'gathers', 'data', 'for', 'subsequent', 'analysis', '.'], ['1']>


[top](#dv0)

##### <a id = 'dv3'>
<font size = 10 color = 'midnightblue'> <b> Distributed bag of words version of paragraph vector (DBoW)

<div class="alert alert-block alert-success">
    
<font size = 4>
    
- The model is trained to predict words randomly sampled from the paragraph (document) it is processing, without using the word order information.

<font size = 5 color = pwdrblue><b>  Create the model object with tagged data

In [7]:
dbow_model = Doc2Vec(tagged_data,vector_size=20, min_count=1, epochs=2,dm=0)

<font size = 5 color = pwdrblue><b>  Get feature vector for : "***Data analysis identifies trends and patterns.***"

In [8]:
print(dbow_model.infer_vector(["Data", "analysis", "identifies", "trends", "and", "patterns"]))

[-0.02037955 -0.01731416 -0.01322459 -0.01250527  0.01428972 -0.01562179
 -0.00683926 -0.0102631   0.00712917  0.01492114  0.01172474 -0.00732228
 -0.02340268  0.01833704 -0.00616148  0.011989    0.01559411  0.02362764
 -0.00038035  0.0149787 ]


<font size = 5 color = pwdrblue><b>  Get top 5 most simlar words.

In [9]:
dbow_model.wv.most_similar("analysis", topn=5)

[('subsequent', 0.5829257965087891),
 ('reviews', 0.38686245679855347),
 ('dataset', 0.3179265260696411),
 ('insights', 0.26625779271125793),
 ('research', 0.19855976104736328)]

<font size = 5 color = pwdrblue><b>  Get the cosine similarity between the two sets of documents.

In [10]:
dbow_model.wv.n_similarity(["data", "analysis"],["insights"])

0.20002559

[top](#dv0)

##### <a id = 'dv4'>
<font size = 10 color = 'midnightblue'> <b> Distributed memory version of paragraph vector (PV-DM)

The Distributed Memory Model of Paragraph Vectors (PV-DM) is one of the two approaches introduced by Le and Mikolov in their 2014 paper on Paragraph Vector (Doc2Vec) models. PV-DM is a powerful method for generating dense, fixed-length vector representations of texts (like sentences, paragraphs, or entire documents) by leveraging the surrounding context of words, making it ideal for downstream NLP tasks like classification, clustering, or semantic search.

## Understanding PV-DM
The PV-DM model is conceptually similar to Word2Vec's Continuous Bag of Words (CBOW) approach, with a key enhancement: it takes into account the document's unique identifier (a "paragraph vector") alongside the context words when predicting a target word. This helps the model capture both local word semantics and global document-level information.

## Key Features of PV-DM
- Context-Aware: By leveraging both word context and paragraph vectors, PV-DM captures both local (word-level) and global (document-level) semantics.
- Flexible Length: Unlike fixed-length features (e.g., Bag-of-Words), PV-DM can generate embeddings for texts of any length.
- Document-Level Memory: The paragraph vector provides a "memory" of the document, which helps retain information even if certain words are missing in the context.

## Example Use Cases
- Document Classification: PV-DM embeddings can be used to classify documents based on their content, such as spam detection or sentiment analysis.
- Semantic Similarity: Finding similar documents by comparing paragraph vectors can be useful in search engines or recommendation systems.
- Text Clustering: Grouping similar documents together for organizing large text corpora.

<font size = 5 color = pwdrblue><b>  Create model object

In [11]:
dm_model = Doc2Vec(tagged_data, min_count=1, vector_size=20, epochs=2,dm=1)

<font size = 5 color = pwdrblue><b>  Get feature vector for : "***Data analysis identifies trends and patterns.***"

In [12]:
print(dm_model.infer_vector(["Data", "analysis", "identifies", "trends", "and", "patterns"]))


[-0.02037944 -0.01731404 -0.01322444 -0.01250534  0.0142896  -0.01562171
 -0.00683909 -0.01026298  0.00712932  0.01492122  0.01172474 -0.00732237
 -0.02340247  0.01833722 -0.00616135  0.01198908  0.01559412  0.02362749
 -0.00038012  0.01497868]


<font size = 5 color = pwdrblue><b>  Get top5 most similar keys to given word

In [13]:
dm_model.wv.most_similar("analysis",topn=5)


[('subsequent', 0.5829257965087891),
 ('reviews', 0.386778861284256),
 ('dataset', 0.31784698367118835),
 ('insights', 0.2662583887577057),
 ('research', 0.19845950603485107)]

In [14]:
dm_model.wv.n_similarity(["data", "analysis"],["insights"])

0.20011897

<div class="alert alert-block alert-success">
    
<font size = 4>

<center> <b> What happens when we compare between words which are not in the vocabulary?

In [15]:
dm_model.wv.n_similarity(['covid'],['data'])

0.0

<div class="alert alert-block alert-success">
    
<font size = 4>
    
<center> <b>If the word is not in vocabulary the similarity score with other words will be zero.


[top](#dv0)