![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

<!-- [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/18.ViT_for_Image_Classification.ipynb) -->

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/Spark_NLP_Udemy_MOOC/Open_Source/16.01.Doc2Vec.ipynb)

# **Doc2Vec**

This notebook will cover the different parameters and usages of `Doc2Vec`.

**📖 Learning Objectives:**

1. Doc2Vec is an unsupervised algorithm for learning a vector representation of a document. It is commonly used for tasks such as text classification, text similarity, and clustering. 
2. Doc2Vec uses a shallow neural network to generate a fixed-length feature vector for each document, which can be used as input to a classifier or other machine learning algorithms.
3. Used the `Doc2VecApproach` annotator that uses the Spark ML Word2Vec behind the scene to train a Word2Vec model.

**🔗 Helpful Links:** 


- doc2vec_gigaword_wiki_300 : [doc2vec_gigaword_wiki_300](https://nlp.johnsnowlabs.com/2021/11/21/doc2vec_gigaword_wiki_300_en.html)  
- doc2vec_gigaword_300 : [doc2vec_gigaword_300](https://nlp.johnsnowlabs.com/2021/11/21/doc2vec_gigaword_300_en.html)  

## **📜 Background**

Trains a Word2Vec model that creates vector representations of words in a text corpus.

The algorithm first constructs a vocabulary from the corpus and then learns vector representation of words in the vocabulary. The vector representation can be used as features in natural language processing and machine learning algorithms.

We use Word2Vec implemented in Spark ML. It uses skip-gram model in our implementation and a hierarchical softmax method to train the model. The variable names in the implementation match the original C implementation.

For instantiated/pretrained models, see Doc2VecModel.

## **🎬 Colab Setup**

In [None]:
!pip install -q pyspark==3.1.2  spark-nlp==4.2.4

In [None]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.sql import functions as F
from pyspark.sql import SparkSession

spark = sparknlp.start()
spark

## **🔎 Parameters**


- `vectorSize`:       (Int) The dimension of codes after transforming from words (> 0).
- `dimension`: (Int) Number of embedding dimension. 


In [None]:
text = """I have seldom heard him mention her under any other name. In his eyes she eclipses and predominates the whole of her sex. 
It was not that he felt any emotion akin to love for Irene Adler. All emotions, and that one particularly, were abhorrent to his cold, 
precise but admirably balanced mind. He was, I take it, the most perfect reasoning and observing machine that the world has seen, 
but as a lover he would have placed himself in a false position. He never spoke of the softer passions, save with a gibe and a sneer. 
They were admirable things for the observer--excellent for drawing the veil from men's motives and actions. But for the trained reasoner 
to admit such intrusions into his own delicate and finely adjusted temperament was to introduce a distracting factor which might throw a 
doubt upon all his mental results. Grit in a sensitive instrument, or a crack in one of his own high-power lenses, would not be more 
disturbing than a strong emotion in a nature such as his. And yet there was but one woman to him, and that woman was the late Irene Adler, 
of dubious and questionable memory. I had seen little of Holmes lately. My marriage had drifted us away from each other. My own complete happiness, 
and the home-centred interests which rise up around the man who first finds himself master of his own establishment, were sufficient to absorb 
all my attention, while Holmes, who loathed every form of society with his whole Bohemian soul, remained in our lodgings in Baker Street, 
buried among his old books, and alternating from week to week between cocaine and ambition, the drowsiness of the drug, and the fierce energy of his own 
keen nature. He was still, as ever, deeply attracted by the study of crime, and occupied his immense faculties and extraordinary powers of observation 
in following out those clues, and clearing up those mysteries which had been abandoned as hopeless by the official police. From time to time 
I heard some vague account of his doings: of his summons to Odessa in the case of the Trepoff murder, of his clearing up of the singular 
tragedy of the Atkinson brothers at Trincomalee, and finally of the mission which he had accomplished so delicately and successfully for the reigning family of Holland."""

## Define pipeline stages 

In [None]:
document = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentencer = SentenceDetector() \
    .setInputCols(["document"])\
    .setOutputCol("sentence")

token = Tokenizer()\
    .setInputCols("sentence")\
    .setOutputCol("token")

norm = Normalizer()\
    .setInputCols(["token"])\
    .setOutputCol("normalized")\
    .setLowercase(True)

stops = StopWordsCleaner.pretrained()\
    .setInputCols("normalized")\
    .setOutputCol("cleanedToken")

doc2Vec = Doc2VecModel.pretrained("doc2vec_gigaword_wiki_300", "en")\
    .setInputCols("cleanedToken")\
    .setOutputCol("sentence_embeddings")


pipeline = Pipeline() \
    .setStages([
      document,
      sentencer,
      token,
      norm,
      stops,
      doc2Vec
    ])

In [None]:
data = spark.createDataFrame([[text]]).toDF("text")

result = pipeline.fit(data).transform(data)

result.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------------------------+
|                text|            document|            sentence|               token|          normalized|        cleanedToken| sentence_embeddings|finished_sentence_embeddings|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------------------------+
|I have seldom hea...|[{document, 0, 22...|[{document, 0, 56...|[{token, 0, 0, I,...|[{token, 0, 0, i,...|[{token, 7, 12, s...|[{sentence_embedd...|        [[-2.173328102799...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------------------------+



In [None]:
result_df = result.select(F.explode(F.arrays_zip(result.sentence.result,
                                                 result.sentence_embeddings.embeddings)).alias("cols")) \
                  .select(F.expr("cols['0']").alias("sentence"),                    
                          F.expr("cols['1']").alias("sentence_embeddings"))

result_df.show(30, truncate=30)

+------------------------------+------------------------------+
|                      sentence|           sentence_embeddings|
+------------------------------+------------------------------+
|I have seldom heard him men...|[-2.1733281E-4, 0.04999, -0...|
|In his eyes she eclipses an...|[0.01971875, 0.0031599998, ...|
|It was not that he felt any...|[-0.0044934996, 0.050951, 0...|
|All emotions, and that one ...|[0.054003004, 0.028325144, ...|
|He was, I take it, the most...|[-0.0075171245, 0.02983325,...|
|He never spoke of the softe...|[0.011385165, 0.016601002, ...|
|They were admirable things ...|[-0.006381749, 0.029029626,...|
|But for the trained reasone...|[0.016324531, 0.037999865, ...|
|Grit in a sensitive instrum...|[0.039239295, 0.0283199, -0...|
|And yet there was but one w...|[0.004118876, 0.033116125, ...|
|I had seen little of Holmes...|[0.0, 0.0, 0.0, 0.0, 0.0, 0...|
|My marriage had drifted us ...|[0.0412555, 0.038276, 0.010...|
|My own complete happiness, ...|[0.03671

### `getVectorSize`:

- The getVectorSize() method in Doc2Vec is used to retrieve the size of the vector representation of a document. This is the length of the vector that is returned when the document is converted into a numerical vector using the Doc2Vec algorithm. The vector size is typically 300 for a standard Doc2Vec model.

In [None]:
doc2Vec.getVectorSize()

300

### `getDimension`:

- The getDimension() method in Doc2Vec is used to retrieve the vector length associated with the embedding model. This method returns the length of the vector for each document, which is used to determine the size of the feature vector for the document when creating the Doc2Vec model.

In [None]:
doc2Vec.getDimension()

300

# **EmbeddingsFinisher**

Extracts embeddings from Annotations into a more easily usable form.

**📖 Learning Objectives:**

1. By using EmbeddingsFinisher you can easily transform your embeddings into array of floats or vectors which are compatible with Spark ML functions such as LDA, K-mean, Random Forest classifier or any other functions that require featureCol. 

2. It provides a set of tools for creating and managing vector representations of words, sentences, and documents. 

3. EmbeddingsFinisher can be used to improve the accuracy of text classification, sentiment analysis, and other natural language processing tasks.

**🔗 Helpful Links:**

- Documentation : [Spark NLP Workshop](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/5.1_Text_classification_examples_in_SparkML_SparkNLP.ipynb).

- WordEmbeddings : [WordEmbeddings](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/embeddings/WordEmbeddings.html). 

- BertEmbeddings : [BertEmbeddings](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/embeddings/BertEmbeddings.html).

- SentenceEmbeddings : [SentenceEmbeddings](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/embeddings/SentenceEmbeddings.html).

- ChunkEmbeddings : [ChunkEmbeddings](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/embeddings/SentenceEmbeddings.html).


## **🖨️ Input/Output Annotation Types**

- Input: `EMBEDDINGS`

- Output: `NONE`

In [None]:
text = [""" Soaring crude prices plus worries about the economy and the outlook for earnings are expected to hang over the stock market next week during the depth of the summer doldrums.""",
        """ Authorities have halted oil export flows from the main pipeline in southern Iraq after intelligence showed a rebel militia could strike infrastructure, an oil official said on Saturday.""",
        """ Tearaway world oil prices, toppling records and straining wallets, present a new economic menace barely three months before the US presidential elections."""
        ]

In [None]:
document = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentencer = sparknlp.annotators.SentenceDetector() \
    .setInputCols(["document"])\
    .setOutputCol("sentence")

token = Tokenizer()\
    .setInputCols("sentence")\
    .setOutputCol("token")

norm = Normalizer()\
    .setInputCols(["token"])\
    .setOutputCol("normalized")\
    .setLowercase(True)

stops = StopWordsCleaner.pretrained()\
    .setInputCols("normalized")\
    .setOutputCol("cleanedToken")

doc2Vec = Doc2VecModel.pretrained("doc2vec_gigaword_wiki_300", "en")\
    .setInputCols("cleanedToken")\
    .setOutputCol("sentence_embeddings")

embeddingsFinisher = EmbeddingsFinisher() \
    .setInputCols("sentence_embeddings") \
    .setOutputCols("finished_sentence_embeddings") \
    .setOutputAsVector(True) \
    .setCleanAnnotations(False)


pipeline = Pipeline() \
    .setStages([
      document,
      sentencer,
      token,
      norm,
      stops,
      doc2Vec,
      embeddingsFinisher
    ])

In [None]:
from pyspark.sql.types import StringType

data = spark.createDataFrame(text,StringType()).toDF('text')

result = pipeline.fit(data).transform(data)

result.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------------------------+
|                text|            document|            sentence|               token|          normalized|        cleanedToken| sentence_embeddings|finished_sentence_embeddings|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------------------------+
| Soaring crude pr...|[{document, 0, 17...|[{document, 1, 17...|[{token, 1, 7, So...|[{token, 1, 7, so...|[{token, 1, 7, so...|[{sentence_embedd...|        [[0.0492293350398...|
| Authorities have...|[{document, 0, 18...|[{document, 1, 18...|[{token, 1, 11, A...|[{token, 1, 11, a...|[{token, 1, 11, a...|[{sentence_embedd...|        [[0.0393050014972...|
| Tearaway world o...|[{document, 0, 15...|[{document, 1, 15...|[{token, 1, 8, Te...|[{token, 1, 8, te...|[{to

In [None]:
result.select("finished_sentence_embeddings").show(truncate=100)

+----------------------------------------------------------------------------------------------------+
|                                                                        finished_sentence_embeddings|
+----------------------------------------------------------------------------------------------------+
|[[0.04922933503985405,0.02283420041203499,0.019734332337975502,-0.029602468013763428,0.0288929343...|
|[[0.03930500149726868,0.006954222917556763,0.007209000643342733,-0.02824166789650917,0.0047421110...|
|[[0.014834600500762463,0.03857426345348358,-0.007008933927863836,-0.033520203083753586,0.01753439...|
+----------------------------------------------------------------------------------------------------+



In [None]:
result.select('finished_sentence_embeddings').take(1)

[Row(finished_sentence_embeddings=[DenseVector([0.0492, 0.0228, 0.0197, -0.0296, 0.0289, -0.0032, 0.0201, -0.0259, -0.0186, 0.0135, 0.026, -0.0076, 0.0111, -0.0159, 0.0111, 0.021, 0.0012, 0.0142, 0.0106, 0.0399, -0.0291, 0.0014, -0.0139, 0.0404, -0.0167, -0.0098, -0.0097, -0.0462, 0.029, -0.0037, 0.0202, 0.0006, -0.0655, -0.0178, -0.011, -0.0101, -0.0196, -0.0369, 0.0077, 0.0011, 0.0177, 0.0236, 0.0122, -0.0252, -0.0242, -0.0136, -0.0331, 0.0121, 0.0557, 0.0009, 0.0166, -0.0253, 0.0383, -0.0302, 0.0086, -0.0279, 0.0116, 0.0061, 0.007, 0.019, -0.0666, -0.0202, 0.0017, 0.0549, -0.0082, 0.0044, 0.0608, -0.0409, -0.0051, 0.0247, -0.0125, 0.0094, -0.0246, -0.0099, 0.0467, 0.021, 0.0051, -0.0508, 0.0194, 0.0118, -0.1017, 0.012, -0.0081, -0.0233, 0.0286, -0.0132, -0.0434, -0.0074, 0.004, -0.0176, -0.0017, 0.0123, -0.0042, 0.0079, -0.0346, -0.0167, -0.0743, -0.0041, 0.002, 0.0137, 0.0047, 0.054, -0.0575, -0.0245, -0.0131, -0.0078, -0.0153, 0.0192, -0.0265, 0.016, 0.0251, 0.0012, -0.0052, 0.030

In [None]:
result.select("text", "finished_sentence_embeddings").show()

+--------------------+----------------------------+
|                text|finished_sentence_embeddings|
+--------------------+----------------------------+
| Soaring crude pr...|        [[0.0492293350398...|
| Authorities have...|        [[0.0393050014972...|
| Tearaway world o...|        [[0.0148346005007...|
+--------------------+----------------------------+



# **Sample Use Case: Text Similarity Using Doc2Vec**

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
from scipy.spatial import distance

import numpy as np 

In [None]:
sent1 = result.select("finished_sentence_embeddings").take(3)[0][0]
sent2 = result.select("finished_sentence_embeddings").take(3)[1][0]
sent3 = result.select("finished_sentence_embeddings").take(3)[2][0]

In [None]:
x = np.stack([sent1, sent2, sent3])

In [None]:
x= x.reshape(3,300)

In [None]:
sk_sim = cosine_similarity(x,x)

In [None]:
result = result.toPandas()

In [None]:
import plotly.express as px
documents = result.text.values
print(f"Document 1: {documents[0]}")
print(f"Document 2: {documents[1]}")
print(f"Document 3: {documents[2]}")
labels = ["Document 1", "Document 2", "Document 3"]
fig = px.imshow(sk_sim, x=labels, y=labels, title="Cosine Similarity Matrix calculated via numpy engine and sklearn - Documentwise")
fig.show()

Document 1:  Soaring crude prices plus worries about the economy and the outlook for earnings are expected to hang over the stock market next week during the depth of the summer doldrums.
Document 2:  Authorities have halted oil export flows from the main pipeline in southern Iraq after intelligence showed a rebel militia could strike infrastructure, an oil official said on Saturday.
Document 3:  Tearaway world oil prices, toppling records and straining wallets, present a new economic menace barely three months before the US presidential elections.
