![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/Spark_NLP_Udemy_MOOC/Open_Source/15.01.Word2Vec.ipynb)

# **Word2Vec**

This notebook will cover the different parameters and usages of the `Word2Vec` annotator. There are two versions of this annotator: approach and model. The `Word2Vec` annotator approach trains a model that creates vector representations of words in a text corpus. 

**📖 Learning Objectives:**

1. Understand how .

2. Understand 

3. Become comfortable using the different parameters of the annotator.


**🔗 Helpful Links:**

- Documentation : [Word2Vec](https://nlp.johnsnowlabs.com/docs/en/annotators#word2vec)

- Python Docs : [Word2VecApproach](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/annotator/embeddings/word2vec/index.html#sparknlp.annotator.embeddings.word2vec.Word2VecApproach) and [Word2VecModel](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/annotator/embeddings/word2vec/index.html#sparknlp.annotator.embeddings.word2vec.Word2VecModel)

- Scala Docs : [Word2VecApproach](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/embeddings/Word2VecApproach) and [Word2VecModel](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/embeddings/Word2VecModel)

- Original C Implementation: [Word2Vec](https://code.google.com/archive/p/word2vec/)

- Research Papers: [Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/abs/1301.3781) and [Distributed Representations of Words and Phrases and their Compositionality](https://arxiv.org/pdf/1310.4546v1.pdf)


## **📜 Background**


The algorithm first constructs a vocabulary from the corpus and then learns vector representation of words in the vocabulary. The vector representation can be used as features in natural language processing and machine learning algorithms.

The anotator uses Word2Vec implemented in Spark ML. It uses skip-gram model in Spark NLP implementation and a hierarchical softmax method to train the model. The variable names in the implementation match the original C implementation.

The Word2VecApproach can be used to train your own model. The Word2VecModel is the instantiated model of the Word2VecApproach. 

Pretrained models can be loaded with `Word2VecModel.pretrained()`. The default model is `word2vec_gigaword_300`, if no name is provided.

For available pretrained models, for several languages and various dimensions, see the [Models Hub](https://nlp.johnsnowlabs.com/models?q=Word2Vec).

## **🎬 Colab Setup**

In [None]:
!pip install -q pyspark==3.1.2  spark-nlp==4.2.4

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.4/212.4 MB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m453.4/453.4 KB[0m [31m33.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m198.6/198.6 KB[0m [31m19.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


In [None]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

spark = sparknlp.start()

## **🖨️ Input/Output Annotation Types**

- Input: `TOKEN`

- Output: `WORD_EMBEDDINGS`

## **🔎 Parameters Word2VecApproach**


- `enableCaching`: (BooleanParam) --> Whether to enable caching DataFrames or RDDs during the training.

- `maxIter`: (IntParam) --> Param for maximum number of iterations (>= 0) (Default: 1).

- `maxSentenceLength`: (IntParam) --> Sets the maximum length (in words) of each sentence in the input data (Default: 1000).

- `minCount`: (IntParam) --> The minimum number of times a token must appear to be included in the word2vec model's vocabulary (Default: 5).

- `numPartitions`: (IntParam)
Number of partitions for sentences of words (Default: 1).

- `seed`: (IntParam) --> Random seed for shuffling the dataset (Default: 44).

- `stepSize`: (DoubleParam) --> 
Param for Step size to be used for each iteration of optimization (> 0) (Default: 0.025).

- `storageRef`: (Param[String]) --> Unique identifier for storage (Default: this.uid).

- `vectorSize`: (IntParam) --> The dimension of the code that you want to transform from words (Default: 100).

- `windowSize`: (IntParam) --> Window size (context words from [-window, window]) (Default: 5).

## **Word2VecApproach Example Pipeline**

In [None]:
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")
tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")
embeddings = Word2VecApproach() \
    .setInputCols(["token"]) \
    .setOutputCol("embeddings")
pipeline = Pipeline() \
    .setStages([
      documentAssembler,
      tokenizer,
      embeddings
    ])
path = "src/test/resources/spell/sherlockholmes.txt"
dataset = spark.read.text(path).toDF("text")
pipelineModel = pipeline.fit(dataset)

### `setVectorSize()`

### `setWindowSize()`

### `setStepSize`

### `setNumPartitions()`

### `setMaxIter()`

### `setMinCount()`

### `setMaxSentenceLength()`

## **🔎 Parameters Word2VecModel**

- `dimension`: (IntParam) --> Number of embedding dimensions (Default depends on model).

- `storageRef`: (Param[String]) --> Unique identifier for storage (Default: this.uid).

- `vectorSize`: (IntParam) --> The dimension of codes after transforming from words (> 0) (Default: 100).

- `wordVectors`: (MapFeature(String, Array[Float]) --> Dictionary of words with their vectors.


## **Word2VecModel Example Pipeline**

In [None]:
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

embeddings = Word2VecModel.pretrained() \
    .setInputCols(["token"]) \
    .setOutputCol("embeddings")

embeddingsFinisher = EmbeddingsFinisher() \
    .setInputCols(["embeddings"]) \
    .setOutputCols("finished_embeddings") \
    .setOutputAsVector(True)

pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    embeddings,
    embeddingsFinisher
])
data = spark.createDataFrame([["This is a sentence."]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.selectExpr("explode(finished_embeddings) as result").show(1, 80)