#### Definition:

Word2Vec is a neural network-based technique for learning distributed representations of words in a continuous vector space. These vectors capture semantic relationships between words, meaning words with similar meanings will have similar vectors.

#### Types:

1. Continuous Bag of Words (CBOW): Predicts the target word (center word) based on the context words (surrounding words).
2. Skip-gram: Predicts the context words based on the target word (center word).

#### Use Cases:

1. Text Similarity: Finding similar words or documents.
2. Machine Translation: Translating words or phrases to another language.
3. Sentiment Analysis: Enhancing the feature space for sentiment classification.
4. Named Entity Recognition (NER): Improving the identification of entities in text.

#### Implementation in Python:
We'll use the gensim library to implement Word2Vec.

#### Installation:
Ensure you have gensim installed:

In [2]:
pip install gensim


Note: you may need to restart the kernel to use updated packages.




In [1]:
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
import nltk

# Download the NLTK tokenizer models
nltk.download('punkt')

# Sample documents
documents = [
    "I love natural language processing",
    "Natural language processing is a fascinating field",
    "I am learning word embeddings using Word2Vec"
]

# Tokenize the documents
tokenized_documents = [word_tokenize(doc.lower()) for doc in documents]

# Initialize and train the Word2Vec model
model = Word2Vec(sentences=tokenized_documents, vector_size=100, window=5, min_count=1, workers=4)

# Print the vector for a word
word = "natural"
print(f"Vector for '{word}':\n{model.wv[word]}")

# Find most similar words
similar_words = model.wv.most_similar(word, topn=5)
print(f"Words most similar to '{word}':\n{similar_words}")


C:\Users\karma\anaconda3\lib\site-packages\numpy\.libs\libopenblas.EL2C6PLE4ZYW3ECEVIV3OXXGRN2NRFM2.gfortran-win_amd64.dll
C:\Users\karma\anaconda3\lib\site-packages\numpy\.libs\libopenblas.XWYDX2IKJW2NMTWSFYNGFUWKQU3LYTCZ.gfortran-win_amd64.dll


Vector for 'natural':
[ 9.4563962e-05  3.0773198e-03 -6.8126451e-03 -1.3754654e-03
  7.6685809e-03  7.3464094e-03 -3.6732971e-03  2.6427018e-03
 -8.3171297e-03  6.2054861e-03 -4.6373224e-03 -3.1641065e-03
  9.3113566e-03  8.7338570e-04  7.4907029e-03 -6.0740625e-03
  5.1605068e-03  9.9228229e-03 -8.4573915e-03 -5.1356913e-03
 -7.0648370e-03 -4.8626517e-03 -3.7785638e-03 -8.5361991e-03
  7.9556061e-03 -4.8439382e-03  8.4236134e-03  5.2625705e-03
 -6.5500261e-03  3.9578713e-03  5.4701497e-03 -7.4265362e-03
 -7.4057197e-03 -2.4752307e-03 -8.6257253e-03 -1.5815723e-03
 -4.0343284e-04  3.2996845e-03  1.4418805e-03 -8.8142155e-04
 -5.5940580e-03  1.7303658e-03 -8.9737179e-04  6.7936908e-03
  3.9735902e-03  4.5294715e-03  1.4343059e-03 -2.6998555e-03
 -4.3668128e-03 -1.0320747e-03  1.4370275e-03 -2.6460087e-03
 -7.0737829e-03 -7.8053069e-03 -9.1217868e-03 -5.9351693e-03
 -1.8474245e-03 -4.3238713e-03 -6.4606704e-03 -3.7173224e-03
  4.2891586e-03 -3.7390434e-03  8.3781751e-03  1.5339935e-03
 -

[nltk_data] Error loading punkt: <urlopen error [Errno 11001]
[nltk_data]     getaddrinfo failed>


#### Explanation:
1. Word2Vec: Initializes the Word2Vec model.
2. sentences: List of tokenized sentences.
3. vector_size: Dimensionality of the word vectors.
4. window: Maximum distance between the current and predicted word within a sentence.
5. min_count: Ignores all words with total frequency lower than this.
6. workers: Number of worker threads to train the model.