# Word Embeddings and Neural Networks Introduction

This notebook demonstrates the fundamentals of natural language processing using Word2Vec and a simple neural network. We'll walk through:

1. Basic text preprocessing
2. Word embedding with Word2Vec
3. Building a simple neural network with PyTorch
4. Making predictions with the model


In [None]:
import numpy as np

import torch

import torch.nn as nn

import matplotlib.pyplot as plt

from sklearn.decomposition import PCA


# from gensim.models import Word2Vec

ImportError: cannot import name 'triu' from 'scipy.linalg' (c:\Users\super\miniconda3\envs\nlp_env\lib\site-packages\scipy\linalg\__init__.py)

## 1. Starting with a Simple Sentence

We'll begin with a basic Polish sentence to demonstrate the NLP pipeline.


In [None]:
# Simple example sentence
sentence = ["kot goni psa"]
print(f"Our sample sentence: '{sentence[0]}'")

## 2. Tokenization

Tokenization is the process of converting words into numerical tokens that computers can process. We create a simple vocabulary mapping each word to a unique integer.


In [None]:
# Create a vocabulary dictionary mapping words to unique integers
vocab = {"kot": 0, "goni": 1, "psa": 2}

# Convert each word in our sentence to its numerical token
tokens = [vocab[word] for word in sentence[0].split()]
print(f"Words: {sentence[0].split()}")
print(f"Tokens: {tokens}")

## 3. Word Embeddings with Word2Vec

Word embeddings represent words as dense vectors in a continuous vector space where semantically similar words are mapped close to each other. Word2Vec is a popular method for generating these embeddings.

Here we train a Word2Vec model on a small corpus of sentences:


In [None]:
# Our training corpus (collection of sentences)
sentences = [["kot", "goni", "psa"], ["pies", "goni", "kota"], ["ryba", "pływa", "w", "wodzie"]]

# Train Word2Vec model
# Parameters:
# - vector_size: dimension of the word vectors
# - window: context window size (words before and after)
# - min_count: ignore words with fewer occurrences
# - sg=0: use CBOW architecture (sg=1 would use Skip-gram)
word2vec_model = Word2Vec(sentences, vector_size=3, window=2, min_count=1, sg=0)

# Extract vectors for words in our vocabulary
word_vectors = {word: word2vec_model.wv[word] for word in vocab.keys()}

# Display the vector for each word
for word, vector in word_vectors.items():
    print(f"{word}: {vector}")

### Visualizing Word Embeddings

Let's visualize our word embeddings to better understand how words are positioned in the vector space:


In [None]:
# Get all words from the trained model
all_words = list(word2vec_model.wv.key_to_index.keys())
all_vectors = [word2vec_model.wv[word] for word in all_words]

# Since we used vector_size=3, we can use PCA to visualize in 2D
pca = PCA(n_components=2)
result = pca.fit_transform(all_vectors)

# Create a scatter plot
plt.figure(figsize=(10, 8))
plt.scatter(result[:, 0], result[:, 1], marker="o")

# Add labels for each word
for i, word in enumerate(all_words):
    plt.annotate(
        word,
        xy=(result[i, 0], result[i, 1]),
        xytext=(5, 2),
        textcoords="offset points",
        ha="right",
        va="bottom",
    )

plt.title("Word Embeddings Visualization using PCA")
plt.grid(True)
plt.show()

## 4. Converting Tokens to Vectors

Now we'll transform our tokenized sentence into a sequence of vectors that can be processed by a neural network:


In [None]:
# Convert each word in our original sentence to its vector representation
input_vectors = np.array([word_vectors[word] for word in sentence[0].split()])
print(f"Input vectors shape: {input_vectors.shape}")
print("Input vectors:")
for i, word in enumerate(sentence[0].split()):
    print(f"{word}: {input_vectors[i]}")

## 5. Building a Simple Neural Network

We'll create a simple neural network that takes word vectors as input and predicts the next word.


In [None]:
class SimpleNN(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(SimpleNN, self).__init__()
        # A single linear (fully connected) layer
        self.fc = nn.Linear(input_dim, output_dim)

        # You could add more complexity with additional layers:
        # self.hidden = nn.Linear(input_dim, 64)
        # self.relu = nn.ReLU()
        # self.output = nn.Linear(64, output_dim)

    def forward(self, x):
        # Simple forward pass with just one layer
        return self.fc(x)

        # With more layers, you'd do:
        # x = self.hidden(x)
        # x = self.relu(x)
        # return self.output(x)


# Initialize the network
input_dim = 3  # Dimensionality of our word vectors
output_dim = len(vocab)  # Number of possible output words
model = SimpleNN(input_dim, output_dim)

# Print model architecture
print(model)

## 6. Forward Pass and Prediction

Now we'll run our word vectors through the neural network and generate predictions:


In [None]:
# Convert numpy arrays to PyTorch tensors
input_tensor = torch.tensor(input_vectors, dtype=torch.float32)

# Run the input through the model
output = model(input_tensor)
print("Raw neural network output:")
print(output)

# Apply softmax to get probabilities
softmax = nn.Softmax(dim=1)
probabilities = softmax(output)
print("\nProbabilities for each word in vocabulary:")
print(probabilities)

# Find the most likely next word for each input word
predicted_tokens = torch.argmax(probabilities, dim=1)
predicted_words = [list(vocab.keys())[token.item()] for token in predicted_tokens]

print("\nPredicted next word for each input word:")
for i, word in enumerate(sentence[0].split()):
    print(f"After '{word}': '{predicted_words[i]}'")

## 7. Visualizing the Model's Decision Process

Let's create a bar chart to better visualize the prediction probabilities:


In [None]:
# Get the last word's prediction probabilities
last_word_probs = probabilities[-1].detach().numpy()

# Plot as a bar chart
plt.figure(figsize=(10, 6))
plt.bar(list(vocab.keys()), last_word_probs, color="skyblue")
plt.title(f"Prediction Probabilities After the Word '{sentence[0].split()[-1]}'")
plt.xlabel("Possible Next Words")
plt.ylabel("Probability")
plt.ylim(0, 1)
plt.show()

## Summary

In this notebook, we've covered the fundamental steps of natural language processing:

1. **Tokenization**: Converting words to numerical tokens
2. **Word Embeddings**: Representing words as vectors using Word2Vec
3. **Neural Network**: Building a simple predictive model with PyTorch
4. **Prediction**: Generating probabilities for the next word

Note that in a real-world scenario:

- We would use much larger training datasets
- Our vocabulary would be much more extensive
- Word vectors would have higher dimensions (typically 100-300)
- The neural network would be more complex (e.g., an LSTM or Transformer)
- We would properly train the model with a loss function and optimizer

This notebook serves as a simplified introduction to the concepts of word embeddings and neural networks for NLP.
