# L06: Embeddings & Reinforcement Learning
## Text Representations and Sequential Decision Making

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Digital-AI-Finance/Methods_and_Algorithms/blob/main/notebooks/L06_embeddings_rl.ipynb)

**Methods and Algorithms -- MSc Data Science**

---

### Learning Objectives

By the end of this notebook, you will be able to:

1. Explain word embeddings and their applications
2. Apply pre-trained embeddings for text analysis
3. Understand the reinforcement learning framework
4. Implement basic Q-learning for decision problems

### Finance Applications: Sentiment Analysis, Algorithmic Trading

In [None]:
# Install required packages (if needed)
# !pip install gensim gymnasium

# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from collections import defaultdict
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

# Plotting style
plt.rcParams.update({'font.size': 12, 'figure.figsize': (10, 6)})

print("Libraries imported successfully!")

## Part 1: Theory Recap

### Word Embeddings

Dense vector representations of words where similar words have similar vectors.

**Cosine Similarity:**
$$\text{sim}(u, v) = \frac{u \cdot v}{||u|| \cdot ||v||}$$

### Reinforcement Learning

Learning from interaction with an environment through trial and error.

**Q-Learning Update:**
$$Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right]$$

Where:
- $\alpha$: learning rate
- $\gamma$: discount factor
- $r$: reward
- $s, s'$: current and next state
- $a, a'$: current and next action

## Part 2: Word Embeddings from Scratch

In [None]:
# Simulate pre-trained word embeddings
# In practice, you would load Word2Vec, GloVe, or BERT embeddings

def create_simulated_embeddings(dim=50):
    """
    Create simulated word embeddings with semantic structure.
    Similar words will have similar vectors.
    """
    np.random.seed(42)
    
    # Base vectors for semantic categories
    base_finance = np.random.randn(dim) * 0.5
    base_positive = np.random.randn(dim) * 0.5
    base_negative = np.random.randn(dim) * 0.5
    base_action = np.random.randn(dim) * 0.5
    
    embeddings = {}
    
    # Financial terms (similar to each other)
    finance_words = ['stock', 'equity', 'share', 'bond', 'dividend', 'portfolio', 'market']
    for i, word in enumerate(finance_words):
        embeddings[word] = base_finance + np.random.randn(dim) * 0.2
    
    # Positive sentiment words
    positive_words = ['bullish', 'profit', 'gain', 'surge', 'growth', 'rally']
    for word in positive_words:
        embeddings[word] = base_positive + np.random.randn(dim) * 0.2
    
    # Negative sentiment words
    negative_words = ['bearish', 'loss', 'decline', 'crash', 'risk', 'selloff']
    for word in negative_words:
        embeddings[word] = base_negative + np.random.randn(dim) * 0.2
    
    # Action words
    action_words = ['buy', 'sell', 'hold', 'trade', 'invest']
    for word in action_words:
        embeddings[word] = base_action + np.random.randn(dim) * 0.2
    
    # Make buy/sell more similar to each other (related concepts)
    embeddings['sell'] = embeddings['buy'] + np.random.randn(dim) * 0.3
    embeddings['bullish'] = -embeddings['bearish'] + np.random.randn(dim) * 0.2
    
    return embeddings

# Create embeddings
embeddings = create_simulated_embeddings(dim=50)
print(f"Created embeddings for {len(embeddings)} words")
print(f"Embedding dimension: {len(list(embeddings.values())[0])}")
print(f"\nVocabulary: {list(embeddings.keys())}")

In [None]:
def cosine_similarity(v1, v2):
    """Calculate cosine similarity between two vectors."""
    dot_product = np.dot(v1, v2)
    norm_v1 = np.linalg.norm(v1)
    norm_v2 = np.linalg.norm(v2)
    return dot_product / (norm_v1 * norm_v2)

def find_similar_words(word, embeddings, top_n=5):
    """Find the most similar words to a given word."""
    if word not in embeddings:
        return []
    
    target_vec = embeddings[word]
    similarities = []
    
    for other_word, other_vec in embeddings.items():
        if other_word != word:
            sim = cosine_similarity(target_vec, other_vec)
            similarities.append((other_word, sim))
    
    similarities.sort(key=lambda x: x[1], reverse=True)
    return similarities[:top_n]

# Test similarity
print("Most similar words to 'stock':")
for word, sim in find_similar_words('stock', embeddings):
    print(f"  {word}: {sim:.3f}")

print("\nMost similar words to 'bullish':")
for word, sim in find_similar_words('bullish', embeddings):
    print(f"  {word}: {sim:.3f}")

In [None]:
# Visualize embeddings using PCA
from sklearn.decomposition import PCA

# Get all words and vectors
words = list(embeddings.keys())
vectors = np.array([embeddings[w] for w in words])

# Reduce to 2D
pca = PCA(n_components=2)
vectors_2d = pca.fit_transform(vectors)

# Plot
fig, ax = plt.subplots(figsize=(12, 8))

# Color by category
colors = {'finance': 'blue', 'positive': 'green', 'negative': 'red', 'action': 'orange'}
word_categories = {
    'stock': 'finance', 'equity': 'finance', 'share': 'finance', 'bond': 'finance',
    'dividend': 'finance', 'portfolio': 'finance', 'market': 'finance',
    'bullish': 'positive', 'profit': 'positive', 'gain': 'positive',
    'surge': 'positive', 'growth': 'positive', 'rally': 'positive',
    'bearish': 'negative', 'loss': 'negative', 'decline': 'negative',
    'crash': 'negative', 'risk': 'negative', 'selloff': 'negative',
    'buy': 'action', 'sell': 'action', 'hold': 'action', 'trade': 'action', 'invest': 'action'
}

for i, word in enumerate(words):
    cat = word_categories.get(word, 'other')
    color = colors.get(cat, 'gray')
    ax.scatter(vectors_2d[i, 0], vectors_2d[i, 1], c=color, s=100, alpha=0.7)
    ax.annotate(word, (vectors_2d[i, 0], vectors_2d[i, 1]), fontsize=10,
                xytext=(5, 5), textcoords='offset points')

# Legend
from matplotlib.patches import Patch
legend_elements = [Patch(facecolor=c, label=cat.capitalize()) for cat, c in colors.items()]
ax.legend(handles=legend_elements, loc='upper right')

ax.set_xlabel('PCA Component 1')
ax.set_ylabel('PCA Component 2')
ax.set_title('Word Embeddings Visualization (2D Projection)')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nObservation: Words from same category cluster together!")

## Part 3: Sentence Embeddings

In [None]:
def get_sentence_embedding(sentence, embeddings):
    """
    Get sentence embedding by averaging word vectors.
    (Simple baseline - more advanced: Doc2Vec, BERT)
    """
    words = sentence.lower().split()
    word_vectors = []
    
    for word in words:
        # Remove punctuation
        word = word.strip('.,!?')
        if word in embeddings:
            word_vectors.append(embeddings[word])
    
    if len(word_vectors) == 0:
        return None
    
    return np.mean(word_vectors, axis=0)

# Example sentences
sentences = [
    "stock market rally gain profit",
    "equity portfolio growth bullish",
    "market crash loss bearish selloff",
    "decline risk loss",
    "buy hold invest"
]

# Get embeddings and compute similarities
sentence_embeddings = {s: get_sentence_embedding(s, embeddings) for s in sentences}

print("Sentence Similarities:")
print("-" * 50)
for i, s1 in enumerate(sentences):
    for j, s2 in enumerate(sentences):
        if i < j:
            sim = cosine_similarity(sentence_embeddings[s1], sentence_embeddings[s2])
            print(f"Sim({i+1}, {j+1}): {sim:.3f}")
            print(f"  S{i+1}: {s1}")
            print(f"  S{j+1}: {s2}")
            print()

## Part 4: Simple Sentiment Classification

In [None]:
# Load synthetic financial news
import json

# Synthetic data (would normally load from file)
news_data = [
    {"text": "stock surge profit gain bullish", "sentiment": "positive"},
    {"text": "equity growth rally market", "sentiment": "positive"},
    {"text": "market crash loss bearish decline", "sentiment": "negative"},
    {"text": "selloff risk loss decline", "sentiment": "negative"},
    {"text": "stock market trade hold", "sentiment": "neutral"},
    {"text": "portfolio dividend bond invest", "sentiment": "neutral"}
]

# Create feature vectors using embeddings
X = []
y = []
texts = []

for item in news_data:
    emb = get_sentence_embedding(item['text'], embeddings)
    if emb is not None:
        X.append(emb)
        y.append(item['sentiment'])
        texts.append(item['text'])

X = np.array(X)
y = np.array(y)

print(f"Feature matrix shape: {X.shape}")
print(f"Labels: {np.unique(y, return_counts=True)}")

In [None]:
# Simple classifier using embedding similarity to sentiment centroids
def simple_sentiment_classifier(sentence, embeddings):
    """
    Classify sentiment by comparing to positive/negative word centroids.
    """
    positive_words = ['bullish', 'profit', 'gain', 'surge', 'growth', 'rally']
    negative_words = ['bearish', 'loss', 'decline', 'crash', 'risk', 'selloff']
    
    # Compute centroids
    pos_centroid = np.mean([embeddings[w] for w in positive_words if w in embeddings], axis=0)
    neg_centroid = np.mean([embeddings[w] for w in negative_words if w in embeddings], axis=0)
    
    # Get sentence embedding
    sent_emb = get_sentence_embedding(sentence, embeddings)
    if sent_emb is None:
        return 'unknown', 0.0
    
    # Compute similarities
    pos_sim = cosine_similarity(sent_emb, pos_centroid)
    neg_sim = cosine_similarity(sent_emb, neg_centroid)
    
    if pos_sim > neg_sim + 0.1:
        return 'positive', pos_sim - neg_sim
    elif neg_sim > pos_sim + 0.1:
        return 'negative', neg_sim - pos_sim
    else:
        return 'neutral', abs(pos_sim - neg_sim)

# Test classifier
test_sentences = [
    "stock market rally profit",
    "market crash decline loss",
    "buy stock hold portfolio"
]

print("Sentiment Classification Results:")
print("-" * 50)
for sentence in test_sentences:
    sentiment, confidence = simple_sentiment_classifier(sentence, embeddings)
    print(f"Text: '{sentence}'")
    print(f"Sentiment: {sentiment} (confidence: {confidence:.3f})")
    print()

## Part 5: Reinforcement Learning - Q-Learning from Scratch

In [None]:
class SimpleTradingEnv:
    """
    Simple trading environment for Q-learning demonstration.
    
    State: (position, price_trend)
        - position: -1 (short), 0 (neutral), 1 (long)
        - price_trend: 0 (down), 1 (neutral), 2 (up)
    
    Actions: 0 (sell), 1 (hold), 2 (buy)
    
    Rewards: Profit/loss from position * price change
    """
    
    def __init__(self):
        self.position = 0  # -1, 0, 1
        self.price_trend = 1  # 0, 1, 2
        self.step_count = 0
        self.max_steps = 20
        
    def reset(self):
        self.position = 0
        self.price_trend = np.random.choice([0, 1, 2])
        self.step_count = 0
        return self._get_state()
    
    def _get_state(self):
        return (self.position + 1, self.price_trend)  # Shift position to 0, 1, 2
    
    def step(self, action):
        """
        Execute action and return (next_state, reward, done).
        action: 0=sell, 1=hold, 2=buy
        """
        self.step_count += 1
        
        # Update position based on action
        if action == 0:  # Sell
            self.position = max(-1, self.position - 1)
        elif action == 2:  # Buy
            self.position = min(1, self.position + 1)
        # Hold: position stays same
        
        # Simulate price change
        price_change = self.price_trend - 1  # -1, 0, 1
        
        # Reward = position * price change (+ small penalty for trading)
        reward = self.position * price_change
        if action != 1:  # Trading cost
            reward -= 0.1
        
        # Update price trend (random walk with persistence)
        if np.random.random() < 0.3:  # 30% chance to change
            self.price_trend = np.random.choice([0, 1, 2])
        
        done = self.step_count >= self.max_steps
        
        return self._get_state(), reward, done

# Test environment
env = SimpleTradingEnv()
state = env.reset()
print(f"Initial state: {state}")
print(f"State space: position (0-2) x trend (0-2) = 9 states")
print(f"Action space: sell (0), hold (1), buy (2)")

In [None]:
class QLearningAgent:
    """
    Q-Learning agent with epsilon-greedy exploration.
    """
    
    def __init__(self, n_states, n_actions, alpha=0.1, gamma=0.95, epsilon=0.1):
        self.n_states = n_states
        self.n_actions = n_actions
        self.alpha = alpha  # Learning rate
        self.gamma = gamma  # Discount factor
        self.epsilon = epsilon  # Exploration rate
        
        # Q-table: state -> action -> value
        # State is (position, trend), so we flatten to single index
        self.q_table = np.zeros((3, 3, n_actions))  # 3 positions x 3 trends x 3 actions
    
    def get_action(self, state, training=True):
        """Epsilon-greedy action selection."""
        if training and np.random.random() < self.epsilon:
            return np.random.randint(self.n_actions)
        else:
            return np.argmax(self.q_table[state[0], state[1]])
    
    def update(self, state, action, reward, next_state):
        """Q-learning update."""
        current_q = self.q_table[state[0], state[1], action]
        max_next_q = np.max(self.q_table[next_state[0], next_state[1]])
        
        # Q-learning update rule
        new_q = current_q + self.alpha * (reward + self.gamma * max_next_q - current_q)
        self.q_table[state[0], state[1], action] = new_q

# Create agent
agent = QLearningAgent(n_states=9, n_actions=3, alpha=0.1, gamma=0.95, epsilon=0.2)
print("Q-Learning agent created")
print(f"Q-table shape: {agent.q_table.shape}")

In [None]:
# Training loop
def train_agent(env, agent, n_episodes=500):
    """Train Q-learning agent."""
    rewards_history = []
    
    for episode in range(n_episodes):
        state = env.reset()
        total_reward = 0
        done = False
        
        while not done:
            # Select action
            action = agent.get_action(state, training=True)
            
            # Execute action
            next_state, reward, done = env.step(action)
            
            # Update Q-table
            agent.update(state, action, reward, next_state)
            
            state = next_state
            total_reward += reward
        
        rewards_history.append(total_reward)
        
        # Decay epsilon
        if episode > 0 and episode % 100 == 0:
            agent.epsilon = max(0.01, agent.epsilon * 0.9)
    
    return rewards_history

# Train the agent
env = SimpleTradingEnv()
agent = QLearningAgent(n_states=9, n_actions=3, alpha=0.1, gamma=0.95, epsilon=0.3)

print("Training Q-learning agent...")
rewards = train_agent(env, agent, n_episodes=500)
print("Training complete!")

In [None]:
# Plot learning curve
fig, ax = plt.subplots(figsize=(10, 6))

# Smooth rewards with moving average
window = 20
smoothed_rewards = np.convolve(rewards, np.ones(window)/window, mode='valid')

ax.plot(rewards, alpha=0.3, color='blue', label='Episode reward')
ax.plot(range(window-1, len(rewards)), smoothed_rewards, color='blue', 
        linewidth=2, label=f'Moving average ({window} episodes)')

ax.axhline(y=0, color='black', linestyle='--', alpha=0.5)
ax.set_xlabel('Episode')
ax.set_ylabel('Total Reward')
ax.set_title('Q-Learning Training: Cumulative Reward per Episode')
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nFinal average reward (last 50 episodes): {np.mean(rewards[-50:]):.2f}")

In [None]:
# Visualize learned Q-values
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

action_names = ['Sell', 'Hold', 'Buy']
position_names = ['Short', 'Neutral', 'Long']
trend_names = ['Down', 'Neutral', 'Up']

for action_idx, (ax, action_name) in enumerate(zip(axes, action_names)):
    q_values = agent.q_table[:, :, action_idx]
    
    im = ax.imshow(q_values, cmap='RdYlGn', aspect='auto')
    ax.set_xticks(range(3))
    ax.set_yticks(range(3))
    ax.set_xticklabels(trend_names)
    ax.set_yticklabels(position_names)
    ax.set_xlabel('Price Trend')
    ax.set_ylabel('Current Position')
    ax.set_title(f'Q-values for Action: {action_name}')
    
    # Add text annotations
    for i in range(3):
        for j in range(3):
            text = ax.text(j, i, f'{q_values[i, j]:.2f}',
                           ha='center', va='center', fontsize=12)
    
    plt.colorbar(im, ax=ax, shrink=0.8)

plt.suptitle('Learned Q-Values', fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

In [None]:
# Extract and display learned policy
print("Learned Policy:")
print("=" * 50)
print(f"{'Position':<12} {'Trend':<12} {'Best Action':<15}")
print("-" * 50)

for pos_idx, pos_name in enumerate(position_names):
    for trend_idx, trend_name in enumerate(trend_names):
        best_action = np.argmax(agent.q_table[pos_idx, trend_idx])
        action_name = action_names[best_action]
        print(f"{pos_name:<12} {trend_name:<12} {action_name:<15}")

print("\nPolicy Interpretation:")
print("- When trend is UP and not already LONG: BUY")
print("- When trend is DOWN and not already SHORT: SELL")
print("- Otherwise: HOLD current position")

## Exercises

### Exercise 1: Word Analogies
Implement word analogy: find the word such that `stock - equity + debt â‰ˆ ?`

In [None]:
# Your code here
# Hint: result = embeddings['stock'] - embeddings['equity'] + embeddings['debt']
# Then find the word with highest cosine similarity to result

### Exercise 2: Epsilon Decay
Modify the Q-learning agent to use exponential epsilon decay. Compare learning curves.

In [None]:
# Your code here
# Hint: epsilon = initial_epsilon * (decay_rate ** episode)

### Exercise 3: More Complex State
Add "momentum" to the state (whether trend is accelerating or decelerating).

In [None]:
# Your code here
# Hint: Track previous trend and compute momentum = current_trend - previous_trend

## Summary

### Key Takeaways

1. **Word Embeddings**:
   - Map words to dense vectors
   - Similar words have similar vectors
   - Use pre-trained (Word2Vec, GloVe, BERT)
   - Applications: sentiment analysis, document similarity

2. **Reinforcement Learning**:
   - Agent learns from environment interaction
   - Q-learning: value-based, tabular method
   - Balance exploration vs exploitation
   - Applications: trading, portfolio optimization

3. **Best Practices**:
   - Embeddings: start with pre-trained, fine-tune if needed
   - RL: start simple (tabular), scale up (DQN)
   - Both: validate thoroughly before deployment

### Course Complete!
Apply these methods in your capstone project.