# **Question#1 - Single Neuron**

## Next Character Prediction in a Word

### Example-1: HELLO



**One-Hot Encoding**

In [None]:
import numpy as np

# Define the data
word = "hello"
char_to_int = {ch: i for i, ch in enumerate(sorted(set(word)))}
int_to_char = {i: ch for ch, i in char_to_int.items()}
encoded_word = [char_to_int[ch] for ch in word]

# Print the encoded word
print("Encoded Word using One-Hot Encoding:", encoded_word, "\n")

# Hyperparameters
hidden_size = 1  # Single neuron
input_size = len(char_to_int)
output_size = len(char_to_int)
learning_rate = 0.1

# Model parameters
Wx = np.random.randn(hidden_size, input_size) * 0.01  # Input to hidden weights
Wh = np.random.randn(hidden_size, hidden_size) * 0.01  # Hidden to hidden weights
Wy = np.random.randn(output_size, hidden_size) * 0.01  # Hidden to output weights
bh = np.zeros((hidden_size, 1))  # Hidden bias
by = np.zeros((output_size, 1))  # Output bias

# RNN function
def rnn_step_forward(x, h_prev):
    h_next = np.tanh(np.dot(Wx, x) + np.dot(Wh, h_prev) + bh)  # Single hidden unit
    y = np.dot(Wy, h_next) + by
    return h_next, y

# Training loop
for epoch in range(2000):  # Training for 2000 epochs
    loss = 0
    h_prev = np.zeros((hidden_size, 1))  # Single hidden unit

    for t in range(len(encoded_word) - 1):
        x_t = np.zeros((input_size, 1))
        x_t[encoded_word[t]] = 1  # One-hot encoding
        y_true = encoded_word[t + 1]

        # Forward pass
        h_prev, y_pred = rnn_step_forward(x_t, h_prev)
        y_pred_softmax = np.exp(y_pred) / np.sum(np.exp(y_pred))  # Softmax

        loss += -np.log(y_pred_softmax[y_true])  # Cross-entropy loss

        # Backward pass
        dy = y_pred_softmax
        dy[y_true] -= 1  # Gradient of softmax + loss

        dWy = np.dot(dy, h_prev.T)
        dby = dy
        dh = np.dot(Wy.T, dy) * (1 - h_prev**2)  # Backprop through tanh activation
        dWx = np.dot(dh, x_t.T)
        dWh = np.dot(dh, h_prev.T)
        dbh = dh

        # Update parameters
        Wy -= learning_rate * dWy
        by -= learning_rate * dby
        Wx -= learning_rate * dWx
        Wh -= learning_rate * dWh
        bh -= learning_rate * dbh

    if epoch % 200 == 0:
        print(f"Epoch {epoch}, Loss: {loss}")

# Predict next characters
h_prev = np.zeros((hidden_size, 1))  # Single hidden unit
print("\nPredictions:")
for t in range(len(encoded_word) - 1):
    x_t = np.zeros((input_size, 1))
    x_t[encoded_word[t]] = 1  # One-hot encoding
    h_prev, y_pred = rnn_step_forward(x_t, h_prev)
    next_char = int_to_char[np.argmax(y_pred)]
    print(f"Input: {int_to_char[encoded_word[t]]}, Predicted: {next_char}")

Encoded Word using One-Hot Encoding: [1, 0, 2, 2, 3] 

Epoch 0, Loss: [5.5985341]
Epoch 200, Loss: [3.73582889]
Epoch 400, Loss: [0.73169815]
Epoch 600, Loss: [0.60448336]
Epoch 800, Loss: [0.46434777]
Epoch 1000, Loss: [0.54941086]
Epoch 1200, Loss: [0.34806325]
Epoch 1400, Loss: [0.33578572]
Epoch 1600, Loss: [0.35104384]
Epoch 1800, Loss: [0.40441667]

Predictions:
Input: h, Predicted: e
Input: e, Predicted: l
Input: l, Predicted: o
Input: l, Predicted: o


**Word Embedding**

In [None]:
import numpy as np

# Define the data
word = "hello"
char_to_int = {ch: i for i, ch in enumerate(sorted(set(word)))}
int_to_char = {i: ch for ch, i in char_to_int.items()}

# Word Embedding: Random embedding initialization (size 2 for simplicity)
embedding_size = 2
embedding_matrix = np.random.randn(len(char_to_int), embedding_size)
encoded_word = [embedding_matrix[char_to_int[ch]] for ch in word]

print("Encoded Word using Word Embedding:\n", encoded_word, "\n")

# Hyperparameters
hidden_size = 1  # Single neuron
input_size = len(char_to_int)
output_size = len(char_to_int)
learning_rate = 0.1

# Model parameters
Wx = np.random.randn(hidden_size, embedding_size) * 0.01  # Adjusted to match embedding size
Wh = np.random.randn(hidden_size, hidden_size) * 0.01  # Hidden to hidden weights
Wy = np.random.randn(output_size, hidden_size) * 0.01  # Hidden to output weights
bh = np.zeros((hidden_size, 1))  # Hidden bias
by = np.zeros((output_size, 1))  # Output bias


# RNN function
def rnn_step_forward(x, h_prev):
    h_next = np.tanh(np.dot(Wx, x) + np.dot(Wh, h_prev) + bh)  # Single hidden unit
    y = np.dot(Wy, h_next) + by
    return h_next, y

# Training loop
for epoch in range(2000):  # Training for 2000 epochs
    loss = 0
    h_prev = np.zeros((hidden_size, 1))  # Single hidden unit

    for t in range(len(encoded_word) - 1):
        x_t = np.array(encoded_word[t]).reshape(-1, 1)  # Convert embedding to column vector
        y_true = char_to_int[word[t + 1]]  # Use index from the original word

        # Forward pass
        h_prev, y_pred = rnn_step_forward(x_t, h_prev)
        y_pred_softmax = np.exp(y_pred) / np.sum(np.exp(y_pred))  # Softmax

        loss += -np.log(y_pred_softmax[y_true])  # Cross-entropy loss

        # Backward pass
        dy = y_pred_softmax
        dy[y_true] -= 1  # Gradient of softmax + loss

        dWy = np.dot(dy, h_prev.T)
        dby = dy
        dh = np.dot(Wy.T, dy) * (1 - h_prev**2)  # Backprop through tanh activation
        dWx = np.dot(dh, x_t.T)
        dWh = np.dot(dh, h_prev.T)
        dbh = dh

        # Update parameters
        Wy -= learning_rate * dWy
        by -= learning_rate * dby
        Wx -= learning_rate * dWx
        Wh -= learning_rate * dWh
        bh -= learning_rate * dbh

    if epoch % 200 == 0:
        print(f"Epoch {epoch}, Loss: {loss}")

# Predict next characters
h_prev = np.zeros((hidden_size, 1))  # Single hidden unit
print("\nPredictions:")
for t in range(len(encoded_word) - 1):
    x_t = np.array(encoded_word[t]).reshape(-1, 1)  # Convert embedding to column vector
    h_prev, y_pred = rnn_step_forward(x_t, h_prev)
    next_char = int_to_char[np.argmax(y_pred)]
    print(f"Input: {int_to_char[char_to_int[word[t]]]}, Predicted: {next_char}")

Encoded Word using Word Embedding:
 [array([ 0.55979795, -1.04837189]), array([ 0.86685017, -1.48646659]), array([1.50208173, 0.48843163]), array([1.50208173, 0.48843163]), array([ 0.97422545, -0.29840624])] 

Epoch 0, Loss: [5.59830852]
Epoch 200, Loss: [1.93608268]
Epoch 400, Loss: [1.68127556]
Epoch 600, Loss: [1.61010648]
Epoch 800, Loss: [1.57747913]
Epoch 1000, Loss: [1.5589571]
Epoch 1200, Loss: [1.54708198]
Epoch 1400, Loss: [1.5388437]
Epoch 1600, Loss: [1.53280378]
Epoch 1800, Loss: [1.52819099]

Predictions:
Input: h, Predicted: l
Input: e, Predicted: l
Input: l, Predicted: l
Input: l, Predicted: o


**Bag of Words**

In [None]:
import numpy as np

# Define the data
word = "hello"
char_to_int = {ch: i for i, ch in enumerate(sorted(set(word)))}
int_to_char = {i: ch for ch, i in char_to_int.items()}

# Bag of Words Representation
bag_of_words = np.zeros((len(char_to_int),), dtype=int)
for ch in word:
    bag_of_words[char_to_int[ch]] += 1

print("Encoded Word using Bag of Words:\n", bag_of_words, "\n")

# Define Hyperparameters and Model Parameters
hidden_size = 1  # Single neuron
input_size = len(char_to_int)
output_size = len(char_to_int)
learning_rate = 0.1

Wx = np.random.randn(hidden_size, input_size) * 0.01
Wh = np.random.randn(hidden_size, hidden_size) * 0.01
Wy = np.random.randn(output_size, hidden_size) * 0.01
bh = np.zeros((hidden_size, 1))
by = np.zeros((output_size, 1))

# RNN function
def rnn_step_forward(x, h_prev):
    h_next = np.tanh(np.dot(Wx, x) + np.dot(Wh, h_prev) + bh)
    y = np.dot(Wy, h_next) + by
    return h_next, y

# Example: Input to RNN
x_input = bag_of_words.reshape(-1, 1)  # The Bag of Words vector as input
h_prev = np.zeros((hidden_size, 1))
h_next, y_pred = rnn_step_forward(x_input, h_prev)

# Training loop (conceptual only for Bag of Words)
# Bag of Words represents the entire word as a single input, so there's no sequential prediction
print("RNN Output using Bag of Words Encoding:\n", y_pred)

# Prediction
predicted_index = np.argmax(y_pred)  # Choose the index with the highest probability
predicted_char = int_to_char[predicted_index]
print(f"Predicted next character based on Bag of Words encoding: {predicted_char}")

Encoded Word using Bag of Words:
 [1 1 2 1] 

RNN Output using Bag of Words Encoding:
 [[ 6.43432841e-05]
 [-1.52126512e-04]
 [ 4.66170685e-05]
 [-1.86400995e-04]]
Predicted next character based on Bag of Words encoding: e


**Bag of Words - Using Sliding Window**

In [None]:
import numpy as np

# Define the data
word = "hello"
char_to_int = {ch: i for i, ch in enumerate(sorted(set(word)))}
int_to_char = {i: ch for ch, i in char_to_int.items()}

# Bag of Words with Sliding Windows
window_size = 2  # Define the size of the sliding window
encoded_windows = []

for i in range(len(word) - window_size + 1):
    window = word[i : i + window_size]  # Extract a window
    bag_of_words = np.zeros((len(char_to_int),), dtype=int)
    for ch in window:
        bag_of_words[char_to_int[ch]] += 1
    encoded_windows.append(bag_of_words)

print("Sliding Window Encoded Words (Bag of Words):")
for i, bow in enumerate(encoded_windows):
    print(f"Window {i + 1}: {bow}")

# Feed these sliding windows to an RNN
hidden_size = 1
Wx = np.random.randn(hidden_size, len(char_to_int)) * 0.01
Wh = np.random.randn(hidden_size, hidden_size) * 0.01
Wy = np.random.randn(len(char_to_int), hidden_size) * 0.01
bh = np.zeros((hidden_size, 1))
by = np.zeros((len(char_to_int), 1))

def rnn_step_forward(x, h_prev):
    h_next = np.tanh(np.dot(Wx, x) + np.dot(Wh, h_prev) + bh)
    y = np.dot(Wy, h_next) + by
    return h_next, y

# Sequentially process the sliding windows
h_prev = np.zeros((hidden_size, 1))
for t, bow in enumerate(encoded_windows):
    x_t = bow.reshape(-1, 1)  # Reshape for RNN input
    h_prev, y_pred = rnn_step_forward(x_t, h_prev)
    next_char = int_to_char[np.argmax(y_pred)]
    print(f"Window {t + 1}: Predicted next character: {next_char}")

Sliding Window Encoded Words (Bag of Words):
Window 1: [1 1 0 0]
Window 2: [1 0 1 0]
Window 3: [0 0 2 0]
Window 4: [0 0 1 1]
Window 1: Predicted next character: l
Window 2: Predicted next character: o
Window 3: Predicted next character: o
Window 4: Predicted next character: o


**Hashing Encoding**

In [None]:
import numpy as np
import hashlib

# Define the data
word = "hello"

# Define Hashing Function
def hash_function(value, num_buckets):
    hashed = int(hashlib.md5(value.encode()).hexdigest(), 16)
    return hashed % num_buckets

# Hashing Encoding (character-by-character for RNN compatibility)
num_buckets = 5  # Define the number of buckets for hashing
hash_vectors = []

for ch in word:
    hash_vector = np.zeros((num_buckets,), dtype=int)
    bucket = hash_function(ch, num_buckets)
    hash_vector[bucket] += 1
    hash_vectors.append(hash_vector)

print("Encoded Word using Hashing Encoding (character-by-character):")
for i, vec in enumerate(hash_vectors):
    print(f"Character '{word[i]}': {vec}")
print("\n")

# Define Hyperparameters and Model Parameters
hidden_size = 1  # Single neuron
input_size = num_buckets
output_size = num_buckets
learning_rate = 0.1

Wx = np.random.randn(hidden_size, input_size) * 0.01
Wh = np.random.randn(hidden_size, hidden_size) * 0.01
Wy = np.random.randn(output_size, hidden_size) * 0.01
bh = np.zeros((hidden_size, 1))
by = np.zeros((output_size, 1))

# RNN function
def rnn_step_forward(x, h_prev):
    h_next = np.tanh(np.dot(Wx, x) + np.dot(Wh, h_prev) + bh)
    y = np.dot(Wy, h_next) + by
    return h_next, y

# Predict next characters using RNN
h_prev = np.zeros((hidden_size, 1))  # Initialize hidden state
print("Predictions:")

for t in range(len(hash_vectors) - 1):  # Predict for all but the last character
    x_t = hash_vectors[t].reshape(-1, 1)
    h_prev, y_pred = rnn_step_forward(x_t, h_prev)

    # Decode prediction (find the bucket with the highest value)
    predicted_bucket = np.argmax(y_pred)
    # Match the bucket back to a character
    predicted_char = None
    for ch in word:  # Iterate through word to find matching hash bucket
        if hash_function(ch, num_buckets) == predicted_bucket:
            predicted_char = ch
            break

    print(f"Input: '{word[t]}', Predicted next character: '{predicted_char}'")

Encoded Word using Hashing Encoding (character-by-character):
Character 'h': [0 0 0 1 0]
Character 'e': [0 1 0 0 0]
Character 'l': [0 0 1 0 0]
Character 'l': [0 0 1 0 0]
Character 'o': [0 0 0 1 0]


Predictions:
Input: 'h', Predicted next character: 'h'
Input: 'e', Predicted next character: 'h'
Input: 'l', Predicted next character: 'None'
Input: 'l', Predicted next character: 'None'


### Example-2: HAPPY



**One-Hot Encoding**

In [None]:
import numpy as np

# Define the data
word = "happy"
char_to_int = {ch: i for i, ch in enumerate(sorted(set(word)))}
int_to_char = {i: ch for ch, i in char_to_int.items()}
encoded_word = [char_to_int[ch] for ch in word]

# Print the encoded word
print("Encoded Word using One-Hot Encoding:", encoded_word, "\n")

# Hyperparameters
hidden_size = 1  # Single neuron
input_size = len(char_to_int)
output_size = len(char_to_int)
learning_rate = 0.1

# Model parameters
Wx = np.random.randn(hidden_size, input_size) * 0.01  # Input to hidden weights
Wh = np.random.randn(hidden_size, hidden_size) * 0.01  # Hidden to hidden weights
Wy = np.random.randn(output_size, hidden_size) * 0.01  # Hidden to output weights
bh = np.zeros((hidden_size, 1))  # Hidden bias
by = np.zeros((output_size, 1))  # Output bias

# RNN function
def rnn_step_forward(x, h_prev):
    h_next = np.tanh(np.dot(Wx, x) + np.dot(Wh, h_prev) + bh)  # Single hidden unit
    y = np.dot(Wy, h_next) + by
    return h_next, y

# Training loop
for epoch in range(2000):  # Training for 2000 epochs
    loss = 0
    h_prev = np.zeros((hidden_size, 1))  # Single hidden unit

    for t in range(len(encoded_word) - 1):
        x_t = np.zeros((input_size, 1))
        x_t[encoded_word[t]] = 1  # One-hot encoding
        y_true = encoded_word[t + 1]

        # Forward pass
        h_prev, y_pred = rnn_step_forward(x_t, h_prev)
        y_pred_softmax = np.exp(y_pred) / np.sum(np.exp(y_pred))  # Softmax

        loss += -np.log(y_pred_softmax[y_true])  # Cross-entropy loss

        # Backward pass
        dy = y_pred_softmax
        dy[y_true] -= 1  # Gradient of softmax + loss

        dWy = np.dot(dy, h_prev.T)
        dby = dy
        dh = np.dot(Wy.T, dy) * (1 - h_prev**2)  # Backprop through tanh activation
        dWx = np.dot(dh, x_t.T)
        dWh = np.dot(dh, h_prev.T)
        dbh = dh

        # Update parameters
        Wy -= learning_rate * dWy
        by -= learning_rate * dby
        Wx -= learning_rate * dWx
        Wh -= learning_rate * dWh
        bh -= learning_rate * dbh

    if epoch % 200 == 0:
        print(f"Epoch {epoch}, Loss: {loss}")

# Predict next characters
h_prev = np.zeros((hidden_size, 1))  # Single hidden unit
print("\nPredictions:")
for t in range(len(encoded_word) - 1):
    x_t = np.zeros((input_size, 1))
    x_t[encoded_word[t]] = 1  # One-hot encoding
    h_prev, y_pred = rnn_step_forward(x_t, h_prev)
    next_char = int_to_char[np.argmax(y_pred)]
    print(f"Input: {int_to_char[encoded_word[t]]}, Predicted: {next_char}")

Encoded Word using One-Hot Encoding: [1, 0, 2, 2, 3] 

Epoch 0, Loss: [5.59826591]
Epoch 200, Loss: [1.11149072]
Epoch 400, Loss: [0.64693971]
Epoch 600, Loss: [1.24950606]
Epoch 800, Loss: [1.7811296]
Epoch 1000, Loss: [0.7302327]
Epoch 1200, Loss: [0.37019844]
Epoch 1400, Loss: [0.33125695]
Epoch 1600, Loss: [0.28336462]
Epoch 1800, Loss: [0.31947961]

Predictions:
Input: h, Predicted: a
Input: a, Predicted: p
Input: p, Predicted: p
Input: p, Predicted: y


**Word Embedding**

In [None]:
import numpy as np

# Define the data
word = "happy"
char_to_int = {ch: i for i, ch in enumerate(sorted(set(word)))}
int_to_char = {i: ch for ch, i in char_to_int.items()}

# Word Embedding: Random embedding initialization (size 2 for simplicity)
embedding_size = 2
embedding_matrix = np.random.randn(len(char_to_int), embedding_size)
encoded_word = [embedding_matrix[char_to_int[ch]] for ch in word]

print("Encoded Word using Word Embedding:\n", encoded_word, "\n")

# Hyperparameters
hidden_size = 1  # Single neuron
input_size = len(char_to_int)
output_size = len(char_to_int)
learning_rate = 0.1

# Model parameters
Wx = np.random.randn(hidden_size, embedding_size) * 0.01  # Adjusted to match embedding size
Wh = np.random.randn(hidden_size, hidden_size) * 0.01  # Hidden to hidden weights
Wy = np.random.randn(output_size, hidden_size) * 0.01  # Hidden to output weights
bh = np.zeros((hidden_size, 1))  # Hidden bias
by = np.zeros((output_size, 1))  # Output bias


# RNN function
def rnn_step_forward(x, h_prev):
    h_next = np.tanh(np.dot(Wx, x) + np.dot(Wh, h_prev) + bh)  # Single hidden unit
    y = np.dot(Wy, h_next) + by
    return h_next, y

# Training loop
for epoch in range(2000):  # Training for 2000 epochs
    loss = 0
    h_prev = np.zeros((hidden_size, 1))  # Single hidden unit

    for t in range(len(encoded_word) - 1):
        x_t = np.array(encoded_word[t]).reshape(-1, 1)  # Convert embedding to column vector
        y_true = char_to_int[word[t + 1]]  # Use index from the original word

        # Forward pass
        h_prev, y_pred = rnn_step_forward(x_t, h_prev)
        y_pred_softmax = np.exp(y_pred) / np.sum(np.exp(y_pred))  # Softmax

        loss += -np.log(y_pred_softmax[y_true])  # Cross-entropy loss

        # Backward pass
        dy = y_pred_softmax
        dy[y_true] -= 1  # Gradient of softmax + loss

        dWy = np.dot(dy, h_prev.T)
        dby = dy
        dh = np.dot(Wy.T, dy) * (1 - h_prev**2)  # Backprop through tanh activation
        dWx = np.dot(dh, x_t.T)
        dWh = np.dot(dh, h_prev.T)
        dbh = dh

        # Update parameters
        Wy -= learning_rate * dWy
        by -= learning_rate * dby
        Wx -= learning_rate * dWx
        Wh -= learning_rate * dWh
        bh -= learning_rate * dbh

    if epoch % 200 == 0:
        print(f"Epoch {epoch}, Loss: {loss}")

# Predict next characters
h_prev = np.zeros((hidden_size, 1))  # Single hidden unit
print("\nPredictions:")
for t in range(len(encoded_word) - 1):
    x_t = np.array(encoded_word[t]).reshape(-1, 1)  # Convert embedding to column vector
    h_prev, y_pred = rnn_step_forward(x_t, h_prev)
    next_char = int_to_char[np.argmax(y_pred)]
    print(f"Input: {int_to_char[char_to_int[word[t]]]}, Predicted: {next_char}")

Encoded Word using Word Embedding:
 [array([0.0016641 , 1.47165403]), array([0.3374475 , 0.42373716]), array([0.09433769, 1.38656867]), array([0.09433769, 1.38656867]), array([-1.18585439,  0.08112651])] 

Epoch 0, Loss: [5.59840657]
Epoch 200, Loss: [3.77824674]
Epoch 400, Loss: [2.16512331]
Epoch 600, Loss: [5.06827428]
Epoch 800, Loss: [2.95859308]
Epoch 1000, Loss: [1.98271182]
Epoch 1200, Loss: [2.14352732]
Epoch 1400, Loss: [2.03112976]
Epoch 1600, Loss: [2.2042331]
Epoch 1800, Loss: [2.16471457]

Predictions:
Input: h, Predicted: a
Input: a, Predicted: p
Input: p, Predicted: p
Input: p, Predicted: p


**Bag of Words**

In [None]:
import numpy as np

# Define the data
word = "happy"
char_to_int = {ch: i for i, ch in enumerate(sorted(set(word)))}
int_to_char = {i: ch for ch, i in char_to_int.items()}

# Bag of Words Representation
bag_of_words = np.zeros((len(char_to_int),), dtype=int)
for ch in word:
    bag_of_words[char_to_int[ch]] += 1

print("Encoded Word using Bag of Words:\n", bag_of_words, "\n")

# Define Hyperparameters and Model Parameters
hidden_size = 1  # Single neuron
input_size = len(char_to_int)
output_size = len(char_to_int)
learning_rate = 0.1

Wx = np.random.randn(hidden_size, input_size) * 0.01
Wh = np.random.randn(hidden_size, hidden_size) * 0.01
Wy = np.random.randn(output_size, hidden_size) * 0.01
bh = np.zeros((hidden_size, 1))
by = np.zeros((output_size, 1))

# RNN function
def rnn_step_forward(x, h_prev):
    h_next = np.tanh(np.dot(Wx, x) + np.dot(Wh, h_prev) + bh)
    y = np.dot(Wy, h_next) + by
    return h_next, y

# Example: Input to RNN
x_input = bag_of_words.reshape(-1, 1)  # The Bag of Words vector as input
h_prev = np.zeros((hidden_size, 1))
h_next, y_pred = rnn_step_forward(x_input, h_prev)

# Training loop (conceptual only for Bag of Words)
# Bag of Words represents the entire word as a single input, so there's no sequential prediction
print("RNN Output using Bag of Words Encoding:\n", y_pred)

# Prediction
predicted_index = np.argmax(y_pred)  # Choose the index with the highest probability
predicted_char = int_to_char[predicted_index]
print(f"Predicted next character based on Bag of Words encoding: {predicted_char}")

Encoded Word using Bag of Words:
 [1 1 2 1] 

RNN Output using Bag of Words Encoding:
 [[ 2.35353406e-04]
 [ 7.45161395e-05]
 [-8.00167329e-05]
 [-1.27538369e-04]]
Predicted next character based on Bag of Words encoding: a


**Bag of Words - Using Sliding Window**

In [None]:
import numpy as np

# Define the data
word = "happy"
char_to_int = {ch: i for i, ch in enumerate(sorted(set(word)))}
int_to_char = {i: ch for ch, i in char_to_int.items()}

# Bag of Words with Sliding Windows
window_size = 2  # Define the size of the sliding window
encoded_windows = []

for i in range(len(word) - window_size + 1):
    window = word[i : i + window_size]  # Extract a window
    bag_of_words = np.zeros((len(char_to_int),), dtype=int)
    for ch in window:
        bag_of_words[char_to_int[ch]] += 1
    encoded_windows.append(bag_of_words)

print("Sliding Window Encoded Words (Bag of Words):")
for i, bow in enumerate(encoded_windows):
    print(f"Window {i + 1}: {bow}")

# Feed these sliding windows to an RNN
hidden_size = 1
Wx = np.random.randn(hidden_size, len(char_to_int)) * 0.01
Wh = np.random.randn(hidden_size, hidden_size) * 0.01
Wy = np.random.randn(len(char_to_int), hidden_size) * 0.01
bh = np.zeros((hidden_size, 1))
by = np.zeros((len(char_to_int), 1))

def rnn_step_forward(x, h_prev):
    h_next = np.tanh(np.dot(Wx, x) + np.dot(Wh, h_prev) + bh)
    y = np.dot(Wy, h_next) + by
    return h_next, y

# Sequentially process the sliding windows
h_prev = np.zeros((hidden_size, 1))
for t, bow in enumerate(encoded_windows):
    x_t = bow.reshape(-1, 1)  # Reshape for RNN input
    h_prev, y_pred = rnn_step_forward(x_t, h_prev)
    next_char = int_to_char[np.argmax(y_pred)]
    print(f"Window {t + 1}: Predicted next character: {next_char}")

Sliding Window Encoded Words (Bag of Words):
Window 1: [1 1 0 0]
Window 2: [1 0 1 0]
Window 3: [0 0 2 0]
Window 4: [0 0 1 1]
Window 1: Predicted next character: p
Window 2: Predicted next character: p
Window 3: Predicted next character: p
Window 4: Predicted next character: p


**Hashing Encoding**

In [None]:
import numpy as np
import hashlib

# Define the data
word = "happy"

# Define Hashing Function
def hash_function(value, num_buckets):
    hashed = int(hashlib.md5(value.encode()).hexdigest(), 16)
    return hashed % num_buckets

# Hashing Encoding (character-by-character for RNN compatibility)
num_buckets = 5  # Define the number of buckets for hashing
hash_vectors = []

for ch in word:
    hash_vector = np.zeros((num_buckets,), dtype=int)
    bucket = hash_function(ch, num_buckets)
    hash_vector[bucket] += 1
    hash_vectors.append(hash_vector)

print("Encoded Word using Hashing Encoding (character-by-character):")
for i, vec in enumerate(hash_vectors):
    print(f"Character '{word[i]}': {vec}")
print("\n")

# Define Hyperparameters and Model Parameters
hidden_size = 1  # Single neuron
input_size = num_buckets
output_size = num_buckets
learning_rate = 0.1

Wx = np.random.randn(hidden_size, input_size) * 0.01
Wh = np.random.randn(hidden_size, hidden_size) * 0.01
Wy = np.random.randn(output_size, hidden_size) * 0.01
bh = np.zeros((hidden_size, 1))
by = np.zeros((output_size, 1))

# RNN function
def rnn_step_forward(x, h_prev):
    h_next = np.tanh(np.dot(Wx, x) + np.dot(Wh, h_prev) + bh)
    y = np.dot(Wy, h_next) + by
    return h_next, y

# Predict next characters using RNN
h_prev = np.zeros((hidden_size, 1))  # Initialize hidden state
print("Predictions:")

for t in range(len(hash_vectors) - 1):  # Predict for all but the last character
    x_t = hash_vectors[t].reshape(-1, 1)
    h_prev, y_pred = rnn_step_forward(x_t, h_prev)

    # Decode prediction (find the bucket with the highest value)
    predicted_bucket = np.argmax(y_pred)
    # Match the bucket back to a character
    predicted_char = None
    for ch in word:  # Iterate through word to find matching hash bucket
        if hash_function(ch, num_buckets) == predicted_bucket:
            predicted_char = ch
            break

    print(f"Input: '{word[t]}', Predicted next character: '{predicted_char}'")

Encoded Word using Hashing Encoding (character-by-character):
Character 'h': [0 0 0 1 0]
Character 'a': [0 0 1 0 0]
Character 'p': [0 1 0 0 0]
Character 'p': [0 1 0 0 0]
Character 'y': [0 0 0 0 1]


Predictions:
Input: 'h', Predicted next character: 'a'
Input: 'a', Predicted next character: 'h'
Input: 'p', Predicted next character: 'h'
Input: 'p', Predicted next character: 'h'


## Next Word Prediction in a Sentence

### Example-1 : I Love Coding

**One Hot Encoding**

In [None]:
import numpy as np

# Define the data
sentence = "I love coding"
words = sentence.split()
word_to_int = {word: i for i, word in enumerate(sorted(set(words)))}
int_to_word = {i: word for word, i in word_to_int.items()}
encoded_sentence = [word_to_int[word] for word in words]

# Display tokens, vocabulary, and encodings
print("### One-Hot Encoding ###")
print("Sentence:", sentence)
print("Tokens (Words):", words)
print("Vocabulary (Word to Index):", word_to_int)
print("Index to Word Mapping:", int_to_word)
print("Encoded Sentence:", encoded_sentence , "\n")


# Model parameters
input_size = len(word_to_int)  # Number of unique words
output_size = len(word_to_int)
hidden_size = 1  # Single neuron
learning_rate = 0.1

# Initialize weights and biases
Wx = np.random.randn(hidden_size, input_size) * 0.01  # Input to hidden weights
Wh = np.random.randn(hidden_size, hidden_size) * 0.01  # Hidden to hidden weights
Wy = np.random.randn(output_size, hidden_size) * 0.01  # Hidden to output weights
bh = np.zeros((hidden_size, 1))  # Hidden bias
by = np.zeros((output_size, 1))  # Output bias

# RNN forward step
def rnn_step_forward(x, h_prev):
    h_next = np.tanh(np.dot(Wx, x) + np.dot(Wh, h_prev) + bh)  # Single hidden neuron
    y = np.dot(Wy, h_next) + by
    return h_next, y

# Training loop
for epoch in range(2000):  # Training for 2000 epochs
    loss = 0
    h_prev = np.zeros((hidden_size, 1))  # Initialize hidden state

    for t in range(len(encoded_sentence) - 1):
        x_t = np.zeros((input_size, 1))
        x_t[encoded_sentence[t]] = 1  # One-hot encoding
        y_true = encoded_sentence[t + 1]

        # Forward pass
        h_prev, y_pred = rnn_step_forward(x_t, h_prev)
        y_pred_softmax = np.exp(y_pred) / np.sum(np.exp(y_pred))  # Softmax activation

        loss += -np.log(y_pred_softmax[y_true])  # Cross-entropy loss

        # Backward pass (gradient calculation and parameter update)
        dy = y_pred_softmax
        dy[y_true] -= 1  # Gradient of softmax + loss

        dWy = np.dot(dy, h_prev.T)
        dby = dy
        dh = np.dot(Wy.T, dy) * (1 - h_prev**2)  # Backprop through tanh activation
        dWx = np.dot(dh, x_t.T)
        dWh = np.dot(dh, h_prev.T)
        dbh = dh

        # Update parameters
        Wy -= learning_rate * dWy
        by -= learning_rate * dby
        Wx -= learning_rate * dWx
        Wh -= learning_rate * dWh
        bh -= learning_rate * dbh

    if epoch % 200 == 0:
        print(f"Epoch {epoch}, Loss: {loss}")

# Predict next words
h_prev = np.zeros((hidden_size, 1))  # Initialize hidden state
print("\nPredictions:")
for t in range(len(encoded_sentence) - 1):
    x_t = np.zeros((input_size, 1))
    x_t[encoded_sentence[t]] = 1  # One-hot encoding
    h_prev, y_pred = rnn_step_forward(x_t, h_prev)
    next_word = int_to_word[np.argmax(y_pred)]
    print(f"Input: {int_to_word[encoded_sentence[t]]}, Predicted: {next_word}")

### One-Hot Encoding ###
Sentence: I love coding
Tokens (Words): ['I', 'love', 'coding']
Vocabulary (Word to Index): {'I': 0, 'coding': 1, 'love': 2}
Index to Word Mapping: {0: 'I', 1: 'coding', 2: 'love'}
Encoded Sentence: [0, 2, 1] 

Epoch 0, Loss: [2.23168791]
Epoch 200, Loss: [0.05924359]
Epoch 400, Loss: [0.02322638]
Epoch 600, Loss: [0.01433453]
Epoch 800, Loss: [0.0103422]
Epoch 1000, Loss: [0.00808059]
Epoch 1200, Loss: [0.00662659]
Epoch 1400, Loss: [0.00561388]
Epoch 1600, Loss: [0.00486837]
Epoch 1800, Loss: [0.00429683]

Predictions:
Input: I, Predicted: love
Input: love, Predicted: coding


**Word Embedding**

In [None]:
import numpy as np

# Define the data
sentence = "I love coding"
words = sentence.split()
word_to_int = {word: i for i, word in enumerate(sorted(set(words)))}
int_to_word = {i: word for word, i in word_to_int.items()}
encoded_sentence = [word_to_int[word] for word in words]

# Display tokens, vocabulary, and encodings
print("### Word Embedding ###")
print("Sentence:", sentence)
print("Tokens (Words):", words)
print("Vocabulary (Word to Index):", word_to_int)
print("Index to Word Mapping:", int_to_word)
print("Encoded Sentence:", encoded_sentence, "\n")

# Model parameters
vocab_size = len(word_to_int)  # Number of unique words
embedding_dim = 3  # Size of word embeddings
hidden_size = 1  # Single neuron
output_size = vocab_size
learning_rate = 0.1

# Initialize weights and biases
embedding_matrix = np.random.randn(vocab_size, embedding_dim) * 0.01  # Word embeddings
Wx = np.random.randn(hidden_size, embedding_dim) * 0.01  # Embedding to hidden weights
Wh = np.random.randn(hidden_size, hidden_size) * 0.01  # Hidden to hidden weights
Wy = np.random.randn(output_size, hidden_size) * 0.01  # Hidden to output weights
bh = np.zeros((hidden_size, 1))  # Hidden bias
by = np.zeros((output_size, 1))  # Output bias

# RNN forward step
def rnn_step_forward(x, h_prev):
    h_next = np.tanh(np.dot(Wx, x) + np.dot(Wh, h_prev) + bh)  # Single hidden neuron
    y = np.dot(Wy, h_next) + by
    return h_next, y

# Training loop
for epoch in range(2000):  # Training for 2000 epochs
    loss = 0
    h_prev = np.zeros((hidden_size, 1))  # Initialize hidden state

    for t in range(len(encoded_sentence) - 1):
        word_idx = encoded_sentence[t]
        x_t = embedding_matrix[word_idx].reshape(-1, 1)  # Word embedding vector
        y_true = encoded_sentence[t + 1]

        # Forward pass
        h_prev, y_pred = rnn_step_forward(x_t, h_prev)
        y_pred_softmax = np.exp(y_pred) / np.sum(np.exp(y_pred))  # Softmax activation

        loss += -np.log(y_pred_softmax[y_true])  # Cross-entropy loss

        # Backward pass (gradient calculation and parameter update)
        dy = y_pred_softmax
        dy[y_true] -= 1  # Gradient of softmax + loss

        dWy = np.dot(dy, h_prev.T)
        dby = dy
        dh = np.dot(Wy.T, dy) * (1 - h_prev**2)  # Backprop through tanh activation
        dWx = np.dot(dh, x_t.T)
        dWh = np.dot(dh, h_prev.T)
        dbh = dh

        # Update parameters
        Wy -= learning_rate * dWy
        by -= learning_rate * dby
        Wx -= learning_rate * dWx
        Wh -= learning_rate * dWh
        bh -= learning_rate * dbh

    if epoch % 200 == 0:
        print(f"Epoch {epoch}, Loss: {loss}")

# Predict next words
h_prev = np.zeros((hidden_size, 1))  # Initialize hidden state
print("\nPredictions:")
for t in range(len(encoded_sentence) - 1):
    word_idx = encoded_sentence[t]
    x_t = embedding_matrix[word_idx].reshape(-1, 1)  # Word embedding vector
    h_prev, y_pred = rnn_step_forward(x_t, h_prev)
    next_word = int_to_word[np.argmax(y_pred)]
    print(f"Input: {int_to_word[encoded_sentence[t]]}, Predicted: {next_word}")


### Word Embedding ###
Sentence: I love coding
Tokens (Words): ['I', 'love', 'coding']
Vocabulary (Word to Index): {'I': 0, 'coding': 1, 'love': 2}
Index to Word Mapping: {0: 'I', 1: 'coding', 2: 'love'}
Encoded Sentence: [0, 2, 1] 

Epoch 0, Loss: [2.23167933]
Epoch 200, Loss: [1.47191492]
Epoch 400, Loss: [1.45512751]
Epoch 600, Loss: [1.44949047]
Epoch 800, Loss: [1.44666605]
Epoch 1000, Loss: [1.44497151]
Epoch 1200, Loss: [1.4438428]
Epoch 1400, Loss: [1.44303742]
Epoch 1600, Loss: [1.44243398]
Epoch 1800, Loss: [1.44196505]

Predictions:
Input: I, Predicted: coding
Input: love, Predicted: coding


**Bag of Words**

In [None]:
import numpy as np

# Define the data
sentence = "I love coding"
words = sentence.split()
vocab = sorted(set(words))
word_to_int = {word: i for i, word in enumerate(vocab)}
int_to_word = {i: word for word, i in word_to_int.items()}

# Bag of Words Encoding
bow_vector = np.zeros(len(vocab), dtype=int)
for word in words:
    bow_vector[word_to_int[word]] += 1

# Display tokens, vocabulary, and encoding
print("### Bag of Words (BoW) ###")
print("Sentence:", sentence)
print("Tokens (Words):", words)
print("Vocabulary (Word to Index):", word_to_int)
print("Index to Word Mapping:", int_to_word)
print("BoW Vector (Encoded Sentence):", bow_vector)

# Example: Use BoW encoding as input to a model
input_vector = bow_vector.reshape(-1, 1)  # Reshape for compatibility
hidden_size = 2  # Number of neurons
output_size = len(vocab)
learning_rate = 0.01

# Initialize model parameters
Wx = np.random.randn(hidden_size, len(vocab)) * 0.01
Wy = np.random.randn(output_size, hidden_size) * 0.01
bh = np.zeros((hidden_size, 1))
by = np.zeros((output_size, 1))

# Simple forward pass
hidden_layer = np.tanh(np.dot(Wx, input_vector) + bh)
output_layer = np.dot(Wy, hidden_layer) + by

# Apply Softmax to output_layer for predictions
output_probs = np.exp(output_layer) / np.sum(np.exp(output_layer))  # Softmax
predicted_word_idx = np.argmax(output_probs)  # Get the index of the predicted word
predicted_word = int_to_word[predicted_word_idx]

print("\nModel Output (Raw Scores):\n", output_layer)
print("\nSoftmax Probabilities (Predictions):", output_probs.flatten())
print("Predicted Word:", predicted_word)

### Bag of Words (BoW) ###
Sentence: I love coding
Tokens (Words): ['I', 'love', 'coding']
Vocabulary (Word to Index): {'I': 0, 'coding': 1, 'love': 2}
Index to Word Mapping: {0: 'I', 1: 'coding', 2: 'love'}
BoW Vector (Encoded Sentence): [1 1 1]

Model Output (Raw Scores):
 [[-2.20668581e-05]
 [ 8.27444754e-05]
 [-2.39934914e-04]]

Softmax Probabilities (Predictions): [0.33334589 0.33338083 0.33327327]
Predicted Word: coding


**Hashing Encoding**

In [None]:
import numpy as np
import hashlib

# Define the data
sentence = "I love coding"
words = sentence.split()

# Hashing function
def hash_function(value, num_buckets):
    hashed = int(hashlib.md5(value.encode()).hexdigest(), 16)
    return hashed % num_buckets

# Hashing Encoding
num_buckets = 3 # Define the number of buckets
hash_vector = np.zeros(num_buckets, dtype=int)
for word in words:
    bucket = hash_function(word, num_buckets)
    hash_vector[bucket] += 1

# Display tokens and hashed encoding
print("\n### Hashing Encoding ###")
print("Sentence:", sentence)
print("Tokens (Words):", words)
print("Number of Buckets:", num_buckets)
print("Hash Vector (Encoded Sentence):", hash_vector)

# Example: Use Hashing encoding as input to a model
input_vector = hash_vector.reshape(-1, 1)  # Reshape for compatibility
hidden_size = 2  # Number of neurons
output_size = num_buckets
learning_rate = 0.01

# Initialize model parameters
Wx = np.random.randn(hidden_size, num_buckets) * 0.01
Wy = np.random.randn(output_size, hidden_size) * 0.01
bh = np.zeros((hidden_size, 1))
by = np.zeros((output_size, 1))

# Simple forward pass
hidden_layer = np.tanh(np.dot(Wx, input_vector) + bh)
output_layer = np.dot(Wy, hidden_layer) + by

# Apply Softmax to output_layer for predictions
output_probs = np.exp(output_layer) / np.sum(np.exp(output_layer))  # Softmax
predicted_bucket = np.argmax(output_probs)  # Get the predicted bucket index

print("\nModel Output (Raw Scores):\n", output_layer)
print("\nSoftmax Probabilities (Predictions):", output_probs.flatten())
print("Predicted Bucket:", predicted_bucket)


### Hashing Encoding ###
Sentence: I love coding
Tokens (Words): ['I', 'love', 'coding']
Number of Buckets: 3
Hash Vector (Encoded Sentence): [2 1 0]

Model Output (Raw Scores):
 [[ 0.00012358]
 [-0.00075279]
 [ 0.00033973]]

Softmax Probabilities (Predictions): [0.33340666 0.3331146  0.33347873]
Predicted Bucket: 2


### Example-2 : Our national language is Urdu

**One Hot Encoding**

In [None]:
import numpy as np

# Define the data
sentence = "Our national language is Urdu"
words = sentence.split()
word_to_int = {word: i for i, word in enumerate(sorted(set(words)))}
int_to_word = {i: word for word, i in word_to_int.items()}
encoded_sentence = [word_to_int[word] for word in words]

# Display tokens, vocabulary, and encodings
print("### One-Hot Encoding ###")
print("Sentence:", sentence)
print("Tokens (Words):", words)
print("Vocabulary (Word to Index):", word_to_int)
print("Index to Word Mapping:", int_to_word)
print("Encoded Sentence:", encoded_sentence , "\n")


# Model parameters
input_size = len(word_to_int)  # Number of unique words
output_size = len(word_to_int)
hidden_size = 1  # Single neuron
learning_rate = 0.1

# Initialize weights and biases
Wx = np.random.randn(hidden_size, input_size) * 0.01  # Input to hidden weights
Wh = np.random.randn(hidden_size, hidden_size) * 0.01  # Hidden to hidden weights
Wy = np.random.randn(output_size, hidden_size) * 0.01  # Hidden to output weights
bh = np.zeros((hidden_size, 1))  # Hidden bias
by = np.zeros((output_size, 1))  # Output bias

# RNN forward step
def rnn_step_forward(x, h_prev):
    h_next = np.tanh(np.dot(Wx, x) + np.dot(Wh, h_prev) + bh)  # Single hidden neuron
    y = np.dot(Wy, h_next) + by
    return h_next, y

# Training loop
for epoch in range(2000):  # Training for 2000 epochs
    loss = 0
    h_prev = np.zeros((hidden_size, 1))  # Initialize hidden state

    for t in range(len(encoded_sentence) - 1):
        x_t = np.zeros((input_size, 1))
        x_t[encoded_sentence[t]] = 1  # One-hot encoding
        y_true = encoded_sentence[t + 1]

        # Forward pass
        h_prev, y_pred = rnn_step_forward(x_t, h_prev)
        y_pred_softmax = np.exp(y_pred) / np.sum(np.exp(y_pred))  # Softmax activation

        loss += -np.log(y_pred_softmax[y_true])  # Cross-entropy loss

        # Backward pass (gradient calculation and parameter update)
        dy = y_pred_softmax
        dy[y_true] -= 1  # Gradient of softmax + loss

        dWy = np.dot(dy, h_prev.T)
        dby = dy
        dh = np.dot(Wy.T, dy) * (1 - h_prev**2)  # Backprop through tanh activation
        dWx = np.dot(dh, x_t.T)
        dWh = np.dot(dh, h_prev.T)
        dbh = dh

        # Update parameters
        Wy -= learning_rate * dWy
        by -= learning_rate * dby
        Wx -= learning_rate * dWx
        Wh -= learning_rate * dWh
        bh -= learning_rate * dbh

    if epoch % 200 == 0:
        print(f"Epoch {epoch}, Loss: {loss}")

# Predict next words
h_prev = np.zeros((hidden_size, 1))  # Initialize hidden state
print("\nPredictions:")
for t in range(len(encoded_sentence) - 1):
    x_t = np.zeros((input_size, 1))
    x_t[encoded_sentence[t]] = 1  # One-hot encoding
    h_prev, y_pred = rnn_step_forward(x_t, h_prev)
    next_word = int_to_word[np.argmax(y_pred)]
    print(f"Input: {int_to_word[encoded_sentence[t]]}, Predicted: {next_word}")

### One-Hot Encoding ###
Sentence: Our national language is Urdu
Tokens (Words): ['Our', 'national', 'language', 'is', 'Urdu']
Vocabulary (Word to Index): {'Our': 0, 'Urdu': 1, 'is': 2, 'language': 3, 'national': 4}
Index to Word Mapping: {0: 'Our', 1: 'Urdu', 2: 'is', 3: 'language', 4: 'national'}
Encoded Sentence: [0, 4, 3, 2, 1] 

Epoch 0, Loss: [6.55916618]
Epoch 200, Loss: [2.25840477]
Epoch 400, Loss: [1.85519033]
Epoch 600, Loss: [1.72324284]
Epoch 800, Loss: [1.65770307]
Epoch 1000, Loss: [1.60609088]
Epoch 1200, Loss: [1.05597886]
Epoch 1400, Loss: [0.64920434]
Epoch 1600, Loss: [0.51583237]
Epoch 1800, Loss: [0.43260939]

Predictions:
Input: Our, Predicted: national
Input: national, Predicted: language
Input: language, Predicted: is
Input: is, Predicted: Urdu


**Word Embedding**

In [None]:
import numpy as np

# Define the data
sentence = "Our national language is Urdu"
words = sentence.split()
word_to_int = {word: i for i, word in enumerate(sorted(set(words)))}
int_to_word = {i: word for word, i in word_to_int.items()}
encoded_sentence = [word_to_int[word] for word in words]

# Display tokens, vocabulary, and encodings
print("### Word Embedding ###")
print("Sentence:", sentence)
print("Tokens (Words):", words)
print("Vocabulary (Word to Index):", word_to_int)
print("Index to Word Mapping:", int_to_word)
print("Encoded Sentence:", encoded_sentence, "\n")

# Model parameters
vocab_size = len(word_to_int)  # Number of unique words
embedding_dim = 3  # Size of word embeddings
hidden_size = 1  # Single neuron
output_size = vocab_size
learning_rate = 0.1

# Initialize weights and biases
embedding_matrix = np.random.randn(vocab_size, embedding_dim) * 0.01  # Word embeddings
Wx = np.random.randn(hidden_size, embedding_dim) * 0.01  # Embedding to hidden weights
Wh = np.random.randn(hidden_size, hidden_size) * 0.01  # Hidden to hidden weights
Wy = np.random.randn(output_size, hidden_size) * 0.01  # Hidden to output weights
bh = np.zeros((hidden_size, 1))  # Hidden bias
by = np.zeros((output_size, 1))  # Output bias

# RNN forward step
def rnn_step_forward(x, h_prev):
    h_next = np.tanh(np.dot(Wx, x) + np.dot(Wh, h_prev) + bh)  # Single hidden neuron
    y = np.dot(Wy, h_next) + by
    return h_next, y

# Training loop
for epoch in range(2000):  # Training for 2000 epochs
    loss = 0
    h_prev = np.zeros((hidden_size, 1))  # Initialize hidden state

    for t in range(len(encoded_sentence) - 1):
        word_idx = encoded_sentence[t]
        x_t = embedding_matrix[word_idx].reshape(-1, 1)  # Word embedding vector
        y_true = encoded_sentence[t + 1]

        # Forward pass
        h_prev, y_pred = rnn_step_forward(x_t, h_prev)
        y_pred_softmax = np.exp(y_pred) / np.sum(np.exp(y_pred))  # Softmax activation

        loss += -np.log(y_pred_softmax[y_true])  # Cross-entropy loss

        # Backward pass (gradient calculation and parameter update)
        dy = y_pred_softmax
        dy[y_true] -= 1  # Gradient of softmax + loss

        dWy = np.dot(dy, h_prev.T)
        dby = dy
        dh = np.dot(Wy.T, dy) * (1 - h_prev**2)  # Backprop through tanh activation
        dWx = np.dot(dh, x_t.T)
        dWh = np.dot(dh, h_prev.T)
        dbh = dh

        # Update parameters
        Wy -= learning_rate * dWy
        by -= learning_rate * dby
        Wx -= learning_rate * dWx
        Wh -= learning_rate * dWh
        bh -= learning_rate * dbh

    if epoch % 200 == 0:
        print(f"Epoch {epoch}, Loss: {loss}")

# Predict next words
h_prev = np.zeros((hidden_size, 1))  # Initialize hidden state
print("\nPredictions:")
for t in range(len(encoded_sentence) - 1):
    word_idx = encoded_sentence[t]
    x_t = embedding_matrix[word_idx].reshape(-1, 1)  # Word embedding vector
    h_prev, y_pred = rnn_step_forward(x_t, h_prev)
    next_word = int_to_word[np.argmax(y_pred)]
    print(f"Input: {int_to_word[encoded_sentence[t]]}, Predicted: {next_word}")


### Word Embedding ###
Sentence: Our national language is Urdu
Tokens (Words): ['Our', 'national', 'language', 'is', 'Urdu']
Vocabulary (Word to Index): {'Our': 0, 'Urdu': 1, 'is': 2, 'language': 3, 'national': 4}
Index to Word Mapping: {0: 'Our', 1: 'Urdu', 2: 'is', 3: 'language', 4: 'national'}
Encoded Sentence: [0, 4, 3, 2, 1] 

Epoch 0, Loss: [6.55928099]
Epoch 200, Loss: [5.73676323]
Epoch 400, Loss: [5.71727751]
Epoch 600, Loss: [5.7106698]
Epoch 800, Loss: [5.70734927]
Epoch 1000, Loss: [5.70535234]
Epoch 1200, Loss: [5.70401925]
Epoch 1400, Loss: [5.70306602]
Epoch 1600, Loss: [5.70235022]
Epoch 1800, Loss: [5.70179248]

Predictions:
Input: Our, Predicted: Urdu
Input: national, Predicted: Urdu
Input: language, Predicted: Urdu
Input: is, Predicted: Urdu


**Bag of Words**

In [None]:
import numpy as np

# Define the data
sentence = "Our national language is Urdu"
words = sentence.split()
vocab = sorted(set(words))
word_to_int = {word: i for i, word in enumerate(vocab)}
int_to_word = {i: word for word, i in word_to_int.items()}

# Bag of Words Encoding
bow_vector = np.zeros(len(vocab), dtype=int)
for word in words:
    bow_vector[word_to_int[word]] += 1

# Display tokens, vocabulary, and encoding
print("### Bag of Words (BoW) ###")
print("Sentence:", sentence)
print("Tokens (Words):", words)
print("Vocabulary (Word to Index):", word_to_int)
print("Index to Word Mapping:", int_to_word)
print("BoW Vector (Encoded Sentence):", bow_vector)

# Example: Use BoW encoding as input to a model
input_vector = bow_vector.reshape(-1, 1)  # Reshape for compatibility
hidden_size = 2  # Number of neurons
output_size = len(vocab)
learning_rate = 0.01

# Initialize model parameters
Wx = np.random.randn(hidden_size, len(vocab)) * 0.01
Wy = np.random.randn(output_size, hidden_size) * 0.01
bh = np.zeros((hidden_size, 1))
by = np.zeros((output_size, 1))

# Simple forward pass
hidden_layer = np.tanh(np.dot(Wx, input_vector) + bh)
output_layer = np.dot(Wy, hidden_layer) + by

# Apply Softmax to output_layer for predictions
output_probs = np.exp(output_layer) / np.sum(np.exp(output_layer))  # Softmax
predicted_word_idx = np.argmax(output_probs)  # Get the index of the predicted word
predicted_word = int_to_word[predicted_word_idx]

print("\nModel Output (Raw Scores):\n", output_layer)
print("\nSoftmax Probabilities (Predictions):", output_probs.flatten())
print("Predicted Word:", predicted_word)

### Bag of Words (BoW) ###
Sentence: Our national language is Urdu
Tokens (Words): ['Our', 'national', 'language', 'is', 'Urdu']
Vocabulary (Word to Index): {'Our': 0, 'Urdu': 1, 'is': 2, 'language': 3, 'national': 4}
Index to Word Mapping: {0: 'Our', 1: 'Urdu', 2: 'is', 3: 'language', 4: 'national'}
BoW Vector (Encoded Sentence): [1 1 1 1 1]

Model Output (Raw Scores):
 [[-0.00016566]
 [ 0.00015024]
 [ 0.00017704]
 [ 0.00039723]
 [ 0.00045202]]

Softmax Probabilities (Predictions): [0.19992644 0.19998961 0.19999497 0.20003901 0.20004997]
Predicted Word: national


**Hashing Encoding**

In [None]:
import numpy as np
import hashlib

# Define the data
sentence = "Our national language is Urdu"
words = sentence.split()

# Hashing function
def hash_function(value, num_buckets):
    hashed = int(hashlib.md5(value.encode()).hexdigest(), 16)
    return hashed % num_buckets

# Hashing Encoding
num_buckets = 5 # Define the number of buckets
hash_vector = np.zeros(num_buckets, dtype=int)
for word in words:
    bucket = hash_function(word, num_buckets)
    hash_vector[bucket] += 1

# Display tokens and hashed encoding
print("\n### Hashing Encoding ###")
print("Sentence:", sentence)
print("Tokens (Words):", words)
print("Number of Buckets:", num_buckets)
print("Hash Vector (Encoded Sentence):", hash_vector)

# Example: Use Hashing encoding as input to a model
input_vector = hash_vector.reshape(-1, 1)  # Reshape for compatibility
hidden_size = 2  # Number of neurons
output_size = num_buckets
learning_rate = 0.01

# Initialize model parameters
Wx = np.random.randn(hidden_size, num_buckets) * 0.01
Wy = np.random.randn(output_size, hidden_size) * 0.01
bh = np.zeros((hidden_size, 1))
by = np.zeros((output_size, 1))

# Simple forward pass
hidden_layer = np.tanh(np.dot(Wx, input_vector) + bh)
output_layer = np.dot(Wy, hidden_layer) + by

# Apply Softmax to output_layer for predictions
output_probs = np.exp(output_layer) / np.sum(np.exp(output_layer))  # Softmax
predicted_bucket = np.argmax(output_probs)  # Get the predicted bucket index

print("\nModel Output (Raw Scores):\n", output_layer)
print("\nSoftmax Probabilities (Predictions):", output_probs.flatten())
print("Predicted Bucket:", predicted_bucket)


### Hashing Encoding ###
Sentence: Our national language is Urdu
Tokens (Words): ['Our', 'national', 'language', 'is', 'Urdu']
Number of Buckets: 5
Hash Vector (Encoded Sentence): [0 1 2 0 2]

Model Output (Raw Scores):
 [[-0.00040377]
 [ 0.00029077]
 [ 0.00037115]
 [-0.00023484]
 [ 0.0002637 ]]

Softmax Probabilities (Predictions): [0.19990778 0.20004667 0.20006275 0.19994155 0.20004125]
Predicted Bucket: 2


# **Question#2 - Multiple Neurons**

## Next Character Prediction in a Word



### Example-1 : TELEPHONE

**One Hot Encoding**

In [None]:
import numpy as np

# Define the data
word = "telephone"
char_to_int = {ch: i for i, ch in enumerate(sorted(set(word)))}
int_to_char = {i: ch for ch, i in char_to_int.items()}
encoded_word = [char_to_int[ch] for ch in word]

print("Encoded Word using One-Hot Encoding:", encoded_word, "\n")

# Hyperparameters
hidden_size = 8
input_size = len(char_to_int)
output_size = len(char_to_int)
learning_rate = 0.1
epochs = 2000

# Model parameters
Wx = np.random.randn(hidden_size, input_size) * 0.01
Wh = np.random.randn(hidden_size, hidden_size) * 0.01
Wy = np.random.randn(output_size, hidden_size) * 0.01
bh = np.zeros((hidden_size, 1))
by = np.zeros((output_size, 1))

# Helper functions
def softmax(x):
    exp_x = np.exp(x - np.max(x))  # Stability improvement
    return exp_x / np.sum(exp_x)

def rnn_step_forward(x, h_prev):
    h_next = np.tanh(np.dot(Wx, x) + np.dot(Wh, h_prev) + bh)
    y = np.dot(Wy, h_next) + by
    return h_next, softmax(y)

# Training loop
for epoch in range(epochs):
    h_prev = np.zeros((hidden_size, 1))
    loss = 0
    dWx, dWh, dWy = np.zeros_like(Wx), np.zeros_like(Wh), np.zeros_like(Wy)
    dbh, dby = np.zeros_like(bh), np.zeros_like(by)

    for t in range(len(encoded_word) - 1):
        # Prepare input and target
        x_t = np.zeros((input_size, 1))
        x_t[encoded_word[t]] = 1
        y_true = encoded_word[t + 1]

        # Forward pass
        h_prev, y_pred = rnn_step_forward(x_t, h_prev)
        loss += -np.log(y_pred[y_true])

        # Backpropagation
        dy = y_pred
        dy[y_true] -= 1  # Cross-entropy gradient

        dWy += np.dot(dy, h_prev.T)
        dby += dy
        dh = np.dot(Wy.T, dy) * (1 - h_prev**2)
        dWx += np.dot(dh, x_t.T)
        dWh += np.dot(dh, h_prev.T)
        dbh += dh

    # Update parameters (with gradient clipping)
    for param, dparam in zip([Wx, Wh, Wy, bh, by], [dWx, dWh, dWy, dbh, dby]):
        np.clip(dparam, -5, 5, out=dparam)  # Clip gradients
        param -= learning_rate * dparam

    if epoch % 200 == 0:
        print(f"Epoch {epoch}, Loss: {loss / len(encoded_word)}")

# Predict next characters
h_prev = np.zeros((hidden_size, 1))
print("\nPredictions:")
for t in range(len(encoded_word) - 1):
    x_t = np.zeros((input_size, 1))
    x_t[encoded_word[t]] = 1
    h_prev, y_pred = rnn_step_forward(x_t, h_prev)
    next_char = int_to_char[np.argmax(y_pred)]
    print(f"Input: {int_to_char[encoded_word[t]]}, Predicted: {next_char}")

Encoded Word using One-Hot Encoding: [6, 0, 2, 0, 5, 1, 4, 3, 0] 

Epoch 0, Loss: [1.72974144]
Epoch 200, Loss: [0.01802542]
Epoch 400, Loss: [0.00700745]
Epoch 600, Loss: [0.00446998]
Epoch 800, Loss: [0.00544615]
Epoch 1000, Loss: [0.00342847]
Epoch 1200, Loss: [0.00260297]
Epoch 1400, Loss: [0.00211741]
Epoch 1600, Loss: [0.00178581]
Epoch 1800, Loss: [0.00154262]

Predictions:
Input: t, Predicted: e
Input: e, Predicted: l
Input: l, Predicted: e
Input: e, Predicted: p
Input: p, Predicted: h
Input: h, Predicted: o
Input: o, Predicted: n
Input: n, Predicted: e


**Word Embedding**

In [None]:
import numpy as np

# Define the data
word = "telephone"
char_to_int = {ch: i for i, ch in enumerate(sorted(set(word)))}
# Embedding parameters
embedding_dim = 4  # Dimensionality of the embeddings
vocab_size = len(char_to_int)

# Randomly initialized embedding matrix
embedding_matrix = np.random.randn(vocab_size, embedding_dim)

# Encode the word using embeddings
embedded_word = [embedding_matrix[char_to_int[ch]] for ch in word]
print("Encoded Word using Word Embedding:")
for ch, emb in zip(word, embedded_word):
    print(f"Character: {ch}, Embedding: {emb}")
print("\n")

# Hyperparameters
hidden_size = 8
input_size = len(char_to_int)
output_size = len(char_to_int)
learning_rate = 0.1
epochs = 2000

# Model parameters
Wx = np.random.randn(hidden_size, input_size) * 0.01
Wh = np.random.randn(hidden_size, hidden_size) * 0.01
Wy = np.random.randn(output_size, hidden_size) * 0.01
bh = np.zeros((hidden_size, 1))
by = np.zeros((output_size, 1))

# Helper functions
def softmax(x):
    exp_x = np.exp(x - np.max(x))  # Stability improvement
    return exp_x / np.sum(exp_x)

def rnn_step_forward(x, h_prev):
    h_next = np.tanh(np.dot(Wx, x) + np.dot(Wh, h_prev) + bh)
    y = np.dot(Wy, h_next) + by
    return h_next, softmax(y)

# Training loop
for epoch in range(epochs):
    h_prev = np.zeros((hidden_size, 1))
    loss = 0
    dWx, dWh, dWy = np.zeros_like(Wx), np.zeros_like(Wh), np.zeros_like(Wy)
    dbh, dby = np.zeros_like(bh), np.zeros_like(by)

    for t in range(len(encoded_word) - 1):
        # Prepare input and target
        x_t = np.zeros((input_size, 1))
        x_t[encoded_word[t]] = 1
        y_true = encoded_word[t + 1]

        # Forward pass
        h_prev, y_pred = rnn_step_forward(x_t, h_prev)
        loss += -np.log(y_pred[y_true])

        # Backpropagation
        dy = y_pred
        dy[y_true] -= 1  # Cross-entropy gradient

        dWy += np.dot(dy, h_prev.T)
        dby += dy
        dh = np.dot(Wy.T, dy) * (1 - h_prev**2)
        dWx += np.dot(dh, x_t.T)
        dWh += np.dot(dh, h_prev.T)
        dbh += dh

    # Update parameters (with gradient clipping)
    for param, dparam in zip([Wx, Wh, Wy, bh, by], [dWx, dWh, dWy, dbh, dby]):
        np.clip(dparam, -5, 5, out=dparam)  # Clip gradients
        param -= learning_rate * dparam

    if epoch % 200 == 0:
        print(f"Epoch {epoch}, Loss: {loss / len(encoded_word)}")

# Predict next characters
h_prev = np.zeros((hidden_size, 1))
print("\nPredictions:")
for t in range(len(encoded_word) - 1):
    x_t = np.zeros((input_size, 1))
    x_t[encoded_word[t]] = 1
    h_prev, y_pred = rnn_step_forward(x_t, h_prev)
    next_char = int_to_char[np.argmax(y_pred)]
    print(f"Input: {int_to_char[encoded_word[t]]}, Predicted: {next_char}")

Encoded Word using Word Embedding:
Character: t, Embedding: [-1.2454758  -0.85674954 -0.68525055  1.0762732 ]
Character: e, Embedding: [-0.76680977 -0.38920485 -0.10529827 -0.23005368]
Character: l, Embedding: [-0.30054043 -0.19943389  0.07753008 -2.25822463]
Character: e, Embedding: [-0.76680977 -0.38920485 -0.10529827 -0.23005368]
Character: p, Embedding: [-0.89982608 -0.48267382 -1.76788283  1.14888282]
Character: h, Embedding: [-0.00956843 -0.21126562 -0.21673139 -0.1999813 ]
Character: o, Embedding: [0.53656774 0.59248285 0.94343379 0.55614605]
Character: n, Embedding: [ 1.69375973  0.57413375 -1.74953246  0.25221324]
Character: e, Embedding: [-0.76680977 -0.38920485 -0.10529827 -0.23005368]


Epoch 0, Loss: [1.72964785]
Epoch 200, Loss: [0.01956246]
Epoch 400, Loss: [0.00809698]
Epoch 600, Loss: [0.00497693]
Epoch 800, Loss: [0.00356651]
Epoch 1000, Loss: [0.00319341]
Epoch 1200, Loss: [0.02740825]
Epoch 1400, Loss: [0.00843575]
Epoch 1600, Loss: [0.0050867]
Epoch 1800, Loss: [0.

**Bag of Words**

In [None]:
import numpy as np
from collections import Counter

# Define the data
word = "telephone"
char_to_int = {ch: i for i, ch in enumerate(sorted(set(word)))}

# Count character occurrences
char_counts = Counter(word)

# Encode word using BoW
bow_vector = np.zeros(len(char_to_int))  # Correct initialization
for ch, count in char_counts.items():
    bow_vector[char_to_int[ch]] = count

print("\nEncoded Word using Bag of Words (BoW):")
print(f"Characters: {sorted(set(word))}")
print(f"BoW Vector: {bow_vector}")

# Hyperparameters
hidden_size = 8
input_size = len(char_to_int)
output_size = len(char_to_int)
learning_rate = 0.1
epochs = 2000

# Model parameters
Wx = np.random.randn(hidden_size, input_size) * 0.01
Wh = np.random.randn(hidden_size, hidden_size) * 0.01
Wy = np.random.randn(output_size, hidden_size) * 0.01
bh = np.zeros((hidden_size, 1))
by = np.zeros((output_size, 1))

# Helper functions
def softmax(x):
    exp_x = np.exp(x - np.max(x))  # Stability improvement
    return exp_x / np.sum(exp_x)

def rnn_step_forward(x, h_prev):
    h_next = np.tanh(np.dot(Wx, x) + np.dot(Wh, h_prev) + bh)
    y = np.dot(Wy, h_next) + by
    return h_next, softmax(y)

# Convert BoW vector to sequences for training
# For simplicity, assume the BoW vector is used as input at every time step
encoded_word = bow_vector.astype(int)

# Training loop
for epoch in range(epochs):
    h_prev = np.zeros((hidden_size, 1))
    loss = 0
    dWx, dWh, dWy = np.zeros_like(Wx), np.zeros_like(Wh), np.zeros_like(Wy)
    dbh, dby = np.zeros_like(bh), np.zeros_like(by)

    for t in range(len(encoded_word) - 1):
        # Prepare input and target
        x_t = np.zeros((input_size, 1))
        x_t[encoded_word[t]] = 1
        y_true = encoded_word[t + 1]

        # Forward pass
        h_prev, y_pred = rnn_step_forward(x_t, h_prev)
        loss += -np.log(y_pred[y_true])

        # Backpropagation
        dy = y_pred
        dy[y_true] -= 1  # Cross-entropy gradient

        dWy += np.dot(dy, h_prev.T)
        dby += dy
        dh = np.dot(Wy.T, dy) * (1 - h_prev**2)
        dWx += np.dot(dh, x_t.T)
        dWh += np.dot(dh, h_prev.T)
        dbh += dh

    # Update parameters (with gradient clipping)
    for param, dparam in zip([Wx, Wh, Wy, bh, by], [dWx, dWh, dWy, dbh, dby]):
        np.clip(dparam, -5, 5, out=dparam)  # Clip gradients
        param -= learning_rate * dparam

    if epoch % 200 == 0:
        print(f"Epoch {epoch}, Loss: {loss / len(encoded_word)}")

# Predict next characters
h_prev = np.zeros((hidden_size, 1))
print("\nPredictions:")
for t in range(len(encoded_word) - 1):
    x_t = np.zeros((input_size, 1))
    x_t[encoded_word[t]] = 1
    h_prev, y_pred = rnn_step_forward(x_t, h_prev)
    next_char = int_to_char[np.argmax(y_pred)]
    print(f"Input: {int_to_char[encoded_word[t]]}, Predicted: {next_char}")


Encoded Word using Bag of Words (BoW):
Characters: ['e', 'h', 'l', 'n', 'o', 'p', 't']
BoW Vector: [3. 1. 1. 1. 1. 1. 1.]
Epoch 0, Loss: [1.66810609]
Epoch 200, Loss: [0.00115437]
Epoch 400, Loss: [0.00051743]
Epoch 600, Loss: [0.00032838]
Epoch 800, Loss: [0.00023892]
Epoch 1000, Loss: [0.00018708]
Epoch 1200, Loss: [0.00015338]
Epoch 1400, Loss: [0.00012977]
Epoch 1600, Loss: [0.00011234]
Epoch 1800, Loss: [9.89572919e-05]

Predictions:
Input: n, Predicted: h
Input: h, Predicted: h
Input: h, Predicted: h
Input: h, Predicted: h
Input: h, Predicted: h
Input: h, Predicted: h


**Hashing Embedding**

In [None]:
import numpy as np
import hashlib

# Define the data
word = "telephone"

# Character-to-integer mapping
char_to_int = {ch: i for i, ch in enumerate(sorted(set(word)))}
int_to_char = {i: ch for ch, i in char_to_int.items()}

# Hashing parameters
hash_size = 8  # Size of the hash space (embedding dimension)

# Hash function
def hash_function(ch, hash_size):
    return int(hashlib.md5(ch.encode()).hexdigest(), 16) % hash_size

# Encode each character using hashing
hashed_vectors = np.zeros((len(word), hash_size))
for i, ch in enumerate(word):
    hash_idx = hash_function(ch, hash_size)
    hashed_vectors[i, hash_idx] = 1  # One-hot vector in the hashed space

print("\nEncoded Word using Hashing Embedding:")
for i, ch in enumerate(word):
    print(f"Character: {ch}, Hashed Vector: {hashed_vectors[i]}")

# Hyperparameters
hidden_size = 8
input_size = hash_size  # Input size now matches hash embedding size
output_size = len(char_to_int)  # Output matches the unique character count
learning_rate = 0.1
epochs = 2000

# Model parameters
Wx = np.random.randn(hidden_size, input_size) * 0.01
Wh = np.random.randn(hidden_size, hidden_size) * 0.01
Wy = np.random.randn(output_size, hidden_size) * 0.01
bh = np.zeros((hidden_size, 1))
by = np.zeros((output_size, 1))

# Helper functions
def softmax(x):
    exp_x = np.exp(x - np.max(x))  # Stability improvement
    return exp_x / np.sum(exp_x)

def rnn_step_forward(x, h_prev):
    h_next = np.tanh(np.dot(Wx, x) + np.dot(Wh, h_prev) + bh)
    y = np.dot(Wy, h_next) + by
    return h_next, softmax(y)

# Training loop
for epoch in range(epochs):
    h_prev = np.zeros((hidden_size, 1))
    loss = 0
    dWx, dWh, dWy = np.zeros_like(Wx), np.zeros_like(Wh), np.zeros_like(Wy)
    dbh, dby = np.zeros_like(bh), np.zeros_like(by)

    for t in range(len(hashed_vectors) - 1):
        # Prepare input and target
        x_t = hashed_vectors[t].reshape(-1, 1)
        y_true = char_to_int[word[t + 1]]  # Use char_to_int for consistent indexing

        # Forward pass
        h_prev, y_pred = rnn_step_forward(x_t, h_prev)
        loss += -np.log(y_pred[y_true])

        # Backpropagation
        dy = y_pred
        dy[y_true] -= 1  # Cross-entropy gradient

        dWy += np.dot(dy, h_prev.T)
        dby += dy
        dh = np.dot(Wy.T, dy) * (1 - h_prev**2)
        dWx += np.dot(dh, x_t.T)
        dWh += np.dot(dh, h_prev.T)
        dbh += dh

    # Update parameters (with gradient clipping)
    for param, dparam in zip([Wx, Wh, Wy, bh, by], [dWx, dWh, dWy, dbh, dby]):
        np.clip(dparam, -5, 5, out=dparam)  # Clip gradients
        param -= learning_rate * dparam

    if epoch % 200 == 0:
        print(f"Epoch {epoch}, Loss: {loss / len(word)}")

# Predict next characters
h_prev = np.zeros((hidden_size, 1))
print("\nPredictions:")
for t in range(len(hashed_vectors) - 1):
    x_t = hashed_vectors[t].reshape(-1, 1)
    h_prev, y_pred = rnn_step_forward(x_t, h_prev)
    next_char = int_to_char[np.argmax(y_pred)]
    print(f"Input: {word[t]}, Predicted: {next_char}")


Encoded Word using Hashing Embedding:
Character: t, Hashed Vector: [0. 0. 0. 0. 0. 0. 1. 0.]
Character: e, Hashed Vector: [0. 0. 1. 0. 0. 0. 0. 0.]
Character: l, Hashed Vector: [0. 0. 0. 1. 0. 0. 0. 0.]
Character: e, Hashed Vector: [0. 0. 1. 0. 0. 0. 0. 0.]
Character: p, Hashed Vector: [0. 0. 1. 0. 0. 0. 0. 0.]
Character: h, Hashed Vector: [0. 1. 0. 0. 0. 0. 0. 0.]
Character: o, Hashed Vector: [0. 0. 0. 0. 1. 0. 0. 0.]
Character: n, Hashed Vector: [0. 1. 0. 0. 0. 0. 0. 0.]
Character: e, Hashed Vector: [0. 0. 1. 0. 0. 0. 0. 0.]
Epoch 0, Loss: [1.72977452]
Epoch 200, Loss: [0.01671459]
Epoch 400, Loss: [0.00694615]
Epoch 600, Loss: [0.00446653]
Epoch 800, Loss: [0.00331141]
Epoch 1000, Loss: [0.00264625]
Epoch 1200, Loss: [0.00221407]
Epoch 1400, Loss: [0.00190886]
Epoch 1600, Loss: [0.00168028]
Epoch 1800, Loss: [0.00150203]

Predictions:
Input: t, Predicted: e
Input: e, Predicted: l
Input: l, Predicted: e
Input: e, Predicted: p
Input: p, Predicted: h
Input: h, Predicted: o
Input: o, P

### Example-2 : OPERATION

**One Hot Encoding**

In [None]:
import numpy as np

# Define the data
word = "operation"
char_to_int = {ch: i for i, ch in enumerate(sorted(set(word)))}
int_to_char = {i: ch for ch, i in char_to_int.items()}
encoded_word = [char_to_int[ch] for ch in word]

print("Encoded Word using One-Hot Encoding:", encoded_word, "\n")

# Hyperparameters
hidden_size = 8
input_size = len(char_to_int)
output_size = len(char_to_int)
learning_rate = 0.1
epochs = 2000

# Model parameters
Wx = np.random.randn(hidden_size, input_size) * 0.01
Wh = np.random.randn(hidden_size, hidden_size) * 0.01
Wy = np.random.randn(output_size, hidden_size) * 0.01
bh = np.zeros((hidden_size, 1))
by = np.zeros((output_size, 1))

# Helper functions
def softmax(x):
    exp_x = np.exp(x - np.max(x))  # Stability improvement
    return exp_x / np.sum(exp_x)

def rnn_step_forward(x, h_prev):
    h_next = np.tanh(np.dot(Wx, x) + np.dot(Wh, h_prev) + bh)
    y = np.dot(Wy, h_next) + by
    return h_next, softmax(y)

# Training loop
for epoch in range(epochs):
    h_prev = np.zeros((hidden_size, 1))
    loss = 0
    dWx, dWh, dWy = np.zeros_like(Wx), np.zeros_like(Wh), np.zeros_like(Wy)
    dbh, dby = np.zeros_like(bh), np.zeros_like(by)

    for t in range(len(encoded_word) - 1):
        # Prepare input and target
        x_t = np.zeros((input_size, 1))
        x_t[encoded_word[t]] = 1
        y_true = encoded_word[t + 1]

        # Forward pass
        h_prev, y_pred = rnn_step_forward(x_t, h_prev)
        loss += -np.log(y_pred[y_true])

        # Backpropagation
        dy = y_pred
        dy[y_true] -= 1  # Cross-entropy gradient

        dWy += np.dot(dy, h_prev.T)
        dby += dy
        dh = np.dot(Wy.T, dy) * (1 - h_prev**2)
        dWx += np.dot(dh, x_t.T)
        dWh += np.dot(dh, h_prev.T)
        dbh += dh

    # Update parameters (with gradient clipping)
    for param, dparam in zip([Wx, Wh, Wy, bh, by], [dWx, dWh, dWy, dbh, dby]):
        np.clip(dparam, -5, 5, out=dparam)  # Clip gradients
        param -= learning_rate * dparam

    if epoch % 200 == 0:
        print(f"Epoch {epoch}, Loss: {loss / len(encoded_word)}")

# Predict next characters
h_prev = np.zeros((hidden_size, 1))
print("\nPredictions:")
for t in range(len(encoded_word) - 1):
    x_t = np.zeros((input_size, 1))
    x_t[encoded_word[t]] = 1
    h_prev, y_pred = rnn_step_forward(x_t, h_prev)
    next_char = int_to_char[np.argmax(y_pred)]
    print(f"Input: {int_to_char[encoded_word[t]]}, Predicted: {next_char}")

Encoded Word using One-Hot Encoding: [4, 5, 1, 6, 0, 7, 2, 4, 3] 

Epoch 0, Loss: [1.84822721]
Epoch 200, Loss: [0.03247513]
Epoch 400, Loss: [0.00822547]
Epoch 600, Loss: [0.00507107]
Epoch 800, Loss: [0.00351097]
Epoch 1000, Loss: [0.0034943]
Epoch 1200, Loss: [0.00512022]
Epoch 1400, Loss: [0.00256907]
Epoch 1600, Loss: [0.00250509]
Epoch 1800, Loss: [0.01118589]

Predictions:
Input: o, Predicted: p
Input: p, Predicted: e
Input: e, Predicted: r
Input: r, Predicted: a
Input: a, Predicted: t
Input: t, Predicted: i
Input: i, Predicted: o
Input: o, Predicted: n


**Word Embedding**

In [None]:
import numpy as np

# Define the data
word = "operation"
char_to_int = {ch: i for i, ch in enumerate(sorted(set(word)))}
# Embedding parameters
embedding_dim = 4  # Dimensionality of the embeddings
vocab_size = len(char_to_int)

# Randomly initialized embedding matrix
embedding_matrix = np.random.randn(vocab_size, embedding_dim)

# Encode the word using embeddings
embedded_word = [embedding_matrix[char_to_int[ch]] for ch in word]
print("Encoded Word using Word Embedding:")
for ch, emb in zip(word, embedded_word):
    print(f"Character: {ch}, Embedding: {emb}")
print("\n")

# Hyperparameters
hidden_size = 8
input_size = len(char_to_int)
output_size = len(char_to_int)
learning_rate = 0.1
epochs = 2000

# Model parameters
Wx = np.random.randn(hidden_size, input_size) * 0.01
Wh = np.random.randn(hidden_size, hidden_size) * 0.01
Wy = np.random.randn(output_size, hidden_size) * 0.01
bh = np.zeros((hidden_size, 1))
by = np.zeros((output_size, 1))

# Helper functions
def softmax(x):
    exp_x = np.exp(x - np.max(x))  # Stability improvement
    return exp_x / np.sum(exp_x)

def rnn_step_forward(x, h_prev):
    h_next = np.tanh(np.dot(Wx, x) + np.dot(Wh, h_prev) + bh)
    y = np.dot(Wy, h_next) + by
    return h_next, softmax(y)

# Training loop
for epoch in range(epochs):
    h_prev = np.zeros((hidden_size, 1))
    loss = 0
    dWx, dWh, dWy = np.zeros_like(Wx), np.zeros_like(Wh), np.zeros_like(Wy)
    dbh, dby = np.zeros_like(bh), np.zeros_like(by)

    for t in range(len(encoded_word) - 1):
        # Prepare input and target
        x_t = np.zeros((input_size, 1))
        x_t[encoded_word[t]] = 1
        y_true = encoded_word[t + 1]

        # Forward pass
        h_prev, y_pred = rnn_step_forward(x_t, h_prev)
        loss += -np.log(y_pred[y_true])

        # Backpropagation
        dy = y_pred
        dy[y_true] -= 1  # Cross-entropy gradient

        dWy += np.dot(dy, h_prev.T)
        dby += dy
        dh = np.dot(Wy.T, dy) * (1 - h_prev**2)
        dWx += np.dot(dh, x_t.T)
        dWh += np.dot(dh, h_prev.T)
        dbh += dh

    # Update parameters (with gradient clipping)
    for param, dparam in zip([Wx, Wh, Wy, bh, by], [dWx, dWh, dWy, dbh, dby]):
        np.clip(dparam, -5, 5, out=dparam)  # Clip gradients
        param -= learning_rate * dparam

    if epoch % 200 == 0:
        print(f"Epoch {epoch}, Loss: {loss / len(encoded_word)}")

# Predict next characters
h_prev = np.zeros((hidden_size, 1))
print("\nPredictions:")
for t in range(len(encoded_word) - 1):
    x_t = np.zeros((input_size, 1))
    x_t[encoded_word[t]] = 1
    h_prev, y_pred = rnn_step_forward(x_t, h_prev)
    next_char = int_to_char[np.argmax(y_pred)]
    print(f"Input: {int_to_char[encoded_word[t]]}, Predicted: {next_char}")

Encoded Word using Word Embedding:
Character: o, Embedding: [ 0.11617933  0.05182357  0.4108245  -0.41417887]
Character: p, Embedding: [ 0.73905628 -1.76463     0.8566777  -1.08343584]
Character: e, Embedding: [ 0.13711358  1.02871666 -0.67525888  0.22507392]
Character: r, Embedding: [-0.14797962 -3.59627136  0.34262165  0.13214909]
Character: a, Embedding: [-0.10480013  0.64304829  0.82731614 -0.20202049]
Character: t, Embedding: [ 0.18836633 -0.62056445  0.60702552  0.97698812]
Character: i, Embedding: [-1.33047587 -1.61224687 -0.88722738  0.81691221]
Character: o, Embedding: [ 0.11617933  0.05182357  0.4108245  -0.41417887]
Character: n, Embedding: [-0.07361069  1.05132346  2.10553455  0.01170187]


Epoch 0, Loss: [1.8484461]
Epoch 200, Loss: [0.0179796]
Epoch 400, Loss: [0.00482728]
Epoch 600, Loss: [0.00285725]
Epoch 800, Loss: [0.00208428]
Epoch 1000, Loss: [0.00176208]
Epoch 1200, Loss: [0.00146921]
Epoch 1400, Loss: [0.00113212]
Epoch 1600, Loss: [0.00093785]
Epoch 1800, Loss: 

**Bag of Words**

In [None]:
import numpy as np
from collections import Counter

# Define the data
word = "operation"
char_to_int = {ch: i for i, ch in enumerate(sorted(set(word)))}

# Count character occurrences
char_counts = Counter(word)

# Encode word using BoW
bow_vector = np.zeros(len(char_to_int))  # Correct initialization
for ch, count in char_counts.items():
    bow_vector[char_to_int[ch]] = count

print("\nEncoded Word using Bag of Words (BoW):")
print(f"Characters: {sorted(set(word))}")
print(f"BoW Vector: {bow_vector}")

# Hyperparameters
hidden_size = 8
input_size = len(char_to_int)
output_size = len(char_to_int)
learning_rate = 0.1
epochs = 2000

# Model parameters
Wx = np.random.randn(hidden_size, input_size) * 0.01
Wh = np.random.randn(hidden_size, hidden_size) * 0.01
Wy = np.random.randn(output_size, hidden_size) * 0.01
bh = np.zeros((hidden_size, 1))
by = np.zeros((output_size, 1))

# Helper functions
def softmax(x):
    exp_x = np.exp(x - np.max(x))  # Stability improvement
    return exp_x / np.sum(exp_x)

def rnn_step_forward(x, h_prev):
    h_next = np.tanh(np.dot(Wx, x) + np.dot(Wh, h_prev) + bh)
    y = np.dot(Wy, h_next) + by
    return h_next, softmax(y)

# Convert BoW vector to sequences for training
# For simplicity, assume the BoW vector is used as input at every time step
encoded_word = bow_vector.astype(int)

# Training loop
for epoch in range(epochs):
    h_prev = np.zeros((hidden_size, 1))
    loss = 0
    dWx, dWh, dWy = np.zeros_like(Wx), np.zeros_like(Wh), np.zeros_like(Wy)
    dbh, dby = np.zeros_like(bh), np.zeros_like(by)

    for t in range(len(encoded_word) - 1):
        # Prepare input and target
        x_t = np.zeros((input_size, 1))
        x_t[encoded_word[t]] = 1
        y_true = encoded_word[t + 1]

        # Forward pass
        h_prev, y_pred = rnn_step_forward(x_t, h_prev)
        loss += -np.log(y_pred[y_true])

        # Backpropagation
        dy = y_pred
        dy[y_true] -= 1  # Cross-entropy gradient

        dWy += np.dot(dy, h_prev.T)
        dby += dy
        dh = np.dot(Wy.T, dy) * (1 - h_prev**2)
        dWx += np.dot(dh, x_t.T)
        dWh += np.dot(dh, h_prev.T)
        dbh += dh

    # Update parameters (with gradient clipping)
    for param, dparam in zip([Wx, Wh, Wy, bh, by], [dWx, dWh, dWy, dbh, dby]):
        np.clip(dparam, -5, 5, out=dparam)  # Clip gradients
        param -= learning_rate * dparam

    if epoch % 200 == 0:
        print(f"Epoch {epoch}, Loss: {loss / len(encoded_word)}")

# Predict next characters
h_prev = np.zeros((hidden_size, 1))
print("\nPredictions:")
for t in range(len(encoded_word) - 1):
    x_t = np.zeros((input_size, 1))
    x_t[encoded_word[t]] = 1
    h_prev, y_pred = rnn_step_forward(x_t, h_prev)
    next_char = int_to_char[np.argmax(y_pred)]
    print(f"Input: {int_to_char[encoded_word[t]]}, Predicted: {next_char}")


Encoded Word using Bag of Words (BoW):
Characters: ['a', 'e', 'i', 'n', 'o', 'p', 'r', 't']
BoW Vector: [1. 1. 1. 1. 2. 1. 1. 1.]
Epoch 0, Loss: [1.81937094]
Epoch 200, Loss: [0.31825305]
Epoch 400, Loss: [0.32262148]
Epoch 600, Loss: [0.26164613]
Epoch 800, Loss: [0.31116175]
Epoch 1000, Loss: [0.18394353]
Epoch 1200, Loss: [0.01760416]
Epoch 1400, Loss: [0.27928677]
Epoch 1600, Loss: [0.2826133]
Epoch 1800, Loss: [0.2820995]

Predictions:
Input: e, Predicted: e
Input: e, Predicted: e
Input: e, Predicted: e
Input: e, Predicted: e
Input: i, Predicted: e
Input: e, Predicted: e
Input: e, Predicted: e


**Hashing Embedding**

In [None]:
import numpy as np
import hashlib

# Define the data
word = "operation"


# Character-to-integer mapping
char_to_int = {ch: i for i, ch in enumerate(sorted(set(word)))}
int_to_char = {i: ch for ch, i in char_to_int.items()}

# Hashing parameters
hash_size = 8  # Size of the hash space (embedding dimension)

# Hash function
def hash_function(ch, hash_size):
    return int(hashlib.md5(ch.encode()).hexdigest(), 16) % hash_size

# Encode each character using hashing
hashed_vectors = np.zeros((len(word), hash_size))
for i, ch in enumerate(word):
    hash_idx = hash_function(ch, hash_size)
    hashed_vectors[i, hash_idx] = 1  # One-hot vector in the hashed space

print("\nEncoded Word using Hashing Embedding:")
for i, ch in enumerate(word):
    print(f"Character: {ch}, Hashed Vector: {hashed_vectors[i]}")

# Hyperparameters
hidden_size = 8
input_size = hash_size  # Input size now matches hash embedding size
output_size = len(char_to_int)  # Output matches the unique character count
learning_rate = 0.1
epochs = 2000

# Model parameters
Wx = np.random.randn(hidden_size, input_size) * 0.01
Wh = np.random.randn(hidden_size, hidden_size) * 0.01
Wy = np.random.randn(output_size, hidden_size) * 0.01
bh = np.zeros((hidden_size, 1))
by = np.zeros((output_size, 1))

# Helper functions
def softmax(x):
    exp_x = np.exp(x - np.max(x))  # Stability improvement
    return exp_x / np.sum(exp_x)

def rnn_step_forward(x, h_prev):
    h_next = np.tanh(np.dot(Wx, x) + np.dot(Wh, h_prev) + bh)
    y = np.dot(Wy, h_next) + by
    return h_next, softmax(y)

# Training loop
for epoch in range(epochs):
    h_prev = np.zeros((hidden_size, 1))
    loss = 0
    dWx, dWh, dWy = np.zeros_like(Wx), np.zeros_like(Wh), np.zeros_like(Wy)
    dbh, dby = np.zeros_like(bh), np.zeros_like(by)

    for t in range(len(hashed_vectors) - 1):
        # Prepare input and target
        x_t = hashed_vectors[t].reshape(-1, 1)
        y_true = char_to_int[word[t + 1]]  # Use char_to_int for consistent indexing

        # Forward pass
        h_prev, y_pred = rnn_step_forward(x_t, h_prev)
        loss += -np.log(y_pred[y_true])

        # Backpropagation
        dy = y_pred
        dy[y_true] -= 1  # Cross-entropy gradient

        dWy += np.dot(dy, h_prev.T)
        dby += dy
        dh = np.dot(Wy.T, dy) * (1 - h_prev**2)
        dWx += np.dot(dh, x_t.T)
        dWh += np.dot(dh, h_prev.T)
        dbh += dh

    # Update parameters (with gradient clipping)
    for param, dparam in zip([Wx, Wh, Wy, bh, by], [dWx, dWh, dWy, dbh, dby]):
        np.clip(dparam, -5, 5, out=dparam)  # Clip gradients
        param -= learning_rate * dparam

    if epoch % 200 == 0:
        print(f"Epoch {epoch}, Loss: {loss / len(word)}")

# Predict next characters
h_prev = np.zeros((hidden_size, 1))
print("\nPredictions:")
for t in range(len(hashed_vectors) - 1):
    x_t = hashed_vectors[t].reshape(-1, 1)
    h_prev, y_pred = rnn_step_forward(x_t, h_prev)
    next_char = int_to_char[np.argmax(y_pred)]
    print(f"Input: {word[t]}, Predicted: {next_char}")


Encoded Word using Hashing Embedding:
Character: o, Hashed Vector: [0. 0. 0. 0. 1. 0. 0. 0.]
Character: p, Hashed Vector: [0. 0. 1. 0. 0. 0. 0. 0.]
Character: e, Hashed Vector: [0. 0. 1. 0. 0. 0. 0. 0.]
Character: r, Hashed Vector: [0. 1. 0. 0. 0. 0. 0. 0.]
Character: a, Hashed Vector: [0. 1. 0. 0. 0. 0. 0. 0.]
Character: t, Hashed Vector: [0. 0. 0. 0. 0. 0. 1. 0.]
Character: i, Hashed Vector: [0. 1. 0. 0. 0. 0. 0. 0.]
Character: o, Hashed Vector: [0. 0. 0. 0. 1. 0. 0. 0.]
Character: n, Hashed Vector: [0. 1. 0. 0. 0. 0. 0. 0.]
Epoch 0, Loss: [1.84831175]
Epoch 200, Loss: [0.02249502]
Epoch 400, Loss: [0.00909796]
Epoch 600, Loss: [0.00580962]
Epoch 800, Loss: [0.00434665]
Epoch 1000, Loss: [0.00351166]
Epoch 1200, Loss: [0.00295531]
Epoch 1400, Loss: [0.00254436]
Epoch 1600, Loss: [0.00222167]
Epoch 1800, Loss: [0.00196016]

Predictions:
Input: o, Predicted: p
Input: p, Predicted: e
Input: e, Predicted: r
Input: r, Predicted: a
Input: a, Predicted: t
Input: t, Predicted: i
Input: i, P

## Next Word Prediction in a Sentence

### Example-1 : I am CS Student

In [None]:
import numpy as np

# Define the data
sentence = "I am CS student"
words = sentence.split()
word_to_int = {word: i for i, word in enumerate(sorted(set(words)))}
int_to_word = {i: word for word, i in word_to_int.items()}
encoded_sentence = [word_to_int[word] for word in words]

# Model parameters
input_size = len(word_to_int)  # Number of unique words
output_size = len(word_to_int)
hidden_size = 4  # Arbitrary hidden state size
learning_rate = 0.01

# Initialize weights and biases
Wx = np.random.randn(hidden_size, input_size) * 0.01  # Input to hidden weights
Wh = np.random.randn(hidden_size, hidden_size) * 0.01  # Hidden to hidden weights
Wy = np.random.randn(output_size, hidden_size) * 0.01  # Hidden to output weights
bh = np.zeros((hidden_size, 1))  # Hidden bias
by = np.zeros((output_size, 1))  # Output bias

# RNN forward step
def rnn_step_forward(x, h_prev):
    h_next = np.tanh(np.dot(Wx, x) + np.dot(Wh, h_prev) + bh)
    y = np.dot(Wy, h_next) + by
    return h_next, y

# Training loop
for epoch in range(1000):  # Training for 1000 epochs
    loss = 0
    h_prev = np.zeros((hidden_size, 1))  # Initialize hidden state

    for t in range(len(encoded_sentence) - 1):
        x_t = np.zeros((input_size, 1))
        x_t[encoded_sentence[t]] = 1  # One-hot encoding
        y_true = encoded_sentence[t + 1]

        # Forward pass
        h_prev, y_pred = rnn_step_forward(x_t, h_prev)
        y_pred_softmax = np.exp(y_pred) / np.sum(np.exp(y_pred))  # Softmax activation

        loss += -np.log(y_pred_softmax[y_true])  # Cross-entropy loss

        # Backward pass (gradient calculation and parameter update)
        dy = y_pred_softmax
        dy[y_true] -= 1  # Gradient of softmax + loss

        dWy = np.dot(dy, h_prev.T)
        dby = dy
        dh = np.dot(Wy.T, dy) * (1 - h_prev**2)  # Backprop through tanh activation
        dWx = np.dot(dh, x_t.T)
        dWh = np.dot(dh, h_prev.T)
        dbh = dh

        # Update parameters
        Wy -= learning_rate * dWy
        by -= learning_rate * dby
        Wx -= learning_rate * dWx
        Wh -= learning_rate * dWh
        bh -= learning_rate * dbh

    if epoch % 100 == 0:
        print(f"Epoch {epoch}, Loss: {loss}")

# Predict next words
h_prev = np.zeros((hidden_size, 1))  # Initialize hidden state
print("\nPredictions:")
for t in range(len(encoded_sentence) - 1):
    x_t = np.zeros((input_size, 1))
    x_t[encoded_sentence[t]] = 1  # One-hot encoding
    h_prev, y_pred = rnn_step_forward(x_t, h_prev)
    next_word = int_to_word[np.argmax(y_pred)]
    print(f"Input: {int_to_word[encoded_sentence[t]]}, Predicted: {next_word}")

Epoch 0, Loss: [4.16598495]
Epoch 100, Loss: [3.74484097]
Epoch 200, Loss: [3.56650005]
Epoch 300, Loss: [3.32936517]
Epoch 400, Loss: [2.66141231]
Epoch 500, Loss: [1.60595847]
Epoch 600, Loss: [0.8719616]
Epoch 700, Loss: [0.52982219]
Epoch 800, Loss: [0.36752919]
Epoch 900, Loss: [0.27982389]

Predictions:
Input: I, Predicted: am
Input: am, Predicted: CS
Input: CS, Predicted: student


### Example-2: France is where i Grew up, but I now live in Boston. I speak fluent French

In [None]:
import numpy as np

# Define the data
sentence = "France is where i Grew up, but I now live in Boston. I speak fluent French"
words = sentence.split()
word_to_int = {word: i for i, word in enumerate(sorted(set(words)))}
int_to_word = {i: word for word, i in word_to_int.items()}
encoded_sentence = [word_to_int[word] for word in words]

# Model parameters
input_size = len(word_to_int)  # Number of unique words
output_size = len(word_to_int)
hidden_size = 4  # Arbitrary hidden state size
learning_rate = 0.01

# Initialize weights and biases
Wx = np.random.randn(hidden_size, input_size) * 0.01  # Input to hidden weights
Wh = np.random.randn(hidden_size, hidden_size) * 0.01  # Hidden to hidden weights
Wy = np.random.randn(output_size, hidden_size) * 0.01  # Hidden to output weights
bh = np.zeros((hidden_size, 1))  # Hidden bias
by = np.zeros((output_size, 1))  # Output bias

# RNN forward step
def rnn_step_forward(x, h_prev):
    h_next = np.tanh(np.dot(Wx, x) + np.dot(Wh, h_prev) + bh)
    y = np.dot(Wy, h_next) + by
    return h_next, y

# Training loop
for epoch in range(1000):  # Training for 1000 epochs
    loss = 0
    h_prev = np.zeros((hidden_size, 1))  # Initialize hidden state

    for t in range(len(encoded_sentence) - 1):
        x_t = np.zeros((input_size, 1))
        x_t[encoded_sentence[t]] = 1  # One-hot encoding
        y_true = encoded_sentence[t + 1]

        # Forward pass
        h_prev, y_pred = rnn_step_forward(x_t, h_prev)
        y_pred_softmax = np.exp(y_pred) / np.sum(np.exp(y_pred))  # Softmax activation

        loss += -np.log(y_pred_softmax[y_true])  # Cross-entropy loss

        # Backward pass (gradient calculation and parameter update)
        dy = y_pred_softmax
        dy[y_true] -= 1  # Gradient of softmax + loss

        dWy = np.dot(dy, h_prev.T)
        dby = dy
        dh = np.dot(Wy.T, dy) * (1 - h_prev**2)  # Backprop through tanh activation
        dWx = np.dot(dh, x_t.T)
        dWh = np.dot(dh, h_prev.T)
        dbh = dh

        # Update parameters
        Wy -= learning_rate * dWy
        by -= learning_rate * dby
        Wx -= learning_rate * dWx
        Wh -= learning_rate * dWh
        bh -= learning_rate * dbh

    if epoch % 100 == 0:
        print(f"Epoch {epoch}, Loss: {loss}")

# Predict next words
h_prev = np.zeros((hidden_size, 1))  # Initialize hidden state
print("\nPredictions:")
for t in range(len(encoded_sentence) - 1):
    x_t = np.zeros((input_size, 1))
    x_t[encoded_sentence[t]] = 1  # One-hot encoding
    h_prev, y_pred = rnn_step_forward(x_t, h_prev)
    next_word = int_to_word[np.argmax(y_pred)]
    print(f"Input: {int_to_word[encoded_sentence[t]]}, Predicted: {next_word}")

Epoch 0, Loss: [40.68110442]
Epoch 100, Loss: [39.78991664]
Epoch 200, Loss: [39.38713047]
Epoch 300, Loss: [37.40914653]
Epoch 400, Loss: [32.61490015]
Epoch 500, Loss: [23.62491818]
Epoch 600, Loss: [20.74614278]
Epoch 700, Loss: [15.62007448]
Epoch 800, Loss: [15.00181898]
Epoch 900, Loss: [16.89805145]

Predictions:
Input: France, Predicted: is
Input: is, Predicted: where
Input: where, Predicted: i
Input: i, Predicted: Grew
Input: Grew, Predicted: up,
Input: up,, Predicted: but
Input: but, Predicted: I
Input: I, Predicted: now
Input: now, Predicted: live
Input: live, Predicted: in
Input: in, Predicted: Boston.
Input: Boston., Predicted: I
Input: I, Predicted: Boston.
Input: speak, Predicted: fluent
Input: fluent, Predicted: French


# **Question#3 - Back Propogation**

## Single Neuron - Single Character



In [None]:
import numpy as np

# Define the data
word = "photosynthesis"
char_to_int = {ch: i for i, ch in enumerate(sorted(set(word)))}
int_to_char = {i: ch for ch, i in char_to_int.items()}
encoded_word = [char_to_int[ch] for ch in word]

print("Encoded Word using One-Hot Encoding:", encoded_word, "\n")

# Hyperparameters
hidden_size = 1  # Single neuron
input_size = len(char_to_int)
output_size = len(char_to_int)
learning_rate = 0.1
epochs = 10  # Fewer epochs for demonstration

# Model parameters
Wx = np.random.randn(hidden_size, input_size) * 0.01  # Input to hidden weights
Wh = np.random.randn(hidden_size, hidden_size) * 0.01  # Hidden to hidden weights
Wy = np.random.randn(output_size, hidden_size) * 0.01  # Hidden to output weights
bh = np.zeros((hidden_size, 1))  # Hidden bias
by = np.zeros((output_size, 1))  # Output bias

# RNN step function
def rnn_step_forward(x, h_prev):
    h_next = np.tanh(np.dot(Wx, x) + np.dot(Wh, h_prev) + bh)  # Single hidden unit
    y = np.dot(Wy, h_next) + by
    return h_next, y

# Training loop
for epoch in range(epochs):
    print(f"Epoch {epoch + 1}/{epochs}")
    loss = 0
    h_prev = np.zeros((hidden_size, 1))  # Single hidden unit

    for t in range(len(encoded_word) - 1):
        print(f"\nTime Step {t + 1}/{len(encoded_word) - 1}")

        # Input preparation
        x_t = np.zeros((input_size, 1))
        x_t[encoded_word[t]] = 1  # One-hot encoding
        y_true = encoded_word[t + 1]  # Target character index

        # Forward pass
        h_prev, y_pred = rnn_step_forward(x_t, h_prev)
        y_pred_softmax = np.exp(y_pred) / np.sum(np.exp(y_pred))  # Softmax activation

        # Print forward propagation details
        print("Forward Propagation:")
        print(f"  Input (x_t): {x_t.T}")
        print(f"  Hidden State (h_prev): {h_prev.T}")
        print(f"  Raw Prediction (y_pred): {y_pred.T}")
        print(f"  Softmax Prediction (y_pred_softmax): {y_pred_softmax.T}")

        # Loss calculation
        loss += -np.log(y_pred_softmax[y_true])  # Cross-entropy loss

        # Backward pass
        dy = y_pred_softmax
        dy[y_true] -= 1  # Gradient of softmax + loss

        dWy = np.dot(dy, h_prev.T)
        dby = dy
        dh = np.dot(Wy.T, dy) * (1 - h_prev**2)  # Backprop through tanh activation
        dWx = np.dot(dh, x_t.T)
        dWh = np.dot(dh, h_prev.T)
        dbh = dh

        # Print backward propagation details
        print("Backward Propagation:")
        print(f"  Gradient of Loss wrt Output (dy): {dy.T}")
        print(f"  Gradient Wy (dWy): {dWy}")
        print(f"  Gradient Wh (dWh): {dWh}")
        print(f"  Gradient Wx (dWx): {dWx}")
        print(f"  Gradient bh (dbh): {dbh.T}")
        print(f"  Gradient by (dby): {dby.T}")

        # Update parameters
        Wy -= learning_rate * dWy
        by -= learning_rate * dby
        Wx -= learning_rate * dWx
        Wh -= learning_rate * dWh
        bh -= learning_rate * dbh

    print(f"\nEpoch {epoch + 1} Loss: {loss}")

# Testing (prediction after training)
print("\nTesting the RNN Model:")
h_prev = np.zeros((hidden_size, 1))  # Single hidden unit
for t in range(len(encoded_word) - 1):
    x_t = np.zeros((input_size, 1))
    x_t[encoded_word[t]] = 1  # One-hot encoding
    h_prev, y_pred = rnn_step_forward(x_t, h_prev)
    next_char = int_to_char[np.argmax(y_pred)]
    print(f"Input: {int_to_char[encoded_word[t]]}, Predicted: {next_char}")

Encoded Word using One-Hot Encoding: [5, 1, 4, 7, 4, 6, 8, 3, 7, 1, 0, 6, 2, 6] 

Epoch 1/10

Time Step 1/13
Forward Propagation:
  Input (x_t): [[0. 0. 0. 0. 0. 1. 0. 0. 0.]]
  Hidden State (h_prev): [[0.00940775]]
  Raw Prediction (y_pred): [[-2.60556853e-04  3.26682597e-05 -1.29606134e-04  4.16751678e-06
   9.34522460e-06  5.60869129e-05 -7.82832643e-05 -9.11400438e-06
   2.17966570e-05]]
  Softmax Prediction (y_pred_softmax): [[0.11108653 0.1111191  0.11110107 0.11111594 0.11111651 0.11112171
  0.11110678 0.11111446 0.1111179 ]]
Backward Propagation:
  Gradient of Loss wrt Output (dy): [[ 0.11108653 -0.8888809   0.11110107  0.11111594  0.11111651  0.11112171
   0.11110678  0.11111446  0.1111179 ]]
  Gradient Wy (dWy): [[ 0.00104507]
 [-0.00836237]
 [ 0.00104521]
 [ 0.00104535]
 [ 0.00104536]
 [ 0.00104541]
 [ 0.00104527]
 [ 0.00104534]
 [ 0.00104537]]
  Gradient Wh (dWh): [[-7.19301021e-05]]
  Gradient Wx (dWx): [[ 0.          0.          0.          0.          0.         -0.00764

## Mutliple Neurons - A sentence

In [None]:
import numpy as np

# Define the data
sentence = "I am CS student"
words = sentence.split()
word_to_int = {word: i for i, word in enumerate(sorted(set(words)))}
int_to_word = {i: word for word, i in word_to_int.items()}
encoded_sentence = [word_to_int[word] for word in words]

print("Encoded Sentence using One-Hot Encoding:", encoded_sentence, "\n")

# Hyperparameters
input_size = len(word_to_int)  # Number of unique words
output_size = len(word_to_int)
hidden_size = 4  # Number of hidden neurons
learning_rate = 0.01
epochs = 10  # Fewer epochs for demonstration

# Initialize weights and biases
Wx = np.random.randn(hidden_size, input_size) * 0.01  # Input to hidden weights
Wh = np.random.randn(hidden_size, hidden_size) * 0.01  # Hidden to hidden weights
Wy = np.random.randn(output_size, hidden_size) * 0.01  # Hidden to output weights
bh = np.zeros((hidden_size, 1))  # Hidden bias
by = np.zeros((output_size, 1))  # Output bias

# RNN step function
def rnn_step_forward(x, h_prev):
    h_next = np.tanh(np.dot(Wx, x) + np.dot(Wh, h_prev) + bh)
    y = np.dot(Wy, h_next) + by
    return h_next, y

# Training loop
for epoch in range(epochs):
    print(f"\nEpoch {epoch + 1}/{epochs}")
    loss = 0
    h_prev = np.zeros((hidden_size, 1))  # Initialize hidden state

    for t in range(len(encoded_sentence) - 1):
        print(f"\nTime Step {t + 1}/{len(encoded_sentence) - 1}")

        # Input preparation
        x_t = np.zeros((input_size, 1))
        x_t[encoded_sentence[t]] = 1  # One-hot encoding
        y_true = encoded_sentence[t + 1]  # Target word index

        # Forward pass
        h_prev, y_pred = rnn_step_forward(x_t, h_prev)
        y_pred_softmax = np.exp(y_pred) / np.sum(np.exp(y_pred))  # Softmax activation

        # Print forward propagation details
        print("Forward Propagation:")
        print(f"  Input (x_t): {x_t.T}")
        print(f"  Hidden State (h_prev): {h_prev.T}")
        print(f"  Raw Prediction (y_pred): {y_pred.T}")
        print(f"  Softmax Prediction (y_pred_softmax): {y_pred_softmax.T}")

        # Loss calculation
        loss += -np.log(y_pred_softmax[y_true])  # Cross-entropy loss

        # Backward pass
        dy = y_pred_softmax
        dy[y_true] -= 1  # Gradient of softmax + loss

        dWy = np.dot(dy, h_prev.T)
        dby = dy
        dh = np.dot(Wy.T, dy) * (1 - h_prev**2)  # Backprop


Encoded Sentence using One-Hot Encoding: [1, 2, 0, 3] 


Epoch 1/10

Time Step 1/3
Forward Propagation:
  Input (x_t): [[0. 1. 0. 0.]]
  Hidden State (h_prev): [[-4.12703871e-05  1.81937510e-03 -5.66316777e-03  1.05266405e-02]]
  Raw Prediction (y_pred): [[-5.72544663e-05 -6.21467669e-05 -1.93818621e-05  9.38987306e-05]]
  Softmax Prediction (y_pred_softmax): [[0.24998849 0.24998727 0.24999796 0.25002628]]

Time Step 2/3
Forward Propagation:
  Input (x_t): [[0. 0. 1. 0.]]
  Hidden State (h_prev): [[ 0.00895886  0.0095831  -0.00256969 -0.00811158]]
  Raw Prediction (y_pred): [[-1.56293039e-04  1.44680775e-04  2.75905553e-04 -4.93226762e-05]]
  Softmax Prediction (y_pred_softmax): [[0.24994749 0.25002273 0.25005554 0.24997423]]

Time Step 3/3
Forward Propagation:
  Input (x_t): [[1. 0. 0. 0.]]
  Hidden State (h_prev): [[ 0.00677048 -0.01419261  0.00273955 -0.01016803]]
  Raw Prediction (y_pred): [[ 0.00024798  0.00023421 -0.00016731 -0.00022872]]
  Softmax Prediction (y_pred_softmax): [[

# **Question#4 - Self Attention**

In [None]:
import numpy as np

# Define a user example: sequence of words
sentence = "The quick brown fox jumps over the lazy dog"
words = sentence.split()
vocab = sorted(set(words))
word_to_int = {word: i for i, word in enumerate(vocab)}
int_to_word = {i: word for word, i in word_to_int.items()}
encoded_sentence = [word_to_int[word] for word in words]

print("Input Sentence:", sentence)
print("Encoded Sentence:", encoded_sentence, "\n")

# Hyperparameters
embedding_size = 8  # Size of embedding vectors
sequence_length = len(words)  # Number of tokens
d_k = embedding_size  # Dimension of the key vectors (commonly the same as embedding size)

# Random embeddings for demonstration
np.random.seed(42)  # Seed for reproducibility
word_embeddings = np.random.randn(len(vocab), embedding_size)  # Embedding matrix
sequence_embeddings = np.array([word_embeddings[i] for i in encoded_sentence])  # Input embeddings

# Initialize Query, Key, and Value weights
Wq = np.random.randn(embedding_size, d_k)
Wk = np.random.randn(embedding_size, d_k)
Wv = np.random.randn(embedding_size, embedding_size)

# Compute Query, Key, and Value matrices
queries = np.dot(sequence_embeddings, Wq)  # (sequence_length x d_k)
keys = np.dot(sequence_embeddings, Wk)     # (sequence_length x d_k)
values = np.dot(sequence_embeddings, Wv)   # (sequence_length x embedding_size)

# Scaled Dot-Product Attention
def scaled_dot_product_attention(Q, K, V):
    scores = np.dot(Q, K.T) / np.sqrt(d_k)  # Scale by sqrt(d_k)
    attention_weights = np.exp(scores) / np.sum(np.exp(scores), axis=1, keepdims=True)  # Softmax
    output = np.dot(attention_weights, V)
    return output, attention_weights

# Calculate self-attention
attention_output, attention_weights = scaled_dot_product_attention(queries, keys, values)

# Print results
print("Self-Attention Mechanism:\n")
print("Sequence Embeddings (Input):")
print(sequence_embeddings)
print("\nQuery Matrix:")
print(queries)
print("\nKey Matrix:")
print(keys)
print("\nValue Matrix:")
print(values)
print("\nAttention Weights (Softmax of Scores):")
print(attention_weights)
print("\nSelf-Attention Output:")
print(attention_output)

# Reconstruct the sequence using the attention output
reconstructed_sequence = [int_to_word[np.argmax(row)] for row in attention_output]
print("\nReconstructed Sequence (Using Attention Outputs):", " ".join(reconstructed_sequence))

Input Sentence: The quick brown fox jumps over the lazy dog
Encoded Sentence: [0, 7, 1, 3, 4, 6, 8, 5, 2] 

Self-Attention Mechanism:

Sequence Embeddings (Input):
[[ 0.49671415 -0.1382643   0.64768854  1.52302986 -0.23415337 -0.23413696
   1.57921282  0.76743473]
 [-0.83921752 -0.30921238  0.33126343  0.97554513 -0.47917424 -0.18565898
  -1.10633497 -1.19620662]
 [-0.46947439  0.54256004 -0.46341769 -0.46572975  0.24196227 -1.91328024
  -1.72491783 -0.56228753]
 [-0.54438272  0.11092259 -1.15099358  0.37569802 -0.60063869 -0.29169375
  -0.60170661  1.85227818]
 [-0.01349722 -1.05771093  0.82254491 -1.22084365  0.2088636  -1.95967012
  -1.32818605  0.19686124]
 [ 0.34361829 -1.76304016  0.32408397 -0.38508228 -0.676922    0.61167629
   1.03099952  0.93128012]
 [ 0.81252582  1.35624003 -0.07201012  1.0035329   0.36163603 -0.64511975
   0.36139561  1.53803657]
 [ 0.73846658  0.17136828 -0.11564828 -0.3011037  -1.47852199 -0.71984421
  -0.46063877  1.05712223]
 [-1.01283112  0.31424733 -0

# **Question#5 - Encoding Techniques**

## Label Encoding
Label encoding is a preprocessing technique used in machine learning to convert categorical data into numerical form

In [None]:
from sklearn.preprocessing import LabelEncoder

# Sample data
categories = ['Apple', 'Banana', 'Apple', 'Cherry']

# Initialize LabelEncoder
encoder = LabelEncoder()

# Fit and transform the data
labels = encoder.fit_transform(categories)

print("Encoded labels:", labels)
print("Original categories:", encoder.inverse_transform(labels))


Encoded labels: [0 1 0 2]
Original categories: ['Apple' 'Banana' 'Apple' 'Cherry']


Here are the **disadvantages of Label Encoding**:

### 1. **Introduces Ordinal Relationships**
   - Label encoding assigns integer values to categories, which can create a false sense of order (e.g., `0 < 1 < 2`). This can mislead machine learning models into treating categories as ordinal when they are not.
   - Example: For `['Red', 'Green', 'Blue']`, assigning `[0, 1, 2]` may cause the model to think `Red < Green < Blue`.

### 2. **Model Bias**
   - Some models, like linear regression or decision trees, may interpret the numeric labels as having meaning or weight, leading to biased predictions or poor performance.
   - Example: If `['Dog', 'Cat', 'Elephant']` is encoded as `[0, 1, 2]`, a model might mistakenly assume `Elephant` is "twice" as important as `Cat`.

### 3. **Not Suitable for High Cardinality**
   - If the categorical variable has many unique values (e.g., a column of user IDs or product codes), label encoding can produce very large integers, making training difficult and leading to poor generalization.

### 4. **Loss of Interpretability**
   - Once categories are converted to integers, it can be challenging to interpret or explain the relationships inferred by the model between encoded values.

### 5. **Inconsistent Behavior Across Datasets**
   - When using label encoding for training and testing datasets separately, if a new category appears in the test set that wasn’t present during training, it can cause errors or unexpected behavior.

### 6. **Risk of Data Leakage**
   - If labels are based on the order in which data is encountered, the encoding may inadvertently include information about the dataset's structure or sequence, leading to data leakage and unreliable results.

---

### When to Avoid Label Encoding:
- When categories do **not have an ordinal relationship** (use one-hot encoding instead).
- When there are **many unique categories** (e.g., thousands of product IDs).
- When the encoded values can mislead models due to the implied ordering.

### Mitigation Strategies:
- Use **One-Hot Encoding** for non-ordinal data.
- For ordinal data, use **explicit ordinal encoding** that reflects the true order (e.g., `{'Low': 0, 'Medium': 1, 'High': 2}`).

## One Hot Encoding
One-Hot Encoding is a method used to convert categorical data into a binary format that can be fed into machine learning models. It creates new binary columns (features), each representing one unique category.



In [None]:
import pandas as pd

# Sample data
data = {'Category': ['Apple', 'Banana', 'Apple', 'Cherry']}
df = pd.DataFrame(data)

# One-hot encoding
encoded_df = pd.get_dummies(df, columns=['Category']).astype(int)
print(encoded_df)


   Category_Apple  Category_Banana  Category_Cherry
0               1                0                0
1               0                1                0
2               1                0                0
3               0                0                1


Here are the main **disadvantages of One-Hot Encoding**:

---

### **1. High Dimensionality**
   - **Description**: For categorical features with many unique values (high cardinality), one-hot encoding creates a separate column for each category, significantly increasing the number of features.
   - **Impact**: This increases memory usage, computational cost, and can make training models slow.
   - **Example**: A feature with 10,000 unique categories results in 10,000 new columns.

---

### **2. Sparse Representation**
   - **Description**: The resulting matrix from one-hot encoding contains mostly `0`s, especially when the number of categories is large. This is inefficient for computation and storage.
   - **Impact**: Models may struggle with sparse data, and additional resources may be required for processing.

---

### **3. Overfitting Risk**
   - **Description**: When there are many categories but limited data for each, one-hot encoding can lead to overfitting. Models might learn noise or irrelevant patterns associated with specific categories.
   - **Impact**: Poor generalization to unseen data.

---

### **4. Not Suitable for Ordinal Data**
   - **Description**: One-hot encoding ignores any natural order among categories, treating them as independent. This is unsuitable for ordinal data (e.g., `Low`, `Medium`, `High`), where the relationship matters.
   - **Impact**: Loss of valuable information about the inherent order.

---

### **5. Inefficiency in High-Cardinality Features**
   - **Description**: For features like postal codes or user IDs, one-hot encoding produces excessive columns, most of which contribute little useful information to the model.
   - **Impact**: Increased computational complexity without significant performance gains.

---

### **6. Hard to Interpret**
   - **Description**: One-hot encoding transforms a single feature into multiple binary columns, making the dataset harder to interpret for humans.
   - **Impact**: Reduced clarity when analyzing or visualizing data.

---

### **When to Avoid One-Hot Encoding:**
- **High Cardinality**: For features with thousands of unique values.
- **Ordinal Features**: Where the order of categories is meaningful.
- **Limited Data**: When the dataset is small compared to the number of categories.

---

### **Alternatives to One-Hot Encoding:**
1. **Label Encoding**: Suitable for ordinal features.
2. **Binary Encoding**: Encodes categories as binary numbers to reduce dimensionality.
3. **Embedding Layers**: Common in deep learning for representing categories in dense, lower-dimensional spaces.
4. **Hash Encoding**: Reduces dimensionality by mapping categories to a fixed number of hash buckets.

In [None]:
pip install category-encoders

Collecting category-encoders
  Obtaining dependency information for category-encoders from https://files.pythonhosted.org/packages/98/47/598b4bf0ccf6f02915e71bdd23fe846a27adc2d3ba734f2ba5215d8e44f5/category_encoders-2.6.4-py2.py3-none-any.whl.metadata
  Downloading category_encoders-2.6.4-py2.py3-none-any.whl.metadata (8.0 kB)
Downloading category_encoders-2.6.4-py2.py3-none-any.whl (82 kB)
   ---------------------------------------- 0.0/82.0 kB ? eta -:--:--
   ---------------------------------------- 0.0/82.0 kB ? eta -:--:--
   ---------------------------------------- 0.0/82.0 kB ? eta -:--:--
   ---------------------------------------- 0.0/82.0 kB ? eta -:--:--
   ---------------------------------------- 0.0/82.0 kB ? eta -:--:--
   ---------------------------------------- 0.0/82.0 kB ? eta -:--:--
   ---------------------------------------- 0.0/82.0 kB ? eta -:--:--
   ---------------------------------------- 0.0/82.0 kB ? eta -:--:--
   ---------------------------------------- 0.

## Binary Encoding
Binary Encoding is a data preprocessing technique that encodes categorical data into binary format. Each category is first assigned a unique integer value, which is then converted into its binary representation. The binary digits are then split into separate columns.


In [None]:
import pandas as pd
from category_encoders import BinaryEncoder

# Sample data
data = {'Category': ['Apple', 'Banana', 'Cherry']}
df = pd.DataFrame(data)

# Binary Encoding
encoder = BinaryEncoder(cols=['Category'])
binary_encoded = encoder.fit_transform(df)

print(binary_encoded)


   Category_0  Category_1
0           0           1
1           1           0
2           1           1


### **Advantages of Binary Encoding**
1. **Reduced Dimensionality**: Compared to one-hot encoding, binary encoding requires fewer columns, making it more efficient for features with high cardinality.
   - Example: One-hot encoding for 10 categories creates 10 columns, but binary encoding creates only 4 columns (`log2(10) ≈ 4`).
   
2. **Retains Some Ordinal Information**: Unlike one-hot encoding, binary encoding preserves the numeric order of categories to some extent.

3. **Efficient for Large Data**: Handles high-cardinality features well without creating sparse matrices.

---

### **Disadvantages of Binary Encoding**
1. **Not Fully Interpretable**: The binary representation may not be as intuitive to humans compared to one-hot encoding.
2. **Loss of Ordinality**: While some order information is retained, it is not explicit or complete.
3. **Risk of Overfitting**: If categories are highly specific and unique (e.g., user IDs), binary encoding can lead to overfitting in some models.

---

### **When to Use Binary Encoding**
- **High-Cardinality Features**: For categorical variables with many unique values.
- **Efficiency Needs**: When one-hot encoding leads to too many columns, causing memory or computational inefficiencies.

---

### **Alternatives to Binary Encoding**
1. **One-Hot Encoding**: For low cardinality and non-ordinal data.
2. **Hash Encoding**: Uses a hash function to map categories to a fixed number of columns.
3. **Embedding Layers**: Common in deep learning for dense, low-dimensional representations.

## Count Vectorizer & BoW

CountVectorizer is a tool provided by the scikit-learn library to convert a collection of text documents into a matrix of token counts. It is used to implement the Bag of Words (BoW) model, where each document is represented as a vector of word counts.

The Bag of Words (BoW) model is a simple and commonly used technique in Natural Language Processing (NLP) to represent text data in numerical form. It disregards grammar, word order, and context, focusing only on word occurrences.

### **How CountVectorizer Works**
1. **Tokenization**: Splits the text into individual words (tokens).
2. **Vocabulary Creation**: Builds a vocabulary of unique words from the corpus.
3. **Vectorization**:
   - Counts the occurrences of each word in the vocabulary for each document.
   - Represents documents as vectors based on these counts.

### **How Bag of Words Works**

1. **Text Preprocessing**:
   - Tokenize the text into words (e.g., split sentences into words).
   - Convert all words to lowercase to ensure consistency.
   - Optionally, remove stop words (e.g., "the," "is," "and") and perform stemming or lemmatization.

2. **Vocabulary Creation**:
   - Create a vocabulary of unique words from the corpus (all text data).
   - Assign an index to each unique word.

3. **Encoding**:
   - Represent each document (text) as a vector, where:
     - Each element corresponds to a word in the vocabulary.
     - The value is the frequency (or presence/absence) of the word in the document.

---

### **Example**
#### Input:
```text
Document 1: "I like apples"
Document 2: "I like bananas"
Document 3: "I eat apples and bananas"
```

#### Step 1: Create Vocabulary
Vocabulary: `['i', 'like', 'apples', 'bananas', 'eat', 'and']`

#### Step 2: Encode Documents as Vectors
Each document is converted into a vector based on word counts:

| Word       | `I` | `like` | `apples` | `bananas` | `eat` | `and` |
|------------|-----|--------|----------|-----------|-------|-------|
| Document 1 | 1   | 1      | 1        | 0         | 0     | 0     |
| Document 2 | 1   | 1      | 0        | 1         | 0     | 0     |
| Document 3 | 1   | 0      | 1        | 1         | 1     | 1     |


In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Sample data
documents = [
    "I like apples",
    "I like bananas",
    "I eat apples and bananas"
]

# Initialize CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform the data
bow_matrix = vectorizer.fit_transform(documents)

# Display results
print("Vocabulary:", vectorizer.get_feature_names_out())
print("Bag of Words Matrix:\n", bow_matrix.toarray())

Vocabulary: ['and' 'apples' 'bananas' 'eat' 'like']
Bag of Words Matrix:
 [[0 1 0 0 1]
 [0 0 1 0 1]
 [1 1 1 1 0]]


In [None]:
vectorizer = CountVectorizer(stop_words='english')  # Remove common stop words
bow_matrix = vectorizer.fit_transform(documents)

print("Vocabulary:", vectorizer.get_feature_names_out())
print("Bag of Words Matrix:\n", bow_matrix.toarray())

Vocabulary: ['apples' 'bananas' 'eat' 'like']
Bag of Words Matrix:
 [[1 0 0 1]
 [0 1 0 1]
 [1 1 1 0]]


### **Advantages of CoutnVectorizer**
1. **Automatic Text Processing**: Automates tokenization, vocabulary creation, and counting.
2. **Flexible**: Allows customization with parameters like stop word removal and n-grams.
3. **Integration**: Easy to integrate with machine learning models in scikit-learn.

---

### **Disadvantages of CoutnVectorizer**
1. **No Context**: Ignores word order and semantic meaning.
2. **High Dimensionality**: For large vocabularies, results in sparse and high-dimensional matrices.
3. **Sensitive to Rare Words**: Rare words may disproportionately affect the representation.

### **Advantages of Bag of Words**
1. **Simple and Intuitive**: Easy to implement and understand.
2. **Works Well for Small Datasets**: Effective for small-scale text classification or clustering tasks.
3. **Foundation for Other Models**: Basis for more advanced techniques like TF-IDF and word embeddings.

---

### **Disadvantages of Bag of Words**
1. **High Dimensionality**:
   - For a large vocabulary, the resulting vectors are high-dimensional and sparse.
   - This increases memory and computational requirements.

2. **No Context or Order Information**:
   - Ignores the sequence of words.
   - Loses the semantic meaning of words and phrases.

3. **Ignores Synonyms and Polysemy**:
   - Treats synonyms as separate words (e.g., "happy" and "joyful").
   - The same word used in different contexts has the same representation (e.g., "bank" in "river bank" vs. "money bank").

4. **Sparse Representation**:
   - Vectors are filled mostly with zeros, making them inefficient to process.

## TF-IDF (Term Frequency - Inverse Document Frequency)

**TF-IDF** is a statistical method used in Natural Language Processing (NLP) to evaluate the importance of a word in a document relative to a collection of documents (corpus). Unlike **Bag of Words (BoW)**, which counts word occurrences, TF-IDF assigns weights to words based on their frequency in a single document and their rarity across all documents, helping to identify key terms.

---

### **Key Components**
1. **Term Frequency (TF)**:
   - Measures how often a word occurs in a document.
   - Formula:
     \[
     TF = Number of occurrences of the word in the document \ Total words in the document
     \]

2. **Inverse Document Frequency (IDF)**:
   - Measures how unique or rare a word is across all documents.
   - Formula:
     \[
     IDF = log(Total number of documents \ Number of documents containing the word)
     \]
   - A word appearing in many documents has a low IDF value, while a rare word has a high IDF value.

3. **TF-IDF Score**:
   - Combines TF and IDF to calculate the importance of a word in a document.
   - Formula:
     \[
     TF-IDF = TF* IDF
     \]

---

### **Example**
#### Documents:
- Document 1: "I love apples"
- Document 2: "I love bananas"
- Document 3: "Apples and bananas are great"

#### Vocabulary:
`['i', 'love', 'apples', 'bananas', 'and', 'are', 'great']`

#### Calculating TF-IDF:
For the word **"apples"**:
- **TF (Document 1)**: 1\3 = 0.33
- **IDF**: log(3/2) = 0.18
- **TF-IDF (Document 1)**: 0.33 * 0.18 = 0.06

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample data
documents = [
    "I love apples",
    "I love bananas",
    "Apples and bananas are great"
]

# Initialize TfidfVectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the data
tfidf_matrix = vectorizer.fit_transform(documents)

# Convert to array for viewing
print("TF-IDF Matrix:\n", tfidf_matrix.toarray())
print("Vocabulary:", vectorizer.get_feature_names_out())

TF-IDF Matrix:
 [[0.         0.70710678 0.         0.         0.         0.70710678]
 [0.         0.         0.         0.70710678 0.         0.70710678]
 [0.49047908 0.37302199 0.49047908 0.37302199 0.49047908 0.        ]]
Vocabulary: ['and' 'apples' 'are' 'bananas' 'great' 'love']


### **Advantages of TF-IDF**
1. **Distinguishes Important Words**:
   - Identifies words that are significant in a document but uncommon in the corpus.
2. **Simple and Effective**:
   - Works well for small to medium-sized datasets.
3. **Reduces Noise**:
   - Reduces the weight of common but uninformative words.

---

### **Disadvantages of TF-IDF**
1. **No Contextual Understanding**:
   - Fails to capture word meanings and relationships.
2. **Sensitive to Data Sparsity**:
   - High-dimensional representations can still be sparse for large corpora.
3. **Static Weights**:
   - Weights are fixed after computation and don't adapt to new data.


### **CBOW (Continuous Bag of Words) and Skip-Gram Models**

CBOW and Skip-Gram are two key algorithms used for generating **word embeddings** in the **Word2Vec** model, developed by Google. These algorithms learn the relationships between words in a text corpus and encode semantic meaning into vector representations.

---

### **1. Continuous Bag of Words (CBOW)**

CBOW predicts a **target word** (center word) based on the surrounding **context words**.

#### **Key Characteristics:**
- **Input**: The context words (surrounding words).
- **Output**: The target word (center word).
- **Objective**: Maximize the probability of predicting the correct target word given the context.

#### **How It Works:**
1. A window of size `n` is defined around the target word.
2. The context words within this window are used as input.
3. The algorithm learns to predict the center word from the context.

#### **Example**:
For the sentence: **"I love playing football."**
- If the window size is 2 and the target word is **"playing"**, the context words are **["I", "love", "football"]**.
- CBOW predicts **"playing"** based on these words.

#### **Advantages of CBOW**:
- Faster to train compared to Skip-Gram.
- Performs well on frequent words.

---

### **2. Skip-Gram**

Skip-Gram does the opposite of CBOW. It predicts the **context words** based on the **target word** (center word).

#### **Key Characteristics:**
- **Input**: The target word (center word).
- **Output**: The context words (surrounding words).
- **Objective**: Maximize the probability of predicting the correct context words given the center word.

#### **How It Works:**
1. A window of size `n` is defined around the target word.
2. The center word is used as input.
3. The algorithm learns to predict the context words.

#### **Example**:
For the sentence: **"I love playing football."**
- If the window size is 2 and the target word is **"playing"**, the algorithm tries to predict **["I", "love", "football"]**.

#### **Advantages of Skip-Gram**:
- Handles infrequent words better than CBOW.
- Produces more accurate word embeddings for rare words.


### **GloVe, FastText, and BERT**  
These are popular word embedding techniques, each with unique methodologies and applications. Here's an overview of their differences, functionalities, and strengths:

---

### **1. GloVe (Global Vectors for Word Representation)**  
**Approach**:  
- **Count-based**: GloVe uses a co-occurrence matrix of word counts to learn embeddings by capturing statistical relationships between words in a corpus.
- **Global Context**: It focuses on aggregating global word-to-word co-occurrence statistics.

**Key Features**:  
- Words with similar contexts have embeddings close in the vector space.
- Computationally efficient after building the co-occurrence matrix.
- Pre-trained embeddings are available (e.g., 50D, 100D, 300D trained on large corpora).

**Strengths**:  
- Efficient for capturing **semantic relationships** like analogies:
  - Example: `king - man + woman ≈ queen`
- Works well with static embeddings (one vector per word).

**Limitations**:  
- Cannot handle **out-of-vocabulary (OOV)** words effectively.
- The embeddings are **static**, meaning polysemy (words with multiple meanings) cannot be captured.

**Example Use Case**: Sentiment analysis, document classification.

---

### **2. FastText**  
**Approach**:  
- **Subword-based**: FastText represents each word as a bag of character **n-grams** (e.g., "playing" includes "pla", "lay", "ayi", etc.).
- Word embedding is the sum of the embeddings of its n-grams.

**Key Features**:  
- Captures **morphological features** (prefixes, suffixes, etc.) of words.
- Handles **OOV words** by composing embeddings from subwords.

**Strengths**:  
- Works well for morphologically rich languages (e.g., Turkish, Finnish).
- Handles **rare words** and unseen words (e.g., typos or compound words).
- Better at capturing word similarities based on structure.

**Limitations**:  
- Still produces **static embeddings** (similar to GloVe).
- Performance depends on the quality of the corpus.

**Example Use Case**: Spell correction, machine translation, word similarity tasks.

---

### **3. BERT (Bidirectional Encoder Representations from Transformers)**  
**Approach**:  
- **Contextualized embeddings**: BERT generates different embeddings for the same word depending on its context in a sentence.
- Based on the **Transformer architecture**, which uses self-attention mechanisms to understand relationships between words in both left-to-right and right-to-left directions.

**Key Features**:  
- Pre-trained on large corpora (e.g., BookCorpus, Wikipedia) using **masked language modeling (MLM)** and **next sentence prediction (NSP)** tasks.
- Embeddings are dynamic and **context-sensitive**.

**Strengths**:  
- Handles **polysemy** by producing unique embeddings for the same word in different contexts.
  - Example: "bank" in "river bank" vs. "money bank."
- Excels in downstream NLP tasks (e.g., question answering, text classification) via fine-tuning.
- Pre-trained models like **BERT-base**, **BERT-large** can be fine-tuned for domain-specific tasks.

**Limitations**:  
- Computationally expensive and resource-intensive.
- Requires fine-tuning for specific tasks.

**Example Use Case**: Question answering (e.g., SQuAD), sentiment analysis, language translation, text summarization.
