<a href="https://colab.research.google.com/github/ChiaoYunTing/Text-Analytics/blob/main/Language_Modeling_with_Embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Simple Language Model
We will explore creating a simple language model based on word (token) embeddings. We will use the Jane Austen novel 'Sense and Sensibility' to train the language model.

In [1]:
import nltk
nltk.download('gutenberg')  # Make sure the Gutenberg corpus is downloaded
from nltk.corpus import gutenberg

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.


In [2]:
# Load "Sense and Sensibility" text
sas = gutenberg.raw('austen-sense.txt')

# Print the first 500 characters of "Sense and Sensibility"
print(sas[:500])


[Sense and Sensibility by Jane Austen 1811]

CHAPTER 1


The family of Dashwood had long been settled in Sussex.
Their estate was large, and their residence was at Norland Park,
in the centre of their property, where, for many generations,
they had lived in so respectable a manner as to engage
the general good opinion of their surrounding acquaintance.
The late owner of this estate was a single man, who lived
to a very advanced age, and who for many years of his life,
had a constant companion an


# Embeddings
We will use the `gensim` library, which provides straightforward implementations of `word2vec`, to create word embeddings using the CBOW model. To keep it simple, we will create embeddings of size 5.

In [3]:
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')

# Tokenize
tokens = word_tokenize(sas)

# Organize the tokens into sentences, Word2Vec needs data in format of list of lists of tokens
sentences = [tokens[i:i+100] for i in range(0, len(tokens), 100)]


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [4]:
# Train the CBOW model
model = Word2Vec(sentences, vector_size=5, window=5, min_count=1, sg=0)  # sg=0 specifies CBOW


#Tokenizer
Once we get the embeddings, we will store the words (tokens), token ids, and embeddings in a dataframe.

Note that the number of distinct words (tokens) is 7111. This is the size of our vocabulary.

In [5]:
import pandas as pd
# Create a DataFrame to store word, token_id, and embedding
data = {
    'word': [],
    'token_id': [],
    'embedding': []
}

for idx, word in enumerate(model.wv.index_to_key):
    data['word'].append(word)
    data['token_id'].append(idx)
    data['embedding'].append(model.wv[word].tolist())  # convert numpy array to list for easier handling in DataFrame

df = pd.DataFrame(data)
print(df)

              word  token_id  \
0                ,         0   
1               to         1   
2                .         2   
3              the         3   
4               of         4   
...            ...       ...   
7106        other.      7106   
7107   incessantly      7107   
7108  exclamations      7108   
7109     hartshorn      7109   
7110       utility      7110   

                                              embedding  
0     [2.3235557079315186, 1.948472499847412, 3.8707...  
1     [1.384544849395752, 2.6455442905426025, 3.9733...  
2     [6.1547956466674805, 2.1530051231384277, 3.250...  
3     [-0.13863931596279144, 2.2208776473999023, 4.8...  
4     [1.0740772485733032, 1.8242425918579102, 5.650...  
...                                                 ...  
7106  [0.1288207471370697, 0.20290175080299377, 0.16...  
7107  [-0.1576346755027771, -0.055189501494169235, -...  
7108  [-0.05515243485569954, 0.060973454266786575, -...  
7109  [0.03785617649555206, -0.1461

#Training Data
Our main objective is to predict the next word (token) based on the previous 5 words (tokens). Thus, our context length is 5.

We will prepare the training data such that inputs are 5 consecutive words (token) and the output to be predicted is the 6th word (token). If the input has less than 5 words (tokens), we will pad it with \<pad>.

In [6]:
import numpy as np
import pandas as pd

def generate_training_data(sentences, model_wv, window_size=5):
    X, y = [], []
    sequence_texts = []  # For storing the actual sequences of words
    next_words = []  # For storing the actual next word
    for sentence in sentences:
        # Embed words using the Word2Vec model
        embedded_sentence = [model_wv[word] for word in sentence if word in model_wv]
        word_sentence = [word for word in sentence if word in model_wv]  # Keep the actual words for viewing
        # Create sequences
        for i in range(len(embedded_sentence)):
            end_ix = i + window_size
            if end_ix >= len(embedded_sentence):
                break
            seq_x, seq_y = embedded_sentence[i:end_ix], embedded_sentence[end_ix]
            seq_text, next_word = word_sentence[i:end_ix], word_sentence[end_ix]
            # Pad sequence if necessary
            seq_x += [np.zeros(model_wv.vector_size)] * (window_size - len(seq_x))
            seq_text += ['<pad>'] * (window_size - len(seq_text))  # Use <pad> for padding text
            X.append(np.concatenate(seq_x))
            y.append(seq_y)
            sequence_texts.append(' '.join(seq_text))
            next_words.append(next_word)
    return np.array(X), np.array(y), sequence_texts, next_words

# Assume 'sentences' and 'model.wv' have been defined
X_train, y_train, train_sequences, train_next_words = generate_training_data(sentences, model.wv)

# Create DataFrame
train_df = pd.DataFrame({
    'Sequence': train_sequences,
    'Next Word': train_next_words,
    'X_train (Flattened Embeddings)': list(X_train),
    'y_train (Embedding)': list(y_train)
})



In [7]:
train_df.head()

Unnamed: 0,Sequence,Next Word,X_train (Flattened Embeddings),y_train (Embedding)
0,[ Sense and Sensibility by,Jane,"[-0.08243611, 0.1148764, 0.016487636, -0.20564...","[-0.16630827, -0.14444734, 0.22338493, -0.1569..."
1,Sense and Sensibility by Jane,Austen,"[0.19308993, 0.21353799, 0.17963225, -0.174526...","[0.22983347, -0.038708005, -0.07965764, -0.224..."
2,and Sensibility by Jane Austen,1811,"[0.3780069, 2.2601147, 4.064688, -3.5333724, -...","[0.1298982, -0.079790495, 0.19601513, -0.23673..."
3,Sensibility by Jane Austen 1811,],"[0.15656252, -0.017120201, 0.2167294, 0.088603...","[0.26361594, 0.3978434, 0.30091414, -0.3552599..."
4,by Jane Austen 1811 ],CHAPTER,"[0.9099719, 2.47948, 4.682765, -2.498548, -2.1...","[1.4351755, 1.4981786, 2.7212527, -1.8203795, ..."


Our training dataset has 134,306 data points. Each data point has 6 words (tokens), and thus the total number of words (tokens) for training is 134,306 * 6 = 805,836.

In [8]:
train_df.shape
# train_df.head()

(134306, 4)

#Neural Network Design
The neural network we will train has the following structure.

1. Input layer: 25 nodes (5 words(tokens) with embedding size of 5 for each).
2. Hidden layer: 10 nodes.
3. Output layer: 5 nodes (for the predicted next word (token)).

The total number of parameters to estimate are:

25 $\times$ 10 (edges from input to hidden layer)+

10 (bias terms in the hidden layer) +

10$\times$5 (*edges from hidden layer to output layer*)+

5 (*bias terms in the output layer*).

Which is a total of 315 parameters that need to be estimated.

![picture](https://drive.google.com/uc?export=view&id=1lAj53mvleR-XRGLuJZu21E86EjXKMbSO)


In [9]:
from keras.models import Sequential
from keras.layers import Dense

def build_model(input_dim, hidden_neurons, output_dim):
    model = Sequential([
        Dense(hidden_neurons, input_dim=input_dim, activation='relu'),
        Dense(output_dim, activation='linear')  # Assuming you want the raw embedding as output
    ])
    model.compile(optimizer='adam', loss='mse')
    return model


# Build and train the model
**Takes a long time - about an hour on my machine**

**I have saved the trained model (`path_to_my_model.h5`). To run the trained model you simply have to load the saved model and run it. **

In [10]:
# Build model
nn_model = build_model(25, 10, 5)

# Train the model
nn_model.fit(X_train, y_train, epochs=10, batch_size=1)

Epoch 1/10

KeyboardInterrupt: 

Save the model for later use

In [11]:
from keras.models import load_model

# Assume 'nn_model' is your trained model
nn_model.save('path_to_my_model.h5')  # Saves the model to your hard drive

  saving_api.save_model(


#Load the trained model

In [12]:
from keras.models import load_model

# Load the model from the disk
loaded_model = load_model('path_to_my_model.h5')

#Next Word Prediction
Once the model is trained, we can use it for the next word (token) prediction. In the following, we will return 5 most likely next words (tokens)

In [13]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def predict_next_words(model, input_sequence, word_vectors, top_n=5):
    # Predict the embedding
    prediction = model.predict(np.array([input_sequence]))[0]

    # Calculate cosine similarity with all words
    all_similarities = cosine_similarity([prediction], word_vectors.vectors)[0]

    # Find the top 5 words with the highest similarity
    top_indices = np.argsort(-all_similarities)[:top_n]  # Negative for descending order
    closest_words = [(word_vectors.index_to_key[i], all_similarities[i]) for i in top_indices]

    return closest_words


Run the prediction

In [14]:
# Test the loaded model
test_sequence = "who for many years" # @param {type:"string"}
test_tokens = word_tokenize(test_sequence)
test_embedded = [model.wv[word] for word in test_tokens if word in model.wv]
test_input = np.concatenate(test_embedded[:5])  # Simplified example

vector_size = 5 # the model expects 5 words in the prompt

# Ensure there are exactly 5 embeddings, pad if fewer
if len(test_embedded) < 5:
    # Pad with zero-filled vectors
    test_embedded += [np.zeros(vector_size) for _ in range(5 - len(test_embedded))]

# Flatten the list of embeddings to match input shape, and ensure it's truncated to exactly 5 words
test_input = np.concatenate(test_embedded[:5])

predicted_words = predict_next_words(loaded_model, test_input, model.wv)

print("Predicted next words:")
for word, similarity in predicted_words:
    print(f"{word}: {similarity:.4f}")  # .4f formats the similarity to 4 decimal places

Predicted next words:
out: 0.9996
heart: 0.9996
They: 0.9994
inclination: 0.9993
CHAPTER: 0.9992


#Text Generation with Randomness
To make the outputted text more interesting, we will inject some randomness. We will pick the next word in the sequence of text based on the probabilities for the 20 likely next words (tokens).

Text generation function

In [15]:
def predict_next_words_with_probabilities(model, input_sequence, word_vectors, top_n=10):
    # Predict the embedding
    prediction = model.predict(np.array([input_sequence]))[0]

    # Calculate cosine similarity with all words
    all_similarities = cosine_similarity([prediction], word_vectors.vectors)[0]

    # Get the top 5 indices and scores
    top_indices = np.argsort(-all_similarities)[:top_n]
    top_scores = all_similarities[top_indices]

    # Convert scores to probabilities using softmax
    top_probabilities = np.exp(top_scores) / np.sum(np.exp(top_scores))

    # Ensure the probabilities sum to 1
    top_probabilities /= top_probabilities.sum()

    return [(word_vectors.index_to_key[i], top_probabilities[j]) for j, i in enumerate(top_indices)]



In [16]:
def generate_text(model, initial_text, word_vectors, num_words, vector_size=5):
    tokens = word_tokenize(initial_text)
    current_embeddings = [word_vectors[word] for word in tokens if word in word_vectors]

    generated_words = tokens.copy()

    for _ in range(num_words):
        if len(current_embeddings) < 5:
            padded_embeddings = current_embeddings + [np.zeros(vector_size) for _ in range(5 - len(current_embeddings))]
        else:
            padded_embeddings = current_embeddings[-5:]

        input_sequence = np.concatenate(padded_embeddings)

        next_word_options = predict_next_words_with_probabilities(model, input_sequence, word_vectors)

        words, probabilities = zip(*next_word_options)

        # Normalize probabilities to ensure they sum to 1
        probabilities = np.array(probabilities)
        probabilities /= probabilities.sum()

        next_word = np.random.choice(words, p=probabilities)

        generated_words.append(next_word)
        current_embeddings.append(word_vectors[next_word])

    return ' '.join(generated_words)



In [17]:
# Assuming 'loaded_model' is your trained and loaded Keras model
# 'model.wv' should be your Word2Vec model's word vectors

initial_text = "who for many years have" # @param {type:"string"}
num_words_to_generate = 40 # @param {type:"integer"}

generated_text = generate_text(loaded_model, initial_text, model.wv, num_words_to_generate)





In [18]:
generated_text

"who for many years have connection question out upon ' since upon entirely natural inclination upon settled upon because upon entirely out settled opened out because inclination inquiry since natural out girl heart upon opened looked natural heart settled natural since settled upon where really"

Given that this is a 'toy' example, the outputs generated are not very impressive. But this represents the right direction of travel in terms of building a language model.