# Training a Word2Vec Embedding Model

> **Note:** This Jupyter notebook is provided as part of the book (chapter 2):
> "Quickstart Your GPT Business: Deploying Ideas Swiftly Using Django Templates"
> By Nabil MABROUK - 2023
> Published on github: [https://github.com/Nabil-Mabrouk/gpt-django-quickstart]

## Introduction
In this notebook, we will walk through the process of training a Word2Vec embedding model using PDF files stored in a folder called `pdfs`. We will follow a step-by-step approach to guide you through the process and ensure a clear understanding.

## Step 1: Data Preprocessing

Before training the Word2Vec model, we need to preprocess the PDF files. This involves extracting text from the PDFs, removing any irrelevant characters, and converting the text into a suitable format for training the model.


In [1]:
import os
import glob
import PyPDF2
import re
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from collections import Counter
from tqdm import tqdm  # Import tqdm

In [4]:
# Estimated execution time: arround 16 min depending on you pc confihuration

# Step 1: List PDF files in the 'pdfs' folder
pdf_folder = 'pdfs'
pdf_files = glob.glob(os.path.join(pdf_folder, '*.pdf'))
print("Number of pdf files : ", len(pdf_files))

# Step 2: Extract text from PDF files sentence by sentence
sentences = []
with tqdm(total=len(pdf_files), desc="Extracting Text") as pbar:  # Create progress bar
    for pdf_file in pdf_files:
        with open(pdf_file, 'rb') as file:
            reader = PyPDF2.PdfReader(file)
            for page in reader.pages:
                page_text = page.extract_text()
                page_sentences = sent_tokenize(page_text)
                sentences.extend(page_sentences)
        pbar.update(1)  # Update progress bar
print("Number of sentences : ", len(sentences))

# Step 3: Preprocess the text
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    
    # Convert text to UTF-8 encoding
    text = text.encode('utf-8').decode('utf-8')
    # Remove special characters, numbers, and symbols
    text = re.sub('[^A-Za-z\s]+', '', text)
    # Convert text to lowercase
    text = text.lower()
    # Tokenize text into words
    words = word_tokenize(text)
    # Remove stop words
    words = [word for word in words if word not in stop_words]
    # Join the words back into a sentence
    processed_text = ' '.join(words)
    return processed_text

preprocessed_sentences = []
with tqdm(total=len(sentences), desc="Preprocessing Text") as pbar:  # Create progress bar
    for sentence in sentences:
        preprocessed_sentence = preprocess_text(sentence)
        preprocessed_sentences.append(preprocessed_sentence)
        pbar.update(1)  # Update progress bar
print("Number of sentences : ", len(preprocessed_sentences))

# Step 4: Gather statistics about the text
num_files = len(pdf_files)
num_sentences = len(sentences)
words = [word_tokenize(sentence) for sentence in preprocessed_sentences]
num_words = len([word for sentence_words in words for word in sentence_words])
unique_words = set([word for sentence_words in words for word in sentence_words])
word_frequencies = Counter([word for sentence_words in words for word in sentence_words])

# Print statistics
print("Number of PDF files:", num_files)
print("Number of sentences:", num_sentences)
print("Number of words:", num_words)
print("Number of unique words:", len(unique_words))
print("Word occurrence frequencies:", word_frequencies)

# Save processed text to a TXT file
output_file = 'processed_text.txt'
with open(output_file, 'w') as file:
    file.write('\n'.join(preprocessed_sentences))

print("Processed text saved to", output_file)


Number of pdf files :  74


Extracting Text: 100%|█████████████████████████████████████████████████████████████████| 74/74 [17:52<00:00, 14.50s/it]
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Number of sentences :  223574


Preprocessing Text: 100%|████████████████████████████████████████████████████| 223574/223574 [00:22<00:00, 9932.88it/s]


Number of sentences :  223574
Number of PDF files: 74
Number of sentences: 223574
Number of words: 1333434
Number of unique words: 30680
Processed text saved to processed_text.txt


## Step 2: Training the Word2Vec Model

Once the data is preprocessed, we will proceed with training the Word2Vec model using the processed text. The Word2Vec algorithm learns word embeddings by predicting the context of words within a given corpus. We will explore different parameters and techniques to fine-tune the model and optimize its performance.


In [47]:
from gensim.models import Word2Vec

with open(output_file, 'r') as file:
    corpus = file.read().splitlines()

corpus = [sentence.split() for sentence in corpus]

model = Word2Vec(corpus, vector_size=50, window=5, min_count=1, workers=4)

# Save the trained model
model_file = 'word2vec_custom_model.bin'
model.save(model_file)
print("Trained model saved to", model_file)

# Load the saved model
loaded_model = Word2Vec.load(model_file)

# Test the loaded model
test_word = "gold"
similar_words = loaded_model.wv.most_similar(test_word)

print(f"Similar words to '{test_word}' (loaded model):")
for word, similarity in similar_words:
    print(f"{word}: {similarity:.4f}")

Trained model saved to word2vec_custom_model.bin
Similar words to 'gold' (loaded model):
silver: 0.7592
coins: 0.7351
cavity: 0.7296
mesh: 0.7273
nuggets: 0.7183
ivory: 0.7182
rectangle: 0.7143
bones: 0.7118
longboat: 0.7116
rainbow: 0.7098


## Step 3: Comparing with Pretrained Embedding Model

To assess the quality of our custom Word2Vec model, we will compare it with a pretrained embedding model. We will use a widely-used pretrained model such as Word2Vec or GloVe. By comparing the performance of our custom model with the pretrained model, we can gain insights into the importance of having a larger corpus of data for training our own embedding model.


In [12]:
import gensim.downloader

# Step 1: Download and load the pretrained model
pretrained_model_name = 'glove-wiki-gigaword-50'# Example: Word2Vec pretrained model

# load the pretrained model 
pretrained_model = gensim.downloader.load(pretrained_model_name)

# Step 2: Test the pretrained model's similarity
pretrained_similar_words = pretrained_model.most_similar('gold')

print(f"Similar words to 'gold' (pretrained model):")
for word, similarity in pretrained_similar_words:
    print(f"{word}: {similarity:.4f}")

# Step 3: Test the custom Word2Vec model's similarity

# Load the saved model
model_file = 'word2vec_custom_model.bin'
model = Word2Vec.load(model_file)
custom_similar_words = model.wv.most_similar('gold')

print('------------------------------------------')
print(f"Similar words to 'gold' (custom model):")
for word, similarity in custom_similar_words:
    print(f"{word}: {similarity:.4f}")


Similar words to 'gold' (pretrained model):
silver: 0.9498
bronze: 0.8349
diamond: 0.7715
medal: 0.7672
medals: 0.7655
golds: 0.7161
medalist: 0.7153
olympic: 0.7142
golden: 0.7052
platinum: 0.6959
------------------------------------------
Similar words to 'gold' (custom model):
mesh: 0.7580
coins: 0.7568
cups: 0.7360
chalk: 0.7305
ivory: 0.7283
largescale: 0.7264
furniture: 0.7251
silver: 0.7169
paint: 0.7160
rainbow: 0.7155


Without any surprise, the results obtained from the pretrained model are more relevant compared to our custom model. The pretrained model has been trained on an extensive corpus of data extracted from Wikipedia, comprising an impressive 6 gigabytes of text. In contrast, our custom model relies on a much smaller text corpus of only 0.17 gigabytes. The vast size difference in the training data between the two models accounts for the disparity in their performance.

>**Note**: The pretrained model is loaded as a KeyedVectors object, which serves as a mapping between unique words from the pretrained model's text corpus and their corresponding embedding vectors obtained during pretraining. This format is highly efficient in terms of memory usage and is particularly valuable for querying purposes. For more detailed information on how to utilize the KeyedVectors object and explore its various functionalities, I recommend referring to the official documentation available at https://radimrehurek.com/gensim/models/word2vec.html. The documentation will provide comprehensive insights into leveraging the capabilities of the KeyedVectors object in your specific use cases.

## Example of application of custom embedding models

If you possess specific knowledge that is not included in the default GPT-4 model (such as information about cities on Mars), it is necessary to embed all the relevant facts about Mars in a custom embedding model. When a question is received, it is also embedded. You then correlate the incoming question with the entire set of embedded facts. Based on the top correlations, you extract the most relevant facts from the database and construct a prompt, truncating it to fit within the limited size of the prompt window. Subsequently, you prompt GPT-4 to answer the question by considering all the top correlated facts within your prompt. This approach is likely the most effective way to extract specific knowledge. Fine-tuning the model may not yield results as specific to your facts, as you would be attempting to mitigate the influence of noise from the complete set of GPT-4 coefficients, which were trained on internet data and may not incorporate your specific knowledge.


In [None]:
from gensim.models import Word2Vec

# Step 1: Load the custom Word2Vec embedding model trained on google news after 2021
custom_embedding_model_path = 'custom_word2vec_model.bin'  # Path to custom Word2Vec model

custom_model = Word2Vec.load(custom_embedding_model_path)

# Step 2: Process and embed the incoming question
incoming_question = "Who won the FIFA world cup in Qatar 2022"  # Example incoming question

# Preprocess and embed the incoming question using the same preprocessing steps as the custom model
preprocessed_question = preprocess_text(incoming_question)  # Preprocess the question (replace with your own preprocessing method)
embedded_question = custom_model.wv[preprocessed_question]  # Embed the preprocessed question

# Step 3: Retrieve and correlate facts from the custom model based on the embedded question
correlated_facts = []
top_facts_threshold = 0.8  # Example threshold for top correlations

# Iterate through all the facts in the custom model's vocabulary
for fact in custom_model.wv.vocab:
    # Embed the fact
    embedded_fact = custom_model.wv[fact]
    # Calculate the correlation between the embedded question and the embedded fact
    correlation = np.dot(embedded_question, embedded_fact) / (np.linalg.norm(embedded_question) * np.linalg.norm(embedded_fact))
    # Check if the correlation is above the threshold
    if correlation > top_facts_threshold:
        correlated_facts.append(fact)

# Step 4: Formulate a prompt using the top correlated facts and truncate if necessary
prompt = "FIFA world cup 2022: " + ", ".join(correlated_facts)
prompt = prompt[:512]  # Example truncation to fit within the prompt window

# Step 5: Ask GPT-4 to answer the question based on the prompt
answer = gpt4_prompt(prompt)  # Replace with the code to ask GPT-4 based on the prompt (truncated or full prompt)

# Use the answer as desired
print("Answer:", answer)


You can get more details in the openai cookbook available on github: https://github.com/openai/openai-cookbook/blob/main/examples/Recommendation_using_embeddings.ipynb