This notebook loads, cleans, and tokenizes data to produce a vocab, word to id mapping, and id to word mapping. 
* it also has a code block to calculate the percent similarity between different vocabs 
* in this case, we wanted to compare the similarity between the ms marco v1.1 vocab and the glove word2vec vocab
* at the end of the notebook, we generate a training dataset for the CBOW model in word2vec

### LOAD THE DATA 

In [1]:
# Prepare the data
from datasets import load_dataset
import pandas as pd


# Stream the dataset from Hugging Face
ds_marco = load_dataset("microsoft/ms_marco", "v1.1")

# Check that the dataset is loaded correctly
ds_marco.keys()
print("train dataset: ", ds_marco['train'])
print("validation dataset: ", ds_marco['validation'])
print("test dataset: ", ds_marco['test'])

README.md:   0%|          | 0.00/9.48k [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/21.4M [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/175M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

Generating validation split:   0%|          | 0/10047 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/82326 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/9650 [00:00<?, ? examples/s]

train dataset:  Dataset({
    features: ['answers', 'passages', 'query', 'query_id', 'query_type', 'wellFormedAnswers'],
    num_rows: 82326
})
validation dataset:  Dataset({
    features: ['answers', 'passages', 'query', 'query_id', 'query_type', 'wellFormedAnswers'],
    num_rows: 10047
})
test dataset:  Dataset({
    features: ['answers', 'passages', 'query', 'query_id', 'query_type', 'wellFormedAnswers'],
    num_rows: 9650
})


In [8]:
# Turn the dataset into a pandas dataframe
train = pd.DataFrame(ds_marco['train'])
test = pd.DataFrame(ds_marco['test'])
validation = pd.DataFrame(ds_marco['validation'])

# Combine the train, validation and test datasets
ds_marco_all = pd.concat([train, validation, test])

# Save as a parquet file
ds_marco_all.to_parquet("../data/marco_all.parquet")

In [10]:
ds_marco_all.head()

Unnamed: 0,answers,passages,query,query_id,query_type,wellFormedAnswers
0,[Results-Based Accountability is a disciplined...,"{'is_selected': [0, 0, 0, 0, 0, 1, 0, 0, 0, 0]...",what is rba,19699,description,[]
1,[Yes],"{'is_selected': [0, 1, 0, 0, 0, 0, 0], 'passag...",was ronald reagan a democrat,19700,description,[]
2,[20-25 minutes],"{'is_selected': [0, 0, 0, 0, 1, 0, 0, 0, 0, 0]...",how long do you need for sydney and surroundin...,19701,numeric,[]
3,[$11 to $22 per square foot],"{'is_selected': [0, 0, 0, 0, 0, 0, 0, 0, 1], '...",price to install tile in shower,19702,numeric,[]
4,[Due to symptoms in the body],"{'is_selected': [0, 0, 1, 0, 0, 0, 0, 0], 'pas...",why conversion observed in body,19703,description,[]


In [25]:
# Create query into a large string
query_list = ds_marco_all['query']
all_queries = " ".join(query_list)

# Create passages into a large string
all_passages = []
passages_list = ds_marco_all['passages']

for passage in passages_list[:5]:
    print(passage['passage_text'])
    passage_list = ' '.join(passage['passage_text'])
    all_passages.append(passage_list)

all_passages = " ".join(all_passages)


all_text_dsmarco = all_queries + " " + all_passages

all_text_dsmarco[-10000:]


["Since 2007, the RBA's outstanding reputation has been affected by the 'Securency' or NPA scandal. These RBA subsidiaries were involved in bribing overseas officials so that Australia might win lucrative note-printing contracts. The assets of the bank include the gold and foreign exchange reserves of Australia, which is estimated to have a net worth of A$101 billion. Nearly 94% of the RBA's employees work at its headquarters in Sydney, New South Wales and at the Business Resumption Site.", "The Reserve Bank of Australia (RBA) came into being on 14 January 1960 as Australia 's central bank and banknote issuing authority, when the Reserve Bank Act 1959 removed the central banking functions from the Commonwealth Bank. The assets of the bank include the gold and foreign exchange reserves of Australia, which is estimated to have a net worth of A$101 billion. Nearly 94% of the RBA's employees work at its headquarters in Sydney, New South Wales and at the Business Resumption Site.", 'RBA Rec

"Plateau to the south. The Sydney Statistical Division, used for census data, is the unofficial metropolitan area and covers 12,145 km² (4,689 mi²). This area includes the Central Coast and Blue Mountains as well as broad swathes of national park and other non-urban land. This itinerary will have you crossing the country to take in the Great Barrier Reef, Australia’s iconic reef in Queensland, before heading to Western Australia to see the breath taking Ningaloo Reef, one of Australia’s best kept secrets. View more information. 10 day-Sydney, rock and reef. It’s easy to see why Hamilton Island is one of the most popular spots for a getaway on the Great Barrier Reef. With palm-fringed beaches, top restaurants and stylish resorts, there’s plenty to do on land, while those keen to explore the clear waters of the Whitsundays will be richly rewarded. The Sydney central business district, Sydney harbour and outer suburbs from the West. North Sydney 's commercial district. The extensive area 

In [40]:
# This section we look take the vocab from a pretrained gloveword2vec

import os
import requests
import zipfile
import io
from gensim.scripts.glove2word2vec import glove2word2vec
from gensim.models import KeyedVectors


In [42]:
# Create directory for downloaded files
os.makedirs('downloaded_model', exist_ok=True)

# Download glove model
url = 'https://nlp.stanford.edu/data/glove.6B.zip'
response = requests.get(url)
if response.status_code == 200:
    z = zipfile.ZipFile(io.BytesIO(response.content))
    z.extractall("downloaded_model")
else:
    print(f"Failed to download glove model. Status code: {response.status_code}")



In [43]:
# Convert to word2vec format
glove_input_file = 'downloaded_model/glove.6B.100d.txt'
word2vec_output_file = 'downloaded_model/glove.6B.100d.word2vec.txt'

print("Converting GloVe format to Word2Vec format...")
glove2word2vec(glove_input_file, word2vec_output_file)


Converting GloVe format to Word2Vec format...


  glove2word2vec(glove_input_file, word2vec_output_file)


(400000, 100)

In [46]:
#Load and test the model
print("Loading the model...")
model = KeyedVectors.load_word2vec_format(word2vec_output_file)
print(f"Model loaded successfully with {len(model.key_to_index)} words")


Loading the model...
Model loaded successfully with 400000 words


In [47]:
# Extract the vocab from the model
glove_vocab = set(model.key_to_index.keys())

# Print the length of the vocab
print(f"Vocab length: {len(glove_vocab)}")

# Save the vocab to a file
with open('glove_vocab.txt', 'w') as f:
    for word in glove_vocab:
        f.write(word + '\n')


Vocab length: 400000


UNCOMMENT THIS CODE BLOCK IF YOU WANT TO COMBINE WITH TEXT 8 ⬇️

### THIS SECTION RUNS THE TOKENIZER
* Create a new vocabulary, corpus, word to id and id to word mapping from this combined dataset

In [87]:
import re
from collections import Counter

def tokenize_text(text):
    # remove punctuation, number, and non-alphabetic characters
    remove_punctuation = re.sub(r'[^\w\s]', '', text)
    remove_numbers = re.sub(r'\d+', '', remove_punctuation)
    lower_case_words = remove_numbers.lower()
    words = lower_case_words.split(' ')

    # print count of words in split_words_by_whitespace
    print(f"Total number of words in the text: {len(words)}")

    # print number of unique words before filtering
    print(f"The vocabulary before filtering: {len(set(words))}")

    vocab = []
    word_counts = Counter(words)
    #print("Check first 10 words in word_counts: ", dict(list(word_counts.items())[:10]))
    
    # Check how often a word like "rba" appears
    print("Check how often a word like 'rba' appears: ", word_counts["rba"])

    # if a word appears less than 5 times, do not include it in the vocabulary
    """
    for word, count in word_counts.items():
        if count < 5:
            vocab.append(word)
            """
    # vocab is the unique words in the vocab
    vocab = list(word_counts.keys())
    word_to_id = {word: i for i, word in enumerate(vocab)}
    id_to_word = {i: word for i, word in enumerate(vocab)}

    # Sum the counts of all words in the text before filtering
    total_count_of_words = sum(count for word, count in word_counts.items())
    #print(f"Total number of words in the text before filtering: {total_count_of_words}")

    # this line checks if your counter is working properly
    print(f"The vocabulary has {len(vocab)} words")

    # Get the sum of words that appear in the vocab
    sum_of_words_in_vocab = sum(count for word, count in word_counts.items() if word in vocab)
    print(f"The sum of words that appear in the vocab: {sum_of_words_in_vocab}")

    
    # Optional: Show what percentage of words the vocab represents
    #percentage = (total_count_of_vocab_words / total_count_of_words) * 100
    #print(f"This represents {percentage:.2f}% of all words in the corpus")

    return word_to_id, id_to_word, vocab, word_counts

def generate_corpus(text, vocab):
    # remove punctuation, number, and non-alphabetic characters
    remove_punctuation = re.sub(r'[^\w\s]', '', text)
    remove_numbers = re.sub(r'\d+', '', remove_punctuation)
    lower_case_words = remove_numbers.lower()
    words = lower_case_words.split(' ')

    # filter corpus to only include words in the top k words
    corpus = [word for word in words if word in vocab]
    print("corpus length:", len(corpus))

    return corpus

Usage

In [88]:
# First get the vocabulary and mappings ms marco
word_to_id, id_to_word, vocab, word_counts = tokenize_text(all_text_dsmarco)

# Then generate the corpus using the vocabulary
#corpus = generate_corpus(all_text_dsmarco, vocab)

Total number of words in the text: 619952
The vocabulary before filtering: 40639
Check how often a word like 'rba' appears:  27
The vocabulary has 40639 words
The sum of words that appear in the vocab: 619952


In [86]:
vocab_counter = []
for word, count in word_counts.items():
    if word in vocab:
        vocab_counter.append((word, count))

# sort the vocab by count
vocab_counter.sort(key=lambda x: x[1], reverse=True)
            

# save the vocab and word counts list of tuples as csv
with open("../data/vocab.csv", "w") as f:
    for word, count in vocab_counter:
        f.write(f"{word},{count}\n")



In [73]:
# print the first 10 words in the vocab
print(vocab[:10])

# Print number of unique words in the vocab
print(f"Number of unique words in the vocab: {len(vocab)}")


['what', 'is', 'rba', 'was', 'ronald', 'reagan', 'a', 'democrat', 'how', 'long']
Number of unique words in the vocab: 7686


Get overlap in vocab

In [91]:
# Compare similarity between the glove vocab and the ms marco vocab
# Get the intersection of the vocab
intersection = set(glove_vocab) & set(vocab)
print("Number of overlapping words: ", len(intersection))

# What percentage of the glove vocab is in the ms marco vocab
percentage_overlap = round(len(intersection) / len(vocab) * 100, 2)
print(f"Percentage of ms marco vocab in glove vocab: {percentage_overlap}%")


Number of overlapping words:  34843
Percentage of glove vocab in ms marco vocab: 85.74%


In [10]:
# save word_to_id, id_to_word, vocab, and corpus as .pt files
import torch
import json

# save word_to_id
torch.save(word_to_id, "./data/word_to_id.pt")



# save id_to_word
torch.save(id_to_word, "./data/id_to_word.pt")

# save as json file
with open("./data/id_to_word.json", "w") as f:
    json.dump(id_to_word, f)

# save vocab
torch.save(vocab, "./data/vocab.pt")

# save corpus
torch.save(corpus, "./data/corpus.pt")



In this section we generate the CBOW training data

In [11]:
# generate the CBOW training data 
def generate_training_data(corpus):
   data = []


   # start from index 2 and end 2 positions before the last word
   # this ensures we always have 2 words before and after the target word
   # for a 5-len sliding window


   for i in range(2, len(corpus) - 2):
       # Get the context words
       # 'i' is the index of the target word
       # [i-2:i] gets the two words before the target word
       # [i+1:i+3] gets the two words after the target word
       context_words = corpus[i-2:i] + corpus[i+1:i+3]
      
       # Get the target word
       target_word = corpus[i]


       # Append the tuple to the data list
       data.append((context_words, target_word))


   return data


In [12]:
training_data = generate_training_data(corpus)

# save training data as .pt file
torch.save(training_data, "./data/training_data.pt")



In [14]:
# save training data as jsonl file

import json 

with open("./data/training_data.jsonl", "w") as f:
    for context, target in training_data:
        json.dump([context, target], f)
        f.write('\n')
