## MOSAIC PS-1 (MASKED LANGUAGE MODEL DESIGNING)

### NLP Pipeline followed-

#### 1. Text preprocessing of training as well as testing dataset
          - Lowercasing
          - Removal of punctuation marks
          - Tokenization into words
#### 2. Utilization of Google-News(300D) Word2Vec embeddings:
          - For better semantic understanding, wide coverage and enhanced generalization
          - loading the Google-News Word2Vec (300D) embeddings using gensim
#### 3. Masking different parts of speech in the training data:
The training data set containing about 50,000 sentences is subjected to masking of different parts of speech for training the model
first 10,000 sentences adjectives are masked, next 10,000 sentences verbs, next 10,000 sentences adverbs, next 10,000 sentences nouns                   and last 10,000 sentences determiners
                  
            - Fallback parts of speech- Prepositions, Pronouns and Conjuctions
#### 4. Training a Bidirectional LSTM model using combined loss evaluation metric:
Bidirectional LSTM model used for better contextual understanding from both right and left hand sides of the masked word.
Evaluation metric utilized for testing the model's performance is a combined loss calculated taking into account the actual mathematical            distance between the predicted embedding and the original word's embedding in the embedding space and the cosine similarity for a better                semantic context. The contribution of either to the net loss is determined by tuning the hyperparameters alpha and beta, fine tuned                     depending on the model's performance on training and cross validation set.

#### 5. Final compilation of the predicted words into a csv file.

### Text Pre-Processing-
Removing punctuation from all the sentences in the data frame.
Lowercasing all the sentences in the data frame.

In [1]:
import pandas as pd
import string

"""Reading the csv dataset into pandas dataframe and applying lowercasing"""

train_df=pd.read_csv("train_set_f.csv") #Loading training data csv file as pandas dataframe
train_df['SENTENCES']=train_df['SENTENCES'].str.lower() #Converting all sentences in the training data frame to lower case

"""Removing punctuation from the sentences in the dataframe"""

exclude=string.punctuation #Flexible list of punctuation in python as defined in the in-built string module
def remove_punc(text): #Function to remove punctuation symbols(as mentioned in exclude) from the text passed as parameter
    return text.translate(str.maketrans("","",exclude))
train_df['SENTENCES']=train_df['SENTENCES'].apply(remove_punc) #The function remove_punc is applied to every value in the SENTENCES series in the dataframe

train_df.head()

Unnamed: 0,IDS,SENTENCES
0,1,the capitals inaugural season was dreadful eve...
1,2,there have been a few unsubstantiated reports ...
2,3,after the war he served in various positions i...
3,4,she was a dancer singer and actress long befor...
4,5,his son christopher o


***Tokenization using spacy***

In [3]:
import spacy
"""Loading the English Tokenizer model of spacy and generating the spacy doc object representing
   the processed text after undergoing the nlp pipeline. Therafter extracting all the tokens separately and storing them into a new
    Series in the pandas dataframe"""


nlp=spacy.load("en_core_web_sm")
def token_generate(text):
    spacy_doc=nlp(text) #Spacy doc object that represents the text which has been processed after undergoing the nlp pipeline
    return [token.text for token in spacy_doc] #extracting each token separately

train_df['TOKENS']=train_df['SENTENCES'].apply(token_generate) #creating a now series in the original dataframe to store the set of tokens corresponding to every sentence

train_df.head()

Unnamed: 0,IDS,SENTENCES,TOKENS
0,1,the capitals inaugural season was dreadful eve...,"[the, capitals, inaugural, season, was, dreadf..."
1,2,there have been a few unsubstantiated reports ...,"[there, have, been, a, few, unsubstantiated, r..."
2,3,after the war he served in various positions i...,"[after, the, war, he, served, in, various, pos..."
3,4,she was a dancer singer and actress long befor...,"[she, was, a, dancer, singer, and, actress, lo..."
4,5,his son christopher o,"[his, son, christopher, o]"


#### Word Embeddings:
The pre-processed text is now to be vectorized to give meaningful number representation for the words while retaining the semantics and context of the sentence. Since Deep Learning models are mathematical implementations capable of interpretting and manipulating numbers, the vectorization is indespensable. To ensure wide coverage, better semantic understanding and enhanced generalization, pre-trained ***Google-News(300D) Word2Vec embeddings*** have been included with the help of the ***genism*** library of python and utilized

In [5]:
from gensim.models import KeyedVectors
"""
Loading the pre-trained Google News word2vec model
The pre-trained model file (GoogleNews-vectors-negative300.bin) has been downloaded in the same directory
"""

google_news_model_path = "GoogleNews-vectors-negative300.bin"  # Path to the pre-trained Google News embeddings file
w2v_model = KeyedVectors.load_word2vec_format(google_news_model_path, binary=True)

# Checking similarity between two words, by example, 'king' and 'queen'
similarity = w2v_model.similarity('king', 'queen')
print(f"Similarity between 'king' and 'queen': {similarity}")

#Getting the most similar words to a given specific word
similar_words = w2v_model.most_similar('king', topn=5)
print(f"Words similar to 'king': {similar_words}")


Similarity between 'king' and 'queen': 0.6510956883430481
Words similar to 'king': [('kings', 0.7138045430183411), ('queen', 0.6510957479476929), ('monarch', 0.6413194537162781), ('crown_prince', 0.6204220056533813), ('prince', 0.6159993410110474)]


#### Masking different Parts of Speech in the Training dataset
The primary approach to train the BiLSTM model is to ***mimic the human interpretation of filling in a blank word in a sentence by training the model to analyze the positioning of different parts of speech in sentences individually so as to analyze which part of speech to use in what form most aptly in given unseen data where a masked word is to be predicted***. For this purpose, firstly primary parts of speech- including nouns, verbs, adverbs, adjectives and determiners are masked in sentences by dividing the dataset into 5 groups of 10,000 sentences and masking one given POS as per the order mentioned in the list(The order has been improved by hit and trial by interchanging positions of different POS to gain maximum usage of the training data). In case a given sentence does not have the designated POS, fallback POS categories have been defined in specific order of priority, which will be masked in case the main POS categories are not present in the sentence thereby ensuring maximum usage of the training data. The compiled masked sentences are stored finally in a numpy array x_train along with the actual masked words in y_train

In [7]:
import numpy as np
import spacy
from gensim.models import KeyedVectors
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Loading model from spacy
nlp = spacy.load("en_core_web_sm")
word2vec = KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin", binary=True)

# Manual assignment of a specialized embedding indicating the presence of a masked token
mask_embedding = np.full((300,), 9.99)

main_pos_categories = ["ADJ", "VERB", "ADV", "NOUN", "DET"] #Primary POS categories to be masked
fallback_categories = ["ADP","AUX", "PRON", "CCONJ", "SCONJ"] #Fallback POS categories, masked in case a sentence does not contain the main pos category, for better utilization of training data

num_sentences = len(train_df) #Total number of Sentences
num_sentences_per_group = num_sentences // len(main_pos_categories) # 10,000 sentences for each category(50,000 sentences in total in the training dataset)

# Function to mask a sentence
def mask_sentence(tokens, pos_to_mask):
    doc = nlp(" ".join(tokens))
    sentence_embedding = []
    target_embedding = None
    masked = False #Ensuring exactly one word is masked per sentence

    fallback_candidates = {pos: None for pos in fallback_categories}  # Store fallback word position

    for idx, token in enumerate(doc):
        word, pos = token.text, token.pos_

        if not masked and pos == pos_to_mask and word in word2vec:
            sentence_embedding.append(mask_embedding)
            target_embedding = word2vec[word]
            masked = True
        else:
            sentence_embedding.append(word2vec[word] if word in word2vec else np.zeros((300,)))

        # Store fallback words if needed
        if pos in fallback_candidates and word in word2vec and fallback_candidates[pos] is None:
            fallback_candidates[pos] = (idx, word)

    # Use fallback if no primary POS was found
    if not masked:
        for fallback_pos in fallback_categories:
            if fallback_candidates[fallback_pos]:  
                idx, fallback_word = fallback_candidates[fallback_pos]
                sentence_embedding[idx] = mask_embedding
                target_embedding = word2vec[fallback_word]
                masked = True
                break  

    return (sentence_embedding, target_embedding) if masked else (None, None)


# Process dataset
x_train, y_train = [], []
num_skipped = 0

for i, tokens in enumerate(train_df["TOKENS"]):
    pos_to_mask = main_pos_categories[i // num_sentences_per_group]  # Assign POS category
    sentence_emb, target_emb = mask_sentence(tokens, pos_to_mask)

    if sentence_emb is not None and target_emb is not None:
        x_train.append(sentence_emb)
        y_train.append(target_emb)
    else:
        num_skipped += 1

print(f"Total skipped sentences: {num_skipped}")

# Padding sequences
max_length = max(len(seq) for seq in x_train)
x_train = pad_sequences(x_train, maxlen=max_length, dtype="float32", padding="post")
y_train = np.array(y_train)

# Final shape
print(f"x_train shape: {x_train.shape}")  
print(f"y_train shape: {y_train.shape}") 

Total skipped sentences: 1461
x_train shape: (48539, 40, 300)
y_train shape: (48539, 300)


In [9]:
for i in range(5):  # Check first 5 examples
    print("X_train Sample:", x_train[i])
    print("Y_train Sample:", y_train[i])
    print("-" * 50)


X_train Sample: [[ 8.00781250e-02  1.04980469e-01  4.98046875e-02 ...  3.66210938e-03
   4.76074219e-02 -6.88476562e-02]
 [ 1.63085938e-01 -6.39648438e-02  8.64257812e-02 ...  2.56347656e-02
   1.00097656e-01  6.20117188e-02]
 [ 9.98999977e+00  9.98999977e+00  9.98999977e+00 ...  9.98999977e+00
   9.98999977e+00  9.98999977e+00]
 ...
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
   0.00000000e+00  0.00000000e+00]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
   0.00000000e+00  0.00000000e+00]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
   0.00000000e+00  0.00000000e+00]]
Y_train Sample: [ 2.18505859e-02 -7.72094727e-03 -2.74658203e-02 -7.37304688e-02
  1.25000000e-01 -5.11718750e-01 -2.55859375e-01 -1.15722656e-01
  4.58984375e-02  1.82617188e-01  4.71191406e-02 -3.41796875e-02
  1.64062500e-01  1.79443359e-02  1.16210938e-01  1.85546875e-01
  1.88476562e-01  1.33789062e-01  8.54492188e-02  2.48046875e-01
 -1.157

#### Bidirectional LSTM model to predict the masked words in sentences
A bidirectional LSTM model has been utilized to analyze the context of different words masked in the training data with reference to the different parts of speech and make prediction as close as possible to the actual masked word. ***The prediction is a raw 300D embedding made by using two BiLSTM layers with units 256 and 128 respectively(fine tuned as per the model's performance) and based on a combined loss evaluation metric. The combined loss is the mean squared error loss(mse) and cosine similarity loss( custom functioned designed to return 1-cosine similarity, closer the loss to 0, higher the cosine similarity). This is done, firstly to reduce absolute distance from the actual word embedding and also considering the fact that synonymous words may differ in magintude as vectors but are oriented in the same direction as meaured by cosine similarity, thus predicting actual close embedding but also taking into account the synonymous context.***

In [15]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Bidirectional, LSTM, Dense, Dropout
from tensorflow.keras.optimizers import Adam
import tensorflow.keras.backend as K

# Define Cosine Similarity Loss
def cosine_loss(y_true, y_pred):
    y_true = tf.nn.l2_normalize(y_true, axis=-1)  # Normalize true embeddings
    y_pred = tf.nn.l2_normalize(y_pred, axis=-1)  # Normalize predicted embeddings
    return 1 - tf.reduce_mean(tf.reduce_sum(y_true * y_pred, axis=-1))  # 1 - cosine similarity

# Combined Loss Function (MSE + Cosine Loss)
def combined_loss(y_true, y_pred, alpha=0.7, beta=0.3):
    mse_loss = tf.keras.losses.MSE(y_true, y_pred)
    cos_loss = cosine_loss(y_true, y_pred)
    return alpha * mse_loss + beta * cos_loss  # Weighted combination(fine tuned as per the model's performance)


# Define BiLSTM Model
model = Sequential([
    Bidirectional(LSTM(256, return_sequences=True, input_shape=(40, 300))),  # BiLSTM layer
    Dropout(0.3), #Dropout to prevent overfitting
    Bidirectional(LSTM(128, return_sequences=False)),  # Final BiLSTM (outputs last hidden state)
    Dropout(0.3), #Dropout to prevent overfitting
    Dense(300, activation="linear")  # Predicts raw word embeddings
])

model.compile(loss=lambda y_true, y_pred: combined_loss(y_true, y_pred, alpha=0.7, beta=0.3),
              optimizer=Adam(learning_rate=0.001),
              metrics=["mse", cosine_loss])


# Train Model
history = model.fit(
    x_train, y_train,
    batch_size=64,
    epochs=11,
    validation_split=0.1
)

model.summary()
#Printing a summary of all params(trainable, non-trainable) along with the description of each layer in the biLSTM
model.save("my_model.h5")



  super().__init__(**kwargs)


Epoch 1/11
[1m683/683[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m177s[0m 252ms/step - cosine_loss: 0.6169 - loss: 0.1977 - mse: 0.0180 - val_cosine_loss: 0.3047 - val_loss: 0.0948 - val_mse: 0.0048
Epoch 2/11
[1m683/683[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m174s[0m 254ms/step - cosine_loss: 0.5295 - loss: 0.1696 - mse: 0.0154 - val_cosine_loss: 0.2795 - val_loss: 0.0871 - val_mse: 0.0046
Epoch 3/11
[1m683/683[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m160s[0m 235ms/step - cosine_loss: 0.5065 - loss: 0.1626 - mse: 0.0152 - val_cosine_loss: 0.2337 - val_loss: 0.0733 - val_mse: 0.0046
Epoch 4/11
[1m683/683[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m164s[0m 240ms/step - cosine_loss: 0.4885 - loss: 0.1569 - mse: 0.0148 - val_cosine_loss: 0.2452 - val_loss: 0.0767 - val_mse: 0.0045
Epoch 5/11
[1m683/683[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m152s[0m 223ms/step - cosine_loss: 0.4765 - loss: 0.1533 - mse: 0.0148 - val_cosine_loss: 0.2486 - val_loss: 0.077



#### Text pre-processing of test data

In [17]:
import pandas as pd
import string

test_df=pd.read_csv("test_set_f.csv") #reading test dataset into pandas dataframe

import re

def custom_preprocess(text):
    words = text.split() #tokenizing sentences to words(since sentences in the test data are mainly plain sentences with space separated words)

    processed_words = []
    for word in words:
        if word == "<MASKED>":  
            processed_words.append(word)  # keeping <"MASKED"> token as it is
        else:
            word = word.lower()
            word = re.sub(r'[^\w\s]', '', word)  # Removing punctuation
            processed_words.append(word)

    return " ".join(processed_words)

test_df["MASKED SENTENCES"] = test_df["MASKED SENTENCES"].apply(custom_preprocess) 

def tokenize_test(text):
    return text.split(' ')

test_df['TOKENS']=test_df['MASKED SENTENCES'].apply(tokenize_test) #creating a new Pandas series containing all tokens corresponding to every sentence

test_df.head(10)

Unnamed: 0,IDS,MASKED SENTENCES,TOKENS
0,1,the sweat stood upon it in <MASKED>,"[the, sweat, stood, upon, it, in, <MASKED>, ]"
1,2,the city was named for judge <MASKED> r mckee,"[the, city, was, named, for, judge, <MASKED>, ..."
2,3,a <MASKED> of girls are cheering,"[a, <MASKED>, of, girls, are, cheering, ]"
3,4,tom resigned as he wasnt <MASKED> valued at work,"[tom, resigned, as, he, wasnt, <MASKED>, value..."
4,5,in the disastrous days that followed maurice w...,"[in, the, disastrous, days, that, followed, ma..."
5,6,many legume seeds have been proven to contain ...,"[many, legume, seeds, have, been, proven, to, ..."
6,7,the name was for mr square the philosopher <MA...,"[the, name, was, for, mr, square, the, philoso..."
7,8,fonaris gardenline the pat <MASKED> show and t...,"[fonaris, gardenline, the, pat, <MASKED>, show..."
8,9,a group of six people men and women hold up a ...,"[a, group, of, six, people, men, and, women, h..."
9,10,a <MASKED> of a racer from gokart street race,"[a, <MASKED>, of, a, racer, from, gokart, stre..."


#### Generating x_test:
After pre-processing and tokenizing of test data, the sentence words vectorized to their corresponding embeddings along with the "MAKSED" token converted to a specific manual embedding as defined for the training data. The x_test is a numpy array in the expected format for the model to predict raw embeddings for the missing(masked) words.

In [19]:
import numpy as np
import pandas as pd
from gensim.models import KeyedVectors

# Constants
MAX_LEN = 40
EMBEDDING_DIM = 300  

# Load Google News Word2Vec Model
word2vec_model = KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin", binary=True)

# Placeholder for masked token embedding
MASKED_EMBEDDING = np.full((EMBEDDING_DIM,), 9.99)  # Random uniform embedding for masked words

def words_to_embeddings(tokenized_sentences, word2vec_model):
    x_test = []
    
    for sentence in tokenized_sentences:
        embedded_sentence = []
        
        for word in sentence:
            if word == "<MASKED>":
                embedded_sentence.append(MASKED_EMBEDDING)  # Use MASKED token embedding
            elif word in word2vec_model:
                embedded_sentence.append(word2vec_model[word])  # Get word embedding
            else:
                embedded_sentence.append(np.zeros(EMBEDDING_DIM))  # Handle unknown words
        
        # Padding/truncation to MAX_LEN
        if len(embedded_sentence) < MAX_LEN:
            embedded_sentence += [np.zeros(EMBEDDING_DIM)] * (MAX_LEN - len(embedded_sentence))
        else:
            embedded_sentence = embedded_sentence[:MAX_LEN]
        
        x_test.append(embedded_sentence)
    
    return np.array(x_test)
tokenized_test_sentences = test_df["TOKENS"].tolist()  

# Convert tokenized test sentences into embeddings
x_test = words_to_embeddings(tokenized_test_sentences, word2vec_model)

print(f"x_test shape: {x_test.shape}")  # (num_samples, MAX_LEN, EMBEDDING_DIM)


x_test shape: (10000, 40, 300)


#### Final word prediction and storage into submission.csv file

In [None]:
import numpy as np
import re
from gensim.models import KeyedVectors

# Load the Word2Vec Model
word2vec_model = KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin", binary=True)

def closest_valid_word(predicted_vector, word2vec_model):
    """
    Finds the closest valid word (only lowercase alphabets) in Word2Vec for the given embedding.
    """
    try:
        # Get top 12 most similar words
        top_matches = word2vec_model.most_similar([predicted_vector], topn=12)

        # Regex pattern to allow only lowercase words
        alpha_pattern = re.compile(r"^[a-z]+$")  

        # Filter valid lowercase words **without converting case**
        for word, _ in top_matches:
            if alpha_pattern.match(word):  
                return word  # Return the first valid lowercase word
        
        return "the"  # If no valid word is found, return "the" as the fallback word

    except KeyError:
        return "<UNK>"

# Predict embeddings from your trained model
predicted_embeddings = model.predict(x_test)  # Shape: (num_samples, 300)

# Convert embeddings to words
predicted_words = [closest_valid_word(emb, word2vec_model) for emb in predicted_embeddings]


In [50]:
print(predicted_words[:250])

['blue', 'for', 'group', 'as', 'the', 'as', 'played', 'on', 'the', 'man', 'the', 'the', 'the', 'me', 'the', 'just', 'with', 'the', 'do', 'that', 'the', 'as', 'the', 'the', 'is', 'for', 'the', 'the', 'the', 'the', 'the', 'by', 'is', 'just', 'many', 'with', 'the', 'do', 'the', 'the', 'the', 'for', 'i', 'just', 'other', 'can', 'constructed', 'just', 'have', 'the', 'out', 'in', 'the', 'just', 'man', 'the', 'the', 'do', 'just', 'at', 'blue', 'has', 'just', 'the', 'north', 'the', 'out', 'going', 'very', 'the', 'are', 'first', 'the', 'in', 'on', 'as', 'many', 'sitting', 'the', 'the', 'by', 'on', 'the', 'the', 'that', 'the', 'out', 'the', 'man', 'the', 'people', 'are', 'the', 'in', 'the', 'the', 'the', 'also', 'the', 'it', 'the', 'the', 'the', 'the', 'just', 'is', 'is', 'just', 'the', 'can', 'down', 'is', 'the', 'out', 'the', 'have', 'there', 'for', 'the', 'the', 'the', 'the', 'sitting', 'the', 'very', 'white', 'the', 'the', 'the', 'on', 'just', 'the', 'give', 'the', 'in', 'know', 'just', 'the

In [52]:
i=0
for i in range(250):
    print(test_df['MASKED SENTENCES'][i])
    print("Predicted word: ")
    print(predicted_words[i])

the sweat stood upon it in <MASKED> 
Predicted word: 
blue
the city was named for judge <MASKED> r mckee 
Predicted word: 
for
a <MASKED> of girls are cheering 
Predicted word: 
group
tom resigned as he wasnt <MASKED> valued at work 
Predicted word: 
as
in the disastrous days that followed maurice was subject to fredericks <MASKED> 
Predicted word: 
the
many legume seeds have been proven to contain high lectin <MASKED> termed hemagglutinating activity 
Predicted word: 
as
the name was for mr square the philosopher <MASKED> in henry fieldings tom jones 
Predicted word: 
played
fonaris gardenline the pat <MASKED> show and the handyman hotline with larry egan 
Predicted word: 
on
a group of six people men and women hold up a pole in the <MASKED> of a forest while another woman monitors this action 
Predicted word: 
the
a <MASKED> of a racer from gokart street race 
Predicted word: 
man
drongen is the birthplace of belgian professional footballer kevin de <MASKED> 
Predicted word: 
the
he 

In [54]:
import pandas as pd

# Create a DataFrame with IDS and PREDICTED WORDS
df = pd.DataFrame({
    "IDS": range(len(predicted_words)),  # Index column
    "PREDICTED WORDS": predicted_words   # Predicted words column
})

# Save to CSV
df.to_csv("predicted_words.csv", index=False)

print("CSV file 'predicted_words.csv' has been saved successfully.")


CSV file 'predicted_words.csv' has been saved successfully.
