# Leveraging Gen AI for SAT Prep - Semantic Similarity

This notebook showcases how I used semantic similarity to find the best suited word for a given genre to reduce hallucination.

Semantic similarity is based on this paper: https://arxiv.org/pdf/2108.06130

In [4]:
%pip install llama-index-embeddings-huggingface
%pip install llama-index-embeddings-instructor
%pip install llama-index

Defaulting to user installation because normal site-packages is not writeable
Collecting llama-index-embeddings-huggingface
  Downloading llama_index_embeddings_huggingface-0.5.2-py3-none-any.whl (8.9 kB)
Collecting sentence-transformers>=2.6.1
  Downloading sentence_transformers-3.4.1-py3-none-any.whl (275 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m275.9/275.9 KB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Installing collected packages: sentence-transformers, llama-index-embeddings-huggingface
Successfully installed llama-index-embeddings-huggingface-0.5.2 sentence-transformers-3.4.1
Note: you may need to restart the kernel to use updated packages.
Defaulting to user installation because normal site-packages is not writeable
Collecting llama-index-embeddings-instructor
  Downloading llama_index_embeddings_instructor-0.3.0-py3-none-any.whl (3.6 kB)
Collecting sentence-transformers<3.0.0,>=2.2.2
  Downloading sentence_transformers-2.7.0-py3-none-

Using BAAI/bge-large-en-v1.5 for evaluating Semantic Similarity

In [6]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
cache_dir="/home/ubuntu/Pragyan/model_cache"
embed_model = HuggingFaceEmbedding( model_name="BAAI/bge-large-en-v1.5", trust_remote_code=True)

2025-03-19 03:47:31.284242: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1742356051.314005    2951 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1742356051.323617    2951 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


Similarity score ranges from 0 - 1. 0 means no similarity and 1 means high semantic similarity. Setting the threshold to 0.5 so that we have at least one word with similarity of 0.5 of higher.

In [7]:
from llama_index.core.evaluation import SemanticSimilarityEvaluator
from llama_index.core.embeddings import resolve_embed_model

evaluator = SemanticSimilarityEvaluator(
    embed_model=embed_model,
    similarity_threshold=0.5,
)

In [8]:
from transformers import LlamaForCausalLM, AutoModelForCausalLM, AutoTokenizer
import torch
from huggingface_hub import login

model_id="meta-llama/Meta-Llama-3-8B-Instruct"
access_token="<your HF Token>"
login(token = access_token)

device = "cuda:0" if torch.cuda.is_available() else "cpu"
print(device)

cache_dir="/home/ubuntu/Pragyan/model_cache"

model=AutoModelForCausalLM.from_pretrained(model_id, token=access_token, cache_dir=cache_dir).to(device)
tokenizer= AutoTokenizer.from_pretrained(model_id, token=access_token, cache_dir=cache_dir)

cuda:0


Loading checkpoint shards: 100%|██████████| 4/4 [00:11<00:00,  2.83s/it]


In [148]:
from transformers import GenerationConfig
generation_config = GenerationConfig(
        # number of tokens to generate
        max_new_tokens=20,  
        # only choose from the top k most likely words
        top_k=20,  
        # Whether or not to use sampling ; use greedy decoding otherwise.
        do_sample=True,
        # parameter that controls the randomness or creativity of the generated text
        temperature=0.001, 
        # sets the pad tokens to whatever it is in the tokenizer
        pad_token_id=tokenizer.eos_token_id, 
        # output unnormalized outputs
        output_logits=True,
        # output the probabilities
        output_scores=True,   
        # passes hidden state along with output
        output_hidden_states=True,
        #returns output as a dict
        return_dict_in_generate=True,
        # reduce repetition
        #repetition_penalty=1.5
    )

print(generation_config)

GenerationConfig {
  "do_sample": true,
  "max_new_tokens": 20,
  "output_hidden_states": true,
  "output_logits": true,
  "output_scores": true,
  "pad_token_id": 128009,
  "return_dict_in_generate": true,
  "temperature": 0.001,
  "top_k": 20
}



In [149]:
# function for running inference against the model
def run_inference(prompt):
    inputs = tokenizer([prompt], return_tensors="pt").to(device)
    outputs=model.generate(**inputs, generation_config=generation_config)
    transition_scores = model.compute_transition_scores(outputs.sequences, outputs.scores, normalize_logits=True)  
    input_length = 1 if model.config.is_encoder_decoder else inputs.input_ids.shape[1]
    complete_text=''
    for t in outputs.sequences:
        complete_text += tokenizer.decode(t)
        
    generated_tokens = outputs.sequences[:,input_length:]
    generated_text = ''
    for t in generated_tokens:
        generated_text += tokenizer.decode(t)
        
    return [generated_text, generated_tokens, transition_scores, complete_text]

### Word dataset with definition

In [195]:
import pandas as pd
import random
import time
import random
from random import randrange
vocab_df = pd.read_csv('sat_vocab.csv')
print("Sample word: {} ".format(vocab_df.head(5)))

Sample word:          word
0       Abate
1  Aberration
2       Abhor
3      Abject
4      Abjure 


### Genre dataset

In [34]:
genre_df = pd.read_csv('sat_genre.csv')
print("Sample genre: {} ".format(genre_df.head(5)))

Sample genre:                          genre
0    Emergence of Homo sapiens
1  Use of fire by early humans
2   Development of stone tools
3      Agricultural Revolution
4     Establishment of Jericho 


In [183]:
def find_antonym(word):
    prompt = "Antonym for the word abate is intensify. What is the antonym of the word {} is? It should be an uncommon word.".format(word);
    output = run_inference(prompt)
    antonym = output[0].split('.')[0].lower().strip().split(" ")[-1]
    return antonym

def find_word_meaning(word):
    prompt = "Definition of the word {} is".format(word);
    output = run_inference(prompt)
    meaning = output[0].split('.')[0].lower().strip()
    if meaning.startswith(":"):
        meaning = meaning.split(":")[1]
    return meaning


In [175]:
print(find_antonym('abate'))
print(find_antonym('castigate'))

augment
extol


In [176]:
print(find_word_meaning('augment'))
print(find_word_meaning('extol'))

 to increase or add to something, especially to make it more effective or valuable
 to praise highly; to glorify; to commend


Following code selects a random genre, selects 10 random words, and checks the semantic similarity for each combination. Finally, comes up with combintations that has the highest score and the score should be .5 or above.

In [204]:
output_df = DataFrame(columns=['genre', 'word', 'similarity_score', 'answer_choices', 'answer_choices_with_score'])
test_cases_count=100
word_count=20
invalid_choice_count=2
counter = 0
while True:
    words=[]
    genre = (genre_df['genre'][randrange(genre_df.shape[0])]).lower()
    for i in range (word_count):
        key = randrange(vocab_df.shape[0])
        words.append((vocab_df['word'][randrange(vocab_df.shape[0])]).lower())       
    
    highest_score=0
    similarity_scores={}
    selected_word=''
    passing_count=0
    for i in range (len(words)):
        result = await evaluator.aevaluate(
            response=genre,
            reference=words[i],
        )
        similarity_scores.update({result.score:words[i]})
        if (result.passing):
            passing_count += 1
        # print("{},{}".format(words[i],result.score))
    
    # we need atleast one match with greater than 50%
    if passing_count == 0:
        continue
        
    # sort the ditionary so that we can pick the word that has highest similarity and pick 
    # the bottom 3 for invalid choices
    scores = list(similarity_scores.keys())
    scores.sort()
    sorted_scores = {i: similarity_scores[i] for i in scores}

    # word with highest similarity
    selected_word = sorted_scores[scores[len(scores) - 1]]

    answer_choices=[]
    answer_choices.append(selected_word)

    answer_choices_with_score={}
    # add invalid choices
    for k in range(invalid_choice_count):
        invalid_choice = sorted_scores[scores[k]]
        answer_choices.append(invalid_choice)
        answer_choices_with_score[invalid_choice] = scores[k]
    
    # add an antonym
    antonym = find_antonym(selected_word)
    answer_choices.append(antonym)
    random.shuffle(answer_choices)
    answer_choices_with_score[antonym] = 'antonym'
    answer_choices_with_score[selected_word] = 'correct answer'
    
    # capture the word definition
    answer_choices_with_def={}
    for i in range (len(answer_choices)):
        choice = answer_choices[i]
        answer_choices_with_def[choice] = find_word_meaning(choice)

    
    for i in range (len(answer_choices)):
        choice = answer_choices[i]
        answer_choices_with_def[choice] = find_word_meaning(choice)

    # store in the dataframe
    output_df = output_df.append({'genre':genre,'word': selected_word, 
        'similarity_score':scores[len(scores) - 1],'answer_choices':answer_choices_with_def,
        'answer_choices_with_score':answer_choices_with_score}, 
        ignore_index=True)
    print("{} - Genre: {}; Word {}; Score: {} ".format(counter, genre, selected_word, scores[len(scores) - 1]))

    counter += 1
    
    # break if test cases count has been reached
    if counter >= test_cases_count:
        break


0 - Genre: korean war; Word impasse; Score: 0.5604859574272113 
1 - Genre: invention of the printing press; Word catalyst; Score: 0.531469707132798 
2 - Genre: 9/11 terrorist attacks; Word resilient; Score: 0.5455340221089306 
3 - Genre: unification of germany; Word laud; Score: 0.5351095440083806 
4 - Genre: futurist conceptual designs; Word whimsical; Score: 0.5253310403956992 
5 - Genre: the history of curling; Word engross; Score: 0.5197745729355853 
6 - Genre: deconstructivist museums; Word enrapture; Score: 0.5289703605621967 
7 - Genre: formation of the grand canyon; Word exigent; Score: 0.5432685705065593 
8 - Genre: advent of quantum computing; Word quandary; Score: 0.5411630610170756 
9 - Genre: the history of ice hockey; Word anachronistic; Score: 0.5212942984138771 
10 - Genre: development of modern ocean currents; Word ubiquitous; Score: 0.5260845268153582 
11 - Genre: persian wars; Word temptation; Score: 0.5271377074430801 
12 - Genre: african tribal huts; Word dilapidat

In [205]:
# write the dataframe to a csv file
output_df.to_csv('test_eval_word_genre.csv', index=False)