## Set Up

### Install

In [None]:
!pip install transformers

In [None]:
pip install -U sentence-transformers==2.2.2

In [None]:
pip install InstructorEmbedding

### Import

In [None]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import torch
import re
from InstructorEmbedding import INSTRUCTOR
import difflib

### Load Models

In [None]:
# Initialize summarization pipeline and model
summarizer_pipeline = pipeline("summarization", model="Oulaa/teachMy_sum")
tokenizer_sum = AutoTokenizer.from_pretrained("Oulaa/teachMy_sum")
model_sum = AutoModelForSeq2SeqLM.from_pretrained("Oulaa/teachMy_sum")

config.json:   0%|          | 0.00/1.51k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/20.7k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

In [None]:
# sentence transformer
model_semantic = SentenceTransformer('all-MiniLM-L6-v2')

In [None]:
# text generation with DistilGPT-2
tokenizer = GPT2Tokenizer.from_pretrained('distilgpt2')
model = GPT2LMHeadModel.from_pretrained('distilgpt2')

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

# Answer Evaluation Development

**Note:** I tried to add descriptions of what I did in retrospect.

this section is to try and chose a suitable method for answer evaluation. My research has shown how open ended is this problem, only large language models have an intrinsic ability to do this task. there are no training resources so the choices are based on trying increasingly more complext methods and finding suitable models for the task.

### experiment 1: evaluate using a distilled large language model

ðŸ’¡ this idea is about using a distilled open source LLM to generate an answer and we compare it with the user's answer

ðŸ’­ result: didn't develop much further because I realised Sonia's model already generates an answer, so the technique should be focused on comparing the gen_answer to the user answer and here it's just a direct match, so I'm gonna work on improving the comparison part.

loading the LLM was too slow and crashing too much so it wasn't feasible to utilise it anyway

In [None]:
def generate_text_from_prompt(prompt):
    inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
    max_length = len(inputs[0]) + 50
    outputs = model.generate(inputs, max_length=max_length, num_return_sequences=1)
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return generated_text

In [None]:
def summarize_text(text, max_length_output=200):
    # summarise context to 50%
    summarized = summarizer_pipeline(text, max_length=max_length_output, min_length=int(max_length_output / 2), do_sample=False)
    return summarized[0]['summary_text']

In [None]:
def summarize_text(text, max_length_output=200):
    # dynamically adjust max_length based on input length, with a minimum
    input_length = len(tokenizer_sum(text)["input_ids"])
    max_length_dynamic = min(max(input_length // 2, 50), max_length_output)  #summarise context to 50%

    summarized = summarizer_pipeline(text, max_length=max_length_dynamic, min_length=max_length_dynamic // 2, do_sample=False)
    return summarized[0]['summary_text']

In [None]:
def evaluate_answer_enhanced(context, question, user_answer, user_question=None):
    prompt = f"Context: {context} Question: {question} Answer:"
    if user_question:
        prompt = f"User's question: {user_question} " + prompt

    # generate response from DistilGPT-2
    generated_text = generate_text_from_prompt(prompt)

    if user_answer.lower() in generated_text.lower():
        feedback = "Correct or close enough! Your answer aligns with expected information."
    else:
        # If answer is incorrect, summarize context
        correction_context = summarize_text(context)
        feedback = f"Not quite right. Here's a brief overview for clarity: {correction_context}"

    return feedback

In [None]:
context = "Python is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum and first released in 1991."
question = "Who created Python?"
user_answer = "Linus Torvalds"  # incorrect

feedback = evaluate_answer_enhanced(context, question, user_answer)
print(feedback)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Your max_length is set to 50, but your input_length is only 34. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=17)


Not quite right. Here's a brief overview for clarity: Python is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum and first released in 1991.


### experiment 1.2.: add semantic evaluation step

ðŸ’¡  first directly compare, then use cosine similarity id it's not a direct match.

ðŸ’­ result: it didn't recognise correct answer with different words

In [None]:
def compute_semantic_similarity(text1, text2):
    embeddings1 = model_semantic.encode([text1], convert_to_tensor=True)
    embeddings2 = model_semantic.encode([text2], convert_to_tensor=True)

    similarity_score = cosine_similarity(embeddings1, embeddings2)

    return similarity_score[0][0]

In [None]:
def evaluate_answer_with_semantics(context, question, user_answer, user_question=None):
    prompt = f"Question: {question} Context: {context[:250]}... The answer is"
    if user_question:
        prompt = f"User's question: {user_question} " + prompt

    generated_text = generate_text_from_prompt(prompt)

    # direct comparison
    if user_answer.lower() in generated_text.lower():
        feedback = "Correct or close enough! Your answer aligns with expected information."
    else:
        # semantic similarity if it's not a direct match
        summary_context = summarize_text(context)
        similarity_score = compute_semantic_similarity(user_answer, summary_context)

        if similarity_score > 0.75:  # threshhold
            feedback = "Correct or close enough! Your answer captures the essence of the expected information."
        else:
            feedback = f"Your answer might not be aligned with the expected information. Consider this: {summary_context}"

    return feedback

In [None]:
context = "Python is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum and first released in 1991."
question = "Who created Python?"
user_answer = "The person who initiated Python is a Dutch programmer."  # Meaningful but differently worded

feedback = evaluate_answer_with_semantics(context, question, user_answer)
print(feedback)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Your max_length is set to 50, but your input_length is only 34. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=17)


Your answer might not be aligned with the expected information. Consider this: Python is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum and first released in 1991.


In [None]:
# a bulk test for 4 scenarios

context = "Python is an interpreted, high-level, general-purpose programming language. It emphasizes code readability with its notable use of significant indentation. Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects."
question = "What is Python known for?"

# Scenario 1: Correct Answer in Very Similar Words
user_answer_1 = "Python is recognized for its emphasis on code readability and significant indentation."
# Scenario 2: Correct Answer in Different Words
user_answer_2 = "This programming language is famed for making code understandable and using indentation to do so."
# Scenario 3: Wrong Answer in Similar Words
user_answer_3 = "Python is known for its speed and performance in complex calculations."
# Scenario 4: Wrong Answer in Different Words
user_answer_4 = "The primary feature of Python is its ability to perform fast computations and handle large data sets efficiently."

In [None]:
# revised eval function
def evaluate_answer_with_semantics_verbose(context, question, user_answer, user_question=None):
    prompt = f"Question: {question} Context: {context[:250]}... The answer is"
    if user_question:
        prompt = f"User's question: {user_question} " + prompt

    # answer
    generated_text = generate_text_from_prompt(prompt)
    print(f"Generated Text: {generated_text}\n")

    # summarising the context before comparison
    summary_context = summarize_text(context)
    print(f"Summary Context: {summary_context}\n")

    # Ccompare using cosine
    similarity_score = compute_semantic_similarity(user_answer, summary_context)
    print(f"Semantic Similarity Score: {similarity_score:.2f}\n")

    # evaluation and feedback
    if similarity_score > 0.75:
        feedback = "Correct or close enough! Your answer captures the essence of the expected information."
    else:
        feedback = f"Your answer might not be aligned with the expected information. Consider this: {summary_context}"

    return feedback

In [None]:
# testing scenarios
print("Scenario 1:")
feedback_1 = evaluate_answer_with_semantics_verbose(context, question, user_answer_1)
print(feedback_1, "\n\n")

print("Scenario 2:")
feedback_2 = evaluate_answer_with_semantics_verbose(context, question, user_answer_2)
print(feedback_2, "\n\n")

print("Scenario 3:")
feedback_3 = evaluate_answer_with_semantics_verbose(context, question, user_answer_3)
print(feedback_3, "\n\n")

print("Scenario 4:")
feedback_4 = evaluate_answer_with_semantics_verbose(context, question, user_answer_4)
print(feedback_4, "\n\n")

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Scenario 1:
Generated Text: Question: What is Python known for? Context: Python is an interpreted, high-level, general-purpose programming language. It emphasizes code readability with its notable use of significant indentation. Its language constructs and object-oriented approach aim to help programmers write clear, logi... The answer is that Python is a language that is not just a language, but a language that is not just a language. It is a language that is not just a language. It is a language that is not just a language. It is a language that is



The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Summary Context: Python is an interpreted, high-level, general-purpose programming language . it emphasizes code readability with its notable use of significant indentation .

Semantic Similarity Score: 0.84

Correct or close enough! Your answer captures the essence of the expected information. 


Scenario 2:
Generated Text: Question: What is Python known for? Context: Python is an interpreted, high-level, general-purpose programming language. It emphasizes code readability with its notable use of significant indentation. Its language constructs and object-oriented approach aim to help programmers write clear, logi... The answer is that Python is a language that is not just a language, but a language that is not just a language. It is a language that is not just a language. It is a language that is not just a language. It is a language that is



The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Summary Context: Python is an interpreted, high-level, general-purpose programming language . it emphasizes code readability with its notable use of significant indentation .

Semantic Similarity Score: 0.63

Your answer might not be aligned with the expected information. Consider this: Python is an interpreted, high-level, general-purpose programming language . it emphasizes code readability with its notable use of significant indentation . 


Scenario 3:
Generated Text: Question: What is Python known for? Context: Python is an interpreted, high-level, general-purpose programming language. It emphasizes code readability with its notable use of significant indentation. Its language constructs and object-oriented approach aim to help programmers write clear, logi... The answer is that Python is a language that is not just a language, but a language that is not just a language. It is a language that is not just a language. It is a language that is not just a language. It is a language th

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Summary Context: Python is an interpreted, high-level, general-purpose programming language . it emphasizes code readability with its notable use of significant indentation .

Semantic Similarity Score: 0.53

Your answer might not be aligned with the expected information. Consider this: Python is an interpreted, high-level, general-purpose programming language . it emphasizes code readability with its notable use of significant indentation . 


Scenario 4:
Generated Text: Question: What is Python known for? Context: Python is an interpreted, high-level, general-purpose programming language. It emphasizes code readability with its notable use of significant indentation. Its language constructs and object-oriented approach aim to help programmers write clear, logi... The answer is that Python is a language that is not just a language, but a language that is not just a language. It is a language that is not just a language. It is a language that is not just a language. It is a language th

### experiment 1.3. evaluation with cosine similarity only

ðŸ’­ result: idk why I tried this but it's not suitable for this task as it's not good for correctness

In [None]:
def evaluate_user_answer(context, generated_answer, user_answer):
    context_embedding = model_semantic.encode(context, convert_to_tensor=True)
    answer_embedding = model_semantic.encode(generated_answer, convert_to_tensor=True)
    user_answer_embedding = model_semantic.encode(user_answer, convert_to_tensor=True)

    reference_embedding = torch.mean(torch.stack([context_embedding, answer_embedding]), dim=0)

    # semantic similarity
    similarity_score = cosine_similarity([user_answer_embedding.numpy()], [reference_embedding.numpy()])[0][0]

    # correctness with an adjusted threshold
    correctness = 1 if similarity_score >= 0.65 else 0

    # user-facing message
    if correctness:
        model_output = "Correct! Your answer aligns well with the expected information."
    else:
        model_output = "Your answer might not fully capture the expected information. Consider revisiting the topic."

    return similarity_score, correctness, model_output

In [None]:
### SIMULATED TESTS ###

context = "Python is a high-level programming language known for its readability and wide applicability in various programming tasks."
generated_answer = "Python is famous for its simplicity and readability, making it suitable for beginners and professionals alike."

# Scenario 1: correct answer in similar words
user_answer_1 = "Python is well-regarded for its straightforward syntax and ease of use, appealing to both new and experienced coders."

# Scenario 2: correct answer in different words
user_answer_2 = "This language is appreciated for being easy to learn and versatile, which is why it's popular among programmers."

# Scenario 3: incorrect answer
user_answer_3 = "Python is primarily known for its speed and performance in mathematical computations."

print("Scenario 1:", evaluate_user_answer(context, generated_answer, user_answer_1), "\n")
print("Scenario 2:", evaluate_user_answer(context, generated_answer, user_answer_2), "\n")
print("Scenario 3:", evaluate_user_answer(context, generated_answer, user_answer_3), "\n")

Scenario 1: (0.84350085, 1, 'Correct! Your answer aligns well with the expected information.') 

Scenario 2: (0.5579801, 0, 'Your answer might not fully capture the expected information. Consider revisiting the topic.') 

Scenario 3: (0.68932354, 1, 'Correct! Your answer aligns well with the expected information.') 



### experiment 2: cosine similarity ensemble approach

In [None]:
# Load models for the ensemble
model1 = SentenceTransformer('all-MiniLM-L6-v2')
model2 = SentenceTransformer('sentence-transformers/all-distilroberta-v1')

.gitattributes:   0%|          | 0.00/1.23k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

.gitattributes:   0%|          | 0.00/791 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.3k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/653 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/15.7k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/328M [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/329M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/333 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [None]:
def evaluate_user_answer(context, generated_answer, user_answer):
    # if the answer is a name, a number, or a single word check directly
    if re.fullmatch(r"[\w-]+", user_answer) or user_answer.isdigit():
        if user_answer.lower() in context.lower() or user_answer.lower() in generated_answer.lower():
            return 1.0, 1, "Correct! Your answer matches the key information."
        else:
            return 0.0, 0, "The answer provided does not match the key information in the context."

    # for more complex answers, ensemble approach
    embedding1_context = model1.encode(context)
    embedding1_user = model1.encode(user_answer)
    score1 = cosine_similarity([embedding1_context], [embedding1_user])[0][0]

    embedding2_context = model2.encode(context)
    embedding2_user = model2.encode(user_answer)
    score2 = cosine_similarity([embedding2_context], [embedding2_user])[0][0]

    # average the scores from the two models
    final_score = (score1 + score2) / 2

    # correctness based on the final score
    correct = 1 if final_score >= 0.65 else 0

    # user-facing output
    if correct:
        user_facing_output = "Correct! Your comprehensive answer aligns well with the expected information."
    else:
        user_facing_output = "Your answer might not fully capture the expected information. Consider revisiting the topic."

    return final_score, correct, user_facing_output


In [None]:
### SIMULATED TESTS ###

context = "Python is a high-level programming language with dynamic semantics. Its high-level built-in data structures, combined with dynamic typing and dynamic binding, make it very attractive for Rapid Application Development."
generated_answer = "Python is known for its dynamic semantics and suitability for Rapid Application Development."

# Scenario 1: correct answer (name)
user_answer_1 = "Python"

# Scenario 2: correct answer (different words)
user_answer_2 = "Python excels in quick development cycles and flexibility."

# Scenario 3: incorrect answer
user_answer_3 = "Python is mainly used for system programming."

print("Scenario 1:", evaluate_user_answer(context, generated_answer, user_answer_1))
print("Scenario 2:", evaluate_user_answer(context, generated_answer, user_answer_2))
print("Scenario 3:", evaluate_user_answer(context, generated_answer, user_answer_3))

Scenario 1: (1.0, 1, 'Correct! Your answer matches the key information.')
Scenario 2: (0.6231131613254547, 0, 'Your answer might not fully capture the expected information. Consider revisiting the topic.')
Scenario 3: (0.6954717993736267, 1, 'Correct! Your comprehensive answer aligns well with the expected information.')


### experiment 2.1: Ensemble on gen_answer instead of context

In [None]:
def dynamic_answer_evaluation(generated_answer, user_answer):
    # direct comparison
    if user_answer.strip().lower() == generated_answer.strip().lower():
        return 1.0, 1, "Correct! Your answer matches perfectly with the expected information."

    # cosine similarity ensemble
    embeddings1_gen = model1.encode([generated_answer])
    embeddings1_user = model1.encode([user_answer])
    score1 = cosine_similarity(embeddings1_gen, embeddings1_user)[0][0]

    embeddings2_gen = model2.encode([generated_answer])
    embeddings2_user = model2.encode([user_answer])
    score2 = cosine_similarity(embeddings2_gen, embeddings2_user)[0][0]

    final_score = (score1 + score2) / 2

    correct = 1 if final_score >= 0.65 else 0

    # user-facing feedback
    feedback = "Correct! Your answer is semantically aligned with the expected information." if correct else "Your answer might not fully capture the expected information. Consider revisiting the topic."

    return final_score, correct, feedback

In [None]:
### SIMULATED TESTS ###
generated_answers = {
    "Scenario 1": "Python's simple syntax.",
    "Scenario 2": "Readability and simplicity.",
    "Scenario 3": "Web development and data analysis.",  # Incorrect
    "Scenario 4": "Rapid development and scalability.",  # Incorrect
}

user_answers = {
    "Scenario 1": "Python is great for beginners due to its easy-to-understand syntax.",
    "Scenario 2": "The language is beginner-friendly because of its straightforward syntax.",
    "Scenario 3": "Python is mainly used for web development and data analysis.",  # Incorrect
    "Scenario 4": "It is appreciated for rapid development and scalability.",  # Incorrect
}

for scenario, user_answer in user_answers.items():
    generated_answer = generated_answers[scenario]
    print(f"\n{scenario}:")
    score, correct, feedback = dynamic_answer_evaluation(generated_answer, user_answer)
    print(f"Score: {score}\nCorrect: {correct}\nFeedback: {feedback}")


Scenario 1:
Score: 0.64680016040802
Correct: 0
Feedback: Your answer might not fully capture the expected information. Consider revisiting the topic.

Scenario 2:
Score: 0.4919956624507904
Correct: 0
Feedback: Your answer might not fully capture the expected information. Consider revisiting the topic.

Scenario 3:
Score: 0.5621111989021301
Correct: 0
Feedback: Your answer might not fully capture the expected information. Consider revisiting the topic.

Scenario 4:
Score: 0.7794828414916992
Correct: 1
Feedback: Correct! Your answer is semantically aligned with the expected information.


### experiment 3: use a pretrained model trained specifically for answer evaluation

In [None]:
from transformers import pipeline

# initialize the classifier pipeline
classifier = pipeline("text-classification", model="Giyaseddin/distilroberta-base-finetuned-short-answer-assessment", return_all_scores=True)

config.json:   0%|          | 0.00/932 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/329M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/355 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]



In [None]:
def evaluate_answer(context, question, ref_answer, student_answer):
    # as specified in the documentation
    body = " [SEP] ".join([context, question, ref_answer, student_answer])

    # get raw classification results
    raw_results = classifier([body])

    # mapping from label IDs
    _LABELS_ID2NAME = {0: "correct", 1: "correct_but_incomplete", 2: "contradictory", 3: "incorrect"}

    # Parse and format results
    results = []
    for result in raw_results:
        for score in result:
            results.append({_LABELS_ID2NAME[int(score["label"][-1:])]: score["score"]})

    return results

In [None]:
### SIMULATED TESTS ###

examples = [
    {
        "context": "A ball is dropped from a height and falls under the influence of gravity.",
        "question": "What forces act on the ball as it falls?",
        "ref_answer": "Gravity acts on the ball.",
        "student_answers": [
            "Only gravity acts on the ball.",  # Correct
            "Gravity and air resistance.",  # Correct but incomplete
            "No forces act on the ball.",  # Contradictory
            "Magnetic forces act on the ball."  # Incorrect
        ]
    },
    {
        "context": "An electric circuit consists of a battery connected to a resistor.",
        "question": "What happens to electrons in the circuit?",
        "ref_answer": "Electrons flow from the negative to the positive terminal of the battery.",
        "student_answers": [
            "Electrons move through the circuit.",  # Correct
            "Electrons oscillate in the wire.",  # Contradictory
            "Electrons are stationary.",  # Incorrect
            "Electrons flow from positive to negative."  # Correct but incomplete
        ]
    }
]

# Evaluate
for example in examples:
    context, question, ref_answer = example["context"], example["question"], example["ref_answer"]
    print(f"\nContext: {context}\nQuestion: {question}\nReference Answer: {ref_answer}\n")
    for student_answer in example["student_answers"]:
        results = evaluate_answer(context, question, ref_answer, student_answer)
        print(f"Student Answer: {student_answer}\nEvaluation: {results}\n")


Context: A ball is dropped from a height and falls under the influence of gravity.
Question: What forces act on the ball as it falls?
Reference Answer: Gravity acts on the ball.

Student Answer: Only gravity acts on the ball.
Evaluation: [{'correct': 0.005690970458090305}, {'correct_but_incomplete': 0.9135565757751465}, {'contradictory': 0.08016437292098999}, {'incorrect': 0.0005880309618078172}]

Student Answer: Gravity and air resistance.
Evaluation: [{'correct': 0.0014699019957333803}, {'correct_but_incomplete': 0.9963841438293457}, {'contradictory': 0.000541943998541683}, {'incorrect': 0.0016039920737966895}]

Student Answer: No forces act on the ball.
Evaluation: [{'correct': 0.00016824981139507145}, {'correct_but_incomplete': 0.0808926448225975}, {'contradictory': 0.7996024489402771}, {'incorrect': 0.1193365603685379}]

Student Answer: Magnetic forces act on the ball.
Evaluation: [{'correct': 0.00016559119103476405}, {'correct_but_incomplete': 0.0762305036187172}, {'contradictor

In [None]:
### SIMULATED TESTS ###
test_scenarios = {
    "1": {
        "context": "Python is a high-level programming language known for its readability and efficiency.",
        "question": "What makes Python suitable for beginners?",
        "ref_answer": "Python's simple syntax and readability make it accessible for newcomers to programming.",
        "student_answer": "Python is great for beginners due to its easy syntax."  # Should be 'correct'
    },
    "2": {
        "context": "Python supports multiple programming paradigms, including procedural, object-oriented, and functional programming.",
        "question": "What programming paradigms does Python support?",
        "ref_answer": "Python is versatile, supporting procedural, object-oriented, and functional programming.",
        "student_answer": "Python supports object-oriented programming."  # Might be 'correct_but_incomplete'
    },
    "3": {
        "context": "Guido van Rossum created Python with the idea of emphasizing code readability.",
        "question": "Who created Python and why?",
        "ref_answer": "Python was created by Guido van Rossum to emphasize readability.",
        "student_answer": "Python was made by Linus Torvalds for system programming."  # Should be 'incorrect' or 'contradictory'
    },
    "4": {
        "context": "Python's dynamic type system and automatic memory management support a wide range of programming paradigms.",
        "question": "What features support Python's programming paradigms?",
        "ref_answer": "The dynamic type system and automatic memory management facilitate diverse paradigms.",
        "student_answer": "Python uses static typing for efficiency."  # Should be 'incorrect' or 'contradictory'
    }
}

# Evaluate
for scenario, details in test_scenarios.items():
    results = evaluate_answer(**details)
    print(f"Scenario {scenario}: {results}")

Scenario 1: [{'correct': 0.0004157705116085708}, {'correct_but_incomplete': 0.0016016713343560696}, {'contradictory': 0.00038914362085051835}, {'incorrect': 0.997593343257904}]
Scenario 2: [{'correct': 0.0008491454063914716}, {'correct_but_incomplete': 0.0014567967737093568}, {'contradictory': 0.00011012798495357856}, {'incorrect': 0.9975838661193848}]
Scenario 3: [{'correct': 0.0031216482166200876}, {'correct_but_incomplete': 0.4721429646015167}, {'contradictory': 0.0008069787872955203}, {'incorrect': 0.5239284038543701}]
Scenario 4: [{'correct': 0.0008858765941113234}, {'correct_but_incomplete': 0.0008691618568263948}, {'contradictory': 0.0001318651920882985}, {'incorrect': 0.9981131553649902}]


## experiment 3.1: Combine previous approaches with pretrained

In [None]:
from transformers import pipeline
from sentence_transformers import SentenceTransformer, util

semantic_model = SentenceTransformer('all-MiniLM-L6-v2')
answer_assessment_model = pipeline("text-classification", model="Giyaseddin/distilroberta-base-finetuned-short-answer-assessment", return_all_scores=True)

.gitattributes:   0%|          | 0.00/1.23k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/932 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/329M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/355 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]



In [None]:
def evaluate_answer(context, question, ref_answer, student_answer):
    # initialize evaluation and feedback
    evaluation_result = {"evaluation": None, "feedback": None}

    # Direct Comparison
    if student_answer.lower().strip() == ref_answer.lower().strip():
        evaluation_result["evaluation"] = "correct"
        evaluation_result["feedback"] = "Your answer is exactly correct."
        return evaluation_result

    # cosine similarity
    ref_embedding = semantic_model.encode(ref_answer)
    student_embedding = semantic_model.encode(student_answer)

    similarity = util.pytorch_cos_sim(ref_embedding, student_embedding)[0][0].item()

    # Apply thresholds
    if similarity > 0.8:
        evaluation_result["evaluation"] = "correct"
        evaluation_result["feedback"] = "Your understanding of the concept is on point."
        return evaluation_result
    elif similarity < 0.4:
        evaluation_result["evaluation"] = "incorrect"
        evaluation_result["feedback"] = "There seems to be a misunderstanding of the concept."
        return evaluation_result

    # DistilRoBERTa pretrained
    body = " [SEP] ".join([context, question, ref_answer, student_answer])
    raw_results = answer_assessment_model([body])
    best_result = max(raw_results[0], key=lambda x: x['score'])
    distilroberta_label = int(best_result['label'][-1])

    if distilroberta_label == 0:  # Correct
        evaluation_result["evaluation"] = "correct"
        evaluation_result["feedback"] = "Your answer aligns well with the core concept."
    elif distilroberta_label == 1:  # Correct but Incomplete
        evaluation_result["evaluation"] = "not_exactly"
        evaluation_result["feedback"] = "You're on the right track, but some details are missing."
    elif distilroberta_label == 2:  # Contradictory
        evaluation_result["evaluation"] = "incorrect"
        evaluation_result["feedback"] = "Your answer seems to contradict the key concept."
    else:  # Incorrect
        evaluation_result["evaluation"] = "incorrect"
        evaluation_result["feedback"] = "The provided answer doesn't match the expected concept."

    return evaluation_result

In [None]:
### SIMULATED TESTS ###
scenarios = {
    "Example 1": {
        "context": "A ball is thrown upwards and comes back down.",
        "question": "What forces act on the ball?",
        "ref_answer": "Gravity",
        "student_answer": "Gravity"
    }
}

# evaluate
for name, details in scenarios.items():
    result = evaluate_answer(**details)  # This now captures the dictionary returned by evaluate_answer
    print(f"Scenario: {name}\nEvaluation: {result['evaluation']}\nFeedback: {result['feedback']}\nReference Answer: {details['ref_answer']}\n")

Scenario: Example 1
Evaluation: correct
Feedback: Your answer is exactly correct.
Reference Answer: Gravity



In [None]:
### SIMULATED TESTS ###
scenarios = {
    "Direct Match": {
        "context": "Gravity pulls objects towards the center of the Earth.",
        "question": "What force pulls objects towards Earth?",
        "ref_answer": "Gravity",
        "student_answer": "Gravity"  # pass at direct comparison step
    },
    "Semantic Similarity": {
        "context": "Photosynthesis is a process used by plants to convert light into energy.",
        "question": "How do plants create energy?",
        "ref_answer": "Through photosynthesis.",
        "student_answer": "Using light to produce energy."  # pass at semantic similarity step
    },
    "Nuanced Assessment - Correct": {
        "context": "The Earth revolves around the Sun in an elliptical orbit.",
        "question": "What path does Earth follow around the Sun?",
        "ref_answer": "An elliptical orbit",
        "student_answer": "A slightly oval path"  # might need assessment from pretrained
    },
    "Nuanced Assessment - Incorrect": {
        "context": "Evaporation is the process where liquid water turns into water vapor.",
        "question": "How does water turn into vapor?",
        "ref_answer": "Through evaporation.",
        "student_answer": "By freezing"  # Clearly incorrect, to test the model's ability to identify incorrect answers
    }
}

# evaluate each scenario
def evaluate_all_scenarios(scenarios):
    for scenario_name, scenario_details in scenarios.items():
        print(f"\nEvaluating Scenario: {scenario_name}")
        result = evaluate_answer(**scenario_details)  # 'result' now holds the dictionary returned by 'evaluate_answer'
        print(f"Feedback: {result['evaluation']}\nReference Answer: {result['feedback']}\n")

evaluate_all_scenarios(scenarios)


Evaluating Scenario: Direct Match
Feedback: correct
Reference Answer: Your answer is exactly correct.


Evaluating Scenario: Semantic Similarity
Feedback: incorrect
Reference Answer: The provided answer doesn't match the expected concept.


Evaluating Scenario: Nuanced Assessment - Correct
Feedback: correct
Reference Answer: Your answer aligns well with the core concept.


Evaluating Scenario: Nuanced Assessment - Incorrect
Feedback: not_exactly
Reference Answer: You're on the right track, but some details are missing.



In [None]:
### SIMULATED TESTS ###
scenarios = {
    "Direct Match": {
        "context": "A force that pulls objects towards the center of the Earth.",
        "question": "What force pulls objects towards Earth?",
        "ref_answer": "Gravity",
        "student_answer": "Gravity"  # pass at direct match
    },
    "Correct but Different Word": {
        "context": "The process plants use to convert sunlight into food.",
        "question": "How do plants make their food?",
        "ref_answer": "Photosynthesis",
        "student_answer": "Through sunlight conversion"  # pass at semantic similarity (> 80%)
    },
    "Somewhat Related": {
        "context": "The organ in the human body that pumps blood.",
        "question": "What is the role of the heart?",
        "ref_answer": "Pumps blood",
        "student_answer": "Helps in circulation"  # not exceed 80%
    },
    "DistilRoBERTa - Correct": {
        "context": "A change in velocity over time.",
        "question": "What is acceleration?",
        "ref_answer": "Change in speed",
        "student_answer": "Speed variation"  # assessed by DistilRoBERTa as correct
    },
    "DistilRoBERTa - Correct but Incomplete": {
        "context": "The body's response to foreign pathogens.",
        "question": "How does the immune system react to viruses?",
        "ref_answer": "By creating antibodies",
        "student_answer": "Fights off infections"  # incomplete
    },
    "DistilRoBERTa - Contradictory": {
        "context": "The celestial body that orbits the Earth.",
        "question": "What orbits the Earth?",
        "ref_answer": "The Moon",
        "student_answer": "The Sun"  # contradictory
    },
    "DistilRoBERTa - Incorrect": {
        "context": "Molecules composed of two hydrogen atoms and one oxygen atom.",
        "question": "What is water made of?",
        "ref_answer": "H2O",
        "student_answer": "Carbon dioxide"  #  incorrect
    },
    "Wrong": {
        "context": "The process by which green plants create their food.",
        "question": "What is photosynthesis?",
        "ref_answer": "Conversion of light energy into chemical energy",
        "student_answer": "Water cycle"  #  incorrect
    },
    "Unrelated": {
        "context": "The largest planet in the solar system.",
        "question": "What is Jupiter known for?",
        "ref_answer": "Its size",
        "student_answer": "Pythagorean theorem"  #  unrelated
    }
}

def evaluate_all_scenarios(scenarios):
    for scenario_name, scenario_details in scenarios.items():
        print(f"\nEvaluating Scenario: {scenario_name}")
        result = evaluate_answer(**scenario_details)
        print(f"Result: {result['evaluation']}\nFeedback: {result['feedback']}\nReference Answer: {scenario_details['ref_answer']}\n")

evaluate_all_scenarios(scenarios)



Evaluating Scenario: Direct Match
Result: correct
Feedback: Your answer is exactly correct.
Reference Answer: Gravity


Evaluating Scenario: Correct but Different Word
Result: incorrect
Feedback: The provided answer doesn't match the expected concept.
Reference Answer: Photosynthesis


Evaluating Scenario: Somewhat Related
Result: incorrect
Feedback: The provided answer doesn't match the expected concept.
Reference Answer: Pumps blood


Evaluating Scenario: DistilRoBERTa - Correct
Result: incorrect
Feedback: The provided answer doesn't match the expected concept.
Reference Answer: Change in speed


Evaluating Scenario: DistilRoBERTa - Correct but Incomplete
Result: incorrect
Feedback: There seems to be a misunderstanding of the concept.
Reference Answer: By creating antibodies


Evaluating Scenario: DistilRoBERTa - Contradictory
Result: incorrect
Feedback: Your answer seems to contradict the key concept.
Reference Answer: The Moon


Evaluating Scenario: DistilRoBERTa - Incorrect
Resul

# BE of User Feedback Function

In [None]:
def generate_user_feedback(evaluation_result, context):
    """
    Generates tailored user feedback by summarizing evaluation result and context

    Parameters:
    - evaluation_result (dict): from the evaluate_answer(), containing 'evaluation' and 'feedback'.
    - context (str): The context related to the question and answer.

    Returns:
    - str: Tailored user feedback based on the summarization of the feedback with context.
    """
    # Initialize
    summarizer_pipeline = pipeline("summarization", model="Oulaa/teachMy_sum")

    # concatenate feedback and context
    context_with_feedback = f"{evaluation_result['feedback']} [SEP] {context}"
    base_length = len(context_with_feedback.split())

    # threshold for when the input is too long
    length_threshold = 400

    if base_length > length_threshold:
        # If the input exceeds the threshold, summarize only the context to 75% of its original length
        max_length_output = int(len(context.split()) * 0.75)
        summarized_context = summarizer_pipeline(context, max_length=max_length_output, min_length=int(max_length_output / 2), do_sample=False)[0]['summary_text']
        # prepend feedback to  summarized content
        final_feedback = f"{evaluation_result['feedback']} [SEP] {summarized_context}"
    else:
        # If within threshold, include both feedback and context in summarization
        max_length_output = base_length  # summarization length equal to the base length
        final_feedback = summarizer_pipeline(context_with_feedback, max_length=max_length_output, min_length=base_length, do_sample=False)[0]['summary_text']

    return final_feedback

In [None]:
from transformers import AutoTokenizer

def generate_user_feedback(evaluation_result, context):

    tokenizer = AutoTokenizer.from_pretrained("Oulaa/teachMy_sum")
    summarizer_pipeline = pipeline("summarization", model="Oulaa/teachMy_sum")
    separator = "\n\n"

    context_tokens = tokenizer.tokenize(context)
    num_context_tokens = len(context_tokens)

    # threshold for when the input is too long
    token_threshold = 512

    # maximum length for summarization
    max_length_output = num_context_tokens  # use the context token count unless it exceeds the threshold
    if num_context_tokens > token_threshold:
        max_length_output = int(num_context_tokens * 0.75)  # Summarize to 75% of the context token count

    # Summarize context
    summarized_context = summarizer_pipeline(context, max_length=max_length_output, min_length=int(max_length_output / 2), do_sample=False)[0]['summary_text']

    # prepend feedback to  summarized content
    final_feedback = f"{evaluation_result['feedback']}{separator}{summarized_context}"

    return final_feedback


In [None]:
# test the previous scenarios

def test_user_feedback_with_scenarios(scenarios):
    for scenario_name, scenario_details in scenarios.items():
        print(f"\nScenario: {scenario_name}")
        evaluation_result = evaluate_answer(**scenario_details)
        feedback = generate_user_feedback(evaluation_result, scenario_details['context'])
        print(f"Feedback: {feedback}")


test_user_feedback_with_scenarios(scenarios)


Scenario: Direct Match
Feedback: Your answer is exactly correct.

force that pulls objects towards the center of the Earth.

Scenario: Correct but Different Word
Feedback: There seems to be a misunderstanding of the concept.

The process plants use to convert sunlight into food

Scenario: Somewhat Related
Feedback: There seems to be a misunderstanding of the concept.

the organ in the human body pumps blood.

Scenario: DistilRoBERTa - Correct
Feedback: There seems to be a misunderstanding of the concept.

a change in velocity over

Scenario: DistilRoBERTa - Correct but Incomplete
Feedback: There seems to be a misunderstanding of the concept.

the body's response to foreign pathogens

Scenario: DistilRoBERTa - Contradictory
Feedback: There seems to be a misunderstanding of the concept.

celestial body that orbits the Earth orbit

Scenario: DistilRoBERTa - Incorrect
Feedback: There seems to be a misunderstanding of the concept.

Molecules composed of two hydrogen atoms and one oxygen at

In [None]:
### SIMULATED TESTS ###

# scenarios with varying context lengths (because I don't know the actual length from other people's models)
long_scenarios = {
    "Direct Match": {
        "context": "Water boils at 100 degrees Celsius. This is a fundamental property of water.",
        "question": "At what temperature does water boil?",
        "ref_answer": "100 degrees Celsius",
        "student_answer": "100 degrees Celsius"  # direct match
    },
    "Correct but Different Wording": {
        "context": "Photosynthesis is the process through which plants use sunlight to synthesize nutrients from carbon dioxide and water. It involves the green pigment chlorophyll and generates oxygen as a byproduct.",
        "question": "How do plants create their food?",
        "ref_answer": "Through photosynthesis",
        "student_answer": "By converting sunlight into energy"  # different wording but correct
    },
    "Somewhat Related": {
        "context": "The heart pumps blood throughout the body. The circulatory system is essential for transporting nutrients and removing waste.",
        "question": "What is the function of the heart?",
        "ref_answer": "Pumps blood",
        "student_answer": "Circulates nutrients"  # not precise
    },
    "Middle Similarity but DistilRoBERTa - Correct": {
        "context": "Newton's third law states that for every action, there is an equal and opposite reaction.",
        "question": "What does Newton's third law state?",
        "ref_answer": "Equal and opposite reactions",
        "student_answer": "Forces come in pairs"
    },
    "Middle Similarity but DistilRoBERTa - Incorrect": {
        "context": "Evaporation is the process by which water changes from a liquid to a gas or vapor.",
        "question": "What is evaporation?",
        "ref_answer": "Water turning into gas",
        "student_answer": "Freezing of water"  # Incorrect
    },
    "DistilRoBERTa - Correct but Incomplete": {
        "context": "A catalyst is a substance that increases the rate of a chemical reaction without itself undergoing any permanent chemical change.",
        "question": "What is a catalyst?",
        "ref_answer": "Increases reaction rate without changing",
        "student_answer": "Speeds up reactions"  # incomplete
    },
    "DistilRoBERTa - Contradictory": {
        "context": "The moon orbits the Earth due to the gravitational pull of the Earth.",
        "question": "Why does the moon orbit the Earth?",
        "ref_answer": "Because of Earth's gravity",
        "student_answer": "Due to the sun's gravity"  # contradictory
    },
    "Wrong": {
        "context": "The mitochondrion is known as the powerhouse of the cell. It produces the energy currency of the cell through respiration.",
        "question": "Why is the mitochondrion called the powerhouse of the cell?",
        "ref_answer": "It produces energy",
        "student_answer": "It stores genetic information"  # Clearly wrong
    },
    "Unrelated": {
        "context": "The theory of relativity proposed by Einstein revolutionized the way we understand time and space.",
        "question": "What did Einstein's theory of relativity change?",
        "ref_answer": "Our understanding of time and space",
        "student_answer": "How plants grow"  # Completely unrelated
    }
}
test_user_feedback_with_scenarios(long_scenarios)


Scenario: Direct Match
Feedback: Your answer is exactly correct.

Water boils at 100 degrees Celsius. This is a fundamental property of water

Scenario: Correct but Different Wording
Feedback: There seems to be a misunderstanding of the concept.

photosynthesis is the process through which plants use sunlight to synthesize nutrients from carbon dioxide and water . it involves the green pigment chlorophyll and generates oxygen as a by

Scenario: Somewhat Related
Feedback: There seems to be a misunderstanding of the concept.

the heart pumps blood throughout the body. The circulatory system is essential for transporting nutrients and removing waste

Scenario: Middle Similarity but DistilRoBERTa - Correct
Feedback: There seems to be a misunderstanding of the concept.

Newton's third law states that for every action, there is an equal and opposite reaction

Scenario: Middle Similarity but DistilRoBERTa - Incorrect
Feedback: There seems to be a misunderstanding of the concept.

water chang