<a href="https://colab.research.google.com/github/Prabhat-005/Automatic_Answer_Evaluation_System/blob/main/Automatic_Answer_Evaluation_System.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**AUTOMATIC ANSWER EVALUATION SYSTEM**

**Selected Project Track- AI IN PERSONALIZED LEARNING**



The **objective** of this project is to build an AI-based system that allows
students to compare their answers with an expected answer.
The system evaluates the response and helps the student understand
their current level and areas for improvement.



**Real-world relevance and motivation**


Students often practice answers before exams but do not get clear feedback
on how good their answers are. Even when an expected answer is available,
it is difficult to judge the quality of one’s own response.

I designed this project for students who want to improve their answers. A student can enter their own answer along with the expected answer, and the system evaluates how close the response is to the ideal one.

I started with a classical NLP baseline and then improved the system
using a semantic model to better understand the meaning of answers.

#Data Understanding & Preparation

This project does not use a fixed dataset.
Instead, I designed the system to work on user-provided inputs at runtime.

The inputs include:
- Expected (ideal) answer
- Student answer
- Maximum marks assigned to the question

This approach makes the system flexible and usable for any subject.


EXPECTED ANSWER: Artificial Intelligence refers to creating machines that can perform tasks requiring human intelligence, such as thinking, learning, and decision making.

Sentence you can use:

1.   HIGH TF-IDF SCORE

*   Artificial Intelligence is the simulation of human intelligence by machines.
*   Artificial involves simulation of intellif=gence by machines.

2.   LOW TF-IDF but HIGH TF-IDF + N-GRAM SCORE

*   Artificial Intelligence allows machine to think.
*   Artificial Intelligence gives machines intelligence.

3.   LOW TF-IDF + N-GRAM but HIGH SBERT SCORE


*   Artificial Intelligence is about building machines that can think and learn like humans.
*   Artificial Intelligence allows machines to think, learn and make decisions similar to humans.







#Text Cleaning and Preprocessing

Text cleaning is required for classical NLP models.
It removes unnecessary noise and helps the model focus on meaningful words.

The following steps are applied:
- Convert text to lowercase
- Remove punctuation and symbols
- Remove extra spaces
- Remove common stopwords

To begin cleaning the text, I will first import the necessary NLTK library and download the 'stopwords' corpus if it hasn't been downloaded already. This is a prerequisite for removing common stop words from the text.


In [None]:
import re
import string
import nltk

try:
    nltk.data.find('corpora/stopwords')
except LookupError:
    nltk.download('stopwords')

from nltk.corpus import stopwords

print("NLTK and stopwords are ready.")

NLTK and stopwords are ready.


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


**Reasoning**:
Now that the necessary NLTK resources are confirmed to be available, I will define a function to clean text by lowercasing it, removing punctuation, and removing stopwords.



In [None]:
def clean_text(text):
    # Convert text to lowercase
    text = text.lower()

    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    words = text.split()
    text = ' '.join([word for word in words if word not in stop_words])

    return text

print("Text cleaning function 'clean_text' is defined.")

Text cleaning function 'clean_text' is defined.


**Reasoning**:
Now that the `clean_text` function is defined, here you can put the input for ideal and student answer you want to compare along with marks.



In [None]:
ideal_answer = "Artificial Intelligence refers to creating machines that can perform tasks requiring human intelligence, such as thinking, learning, and decision making."
student_answer = "AI is about building machines that can think and learn like humans."
maximum_marks = 5

print("Ideal, student answers and maximum marks are defined.")

Ideal, student answers and maximum marks are defined.


**Reasoning**:
With the `clean_text` function and the `ideal_answer` and `student_answer` variables already defined, I will now apply the cleaning function to both texts and store the results in new variables. This directly addresses the subtask's requirement to clean both answer texts.



In [None]:
cleaned_ideal_answer = clean_text(ideal_answer)
cleaned_student_answer = clean_text(student_answer)

print("Original Ideal Answer:", ideal_answer)
print("Cleaned Ideal Answer:", cleaned_ideal_answer)
print("\nOriginal Student Answer:", student_answer)
print("Cleaned Student Answer:", cleaned_student_answer)

Original Ideal Answer: Artificial Intelligence refers to creating machines that can perform tasks requiring human intelligence, such as thinking, learning, and decision making.
Cleaned Ideal Answer: artificial intelligence refers creating machines perform tasks requiring human intelligence thinking learning decision making

Original Student Answer: AI is about building machines that can think and learn like humans.
Cleaned Student Answer: ai building machines think learn like humans


#Model / System Design

This project was build in three stages:


1.   **TF-IDF MODEL** - I use a basic NLP approach to evaluate answers. TF-IDF is used to convert the cleaned text into numerical vectors based on word importance.
Cosine similarity measures how close the two vectors are.
The similarity score is scaled using the maximum marks assigned to the question, and feedback is generated.

*   **PIPELINE** - Input Answers → Text Cleaning → TF-IDF Vectorization → Cosine Similarity → Marks Scaling → Feedback


2.   **TF-IDF + N-grams MODEL** - This improves the basic TF-IDF model by adding n-grams. After text cleaning, TF-IDF is applied using which allows the model to capture important word pairs. This model performs better than plain TF-IDF because it captures phrases.
Cosine similarity is then calculated between the TF-IDF vectors.
The similarity score is scaled according to the maximum marks, and feedback is generated.



*   **PIPELINE** - Input Answers → Text Cleaning → TF-IDF with N-Grams → Cosine Similarity → Marks Scaling → Feedback

3.   **SBERT MODEL** - In this pipeline, I use a pre-trained semantic model called SBERT.
Minimal text processing is applied since the model works on natural sentences.
Both answers are converted into sentence-level embeddings and compared using cosine similarity.


*   **PIPELINE** - Input Answers → Minimal Text Processing → SBERT Sentence Embeddings → Cosine Similarity → Marks Scaling → Feedback



This model performs best because it evaluates answers based on meaning and handles paraphrased responses effectively.


#**TF-IDF MODEL**

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

print("TfidfVectorizer imported successfully.")

TfidfVectorizer imported successfully.


**Reasoning**:
Now that `TfidfVectorizer` is imported, I will create a list of the cleaned answers, instantiate the vectorizer, and then fit and transform the texts to create TF-IDF vectors for both the ideal and student answers as per the subtask instructions.



In [None]:
documents = [cleaned_ideal_answer, cleaned_student_answer]

vectorizer = TfidfVectorizer()

tfidf_matrix = vectorizer.fit_transform(documents)

tfidf_ideal_answer = tfidf_matrix[0:1]
tfidf_student_answer = tfidf_matrix[1:2]

print("TF-IDF vectors created for ideal and student answers.")
print("Shape of TF-IDF vector for ideal answer:", tfidf_ideal_answer.shape)
print("Shape of TF-IDF vector for student answer:", tfidf_student_answer.shape)

TF-IDF vectors created for ideal and student answers.
Shape of TF-IDF vector for ideal answer: (1, 19)
Shape of TF-IDF vector for student answer: (1, 19)


**Reasoning**:
To calculate cosine similarity, I need to import the `cosine_similarity` function from `sklearn.metrics.pairwise`.



In [None]:
from sklearn.metrics.pairwise import cosine_similarity

print("cosine_similarity function imported successfully.")

cosine_similarity function imported successfully.


**Reasoning**:
Now that the `cosine_similarity` function is imported and the TF-IDF vectors (`tfidf_ideal_answer`, `tfidf_student_answer`) are available, I will calculate the cosine similarity between them and print the result, fulfilling the subtask requirements.



In [None]:
cosine_sin_score = cosine_similarity(tfidf_ideal_answer, tfidf_student_answer)[0][0]

print(f"Cosine Similarity Score (formatted): {cosine_sin_score:.4f}")

scaled_similarity_score = cosine_sin_score * maximum_marks

print(f"Maximum Score: {maximum_marks}")
print(f"Scaled Similarity Score: {scaled_similarity_score:.4f}")

Cosine Similarity Score (formatted): 0.0504
Maximum Score: 5
Scaled Similarity Score: 0.2520


**Reasoning**:
To generate feedback based on the scaled similarity score, I will define a Python function named `generate_feedback` that takes the `scaled_similarity_score` as input. This function will use conditional logic to return different feedback messages based on predefined score ranges, as outlined in the instructions.



In [None]:
final_score_percentage = cosine_sin_score * 100
print(f"Final Score Percentage: {final_score_percentage:.2f}%")

Final Score Percentage: 5.04%


In [None]:
def generate_feedback(score):
    if score >= 90:
        return "Excellent match! Your answer is very similar to the ideal answer."
    elif score >= 70:
        return "Good match. Your answer aligns well with the ideal answer."
    elif score >= 50:
        return "Fair match. Your answer shows some similarities but also significant differences from the ideal answer."
    else:
        return "Poor match. Your answer has significant discrepancies with the ideal answer."

print("Feedback generation function 'generate_feedback' is defined.")

Feedback generation function 'generate_feedback' is defined.


**Reasoning**:
Now that the `generate_feedback` function is defined, I will call it with the `scaled_similarity_score` to obtain the feedback message and then print this message, thus completing the subtask of generating and displaying feedback.



In [None]:
feedback = generate_feedback(final_score_percentage)

print(f"Feedback: {feedback}")

Feedback: Poor match. Your answer has significant discrepancies with the ideal answer.


#Evaluation & Analysis

The TF-IDF model performs well when student answers use similar words
and phrases as the ideal answer.

However, it struggles with:
- Paraphrased answers
- Synonyms
- Very short but correct answers

#**TF-IDF + N-GRAMS MODEL**


In this model, I improve the basic TF-IDF model by adding n-grams.
While the earlier model considers only single words, this model
also captures short phrases.

The goal is to better represent meaningful word combinations
that often appear in descriptive answers.


**Reasoning**:
To capture unigrams and bigrams as instructed, I will instantiate `TfidfVectorizer` with `ngram_range=(1, 2)`, then fit and transform the cleaned ideal and student answers to create their respective TF-IDF vectors, and finally print their shapes.



In [None]:
cleaned_ideal_answer = clean_text(ideal_answer)
cleaned_student_answer = clean_text(student_answer)

documents_ngram = [cleaned_ideal_answer, cleaned_student_answer]

vectorizer_ngram = TfidfVectorizer(ngram_range=(1, 2))  #ngram_range = (1, 2) allows the model to consider both single words and two-word phrases.

tfidf_matrix_ngram = vectorizer_ngram.fit_transform(documents_ngram)

tfidf_ideal_answer_ngram = tfidf_matrix_ngram[0:1]
tfidf_student_answer_ngram = tfidf_matrix_ngram[1:2]

print("TF-IDF vectors with unigrams and bigrams created for ideal and student answers.")
print("Shape of TF-IDF vector for ideal answer (n-gram):", tfidf_ideal_answer_ngram.shape)
print("Shape of TF-IDF vector for student answer (n-gram):", tfidf_student_answer_ngram.shape)


TF-IDF vectors with unigrams and bigrams created for ideal and student answers.
Shape of TF-IDF vector for ideal answer (n-gram): (1, 38)
Shape of TF-IDF vector for student answer (n-gram): (1, 38)


**Reasoning**:
To calculate the cosine similarity between the n-gram TF-IDF vectors, I will use the `cosine_similarity` function with `tfidf_ideal_answer_ngram` and `tfidf_student_answer_ngram`, then extract and print the scalar result as `cosine_sin_score_ngram`.



In [None]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_sin_score_ngram = cosine_similarity(tfidf_ideal_answer_ngram, tfidf_student_answer_ngram)[0][0]

print(f"Cosine Similarity Score (n-gram): {cosine_sin_score_ngram:.4f}")

Cosine Similarity Score (n-gram): 0.0268


**Reasoning**:
To scale the calculated cosine similarity score to a graded mark, I will define the maximum marks, multiply it by the previously calculated n-gram cosine similarity score, and then print the result as instructed.



In [None]:
scaled_ngram_similarity_score = cosine_sin_score_ngram * maximum_marks
print(f"Scaled n-gram Similarity Score: {scaled_ngram_similarity_score:.2f}")

final_score_percentage = scaled_ngram_similarity_score * 100
print(f"Final Score Percentage: {final_score_percentage:.2f}%")

Scaled n-gram Similarity Score: 0.13
Final Score Percentage: 13.41%


**Reasoning**:
Feedback generation




In [None]:
def generate_feedback(score):
    if score >= 90:
        return "Excellent match! Your answer is very similar to the ideal answer."
    elif score >= 70:
        return "Good match. Your answer aligns well with the ideal answer."
    elif score >= 50:
        return "Fair match. Your answer shows some similarities but also significant differences from the ideal answer."
    else:
        return "Poor match. Your answer has significant discrepancies with the ideal answer."

feedback = generate_feedback(final_score_percentage)

print(f"Feedback: {feedback}")

Feedback: Poor match. Your answer has significant discrepancies with the ideal answer.


#Evaluation and Analysis

Although n-grams improve phrase matching, this model still depends heavily on word overlap. If a student uses different words or synonyms,
the model may give a low score even when the meaning is correct.

TF-IDF with n-grams captures word and phrase overlap but fails to understand semantic meaning, leading to incorrect grading for paraphrased or keyword-stuffed answers.

### **SBERT MODEL**

I use a pre-trained Sentence-BERT (SBERT) model to perform semantic
answer evaluation.
The model is loaded using:
SentenceTransformer("all-MiniLM-L6-v2"). This model converts an entire sentence into a fixed-length vector
that represents its meanings. It is lightweight, fast, and suitable for comparing short and
medium-length answers, which makes it appropriate for this task.


In [None]:
ideal_answer = "Artificial Intelligence refers to creating machines that can perform tasks requiring human intelligence, such as thinking, learning, and decision making."
student_answer = "AI is about building machines that can think and learn like humans."
maximum_marks = 5

print("Ideal answer, student answer, and maximum marks are defined.")

Ideal answer, student answer, and maximum marks are defined.


**Reasoning**:
I will define a function that lowercases the text and removes punctuation without heavy stop word removal or stemming, as SBERT works well with natural language, then apply it to the `ideal_answer` and `student_answer` and print the results.



In [None]:
import string

def minimal_clean_text(text):
    # Convert text to lowercase
    text = text.lower()

    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))

    return text

min_cleaned_ideal_answer = minimal_clean_text(ideal_answer)
min_cleaned_student_answer = minimal_clean_text(student_answer)

print("Original Ideal Answer:", ideal_answer)
print("Minimally Cleaned Ideal Answer:", min_cleaned_ideal_answer)
print("\nOriginal Student Answer:", student_answer)
print("Minimally Cleaned Student Answer:", min_cleaned_student_answer)

Original Ideal Answer: Artificial Intelligence refers to creating machines that can perform tasks requiring human intelligence, such as thinking, learning, and decision making.
Minimally Cleaned Ideal Answer: artificial intelligence refers to creating machines that can perform tasks requiring human intelligence such as thinking learning and decision making

Original Student Answer: AI is about building machines that can think and learn like humans.
Minimally Cleaned Student Answer: ai is about building machines that can think and learn like humans


**Reasoning**:
To load the pre-trained SBERT model, I first need to ensure the 'sentence-transformers' library is installed, then import `SentenceTransformer` and load the 'all-MiniLM-L6-v2' model as specified in the instructions.



In [None]:
try:
    from sentence_transformers import SentenceTransformer
except ImportError:
    !pip install sentence-transformers
    from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

print("SBERT model 'all-MiniLM-L6-v2' loaded successfully.")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

SBERT model 'all-MiniLM-L6-v2' loaded successfully.


**Reasoning**:
Now that the SBERT model has been successfully loaded, I will use it to generate sentence embeddings for both the minimally cleaned ideal and student answers.

Each embedding captures the overall meaning of the sentence,
rather than individual word frequencies.


In [None]:
ideal_answer_embedding = model.encode(min_cleaned_ideal_answer)
student_answer_embedding = model.encode(min_cleaned_student_answer)

print("Sentence embeddings created for ideal and student answers.")
print("Shape of ideal answer embedding:", ideal_answer_embedding.shape)
print("Shape of student answer embedding:", student_answer_embedding.shape)

Sentence embeddings created for ideal and student answers.
Shape of ideal answer embedding: (384,)
Shape of student answer embedding: (384,)


**Reasoning**:
To compute the cosine similarity between the generated SBERT embeddings, I will import the `cosine_similarity` function from `sklearn.metrics.pairwise`.



In [None]:
from sklearn.metrics.pairwise import cosine_similarity

print("cosine_similarity function imported successfully.")

cosine_similarity function imported successfully.


**Reasoning**:
With the `cosine_similarity` function already imported and the `ideal_answer_embedding` and `student_answer_embedding` available, I will now compute the cosine similarity between these two embeddings. This directly addresses the subtask's requirement to quantify their textual similarity.



In [None]:
sbert_similarity_score = cosine_similarity(ideal_answer_embedding.reshape(1, -1), student_answer_embedding.reshape(1, -1))[0][0]

print(f"SBERT Cosine Similarity Score: {sbert_similarity_score:.4f}")

SBERT Cosine Similarity Score: 0.7941


**Reasoning**:
With the SBERT cosine similarity score calculated, the next step is to scale this score to the predefined maximum marks, as specified in the main task instructions.



In [None]:
scaled_sbert_score = sbert_similarity_score * 100
print(f"Final Score Percentage: {scaled_sbert_score:.2f}%")

Final Score Percentage: 79.41%


**Reasoning**:
Feedback is generated as before.


In [None]:
def generate_feedback(score):
    if score >= 90:
        return "Excellent match! Your answer is very similar to the ideal answer."
    elif score >= 70:
        return "Good match. Your answer aligns well with the ideal answer."
    elif score >= 50:
        return "Fair match. Your answer shows some similarities but also significant differences from the ideal answer."
    else:
        return "Poor match. Your answer has significant discrepancies with the ideal answer."

feedback_sbert = generate_feedback(scaled_sbert_score)

print(f"SBERT Feedback: {feedback_sbert}")

SBERT Feedback: Good match. Your answer aligns well with the ideal answer.



**Result**

The SBERT-based approach effectively captured the semantic similarity between the ideal and student answers, providing a nuanced grade and feedback. This method offers a more robust evaluation compared to traditional lexical matching techniques like TF-IDF for tasks requiring semantic understanding.

#**COMPARISION TABLE**



| Aspect                     | TF-IDF            | TF-IDF + N-grams     | SBERT                        |
| -------------------------- | ----------------- | -------------------- | ---------------------------- |
| NLP approach               | Classical         | Classical (improved) | Semantic (Transformer-based) |
| Text representation        | Single words      | Words + phrases      | Sentence meaning             |
| Handles paraphrasing       | ❌ Poor            | ⚠️ Limited           | ✅ Good                       |
| Depends on keyword overlap | High              | High                 | Low                          |
| Similarity method          | Cosine similarity | Cosine similarity    | Cosine similarity            |
| Overall effectiveness      | Baseline          | Better baseline      | Best                         |


#**Ethical Considerations & Responsible AI**



*   **Bias and Fairness Considerations** - This system evaluates answers based on text similarity and meaning.
It may favor answers that are clearly written or longer in length. The model does not intentionally discriminate,
but like all AI systems, it can reflect bias based on how text is expressed.
Because of this, results should be interpreted as guidance, not absolute judgment.


*   **Dataset Limitations** - This project does not use a large labeled dataset.
The evaluation is based only on the expected answer provided by the user. This makes the system flexible, but also limits its accuracy
for complex or highly technical questions.



*   **Responsible Use of AI Tools** - This system is designed as a learning support tool. It helps students understand how close their answers are
to an expected response. It should not be used as the sole method for grading exams. Final evaluation decisions should always involve human judgment. The goal is to support learning, not replace teachers or examiners.





#**Conclusion & Future Scope**
**Summary**

In this project, I built an automatic answer evaluation system
to help students assess their answers during practice.
I first implemented a classical NLP baseline using TF-IDF.
I then improved it using TF-IDF with n-grams to capture phrases.
Finally, I used SBERT to evaluate answers based on semantic meaning.

The results show that:


*   TF-IDF works for keyword-based similarity
*   N-grams improve phrase matching
*   SBERT performs best for paraphrased and semantically similar answers

This progression clearly demonstrates the benefit of semantic models
in learning-based evaluation systems.


**Future Scope**

In the future, the system can be extended to accept answers in image form.
Students will be able to upload photos of their handwritten or printed answers instead of typing them.
This will make the system easier to use and closer to real exam practice.