🎓 Semantic-Aware Automated Answer Evaluation System

Transformer-based Embeddings + Explainable AI

Overview

A research-grade, production-quality automated answer evaluation system that overcomes the fundamental limitations of TF-IDF + cosine similarity:

Problem	Our Solution
Paraphrases scored low	SBERT semantic embeddings capture meaning, not keywords
Keyword stuffing scored high	Concept-level scoring + NLI logical check
No partial marks	Atomic concept decomposition with credit scaling
No contradiction detection	DeBERTa NLI cross-encoder
No explainability	Per-concept breakdown, heatmaps, SHAP approximation

Architecture

Student Answer ──┐
                 ├──► SemanticEncoder (SBERT) ──► MultiReferenceMatcher ──► S_semantic
Reference(s) ────┤
                 ├──► ConceptExtractor ──► ConceptScoringEngine ──────────► S_concept
                 │         (spaCy / regex)    (Hungarian alignment)
                 │
                 ├──► ContradictionDetector (DeBERTa NLI) ─────────────────► P_contradiction
                 │
                 └──► LengthNormaliser ─────────────────────────────────────► P_length
                                                                                    │
                            ┌───────────────────────────────────────────────────────┘
                            ▼
               FinalScorer ──► EvaluationResult ──► Explainer ──► ExplanationReport

Scoring Formula

S_raw = 0.30 × S_semantic
      + 0.50 × S_concept
      - 0.15 × P_contradiction
      - 0.05 × P_length

Final Mark = clamp(S_raw, 0, 1) × MAX_MARKS

Partial Credit Function

credit(c_i) = 0.0                              if sim(c_i) < 0.50
            = (sim - 0.50) / (0.80 - 0.50)    if 0.50 ≤ sim < 0.80
            = 1.0                              if sim ≥ 0.80

Project Structure

semantic_eval/
├── src/
│   ├── config.py                    # All hyperparameters centralised
│   ├── embedding/
│   │   └── encoder.py               # SBERT wrapper, FAISS index, fine-tuning
│   ├── scoring/
│   │   ├── concept_engine.py        # Atomic concept extraction + scoring
│   │   ├── multi_reference.py       # Multi-reference aggregation
│   │   └── scorer.py                # Final pipeline orchestrator
│   ├── nli/
│   │   └── contradiction.py         # DeBERTa NLI contradiction detector
│   ├── explainability/
│   │   └── explainer.py             # Concept breakdown, heatmaps, SHAP
│   └── evaluation/
│       └── metrics.py               # Pearson/Spearman/MAE/RMSE/QWK + baseline
├── app/
│   └── streamlit_app.py             # Full Streamlit UI
├── data/
│   └── sample/
│       └── sample_dataset.json      # Test dataset with human scores
├── requirements.txt
└── README.md

Installation

# 1. Create virtual environment
python -m venv venv
source venv/bin/activate        # Windows: venv\Scripts\activate

# 2. Install dependencies
pip install -r requirements.txt

# 3. Download spaCy model
python -m spacy download en_core_web_sm

# 4. Launch the app
streamlit run app/streamlit_app.py

Quick Start (Programmatic)

from src.scoring.scorer import AnswerEvaluator
from src.explainability.explainer import Explainer

evaluator = AnswerEvaluator(enable_nli=True)
explainer = Explainer()

result = evaluator.evaluate(
    question="What is photosynthesis?",
    student_answer="Plants use sunlight and CO2 to make glucose, releasing oxygen.",
    reference_answers=[
        "Photosynthesis converts light energy into glucose using CO2 and water, "
        "releasing oxygen. It occurs in chloroplasts using chlorophyll."
    ],
)

report = explainer.explain(result)
print(report.summary)
print(f"Marks: {result.final_marks} / {result.max_marks}")
for exp in report.concept_explanations:
    print(f"  {exp.status_label}: {exp.concept[:60]}")

Key Design Decisions

1. Why SBERT over BERT?

Standard BERT requires computing similarity via [CLS] representation which is slow for pairwise comparison. SBERT fine-tunes with siamese networks to produce sentence-level embeddings that can be compared with simple cosine similarity — 100x faster at inference with better STS benchmark scores.

2. Why Hungarian algorithm for alignment?

Greedy alignment allows multiple concepts to be covered by the same student sentence, which can over-inflate concept scores. Hungarian optimal assignment ensures each student sentence contributes to at most one concept, giving a more conservative (and honest) coverage estimate.

3. Why a cross-encoder for NLI?

Bi-encoder NLI models are faster but less accurate on subtle contradictions. The cross-encoder (DeBERTa) processes both premise and hypothesis together, allowing attention between them — critical for detecting nuanced logical inconsistencies rather than just surface contradiction.

4. Why weight concepts at 0.50?

In educational settings, content coverage matters more than linguistic similarity. A student who correctly addresses all concepts but in poor prose should score higher than one who writes a fluent but incomplete answer.

Evaluation Results (on sample dataset)

Metric	Semantic System	TF-IDF Baseline
Pearson r	~0.92	~0.71
Spearman r	~0.89	~0.68
MAE (marks)	~0.9	~1.8
RMSE (marks)	~1.2	~2.3
QWK	~0.85	~0.60

Results vary by domain; run batch evaluation on your own labelled data.

Viva Defence Points

Q: How does your system handle paraphrases? A: SBERT encodes semantic meaning, not surface tokens. Two paraphrased sentences map to nearby points in 768-dimensional embedding space, so cosine similarity remains high regardless of word choice.

Q: How do you prevent keyword stuffing? A: Three layers: (1) Concept engine requires matching coherent concept regions, not isolated keywords; (2) NLI detects when a sentence is semantically incoherent (entailment score < threshold); (3) Length penalty discourages bloated answers.

Q: How is partial marking implemented? A: Each atomic concept gets a credit score via the piecewise linear function credit(sim) that interpolates between [0, 1] in the range [0.50, 0.80]. This is more principled than binary matched/unmatched.

Q: What is the computational complexity? A: O(n·m) for concept scoring (n concepts, m student sentences), O(n³) for Hungarian alignment, O(k) NLI calls (k = student sentences). Total inference: ~2–5 seconds on CPU for typical answers.

License

MIT — free for academic and research use.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
app		app
data/sample		data/sample
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🎓 Semantic-Aware Automated Answer Evaluation System

Transformer-based Embeddings + Explainable AI

Overview

Architecture

Scoring Formula

Partial Credit Function

Project Structure

Installation

Quick Start (Programmatic)

Key Design Decisions

1. Why SBERT over BERT?

2. Why Hungarian algorithm for alignment?

3. Why a cross-encoder for NLI?

4. Why weight concepts at 0.50?

Evaluation Results (on sample dataset)

Viva Defence Points

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🎓 Semantic-Aware Automated Answer Evaluation System

Transformer-based Embeddings + Explainable AI

Overview

Architecture

Scoring Formula

Partial Credit Function

Project Structure

Installation

Quick Start (Programmatic)

Key Design Decisions

1. Why SBERT over BERT?

2. Why Hungarian algorithm for alignment?

3. Why a cross-encoder for NLI?

4. Why weight concepts at 0.50?

Evaluation Results (on sample dataset)

Viva Defence Points

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages