# Candidate Evaluation Pipeline

This notebook demonstrates a unified pipeline that:
1. Loads both Technical and HR evaluation rubrics (pre‐generated).
2. Builds FAISS indexes for both Technical and HR question sets.
3. Defines helper functions for JSON extraction, rubric‐based scoring, and instructional‐style handling for HR.
4. Implements a single `evaluate_question_answer(...)` function that:
   - Detects whether a question is Technical or HR.
   - Checks for an **exact match** in the corresponding dataset. If found and its old score > 70, combines 70% old + 30% fresh rubric. If old score ≤ 70, uses 100% fresh rubric.
   - If no exact match, checks **relevance** among top‐3 FAISS neighbors. If relevant old questions exist, averages their old scores and combines 30% average + 70% fresh rubric. Otherwise, uses 100% fresh rubric.
5. Iterates over a master list of eight questions (4 Technical, 4 HR), prompts for the candidate’s answer to each, and prints & saves a JSON summary.

Make sure to install required dependencies:

```bash
pip install pandas python-dotenv faiss-cpu langchain openai


## 1. Imports and Helper: Robust JSON Extraction


### 1.1 Imports


In [1]:
# 1.1 Imports
import os
import re
import random
import json
import pandas as pd
from dotenv import load_dotenv

from langchain.chat_models import AzureChatOpenAI
from langchain import PromptTemplate, LLMChain
from langchain_openai import AzureOpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains import RetrievalQA

# 1.2 Helper: Robust JSON extraction for LLM outputs
def extract_json(text: str):
    """
    Finds the first balanced JSON object or array in `text` (by bracket matching),
    then returns the parsed Python object. Raises ValueError if none found.
    """
    for start_idx, ch in enumerate(text):
        if ch in ("{", "["):
            open_char = ch
            close_char = "}" if ch == "{" else "]"
            balance = 0
            for end_idx in range(start_idx, len(text)):
                if text[end_idx] == open_char:
                    balance += 1
                elif text[end_idx] == close_char:
                    balance -= 1
                    if balance == 0:
                        snippet = text[start_idx : end_idx + 1]
                        return json.loads(snippet)
            break
    raise ValueError(f"No complete JSON object/array found in LLM output:\n{text}")


## 2. Load Environment Variables & Initialize AzureOpenAI Client

We load environment variables (`.env`) for AzureOpenAI and instantiate the `AzureChatOpenAI` client.


In [2]:
# 2. LOAD ENV AND AZURE OPENAI CLIENT
load_dotenv()
os.environ["AZURE_OPENAI_API_KEY"]  = os.getenv("AZURE_OPENAI_API_KEY")
os.environ["AZURE_OPENAI_ENDPOINT"] = os.getenv("AZURE_OPENAI_ENDPOINT")
os.environ["OPENAI_API_TYPE"]       = "Azure"

llm = AzureChatOpenAI(
    openai_api_version="2023-12-01-preview",
    azure_deployment="GPT-4O-50-1",
)


  llm = AzureChatOpenAI(


## 3. Evaluation Helpers

- **`evaluate_with_rubric(...)`**: Prompts the LLM with a given rubric (JSON array of criteria) to score a single (question, answer) pair.  
- **`is_instructional(...)`**: Heuristic to detect if an HR “answer” is actually a set of meta‐instructions rather than a concrete response.  
- **`convert_to_sample_answer(...)`**: If the HR answer is instructional, uses the LLM to produce a concrete 1–2 paragraph sample answer.


### 3.1 Evaluate with rubric


In [3]:
# 3.1 Evaluate with rubric (3× scoring + summary of rationales)
def evaluate_with_rubric(question: str, answer: str, rubric: list) -> dict:
    """
    Given a single question & answer and a rubric (list of {"name","description"}),
    calls the LLM three times for independent evaluations, then:
      - Averages each criterion's numeric "score" (rounded to two decimals).
      - Summarizes the three one‐sentence rationales into a single concise rationale.
      - Computes "overall_score" as the average of all averaged criterion scores.
    Returns a JSON object:
      {
        "scores": [
          {
            "name": <criterion name>,
            "score": <avg of 3 runs>,
            "explanation": <summarized rationale>
          },
          ...
        ],
        "overall_score": <float>
      }
    """
    # 1) Prepare the base prompt template (identical for all runs)
    rubric_str = json.dumps(rubric, indent=2)
    eval_prompt = PromptTemplate(
        input_variables=["rubric", "question", "answer"],
        template="""
You are an objective assessor. Here is a rubric (JSON array):
{rubric}

Now evaluate this response.

Question:
{question}

Answer:
{answer}

For each rubric item, produce an object with:
- "name"       (same as criterion)
- "score"      (integer 0–100)
- "explanation" (one-sentence rationale)

Then compute "overall_score" as the average of all scores.

Return **only** the final JSON object with keys "scores" and "overall_score".
""",
    )

    # 2) Run the LLM three times and parse each JSON output
    runs = []
    for _ in range(3):
        chain = LLMChain(llm=llm, prompt=eval_prompt)
        raw = chain.run(
            rubric=rubric_str,
            question=question,
            answer=answer
        )
        parsed = extract_json(raw)
        runs.append(parsed)

    # 3) Combine the three runs:
    #    - Average numeric scores per criterion
    #    - Collect the three explanations per criterion
    first_scores = runs[0]["scores"]
    combined_scores = []

    for crit_obj in first_scores:
        crit_name = crit_obj["name"]

        # Collect the three scores and explanations for this criterion (matching by name)
        score_values = []
        explanations = []

        for run in runs:
            match = next(item for item in run["scores"] if item["name"] == crit_name)
            score_values.append(match["score"])
            explanations.append(match["explanation"])

        # 3a) Average the numeric scores
        avg_score = round(sum(score_values) / len(score_values), 2)

        # 3b) Summarize the three one‐sentence rationales into one concise sentence
        summary_prompt = PromptTemplate(
            input_variables=["exp0", "exp1", "exp2"],
            template="""
You have three one‐sentence rationales for the same evaluation criterion:
1) {exp0}
2) {exp1}
3) {exp2}

Please write a single, concise one‐sentence explanation that captures the essence of all three rationales.
Return **only** that one‐sentence summary.
""",
        )
        summary_chain = LLMChain(llm=llm, prompt=summary_prompt)
        raw_summary = summary_chain.run(
            exp0=explanations[0],
            exp1=explanations[1],
            exp2=explanations[2]
        )
        summarized_explanation = raw_summary.strip()

        combined_scores.append({
            "name": crit_name,
            "score": avg_score,
            "explanation": summarized_explanation
        })

    # 4) Compute overall_score as the average of averaged criterion scores
    if combined_scores:
        overall = round(
            sum(item["score"] for item in combined_scores) / len(combined_scores), 2
        )
    else:
        overall = 0.0

    return {
        "scores": combined_scores,
        "overall_score": overall
    }


## 4. Load Rubrics & Old Scores JSONs

- **Technical**  
  - `tech_rubric.json`: JSON array of 5–8 technical evaluation criteria.  
  - `evaluation_results.json`: Pre‐computed LLM‐evaluated results for `dataset.csv`.  
- **HR**  
  - `Data/HR/hr_rubric.json`: JSON array of 5–8 HR evaluation criteria.  
  - `Data/HR/hr_evaluation_results_with_samples.json`: Pre‐computed HR evaluation results.


In [4]:
# 4.1 Technical rubric & old results
TECH_RUBRIC_PATH      = "Data/Technical/tech_rubric.json"
TECH_OLD_RESULTS_PATH = "Data/Technical/tech_evaluation_results.json"

tech_rubric = json.load(open(TECH_RUBRIC_PATH, "r", encoding="utf-8"))
tech_old    = json.load(open(TECH_OLD_RESULTS_PATH, "r", encoding="utf-8"))

# Build question → old overall_score map for Technical
tech_past_scores = {
    entry["question"]: entry["evaluation"].get("overall_score", 0.0)
    for entry in tech_old
}

# 4.2 HR rubric & old results
HR_RUBRIC_PATH      = "Data\HR\hr_rubric.json"
HR_OLD_RESULTS_PATH = "Data\HR\hr_evaluation_results_with_samples.json"

hr_rubric = json.load(open(HR_RUBRIC_PATH, "r", encoding="utf-8"))
hr_old    = json.load(open(HR_OLD_RESULTS_PATH, "r", encoding="utf-8"))

# Build question → old overall_score map for HR
hr_past_scores = {
    entry["question"]: entry["evaluation"].get("overall_score", 0.0)
    for entry in hr_old
}


## 5. Load Datasets & Build FAISS Indexes

- **Technical dataset**  
  - `dataset.csv` should contain columns `"question","answer"`.  
  - We build a FAISS index on `question` texts for exact‐match/relevance checks.

- **HR dataset**  
  - `Data/HR/interview_best_answers_samples.csv` should contain `"question","answer"`.  
  - We similarly build a FAISS index on HR questions.


### 5.1 Technical dataset & FAISS index


In [5]:
# 5.1 Technical dataset & FAISS index
TECH_CSV_PATH = "Data\Technical\dataset.csv"  # contains "question","answer"
df_tech = pd.read_csv(TECH_CSV_PATH).dropna(subset=["question", "answer"])
tech_questions = df_tech["question"].astype(str).tolist()

embeddings = AzureOpenAIEmbeddings(
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
    openai_api_key=os.getenv("AZURE_OPENAI_API_KEY"),
)
tech_vectorstore = FAISS.from_texts(tech_questions, embeddings)
tech_retriever   = tech_vectorstore.as_retriever(search_kwargs={"k": 3})

# 5.2 HR dataset & FAISS index
HR_CSV_PATH = "Data\HR\interview_best_answers_samples.csv"
df_hr = pd.read_csv(HR_CSV_PATH).dropna(subset=["question", "answer"])
hr_questions = df_hr["question"].astype(str).tolist()

hr_vectorstore = FAISS.from_texts(hr_questions, embeddings)
hr_retriever   = hr_vectorstore.as_retriever(search_kwargs={"k": 3})


## 6. Set Up EXACT‐MATCH and RELEVANCE Chains

For both Technical and HR, we create:
1. An **exact‐match** `RetrievalQA` chain to see if the new question exactly matches one in our dataset.  
2. A **relevance** `LLMChain` that asks the model, “Of these top‐3 retrieved, which are truly relevant?” and returns a JSON array.


### 6.1 Technical EXACT‐MATCH chain


In [6]:
# 6.1 Technical EXACT‐MATCH chain
tech_match_prompt = PromptTemplate(
    input_variables=["context", "question"],
    template="""
You are an assistant that decides if a new Technical question exactly matches one in the dataset.
From these retrieved questions:
{context}

New Question:
{question}

Respond with **exactly**:
- YES: "<matched question>"
- NO
""",
)
tech_match_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=tech_retriever,
    chain_type_kwargs={"prompt": tech_match_prompt}
)


### 6.2 Technical RELEVANCE chain


In [7]:
# 6.2 Technical RELEVANCE chain
tech_relevance_prompt = PromptTemplate(
    input_variables=["new_question", "candidates"],
    template="""
New question:
{new_question}

Here are 3 candidate questions retrieved from the Technical dataset:
{candidates}

For each candidate, respond with YES or NO if it's truly relevant to the new question.
Return a JSON array with only those candidate strings that are relevant.
Example: ["Q1 text","Q3 text"]
""",
)
tech_relevance_chain = LLMChain(llm=llm, prompt=tech_relevance_prompt)

  tech_relevance_chain = LLMChain(llm=llm, prompt=tech_relevance_prompt)


### 6.3 HR EXACT‐MATCH chain


In [8]:
# 6.3 HR EXACT‐MATCH chain
hr_match_prompt = PromptTemplate(
    input_variables=["context", "question"],
    template="""
You are an assistant that decides if a new HR question exactly matches one in the HR dataset.
From these retrieved questions:
{context}

New Question:
{question}

Respond with **exactly**:
- YES: "<matched question>"
- NO
""",
)
hr_match_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=hr_retriever,
    chain_type_kwargs={"prompt": hr_match_prompt}
)


### 6.4 HR RELEVANCE chain


In [9]:

# 6.4 HR RELEVANCE chain
hr_relevance_prompt = PromptTemplate(
    input_variables=["new_question", "candidates"],
    template="""
New question:
{new_question}

Here are 3 candidate questions retrieved from the HR dataset:
{candidates}

For each candidate, respond with YES or NO if it's truly relevant to the new question.
Return a JSON array with only those relevant candidate strings.
Example: ["Tell me about yourself.","What are your strengths?"]
""",
)
hr_relevance_chain = LLMChain(llm=llm, prompt=hr_relevance_prompt)

## 7. Master List of Questions (4 Technical + 4 HR)

We expand from six questions to eight: four Technical and four HR. Each entry has:
- `"question"`: the exact question text.
- `"type"`: either `"Technical"` or `"HR"`.

Feel free to swap these out for your actual eight interview questions.


In [10]:
MASTER_QUESTIONS = [
    # --- Four Technical questions ---
    {
        "question": "Explain how you would implement a Transformer-based Speech Emotion Recognition (SER) pipeline end-to-end, from raw audio input to predicted emotion labels.",
        "type": "Technical"
    },
    {
        "question": "Given a large dataset of audio recordings, describe at least three feature-extraction techniques (e.g., Mel-spectrogram, log-Mel, MFCC) and discuss the pros and cons of each for a SER task.",
        "type": "Technical"
    },
    {
        "question": "How does Retrieval-Augmented Generation (RAG) improve answer evaluation compared to using a vanilla LLM? Illustrate with a pseudo-code or high-level workflow.",
        "type": "Technical"
    },
    {
        "question": "Suppose you have to fine-tune a pre-trained wav2vec 2.0 model on a new emotional‐speech dataset. Which steps would you follow (data preprocessing, training loop, hyperparameter tuning), and why?",
        "type": "Technical"
    },
    # --- Four HR questions ---
    {
        "question": "Tell me about a time you faced a conflict in a team. How did you resolve it?",
        "type": "HR"
    },
    {
        "question": "What are your greatest strengths and how do they apply to this role?",
        "type": "HR"
    },
    {
        "question": "Describe a situation when you had to adapt quickly to a significant change at work or in school. What did you learn?",
        "type": "HR"
    },
    {
        "question": "How do you handle constructive criticism? Give an example.",
        "type": "HR"
    },
]

# Build a lookup map: question_text → type
QUESTION_TYPE_MAP = { entry["question"]: entry["type"] for entry in MASTER_QUESTIONS }


## 8. Unified Evaluation Function

`evaluate_question_answer(...)` takes a `(question, answer)` pair, looks up whether it is Technical or HR, then applies:

1. **Exact‐Match Check** (using the relevant FAISS retriever + `RetrievalQA`):
   - If the model returns `YES: "<matched question>"` and the old score > 70, combine 70% old + 30% fresh rubric.
   - If old score ≤ 70, use 100% fresh rubric.
2. **Relevance Check** (if no exact match): 
   - Get the top‐3 neighbors from FAISS, ask “which are truly relevant?” via an LLM chain.
   - If none relevant, use 100% fresh rubric.
   - If some relevant, average their old scores, then combine 30% average + 70% fresh rubric.

Returns a dictionary with:
- `"question"`, `"type"`,  
- `"old_dataset_score"`, `"rubric_score"`, `"final_combined_score"`,  
- `"rubric_breakdown"` (detailed per‐criterion scores + explanations),  
- Optionally `"error"` if something fails.


In [11]:
import re

def is_instructional(text: str) -> bool:
    """
    Detects if a given HR answer is actually meta-instructions (e.g., “Best strategy: …”).
    """
    patterns = [
        r"^\s*(Start with|Remember that|BEST ANSWERS?|Best strategy|Example:|Remember, you|To answer this question|If you want to|The only right answer|To cover both|Many executives)",
        r"\b(you should|you must|always|never|exercise)\b"
    ]
    for pat in patterns:
        if re.search(pat, text, flags=re.IGNORECASE):
            return True
    return False


def evaluate_question_answer(question: str, answer: str) -> dict:
    """
    Version A: Pre-check “I don’t know” (or very short answers).
    If answer is literally “I don’t know” (case-insensitive) or fewer than 3 words,
    return zero scores immediately. Otherwise, proceed with exact/relevance logic.
    """
    q_type = QUESTION_TYPE_MAP.get(question)
    if q_type not in ("Technical", "HR"):
        raise ValueError(f"Question not found in MASTER_QUESTIONS: {question}")

    # Identify which rubric and chains to use
    if q_type == "Technical":
        old_scores_map  = tech_past_scores
        old_df          = df_tech
        old_match_chain = tech_match_chain
        old_relev_chain = tech_relevance_chain
        retriever       = tech_retriever
        rubric          = tech_rubric
    else:  # HR
        old_scores_map  = hr_past_scores
        old_df          = df_hr
        old_match_chain = hr_match_chain
        old_relev_chain = hr_relevance_chain
        retriever       = hr_retriever
        rubric          = hr_rubric

    # 1) PRE-CHECK: if answer is “I don't know” / “idk” / extremely short, return zeros
    normalized = answer.strip().lower()
    if normalized in ["i don’t know", "i don't know", "idk", "no idea"] or len(normalized.split()) < 3:
        zero_breakdown = [
            {
                "name": crit["name"],
                "score": 0.0,
                "explanation": "No substantive answer provided."
            }
            for crit in rubric
        ]
        return {
            "question": question,
            "type": q_type,
            "old_dataset_score": 0.0,
            "rubric_score": 0.0,
            "final_combined_score": 0.0,
            "rubric_breakdown": {
                "scores": zero_breakdown,
                "overall_score": 0.0
            }
        }

    # 2) If not “I don't know,” proceed with EXACT-MATCH / RELEVANCE logic as before:
    result = {
        "question": question,
        "type": q_type,
        "old_dataset_score": 0.0,
        "rubric_score": 0.0,
        "final_combined_score": 0.0,
        "rubric_breakdown": None,
    }

    # 2a) Retrieve top-3 neighbors
    docs = retriever.get_relevant_documents(question)
    context = "\n".join(d.page_content for d in docs)

    # 2b) Exact-match check
    match_out = old_match_chain.run(question).strip()
    if match_out.upper().startswith("YES"):
        m = re.search(r'YES:\s*"(.*)"', match_out)
        exact_q = m.group(1) if m else None
        old_score = old_scores_map.get(exact_q, 0.0)
        result["old_dataset_score"] = old_score

        if old_score > 70:
            # Combine 70% old + 30% fresh rubric
            rub_report = evaluate_with_rubric(question, answer, rubric)
            rub_score  = rub_report["overall_score"]
            result["rubric_score"]         = rub_score
            result["final_combined_score"] = round(0.7 * old_score + 0.3 * rub_score, 2)
            result["rubric_breakdown"]     = rub_report
        else:
            # old_score ≤ 70 → 100% fresh rubric
            rub_report = evaluate_with_rubric(question, answer, rubric)
            rub_score  = rub_report["overall_score"]
            result["rubric_score"]         = rub_score
            result["final_combined_score"] = rub_score
            result["rubric_breakdown"]     = rub_report

        return result

    # 2c) No exact match → RELEVANCE check
    candidates_json = json.dumps([d.page_content for d in docs], indent=2)
    rel_raw = old_relev_chain.run(new_question=question, candidates=candidates_json)

    try:
        relevant_list = extract_json(rel_raw)
    except ValueError:
        relevant_list = []

    if not relevant_list:
        # 2c.i) No relevant → 100% fresh rubric
        rub_report = evaluate_with_rubric(question, answer, rubric)
        rub_score  = rub_report["overall_score"]
        result["rubric_score"]         = rub_score
        result["final_combined_score"] = rub_score
        result["rubric_breakdown"]     = rub_report
        return result
    else:
        # 2c.ii) Some relevant → average their old overall_scores
        old_scores_accum = []
        for q_old in relevant_list:
            try:
                a_old = old_df.loc[old_df.question == q_old, "answer"].iloc[0]
            except IndexError:
                a_old = ""

            if q_type == "HR" and is_instructional(a_old):
                try:
                    a_old = convert_to_sample_answer(q_old, a_old, llm)
                except Exception:
                    pass

            old_rub_report = evaluate_with_rubric(q_old, a_old, rubric)
            old_scores_accum.append(old_rub_report["overall_score"])

        avg_old = sum(old_scores_accum) / len(old_scores_accum)
        result["old_dataset_score"] = avg_old

        # Fresh rubric on new (question, answer)
        new_rub_report = evaluate_with_rubric(question, answer, rubric)
        rub_score      = new_rub_report["overall_score"]
        result["rubric_score"]       = rub_score
        result["rubric_breakdown"]   = new_rub_report

        combined = round(0.7 * rub_score + 0.3 * avg_old, 2)
        result["final_combined_score"] = combined
        return result


## 9. Main Interactive Loop

Iterate over the eight questions in `MASTER_QUESTIONS`. For each:
1. Print the question and its type (Technical/HR).
2. Prompt the user to paste the candidate’s answer.
3. Call `evaluate_question_answer(...)`.
4. Print the JSON‐formatted result (per‐criterion breakdown + combined score).
5. Save all eight results to `candidate_evaluation_summary.json`.


In [20]:
# ─── 9. Main Interactive Loop (Jupyter‐friendly prints + flush) ─────────────────────────────────────────────────────
import random
import sys
from IPython.display import display, Markdown

def main():
    """
    Randomly select 2 Technical and 1 HR question from MASTER_QUESTIONS,
    then ask the user to paste the candidate’s answer for each.
    Show:
      - The selected question itself (as Markdown)
      - The candidate's answer
      - The best‐retrieved question (top‐1 from FAISS)
    Then call evaluate_question_answer(...) and print its JSON output.
    Finally, write all results to candidate_evaluation_summary.json.
    """
    display(Markdown("## Candidate Evaluation Script (3 Random Questions)"))

    # 1) Build pools and randomly pick
    tech_pool = [q["question"] for q in MASTER_QUESTIONS if q["type"] == "Technical"]
    hr_pool   = [q["question"] for q in MASTER_QUESTIONS if q["type"] == "HR"]

    selected_tech = random.sample(tech_pool, k=2)
    selected_hr   = random.sample(hr_pool, k=1)
    selected_questions = selected_tech + selected_hr
    random.shuffle(selected_questions)

    all_results = []

    for q_text in selected_questions:
        q_type = QUESTION_TYPE_MAP[q_text]

        # 2) Display the question itself using Markdown
        display(Markdown(f"**QUESTION ({q_type}):**  \n{q_text}"))

        # 3) Prompt for candidate's answer (question remains visible above)
        prompt_str = f"Enter candidate’s answer for the above question:\n> "
        user_ans = input(prompt_str).strip()

        # 4) Immediately echo candidate's answer (print will show under the prompt)
        print(f"\n**Candidate’s answer:**\n{user_ans}\n")

        # 5) Retrieve and display the best FAISS neighbor (top‐1)
        retriever = tech_retriever if q_type == "Technical" else hr_retriever
        docs = retriever.get_relevant_documents(q_text)
        if docs:
            best_q = docs[0].page_content
            print(f"**Best retrieved question from dataset:**\n{best_q}\n")
        else:
            print("**No retrieved questions found.**\n")

        # 6) Print “Evaluating…” and flush stdout immediately
        print("Evaluating…", end=" ", flush=True)
        sys.stdout.flush()

        try:
            report = evaluate_question_answer(q_text, user_ans)

            # Once done, print “Done.” and full JSON
            print("Done.\n")
            print(json.dumps(report, indent=2))
            all_results.append(report)

        except Exception as e:
            # Print any error, then continue
            print("ERROR:", e)
            all_results.append({
                "question": q_text,
                "type": q_type,
                "error": str(e)
            })

    # 7) Write all results to disk
    out_path = "candidate_evaluation_summary.json"
    with open(out_path, "w", encoding="utf-8") as f:
        json.dump(all_results, f, indent=2, ensure_ascii=False)

    print(f"\nAll results written to {out_path}\n")


if __name__ == "__main__":
    main()


## Candidate Evaluation Script (3 Random Questions)

**QUESTION (Technical):**  
Given a large dataset of audio recordings, describe at least three feature-extraction techniques (e.g., Mel-spectrogram, log-Mel, MFCC) and discuss the pros and cons of each for a SER task.


**Candidate’s answer:**
no kholedge

**Best retrieved question from dataset:**
Which feature selection techniques do you know?

Evaluating… Done.

{
  "question": "Given a large dataset of audio recordings, describe at least three feature-extraction techniques (e.g., Mel-spectrogram, log-Mel, MFCC) and discuss the pros and cons of each for a SER task.",
  "type": "Technical",
  "old_dataset_score": 0.0,
  "rubric_score": 0.0,
  "final_combined_score": 0.0,
  "rubric_breakdown": {
    "scores": [
      {
        "name": "Clarity",
        "score": 0.0,
        "explanation": "No substantive answer provided."
      },
      {
        "name": "Accuracy",
        "score": 0.0,
        "explanation": "No substantive answer provided."
      },
      {
        "name": "Completeness",
        "score": 0.0,
        "explanation": "No substantive answer provided."
      },
      {
        "name": "Relevance",
        "score": 0.0,
        "explanation": "No substantive answer provided."
      }

**QUESTION (Technical):**  
Explain how you would implement a Transformer-based Speech Emotion Recognition (SER) pipeline end-to-end, from raw audio input to predicted emotion labels.


**Candidate’s answer:**
using cnn

**Best retrieved question from dataset:**
How can we use machine learning for text classification?

Evaluating… Done.

{
  "question": "Explain how you would implement a Transformer-based Speech Emotion Recognition (SER) pipeline end-to-end, from raw audio input to predicted emotion labels.",
  "type": "Technical",
  "old_dataset_score": 0.0,
  "rubric_score": 0.0,
  "final_combined_score": 0.0,
  "rubric_breakdown": {
    "scores": [
      {
        "name": "Clarity",
        "score": 0.0,
        "explanation": "No substantive answer provided."
      },
      {
        "name": "Accuracy",
        "score": 0.0,
        "explanation": "No substantive answer provided."
      },
      {
        "name": "Completeness",
        "score": 0.0,
        "explanation": "No substantive answer provided."
      },
      {
        "name": "Relevance",
        "score": 0.0,
        "explanation": "No substantive answer provided."
      },
      {
        "name": "

**QUESTION (HR):**  
Tell me about a time you faced a conflict in a team. How did you resolve it?


**Candidate’s answer:**
i communicated with them

**Best retrieved question from dataset:**
Tell me about a situation when your work was criticized.

Evaluating… Done.

{
  "question": "Tell me about a time you faced a conflict in a team. How did you resolve it?",
  "type": "HR",
  "old_dataset_score": 0.0,
  "rubric_score": 37.62,
  "final_combined_score": 37.62,
  "rubric_breakdown": {
    "scores": [
      {
        "name": "Relevance",
        "score": 50.0,
        "explanation": "The answer is relevant to resolving team conflict through communication but lacks specific detail."
      },
      {
        "name": "Clarity",
        "score": 36.67,
        "explanation": "The response is unclear, excessively brief, and lacks sufficient detail to effectively convey the situation, communication process, or resolution."
      },
      {
        "name": "Professionalism",
        "score": 53.33,
        "explanation": "The tone is neutral and professional but lacks the depth and context

---

### Summary

- We have now separated the code into distinct notebook cells with explanatory Markdown.  
- File paths (`<PATH_TO_YOUR>/…`) are taken directly from your provided template.  
- `MASTER_QUESTIONS` has been expanded from 6 to 8 (4 Technical + 4 HR).  

To use this notebook:
1. Replace all `<PATH_TO_YOUR>` placeholders with your actual local file paths.  
2. Replace the eight sample questions with the exact ones you wish to ask (maintaining `"type": "Technical"` or `"type": "HR"`).  
3. Run each cell in order, then execute the `main()` cell.  
4. For each question, paste the candidate’s answer when prompted.  
5. At the end, review `candidate_evaluation_summary.json`, which contains per‐question breakdowns and overall combined scores.  

