Skip to content

Evaluation: Improve scores for Telugu #699

@tanuprasad530

Description

@tanuprasad530

Is your feature request related to a problem?
We are facing issues with the evaluation scores for our model, particularly the static score of 0.3 to 0.35 for Telugu questions. This results in over 15 questions consistently returning fallback responses across multiple iterations.

Describe the solution you'd like

  • Analyze and improve evaluation scores.
  • Investigate reasons behind fallback responses for the identified questions.
  • Utilize the provided resources, such as Golden QnA and KB files, to support the evaluation process.
  • Review previous evaluation results for insights.
  • Engage the team to understand and enhance eval scores based on findings.

Resources:

  • Golden QnA - for all states

  • KB files
    Please note: for every state, we add it's state specific documents (available in the state-specific folders) and the Central documents (available in the "Common-docs" folder)
    You can find the md versions inside a folder named "transformed_files" inside each of these directories.

  • Model: gpt-4o

  • Prompts
    This is the latest prompt. You can find the history documented here.

I'll create the VS shortly and share for Odisha.

For all other states, you can find them in the eval configs.

Requesting the teams support to help to understand and push the eval scores for the LLM, particularly for Odisha (Language-Odia)

Previous eval results can be accessed here
But initial finding as as follows (for telugu:

  • The eval score (cosine similarity) is currently stagnant at 0.3 to 0.35.
  • We are consistently observing 15+ questions that are retuning the fallback responses across all five iterations.
  • We have tried, to run two kinds of evals too: First, pick a small subset of these 15+ question, run an eval with its translations including its original telugu record. The translated QnA returned returned the response, but the original one, still did not.
  • Upon trying to replicate this across the entire 15+ questions, similar pattern was observed, with some question (translated to english), still returning the fallback responses across all 5 iteration.

Please note - all these efforts are to just enable and check if teh LLM is able to first return the responses, we are yet to look at the semantic correctness of the responses returned till now.

@PritamSGB Please feel free to add more, in case i have missed something.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions