Evaluation: Improve scores for Telugu

**Is your feature request related to a problem?**
We are facing issues with the evaluation scores for our model, particularly the static score of 0.3 to 0.35 for Telugu questions. This results in over 15 questions consistently returning fallback responses across multiple iterations.

**Describe the solution you'd like**
- Analyze and improve evaluation scores.
- Investigate reasons behind fallback responses for the identified questions.
- Utilize the provided resources, such as Golden QnA and KB files, to support the evaluation process.
- Review previous evaluation results for insights.
- Engage the team to understand and enhance eval scores based on findings.

**Resources:**
 
- [Golden QnA](https://docs.google.com/spreadsheets/d/1k50IL-Avm-X7A1T3q3K2i2WMh_yob6POTzyundwWKC8/edit?gid=178550138#gid=178550138) - for all states

- [KB files](https://drive.google.com/drive/folders/1uTd_zfO4NxLONfXDrV2TdQLeYAz3tvsW?dmr=1&ec=wgc-drive-hero-goto)
Please note: for every state, we add it's state specific documents (available in the state-specific folders) and the Central documents (available in the "Common-docs" folder) 
You can find the `md` versions inside a folder named "transformed_files" inside each of these directories. 

- Model: `gpt-4o`

- [Prompts](https://docs.google.com/document/d/1djXiBIkHBLSG0NN27fmah99x8Z7uI5nEZ1kNTCMLbEQ/edit?tab=t.gir3n6c6r85f)
This is the latest prompt. You can find the history documented [here](https://drive.google.com/drive/folders/1hK-i7gRwKB9w_hRtmpnjmiDFqUl8MqKE?dmr=1&ec=wgc-drive-hero-goto).

I'll create the VS shortly and share for Odisha. 

For all other states, you can find them in the eval configs. 

Requesting the teams support to help to understand and push the eval scores for the LLM, particularly for Odisha (Language-Odia)

Previous eval results can be accessed [here](https://drive.google.com/drive/folders/1hOy-D9Qqd1oOiBcCmMLh4JaHV9lDg_AD?dmr=1&ec=wgc-drive-hero-goto)
But initial finding as as follows (for telugu: 
- The eval score (cosine similarity) is currently stagnant at 0.3 to 0.35. 
- We are consistently observing 15+ questions that are retuning the fallback responses across all five iterations. 
- We have tried, to run two kinds of evals too: First, pick a small subset of these 15+ question, run an eval with its translations including its original telugu record. The translated QnA returned returned the response, but the original one, still did not. 
- Upon trying to replicate this across the entire 15+ questions, similar pattern was observed, with some question (translated to english), still returning the fallback responses across all 5 iteration. 

Please note - all these efforts are to just enable and check if teh LLM is able to first return the responses, we are yet to look at the semantic correctness of the responses returned till now. 

@PritamSGB Please feel free to add more, in case i have missed something. 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation: Improve scores for Telugu #699

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Evaluation: Improve scores for Telugu #699

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions