Skip to content

A Python Script In Which Runs User-Specified Benchmarks On A Fine-Tuned Large-Language Model Through Means Of Comparing FTM's Answers To Expected Answers.

Notifications You must be signed in to change notification settings

Kingerthanu/Python_FTM_Benchmarker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 

Repository files navigation

FTM_Benchmarker

Python Script In Which Will Use A Provided openai API Key As Well As A Fine-Tuned Model ID To Create A Client Session. With Provided Unit-Test Questions With Expected Answers For The Fine-Tuned Model To Give, We Will Compare These Expected Answers With The Answers Actually Given By Our Fine-Tuned openai Model And Get A Percentage Similarity Score Between The Two Answers By Asking Another openai Session How Semantically, And Theoretically Similpar The Two Solutions Are. This Gives Us A Good Guage On The Fine-Tuned Model's Capability In Answering Problems Related To It's Field Of Expertise.


Cornstarch <3 Cornstarch <3 Cornstarch <3 Cornstarch <3

The Breakdown:

Before Running The Program 3 Specific Things Need To Be Done To Ensure Valid Benchmarking These Are:
 1.) Provide fine_tune_id On Line 263.
 2.) Provide An openai.api_key On Line 6.
 3.) Provide A .json File In Which Follows The Style-Guideline (Shown Below), Filling In "question" And "expected_answer" For Each Question In Questions

The Stylization Of The Benchmarks Should Be As Follows, Provided In A .json:

# You Don't Need "given_answer", "differences" or "similarity" As Will Be Generated During Runtime
{
  "questions": [
        {
          "question": "YOUR QUESTION",
          "expected_answer": "YOUR EXPECTED ANSWER (WHAT YOU WANT THE FINE-TUNED MODEL TO SAY)",
          "given_answer": "____",
          "similarity": ____,
          "differences": ____
        },
        {
          "question": "YOUR QUESTION",
          "expected_answer": "YOUR EXPECTED ANSWER (WHAT YOU WANT THE FINE-TUNED MODEL TO SAY)",
          "given_answer": "____",
          "similarity": ____,
          "differences": ____
        },
      ]
}

After The 3 Preliminary Tasks Are Complete, You Can Run The Script.

When Running The Script We Will Start By Initially Finding The Actual Model ID Associated With The Fine-Tuned Model ID We Are Provided By openai When Fine-Tuning (fine_tune_id). If We Provided A Valid Fine-Tuned ID, We Will Then Dump The Contents Of Our "questions" (Benchmarks) To Be Done Out Into A Struct. To Then Evaluate Each "question" In Our Pulled From .json We Will Use Multi-Threading. This Is Achieved In evaluate_benchmark(...); We Will Send Off A Worker Thread To Process An Individual Benchmarking Question Entry, This Allows Us To Process Many Benchmarks In Paralell Instead Of Sequentially Processing Questions.

In The Worker Function (process_question(...)), It Will Be Given An Individual Benchmark And In Each One Of These, We Will Have A "question" And "expected_answer". Initially We Will Ask Our Fine-Tuned Model Our Question, Getting It's Response. From This Fine-Tuned Model's Response, We Will Then Compare It To The Solution We Expected To Get Thats Provided in "expected_answer". Using openai Again, We Ask The ChatGPT-4 Model To Compare And Give A Similarity Score Between These Two Answers Based Upon Theoretical And Semantic Relations--This Can Allow Us To Quickly Compare Our Differing Solutions And Recommend Changes To Our Model If Lacking In A Specific Subtopic In The Fine-Tuned Model's Informational Knowledge.

After These Worker Threads Ask These Questions And Get Their Similarity Scores, We Will Then Add Them All Back Together In A List Struct. Now When Leaving evaluate_benchmark(...) We Return This List Struct Of All The Worker Threads' Answers And Can Inject This Back Into The .json Provided, Adding "given_answer" "similarity" And "differences" Now As Entries For Each Benchmark. Before The Process Ends We Will Also Quickly Print The Contents Of Each Benchmark Out, Outlining The Similarity Score And Reasoning For This Score In Terminal.

Cornstarch <3 Cornstarch <3 Cornstarch <3 Cornstarch <3


Cornstarch <3 Cornstarch <3 Cornstarch <3 Cornstarch <3

Features:

Result After Parsing:

image

Cornstarch <3 Cornstarch <3 Cornstarch <3 Cornstarch <3

About

A Python Script In Which Runs User-Specified Benchmarks On A Fine-Tuned Large-Language Model Through Means Of Comparing FTM's Answers To Expected Answers.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages