Welcome to Battle of LLMs, a project that evaluates the responses of various language models (LLMs) including ChatGPT-4, ChatGPT, Gemini, and Mistral. This project leverages conversational QA datasets from CoQA, DialFact, FaVIQ, and CoDAH for testing these LLMs, comparing their performance, and providing insights into their capabilities.
- Introduction
- Project Purpose
- Used Datasets
- ChatGPT
- Gemini
- Mistral
- Claude
- Usage Examples
- About Us
- Contributing
This project focuses on evaluating the responses of ChatGPT-4, ChatGPT, Gemini, and Mistral through the lens of conversational QA datasets from CoQA, DialFact, FaVIQ, and CoDAH. By comparing their performance, we aim to provide valuable insights into the strengths and weaknesses of each language model.
The primary purpose of this project is to:
- Evaluate responses of different LLMs using conversational QA datasets.
- Conduct a "Battle of LLMs" with datasets including CoQA, DialFact, FaVIQ, and CoDAH.
- Provide a platform for testing and comparing LLMs' capabilities.
Dataset | Description |
---|---|
CoQA | Conversational Question Answering |
DialFact | Dialogue-based Fact Verification |
FaVIQ | Facts and Verification in Questions |
CoDAH | Conversational Datasets Adversarial Hardness |
ChatGPT, a product of OpenAI's advanced language models, brings versatile natural language understanding and generation to your fingertips. Rooted in the powerful Transformer architecture, it seamlessly integrates into applications through the OpenAI API. With pre-training on diverse internet text and fine-tuning for conversations, ChatGPT excels at dynamic and context-aware interactions. However, users should be aware of occasional limitations in specific scenarios.
You can use ChatGPT on this website ChatGPT and gain access to its API key through the OpenAI platform and follow the installation process at OpenAI Developer Portal.
Gemini stands out as an influential tool in the realm of text and image processing, showcasing its prowess through the seamless integration of multimodal prompting. In the domain of text processing, it exhibits the ability to craft imaginative and inventive responses, spanning a spectrum from compelling narratives to evocative poetry. Its versatility extends to image processing, where it empowers users to generate personalized artwork and conceptual visualizations by translating text prompts into captivating visual representations.
The unique strength of Gemini lies in its capacity to synergize these diverse capabilities, enabling it to tackle intricate tasks with remarkable finesse. Notably, it excels in tasks that demand the harmonious fusion of textual and visual elements, such as the generation of cohesive stories interweaving both mediums or the provision of vivid and articulate captions for images.
For a more in-depth exploration of Gemini and its myriad capabilities, enthusiasts and users are encouraged to visit the official Gemini AI website. There, a wealth of information awaits, providing comprehensive insights into the tool's functionalities and diverse applications. Ultimately, Gemini serves as a powerful and innovative solution for those seeking to unlock new dimensions of creativity and efficiency in text and image processing tasks.
-
Clone the repository.
git clone https://github.com/IIT-DM/BattleofLLMs.git
-
Navigate to the
gemini
directory in the command line.cd gemini
-
Install requirements using the command:
pip install -r requirements.txt
-
Install
google-generativeai
using:pip install -q -U google-generativeai
-
Obtain your API key by clicking on the key icon 🔑
-
Navigate to one of the four folders named after QA Datasets. Choose the one you want to run.
-
To run the python file in cmd:
python <file_name.py>
Mistral AI has introduced powerful open generative models tailored for developers, offering efficient deployment and customization options for production environments. The beta access to their initial platform services includes three chat endpoints for text generation based on instructions and an embedding endpoint, each offering distinct performance/price tradeoffs.
The generative endpoints, Mistral-tiny and Mistral-small, utilize Mistral AI's released open models, while Mistral-medium employs a prototype model with enhanced performance, currently being tested in a live setting. The models are instructed versions incorporating effective alignment techniques, such as efficient fine-tuning and direct preference optimization, and are pre-trained on data from the open web.
Mistral-tiny, the most cost-effective option, currently serves Mistral 7B Instruct v0.2 with a score of 7.6 on MT-Bench, exclusively supporting English. Mistral-small features the latest model, Mixtral 8x7B, excelling in multiple languages and code with a notable score of 8.3 on MT-Bench. Mistral-medium, the highest-quality endpoint, deploys a prototype model achieving an impressive 8.6 on MT-Bench across various languages and code. A performance comparison table is provided, showcasing Mistral-medium against Mistral-small and a competitor's endpoint.
-
Clone the repository.
git clone https://github.com/IIT-DM/BattleofLLMs.git
-
Navigate to the
mistral
directory in the command line.cd mistral
-
Install requirements using the command: (requires version 4.36.1)
pip install -r requirements.txt
-
Upgrade transformers version using:
pip install transformers --upgrade`
-
Navigate to one of the four folders named after QA Datasets. Choose the one you want to run.
-
To run the python file in cmd:
python <file_name.py>
-
Clone the repository.
git clone https://github.com/IIT-DM/BattleofLLMs.git
-
Navigate to the
gemini
directory in the command line.cd claude
-
Install requirements using the command:
pip install -r requirements.txt
-
Obtain your API key by clicking on the key icon 🔑
-
Navigate to one of the four folders named after QA Datasets. Choose the one you want to run.
-
To run the python file in cmd:
python <file_name.py>
Examples of running Python code for different levels of prompting (0-shot, N-shot, and CoT) are provided in each relevant directory. To effectively demonstrate the capabilities of the language models, we have provided usage examples for different levels of prompting, including 0-shot, N-shot, and CoT (Chain of Thought). Each example is designed to showcase the models' responses in various scenarios.
In the 0-shot scenario, the models are provided with a single prompt without any prior training or context. This is a test of the models' ability to generate relevant responses based solely on the given input.
The N-shot examples involve providing the models with a context or set of examples before posing the main question. This allows the models to leverage the provided information for a more informed response.
Chain of Thought is a purposefully designed sequence of prompts aimed at directing the output of a language model. It involves constructing a logical progression of inputs to shape the model's responses in a controlled manner, with the goal of achieving specific outcomes or generating coherent narratives. This approach guides the language model's behavior, ensuring the production of contextually relevant and desired responses.
We are a passionate team of 2: Aman Rangapur and Aryan Rangapur. Our collective dedication lies in exploring the capabilities of language models and advancing the field of natural language processing. This project is fueled by our shared interest in understanding how different LLMs perform in conversational question-answering scenarios. Embracing the power of open-source collaboration, we welcome contributions from the community to enrich and enhance our project. Feel free to reach out to us with any questions or ideas – we look forward to building a vibrant community around this initiative.
Aryan Rangapur | |
---|---|
Aman Rangapur |
We invite the community to contribute, provide feedback, and share ideas. Feel free to raise issues, propose enhancements, or submit pull requests to help enhance this project collaboratively. Let's collaborate to further enhance the project's quality and effectiveness!