# Unify
Unify is a platform that offers dynamic routing of queries to the most appropriate endpoint served by different providers like OpenAI, MistralAI, Perplexity AI, Together AI and more. It offers single sign on access with one api key to access all the different providers supported by the [platform](https://unify.ai/hub).

When using models through the Unify API endpoint, you get:
- Single sign-on with all providers.
- Dynamic routing to the best provider for your use case.

First, install LlamaIndex 🦙 and the Unify integration.

In [31]:
%pip install llama-index-llms-unify

In [2]:
!pip install llama-index

## Installation and Setup

- Get your Unify API Key from the [Unify console](https://console.unify.ai/)
- Set the `UNIFY_API_KEY` environment variable to the API key.

In [1]:
import os
os.environ["UNIFY_API_KEY"] = "<YOUR API KEY>"

Call `complete` with a prompt

In [3]:
from llama_index.llms.unify import Unify
llm = Unify(model="llama-2-70b-chat@deepinfra")
llm.complete("How are you today, llama?")

CompletionResponse(text='  I\'m doing well, thanks for asking! I\'m feeling a bit mischievous today, so I hope you\'re ready for some fun and games. What do you say we play a game of "Guess the Number" or "Hangman" to get things started? Or if you have a different game in mind, I\'m always up for trying something new. Just let me know what you\'d like to do and we\'ll get started!', additional_kwargs={}, raw={'id': 'chatcmpl-1fb0b98b67d34b7188ec4b30d26f4ac0', 'choices': [Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='  I\'m doing well, thanks for asking! I\'m feeling a bit mischievous today, so I hope you\'re ready for some fun and games. What do you say we play a game of "Guess the Number" or "Hangman" to get things started? Or if you have a different game in mind, I\'m always up for trying something new. Just let me know what you\'d like to do and we\'ll get started!', role='assistant', function_call=None, tool_calls=None, name=None))], 'c

One of the advantages of using Unify is that through a single api key, you can access endpoints from different providers like Perplexity, Together AI, OctoAI and more.

In [13]:
llm_together = Unify(model="llama-2-70b-chat@together-ai")
print("Together AI response: \n", llm_together.complete("Why are you called llama, isn't it a goat or something?"))

llm_anyscale = Unify(model="llama-2-70b-chat@anyscale")
print("\n\nAnyscale response: \n", llm_anyscale.complete("Isn't goat a better name than llama?"))

Together AI response: 
   I'm called Llama because I'm a machine learning model, and my creators wanted to give me a unique and memorable name. They chose the name Llama because it's a bit quirky and unexpected, and it stands out among other AI models.

Llamas are actually domesticated animals that are native to South America, and they are known for their distinctive long necks and ears. They are not closely related to goats, although they are both members of the camelid family.

I don't have a physical body like a llama, of course - I exist only as a computer program designed to understand and generate text. But I'm glad to have a name that's a bit different and fun!


Anyscale response: 
   Both goat and llama are good names, but they have different connotations and associations.

Goat is a more general term that can refer to any member of the Bovidae family, including both male and female goats. It's a simple and straightforward name that is easily recognizable and understandable.



## Streaming and Dynamic routing
You may also want to use `mixtral-8x7b-instruct-v0.1` with the fastest output speed, and are not concerned with which provider you are using. Unify's dynamic routing gets this data from our live benchmarks and routes you to the provider that is performing better than the rest at that instant.

You can see how to configure dynamic routing [here](https://unify.ai/docs/hub/concepts/runtime_routing.html#how-to-use-it).

In [9]:
llm = Unify(model="mixtral-8x7b-instruct-v0.1@tks-per-sec")
response = llm.stream_complete("Why is mixtral so competitive with bigger models?")

In [10]:
show_provider = True
for r in response:
    if show_provider:
        print(f"Model and provider are : {r.raw['model']}\n")
        show_provider = False        
    print(r.delta, end="", flush=True)

Model and provider are : mixtral-8x7b-instruct-v0.1@fireworks-ai

I'm assuming you are referring to MixTraL, a recently proposed language model architecture. MixTraL is a relatively new model, and while it has shown promising results in some benchmarks, it's not accurate to label it as "so competitive with bigger models" just yet.

MixTraL combines the strengths of Transformer and Mixture of Experts (MoE) models to achieve good performance while keeping the model size relatively small. The key idea is to partition the model's parameters into several expert sub-networks and use a gating network to select the most relevant experts for each input token. This approach allows MixTraL to scale up to a larger model size without a proportional increase in computational requirements.

However, MixTraL still faces some challenges compared to bigger models:

1. Data size: Larger models typically have access to more data, which can lead to better performance. MixTraL has been trained

## Async and cost based routing
You can also run this asynchronously, and may want to optimize for cost. If you need to run large volumes of data through some models, and need it by tomorrow, and want to optimize for cost and aren't interested too much in speed, but want to save on input costs. Unify's dynamic router can be used for this too!

In [12]:
llm = Unify(model="mixtral-8x7b-instruct-v0.1@input-cost")
response = await llm.acomplete("What is the difference between sync and async code?")
print(f"Model and provider are : {response.raw['model']}\n")
print(response)

Model and provider are : mixtral-8x7b-instruct-v0.1@deepinfra

 Synchronous (sync) code and asynchronous (async) code are two different ways to write code that allow you to control the order in which operations are executed in your program.

Synchronous code is executed in a sequential manner, meaning that each statement is executed one after the other, and the program waits for the current statement to complete before moving on to the next one. This is useful when you have a series of operations that depend on each other and need to be executed in a specific order.

Asynchronous code, on the other hand, allows you to execute multiple operations concurrently, without waiting for each one to complete before moving on to the next. This is useful when you have operations that take a long time to complete, such as reading from a file or making a network request, and you don't want to block the execution of the rest of your program while you wait for them to finish.

In summary, the main di