# Testing API LLM API Tradeoffs (UX vs Cost)

[![open in colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LinkedInLearning/generative-ai-and-llmops-deploying-and-managing-llms-in-production-4465782/blob/solution/ch-02/challenge_API_limitations.ipynb)

In [3]:
!pip install groq openai -q

In [2]:
from groq import Groq
import getpass

# Get token from console.groq.com
client = Groq(api_key=getpass.getpass())

··········


In [7]:
def generate_text(model, prompt):
    chat_completion = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "user", "content": prompt}
        ])
    return chat_completion

In [8]:
model_list=['llama3-8b-8192', 'llama3-70b-8192', 'mixtral-8x7b-32768', 'gemma-7b-it']
prompt="Write a blog about taking Generative AI applications to production"

for model in model_list:
    chat_completion = generate_text(model, prompt)
    total_tokens=chat_completion.usage.total_tokens
    total_time=chat_completion.usage.completion_time
    print(f"Model: {model}")
    print(f"Total tokens: {total_tokens}")
    print(f"Total time: {total_time}")
    print(f"Throughput: {total_tokens/total_time}")
    print("-----------")

Model: llama3-8b-8192
Total tokens: 805
Total time: 0.653333333
Throughput: 1232.1428577715014
-----------
Model: llama3-70b-8192
Total tokens: 794
Total time: 2.478155744
Throughput: 320.3995559691506
-----------
Model: mixtral-8x7b-32768
Total tokens: 743
Total time: 1.177711155
Throughput: 630.8847435515715
-----------
Model: gemma-7b-it
Total tokens: 479
Total time: 0.627873152
Throughput: 762.8929481603316
-----------


## OpenAI

In [9]:
from openai import OpenAI
import time

client = OpenAI(api_key=getpass.getpass())

··········


In [14]:
gpt_model_list = ["gpt-3.5-turbo", "gpt-4", "gpt-4o"]
prompt="Write a blog about taking Generative AI applications to production"
start_time=time.time()

for model in gpt_model_list:
    chat_completion = generate_text(model, prompt)
    total_tokens=chat_completion.usage.total_tokens
    total_time=time.time()-start_time
    print(f"Model: {model}")
    print(f"Total tokens: {total_tokens}")
    print(f"Total time: {total_time}")
    print(f"Throughput: {total_tokens/total_time}")
    print("-----------")

Model: gpt-3.5-turbo
Total tokens: 437
Total time: 4.186880826950073
Throughput: 104.37364187371246
-----------
Model: gpt-4
Total tokens: 674
Total time: 34.953261852264404
Throughput: 19.282892762591647
-----------
Model: gpt-4o
Total tokens: 850
Total time: 61.952709436416626
Throughput: 13.720142472095961
-----------



## Model Summaries:
### Grok
* llama3-8b-8192: High-speed, cost-efficient model delivering excellent user experience with low latency, suitable for real-time, moderately complex tasks.
* llama3-70b-8192: Provides detailed and nuanced responses at the cost of increased latency and higher token usage, best for in-depth, non-real-time queries.
* mixtral-8x7b-32768: Balanced throughput and response time make it versatile for longer inputs, offering a good trade-off between speed and complexity.
* gemma-7b-it: Quick and cost-effective, ideal for fast interactions and less complex tasks, ensuring good UX with low latency and cost.

### OpenAI
* gpt-3.5-turbo: Moderate speed and cost, but lower throughput compared to Grok models, making it less ideal for high-performance needs.
* gpt-4: High complexity but significantly slow with high cost, suitable only for tasks where detailed output is crucial and time/cost are less of a concern.
* gpt-4o: Slowest model with the highest cost, offering detailed responses but at a substantial expense, limiting its practical use for business applications.
## Overall Summary:
Grok models generally provide superior user experience with lower latency and cost efficiency, making them more viable for real-time applications and balanced business value. OpenAI models, particularly gpt-4 and gpt-4o, offer high complexity but are hindered by high costs and slow response times, reducing their business value for real-time applications. For most business use cases, Grok models offer a better balance between user experience and cost-effectiveness.