Description
What happened?
Token tracking is not working as expected when using a streamified module using Amazon Bedrock. completion_tokens
is populated as expected, but all other fields have a value of 0. The one exception is total_tokens
, but that just has the same value as completion_tokens
since all other fields are 0.
I have spent a lot of time trying to track down the issue on this. I've walked through the LiteLLM code in the debugger and I can see that all usage fields are being set, but at some point in the chunk iterator it seems that they are getting stripped. The LiteLLM code is way too dense and indirect for me to follow exactly where that is happening.
The interesting thing is that while walking through LiteLLM code all of the usage fields looked right, total_tokens was the sum of prompt and completion tokens, which tells me that it's getting re-computed at some point.
To Recap
Token usage is getting set properly in LiteLLM, but at some point between LiteLLM and DSPy all usage is set to 0 except for completion_tokens
I have tried updating to the newest version of LiteLLM and that didn't make a difference.
LiteLLM version tested: 1.71.1
, 1.72.6.post1
Steps to reproduce
I have tried many different parameters for the streamify call. The example below uses async_streaming=False
because the results are the same either way and the example is less cumbersome without async streaming.
Here is a minimal example that will reproduce the error:
import os
import dspy
# If no AWS profile is configured you can use access key and secret key instead
os.environ["AWS_PROFILE"] = "YOUR_AWS_PROFILE"
# os.environ["AWS_ACCESS_KEY_ID"] = "YOUR_AWS_ACCESS_KEY_ID"
# os.environ["AWS_SECRET_ACCESS_KEY"] = "YOUR_AWS_SECRET_ACCESS_KEY"
# os.environ["AWS_SESSION_TOKEN"] = "YOUR_AWS_SESSION_TOKEN" <-- Optional
os.environ["AWS_REGION"] = "us-east-1"
class QAPredictor(dspy.Signature):
"""Answer questions with helpful, accurate responses."""
question = dspy.InputField()
answer = dspy.OutputField()
predictor = dspy.Predict(QAPredictor)
lm = dspy.LM("bedrock/us.amazon.nova-lite-v1:0")
dspy.configure(lm=lm, track_usage=True)
streaming_predictor = dspy.streamify(
predictor,
async_streaming=False, # <--- Doesn't matter if this is True or False fot token tracking
include_final_prediction_in_output_stream=True,
)
def generate_response(question):
for chunk in streaming_predictor(question=question):
if isinstance(chunk, dspy.Prediction):
usage = chunk.get_lm_usage()
print(usage)
if __name__ == "__main__":
generate_response(question="What is the capital of Maine?")
DSPy version
3.0.0b1, also tested with 2.6.27