New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend OpenAI finish_reason
handling
#1985
base: master
Are you sure you want to change the base?
Extend OpenAI finish_reason
handling
#1985
Conversation
bertopic/representation/_openai.py
Outdated
label = choice.message.content.strip().replace("topic: ", "") | ||
elif choice.finish_reason == "length": | ||
logger.warn(f"Extracing Topics - Length limit reached for doc_ids ({repr_doc_ids})") | ||
if hasattr(response.choices[0].message, "content"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure of the APIs behavior when hitting the token limit, so I kept the check for content
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, this is fine I guess. Also, you can replace response.choices[0]
with choice
. Also, and this might be just me, could we replace the variable name choice
with output
? I find the term choice
not easily readable/explicit.
bertopic/representation/_openai.py
Outdated
else: | ||
logger.warn(f"Extracing Topics - No label due to finish_reason {choice.finish_reason} for doc_ids ({repr_doc_ids})") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not in love with exposing the finish_reason
concept in the error message but I wonder whether folks will have to learn about it when debugging the issue anyway. Wyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think I need to index into repr_doc_ids
based on the index. Kudos to @sdehm for pointing that out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not in love with exposing the finish_reason concept in the error message but I wonder whether folks will have to learn about it when debugging the issue anyway. Wyt?
If that reason didn't trigger any of the other if statements, then to me it makes sense to log it since we didn't expect it and there is no reason for the user to know which one it is.
Also, perhaps make the warning a bit more explicit. For instance, "OpenAI Topic Representation - Couldn't create a label due to {choice.finish_reason} for the following document IDs"
I think being explicit here might improve any potential debugging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the delay, I left some comments here and there.
|
||
logger = MyLogger("WARNING") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure whether importing the logger will work with the current logging procedure. My guess is that it will create a separate logger which does things twice. Also, the model should control the verbosity so that this warning is suppressed if needed by the user.
I believe, but I will have to check, that the functionality of the logger will change in an open PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes total sense. Since we want to leave verbosity controls to the model, I could use the logger instance from topic_model.logger
. Wyt?
@@ -37,7 +38,7 @@ | |||
Topic name:""" | |||
|
|||
DEFAULT_CHAT_PROMPT = """ | |||
I have a topic that contains the following documents: | |||
I have a topic that contains the following documents: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although I expect no issues, it would be interesting to see the effect of removing a \n
from this prompt. In my experience with recent LLMs (Phi-3/Llama-3), this actually can have a decent influence on performance (especially with respect to following instructions). Either, good change!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a great point. I am tempted to undo this change then I don't necessarily want to modify any behavior other than logging.
bertopic/representation/_openai.py
Outdated
label = choice.message.content.strip().replace("topic: ", "") | ||
elif choice.finish_reason == "length": | ||
logger.warn(f"Extracing Topics - Length limit reached for doc_ids ({repr_doc_ids})") | ||
if hasattr(response.choices[0].message, "content"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, this is fine I guess. Also, you can replace response.choices[0]
with choice
. Also, and this might be just me, could we replace the variable name choice
with output
? I find the term choice
not easily readable/explicit.
bertopic/representation/_openai.py
Outdated
if hasattr(response.choices[0].message, "content"): | ||
label = choice.message.content.strip().replace("topic: ", "") | ||
else: | ||
label = "Incomple output due to token limit being reached" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be "Incomplete" instead of "Incomple"
bertopic/representation/_openai.py
Outdated
else: | ||
label = "Incomple output due to token limit being reached" | ||
elif choice.finish_reason == "content_filter": | ||
logger.warn(f"Extracing Topics - Content filtered for doc_ids ({repr_doc_ids})") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps make the warning a bit more explicit? For instance, "The content filter of OpenAI was trigger for the following documents IDs:"
bertopic/representation/_openai.py
Outdated
else: | ||
logger.warn(f"Extracing Topics - No label due to finish_reason {choice.finish_reason} for doc_ids ({repr_doc_ids})") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not in love with exposing the finish_reason concept in the error message but I wonder whether folks will have to learn about it when debugging the issue anyway. Wyt?
If that reason didn't trigger any of the other if statements, then to me it makes sense to log it since we didn't expect it and there is no reason for the user to know which one it is.
Also, perhaps make the warning a bit more explicit. For instance, "OpenAI Topic Representation - Couldn't create a label due to {choice.finish_reason} for the following document IDs"
I think being explicit here might improve any potential debugging.
- Replace all uses of choices[0] with output
@MaartenGr, I made the following updates based on your feedback:
Open questions: |
As part of #1979, I added:
finish_reason
values:length
,content_filter
, andstop
finish_reason
for any unspecified casesrepr_doc_ids
in all error cases to improve debuggingFixes #1979