Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates to the OpenAI Embeddings Provider #758

Merged
merged 23 commits into from
May 9, 2024

Conversation

dkotter
Copy link
Collaborator

@dkotter dkotter commented Apr 16, 2024

Description of the Change

OpenAI recently introduced new Embedding models that are cheaper, give better results and come with the ability to reduce the size of the vectors produced. The main purpose of this PR is to update our existing OpenAI Embeddings Provider to use this new model. In addition, a few other improvements have been made to how we use Embeddings.

High level overview of changes

  • Update from text-embedding-ada-002 model to the text-embedding-3-small model. Decided to use this model instead of text-embedding-3-large as it's cheaper to run while still giving good results
  • Reduce our embedding vectors down to 512 dimensions, which is a new feature of the new OpenAI Embedding models. This gives us better performance while still getting fairly accurate results
  • For any content we get embeddings for, add a method that chunks the content down first (with an overlap between each chunk). This means instead of generating a single embedding for an entire post_content field, we may generate 4 or 5 embeddings. This required changes in multiple places as we now have an array of embeddings to compare, which we then merge these results together, sort by score and then remove duplicates, giving us the highest matched items
  • Added a generic generate_embedding method that will take in some text and generate embeddings for that. We have more specific methods to generate embeddings for a post or a term which now pass content into this more generic method. This also makes it easier for 3rd parties to use this same method
  • Change our term query to raise the limits on the amount of terms we return and return terms sorted by most used to least used, so the terms we consider should be the most valuable
  • Show a dismissible admin notice that prompts an admin to regenerate embeddings, as embeddings produced with different models are compatible. Clicking the link in this notice will trigger this process
  • Slight refactoring of our admin notice code to ensure notices only show to admins, that dismissing a notice only happens when the X button is clicked and some general cleanup
  • Noticed when running automatic classification that PHP warnings were being logged so fixed an issue with an undefined variable
  • In our existing normalizer method, remove code that strips out abbreviations. Not sure why this is there (has been there for a number of years) but ends up removing text that we don't want removed (for instance, removes the AI from ClassifAI, leaving just Classif)
  • Updated our similarity function to work better with the new embeddings model and renamed this to cosine_similarity so it's more clear what it's doing (and allows us to add other similarity functions if desired)
  • Added new helper methods that pass values through filters to get various values, allowing 3rd parties to more easily change things (like the model or number of terms we process at once)

How to test the Change

  1. On the latest released version and in a clean environment, setup OpenAI Embeddings as the Provider for the Classification Feature
  2. Ensure you have at least one taxonomy set and save the settings (this usually requires saving twice, which is an existing issue we need to fix)
  3. Once term embeddings have been generated, try running the previewer on an item and ensure results are shown
  4. Go to an item and try the manual classification feature and ensure it works
  5. Turn on automatic classification and ensure it works
  6. Now checkout the changes in this PR
  7. Ensure you see an admin notice letting you know the model has changed and embeddings need regenerated. Click on the link in this notice
  8. This should rebuild the term embeddings and take you to the settings page, showing a success message
  9. Try all the same steps above and ensure proper results are shown (they may be slightly different results since it's a different model)

Changelog Entry

Added - New filters allowing the ability to change some of the OpenAI Embedding variables, like model or dimensions
Changed - Update from the text-embedding-ada-002 to the text-embedding-3-small model for the OpenAI Embeddings Provider. Note this requires regenerating all existing embeddings for comparisons to work properly. Also worth mentioning you'll want to check your threshold settings to ensure that works well with these changes, as we're seeing lower threshold scores with this new model.
Changed - When generating embeddings, ensure we chunk our content down and reduce our vector dimensions to 512
Changed - Increase the amount of terms we consider when running comparisons using the OpenAI Embeddings Provider from 500 to 5000
Fixed - Fix some undefined PHP variable warnings when running automatic classification

Credits

Props @dkotter

Checklist:

  • I agree to follow this project's Code of Conduct.
  • I have updated the documentation accordingly.
  • I have added tests to cover my change.
  • All new and existing tests pass.

…sing each through custom filters. Allows these to easily be used and changed by extenders
…ould have, passing a filter around this value
…post or term embeddings. Chunk content down before we generate embeddings. Start process of changing how we compare embeddings to work with these chunks, as we now have an array of embeddings instead of a single embedding
…round this. Increase the max number of terms to 5000. Order terms by most used to least used
…ices to admins. Fix a bug in our hide notice script so it only triggers when the close icon is clicked. Add functionality to regenerate embeddings for all terms and delete all post embeddings. Ensure we build embeddings when running our preview function, in case the post embeddings are out of date.
@dkotter dkotter added this to the 3.1.0 milestone Apr 16, 2024
@dkotter dkotter self-assigned this Apr 16, 2024
@dkotter dkotter requested review from jeffpaul and a team as code owners April 16, 2024 20:06
@github-actions github-actions bot added the needs:code-review This requires code review. label Apr 16, 2024
@dkotter
Copy link
Collaborator Author

dkotter commented Apr 16, 2024

Note that you'll see a couple TODO statements in the code that talk about some existing issues. I was going to fix those in this PR but decided not to since it was already getting fairly large. I've opened #759 to track these and ideally we get that fixed before releasing this.

Copy link
Member

@iamdharmesh iamdharmesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for working on this, @dkotter. This is such an amazing enhancement to ClassifAI.

The code looks good to me, and in most cases, it works fine. I just noticed a significant threshold % difference between the trunk and this branch. Is this expected? Could you please help to check if you experience the same, or is this something that only happens with my setup? Also, if this is expected, then we might want to add information in the notice to adjust the threshold.

trunk This PR branch
Screenshot 2024-04-18 at 8 46 58 PM Screenshot 2024-04-18 at 8 41 09 PM

Thank you.

includes/Classifai/Providers/OpenAI/Embeddings.php Outdated Show resolved Hide resolved
@dkotter
Copy link
Collaborator Author

dkotter commented May 8, 2024

@iamdharmesh

I just noticed a significant threshold % difference between the trunk and this branch. Is this expected? Could you please help to check if you experience the same, or is this something that only happens with my setup?

I do think this is expected as we're not only using a different model but we've slightly changed our comparison function as well. I am not seeing quite as drastic differences as what you've shown but am seeing differences before and after.

The good things I'm seeing though are:

  1. The recommended results I'm seeing are similar or even better than previous, though the scores are different
  2. There's a wider range in the scores. Previously there was usually only 10-15% difference between the top match and the lowest match and often only a few percentages (or sometimes less than a percentage) difference between results, which makes it hard to set a proper threshold. In my testing (and this matches your screenshots) we're now seeing a much wider range of scores, which I think is good

To test further, I used some of the examples that OpenAI provides (utilizing their python SDK) to run comparisons and I got very similar distance scores. A few examples:

  • For one particular term, our code gave a distance score of 0.33385; OpenAI's python SDK gave a score of 0.32443
  • For another term, our code gave a distance score of 0.42320; OpenAI's python SDK gave a score of 0.41547

These obviously don't match 100% but I believe they are close enough to show things are working as expected (especially as we round things up and convert to a percentage, so you'd end up with 66.15% vs 67.56% and 57.68% vs 58.45% respectively).

Also, if this is expected, then we might want to add information in the notice to adjust the threshold.

I'm not worried about this for a few reasons:

  1. To be honest, not sure anyone is currently using this feature :)
  2. In my testing, the scores you get are going to vary based on the content you have, so not sure there is a threshold we can recommend as the default. Ideal would be for each site to test things out and adjust accordingly, which we may be able to make that more clear in our readme.

For sure I think we can call this out in our changelog but I probably wouldn't do more than that, but open to other thoughts.

Also worth mentioning that I'm open to others testing things out and seeing if we need some changes to our comparison function. I took what OpenAI is doing in their SDK and converted that from python to PHP but there's a good chance I missed something / messed something up in that conversion. There's also other comparison functions we can look to use, cosine similarity seems to be the most widely used but may be worth testing out others.

@dkotter dkotter requested a review from iamdharmesh May 8, 2024 23:06
Copy link
Member

@iamdharmesh iamdharmesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dkotter Thanks for checking on this and providing a detailed explanation; everything looks good to me. Regarding the comparison function, I've already utilized it in finding similar terms, and it works really well there. So, there's no issue with that. I think we're good to merge this.

@dkotter dkotter merged commit 400529e into develop May 9, 2024
15 checks passed
@dkotter dkotter deleted the feature/update-embedding-model branch May 9, 2024 20:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs:code-review This requires code review.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

2 participants