Updates to the OpenAI Embeddings Provider #758

dkotter · 2024-04-16T20:06:29Z

Description of the Change

OpenAI recently introduced new Embedding models that are cheaper, give better results and come with the ability to reduce the size of the vectors produced. The main purpose of this PR is to update our existing OpenAI Embeddings Provider to use this new model. In addition, a few other improvements have been made to how we use Embeddings.

High level overview of changes

Update from text-embedding-ada-002 model to the text-embedding-3-small model. Decided to use this model instead of text-embedding-3-large as it's cheaper to run while still giving good results
Reduce our embedding vectors down to 512 dimensions, which is a new feature of the new OpenAI Embedding models. This gives us better performance while still getting fairly accurate results
For any content we get embeddings for, add a method that chunks the content down first (with an overlap between each chunk). This means instead of generating a single embedding for an entire post_content field, we may generate 4 or 5 embeddings. This required changes in multiple places as we now have an array of embeddings to compare, which we then merge these results together, sort by score and then remove duplicates, giving us the highest matched items
Added a generic generate_embedding method that will take in some text and generate embeddings for that. We have more specific methods to generate embeddings for a post or a term which now pass content into this more generic method. This also makes it easier for 3rd parties to use this same method
Change our term query to raise the limits on the amount of terms we return and return terms sorted by most used to least used, so the terms we consider should be the most valuable
Show a dismissible admin notice that prompts an admin to regenerate embeddings, as embeddings produced with different models are compatible. Clicking the link in this notice will trigger this process
Slight refactoring of our admin notice code to ensure notices only show to admins, that dismissing a notice only happens when the X button is clicked and some general cleanup
Noticed when running automatic classification that PHP warnings were being logged so fixed an issue with an undefined variable
In our existing normalizer method, remove code that strips out abbreviations. Not sure why this is there (has been there for a number of years) but ends up removing text that we don't want removed (for instance, removes the AI from ClassifAI, leaving just Classif)
Updated our similarity function to work better with the new embeddings model and renamed this to cosine_similarity so it's more clear what it's doing (and allows us to add other similarity functions if desired)
Added new helper methods that pass values through filters to get various values, allowing 3rd parties to more easily change things (like the model or number of terms we process at once)

How to test the Change

On the latest released version and in a clean environment, setup OpenAI Embeddings as the Provider for the Classification Feature
Ensure you have at least one taxonomy set and save the settings (this usually requires saving twice, which is an existing issue we need to fix)
Once term embeddings have been generated, try running the previewer on an item and ensure results are shown
Go to an item and try the manual classification feature and ensure it works
Turn on automatic classification and ensure it works
Now checkout the changes in this PR
Ensure you see an admin notice letting you know the model has changed and embeddings need regenerated. Click on the link in this notice
This should rebuild the term embeddings and take you to the settings page, showing a success message
Try all the same steps above and ensure proper results are shown (they may be slightly different results since it's a different model)

Changelog Entry

Added - New filters allowing the ability to change some of the OpenAI Embedding variables, like model or dimensions
Changed - Update from the text-embedding-ada-002 to the text-embedding-3-small model for the OpenAI Embeddings Provider. Note this requires regenerating all existing embeddings for comparisons to work properly. Also worth mentioning you'll want to check your threshold settings to ensure that works well with these changes, as we're seeing lower threshold scores with this new model.
Changed - When generating embeddings, ensure we chunk our content down and reduce our vector dimensions to 512
Changed - Increase the amount of terms we consider when running comparisons using the OpenAI Embeddings Provider from 500 to 5000
Fixed - Fix some undefined PHP variable warnings when running automatic classification

Credits

Props @dkotter

Checklist:

I agree to follow this project's Code of Conduct.
I have updated the documentation accordingly.
I have added tests to cover my change.
All new and existing tests pass.

…512 vectors for easier storage

…ons. Modify our calculation slightly

…sing each through custom filters. Allows these to easily be used and changed by extenders

…ould have, passing a filter around this value

…post or term embeddings. Chunk content down before we generate embeddings. Start process of changing how we compare embeddings to work with these chunks, as we now have an array of embeddings instead of a single embedding

…round this. Increase the max number of terms to 5000. Order terms by most used to least used

…rray structure

…rrors

…n used

…ices to admins. Fix a bug in our hide notice script so it only triggers when the close icon is clicked. Add functionality to regenerate embeddings for all terms and delete all post embeddings. Ensure we build embeddings when running our preview function, in case the post embeddings are out of date.

… bit

…ns anymore

dkotter · 2024-04-16T20:38:03Z

Note that you'll see a couple TODO statements in the code that talk about some existing issues. I was going to fix those in this PR but decided not to since it was already getting fairly large. I've opened #759 to track these and ideally we get that fixed before releasing this.

iamdharmesh

Thanks a lot for working on this, @dkotter. This is such an amazing enhancement to ClassifAI.

The code looks good to me, and in most cases, it works fine. I just noticed a significant threshold % difference between the trunk and this branch. Is this expected? Could you please help to check if you experience the same, or is this something that only happens with my setup? Also, if this is expected, then we might want to add information in the notice to adjust the threshold.

`trunk`	This PR branch

Thank you.

includes/Classifai/Providers/OpenAI/Embeddings.php

dkotter · 2024-05-08T23:05:08Z

@iamdharmesh

I just noticed a significant threshold % difference between the trunk and this branch. Is this expected? Could you please help to check if you experience the same, or is this something that only happens with my setup?

I do think this is expected as we're not only using a different model but we've slightly changed our comparison function as well. I am not seeing quite as drastic differences as what you've shown but am seeing differences before and after.

The good things I'm seeing though are:

The recommended results I'm seeing are similar or even better than previous, though the scores are different
There's a wider range in the scores. Previously there was usually only 10-15% difference between the top match and the lowest match and often only a few percentages (or sometimes less than a percentage) difference between results, which makes it hard to set a proper threshold. In my testing (and this matches your screenshots) we're now seeing a much wider range of scores, which I think is good

To test further, I used some of the examples that OpenAI provides (utilizing their python SDK) to run comparisons and I got very similar distance scores. A few examples:

For one particular term, our code gave a distance score of 0.33385; OpenAI's python SDK gave a score of 0.32443
For another term, our code gave a distance score of 0.42320; OpenAI's python SDK gave a score of 0.41547

These obviously don't match 100% but I believe they are close enough to show things are working as expected (especially as we round things up and convert to a percentage, so you'd end up with 66.15% vs 67.56% and 57.68% vs 58.45% respectively).

Also, if this is expected, then we might want to add information in the notice to adjust the threshold.

I'm not worried about this for a few reasons:

To be honest, not sure anyone is currently using this feature :)
In my testing, the scores you get are going to vary based on the content you have, so not sure there is a threshold we can recommend as the default. Ideal would be for each site to test things out and adjust accordingly, which we may be able to make that more clear in our readme.

For sure I think we can call this out in our changelog but I probably wouldn't do more than that, but open to other thoughts.

Also worth mentioning that I'm open to others testing things out and seeing if we need some changes to our comparison function. I took what OpenAI is doing in their SDK and converted that from python to PHP but there's a good chance I missed something / messed something up in that conversion. There's also other comparison functions we can look to use, cosine similarity seems to be the most widely used but may be worth testing out others.

iamdharmesh

@dkotter Thanks for checking on this and providing a detailed explanation; everything looks good to me. Regarding the comparison function, I've already utilized it in finding similar terms, and it works really well there. So, there's no issue with that. I think we're good to merge this.

dkotter added 20 commits February 19, 2024 14:07

Switch to the new small embeddings model; keep our embedding data to …

fa6144e

…512 vectors for easier storage

Ensure our vector arrays are the same length before running calculati…

bc7b158

…ons. Modify our calculation slightly

If we trim the embedding data, make sure we also normalize it

9234d59

Remove code that isn't needed

794017c

Update docblock

d4c3f84

Merge branch 'develop' into feature/update-embedding-model

2df624c

Merge branch 'develop' into feature/update-embedding-model

5ad469f

Add helper methods to get the API URL, model name and max tokens, pas…

cc72cbf

…sing each through custom filters. Allows these to easily be used and changed by extenders

Add a helper method to get the number of dimensions each embedding sh…

729e2cc

…ould have, passing a filter around this value

Remove lines of code that aren't used anymore

59d95da

Don't remove abbreviations when normalizing content

b38346b

Add a helper method to get the max number of terms and add a filter a…

3fa96e1

…round this. Increase the max number of terms to 5000. Order terms by most used to least used

Iterate over all term chunks when run similarity calculations

cc5d800

Ensure our set_terms method properly accounts for the new embedding a…

5a95615

…rray structure

Rename our similarity function to be more clear. Fix some undefined e…

c53904f

…rrors

Remove code that isn't functional and doesn't appear to ever have bee…

8a00548

…n used

Clean up the Notifications class a bit

b5fe959

Show success message once embeddings are regenerated. Clean up code a…

03531a7

… bit

dkotter added this to the 3.1.0 milestone Apr 16, 2024

dkotter self-assigned this Apr 16, 2024

dkotter requested review from jeffpaul and a team as code owners April 16, 2024 20:06

github-actions bot added the needs:code-review This requires code review. label Apr 16, 2024

Remove test that we no longer need as we aren't stripping abbreviatio…

4de6136

…ns anymore

dkotter mentioned this pull request Apr 16, 2024

Need a more performant way to bulk generate embeddings for terms #759

Closed

1 task

iamdharmesh reviewed Apr 18, 2024

View reviewed changes

includes/Classifai/Providers/OpenAI/Embeddings.php Outdated Show resolved Hide resolved

Merge branch 'develop' into feature/update-embedding-model

7a31aa2

Make sure we force update all terms

b4e85bd

dkotter requested a review from iamdharmesh May 8, 2024 23:06

iamdharmesh approved these changes May 9, 2024

View reviewed changes

dkotter merged commit 400529e into develop May 9, 2024
15 checks passed

dkotter deleted the feature/update-embedding-model branch May 9, 2024 20:41

dkotter mentioned this pull request Jun 7, 2024

Improvements to the Embedding functionality #776

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updates to the OpenAI Embeddings Provider #758

Updates to the OpenAI Embeddings Provider #758

dkotter commented Apr 16, 2024 •

edited

Loading

dkotter commented Apr 16, 2024

iamdharmesh left a comment

dkotter commented May 8, 2024 •

edited

Loading

iamdharmesh left a comment

Updates to the OpenAI Embeddings Provider #758

Updates to the OpenAI Embeddings Provider #758

Conversation

dkotter commented Apr 16, 2024 • edited Loading

Description of the Change

How to test the Change

Changelog Entry

Credits

Checklist:

dkotter commented Apr 16, 2024

iamdharmesh left a comment

Choose a reason for hiding this comment

dkotter commented May 8, 2024 • edited Loading

iamdharmesh left a comment

Choose a reason for hiding this comment

dkotter commented Apr 16, 2024 •

edited

Loading

dkotter commented May 8, 2024 •

edited

Loading