Skip to content
This repository has been archived by the owner on Nov 1, 2018. It is now read-only.

redoPop/BayesianAverage

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 

Repository files navigation

Not Maintained

It’s been a while since I last worked with CakePHP, and I haven’t kept this repository up-to-date.

If you’d like to maintain a fork, please shoot me an email and I’ll drop a link here to point others in the right direction.

BayesianAverage plugin for CakePHP

This CakePHP plugin provides a Behavior that calculates the Bayesian averages and unweighted averages of item ratings, so that you can meaningfully sort items by their rating.

Cool. What?

Let’s say you have a book rating site. Two of the books are “Slaughterhouse-Five”, and “I Love My Mommy” by some guy no one’s heard of. “Slaughterhouse-Five” got 399 5-star ratings, and one 3-star rating. “I Love My Mommy” got a single rating of 5 stars by the author’s mother. Because of Slaughterhouse’s one 3-star rating, its mean rating has become lower than that of “I Love My Mommy”, despite the fact that it has 398 more 5-star ratings than that book. In a list of books ordered by mean rating, “I Love My Mommy” would be considered “better” than “Slaughterhouse-Five”.

This CakePHP plugin provides a Behavior you can use to avoid such embarrassment by calculating what some call the “Bayesian average” or “Bayesian estimate”, which weighs the mean rating of an item toward the mean rating of all items in proportion to the number of votes it receives in relation to the average number of votes an item receives. Effectively, it produces a number that represents the item’s ratings in the context of other items in the table, so sorting by it is more meaningful.

Basic Use

This plugin should be deployed to the directory /app/plugins/bayesian_average, so that this document sits in the path /app/plugins/bayesian_average/README.textile

Your database should contain a table for the items being rated, and another table to store each rating, with the following fields:

items table:

  • id
  • ratings_count (an integer)
  • mean_rating (a decimal or float)
  • bayesian_rating (a decimal or float)

item_ratings table:

  • id
  • item_id
  • rating

The actual table and field names don’t matter; the names of any fields that differ from the above structure can be specified in the configuration array of the BayesianAverage Behavior when it is imported into the ItemRating (or equivalent) model:


  class ItemRating extends AppModel {
      ...

      var $actsAs = array('BayesianAverage.BayesianAverageable' => array(
          'fields' => array(
              'itemId' => 'item_id', # foreign key in "item_ratings", linking it to "items"
              'rating' => 'rating', # field in "item_ratings" that contains that rating's value
              'ratingsCount' => 'ratings_count', # field in "items" used to store ratings count for an item
              'meanRating' => 'mean_rating', # field in "items" used to store unweighted average rating for an item
              'bayesianRating' => 'bayesian_rating', # field in "items" used to store Bayesian average for an item
          )
      ));

      ...

You must also make sure that caching is enabled on your production environment. The line Configure::write('Cache.disable', true); should be commented out in /app/config/core.php.

The rest is done by magic. The ratings_count, mean_rating, and bayesian_rating will all be populated for you when ratings are saved. When you want to order your items by rating, order by bayesian_rating instead.

Read on for tips on how to get the best performance out of this plugin.

Basic Performance: Indexing and Normalizing Items

There are several things you can do to increase the performance of this plugin and guard against potential performance bottlenecks. The most basic of these is to create a single MySQL table index for the two fields ratings_count and mean_rating (or whatever they’re named in your “item_ratings” table schema).

If you have a wide “items” table (i.e., if it has a lot of columns in addition to the 4 outlined in “Basic Use”), you can get a further performance boost by separating the ratings_count, mean_rating, and bayesian_rating fields into an “item_averages” table, with an item_id field acting as a common primary key between the “item_ratings” and “item_averages” tables. You would then need to modify the Behavior’s configuration array to declare this:


  class ItemRating extends AppModel {
      ...
      var $actsAs = array('BayesianAverage.BayesianAverageable' => array(
          'fields' => array('itemId' => 'item_id'),
          'itemModel' => 'ItemAverage',
      ));
      ...

Intermediate Performance: Cache Configuration

An important aspect of calculating Bayesian averages is that the constant C and total mean m must remain consistent for all items. (See the “Bayesian Average Formula” section of this document to understand what these variables mean.) To this end, the behavior must occasionally recalculate the Bayesian averages for all items in your “item” model.

Every half hour (or half your app’s default cache lifetime), the plugin compares a freshly calculated constant C and mean m against a cache of those values. If the difference between the two is more than 10%, the freshly calculated values replace the previous cache and the Bayesian averages are updated for all items; otherwise, the cached values continue to be used for the next half hour until the difference is more prominent.

As the number of ratings in your database increases, the cache must be replaced — and item averages must be recalculated — less and less frequently, resulting in better performance. However, CakePHP periodically destroys cache files, in which case the plugin must recreate them. You can set up a cache configuration in CakePHP to make sure that cache files are very rarely destroyed.

Creating and Applying a CakePHP Cache Configuration

To create an annual cache configuration (one that will keep cache files for an entire year), open /app/config/core.php and add the following:


  Cache::config('annual', array(
      'engine' => 'File',
      'duration'=> '+1 year',
  ));

You must now modify the Behavior’s configuration array to apply this cache configuration:


  var $actsAs = array('BayesianAverage.BayesianAverageable' => array(
      'fields' => array('itemId' => 'item_id'),
      'cache' => array('config' => 'annual'),    // <= this line!
  ));

Lastly, double check that caching is enabled for your app. The line Configure::write('Cache.disable', true); should be commented out in /app/config/core.php

Advanced Performance: Manual Configuration of Constants

By the time you have a sizable number of items and ratings, C and m should be involatile, so that a recalculation of all items’ averages will not occur unless the cache is destroyed. However, you can escape the potential performance problem entirely by setting the value of the constant C and the mean m directly in the Behavior’s configuration array.

To do this, simply remember that C is the average number of votes you expect per item, and m is the average rating an item is given. For example, if you notice a trend that most items average about 50 ratings and the mean rating is 3.5, you can set the values of C and m to be 50 and 3.5 respectively:


  var $actsAs = array('BayesianAverage.BayesianAverageable' => array(
      'C' => 50,
      'm' => 3.5,
  ));

Whether calculated or declared explicitly, if the value of C is 2 or below, the Bayesian average has no serious value. So that you can continue to order by the bayesian_rating field, the mean rating will be used instead.

“Bayesian Average” Formula

The thing I’m calling a “Bayesian Average” might be better named otherwise. This is the actual formula I’m using:


  (n / (n + C)) * j + (C / (n + C)) * m

  • C is the average number of ratings an item receives
  • m is the average rating across all items
  • n is the number of ratings the current item
  • j is the average rating for the current item

I’m mot a mathematician. I stole it from the shoulders of Giants and was using it for some time before building this Behavior.

All Configuration Options

To show the proper syntax of its configuration options, here is how the Behavior would look if imported with every setting applied:


  class ItemRating extends AppModel {
      ...

      var $actsAs = array('BayesianAverage.BayesianAverageable' => array(
          'fields' => array(
              'itemId' => 'item_id',
              'rating' => 'rating',
              'ratingsCount' => 'ratings_count',
              'meanRating' => 'mean_rating',
              'bayesianRating' => 'bayesian_rating',
          ),
          'itemModel' => 'Item',
          'C' => 50,
          'm' => 3.5,
          'cache' => array(
            'config' => null,
            'prefix' => 'BayesianAverage_',
            'calculationDuration' => 1800,
          ),
      ));

      ...

fields

This array is used to configure the Behavior according to the field names in your table schema: itemId is the name of the foreign key that establishes the link between your “item_ratings” and “items” tables, etc. Its use is illustrated in the “Basic Use” section of this document.

itemModel

This is the name of the model that contains the ratings_count, mean_rating, and bayesian_rating for each item. Usually, the plugin is able to determine the item model from its relationship with the rating model, but sometimes this relationship doesn’t exist. The “Basic Performance” section of this document describes how this field can be used to help normalize the items table for increased performance.

If there is no relationship between the Rating model and “itemModel”, the behavior will assume that “itemId” (from the “fields” array) is a common foreign key to both models.

C and m

The numbers C and m represent the average number of votes you expect per item, and the average rating an item is given, respectively. The “Advanced Performance” section of this document describes how these values can be used to further increase performance once you have a sufficient pool of data.

cache

These settings control how the plugin uses its cache files.

  • config – string name of the CakePHP cache configuration to use. If this value is null, the default cache configuration will be used.
  • prefix – string prefix prepended to the plugin’s cache files. You shouldn’t need to change this unless it causes a naming conflict with other cache files.
  • calculationDuration – how long the plugin should wait, in seconds, before recalculating the values C and m to see if all items need to be updated.

The “Intermediate Performance” section of this document shows how to use the cache configuration settings to make sure CakePHP destroys cache data as infrequently as possible.

About

[Deprecated] Bayesian average calculations for CakePHP.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages