># Evaluation and Testing of Recommendations

Typical machine learning metrics cant be used in recommender systems due to their nature. Therefore in order to understand the performance of the recommendations (whether recommendations are good) we need to have different methodologies. Usually live system is the best way to understand this. But we need pointers to figure out the genaral direction of the recommendations during development as well.

One such method is called `coverage`. This answers whether all the users get a recommendation and does all the items get recommended. ie:- user coverage and content coverage.


In [1]:
def coverage(users, items, recommender):
    users_with_recommendations = 0
    recommended_items = set()

    for user in users:
        recommendations_for_user = recommender.recommend(user.id, items)
        if ((recommendations_for_user != None) and len(recommendations_for_user) >0):
            users_with_recommendations += 1

            for recommended_item in recommendations_for_user:
                recommended_items.add(recommended_item)
    
    total_users = len(users)
    total_items = len(items)

    user_converage = users_with_recommendations/total_users
    item_coverage = len(recommended_items)/total_items

    return user_converage, item_coverage

>It is important to have tests for the smaller pieces of the code itself.

eg:- In similarity checks we can look whether the same vector similarity returns 1. orthogonal vectors return 0 and opposite vectors return -1 etc.

To do further testing, we need to have a test dataset called `complete ground truth`. This should contain all the user and item combinations. In practice we would never have something like this because it makes the recommendation systems irrelavant. But for testing purposes we can create a synthetic one and progress upon it.

>## Offline testing

The simplest (least useful) way of testing a recommender system is taking the historical data(ground truth) and then cover part of the ratings given by users to items. Then we can use the recommender system to rate those covered ratings. Once it is done, we can compare the recommender outputs and actual rating values using MSE, RMSE etc.

This type of evalution does not work well in practise (apparently).

>## Desicion Support Matrices

In typical ML project we have classification evaluation metrics such as precision, recall and accuracy. But in recommendation systems, identifing the related values for calculation of above metrics are bit complicated. To do that we define below values tailored to a recommendation system.

- True Positive: Item recommended and Consumed by the User.
- False Positive: Item was recommended but the user did not consume it.
- False Negative: The recommender did not recommend, but user consumed.
- True Negative: Recommender did recommend, user did not consume as well.


<center><image src="./images/Evaluation Metrics.jpg" width="500px" /></center>




### **Mean Average Precision**

The average precision can be measured to see how good a rank is by taking the precision for top k recommended items. Here m indicated recommended items while k denotes the relevant items. (Per user recommendation)

<center><image src="./images/Precision for k.jpg" width="350px" /></center>

<center><image src="./images/Mean average precision.jpg" width="250px" /></center>

### **Discounted Cumulative Gain**

In this metric we not only consider the relevancy of an item, we also consider the position(ranking) of that item to the calculation. In the below equation rel[i] can be the predicted rating, business profit for the item etc. depending on the use case.

<center><image src="./images/Discounted cumulative gain.jpg" width="250px" /></center>

For more details of the implementation check [This Link](https://github.com/benhamner/Metrics).

>## Online Evaluation

2 main methods of online testing are as follows.

1. Controlled Experiment: In this method we expose the recommendation system to closed group and get feedback from them.
2. A/B Testing: Basically we divert a part of users to new recommendations while others were given with usual implementation. This is useful to understand the impact of the recommendation systems compared to what already exist.

Either way the point is to identify the parameters/techniques that would provide better recommendations to users.

Also one important concept in recommendation systems is `Feedback Loops`. In this concept it is said that when we do recommendations, users may get stuck in same type of items as the system progress. In order to keep the diversity of content recommended to users it is essential to add new items as inputs. Those can be random items, searches done by users etc.