<h1>CS4619: Artificial Intelligence II</h1>
<h1>Recommender Systems IV</h1>
<h2>
    Derek Bridge<br />
    School of Computer Science and Information Technology<br />
    University College Cork
</h2>

<h1>Initialization</h1>
$\newcommand{\Set}[1]{\{#1\}}$ 
$\newcommand{\Tuple}[1]{\langle#1\rangle}$ 
$\newcommand{\v}[1]{\pmb{#1}}$ 
$\newcommand{\cv}[1]{\begin{bmatrix}#1\end{bmatrix}}$ 
$\newcommand{\rv}[1]{[#1]}$ 
$\DeclareMathOperator{\argmax}{arg\,max}$ 
$\DeclareMathOperator{\argmin}{arg\,min}$ 
$\DeclareMathOperator{\dist}{dist}$
$\DeclareMathOperator{\abs}{abs}$

In [2]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.metrics import mean_absolute_error

<h1>A Broader Perspective on Recommender Systems</h1>
<ul>
    <li>Now that we know something about how recommender systems work, we'll use this lecture to reflect
        on the way that the field has changed. 
    </li>
    <li>It is a story about how the field has matured by relaxing various assumptions over time.</li>
    <li>This might be instructive because other parts of AI, especially those parts that use Machine Learning,
        are undergoing similar journeys.
    </li>
</ul>

<h2>From predicted ratings to ranking</h2>
<ul>
    <li>Early recommender systems focused on the prediction of ratings:
        <ul>
            <li>The loss function is, e.g., MSE.</li>
            <li>We evaluate the performance of a model on a test set by measuring prediction error, e.g. MAE.</li>
        </ul>
    </li>
    <li>But:
        <ul>
            <li>Correctly ranking the candidates is more important than correctly predicting their ratings.
                Why?
            </li>
            <li>Correctly ranking the candidates is more important at the top of the ranking than at the
                bottom of the ranking. Why?
            </li>
        </ul>
    </li>
    <li>Hence, later recommender systems focused on ranking and on the top of the ranking:
        <ul>
            <li>New loss functions (we won't look at these).</li>
            <li>New measures of performance, called ranking metrics &mdash; some of them borrowed from Information Retrieval.
                <ul>
                    <li>Suppose we say that some items in the test set are <em>relevant</em> to a user, e.g.
                        the ones which she rated 4 or 5, i.e. $rel_{ui} = r_{ui} \geq 3$.
                    </li>
                    <li>Then we want metrics that reward a model if, for a given user, her top-$N$ 
                        includes relevant items,
                        and especially if those relevant items come earlier in the top-$N$.
                        <ul>
                            <li>E.g. precision @$N$: the fraction of the top-$N$ that are relevant.</li>
                            <li>E.g. mean average precision @$N$: (roughly) the precision at 
                                $1, 2,\ldots,N$, averaged.
                            </li>
                            <li>E.g. normalized discounted cumulative gain @ $N$ is based on the ranks of the
                                relevant items.
                            </li>
                        </ul>
                    </li>
                </ul>
            </li>
        </ul>
    </li>
</ul>

<h2>Beyond-accuracy</h2>
<ul>
    <li>Whether we are looking at rating prediction or ranking, and whether we are measuring prediction error
        or ranking metrics, we are still focusing on recommendation accuracy or relevance.
    </li>
    <li>Another development in the field was to move beyond this focus on accuracy/relevance.</li>
    <li>Beyond-accuracy includes such criteria as:
        <ul>
            <li>Serendipity: the extent to which recommendations are a pleasant surprise.</li>
            <li>Novelty: the extent to which the recommender avoids popular items, exploring the long-tail 
                of items.
            </li>
            <li>Diversity: the extent to which a top-$N$ covers different tastes.</li>
        </ul>
    </li>
    <li>This led to new recomender systems:
        <ul>
            <li>E.g. recommemder systems that, as we saw, greedily re-rank the set of candidates to obtain
                a set that still contains relevant recommendations but which is also diverse.
            </li>
        </ul>
    </li>
    <li>And it led to new evaluation metrics:
        <ul>
            <li>E.g. in my research, we have defined measures of surprise!</li>
        </ul>
        But it probably also led to a greater emphasis on evaluation using human users, instead of just using
        test sets: we need to find out whether users get a surprise or perceive the diversity.
    </li>
</ul>

<h2>From explicit ratings to implicit ratings</h2>
<ul>
    <li>Another development was to focus less on explicit ratings.</li>
    <li>Explicit ratings are a form of user-item interaction in which the user gives feedback 
        on an item she has consumed.
        <ul>
            <li>These ratings are usually numeric, e.g. 1-5 stars, plus $\bot$.</li>
            <li>But they can be binary, e.g. thumbs-up/thumbs-down, plus $\bot$.</li>
        </ul>
    </li>
    <li>But not everyone bothers to rate the items that they consume.</li>
    <li>We could use other user-item interactions instead:
        <ul>
            <li>clicks, downloads, purchases,&hellip;</li>
        </ul>
        Since the user is not asked to rate (they are based on other user actions), we call these implicit ratings.
    </li>
<ul>

<h3>Implicit ratings</h3>
<ul>
    <li>Implicit ratings are often going to be unary:
        <ul>
            <li>1 if you clicked/downloaded/purchased, plus $\bot$.
             <table style="border: 1px solid; border-collapse: collapse;">
            <tr>
                <th style="border: 1px solid black; text-align: left;"></th>
                <th style="border: 1px solid black; text-align: left;">$i_1$</th>
                <th style="border: 1px solid black; text-align: left;">$i_2$</th>
                <th style="border: 1px solid black; text-align: left;">$i_3$</th>
                <th style="border: 1px solid black; text-align: left;">$i_4$</th>
                <th style="border: 1px solid black; text-align: left;">$i_5$</th>
                <th style="border: 1px solid black; text-align: left;">$i_6$</th>
            </tr>
            <tr>
                <th style="border: 1px solid black; text-align: left;">$u_1$</th>
                <td style="border: 1px solid black; text-align: left;"></td>
                <td style="border: 1px solid black; text-align: left;">1</td>
                <td style="border: 1px solid black; text-align: left;">1</td>
                <td style="border: 1px solid black; text-align: left;">1</td>
                <td style="border: 1px solid black; text-align: left;">1</td>
                <td style="border: 1px solid black; text-align: left;">1</td>
            </tr>
            <tr>
                <th style="border: 1px solid black; text-align: left;">$u_2$</th>
                <td style="border: 1px solid black; text-align: left;">1</td>
                <td style="border: 1px solid black; text-align: left;">1</td>
                <td style="border: 1px solid black; text-align: left;"></td>
                <td style="border: 1px solid black; text-align: left;">1</td>
                <td style="border: 1px solid black; text-align: left;">1</td>
                <td style="border: 1px solid black; text-align: left;"></td>
            </tr>
            <tr>
                <th style="border: 1px solid black; text-align: left;">$u_3$</th>
                <td style="border: 1px solid black; text-align: left;"></td>
                <td style="border: 1px solid black; text-align: left;"></td>
                <td style="border: 1px solid black; text-align: left;"></td>
                <td style="border: 1px solid black; text-align: left;"></td>
                <td style="border: 1px solid black; text-align: left;">1</td>
                <td style="border: 1px solid black; text-align: left;"></td>
            </tr>
            <tr>
                <th style="border: 1px solid black; text-align: left;">$u_4$</th>
                <td style="border: 1px solid black; text-align: left;">1</td>
                <td style="border: 1px solid black; text-align: left;">1</td>
                <td style="border: 1px solid black; text-align: left;">1</td>
                <td style="border: 1px solid black; text-align: left;">1</td>
                <td style="border: 1px solid black; text-align: left;">1</td>
                <td style="border: 1px solid black; text-align: left;">1</td>
            </tr>
            <tr>
                <th style="border: 1px solid black; text-align: left;">$u_5$</th>
                <td style="border: 1px solid black; text-align: left;">1</td>
                <td style="border: 1px solid black; text-align: left;">1</td>
                <td style="border: 1px solid black; text-align: left;">1</td>
                <td style="border: 1px solid black; text-align: left;">1</td>
                <td style="border: 1px solid black; text-align: left;"></td>
                <td style="border: 1px solid black; text-align: left;"></td>
            </tr>
        </table>
            </li>
        </ul>
    </li>
    <li>Sometimes they are numeric, e.g.:
        <ul>
            <li>listening frequency for music;</li>
            <li>dwell-time for news articles.</li>
        </ul>
    </li>
</ul>

<h3>Which are more reliable?</h3>
<ul>
    <li>Explicit ratings were previously thought to be more reliable.
        <ul>
            <li>Why? Because they involve a conscious act.</li>
            <li>But there are many factors that may reduce their reliability, e.g.:
                <ul>
                    <li>Users may not rate consistently.</li>
                    <li>Users' tastes may change over time.</li>
                    <li>Users may re-calibrate as they get exposed to more items.</li>
                    <li>There is a timing effect: people rate items that were recently consumed, consumed in the
                        less recent past and even items that they have not yet consumed.
                    </li>
                    <li>Users may withold ratings due to privacy concerns.</li>
                    <li>Users may attempt to bias the system or to counteract perceived bias.</li>
                    <li>'Posturing' is rife!</li>
                </ul>
            </li>
        </ul>
    </li>
    <li>Many of the above concerns go away when it comes to implicit ratings. But new concerns arise, e.g.:
        <ul>
            <li>It is not easy to infer negative opinions from implicit ratings. Why not?</li>
            <li>It is not easy to infer that one item is preferred over another (even when implict ratings are
                numeric). Why not?
            </li>
            <li>You must be aware that the ratings may be noisy (e.g. a user clicked but didn't mean to,
                a user purchased the item but didn't like it; e.g. the dwell-time was high but only because
                the user was doing something else, &hellip;).
            </li>
        </ul>
    </li>
    <li>An implicit ratings matrix, while still sparse, will be less sparse than an explicit
        ratings matrix.
    </li>
</ul>

<h3>Building recommender systems from implicit ratings</h3>
<ul>
    <li>People have developed recommender systems, similar to the ones we have
        studied, but which learn from an implicit ratings matrix.
    </li>
    <li>Without going into any great details, this usually involves learning something more like a classifier
        than a regressor, e.g. that can clasify whether one item is preferred to another, using cross-entropy
        as the loss function and making an assumption that item $i$ is preferred to item $j$ if $r_{ui} = 1$ but
        $r_{uj} = \bot$.
    </li>
</ul>

<h2>Towards context-aware recommender systems</h2>
<ul>
    <li>Especially for implicit ratings, we might want to record the context. 
        <ul>
            <li>E.g when did the user click or purchase this item? (Day, time-of-day, &hellip;)</li>
            <li>Where was she? (Home, work, &hellip;)</li>
            <li>Who was she with? (Alone, her kids, her partner, &hellip;)</li>
            <li>What were the circumstances? (The weather, was she in her car, etc.)</li>
        </ul>
    </li>
    <li>If we think of each of these as further dimensions, then we get a ratings tensor that is even more
        sparse than the traditional (two-dimensional) ratings matrix.
    </li>
    <li>How do we build recommender systems that work in a context-aware fashion?
        <ul>
            <li>You could pre-filter: ignore ratings that came from contexts unlike the user's current context;
                then make recommendations in the normal way based on the remaining ratings. What's the problem?
            </li>
            <li>You could post-filer: make recommendation in the normal way based on all the ratings, 
                then discard any recommendations that do not seem
                suitable to the user's current context. What's the problem?
            </li>
            <li>Make recommendations in a special way but based on all the ratings but giving more weight to
                those that are similar to the current context. 
            </li>
        </ul>
    </li>
</ul>

<h2>Towards sequence-aware recommender systems</h2>
<ul>
    <li>Moving away from explicit fedback to implict and to context-aware is great.</li>
    <li>But even doing so still ignores the following, which are true of many domains, including e-commerce:
        <ul>
            <li>User actions are ordered by time.</li>
            <li>There can be different kinds of actions e.g. search, buy, download, play.</li>
            <li>There may be more than one interaction (action) per item.</li>
            <li>Different kinds of actions may be ordered semantically, e.g. play follows download.</li>
            <li>User actions are often grouped into sessions:
                <ul>
                    <li>a user’s goals/intentions and short-term/ephemeral preferences may vary from session 
                        to session.
                    </li>
                </ul>
            </li>
            <li>Users are often anonymous, or they may be infrequent visitors, which makes many of them cold-start users.</li>
        </ul>
    </li>
</ul>
<figure>
    <img src="images/seq_aware.png" />
</figure>

<h3>Sequence-aware recommender systems</h3>
<ul>
    <li>These are in their infancy. Many techniques are being tried.</li>
    <li>Nearest-neighbour methods;</li>
    <li>Association rule mining;</li>
    <li>Markov chains;</li>
    <li>Recurrent Neural Networks; and</li>
    <li>Embeddings leanred from the sessions, which may be used in combination with the above.</li>
</ul>

<h2>Beyond beyond-accuracy</h2>
<ul>
    <li>As we saw, beyond-accuracy is the idea of evaluating the quality of a recommender system not just on the basis
        of rating prediction error or top-$N$ recommendation relevance.
    </li>
    <li>Beyond beyond-accuracy is (my name for) an even wider set of criteria.
        <ul>
            <li>Transparency (explanations).</li>
            <li>Robustness to attacks.</li>
            <li>Privacy.</li>
            <li>Fairness.</li>
            <li>Freedom from systemic bias:
                <ul>
                    <li>This is bias caused by an imbalance in a dataset due to societal or historical factors.</li>
                </ul>
            </li>
            <li>Freedom from selection bias:
                <ul>
                    <li>This is bias in a dataset caused by the fact that users are not exposed to items at random.
                        (Why does this occur?)
                        <!-- They are more likely to be exposed to items that are already popular.
                             The recommender itself sets up a feedback loop.
                          -->
                        This influences the ratings that we obtain and learn from.
                </ul>
            </li>
            <li>Responsible recommendation:
                <ul>
                    <li>Maximizing the right objectives, e.g. not maximizing clicks because it promotes clickbait.</li>
                    <li>Avoiding misinformation and polarization.</li>
                    <li>Avoiding recommendations that do harm. <!-- e.g. recommendinations to alcoholics or
                        gamblers, or people crossing the street. -->
                    </li>
                </ul>
            </li>
        </ul>
    </li>
</ul>

<h2>Conclusion</h2>
<ul>
    <li>On top of what we have covered, there are many, many more topics that are active areas of recommender
        systems research &amp; development.
    </li>
    <li>This makes the point about recommender systems, but also about AI in general, that there remains
        lots to be done, even in fields that seems to be successful.
    </li>
</ul>