title	titleSuffix	description	author	ms.author	ms.manager	ms.service	ms.topic	ms.date
Reinforcement Learning - Personalizer	Azure AI services	Personalizer uses information about actions and current context to make better ranking suggestions. The information about these actions and context are attributes or properties that are referred to as features.	jcodella	jacodel	nitinme	azure-ai-personalizer	conceptual	01/19/2024

What is Reinforcement Learning?

[!INCLUDE Deprecation announcement]

Reinforcement Learning is an approach to machine learning that learns behaviors by getting feedback from its use.

Reinforcement Learning works by:

Providing an opportunity or degree of freedom to enact a behavior - such as making decisions or choices.
Providing contextual information about the environment and choices.
Providing feedback about how well the behavior achieves a certain goal.

While there are many subtypes and styles of reinforcement learning, this is how the concept works in Personalizer:

Your application provides the opportunity to show one piece of content from a list of alternatives.
Your application provides information about each alternative and the context of the user.
Your application computes a reward score.

Unlike some approaches to reinforcement learning, Personalizer doesn't require a simulation to work in. Its learning algorithms are designed to react to an outside world (versus control it) and learn from each data point with an understanding that it's a unique opportunity that cost time and money to create, and that there's a non-zero regret (loss of possible reward) if suboptimal performance happens.

What type of reinforcement learning algorithms does Personalizer use?

The current version of Personalizer uses contextual bandits, an approach to reinforcement learning that is framed around making decisions or choices between discrete actions, in a given context.

The decision memory, the model that has been trained to capture the best possible decision, given a context, uses a set of linear models. These have repeatedly shown business results and are a proven approach, partially because they can learn from the real world very rapidly without needing multi-pass training, and partially because they can complement supervised learning models and deep neural network models.

The explore / best action traffic allocation is made randomly following the percentage set for exploration, and the default algorithm for exploration is epsilon-greedy.

History of Contextual Bandits

John Langford coined the name Contextual Bandits (Langford and Zhang [2007]) to describe a tractable subset of reinforcement learning and has worked on a half-dozen papers improving our understanding of how to learn in this paradigm:

Beygelzimer et al. [2011]
Dudík et al. [2011a, b]
Agarwal et al. [2014, 2012]
Beygelzimer and Langford [2009]
Li et al. [2010]

John has also given several tutorials previously on topics such as Joint Prediction (ICML 2015), Contextual Bandit Theory (NIPS 2013), Active Learning (ICML 2009), and Sample Complexity Bounds (ICML 2003)

What machine learning frameworks does Personalizer use?

Personalizer currently uses Vowpal Wabbit as the foundation for the machine learning. This framework allows for maximum throughput and lowest latency when making personalization ranks and training the model with all events.

References

Making Contextual Decisions with Low Technical Debt
A Reductions Approach to Fair Classification
Efficient Contextual Bandits in Non-stationary Worlds
Residual Loss Prediction: Reinforcement: learning With No Incremental Feedback
Mapping Instructions and Visual Observations to Actions with Reinforcement Learning
Learning to Search Better Than Your Teacher

Next steps

Offline evaluation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

concepts-reinforcement-learning.md

concepts-reinforcement-learning.md

What is Reinforcement Learning?

What type of reinforcement learning algorithms does Personalizer use?

History of Contextual Bandits

What machine learning frameworks does Personalizer use?

References

Next steps

Files

concepts-reinforcement-learning.md

Latest commit

History

concepts-reinforcement-learning.md

File metadata and controls

What is Reinforcement Learning?

What type of reinforcement learning algorithms does Personalizer use?

History of Contextual Bandits

What machine learning frameworks does Personalizer use?

References

Next steps