-
Notifications
You must be signed in to change notification settings - Fork 0
/
info.json
17 lines (17 loc) · 1.55 KB
/
info.json
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
{
"abstract": "For undiscounted reinforcement learning in Markov decision\nprocesses (MDPs) we consider the <i>total regret</i> of\na learning algorithm with respect to an optimal policy.\nIn order to describe the transition structure of an MDP we propose a new parameter:\nAn MDP has <i>diameter</i> <i>D</i> if for any pair of states <i>s,s'</i> there is\na policy which moves from <i>s</i> to <i>s'</i> in at most <i>D</i> steps (on average).\nWe present a reinforcement learning algorithm with total regret\n<i>Õ(DS√AT)</i> after <i>T</i> steps for any unknown MDP\nwith <i>S</i> states, <i>A</i> actions per state, and diameter <i>D</i>.\nA corresponding lower bound of <i>Ω(√DSAT)</i> on the\ntotal regret of any learning algorithm is given as well.\n\n<br>\n\nThese results are complemented by a sample complexity bound on the\nnumber of suboptimal steps taken by our algorithm. This bound can be\nused to achieve a (gap-dependent) regret bound that is logarithmic in <i>T</i>.\n\n<br>\n\nFinally, we also consider a setting where the MDP is allowed to change\na fixed number of <i>l</i> times. We present a modification of our algorithm\nthat is able to deal with this setting and show a regret bound of\n<i>Õ(l<sup>1/3</sup>T<sup>2/3</sup>DS√A)</i>.",
"authors": [
"Thomas Jaksch",
"Ronald Ortner",
"Peter Auer"
],
"id": "jaksch10a",
"issue": 51,
"pages": [
1563,
1600
],
"title": "Near-optimal Regret Bounds for Reinforcement Learning",
"volume": "11",
"year": "2010"
}