# Batch Learning in Stochastic Bandits

| | |
| --- | --- |
| Problem | Learning user preferences online might have an impact of delay and training recommender system sequentially for every example is computationally heavy. |
| Hypothesis | A learning agent observes responses batched in groups over a certain time period. The impact of batch learning can be measured in terms of online behavior. |
| Prblm Stmt. | Given a finite set of arms ⁍, an environment ⁍ (⁍ is the distribution of rewards for action ⁍), and a time horizon ⁍, at each time step ⁍, the agent chooses an action ⁍ and receives a reward ⁍. The goal of the agent is to maximize the total reward ⁍. |
| Solution | Sequential batch learning is a more generalized way of learning which covers both offline and online settings as special cases bringing together their advantages. Unlike offline learning, sequential batch learning retains the sequential nature of the problem. Unlike online learning, it is often appealing to implement batch learning in large scale bandit problems. In this setting, responses are grouped in batches and observed by the agent only at the end of each batch. |
| Dataset | Mushroom, Synthetic |
| Preprocessing | Train/test split, label encoding |
| Metrics | Conversion rate, regret |
| Credits | Danil Provodin |

## Model

### Environments

| Name | Type | Rewards |
| --- | --- | --- |
| env1 | 2-arm environment | [0.7, 0.5] |
| env2 | 2-arm environment | [0.7, 0.4] |
| env3 | 2-arm environment | [0.7, 0.1] |
| env4 | 4-arm environment | [0.35, 0.18, 0.47, 0.61] |
| env5 | 4-arm environment | [0.40, 0.75, 0.57, 0.49] |
| env6 | 4-arm environment | [0.70, 0.50, 0.30, 0.10] |

### Simulation

| Application | Policy |
| --- | --- |
| Multi-armed bandit (MAB) | Thompson Sampling (TS) |
| Multi-armed bandit (MAB) | Upper Confidence Bound (UCB) |
| Contextual MAB (CMAB) | Linear Thompson Sampling (LinTS) |
| Contextual MAB (CMAB) | Linear UCB (LinUCB) |

## Tutorials

### Sequential Batch Learning in Stochastic MAB and Contextual MAB on Mushroom and Synthetic data

[direct link to notebook →](https://github.com/RecoHut-Stanzas/S873634/blob/main/nbs/P296669_Sequential_Batch_Learning_in_Stochastic_MAB_and_Contextual_MAB_on_Mushroom_and_Synthetic_data.ipynb)

![https://github.com/RecoHut-Stanzas/S873634/raw/main/images/process_flow.svg](https://github.com/RecoHut-Stanzas/S873634/raw/main/images/process_flow.svg)

## References

1. [https://github.com/RecoHut-Stanzas/S873634](https://github.com/RecoHut-Stanzas/S873634)
2. [https://arxiv.org/abs/2111.02071v1](https://arxiv.org/abs/2111.02071v1)
3. [https://github.com/danilprov/batch-bandits](https://github.com/danilprov/batch-bandits)