In [None]:
import matplotlib.pyplot as plt
import numpy as np
from collections import defaultdict
from main import parse_config, instantiate_agents, instantiate_auction

In [None]:
# Parse configuration file
rng, config, agent_configs, agents2items, agents2item_values,\
num_runs, max_slots, embedding_size, embedding_var,\
obs_embedding_size = parse_config('../auction-gym/config/SP_Oracle.json')

# Config

For an explanation of the config fields, see [CONFIG.md](https://github.com/amzn/auction-gym/blob/065f8bf325ebbec9c96631625ef1c36df3870cb3/CONFIG.md?plain=1#L30).

In [None]:
config

In [None]:
agent_configs

In [None]:
agents = instantiate_agents(rng, agent_configs, agents2item_values, agents2items)

auction, num_iter, rounds_per_iter, output_dir =\
    instantiate_auction(rng,
                        config,
                        agents2items,
                        agents2item_values,
                        agents,
                        max_slots,
                        embedding_size,
                        embedding_var,
                        obs_embedding_size)
config

In [None]:
# Let's decrease rounds_per_iter for now b/c we just want to test things out.
# Using a lower rounds_per_iter just means that the runs will take less time.
rounds_per_iter = 100

In [None]:
for _ in range(rounds_per_iter):
    auction.simulate_opportunity()

In [None]:
agents[0].net_utility

# Notes

The paper mentions (in the conclusion) that the bandit model that they study is a limiting approximation of the full bidding problem

Bandit models are myopic: They associate the outcome of a single auction with their bid in that auction, which is good, but they ignore the fact that their bid in this auction will impact what other bidders learn from the auction which, in turn will impact how they bid in subsequent auctions. And all of the other bidders’ bids will affect what we learn, etc.

All of that feedback means that your bid now will affect not only this auction but future auctions, too. So that outcome of an auction should not only be associated with your bid in that single auction, but also in previous auctions. But how? It’s tough. In reinforcement learning this is called the credit assignment problem.
Episodic policy search is non-myopic and makes explicit credit assignment unnecessary.
Bayesian optimization is a good way to do episodic policy search in a simulation *and* in real, production, advertising systems.

See paragraph below eqn (14):

Finally, note that these measures are only well-defined in the bandit-based setting where we can easily characterise the theoretically optimal bidding strategy. When moving to full reinforcement learning scenarios, this will no longer be the case. Indeed, when cur- rent actions influence future states, this adds significant complexity to the problem setting, obscuring the notion of optimality.

To understand what a “bidder” does in the AuctionGym simulation, look at TruthfulBidder
https://github.com/amzn/auction-gym/blob/065f8bf325ebbec9c96631625ef1c36df3870cb3/src/Bidder.py#L28C16-L28C16

The bid() method returns the price an advertiser will pay for a click on the ad (called value, ex., value = $3/click) times the probability that the user will click on that ad (called estimated_CTR, ex., estimated_CTR=0.01). The returned number is the expected dollar revenue of showing the ad:

```
  E[$revenue] = P{user will click on the ad} * [$amount advertiser will pay for the click]
  E[$revenue] = estimated_CTR * value
```

See notes at https://github.com/amzn/auction-gym/blob/065f8bf325ebbec9c96631625ef1c36df3870cb3/src/Auction.py#L64
The TruthfulBidder says, “I will bid — in the ad auction — the actual expected value for this ad.”
```
bid = E[$revenue]
```
The learning bidders are more clever: They try to win while still bidding a little lower than the true expected value, because, why not? They save money. The paper and code refer to this as “bid shading”.

```
bid = gamma*E[$revenue], where gamma, the shading amount, is in [0,1]
```

The shading value is a function of value and estimated_CTR.

Look at method PolicyLearningBidder.bid()
https://github.com/amzn/auction-gym/blob/065f8bf325ebbec9c96631625ef1c36df3870cb3/src/Bidder.py#L348

PolicyLearningBidder uses as its model BidShadingContextualBandit
See BidShadingContextualBandit at https://github.com/amzn/auction-gym/blob/065f8bf325ebbec9c96631625ef1c36df3870cb3/src/Models.py#L93

BidShadingContextualBandit is a PyTorch nn.Module. As such it has parameters that describe the function that maps x to gamma.

PolicyLearningBidder.bid() makes a feature vector, x, from (value, estimated_CTR)
https://github.com/amzn/auction-gym/blob/065f8bf325ebbec9c96631625ef1c36df3870cb3/src/Bidder.py#L360C53-L360C53

The model (BidShadingContextualBandit.forward()) maps the feature vector x to gamma. It also produces something called a propensity value, but we can discuss that later if you’re not already familiar with it.

Finally, PolicyLearningBidder.bid() returns gamma*value*estimated_CTR, the shaded bid value.
In a second-price auction, the bidder who bids the highest price wins, but *pays* the bid of the second-highest bidder. So if you bid $3.00 and I bid $2.50, you win the auction and get to show your ad, but you pay only $2.50. 

This is great… except:
Next time I might want to bid $3.01 to win, thus driving up the price of this ad slot.
As a rule, you want to bid as low as possible.