# [Uncovering Bias in Ad Feedback Data Analyses & Applications](https://labtomarket.files.wordpress.com/2019/03/adfeedback.pdf)

## Context

- Trying to provide a rewarding investment to advertisers
- Minimize negative impact to users

`Annoying ads have a real cost to users beyond mere annoyance`: reduced visits of shorter duration, fewer referrals, long-term user disengagement...

It has been shown that it is better to not show any ads than to show non relevant ones.

=> Using explicit feedback from users can help capture all these effects and once integrated directly into the ad ranking score allows ads to be ranked interms of bit _short term_ and _long term_ expected revenue.

Bias can come from
* the fact that ads are targeted
* the type of users (interacting a lot or not with content)

## 1. Analysis
## 2. Bias: explanation and correction
## 3. Ad Ranking

## 1. Analysis

## Analysis

⇨ Investigate if the association between ads and ad feedback is affected by
* ads being targeted to users with particular demographics, interests, behaviours
* user behaviour (eg clicks, interaction with content)

Users may dislike ads but not indicate this through a feedback option whereas others may always give feedback, however minro the complaint.

## Analysis

### Data

40 million distinct users and 200 000 distinct ads

## Analysis

### Metric

$$ Hide Rate = \frac{hides}{impressions} $$

* feedback is generally a signal of bad quality: explicit negative signal
* CTR: the absence of click does not necessarily indicate a low quality ad, high CTR may not mean high quality

## Analysis

### Features - to characterize users

* User demographics: age, gender, interests, location
* User behaviour: ad impressions, ad clicks, article clicks

## Analysis

### Features - to characterize ads

* Text-based: spam, readability, adult
* Image-based: contains text, contains flesh
* Advertiser: pagerank score of the ad landing page

## Analysis

### Formula

* Study the difference of ad hide behaviour for each user variable in turn

$$ HR_{var}(u) = \frac{HR(u) - mean(HR(U))}{mean(Hr(U))}$$

## Analysis

### Results

* User demographics
  * users from different states have different behaviours
  * demographic distribitions differ per state
  * female users are less likely to hide than male users

* User interests
  * "Retail" and "Technology" most likely to hide
  * "Business/B2B" and "Telecommunication" less likely to hide
  * __feedback variations accross these variables make them good candidates to indentify bias due to targeting__

* Ad quality
  * more likely to be spam: more likely to be hidden
  * easier to read: more likely to be hidden
  * most and least "adultness": less likely to be hidden

## Analysis

### Conclusion

* Some types of users that provide feedback may be more sensitve to ads and have a higher tendency to provide feedback
* Since ads are targeted, the ad feedback may be from a group unrepresentative of the general population

## 2. Modelling and Correcting Bias

## Modelling Bias

An ad quality model based on such biased data wil consistently over or under estimate the quality of ads.

⇨ develop a model able to determine the proportion of bias present in the feedback on ads.

* Descriptive model
* Only include variables able to explain the source of selection bias

## Modelling Bias

Simple logistic regression based ad-user model
* one ad feature _a_
* one user selection feature _u_
* associated weights

$$ f(\hat{p}) = w_0 + w_a . a + w_u . u + \epsilon  $$

* only with one ad feature _a_

$$ f(\hat{p}) = \hat{w}_0 + \hat{w}_a . a + \epsilon  $$

* If both models are fit to the feedback data with the selection bias then the bias in the coefficient of the ad model is

$$ \hat{w}_a = w_a + \rho w_u $$

* $\rho$ is the correlation between _a_ and *u*
* The bias in the ad only model is the true user bias proportional to the correlation between the user and the ad feature

⇨ Goal is to identify user selection bias term $ w_u . u $



## Modelling Bias

Deviance statistics for the models of interest
* Model name
* systemic structure
* deviance statistic
* difference between null model and current model
* number of parameters used in the model


![table 3](table3.png)

* Different age levels are more significative than gender or state
* Interest variables are a popular targeting criteria

⇨ Suggests that there is a selection bias due to targeting present in the feedback data

⇨ + selection bias due to user ad sensitivity (click behaviour variables explain additional feedback)

## Modelling Bias

* _net effect_ $\beta$: how individual variable level affects the selection bias present in bias
* _p_ : probability of hiding ads

$$ \frac{p}{1-p} = e^\beta $$

$$ p = \frac{e^\beta}{1 + e^\beta} $$

![table 4](table4.png)

* variables included in the model
* levels of each variable
* effects of each level ($\beta$)




## Correcting Bias

### Formula

Formula that explicitly models the user selection bias in addition to the ad features:

$$ f(\hat{p}) = w_0 + w_a . a + I(w_u . u) + \epsilon $$

* _I_ binarizes the user selection bias term using a threshold (0.5)

### Conclusions

⇨ ads with low pagerank, low readability, low adult and low spam levels are considered as low quality

⇨ single features do not characterize the quality of an ad

⇨ features such as adultness and pagerank: for high levels, likely to receive feedback from general population, but less likely from a specific segment of users.

## 3. Ad Ranking

## Ad Auctions

* ads are ranked according to a function of their bids
* generally ranked by expected cost per impression `eCPI`: based on probability that ad is clicked given a user impression
* some companies started to incorporate **quality score**

$$ eCPI = bid_a . P(C_a = 1 | U = u) $$

$$ eCPI_q = bid_a . P(C_a = 1 | U = u) . P(Q_a = 1 | U = u) $$


* $ P(C_a = 1 | U = u) $: probability that the ad is clicked given a user impression
* $ P(Q_a = 1 | U = u) $: probability that the ad will provide a good quality experience, $= 1 - p$ (hide probability) 

## Using ad feedback in ad ranking: model comparison

* (i)   _oracle_: empirical, use available logs
* (ii)  _biased_: use estimates from the biased model
* (iii) _unbiased_: use estimates from corrected model

![fig 5](figure5.png)



## Ad feedback filtering on revenue

![Fig 6](figure6.png)