In [1]:
# Setup Reveal.JS
from traitlets.config.manager import BaseJSONConfigManager
from pathlib import Path
path = Path.home() / ".jupyter" / "nbconfig"
cm = BaseJSONConfigManager(config_dir=str(path))
# ipyleaflet hack to load full map in Reveal.js
# These settings are also injected into the notebook metadata
# (Edit -> Edit Notebook Metadata), which is the preferred method
cm.update(
    "rise",
    {"minScale": 1.25,
     "width": "80%",
     "transition": "none"
    }
)

import numpy as np
np.random.seed(42)

from utils import AIUKSlides
bristol_center = (51.4545, -2.5879)

slides = AIUKSlides(local_center=bristol_center, isolines=[.0, .2, .4, .6, .8, 1.0],
                    width='600px', height='400px', grid_density=500)

# Computer says "I Don't Know"

<img width=35% align="right" src="PeterCartoon-square.jpg">

## The case for Honest AI

Peter Flach, University of Bristol and Alan Turing Institute, UK

[flach.github.io](https://flach.github.io)

Would you consider it newsworthy if a human passes a multiple-choice test? 

<img width=50% align="right" src="mc.jpg" alt="credit: https://www.learningscientists.org/blog/2017/10/10-1">

✓ **Probably not.**


Yet multiple-choice tests are behind many AI successes reported in the media, leading to recent headlines such as 

<p>&#128478; Researchers taught an AI to recognize smells!</p

<p>&#128478; AI Trained on Old Scientific Papers Makes Discoveries Humans Missed!</p>

<p>&#128478; AI learns to recognize nerve cells!</p>

We are told that "AI passed the test" or "the algorithm worked" --

<img width=30% align="right" align="bottom" src="pass.jpg" alt="credit: TODO">

- but what exactly does that mean?

**Who sets the exam, and what is the passing grade?**

# The case for Honest AI

In this talk I will discuss why performance evaluation is not something that can be easily summarised in a catchy headline -- neither for humans nor for machines. 

Furthermore, I will argue why it is imperative that AI algorithms become more *honest* about their own abilities.

<img width=25% align="right" align="bottom" src="honest.png" alt="credit: TODO">

Quantifying the **uncertainty** in predictions would be a good start.
- E.g., saying "the chance of rain is 60%" rather than "it will rain".
  
<img width=60% align="right" src="weather.jpg" alt="credit: Met Office">

Quantifying the uncertainty in that chance estimate would be even better. 
- Is it really 60%, or could it also be 40% or 80%?

# Computer says "I don't know"

But what would really demonstrate an AI algorithm's awareness of its own strengths *and* limitations is if it would occasionally say **"I don't know"** --
- something that not many contemporary AI algorithms and machine-learned classifiers do;
- often leading to problems with "adversarial examples" which are doctored to mislead the algorithm. 

# In this talk...
I will discuss in an accessible way how this arises due to a focus on *discriminative learning*, and how recent research has developed ways to overcome this, 

<img width=50% align="right" src="lb.jpg">

allowing AI and machine learning to become more **honest and aware of their own limitations**.  

# Let's look at an example
<img height=80% align="center" src="PedroBandero.png" alt="credit: TODO">

In [3]:
# Some recent COVID-19 data
# Case numbers going up/down in red/blue
display(slides.map_covid_uk())

In [None]:
# Zooming in on Bristol
display(slides.map_covid_local())

In [None]:
# Train a model to distinguish between up/down areas
from utils import KDE
clf = KDE(bandwidth=0.005)
slides.train_local_classifier(clf)
slides.train_local_foreground()
display(slides.map_local_classifier_foreground())

In [None]:
# What actually happens with discriminative models
display(slides.map_local_classifier(fillopacity=0.4, lineopacity=1.0))

# What are discriminative$^\dagger$ models?

These are classifiers that learn to separate classes
- cats vs dogs, spam vs ham, COVID-19 cases going up or down, ...

by identifying distinguishing characteristics in the training data. 

<img width=40% align="right" src="spam-filter.png" alt="credit: https://appliedmachinelearning.blog/2017/01/23/email-spam-filter-python-scikit-learn/">

$^\dagger$Not to be confused with *discriminatory*...

# How else would you do that?

**Generative** models additionally learn what *typical* cats, dogs, etc. look like. 

This allows the model to recognise that a new "query" looks very different from data used to train the model. 

<img width=25% align="right" src="atchoum.png" alt="credit: https://www.atchoumthecat.com">

However, generative models require much more computational effort to train. 

In [None]:
# A discriminative model has no problem making 
# confident -- but unjustified -- predictions 
# in areas without training data. 
from sklearn.ensemble import RandomForestClassifier
clf2 = RandomForestClassifier()
slides.train_local_classifier(clf2)
display(slides.map_local_classifier(fillopacity=0.4, lineopacity=1.0))

# Each AI has its own comfort zone
A learned discriminative model usually operates without direct access to the data used to train it, and so has no way of knowing when it ventures out of its "comfort zone". 

<img width=30% align="right" src="lazyboy.jpg" alt="credit: https://www.woodworkingnetwork.com/furniture/la-z-boy-debuts-rechargeable-batteries-power-recliners">

Luckily there are techniques for identifying a model's comfort zone: one such technique called [`Background Check`](https://reframe.github.io/background_check/) works by introducing an additional "background class" during training. 

<img height=100% align="right" src="BC.png" alt="credit: https://reframe.github.io/background_check/">

In [None]:
# Using only the foreground classes, the model sticks to its comfort zone
display(slides.map_local_classifier_foreground())

# The bigger picture
<img height=70% align="center" src="biggerpicture.jpg" alt="credit: https://www.alburycity.nsw.gov.au/leisure/arts-and-culture/public-art/the-bigger-picture">

1. Never trust an algorithm (or person!) that always has an answer. 

2. Always determine an algorithm's comfort zone, area of expertise, or operating conditions. 

3. Expect honesty rather than magic.

# Acknowledgments

- Part of this work was funded or supported by 
  - the [SPHERE project](https://irc-sphere.ac.uk): a Sensor Platform for HEalth in Residential Environments
  - the [TAILOR European network](https://tailor-network.eu) on Trustworthy AI through Integrating Learning, Optimisation and Reasoning
  - the [UKRI Centre for Doctoral Training in Interactive AI](http://www.bristol.ac.uk/cdt/interactive-ai/)
  - the [Alan Turing Institute](https://www.turing.ac.uk/research/research-projects/measurement-theory-data-science-and-ai)

# Acknowledgments

- Thanks to **Miquel Perello Nieto**, Hao Song and Kacper Sokol for programming the COVID-19 examples. 

- Thanks to Meelis Kull, Jose Hernandez-Orallo, Telmo Filho, Yu Chen, Raul Santos-Rodriguez, Tom Diethe and others for joint work that (eventually) led to this presentation. 

- Thanks to all members of the `SaFE-AI` research group in Bristol (Slow and Fast, Explainable AI). 