<h1 style='text-align: center; color: lightblue; font-size: 40px'> Metrics, and data Ethics </h1>

In [1]:
import pandas as pd
from sklearn.metrics import confusion_matrix, recall_score, precision_score
import numpy as np

# Metrics fundamentals

## MSE vs. MAE

<img src='images/MAEvsMSE.png' style='align: left'></img> 

Which is which ? 

You want to predict people's age based on their photo, which one do you pick ?

You want to predict building's energy consumption, which one do you pick ?

You want to predict appartment prices based on their features, which one do you pick ?

## Accuracy, precision, recall

<img src='images/conf_matrix.png'></img>

<img src='images/conf_matrix2.png'>

<h2> In code: </h2>

In [2]:
actual = np.array([0,0,0,0,0,1,1,1,1,1])
predicted = np.array([0,0,0,0,0,1,1,1,1,1])

In [3]:
confusion_matrix(actual, predicted)

array([[5, 0],
       [0, 5]])

In [4]:
recall_score(actual, predicted)

1.0

In [5]:
actual = np.array([0,0,0,0,0,1,1,1,1,1])
predicted = np.array([0,0,0,0,0,0,0,0,0,0]) #all zeros
confusion_matrix(actual, predicted)

array([[5, 0],
       [5, 0]])

In [6]:
actual = np.array([0,0,0,0,0,1,1,1,1,1])
predicted = np.array([1,1,1,1,1,1,1,1,1,1]) #all ones
confusion_matrix(actual, predicted)

array([[0, 5],
       [0, 5]])

The numbers don't look like the above diagrams. Do you know why ?

### answer

In [7]:
??confusion_matrix

In [19]:
actual = np.array([0,0,0,0,0,1,1,1,1,1])
predicted = np.array([0,0,0,0,0,0,1,1,1,1]) #changed one 1 into 0
confusion_matrix(actual, predicted)

array([[5, 0],
       [1, 4]])

In [8]:
# this is from the above documentation:
tn, fp, fn, tp = confusion_matrix(actual, predicted).ravel()
# the bottom right number is true positive, whereas in the above matrix TP is top left !
tp

5

In [9]:
# you can rearrange the confusion matrix like this:
confusion_matrix(actual, predicted, labels=[1,0])

array([[5, 0],
       [5, 0]])

In [25]:
# from sklearn.metrics import ...  # no specificity ?

Well, specificity becomes recall if you reverse your confusion matrix

In [30]:
tn, fp, fn, tp = confusion_matrix(actual, predicted).ravel()
# the bottom right number is true positive, whereas in the above matrix TP is top left !
print("specificity=", tn / (tn +fp))
print("recall=", tp / (tp + fn))
print("recall sklearn:", recall_score(actual, predicted))
print("specificity sklearn:", recall_score(actual, predicted, labels=[1,0]))

specificity= 1.0
recall= 0.8
recall sklearn: 0.8
specificity sklearn: 0.8


<h2> In the end, what do you care about ? </h2>

- SPAM: Specificity (you want as few FP as possible)
- CANCER DETECTION: Recall (you want as few FN as possible) 
- Classify cat vs. dog: well, what's more important to you ?

## Loss functions in DL

Recap of class:

* Why is MSE not suited for classification ?
* Why is CrossEntropy not suited for multi-label classification ?

Another example: focal loss:

\begin{equation}
\mathrm{FL}\left(p_{\mathrm{t}}\right)=-\left(1-p_{\mathrm{t}}\right)^{\gamma} \log \left(p_{\mathrm{t}}\right)
\end{equation}

Where: 

\begin{equation}
p_{\mathrm{t}}=\left\{\begin{array}{ll}
{p} & {\text { if } y=1} \\
{1-p} & {\text { otherwise }}
\end{array}\right.
\end{equation}

In [43]:
# the model classifies correctly
p = 0.9
y = 1

print(-np.log(p))
-(1-p)**2 * np.log(p)

0.10536051565782628


0.0010536051565782623

In [44]:
# the model is unsure
p = 0.5
y = 1

print(-np.log(p))
-(1-p)**2 * np.log(p)

0.6931471805599453


0.17328679513998632

Which loss for which problem ? 

awesome recap:
https://heartbeat.fritz.ai/research-guide-advanced-loss-functions-for-machine-learning-models-aee68ed8a38c

# The importance of well-defined metrics and losses

## In traditionnal ML

In [99]:
# a very imbalanced classification problem
actual = (np.random.rand(1000) > 0.99).astype(int)
actual.sum()

5

In [116]:
preds = np.ones(1000)
preds[0] = 1
print("precision:", precision_score(actual, preds))
print("recall:", recall_score(actual, preds))
confusion_matrix(actual, preds)

precision: 0.005
recall: 1.0


array([[  0, 995],
       [  0,   5]])

In [115]:
preds = np.zeros(1000)
preds[0] = 1
print("precision:", precision_score(actual, preds))
print("recall:", recall_score(actual, preds))
confusion_matrix(actual, preds)

precision: 0.0
recall: 0.0


array([[994,   1],
       [  5,   0]])

In [141]:
# a better model:
preds = actual.copy()
for i in range(len(preds[preds == 0])):
    if np.random.randn() > 0.95:
        preds[i] = 1

In [142]:
# I found all the anomalies I was looking for, but it cost me more than 100 useless checks
print("precision:", precision_score(actual, preds))
print("recall:", recall_score(actual, preds))
confusion_matrix(actual, preds)

precision: 0.026455026455026454
recall: 1.0


array([[811, 184],
       [  0,   5]])

In [146]:
# Compare these models : 
print(np.array([[811, 184],
           [  4,   1]]))

print(np.array([[811, 184],
           [  1,   4]]))

[[811 184]
 [  4   1]]
[[811 184]
 [  1   4]]


Which one is better ? Why ?

https://www.washingtonpost.com/local/education/creative--motivating-and-fired/2012/02/04/gIQAwzZpvR_story.html

TL;DR:
A teacher was fired, yet everyone (principal, parents) thought she was a great teacher. What happened ?
- she was judged by her ability to improve students' performance
- how is student performance measured ? Test scores
- if a student was "below expectations" at reading at year N-1 and "meeting expectations" at year N, then the teacher has done a good job
- what is the problem with that ?

for more on mass students testing, read Freakonomics :
https://www.gradesaver.com/freakonomics/study-guide/summary-chapter-1

## In Deep Learning

- CoastRunners: https://openai.com/blog/faulty-reward-functions/


- Problems when metrics confront, like in GANS: which one to optimize ?

- Essay grading: 

https://www.vice.com/en_us/article/pa7dj9/flawed-algorithms-are-grading-millions-of-students-essays

short story: 

Several years ago, Les Perelman, the former director of writing across the curriculum at MIT, and a group of students developed the Basic Automatic B.S. Essay Language (BABEL) Generator, a program that patched together strings of sophisticated words and sentences into meaningless gibberish essays. The nonsense essays consistently received high, sometimes perfect, scores when run through several different scoring engines

Motherboard replicated the experiment. We submitted two BABEL-generated essays—one in the “issue” category, the other in the “argument” category—to the GRE’s online ScoreItNow! practice tool, which uses E-rater. Both received scores of 4 out of 6, indicating the essays displayed “competent examination of the argument and convey(ed) meaning with acceptable clarity.”

- More: see book "Weapons of Math Destruction"

# In AI

Tim Urban's world-reknown blog post on AS(super)I:

https://waitbutwhy.com/2015/01/artificial-intelligence-revolution-1.html


The answer isn’t anything surprising—AI thinks like a computer, because that’s what it is. But when we think about highly intelligent AI, we make the mistake of anthropomorphizing AI (projecting human values on a non-human entity) because we think from a human perspective and because in our current world, the only things with human-level intelligence are humans. To understand ASI, we have to wrap our heads around the concept of something both smart and totally alien.

Let me draw a comparison. If you handed me a guinea pig and told me it definitely won’t bite, I’d probably be amused. It would be fun. If you then handed me a tarantula and told me that it definitely won’t bite, I’d yell and drop it and run out of the room and not trust you ever again. But what’s the difference? Neither one was dangerous in any way. I believe the answer is in the animals’ degree of similarity to me.

A guinea pig is a mammal and on some biological level, I feel a connection to it—but a spider is an insect,18 with an insect brain, and I feel almost no connection to it. The alien-ness of a tarantula is what gives me the willies. To test this and remove other factors, if there are two guinea pigs, one normal one and one with the mind of a tarantula, I would feel much less comfortable holding the latter guinea pig, even if I knew neither would hurt me.

Now imagine that you made a spider much, much smarter—so much so that it far surpassed human intelligence? Would it then become familiar to us and feel human emotions like empathy and humor and love? No, it wouldn’t, because there’s no reason becoming smarter would make it more human—it would be incredibly smart but also still fundamentally a spider in its core inner workings. I find this unbelievably creepy. I would not want to spend time with a superintelligent spider. Would you??

When we’re talking about ASI, the same concept applies—it would become superintelligent, but it would be no more human than your laptop is. It would be totally alien to us—in fact, by not being biology at all, it would be more alien than the smart tarantula.


[....]



For example, what if we try to align an AI system’s values with our own and give it the goal, “Make people happy”?

Once it becomes smart enough, it figures out that it can most effectively achieve this goal by implanting electrodes inside people’s brains and stimulating their pleasure centers. Then it realizes it can increase efficiency by shutting down other parts of the brain, leaving all people as happy-feeling unconscious vegetables. If the command had been “Maximize human happiness,” it may have done away with humans all together in favor of manufacturing huge vats of human brain mass in an optimally happy state. We’d be screaming Wait that’s not what we meant! as it came for us, but it would be too late. The system wouldn’t let anyone get in the way of its goal.

If we program an AI with the goal of doing things that make us smile, after its takeoff, it may paralyze our facial muscles into permanent smiles. Program it to keep us safe, it may imprison us at home. Maybe we ask it to end all hunger, and it thinks “Easy one!” and just kills all humans. Or assign it the task of “Preserving life as much as possible,” and it kills all humans, since they kill more life on the planet than any other species.

# Metrics in life: is our loss what we call "interest" ?

There are three kinds of incentives: economic, social, and moral, and often incentive schemes will include all three of these. Levitt uses crime as an example: why don't more people commit crimes? Because there exist economic incentives—being jailed, losing your house, being fined—that stop us from doing so, as well as moral incentives, like the refusal to do something morally wrong, and social incentives–we do not want others to see us doing something wrong. These types of incentives are how society attempts to mitigate crime.

<img src='images/covey_center_habit_two.png'>

Napoléon et la légion d'honneur:

Il expose ses idées sur la Légion d’honneur et réplique au conseiller d’État Berlier, qui réservait aux monarchies les hochets et les rubans, ces distinctions indignes d’une république : « Les Romains avaient des patriciens, des chevaliers, des citoyens et des esclaves. Ils avaient pour chaque chose des costumes divers, des mœurs différentes. Ils décernaient en récompense toutes sortes de distinctions, des noms qui rappelaient des services, des couronnes murales, le triomphe ! Je défie qu’on me montre une république ancienne ou moderne dans laquelle il n’y ait pas eu de distinctions. On appelle cela des hochets ! Eh bien ! c’est avec des hochets que l’on mène les hommes. »

En vertu de quoi l’ordre de la Légion d’honneur est créé le 19 mai 1802, pour récompenser les services militaires et civils. L’ordre existe toujours : le président de la République française en est aujourd’hui le grand maître.

- "Un acte désinterressé est-il possible ?" dans <u>Raisons pratiques: sur la théorie de l'action</u> Pierre Bourdieu

## Metric in science: publications

https://en.wikipedia.org/wiki/Replication_crisis

https://en.wikipedia.org/wiki/Data_dredging

Sociologue de l'engagement Bernard Pudal en colloque: "quand mes entretiens ne collaient pas avec mon modèle, je les mettais pas"

## Metric for a politician: point towards a hospital 

Charles Duhigg, <u>The power of Habits:</u> The government would build hospitals even when they weren’t needed

The cue was budget money. Routine was to build a hospital. Reward was the politician being able to point and say “look what I did!” to climb the ladder of success

Actually probably more complicated than that: Obama, <u>The audacity of Hope</u>: fear of humiliation by defeat

# Takeaways: are metrics bad ?

*
*