# Data Scientist Need to Know Just one Test

![test](https://miro.medium.com/max/630/1*NqycnGXFLuvypvOQ-vfw4Q.png)

**I am here to reassure you: as a data professional, there is only one test that you need to know. Not because 1 test is important and the other 103 are negligible. But because:**

## All the statistical tests are in reality the same one test!

We will solve 4 very diverse statistical problems. And we will solve them always using the same exact algorithm.

1. You have thrown a die 10 times. You got [1, 1, 1, 1, 1, 2, 2, 2, 3, 3]. Is the die loaded?
2. Your friend claims that some Scrabble tiles fell out of the bag and, coincidentally, the letters formed a real word: “F-E-A-R”. You suspect that your friend is just trying make fun of you. Is your friend lying?
3. In a customer satisfaction survey, 100 customers gave an average rating of 3.00 to product A and 2.63 to product B. Is this difference significant?
4. You trained a binary classification model. It has an area under the ROC curve of 70% on your test set (made of 100 observations). Is the model significantly better than random?

You have a hypothesis — called “null hypothesis” — and you want to put it to the test. Thus, you ask yourself:
*“If the hypothesis was true, how often would I get an outcome as suspect as the outcome that I actually had?”*

In the example of the die, the question becomes: “If the die was fair, how often would I get a sequence as unexpected as [2, 2, 2, 2, 2, 4]?” Since you are asking “how often”, the answer must necessarily be a number between 0 and 1, where 0 means never and 1 means always.

**In statistics, this “how often” is called “p-value”.**

## The ingredients of statistical testing

Reading the previous paragraph, you may have guessed that we need two ingredients:

- The distribution of the possible outcomes, depending on the null hypothesis.
- A measure of the “unexpectedness” of any outcome.

## A unique statistical test
But how do we do it in Python? The algorithm is the following:

1. Define a function draw_random_outcome. This function should return the outcome of a random trial, given that the null hypothesis is true. It may be a single number, an array, a list of arrays, an image, practically anything: it depends on the specific case.

2. Define a function unexp_score (which stands for “unexpectedness score”). The function should take an experiment outcome as input, and return a single number. This number must be a score of how unexpected the outcome is, assuming it was generated under the null hypothesis. The score may be positive, negative, integer, or float, it doesn’t matter. The only property it must have is the following: the unlikelier the outcome is, the higher this score must be.

3. Run many times (e.g. 10,000 times) the function draw_random_outcome (defined at point 1) and, for each random outcome, compute its unexp_score (defined at point 2). Store all the scores in an array called random_unexp_scores.

4. Compute unexp_score of the observed outcome, and call it observed_unexp_score.

5. Compute how many random outcomes are more unexpected than the observed outcome. That is to say, count how many elements of random_unexp_scores are higher than observed_unexp_score. This is the p-value.


**The first two steps are the only ones that require a bit of creativity**, depending on the specific case, while steps 3, 4, and 5 are purely mechanical.

Now, to make it more concrete, let’s go through the examples.

## Example 1. Rolling A Die

We have launched a die 10 times and obtained this outcome:

In [1]:
import numpy as np

observed_outcomes = np.array([1,1,1,1,1,2,2,2,3,3])

The null hypothesis is that the die is fair. Under this hypothesis, it’s easy to extract random outcomes: it’s enough to use Numpy’s random.choice. So, this is the first step of our algorithm:

In [2]:
# step 1
def draw_random_outcome():
  return np.random.choice([1,2,3,4,5,6], size=10)

The second step is to define a function called unexp_score that should assign a score of unexpectedness to each possible outcome.

If the die is fair, we expect each face to appear on average one sixth of the time. So we should check the distance between the observed frequency of each face and 1/6. Then, to obtain a single score, we should take the average. In this way, the higher the average distance from one sixth, the more unexpected is the outcome.

In [3]:
# step 2
def unexp_score(outcome):
  outcome_distribution = np.array([np.mean(outcome == face) for face in [1,2,3,4,5,6]])
  return np.mean(np.abs(outcome_distribution - 1/6))

In [5]:
# step 3
n_iter = 10000
random_unexp_scores = np.empty(n_iter)
for i in range(n_iter):
  random_unexp_scores[i] = unexp_score(draw_random_outcome())
# step 4
observed_unexp_score = unexp_score(observed_outcomes)
# step 5
pvalue = np.sum(random_unexp_scores >= observed_unexp_score) / n_iter

In [6]:
pvalue

0.0229

![unexp_scores](https://miro.medium.com/max/630/1*C7LRHNOVL1Z8OAwIim2eiA.png)

## Example 2. Scrabble mystery


Your friend claims that some Scrabble tiles fell out of the bag and, coincidentally, the letters formed a real word: “F-E-A-R”. You suspect that your friend is teasing you. How to check statistically if your friend is lying?

First of all, the observed outcome is a sequence of letters, therefore a string:

In [7]:
observed_outcome = 'FEAR'

Suppose the bag contained the 26 letters of the alphabet. The null hypothesis is that a random number of letters (between 1 and 26) fell out of the bag in random order. So, we will have to use Numpy’s random for both the number of letters and the choice of the letters:

In [11]:
# step 1
import string

def draw_random_outcome():
  size=np.random.randint(low=1, high=27)
  return ''.join(np.random.choice(list(string.ascii_uppercase), size=size, replace=False))

Now, how to evaluate unexpectedness in this scenario?

In general, it’s reasonable to expect that the more letters fall from the bag, the less likely is to obtain a real word.

Therefore, we can use this rule: if the string is an existing word, then its score will be the length of the word. If the string is not a real word, then its score will be minus the length of the word.

In [9]:
from english_words import english_words_set

english_words_set = [w.upper() for w in english_words_set]
def unexp_score(outcome):
  is_in_dictionary = outcome in english_words_set
  return (1 if is_in_dictionary else -1) * len(outcome)

In [12]:
# step 3
n_iter = 10000
random_unexp_scores = np.empty(n_iter)
for i in range(n_iter):
  random_unexp_scores[i] = unexp_score(draw_random_outcome())
# step 4
observed_unexp_score = unexp_score(observed_outcome)
# step 5
pvalue = np.sum(random_unexp_scores >= observed_unexp_score)

In [13]:
pvalue

2

## Example 3. Difference between two means

In [None]:
product_a = np.repeat([1,2,3,4,5], 20)
product_b = np.array([1]*27+[2]*25+[3]*19+[4]*16+[5]*13)
observed_outcome = np.mean(product_a) - np.mean(product_b)

In [None]:
# step 1
def draw_random_outcome():
  pr_a, pr_b = np.random.permutation(np.hstack([product_a, product_b])).reshape(2,-1)
  return np.mean(pr_a) - np.mean(pr_b)

In [None]:
# step 2
def unexp_score(outcome):
  return np.abs(outcome)

In [None]:
# step 3
n_iter = 10000
random_unexp_scores = np.empty(n_iter)
for i in range(n_iter):
  random_unexp_scores[i] = unexp_score(draw_random_outcome())
# step 4
observed_unexp_score = unexp_score(observed_outcome)
# step 5
pvalue = np.sum(random_unexp_scores >= observed_unexp_score)/ n_iter

## Example 4. Area under the ROC curve

In [None]:
y_test = np.random.choice([0,1], size=100, p=[.9,.1])
proba_test = np.random.uniform(low=0, high=1, size=100)
observed_outcome = .7

In [None]:
# step 1
def draw_random_outcome():
  return roc_auc_score(y_test, np.random.permutation(proba_test))

In [None]:
# step 2
def unexp_score(outcome):
  return np.abs(outcome - .5)

In [None]:
# step 3
n_iter = 10000
random_unexp_scores = np.empty(n_iter)
for i in range(n_iter):
  random_unexp_scores[i] = unexp_score(draw_random_outcome())
# step 4
observed_unexp_score = unexp_score(observed_outcome)
# step 5
pvalue = np.sum(random_unexp_scores >= observed_unexp_score) / n_iter