# Lesson 5: Hypothesis Testing

**Python learning objectives**

1. Learn how to use an `if` and `else` statement
2. Learn how to write your own functions

**What you will be able to do with these skills**

1. Automate a T-test with an `if` statement
2. Write a function that can automate a T-test
3. Understand what *Student's T-test* is

**T-test**

Data scientists and statiticians are often faced with yes-no questions about the world. To answer these questions with statistical evidence we use hypothesis testing, specifically in this lesson we are going to be using the *T-test*.

Firstly, as always we need to import the `pandas` library. Once again we are giving it a nickname of `pd`. In this lesson we are also going to be using a function from `scipy`, therefore, we need to import that library as well. 

Instead of importing the whole `scipy` library we are going to only import a submodule - `stats` - which has all the statistical functions we need.

In [None]:
import pandas as pd 
from scipy import stats

We are going to start this lesson with data from the UK government about *A-Level* results [1]. *A-Levels* are exams taken by students aged 16-18 and are similar to the *International Baccalaureate* or the *European Baccalaureate*.

In [None]:
resultsdf = pd.read_csv("https://raw.githubusercontent.com/ThomasJewson/datasets/master/ALevelResults2018-19.csv")
resultsdf

If you observe the data, a useful piece of infomation to know would be whether men or women perform better in these examinations. 

How would be go about doing this? 

Initially, you might take an average of both the `Male Average Score Point (A level)` and `Female Average Score Point (A level)` columns and compare them. This can be done quite easily with the code below.

In [None]:
resultsdf["Male Average Score Point (A level)"].mean()

In [None]:
resultsdf["Female Average Score Point (A level)"].mean()

Simply comparing the means like this may give us an idea of who performs better, however, it does not determine if there is a significant difference such that random variation can be discounted. 

Therefore, to account for this significance we use *Student's T-test*.

We will learn how to perform a single sample T-test with the `stats.ttest_1samp()` [function](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_1samp.html). We need to provide this function with two arguments, a population and a mean to compare that population to. 

For example, if we want to compare the male and female mean score point we can either provide:

- The *mean* `resultsdf["Male Average Score Point (A level)"].mean()` and the *population* `resultsdf["Female Average Score Point (A level)"]`

- The *mean* `resultsdf["Female Average Score Point (A level)"].mean()` and the *population* `resultsdf["Male Average Score Point (A level)"]`


For the following example we will be using the **mean `Male Average Score Point (A level)`** and the **population of `Female Average Score Point (A level)`**. 

Firstly, lets calculate and save the *mean* `Male Average Score Point (A level)` to a variable called `MASPmean`.

In [None]:
MASPmean = resultsdf["Male Average Score Point (A level)"].mean()
MASPmean

Secondly, lets save the population of `Female Average Score Point (A level)` to the variable called `FASPpop`.

In [None]:
FASPpop = resultsdf["Female Average Score Point (A level)"]
FASPpop

Now lets insert this data into our `stats.ttest_1samp()` [function](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_1samp.html) with the population being the first positional argument and the mean being the second positional argument. 

In [None]:
stats.ttest_1samp(FASPpop,MASPmean)

This function returns an array of two numbers. The first number which is labeled `statistic` is the *t-statistic* and is used to calculate the *p-value* - this number is unimportant for this lesson. 

The second number which is labeled `pvalue` is the *two tailed p-value* of our *t-test* and it is the important result from this function.

To output just the *p-value* of this function we just need to put `.pvalue` at the end of the function. Below, we save this output to a variable called `pval`.

To reiterate, to get the p-value we do the following steps:

1. Start with our `stats.ttest_1samp()` function.
```Python
stats.ttest_1samp()
```
2. Insert the arguments into the function. The first positional argument is the *population* and the second is the *mean* which we want to compare to. In this case, that is `FASPpop` and `MASPmean` respectively.
```Python
stats.ttest_1samp(FASPpop,MASPmean)
```
3. The output of this function is an array of two numbers. To output just the p-value place `.pvalue` at the end of the function.
```Python
stats.ttest_1samp(FASPpop,MASPmean).pvalue
```
4. Now lets save this to a variable called `pval`.
```Python
pval = stats.ttest_1samp(FASPpop,MASPmean).pvalue
```

In [None]:
pval = stats.ttest_1samp(FASPpop,MASPmean).pvalue
pval

What is the p-value?

*The p-value is the probability that the mean of the population is **not equal** to the mean you provided.*

To use this number to help answer our question "Do women score the same in A levels then men" we need to construct two hypotheses. 

1. Our null hypothesis (H0) is the mean of `Female Average Score Point (A level)` is *EQUAL TO* the mean of `Male Average Score Point (A level)`.

2. Our alternative hypothesis (H1) is the mean of `Female Average Score Point (A level)` is *NOT EQUAL TO* the mean of `Male Average Score Point (A level)`

Then we need to compare this p-value with a significance level, a common significance level is 95%. Therefore, 95% which has a corresponding value of alpha = 0.05. This alpha value comes from 1 - 95% = 1 - 0.95 = 0.05. 

*When the p-value is **smaller** than our alpha value (0.05) then we reject our null hypothesis (H0) as there is evidence for the alternative hypothesis (H1).*

*When the p-value is **greater** than our alpha value (0.05) then we accept our null hypothesis (H0) as there is insufficient evidence for the alternative hypothesis (H1).*

Therefore, as we have a p-value = 0.425 and alpha = 0.05 we can tell that p-value > alpha. This means, *we accept our null hypothesis (H0)* that female and male A Level scores are the same as there is insufficient evidence for the alternative hypothesis.

We get this result despite the means not being equal due to the chance that random variation could have lead to women scoring higher than men. As this is a particularly small dataset with only 11 points of data the chances of random variation effecting the means is greater. If we use a larger dataset like in the example at the end of this lesson there is a smaller chance that random variation lead to a difference in the mean - therefore, our H1 is more likely.

Below, we have used comparison operators to compare our `pval` to our alpha value of `0.05`. The ouput of these expression is either `True` or `False` and are known as Boolean values - named after George Boole who helped establish modern symbolic logic.

In [None]:
pval > 0.05

In [None]:
pval < 0.05

The full set of common comparison operators are listed below. 

|Comparison|Operator|True example|False Example|
|-|-|-|-|
|Less than|<| 2 < 3 | 2 < 2 |
|Greater than|>| 3 > 2 | 3 > 3 |
|Less than or equal to|<=| 2 <= 2 | 3 <= 2 |
|Greater than or equal to|>=| 3 >= 3 | 2 >= 3 |
|Equal|==| 3 == 3 | 3 == 2 |
|Not equal|!=| 3 != 2 | 2 != 2 |


**If Statements**

We can automate this T-test with the use of an `if` [statement](https://docs.python.org/3/tutorial/controlflow.html#if-statements). To construct an `if` statement we need to include a comparison that equates to a Boolean result - either `True` or `False`. If that comparison evaluates to `True` then the code within the `if` statment is run. 

For example, if we wanted to code "if 5 is greater than 3 then print hello world" we can do it quite simply, see below.

It should be noted, the blank space in the line following `if ... :` is needed. When writing code this blank space will be automatically inserted. 

In [None]:
if 5 > 3:
    print("Hello World")

**Excercise 1:** *In the `if` statement below we have the expression `8 > 10` which is `False`. Predict before running the code whether `print("Hello World")` will be executed.*

In [None]:
if 8 > 10:
    print("Hello World")

#Answer: print("Hello world") will not be run 

Notice after the colon there is an indent in the next line of code, this is needed. To indent code press the tab button. All the indented code will be run during the `if` statement.

Observe the two pieces of code below for an example of this. 

In [None]:
if 5 < 3: #This is a False statement therefore the indented code does not run
    print("Hello World")
    print("My name is Lewis Carol")
    
print("This print function is not apart of the if statement")

In [None]:
if 5 > 3: #This statement is True therefore both the indented code and the code without an indent is run
    print("Hello World")
    print("My name is Lewis Carol")
    
print("This print function is not apart of the if statement")

We can use `else` to extend our `if` statement to cover all scenarios. 

For example, below we have coded the following logical statement "if 12 is greater than 15 then print 'Twelve is greater than fifteen' else ( when 12 is smaller than 15 ) print 'Twelve is less than fifteen'"

In [None]:
if 12 > 15:
    print("Twelve is greater than fifteen")
else:
    print("Twelve is less than fifteen")

Therefore to make an `if` and `else` statement we do the following:

1. Start with `if`. 
```Python
if
```
2. Place your logic statement after the `if`. In this case we are going to use `10 != 5`, where `!=` is the operator for *not equal to*.
```Python
if 10 != 5
```
3. Place a colon (`:`) after your logic statement. 
```Python
if 10 != 5:
```
4. The code we want to run within the `if` statment we write directly below. However, remember this code needs to be indented. To indent use the tab key. 
```Python
if 10 != 5:
        print("10 is not equal to 5")
```
5. To introduce an `else` statement we need to put `else:` in line with the `if 10 != 5:`. Therefore, we need to remove the indentation. 
```Python
if 10 != 5:
        print("10 is not equal to 5")
else:
```
6. The code we want to run within the `else` statement we write directly below. However, remember this code needs to be indented. To indent use the tab key. 
```Python
if 10 != 5:
        print("10 is not equal to 5")
else:
        print("10 is equal to 5?!")
```

We can use `if` statements to automate our T-test. 

If our p-value is less than 0.05 then we reject our null hypothesis (H0) as there is evidence for our alternative hypothesis (H1). Else, we accept our null hypothesis (H0) as there is insufficient evidence for our alternative hypothesis. 

See the code below.

In [None]:
if pval < 0.05:
    print("We reject our null hypothesis (H0) as there is evidence for our alternative hypothesis (H1).")
    print("Therefore, to a significance of 95% we can say that Men and Women do not score the same at A Levels.")
else:
    print("We accept our null hypothesis (H0) as there is insufficient evidence for our alternative hypothesis (H1).")
    print("Therefore, to a significance of 95% we can say that Men and Women score the same at A Levels.")

**Defining our own functions**

In all programming langauges a core feature is [defining your own functions](https://docs.python.org/3/tutorial/controlflow.html#defining-functions). In conjunction with `if` statements we will be able to define our own function that will completely automate the T-test.

The key purpose of defining a function is to give a name to a computational process that may be applied multiple times. The code in this function is only run when it is called.

The function below called `double`, simply doubles the number given to it. 

In [None]:
#Our first function definition

def double(x):
    """ Double x """
    return 2*x

To define a function we begin with `def`. Here is a breakdown of the other parts (known as the *syntax*) of this function. [2]

![title](https://raw.githubusercontent.com/ThomasJewson/datasets/master/function_definition.jpg)

Lets call the function, `double()`, below and lets give it an argument of `10`. 

In [None]:
double(10)

It is possible to add more arguments into a defined function.  

In [None]:
def divide(x,y):
    """This code divides x by y"""
    return x/y

In [None]:
divide(10,5)

In [None]:
divide(5,10)

**Excercise 2:** *Write a function below that multiplies two numbers together called `multiply`.*

In [None]:
def multiply(x,y):
    """This code multiplies x by y"""
    return x*y

**Excercise 3:** *Run `multiply(2,10)` to test your function.*

In [None]:
multiply(2,10)

Now, with what we have learnt, we are able to construct a function for our T-test. The code we want to run within the function is the same as what we have used before. 

In [None]:
def ttest(data,mean):
    """This function outputs which hypothesis is accepted or rejected in a T-test.
    
    More precisely, this function runs the stats.ttest_1_samp() function to
    obtain a p-value which it compares to a significance level of 95%. Then 
    it outputs whether the H0 or the H1 are accepted or rejected.
    
    """
    pval = stats.ttest_1samp(data,mean).pvalue
    
    if pval < 0.05:
        print("We reject our null hypothesis (H0) as there is evidence for our alternative hypothesis (H1).")
        print("Therefore, the means are not eqaul.")
    else:
        print("We accept our null hypothesis (H0) as there is insufficient evidence for our alternative hypothesis (H1).")
        print("Therefore, the means are the equal.")
    return

To check that this is working as before, lets run the A-level results data through it again. 

In [None]:
ttest(FASPpop,MASPmean)

This function can now accept any data we want to perform a T-test on. Below we import another dataset. This dataset is the age of death among members of the sovereignty, aristocracy, and gentry within the UK. [3]

The sovereignty are members of parliment and royals. The aristocracy are members of the highest class which typically had hereditary titles and the gentry are people who are moneyed and typically own large amounts of land. 

In [None]:
sovr = pd.read_csv("https://raw.githubusercontent.com/ThomasJewson/datasets/master/Age_of_death_of_Gentry_Sov_Arist/sovr.csv")
aris = pd.read_csv("https://raw.githubusercontent.com/ThomasJewson/datasets/master/Age_of_death_of_Gentry_Sov_Arist/aris.csv")
gent = pd.read_csv("https://raw.githubusercontent.com/ThomasJewson/datasets/master/Age_of_death_of_Gentry_Sov_Arist/gent.csv")

We can see the mean ages of deaths of each class

In [None]:
print("Sovriegn class average age of death:",sovr["age"].mean())
print("Aristocracy class average age of death:",aris["age"].mean())
print("Gentry class average age of death:",gent["age"].mean())

This dataset is much larger than our A-Level dataset. For example, the `aris` DataFrame has 2291 rows.

In [None]:
aris

This dataset is considerably larger than the A-level results we had previously. As this dataset is larger, there is less chance that random variation would lead to the means not being equal. Therefore, it is much more likely that the null hypothesis will be rejected if there is a small variation in the means.  

Lets do a T-test comparing the `sovr["age"]` population and the mean of the `aris` DataFrame and run it through our function. 

In [None]:
ttest(sovr["age"],aris["age"].mean())

This example displays the power of user defined functions. 

---------------------------

**Conclusions:**

*You should now be able to do the following:*
1. Be able to construct a T-test with the `stats.ttest_1samp()` function
2. Be able to obtain the p-value from the output of the T-test function with `stats.ttest_1samp().pvalue`
3. Understand how to use logic statements, and know how to use all of the logic operators ` >, <, <= ,>= ,== ,!=`
4. Know how to use an `if` and `else` statement
5. Be able to construct a program that automatically outputs the result of a T-test
6. Define your own functions
7. Define a function that outputs the answers to a T-test

Sources:

[1] https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/840414/2019_provisional_National_tablesv2.xlsx

[2]
https://www.inferentialthinking.com/chapters/08/Functions_and_Tables.html

[3] William Guy: Journal of The Statistical Society of London 

On the Duration of Life Among the English Gentry (March, 1846) Volume 9 pp 37-49

On the Duration of Life of Sovereigns (March, 1847) Volume 10, pp. 62-69

On the Duration of Life Among the Families of the Peerage and Baronetage of the UK (March,1845), Vol 8, pp69-77