<a href="https://colab.research.google.com/github/DavidSenseman/STA1403/blob/master/Assignment_06.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---------------------------
**COPYRIGHT NOTICE:** This Jupyterlab Notebook is a Derivative work of [Jeff Heaton](https://github.com/jeffheaton) licensed under the Apache License, Version 2.0 (the "License"); You may not use this file except in compliance with the License. You may obtain a copy of the License at

> [http://www.apache.org/licenses/LICENSE-2.0](http://www.apache.org/licenses/LICENSE-2.0)

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

------------------------

# **STA1403: "Biostats"**

### **Assignment 6: Hypothesis Testing**

#### In this assignment you will learn about:

* Extracting data with Boolean masks
* Hypothesis testing
* _t_ Test
* Analysis of Variance (ANOVA)

### Google CoLab Instructions

The following code will map your GDrive to ```/content/drive``` and print out your Google GMAIL address.

In [None]:
# You must run this cell first
try:
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)
    from google.colab import auth
    auth.authenticate_user()
    COLAB = True
    print("Note: using Google CoLab")
    %tensorflow_version 2.x
    import requests
    gcloud_token = !gcloud auth print-access-token
    gcloud_tokeninfo = requests.get('https://www.googleapis.com/oauth2/v3/tokeninfo?access_token=' + gcloud_token[0]).json()
    print(gcloud_tokeninfo['email'])
except:
    print("Note: not using Google CoLab")
    COLAB = False

## **Introduction to Hypothsis Testing**

When performing biomedical experiments, it is typical to start by expressing a **_hypothesis_** to explain an observed phenomenon. For example, Mackowiak _et al_. (1992) measured the body temperature of a number of healthy men and women and found that the average of their body temperatures were less than the normally accepted value of 98.6°F. 

To explain this observation, they hypothesized that value of 98.6 <sup>o</sup>F reported by the great German physician, Carl Reinhold August Wunderlich in 1856 (see photograph) was too high and that the real value was somewhat lower.

![__](https://biologicslab.co/STA1403/images/A06/Wunderlich.png)

In this lesson we look at **_hypothesis testing_** as a formal way to statistically compare two (or more) sets of numerical values. 

As in many areas of mathematics and statistics, certain Greek letters and/or symbols are commonly used in specific types of problems. In statistics, for example, the lower case Greek letter mu ( **_μ_** ) is used to represent the **_population mean_** of a variable being studied. In this example, the variable is body temperature. Using this notation we can express Mackowiak _et al_'s hypothesis as _μ_ < 98.6 <sup>o</sup>F. 

For many students, the next step is the hardest since it seems to be opposite of what you want to show. The trick is that you have to find another explanation, also expressed as a hypothesis, that **_invalidates_** (annuls) your proposed hypothesis. We call this hypothesis the **_null hypothesis_** and denote it as  **_H<sub>0</sub>_**. 

The _null hypothesis_ usually reflects the “status quo” or “nothing of interest”. In this example the _status quo_ would simply be the old value reported by Dr. Wunderlich of 98.6 <sup>o</sup>F. Imagine trying to win a blue ribbon at a high school science fair if all your project showed was that normal body temperature was the good old 98.6 <sup>o</sup>F? In other words, proving a **_null hypothesis_** would definitely be considered “<u>nothing of interest</u>” and not likely to make you famous, or win any prize at a science fair for that matter.     

But now look at this situation the other way around. What if you could **_prove_** that the great Carl Reinhold August Wunderlich had been **_wrong_** all these years and that, because of your brilliant research, the true body temperature was **_not_** 98.6 <sup>o</sup>F after all? In that case, you (and everyone else) would have to **_reject the null hypothesis_**. And more importantly, if the null hypothesis is rejected, you (and everyone else) would now be **_forced_** to accept **YOUR** hypothesis. 

We call your hypothesis--the hypothesis that you are investigating--the **_alternative hypothesis_** and we denote it as **_H<sub>A</sub>_**. When you are able to disprove (i.e. **_reject_**) the null hypothesis, _H<sub>0</sub>_, it means you have to accept the alternative hypothesis, _H<sub>A</sub>_, as being true.

-----------------------------------------------
#### **The Secret of Hypothesis Testing**

The **_secret of hypothesis testing_** is that the only way to show that your hypothesis (the alternative hypothesis **_H<sub>A</sub>_**) is correct, is to **_prove_** the null hypothesis (**_H<sub>0</sub>_**) is **_wrong_**. When you **reject** the null hypothesis, you are logically forced to accept the alternative hypothesis as being true since it is the only possible alternative.

----------------------------------------

We will return to the topic of hypothesis testing latter in this lesson. Before we are ready, we need to learn: 

1. How to **_read_** data files stored locally on your computer
2. How to create a **_dataframe_** to store the data in a convient format
3. How to **_extract specific data_** from dataframes using **_Boolean Masks_**.

For this example, Mackowiak _et al_ hypothesized the widely accepted value of 98.6 <sup>o</sup>F was wrong due to long-standing error made by the Germany physiologist, Carl Reinhold August Wunderlich. Instead,  and that true mean body temperature _μ_ ≥ 98.6. We refer to this hypothesis as the **_null hypothesis_** and denote it as _H<sub>0</sub>_. 

The null hypothesis usually reflects the “status quo” or “nothing of interest”. In contrast, we refer to our hypothesis (i.e., the hypothesis we are investigating through a scientific study) as the **_alternative hypothesis_** and denote it as _H<sub>A</sub>_. It is common to express the null hypothesis in the simplest form possible. For the above example, to annul the alternative hypothesis, _H<sub>A</sub>_ : _μ_ < 98.6, it suffices to show that _H<sub>0</sub>_ : _μ_ = 98.6. This makes the task of evaluating a hypothesis easier. 

The procedure for evaluating a hypothesis is called **_hypothesis testing_** and it rises in many scientific problems. A common approach for hypothesis testing is to focus on the null hypothesis, which is usually simpler than the alternative hypothesis, and decide whether or not to reject it. To this end, we examine the evidence that the observed data provide against the null hypothesis _H<sub>0</sub>_. If the evidence against _H<sub>0</sub>_ is strong, we **_reject_** _H<sub>0</sub>_. If not, we state that the evidence provided by the data is not strong enough to reject _H<sub>0</sub>_, and we **_fail to reject it_**.


## **Two-Sample t-tests for Comparing the Means**

For many hypothesis testing problems, we might be indifferent to the direction of departure from the null value. In such cases, we can express the null and alternative hypotheses as _H<sub>0</sub>_ : _μ_ = _μ_<sub>0</sub> and _H<sub>A</sub>_ : _μ_ $\not=$ _μ_<sub>0</sub>, respectively. Then we consider **_both_** large positive values and small negative values of the population mean as evidence against the null hypothesis, and our alternative hypothesis is referred to as **_two-sided_**.

For example, suppose we believe that the average normal body temperature is different from the accepted value 98.6°F, but we are not sure whether it is higher or lower than 98.6. Then the null hypothesis remains _H<sub>0</sub>_ : _μ_ = 98.6, but the alternative hypothesis is expressed as _H<sub>A</sub>_ : _μ_ $\not=$ 98.6. 

### **Reading datafile files from the Internet**

As mentioned in an earlier lesson, a data file stored on your computer is considered a data type known as a **_stream_**. Python has a relative large number of ways to read different types of data streams depending on the size of the data and the type of data connection. 

If the datafile is stored locally on a hard drive attached directly to your computer and you want to store the information in the datafile in data structure known as a **_dataframe_**, then the easiest way to read the datafile is using the `read_csv()` function that comes with the `pandas` library. 

### Example 1: Read a Datafile and Create a DataFrame

Before we dive into the details of hypothesis testing, we first need to read a datafile and store its information in a data structure called a **_DataFrame_**. DataFrames are part of a software package called `Pandas`. Pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures (e.g DataFrames) and operations for manipulating numerical tables and time series. 

A _DataFrame_ is very similar to a Microsoft(C) Excel spreadsheet with columns and rows. In both an Excel spreadsheet and DataFrame, the rows hold the data values for each subject (or object) in the dataset while the columns contain values for different variables associated with each subject. These data values can be either strings (letters) or numbers.

In this example the experimental data is contained in a textfile called "BodyData.txt". In order to read this data using the `pandas` function `read_csv()` we need to provide the function with two _arguments_:

1. The **_filepath_** to the datafile.
2. The text character used in the datafile to **_separate_** data values. 

To make things easier, the **_filepath_** in this example is simply the name of the datafile, including its file extension, `BodyData.text`. You only need to provide the filename **IF** the file is located in the current working directory. If the datafile is **not** in your current working directory, then the _filepath_ must include the names of all of the directories and subdirectories as well as the filename. 

The second thing you need to tell the function is the text character used to **separate** data values. The name of this argument is `sep` which stands for _separator_. The most common separators are commas, tabs and spaces. In the example below, data values are separted by a space, so the separator argument `sep` is set to equal a space `sep = " "`. (NOTICE: There is eactly one space between the double quotes). 

As it typical of many Python packages, we gave the`pandas` package a "nickname" when we imported the package. If you look at the list of packages imported at the beginning of this program, you will see the line: `import pandas as pd`. The `as pd` means that we don't have to write out the entire word `pandas` when calling the function, just its abbreviation `pd`. While this saves us from writing only a couple of letters of code, Python strives to be efficient as possible.  

Finally, we assign a variable to function which will be the name of the **_dataframe_** the function will create for us. In this example, the dataframe will be called `BodyData_df`. We will use the suffix `_df` as a reminder that this variable is a `pandas` dataframe.

In [None]:
# Example 1: Read in datafile and create a dataframe

import pandas as pd

# Read data file
btDF = pd.read_csv("https://biologicslab.co/STA1403/data/BodyTemperature.txt", 
                   sep=' ')  # define the separator as a space

# Print out the first 5 rows
btDF.head()

If the code is correct, you should see the following table:

![__](https://biologicslab.co/STA1403/images/A06/A06_Image01.png)

When reading a datafile into a dataframe, it is always a good idea to check if your code was correct. The command `btDF.head()` is a simply way to print out the first 5 values of the `btDF` dataframe. If your code was correct, your data should appear in nice neat columns as seen above. We can also see the _exact name_ for each column which we will need if we want to extract a subportion of the data our DataFrame, `btDF`. If the data is all piled up at the right, it probably means that you didn't specify the correct separator variable. 

### **Exercise 1: Read a datafile and create a dataframe**

The textfile "balance.csv" contains the results of an experiment where the amount of mean sway range (in millimeters) in the forward/backward plane and side/side plane were recorded for two groups of subjects, young and elderly, while taking part in a reaction time test. The data set includes 9 elderly subjects and 8 young subjects. Each subject was asked to stand barefoot on a “force platform” and maintain a stable upright position. Then, they were supposed to react as quickly as possible to an unpredictable noise by pressing a hand-held button. The noise was produced randomly. The platform automatically measured how much a subject swayed in millimeters in both the forward/backward and the side-to-side directions.

In the code cell below, use `pd.read_csv()` to read the textfile called "balance.csv". In this textfile, the text symbol used to separate data values is a comma, not a space. 

In [None]:
# Insert your code for Exercise 1 here




If your code is correct you should see the following output:

![__](https://biologicslab.co/STA1403/images/A06/A06_Image02.png)

This indicates the the first 5 subjects were `elderly`. In this instance, you can't see that the subjects in the bottom half the dataframe were given the label `young` in the `Age` column. 

-------------------------------------------------

### **WARNING!&nbsp;&nbsp;WARNING!&nbsp;&nbsp;WARNING!**

If your code is **_not_** correct you might see the following:

![__](https://biologicslab.co/STA1403/images/A06/A06_Image03.png)

If your output looks like the second image **_you will need to fix your code before you can continue_**. To fix your problem, go back and carefully reread the instructions and try again.

------------------------------------------------

### Example 2: Create male and female "Boolean masks"

By visual inspection of the `btDF` in Example 1, we can see that data for men and women are _intermingled_; data for males are indicated by the letter `M` in the **Gender** column while female data is indicated by the letter `F`. Suppose we want to compare the body temperature or the heart rate between males and females? How are we going to extract (separate) the male and female data?

One way to extract specific types of information embedded in a dataframe is to create a **_Boolean Mask_**. A Boolean Mask is simply a list containining either `True` or `False` for each subject (or object) in the DataFrame. The Python code to create two new variables called `mask_male` and `mask_female` using `boolean` logic, is shown below. 

As noted above, the gender of each subject in the `btDF` is specified in the `Gender` column as either `M` for male, and `F` for female. The code below uses the boolean **_equality operator (==)_** to see if the letter in the `Gender` column is an `M` or an `F`. 

As will be shown below, we can use these _boolean masks_ to separate variables (e.g. body temperatures and heart rates) of male subjects from female subjects. 


In [None]:
# Example 2: Create separate 'masks' for males and females

# Test if Gender is M or F
mask_male = btDF.Gender =='M'
mask_female = btDF.Gender == 'F'

# Print out first 5 values of maskMale
mask_male.head()

Your output should look like this:

~~~text
0     True
1     True
2     True
3    False
4    False
Name: Gender, dtype: bool
~~~

The output shows the first 5 values in the `mask_male` were: `True, True, False, False`. If you look at the output of `btDF.head()` shown in Example 1, you will see that the first three subjects were males, and the next two were females. In other words, our `mask_male`worked as expected. 

Remember, the `mask_male` is simply a list of `True` and `False` depending upon the gender of the subject. If the value of `mask_male` is `True` for a particular subject, the suject's temperature will be used in the analysis. Conversely, if the value of `mask_male` is `False` (i.e., the subject is a female), the temperature value will **not** be included in the analysis of male body temperatures. 

This ability of the boolean mask to selectively include or exclude data will be demonstrated in Example 3 below.

### **Exercise 2: Create `elderly` and `young` Boolean Masks**

In the code cell below, create a variable called `mask_elderly` and another variable called `mask_young` using the `Age` column in the  `balDF` dataframe. Use the `.head()` method to print out the first 5 boolean values of the `mask_elderly`.

In [None]:
# Insert your code for Exercise 2 here



If your code is correct you should see the following:
~~~text
0    True
1    True
2    True
3    True
4    True
Name: Age, dtype: bool
~~~

As expected from **Exercise 2**, the first 5 subjects (numbers 0 - 4) in `balDF` were all `elderly`. 

### Example 3: Print out the first 5 male body temperature values

So why are boolean masks useful? Boolean masks can be used to separate data in a DataFrame based on a data value. In this example, we use the `mask_male` to separate the male data from the female data. To do this, we use the `mask_male` as an **_index_** into the `btDF` dataframe as follows:

`btDF.Temperature[mask_male]`

If the value of `mask_male` is `True` for a particular (i.e. male) subject, the `Temperature` data will be used -- in this case printed out. If the `mask_male` evaluates to `False` (i.e. a female) the data will not be printed. 

The Python code in the cell below uses the `.head()` method to print out the body temperatures of the first 5 male sujects in the `btDF` dataframe. 

In [None]:
# Example 3: Print out the first 5 male temperature values.

# use .head() method to view first 5 values
btDF.Temperature[mask_male].head()

The output should be:
~~~text
0     97.0
1     98.8
2     96.2
5    101.3
9     99.2
Name: Temperature, dtype: float64
~~~

Notice that using the `mask_male`, the temperature values were selected from the first 3 subjects (0 through 2) and then skips two subjects (3 and 4) before continuing. This is exactly what we wanted--to select only the male data and to skip over the female data. 

### **Exercise 3: Print out the first 5 `side_to_side` values for the `young` subjects**

In the code cell below use your `mask_young`, in conjuction with the `head()` method, to print out the `side_to_side` sway measurements of the first 5 young subjects in the `balDF` DataFrame. 

In [None]:
# Insert your code for Exercise 3 here



If your code is correct you should see the following output:
~~~text
9     17
10    10
11    16
12    22
13    12
Name: side_to_side, dtype: int64
~~~


### Example 4: Calculate the mean male and female body temperature

The most common way to characterize a series of numerical values is to calculate it's average or **_mean_**. In statistics, the word _mean_ is preferred over the word _avearge_. 

We will begin by simply calculating the mean body temperature for males and females in the `btDF` DataFrame. To do this we can use the `np.mean()` function that comes with the `Numpy` package. 

In statistics, the letter typically used to represent the mean value of some variable is the lowercase Greek letter mu, _μ_. So the mean male body temperature would be written as _μ<sub>Male</u>_ and the mean female body temperature would be written as _μ<sub>Female</u>_.

The code for calculating the mean male body temperature, _μ<sub>Male</u>_, is shown in the code cell below. 

In [None]:
# Example 4: Calculate mean male body temperature

import numpy as np

# Compute mean
mu_maleBT = np.mean(btDF.Temperature[mask_male])
mu_femaleBT = np.mean(btDF.Temperature[mask_female])

# Print results
print("The mean body temperature in males = ", mu_maleBT, "deg F")
print("The mean body temperatire in females = ", mu_femaleBT, "deg F")

Rounding off to two decimal places, the mean male body temperature, _μ<sub>Male</u>_ = 98.20 <sup>o</sup>F and the mean female body temperature, _μ<sub>Female</u>_ = 98.46 <sup>o</sup>F. Given the data in this study, the mean female body temperature was 0.26 <sup>o</sup>F _higher_ than the mean male body temperature. 

### **Exercise 4: Calculate the mean `side_to_side` sway measurements for elderly and young subjects**

In the cell below write the code to calculate the mean `side_to_side` measurements for `elderly` and `young` subjects in the `balDF` DataFrame. Store each calculation in a suitably named varaible and then print out the results. 

In [None]:
# Insert your code for Exercise 4 here



If your code is correct you should see something like the following output:
~~~text
The mean side to side to side in the elderly was =  22.22222222222222 mm
The mean side to side to side in the young was =  15.125 mm
~~~
Rounding off to one decimal place, the mean side_to_side measurement for elderly subjects, _μ<sub>Elderly</u>_ = 22.2 mm and _μ<sub>Young</u>_ = 15.1 mm. 

Taking the difference, the elderly subjects swayed 7.1 mm _more_ in a forward/backword plane than the young subjects. 

## **How to compare two mean values?**

Focusing on the body temperature data, in our sample of 49 healthy male subjects and 51 healthy female subjects we found a **0.26 F<sup>o</sup> difference** in mean body temperature. 

There is a very large number of differences between males and females when it comes to their anatomy and physiology. These differences, such as the shape of the pelvic girdle, are referred to as **_sexually dimorphic characteristics_**. 

Could YOU could use the data in `btDF` to **_prove_** body temperature in humans was _sexually dimorphic_, you would become immediately famous! Since the dawn of medical science, this basic physiological fact had somehow escaped discovery until now.  

The problem, of course, is **_biological variability_**. As a biologist you naturally expect to see differences in a wide variety of anatomical and physiological characteristics -- not just between men and women, but even between members of the same sex. 

One of Chalres Darwin's keenest insights was there was always significant variations between individuals of any species be they plants, bugs, birds, or humans. It is this individual variation that complicates the comparison of two mean values -- and why biostatistics is so important! 

We can easily use Python's maximum and minimum methods to find the highest and lowest body temperature in our dataset. To find the maximum male body temperature, for example, we can use this code:

`btDF.Temperature[mask_male].max()`

And to find the minimum male body temperature we can use this code: 

`btDF.Temperature[mask_female].max()`

Using these maximum and minimum methods, you would find the highest male body temperature in this study was 101.3 F<sup>o</sup> and the lowest 96.2 F<sup>o</sup>. For female subjects, the highest body temperature recorded was 100.8 F<sup>o</sup> and the lowest 96.8 F<sup>o</sup>.

Given that there is considerable variation in body temperatures **_between_** subjects in this study, how can we decide if our finding that mean female body temperature being  0.26 F<sup>o</sup> higher than the male value is worth sending a quick paper off to _Nature(Lond)_? 

The key idea is to use biostatistics to find if there is a **_significant difference__** between these two mean values that can't be explained by simple random variations between the individuals used in this study.

### Example 5: Compute the maximum and minimum body temperatures for males and females

To get some sense of how much individual variation was present in the body temperature data, we can use Python's _maximum_ and _minimum_ methods. The code in the cell below calculates the maximum and minimum body temperature values for both men and women and print out the results.

In [None]:
# Example 5: Find the max and min body temperatures for men and women

# Use max and min methods with male data
maxMaleBT = btDF.Temperature[mask_male].max()
minMaleBT = btDF.Temperature[mask_male].min()

# Use max and min methods with female data
maxFemaleBT = btDF.Temperature[mask_female].max()
minFemaleBT = btDF.Temperature[mask_female].min()

# Print out results
print("The maximum male body temperature = ", maxMaleBT, "deg F.")
print("The minimum male body temperature = ", minMaleBT, "deg F.")
print("The maximum female body temperature = ", maxFemaleBT, "deg F.")
print("The minimum femmale body temperature = ", minFemaleBT, "deg F.")


These results are quite interesting. Even though the females in this study had, on average, a higher mean body temperature, the highest body temperature, 101.3 <sup>o</sup>F, was recorded in a male subject, not a female. 

### **Exercise 5: Compute the maximum and minimum body `forward_backward` sway for elderly and young subject**

In the cell below, calculate the maximum and minimum `forward_backward` sway for both elderly and young subjects and print out the results.

In [None]:
# Insert your code for Exercise 5 here



If your code is correct you should see the following output:
~~~text
The maximum elderly forward_backward sway =  50 mm.
The maximum elderly forward_backward sway =  19 mm.
The maximum young forward_backward sway = 25 mm.
The minimum young forward_backward sway =  14 mm.
~~~

Here we note that the forward/backward sway distances are consistently larger for elderly subjects compared to young subjects. This makes sense since elderly people typically sway more than young adults.

## **Hypothesis Testing: Revisited**

The statistical tools of **_hypothesis testing_** have been developed to help investigators like biologists and clinicians decide if the differences that they observe in their investigations and experiments reflect **_real_** differences between groups (perhaps due to an experimental treatment) or simply reflect the inherent biological variation between individuals that can be seen in any living species. 

The two basic dtatistical tools for comparing numerical data like body temperture are (1) Student's _t_-Test and (2) the Analysis of Variance (ANOVA). We use the _t_-Test when we need to compare just two sets of numerical data, and we use ANOVA when we have three (or more) sets of numerical data.  

### Example 6: Two-Sample _t_-Test Body Temperature, Two-Sided

We can use the Two-sample _t_-Test to test the hypothesis that normal male body temperature is different from the normal female body temperature. Specifically, we will be comparing the _mean_ male body temperature _μ<sub>Male</u>_ with the _mean_ female body temperature _μ_<sub>Female</sub>.  

In a **_Two-sided_** analysis we care whether any difference between the two means is in a particular _direction_, either greater or smaller. In this case, we can express the null as _H<sub>0</sub>_ : _μ<sub>Male</u>_ = _μ_<sub>Female</sub> and alternative hypotheses as _H<sub>A</sub>_ : _μ<sub>Male</sub>_ $\not=$ _μ_<sub>Female</sub>. In other words, the null hypothesis, _H<sub>0</sub>:, is that there is no significant difference between the mean body temperature of the male subjects (_μ_<sub>Male</sub> = 98.20 F<sup>o</sup>) and the female subjects (_μ<sub>Female</sub>_ = 98.46 F<sup>o</sup>).

Since we don't care whether the mean male body temperature is higher or lower than the mean female body temperature, we add the argument `alternative = "two-sided"` the **_Two-Sample_** _t_-Test function as shown in the code cell below. 


In [None]:
# Example 6: Two-sample t-Test (two-sided)

from scipy import stats
from scipy.stats import ttest_ind

# Perform t-Test
t_statistic, p_value = ttest_ind(btDF.Temperature[mask_male],
                                 btDF.Temperature[mask_female],
                                 alternative="two-sided")

# Print results
print(f"T-statistic: {t_statistic}, P-value: {p_value}")

### **Interpretation of the _t_-Test Results, Two-Sided**

The output shows that the _p_-value  = 0.18 when rounded to 2 decimal places. This is significantly _greater_ than the 95% confidence limit of _p_ = 0.05. This means there is a reasonably high chance the difference between the male and female body temperatures is simply due to individual variation.

So given the data in this dataset, we **_fail to reject_** the null hypothesis, _H<sub>0</sub>_ : _μ<sub>Male</u>_ = _μ_<sub>Female</sub>. 

In short, there is no reason to believe, given this data, that body temperature is different between males and female. 

### **Exercise 6: Two-Sample _t_-Test `side_to_side` Sway, Two-sided**

In the cell below, write the Python code to perform `two-sided` t-Test on the `side_to_side` of elderly subjects compared to young subjects. 


In [None]:
# Insert your code for Exercise 6 here




If your code is correct you should see
~~~text
T-statistic: 1.8349136259468637, P-value: 0.08642680897586714
~~~

In the Markdown cell below, write a short statment whether you **_reject_** or **_fail to reject_** the null hypothesis that: _H<sub>0</sub>_ : _μ<sub>Elderly</u>_ = _μ_<sub>Young</sub> with regards to the amount of sway in the side_to_side plane.

------------------------
**Write you analysis of the t-Test results in this code cell below**




### **Example 7: Two-Sample _t_-Test Body Temperature, One-Sided**

In Example 6 we compared male and female body temperatures using a **_Two-Sided_** _t_-Test. In a two-sided _t_-Test we don't care if the mean female body temperature was higher, or lower than the mean male body temperature. However, we know from **Example 4** that the mean female body temperature was 0.26 <sup>o</sup>F _higher_ than the mean male body temperature. 

We can test the hypothesis that female body temperatures are higher than males using a **_One-Sided_** _t_-Test as shown in the code cell below. To do this we have to change the `alternative` argument to `less`. 

In [None]:
# Example 7: Two-sample t-Test ONE-SIDED

from scipy import stats
from scipy.stats import ttest_ind

# Perform t-Test
t_statistic, p_value = ttest_ind(btDF.Temperature[mask_male],
                                 btDF.Temperature[mask_female],
                                 alternative="less")

# Print results
print(f"T-statistic: {t_statistic}, P-value: {p_value}")

The output should be:
~~~text
T-statistic: -1.3583148568839252, P-value: 0.08874126222561389
~~~

### **Interpretation of the _t_-Test Results, One-Sided**

The output shows that the _p_-value  = 0.09 when rounded to 2 decimal places. Changing the _t_-Test from from _Two-Sided_ to _One-Sided_ decreased the _p_-value by about half, from _p_ = 0.18 to _p_= 0.09. However, a value of _p_ = 0.09 is still greater than the 95% confidence interval cutoff of _p_ = 0.05 so as before, we **_fail to reject_** the null hypothesis. Given the data, there is no reason to believe that body temperature is different between males and female. 

### **Exercise 7: Two-Sample _t_-Test `side_to_side` Sway, One-sided**

In the cell below, write the Python code to perform a Two-Sample _t_-Test on the `side_to_side` sway of elderly subjects compared to young subjects. Since we might expect elderly persons to sway more than young ones, set the variable `alternative` to `"greater"` and print out your results.

In [None]:
# Insert your code for Exercise 7 here



If your code is correct you should see
~~~text
T-statistic: 1.8349136259468637, P-value: 0.04321340448793357
~~~
In the Markdown cell below, write a short statment whether you **_reject_** or **_fail to reject_** the null hypothesis that: _H<sub>0</sub>_ : _μ<sub>Elderly</u>_ = _μ_<sub>Young</sub> with regards to the amount of sway in the side to side plane.

------------------------
**Write you analysis of the t-Test results in this code cell below**




# **Analysis of Variance (ANOVA)**

The **_Analysis of Variance_** (ANOVA) can be thought of as an extension to the _t_-Test. The independent _t_-Test is used to compare the means of a condition between 2 groups while the ANOVA is used when one wants to compare the means of a condition between three (or more) groups. 

ANOVA is an _omnibus_ test, meaning it tests all of the data as a **whole**. In other words, ANOVA tests if there is a difference between one (or more) means, but it does not tell you which mean(s) is/are difference. To find out where the groups are different from one another, one has to conduct additional post-hoc tests. 

Although it can be thought of as an extension of the _t_-Test, mathematically speaking, ANOVA more of a **_regression model_** which we will cover in a future lesson. 

The testing hypothesis of an ANOVA is as follows:

* _H<sub>0</sub>_: No difference between means, i.e. _μ_<sub>1</sub> =  _μ_<sub>2</sub> =  _μ_<sub>3</sub> ...
* _H<sub>A</sub>_: Difference between means exist somewhere in the data.

## **ANOVA Assumptions**

There are 3 assumptions that need to be met for the results of an ANOVA test to be considered accurate and trust worthy. It’s important to note the the assumptions apply to the residuals and not the variables themselves. The ANOVA assumptions are the same as for linear regression and are:

* **_Normality_**
    + Caveat to this is, if group sizes are equal, the F-statistic is robust to violations of normality
* **_Homogeneity of variance_**
    + Same caveat as above, if group sizes are equal, the F-statistic is robust to this violation
* **_Independent observations_**

If possible, it is best to have groups the same size so corrections to the data do not need to be made. However, with real world data, that is often not the case and one will have to make corrections to the data. 

In [None]:
# Loading data

# Read data file
cushDF = pd.read_csv("https://biologicslab.co/STA1403/data/Cushings.csv", 
                      sep=',')  # define the separator as a comma

#Cushings_df = pd.read_csv("Cushings.csv", sep = ",")
cushDF.head()

If your code is correct you should see the following output:

![__](https://biologicslab.co/STA1403/images/A06/A06_Image04.png)



In [None]:
# Example: Extract tumor specific data 

a_TumorDat = Cushings_df.loc[Cushings_df['Type'] == 'a']
b_TurmorDat = Cushings_df.loc[Cushings_df['Type'] == 'b']
c_TurmorDat = Cushings_df.loc[Cushings_df['Type'] == 'c']
u_TurmorDat = Cushings_df.loc[Cushings_df['Type'] == 'u']

# print out the contents of one 
a_TumorDat

Notice that the contents of `a_TumorDat` contains _both_ the urinary secretion rates for the metabolite Tetrahydrocortisone (`TCort`) as well as the metabolite Pregnanetriol (`PregN`). Therefore these 4 variables could be use to perform an ANOVA analysis on either metabolite. 

### Example 8: ANOVA of Tetrahydrocortisone secretion

In this example we will test the hypothesis that the urinary secretion of the metabolite Tetrahydrocortisone is different in patients with different types of adrenal cortical tumors. As always, the null hypothesis, _H<sub>0</sub>_: is that there is _nothing interesting going on_. In other words, the average or mean amout of Tetrahydrocortisone secretion in a 24-hr period is roughly the same for all four tumor types. Since we are testing the data between 4 different tumor types, we need to use an Analysis of Variance (ANOVA).  

There are severak statistical add-in packages for generating ANOVA on Python. In this example we will use the `statsmodel` package which was imported at the start of this lesson:

`import statsmodels.api as sm` <br>
`from statsmodels.formula.api import ols`

The `statsmodels` package is convient since it supports using data held in a`pandas` dataframes. This makes it easier to set up the ANOVA formula, since we don't need to separate the data into individual groups using "Boolean masks".

In [None]:
# Example 8: ANOVA of Tetrahydrocortisone with statsmodel 

import statsmodels.api as sm
from statsmodels.formula.api import ols

# Generate ANOVA model
TCort_model = ols('TCort ~ Type', data = cushDF).fit()

# Save results 
TCort_results = sm.stats.anova_lm(TCort_model)

# Print out the results
print(TCort_results)

If the code is correct, you should see the following table:
~~~text
            df       sum_sq     mean_sq         F    PR(>F)
Type       3.0   893.521000  297.840333  3.225739  0.041218
Residual  23.0  2123.645667   92.332420       NaN       NaN
~~~

### **Interpretation of the ANOVA Results from Example 8**

As before we look at the _p_-value to see if we should **_reject_** or **_fail to reject_** the null hypothesis. We see that _p_ = 0.041, which is less than the cutoff value of 0.05. That means that there is a less than 5% chance the null hypothesis is correct. Therefore we should **_reject_** the null hypothesis.

In other words, the 4 tumor types do **not** appear to secretion the same levels of Tetrahydrocortisone. That's really all the ANOVA tells us. We would have to perform addition post-hoc tests to figure which tumor (or tumors) secrete more or less of this metabolite.

### **Exercise 8: ANOVA of Pregnanetriol secretion**

In the code cell below write the Python code to perform an Analysis of Variance (ANOVA) on the secretion of the metabolite Pregnanetriol (`PregN`) by tumor type. 

In [None]:
# Insert your code for Exercise 8 here




If your code is correct you should see the following output:
~~~text
            df      sum_sq    mean_sq         F    PR(>F)
Type       3.0   72.411467  24.137156  3.539263  0.030507
Residual  23.0  156.856000   6.819826       NaN       NaN
~~~


### **Interpretation of the ANOVA Results from Exercise 8**

In the blank code cell below, write a short statement about the results of **Exercise 8**. Your statment should include the null hypothesis, the alternative hypothesis and the reason that you either **_reject_** the null hypothesis or **_fail to reject_** the null hypothesis. 

### **Lesson Turn-In**

When you have completed all of the exercises and run **every** cell in this Lesson, print out a PDF copy and upload it to Canvas. Your PDF should be called `Assignment_06_lastname.pdf` where _Lastname_ is your last name.,

### **References**

Mackowiak, P.A., Wasserman, S.S., Levine, M.M.: A critical appraisal of 98.6°F, the upper limit of the normal body temperature, and other legacies of Carl Reinhold August Wunderlich. JAMA 268, 1578–1580 (1992) 