# What is the True Normal Human Body Temperature? 

#### Background

The mean normal body temperature was held to be 37$^{\circ}$C or 98.6$^{\circ}$F for more than 120 years since it was first conceptualized and reported by Carl Wunderlich in a famous 1868 book. But, is this value statistically correct?

<div class="span5 alert alert-info">
<h3>Exercises</h3>

<p>In this exercise, you will analyze a dataset of human body temperatures and employ the concepts of hypothesis testing, confidence intervals, and statistical significance.</p>

<p>Answer the following questions <b>in this notebook below and submit to your Github account</b>.</p> 

<ol>
<li>  Is the distribution of body temperatures normal? 
    <ul>
    <li> Although this is not a requirement for CLT to hold (read CLT carefully), it gives us some peace of mind that the population may also be normally distributed if we assume that this sample is representative of the population.
    </ul>
<li>  Is the sample size large? Are the observations independent?
    <ul>
    <li> Remember that this is a condition for the CLT, and hence the statistical tests we are using, to apply.
    </ul>
<li>  Is the true population mean really 98.6 degrees F?
    <ul>
    <li> Would you use a one-sample or two-sample test? Why?
    <li> In this situation, is it appropriate to use the $t$ or $z$ statistic? 
    <li> Now try using the other test. How is the result be different? Why?
    </ul>
<li>  At what temperature should we consider someone's temperature to be "abnormal"?
    <ul>
    <li> Start by computing the margin of error and confidence interval.
    </ul>
<li>  Is there a significant difference between males and females in normal temperature?
    <ul>
    <li> What test did you use and why?
    <li> Write a story with your conclusion in the context of the original problem.
    </ul>
</ol>

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet

#### Resources

+ Information and data sources: http://www.amstat.org/publications/jse/datasets/normtemp.txt, http://www.amstat.org/publications/jse/jse_data_archive.htm
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet

****
</div>

In [43]:
import pandas as pd
import numpy as np
import math

df = pd.read_csv('data/human_body_temperature.csv')

130

In [54]:
print (df.describe())
df.head()

       temperature  heart_rate
count   130.000000  130.000000
mean     98.249231   73.761538
std       0.733183    7.062077
min      96.300000   57.000000
25%      97.800000   69.000000
50%      98.300000   74.000000
75%      98.700000   79.000000
max     100.800000   89.000000


Unnamed: 0,temperature,gender,heart_rate
0,99.3,F,68.0
1,98.4,F,81.0
2,97.8,M,73.0
3,99.2,F,66.0
4,98.0,F,73.0


#### 1) Is the distribution of body temperatures normal?

In [3]:
mean_BTpopulation = 98.6
sample_size = 130
mean_BTsample = 98.249231
std_BTsample =0.733183
print ('median_BTsample = %0.4f' % df['temperature'].median())
print ('median_heartrate = %0.4f' % df['heart_rate'].median())

median_BTsample = 98.3000
median_heartrate = 74.0000


In [7]:
import pandas as pd
import scipy
from scipy import stats
from plotly.tools import FigureFactory as FF
import plotly.plotly as py

In [46]:
# testing the null hypothesis that the body temperature was sampled from population that has a normal distribution. 
# Let us assume that our significance level is 0.05.

temp = df['temperature']

shapiro_norm = scipy.stats.shapiro(temp)

matrix_nd = pd.DataFrame.from_dict({'Test Statistic':shapiro_norm[0], 'p-value':shapiro_norm[1]}, orient='index')
matrix_nd

Unnamed: 0,0
Test Statistic,0.986577
p-value,0.233168


since the p-value (0.23) is higher than the significance level of 0.05, we have enough evidence not to reject the null hypothesis at the 0.05 significance level.

#### 2) Is the sample size large? Are the observations independent?

A) The mean and median of the sample data seem to be very close, which may refer to symmetric data. There is no evidence that the population data is not normally distributed [see the above results]. Considering the distribution of the sample, a sample size of 30 is considered sufficient (based on the conditions and assumption of CLT). As we have a sample size of 130, we can say the sample size is large enough.

B) Are the observations independent? One of the conditions for CLT is to honor the randomization conditions. Was the sample randomly selected? with replacement or without replacement? I believe the sample was drawn without replacement as it is usually the case and seems to be less than 10% of the population [assumption of CLT].

#### 3) Is the true population mean really 98.6 degrees F?

The question seems to be one sample test because it is asking whether the population mean is significantly different or not from the standard value (in this case, 98.6). It is also appropriate to use a Z-statistic, as the sample size is greater than 30. Let us test the hypothesis that the true population mean is 98.6. 

Assume null hypothesis, Ho: mean_pop = 98.6 and alternative hypothesis, Ha: mean_pop <> 98.6

mean_BTpopulation = 98.6

sample_size = 130

mean_BTsample = 98.249231

std_BTsample =0.733183

Z = sample mean - population mean/(std_population/sqr(number of sample))

As our sample size is greater than 30, we can substitue the std_population by std_mean.Let take the confidence interval of 95%. In the Z-table, we will look the z-score for 97.5%, which is 1.96.

In [48]:
# Estimating population mean using the Z-score

# Z > -1.96 and Z < 1.96, and from the above formula

population_mean_UB = 98.249231 - ((1.96)*(0.733183/math.sqrt(130)))
population_mean_LB = 98.249231 + ((1.96)*(0.733183/math.sqrt(130)))

print('The estimated lower value of the mean body temperature of the population = %0.2F' % population_mean_LB)
print('The estimated upper value of the mean body temperature of the population = %0.2F' % population_mean_UB)

The estimated lower value of the mean body temperature of the population = 98.38
The estimated upper value of the mean body temperature of the population = 98.12


Based on the above result, there is 95% chance that the true population mean for the body temperature lies between 98.12 and 98.38, as opposed to given value of 98.60 degrees F. Therefore, we reject the null hypothesis and go foward with the alternative one. 

#### 4) At what temperature should we consider someone's temperature to be "abnormal"?

The confidence interval of 95 % was taken, and the Z-score for that CI is 1.96. The margin of error can be estimated by multiplying Z-score with standard deviation of the sampling distribution of sampling mean. From the results below, the range for normal body temperature seems to be between 98.1227 and 98.3758 degrees F.

In [104]:
margin_error = 1.96 * (0.733183/math.sqrt(130))
print('The margin error (E) = %0.4f' % margin_error)
print('Lower bound of the population mean = %0.4f' % (mean_BTsample-margin_error))
print('Upper bound of the population mean = %0.4f' % (mean_BTsample + margin_error))

The margin error (E) = 0.1260
Lower bound of the population mean = 98.1232
Upper bound of the population mean = 98.3753


#### 5) Is there a significant difference between males and females in normal temperature?

To answer this question, I prefer to use the independent two sample Z-test because we are examing two groups who have not natural one to one pairing. Let us do the hypothesis test to figure out whether there is a difference in normal body temperature between male and female or not.

In [53]:
# Assume null hypothesis, Ho: mean_pop_male = mean_pop_female
# alternative hypothesis, Ha: mean_pop_male <> mean_pop_female

# mean_pop_male

male_temp = df[df['gender'] == 'M']
female_temp = df[df['gender'] == 'F']

print(male_temp.describe())
print(female_temp.describe())

       temperature  heart_rate
count    65.000000   65.000000
mean     98.104615   73.369231
std       0.698756    5.875184
min      96.300000   58.000000
25%      97.600000   70.000000
50%      98.100000   73.000000
75%      98.600000   78.000000
max      99.500000   86.000000
       temperature  heart_rate
count    65.000000   65.000000
mean     98.393846   74.153846
std       0.743488    8.105227
min      96.400000   57.000000
25%      98.000000   68.000000
50%      98.400000   76.000000
75%      98.800000   80.000000
max     100.800000   89.000000


In [102]:
# There are 65 male and 65 female observations. 
#significance level: 5%

std_female = 0.743488
std_male = 0.698756 
no_diff = 0 # Assuming mean_male = mean_female

# Sample mean of male and female is 98.104615 and 98.393846 respectively

mean_diff = 98.393946-98.104615


# Estimate Z-score
Z_score = (mean_diff - no_diff)/((std_female/math.sqrt(65)) + (std_male/math.sqrt(65)))
print('Estimated Z-score = %0.4f' % Z_score)

# P-value at the estimated Z-score
P_value_Zscore = 0.1052
print ('P-value = %0.4f' % P_value_Zscore)

# Critical Z-value at our significance level of 5 %;

Z_criticalValue = 1.96

print ('Z_critical value = %0.4f ' % Z_criticalValue)


Estimated Z-score = 1.6174
P-value = 0.1052
Z_critical value = 1.9600 


We fail to reject the null hypothesis, as the calculated Z-score is less than the critical Z-value at 0.05 significance level [P_value= 0.1052, which is greater than the 0.05 threshhold]. Therefore, there is no evidence that there is a significant difference between males and females in normal temperature. 

After conducting the above analysis, can we still say that the value of 98.6 degrees F statistically correct? It could be yes or no. The sample size and its distribution seem to be reasonable and symmetric respectively. The gender proportion of the sample is also acceptable. The population mean lies in the range of 98.12 - 98.38 which exludes the value of 98.60 degree F. The mean body temperature for male and female did not show any evidence for any significant difference. Both results are based on the 95% confidence interval. The difference between statistically calculated mean of the body temperature and the standard value is very small. Statistically speaking, the 98.6 value is not correct, as it is out of the range of the possible values. Therefore the true normal human body temperature lies between 98.12 - 98.38. However, what would the result be if more samples were collected? The accuracy would probably have been improved and our conclusion would be different.