# Hypothesis Testing of Human Height Data

In this lab, you will learn how to use R to perform and understand the basics of hypothesis testing. Hypothesis testing is widely used. Anytime you are trying to determine if a parameter or relationship is statistically significant you can perform a hypothesis test. 

In this lab you will explore and perform hypothesis tests on a famous data set collect by Frances Galton, who invented the regression method. Galton collected these data from Families living in late 19th century London.  Galton published his famous paper in 1885, showing that the highs of adult children regressed to the mean of the population, regardless of the heights of the parents. From this seminal study we have the term regression in statistics, 

## Exercise 1. Explore the data

In this first exercise you will load the Galton data set. You will then and explore differences between some of the variables in these data using some simple visulaizaiton technques. 

****
**Note:** Data visulaization is convered in subsequent modules of this course. 

### Load and examine the data set

Execute the code in the cell below to load the Galton data set. 

In [None]:
library("AzureML")
ws <- workspace()
galton <- download.datasets(ws, "GaltonFamilies.csv")

With the data loaded, you can examine the first few rows by executing the code in the cell below:

In [None]:
head(galton)

This data set has 9 features:
 1. A case or row number.
 2. A unique code for each family in the sample.
 3. The height of the father in inches.
 4. The height of the mother in inches.
 5. The average height of the parents.
 6. The number of childern in the family.
 7. A code for the each unique child in the family.
 8. The gender of the child.
 9. The height of the adult child in inches. 
 
 Execute the code in the cell below to determine the number of cases in this data set.

In [None]:
dim(galton)

There are a total of 934 cases, or childern, in the sample comprising this data set. 

### Visualizing some relationships in these data

To develop a better understanding of some of the relationships in these data you will create and compare some histograms of some of the variables. 

The code in the cell below creates a pair of histograms to compare the distributions of two variables. The historgrams are ploted on the same horizontal scale to aid in comparison. A red line is plotted at the mean value of each variable. 

Exectue the code in the cell below to plot a pair of histograms comparing the hight of mothers to the height of  their sons. You can safely ignore any warnings about building a font cache. You can safely ignore any warnings about position_stack. 

In [None]:
options(repos = c(CRAN = "http://cran.rstudio.com"))
install.packages('gridExtra')
hist.plot = function(df, col, bw, max, min){
    ggplot(df, aes_string(col)) +
      geom_histogram( binwidth = bw) + 
      xlim(min, max)
}
    
hist.family = function(df, col1, col2, num.bin = 30){
  require(ggplot2)
  require(gridExtra)
  
  ## Compute bin width
  max = max(c(df[, col1], df[, col2]))
  min = min(c(df[, col1], df[, col2]))  
  bin.width = (max - min)/num.bin
  
  ## Create a first histogram
  p1 = hist.plot(df, col1, bin.width, max, min)
  p1 = p1 + geom_vline(xintercept = mean(df[, col1]),
                        color = 'red', size = 1)
  
  ## Create a second histogram
  p2 = hist.plot(df, col2, bin.width, max, min)
  p2 = p2 + geom_vline(xintercept = mean(df[, col2]),
                        color = 'red', size = 1)
  
  ## Now stack the plots
  grid.arrange(p1, p2, nrow = 2, ncol = 1)
}

sons = galton[galton$gender == 'male', ]
hist.family(sons, 'childHeight', 'mother')

Examine these histogram and note the following:

- The distributions of the height of the mothers and their sons have a fair degree of overlap.
- The mean height of the sons is noticeably greater than the mothers.

Next you will compare the heights of mothers to the heights of their daughters. 

In [None]:
daughters = galton[galton$gender == 'female', ]
hist.family(daughters, 'childHeight', 'mother')

Examine these histogram and note the following:

- The distributions of the height of the mothers and their daughters overlap almost entirely.
- The mean height of the daughters is nearly the same as the mothers.

In summary, it appears that sons are usually taller than their mothers, whereas, the height of daughters does not appear to be much different from their mothers. But, how valid is this conclusion statistically? 

## Apply a t test 

Now that you have examined some of the  relationships between the variables in these data, you will now apply formal hypothesis testing. In hypothesis testing the a null hypothesis is tested against a statistic. The null hypothesis is simply that the difference is not significant. Depending on the value of the test statistic, you can accept or reject the null hypthesis. 

In this case, you will use the two-sided t-test to determine if the difference in means of two variables are significantly different.   The null hypothesis is that there is no significant difference between the means. There are multiple criteria which are used to interpret the test results. You will determine if you can reject the null hyposesis based on the following criteria:

- Selecting a **confidence level** of **5%** or **0.05**.
- Determine if the t-statistic for the degrees of freedom is greater than the **critical value**. The difference in means of Normally distributed variables follows a t-distribution. The large t-statistic indicates the probility that the difference in means is unlikely to be by chance alone. 
- Determine if the P-value is less than the **confidence level**. A small P-value indicates the probability of the difference of the means being more extreme by chance alone is the small. 
- The **confidence interval** around the difference of the means does not overlap with **0**. If the **confidence interval** is far from **0** this indicates that the difference in means is unlikely to include **0**. 

Based on these criteria you will accept of reject the null hypothesis. However, rejecting the null-hypothesis should not be confused with accepting the alternative. It simply means the null is not a good hypothesis. 

The **family.test** function in the cell below uses the base R **t.test** function to compute the two-sided t statistics. The **hist.family.conf** funcion calls the **family.test** function and plots the results. Execute this code to compute and  disply the results. 

In [None]:
families.test <- function(df, col1, col2, paired = TRUE){
  t.test(df[, col1], df[, col2], paired = paired)
}

hist.family.conf <- function(df, col1, col2, num.bin = 30, paired = FALSE){
  require(ggplot2)
  require(gridExtra)
  
  ## Compute bin width
  max = max(c(df[, col1], df[, col2]))
  min = min(c(df[, col1], df[, col2]))  
  bin.width = (max - min)/num.bin
  
  mean1 <- mean(df[, col1])
  mean2 <- mean(df[, col2])
  t <- t.test(df[, col1], df[, col2], paired = paired)
  pv1 <- mean2 + t$conf.int[1]
  pv2 <- mean2 + t$conf.int[2]
  
  ## Plot a histogram
  p1 <- hist.plot(df, col1, bin.width, max, min)
  p1 <- p1 + geom_vline(xintercept = mean1,
                        color = 'red', size = 1) + 
             geom_vline(xintercept = pv1,
                        color = 'red', size = 1, linetype = 2)  + 
             geom_vline(xintercept = pv2,
                        color = 'red', size = 1, linetype =2) 
  
  ## A simple boxplot
  p2 <-  hist.plot(df, col2, bin.width, max, min)
  p2 <- p2 + geom_vline(xintercept = mean2,
                        color = 'red', size = 1.5)
  
  ## Now stack the plots
  grid.arrange(p1, p2, nrow = 2)
  
  print(t)
}

hist.family.conf(sons, 'mother', 'childHeight')

##### Examine the printed table of results and the charts noting the following:

- The difference of the means is 5.2 inches. You can see this difference graphically by comparing the positions of the solid red lines showing the means of the two distributions. 
- The **critical value** of the two-sided t-statistic at  945 degrees of freedom is **1.96**. The t-statistic of -32.5565 is larger than this **critical value**.
- The P-value is effectively 0, which is smaller than the **confidence level** of 0.05. 
- The 95% **confidence interval** of the difference in means is from -4.9 to -5.5, which does not overlap 0. You can see the confidence interval plotted as the two dashed red lines in the lower chart shown above. This **confidence interval** around the mean of the mother's heights does not overlap with the mean of the son's height. 

Overall, these statistics indicate you can reject the null hypothesis, or that there difference in the means is not **0**. 

In [None]:
hist.family.conf(daughters, 'mother', 'childHeight')

Examine the printed table of results, which are quite different from the test of the heights of mothers vs. sons. Examine the statistics and charts noting the following:

- The difference of the means is only 0.04 inches. You can see this small difference graphically by comparing the positions of the solid red lines showing the means of the two distributions. 
- The **critical value** of the two-sided t-statistic at  902 degrees of freedom is **1.96**. The t-statistic of 0.35 is smaller than this **critical value**.
- The P-value is 0.77, which is larger than the **confidence level** of 0.05. 
- The 95% **confidence interval** of the difference is from -0.26 to 0.35, which overlaps 0. You can see the confidence interval plotted as the two dashed red lines in the lower chart shown above. This **confidence interval** around the mean of the mother's heights does overlaps the mean of the dauther's height. 

Overall, these statistics indicate you cannot reject the null hypothesis that there are is not a significant difference in the means. 

**Evaluation question**

You have found that you could not reject the null hypothesis that there was no significant difference between the heights of mothers and their adult daughters. But what about the difference in height between fathers and their adult daughters? Perform the t-test on the Galton data set to answer the question below:

- Can you reject the null hypothesis that there is no significant difference in the heights of fathers and their adult daughters?