# FDA Project Dec 2018

by Colm Doherty

*Summarise the history of the box plot and situations in which it used

*Demonstrate the use of the box plot using data of your choosing. EXPAND

*Explain any relevant terminology such as the terms quartile and percentile. EXPAND

*Compare the box plot to alternatives.

## History of the box plot

The "box plot", or more correctly the "box-and-whisker plot", is a histogram-like method of displaying groups of numerical data through their quartiles<sup>[1]</sup>. It was invented in 1970 by John Wilder Tukey<sup>[1a]</sup>, an American mathematician best known for development of the FFT algorithm and the box plot. The Tukey range test, the Tukey lambda distribution, the Tukey test of additivity, and the Teichmüller–Tukey lemma all bear his name. He is also credited with coining the term 'bit' <sup>[2]</sup>.

![alt text](Boxplot Cholesterol.png "Boxplot Cholesterol")

In his 1995 book "The Same AND Not the Same"<sup>[3]</sup>, Roald Hoffmann, a theoretical chemist who won the 1981 Nobel Prize in Chemistry<sup>[4]</sup>, said this about data: 
"They are different, but not different enough to matter - like the maple leaves off the tree in my yard, when all I want to do is rake them up."<sup>[5]</sup> 

"The Same AND Not the Same" is a short, accurate description of almost any set of data...a pile of maple leaves for example. Maple leaves have approximately the same size, but with some variation. Descriptive statistics are an attempt to use numbers to describe how data are the same and not the same. The box plot is a standardized way of displaying the distribution of data based on the five number summary: minimum, first quartile, median, third quartile, and maximum. In the simplest box plot the central rectangle spans the first quartile to the third quartile (the interquartile range or IQR). A segment inside the rectangle shows the median and "whiskers" above and below the box show the locations of the minimum and maximum.

## Situations in which the box plot is used

When tasked to compare group datasets, Histograms are helpful for visualizing the distribution of variables. But if you need to 'drill down' for more information, boxplots are very useful. Perhaps we want a clearer view of the standard deviation? Perhaps the median is quite different from the mean and thus we have many outliers? What if there is so skew and many of the values are concentrated to one side?

That’s where boxplots come in. Box plots give us all of the information above. The bottom and top of the solid-lined box are always the first and third quartiles (i.e 25% and 75% of the data), and the band inside the box is always the second quartile (the median). The whiskers (i.e the dashed lines with the bars on the end) extend from the box to show the range of the data. (https://towardsdatascience.com/5-quick-and-easy-data-visualizations-in-python-with-code-a2284bae952f)

The box plot is particularly suitable for comparing range and distribution for groups of numerical data. Its ideal for comparing distributions because the centre, spread and overall range are immediately apparent.

While it does not show a distribution in as much detail as a histogram does, it is useful for indicating whether a distribution is skewed and whether there are potentially unusual observations (outliers) in the data set. Box plots are also very useful when large numbers of observations are involved and when two or more data sets are being compared.<sup>[6]</sup>

ADVANTAGES:    The box plot organizes large amounts of data, and visualizes outlier values.

DISADVANTAGES: Its not relevant for detailed analysis of data as it deals with a summary of the data distribution. <sup>[7]</sup>

### Interpretting a box plot 

![alt text](boxplotimage.png "Boxplot definitions")

The box plot (a.k.a. box and whisker diagram) is a standardized way of displaying the distribution of data based on the five number summary: minimum, first quartile, median, third quartile, and maximum. 

In the simplest box plot the central rectangle spans the first quartile to the third quartile (the interquartile range or IQR). A segment inside the rectangle shows the median and "whiskers" above and below the box show the locations of the minimum and maximum. (http://www.physics.csbsju.edu/stats/box2.html)

To interpret a boxplot:

Step 1: Assess the key characteristics. <sup>[8]</sup>
Examine the center and spread of the distribution. Assess how the sample size may affect the appearance of the box plot.

Step 2: Look for indicators of nonnormal or unusual data.
Skewed data indicate that data may be nonnormal. Outliers may indicate other conditions in your data.

Step 3: Assess and compare groups.
If your box plot has groups, assess and compare the center and spread of the groups.

## Example of using the Box plot with a chosen dataset

### Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
%matplotlib inline

### About the Dataset - Infants with SIRDS

The following example<sup>[9]</sup> from the Open University relates to birth weights of infants exhibiting severe idiopathic respiratory distress syndrome (SIRDS), and the question ‘Is it possible to relate the chances of eventual survival to birth weight?’ The data consists of the recorded birth weights of infants (in kgs) who displayed the syndrome, divided into two groups - those who died, and those who survived.


### Import Dataset

In [None]:
# import the dataset and describe it
import csv as df
df = pd.read_csv("Birthweights.csv") 
df.describe()

### Box Plot of the data

In [None]:
# Using the PANDAS FUNCTION
df.plot.box() 

We know from the data description that the mean birth weight of the infants who survived is considerably higher than the mean birth weight of the infants who died. The standard deviation of the birth weights of the infants who survived is also higher. 

For the birth weights (in kg) of the infants who survived, the lower quartile, median and upper quartile are 1.74, 2.30 and 2.76. For infants who died, the corresponding quartiles are 1.24, 1.69 and 2.07. Boxplots of the two data sets clearly depict these values, along with the adjacent sample maxima and minima, so that the whiskers extend to the ends of the sample range. 

In [None]:
# using seaborn box plot
sns.boxplot(data=df.ix[:,0:2])
# sns.plt.show()

### What can a box plot highlight that descriptive statistics may not ?

Most of the information depicted in a boxplot is also available in the descriptive statistics, but a boxplot can give us an immediate "feel" for the dataset at a glance. Humans are visual creatures. Most of us process information based on what we see. In fact 65 percent of us are visual learners, according to the Social Science Research Network. https://www.forbes.com/sites/tjmccue/2013/01/08/what-is-an-infographic-and-ways-to-make-it-go-viral/

Of course a distribution plot will also provide a quick visual feel for the data, but the weight of the data distribution in the middle 50% may not seem quite so obvious. Skew also may be apparent, but the boxplot can reveal the 'outliers' of the dataset far more clearly.

### Explaining the terminology such as quartiles and percentiles, etc

### References
1. http://mathworld.wolfram.com/Box-and-WhiskerPlot.html
1a. http://vita.had.co.nz/papers/boxplots.pdf
2. https://en.wikipedia.org/wiki/John_Tukey
3. https://www.amazon.com/Same-Not-Roald-Hoffmann/dp/0231101392
4. https://en.wikipedia.org/wiki/Roald_Hoffmann
5. http://www.physics.csbsju.edu/stats/box2.html
6. https://www150.statcan.gc.ca/n1/edu/power-pouvoir/ch12/5214889-eng.htm
7. https://help.qlik.com/en-US/sense/September2017/Subsystems/Hub/Content/Visualizations/BoxPlot/when-to-use-box-plot.htm
8. https://support.minitab.com/en-us/minitab-express/1/help-and-how-to/graphs/boxplot/interpret-the-results/key-results/#step-2-look-for-indicators-of-nonnormal-or-unusual-data
9. https://www.open.edu/openlearn/science-maths-technology/mathematics-and-statistics/statistics/interpreting-data-boxplots-and-tables/content-section-1.1.3



