## Investigating the box plot

The box plot was invented by the American mathematician [John W. Tukey](https://en.wikipedia.org/wiki/John_Tukey) and can also be referred to as the box-and-whisker plot.  It is a type of graph that is used to display patterns in numerical datasets by basing them on a five number summary, or range, - [minimum, first quartile, median, third quartile, and maximum](http://www.physics.csbsju.edu/stats/box2.html) and is so called because of the box shape it is displayed in. 

An example of a box plot is pictured below:

![seaborn-boxplot-2.png](https://raw.githubusercontent.com/AideenByrne/FundamentalsProject/master/img/seaborn-boxplot-2.png)
*Image credit https://seaborn.pydata.org/generated/seaborn.boxplot.html*

Box plots can summarise [data from multiple sources and display the results in a single graph](http://asq.org/learn-about-quality/data-collection-analysis-tools/overview/box-whisker-plot.html), which can be particularly useful when you have multiple data sets from independent sources that relate to each other in some way. [The Seaborn tutorial](https://seaborn.pydata.org/tutorial/categorical.html#categorical-tutorial) describes the box plot as a "categorical distribution plot" which refers to it's function of providing information about the distribution of values of each category it visualises.  [This paper](http://vita.had.co.nz/papers/boxplots.pdf), which charts the evolution of the box plot over 40 years since Tukey developed it, details how it has "become one of the most frequently used statistical graphics,and is one of the few plot types invented in the 20th century that has found widespread adoption".

Box plots are based on [quartiles](https://en.wikipedia.org/wiki/Quartile) which are three summary measures that divide a rank-ordered data set into four equal parts:

The first quartile is the middle value of all the data points below the median (median being the the value separating the higher half from the lower half of the data set).  
The second quartile is the median of the data set.  
The third quartile is the middle value of all the data points above the median.

[The Interquartile Range (IQR)](https://en.wikipedia.org/wiki/Interquartile_range) is the difference between the third and first quartiles, which describes the middle 50% of the distribution of the data set, as the below image demonstrates: 

![IQR](      )



In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
df = pd.read_csv("http://data.tii.ie/Datasets/Pavement/Scrim/TII_SCRIM_2017_ML_dTIMS_100m_CSC.csv")
df.describe()

Unnamed: 0,DTIMS_Chain,SU,CSC,ITM_E,ITM_N
count,52753.0,52753.0,52706.0,52724.0,52724.0
mean,11.539031,115.444619,0.482584,591684.668812,710407.766103
std,7.946781,79.469718,0.0697,69600.743215,91381.875036
min,0.01,1.0,0.16,444412.7,533021.3
25%,4.791,48.0,0.43,537252.6,633952.45
50%,10.396,104.0,0.48,588168.71215,712371.75
75%,17.199,172.0,0.53,647862.475,776931.5
max,35.264,353.0,0.9,728224.1,937353.6


In [None]:
sns.boxplot(x="SU", y="CSC", hue="ITM_E", data=df)