# Boxplots in Data Analysis

![title](Images/BoxPlot.png)

In this project I will be looking at boxplots and their uses in data analytics. I will:

1) Summarise the history of the boxplot and the situations it is used.

2) Demonstrate the use of the boxplot using some data.

3) Explain any relevant terminology such as the terms quartile and percentile.

4) Compare the boxplot to alternative models.

## The Python Libraries to be used

Pandas is a package providing fast, flexible and expressive data structures designed to make working with data both easy and initutive.

Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt

## The history of the boxplot

A boxplot is a method for graphically depicting groups of numerical data through their quartiles. It displays the five-number summary of a set of data, which is the minimum, first quartile, median, third quartile and maximum. The man responsible for the introduction of box and whisker plots was John Tukey. In the 1970's he introduced these as part of his toolkit for exploratory data analysis. Boxlots are a more compact distributional summary, displaying less detail than a histogram but also taking up less space. Boxplots have become one of the few plot types invented in the 20th century that has found widespread adoption. They are particulary useful for comparing distribution across groups.

![title](Images/BoxPlot1.png)

## Some Terminology

A breakdown of the various terminology on boxplots can be seen on the above figure 1.

Quartiles are the values that divide a list of numbers into quarters. The quartiles can also be refered to as special percentiles. For example the first quartile is also the 25th percentile. 25% of the data will be less than the 25th percentile and 75% of the data will be more than the 25th percentile. 

The Interquartile range is the fifference between the largest and smallest values in the middle 50% of a dataset.

Boxplots may also have lines extending vertically from the boxes indicating variability outside the upper and lower quartiles and so can also be called whisker plots. Outliers in relation to these plots refer to an observation that lies abnormal distance from other values in a data sample.

## Data for this project

For this project I have selected data from Mountainviews.ie which has a comprehensive list of all of the hills and mountains in Ireland. I plan to apply some of the boxplot functionality from Matplotlib to demonstrate how these plots can be used to read data. 

In [17]:
df = pd.read_csv("Summits.csv")
print(df)

                      Mountain Name                          Area Name  \
0                     Carrauntoohil              MacGillycuddy's Reeks   
1                        Beenkeragh              MacGillycuddy's Reeks   
2                             Caher              MacGillycuddy's Reeks   
3                     Knocknapeasta              MacGillycuddy's Reeks   
4                    Caher West Top              MacGillycuddy's Reeks   
5                        Maolan Bui              MacGillycuddy's Reeks   
6                 Cnoc an Chuillinn              MacGillycuddy's Reeks   
7                    The Bones Peak              MacGillycuddy's Reeks   
8                           Brandon                      Brandon Group   
9                       The Big Gun              MacGillycuddy's Reeks   
10                      Cruach Mhor              MacGillycuddy's Reeks   
11       Cnoc an Chuillinn East Top              MacGillycuddy's Reeks   
12                      Lugnaquilla  D

In [18]:
df.loc[:,["Mountain Name", "Area Name", "Elevation", "Prominence"]]      #df.loc to tidy up the data

Unnamed: 0,Mountain Name,Area Name,Elevation,Prominence
0,Carrauntoohil,MacGillycuddy's Reeks,1038.6,1038
1,Beenkeragh,MacGillycuddy's Reeks,1008.2,90
2,Caher,MacGillycuddy's Reeks,1000.0,99
3,Knocknapeasta,MacGillycuddy's Reeks,988.0,253
4,Caher West Top,MacGillycuddy's Reeks,973.4,24
5,Maolan Bui,MacGillycuddy's Reeks,973.0,38
6,Cnoc an Chuillinn,MacGillycuddy's Reeks,958.0,53
7,The Bones Peak,MacGillycuddy's Reeks,956.5,37
8,Brandon,Brandon Group,951.7,934
9,The Big Gun,MacGillycuddy's Reeks,939.0,74


In [20]:
df.loc[:,["Elevation"]].describe()   # a look at the descriptive statistics of the Irish mountain heights

Unnamed: 0,Elevation
count,1501.0
mean,449.404797
std,174.374681
min,94.0
25%,312.0
50%,444.0
75%,565.0
max,1038.6
