# `BOXPLOT`

Boxplot, introduced by John Tukey in his classic book Exploratory Data Analysis close to 50 years ago, is great for visualizing data distributions from multiple groups. Boxplot captures the summary of the data efficiently with a simple box and whiskers and allows us to compare easily across groups. Boxplots summarizes a sample data using 25th, 50th and 75th percentiles. These percentiles are also known as the lower quartile, median and upper quartile. The advantage of comparing quartiles is that they are not influenced by outliers.


------------------------------------------------------------------

![](http://web.pdx.edu/~stipakb/download/PA551/boxplot_files/boxplot4.jpg)

------------------------------------------------------------------

### Definitions
`Median`
The median (middle quartile) marks the mid-point of the data and is shown by the line that divides the box into two parts. Half the scores are greater than or equal to this value and half are less.

`Inter-quartile range`
The middle “box” represents the middle 50% of scores for the group. The range of scores from lower to upper quartile is referred to as the inter-quartile range. The middle 50% of scores fall within the inter-quartile range.

`Upper quartile`
Seventy-five percent of the scores fall below the upper quartile.

`Lower quartile`
Twenty-five percent of scores fall below the lower quartile.

`Whiskers`
The upper and lower whiskers represent scores outside the middle 50%. Whiskers often (but not always) stretch over a wider range of scores than the middle quartile groups.


In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from ipywidgets import interact

In [2]:
df = pd.read_csv("http://bit.ly/2cLzoxH")

In [3]:
df.head()

Unnamed: 0,country,year,pop,continent,lifeExp,gdpPercap
0,Afghanistan,1952,8425333.0,Asia,28.801,779.445314
1,Afghanistan,1957,9240934.0,Asia,30.332,820.85303
2,Afghanistan,1962,10267083.0,Asia,31.997,853.10071
3,Afghanistan,1967,11537966.0,Asia,34.02,836.197138
4,Afghanistan,1972,13079460.0,Asia,36.088,739.981106


In [4]:
def get_year(year,swarm = False,palette="deep",save = False):
    plt.figure(figsize=(15,10))
    
    bplot = sns.boxplot(y='lifeExp', x='continent', 
                 data=df[df.year == year],
                 width=0.5,
                 palette=palette,linewidth=3)
    bplot.axes.set_title(f"{year}: Life Expectancy Vs Continent",
                    fontsize=16)
 
    bplot.set_xlabel("Continent", 
                    fontsize=14)
    bplot.tick_params(labelsize=10)
    if swarm == True:
        bplot = sns.swarmplot(y='lifeExp', x='continent', data=df[df.year == year],
                              color=".1")
    if save == True:
        filename=f"{year} boxplot.jpg"
        bplot.figure.savefig(filename,
                    format='jpeg',
                    dpi=100)
    plt.show()

# Identify Skewness
We can also identify the skewness of our data by observing the shape of the box plot. If the box plot is symmetric it means that our data follows a normal distribution. If our box plot is not symmetric it shows that our data is skewed. You can get a better understanding by looking at the diagrams below:

![](https://cdn-images-1.medium.com/max/800/1*X_IbsDGoX8Tdad8hC-xYOQ.png)

In [5]:
interact(get_year,
         year = [2007, 2002, 1997, 1992, 1987, 1982, 1977, 1972, 1967, 1962, 1957,1952],
         swarm = [True,False],
         palette = ["deep", "muted", "pastel", "bright", "dark", "colorblind","Set3","Set2"],
         save = [True,False]);

interactive(children=(Dropdown(description='year', options=(2007, 2002, 1997, 1992, 1987, 1982, 1977, 1972, 19…