# Lab - Data Visualizations with Seaborn & Pandas

## Introduction

In this lab, you will apply your knowledge of the advanced visualization library Seaborn to generate plots that provide insight into the popular `tips` dataset.

## Objectives

You will be able to:
    
- Create a boxplot using Seaborn
- Label plots with appropriate axis labels and titles
- Create an lmplot using Seaborn
- Create a subplot using Seaborn
- Create data visualizations with Pandas

## Part I: Seaborn

We will use a randomly generated data set to practice using Seaborn. Begin by running the below code without change.

In [1]:
# CodeGrade step0

# Run this cell without change

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# The seed must be 42 for the data to replicate
seed = 42

# Data
data = np.random.normal(size=(20, 10)) + np.arange(10) / 2

We'll being with basic visualizations with Seaborn.

### Step 1

Create a boxplot and pass in the parameter data=data. Store the object returned in the variable boxplot.

In [2]:
# CodeGrade step1



### Step 2
Repeat step 1, but
* Call the boxplot object's set() method and pass in the following parameters:
* xlabel= 'X Label'
* ylabel= 'Y Label'
* title = 'Example Boxplot'

In [3]:
# CodeGrade step2



Now we'll move to changing style and content.

### Step 3

Repeat step 2, but now call Seaborn's 'set_style()' method and pass in the string 'darkgrid'.

In [4]:
# CodeGrade step3



Now we'll move to changing style and content.

### Step 4

While the plot looks much better now, the size of the text for ticks and axis labels so small that it would be hard for people to read it unless they're right in front of the monitor--that's a problem, if the visualizations are going to be used in something like a tech talk or presentation.

* Call Seaborn's set_context() method and pass in the string 'poster' (the default is 'notebook')
* Recreate the labeled boxplot that we made in Step 3



In [5]:
# CodeGrade step4



We'll now turn to the canonical Seaborn peguins dataset.

In [6]:
# CodeGrade step0

# Run this cell without change

penguins = sns.load_dataset("penguins")
penguins

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female
...,...,...,...,...,...,...,...
339,Gentoo,Biscoe,,,,,
340,Gentoo,Biscoe,46.8,14.3,215.0,4850.0,Female
341,Gentoo,Biscoe,50.4,15.7,222.0,5750.0,Male
342,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,Female


### Step 5

Now use Seaborn to create a histogram by 'body_mass_g' with the following parameters
*   context is set to 'talk'
*   the width of the bins is set to 100 g
*   color of the rectangles should distinguish male and female (as well as N/A)
*   the horizontal axis should be 'Body Mass (g)', the vertical axis 'Number of Penguins', and the title 'Penguin Mass Distribution by Sex'



In [7]:
# CodeGrade step5



### Step 6

Create a scatter plot of the bill length (horizontal axis) vs. the bill depth (vertical axis), with

* the context set to paper
* have the points color represent the sex
* have the points shape represent the species
* name the horizontal axis 'Bill Length (mm)', the vertical axis is 'Bill Depth, where the tile is 'Scatterplot of Bill Sizes of Penguins'

In [8]:
# CodeGrade Step6


## Part II: Pandas

### Visualizing High Dimensional Data

Pandas also has plotting tools that help with visualizing high dimensional data, i.e., way too many columns/variables to inspect individually. Let's explore some of these features by loading in the iris flower dataset.

The iris dataset is a classic multivariate dataset, which includes the sepal length, sepal width, petal length, and petal width for hundreds of samples of three species of the iris flower.

In [None]:
# CodeGrade step0

# Run this cell without change

iris = sns.load_dataset('iris')
iris.head()

### Step 7

Pandas has a plotting tool that allows us to create a scatter matrix from a DataFrame. A scatter matrix is a way of comparing each column in a DataFrame to every other column in a pairwise fashion.

The scatter matrix creates scatter plots between the different variables and histograms along the diagonals.

This allows us to quickly see some of the more obvious patterns in the dataset. Let's use it to visualize the iris DataFrame and see what insights we can gain from our data. We will use the method `pd.tools.plotting.scatter_matrix()` and pass in our dataset as an argument.

In [None]:
# CodeGrade Step7


### Step 8

Now repeat the previous step, but using 'pairplot' from Seaborn and letting hue be set to 'species'.

In [None]:
# CodeGrade Step8



You should contemplate the similarities and differences between the two visual outputs.

### Step 9

Pandas includes a plotting tool for creating parallel coordinates plots which could be a great way to visualize multivariate data.

Parallel coordinate plots are a common way of visualizing high dimensional multivariate data. Each variable in the dataset corresponds to an equally-spaced, parallel, vertical line. The values of each variable are then connected by lines between for each individual observation.

Let's create a parallel plot for the 4 predictor variables in the iris dataset and see if we can make any further judgments about the nature of data. We will use the pd.plotting.parellel_coordinates() function and pass in the iris dataset with the response column (species) as an argument, just like we saw above. Let's also apply some customizations.

Color the lines by class given in 'species' column (this will allow handy inspection to see any patterns). 

In [None]:
# CodeGrade Step9
