In this lesson you'll add several new plot types to your growing toolkit: **scatter plots**, **histograms**, and **density plots**.

# Set up the notebook

As always, we begin by setting up the coding environment.  (_This code is hidden, but you can un-hide and re-hide it by clicking on the "Code" button immediately below this text, on the right._)

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
print("Setup Complete")

# Select a dataset

We'll work with a dataset of 150 different flowers, or 50 each from three different species of iris (*Iris setosa*, *Iris versicolor*, and *Iris virginica*).

<img src="images/tut3_iris.png">

# Load and examine the data

In the dataset, each species has a different CSV file.  For each flower, there are four measurements: the sepal length and width, along with the petal length and width. 

Run the code below (without changes) to load the files.

In [None]:
# Paths of the files to read
iris_set_filepath = "../input/iris_setosa.csv"
iris_ver_filepath = "../input/iris_versicolor.csv"
iris_vir_filepath = "../input/iris_virginica.csv"

# Read the files into variables 
iris_set_data = pd.read_csv(iris_set_filepath, index_col="Id")
iris_ver_data = pd.read_csv(iris_ver_filepath, index_col="Id")
iris_vir_data = pd.read_csv(iris_vir_filepath, index_col="Id")

# Print the first 5 rows of the Iris setosa data
iris_set_data.head()

# Histogram

To create a **histogram**, we use the `sns.distplot` command and use `a=` to choose the column we'd like to plot.  

In the code cell below, we create three different histograms to understand how petal length varies between species.  In this case, it is helpful to use a legend, and you can set the values in the legend by using `label=`.  When creating a histogram, you should also set `kde=False` (as leaving it out will create a slightly different plot).

In this case, the legend does not automatically appear on the plot.  To force it to show (for any plot type), we can always use `plt.legend()`.

In [None]:
# Plot a histogram for each species
sns.distplot(a=iris_set_data['Petal Length (cm)'], label='Iris-setosa', kde=False)
sns.distplot(a=iris_ver_data['Petal Length (cm)'], label='Iris-versicolor', kde=False)
sns.distplot(a=iris_vir_data['Petal Length (cm)'], label='Iris-virginica', kde=False)

# Show the legend
plt.legend()

# Density plot

The next type of plot you'll learn to make is a **kernel density estimate (KDE)** plot.  In case you're not familiar with KDE plots, you can think of it as a smoothed histogram.   

To make a KDE plot, we use the `sns.kdeplot` command (_where you'll notice that the process of setting the data column and the label is identical to when we made a histogram_).  Setting `shade=True` colors the area below each curve.

In [None]:
# KDE plot for each species
sns.kdeplot(data=iris_set_data['Petal Length (cm)'], label='Iris-setosa', shade=True)
sns.kdeplot(data=iris_ver_data['Petal Length (cm)'], label='Iris-versicolor', shade=True)
sns.kdeplot(data=iris_vir_data['Petal Length (cm)'], label='Iris-virginica', shade=True)

# Show the legend
plt.legend()

# A _slightly_ different dataset

For the next part of the lesson, instead of working with the dataset as three different files, we'll upload the dataset as a single file. 

In [None]:
# Path of the file to read
iris_filepath = "../input/iris.csv"

# Read the file into a variable iris_data
iris_data = pd.read_csv(iris_filepath, index_col="Id")

# Print the first 5 rows of the data
iris_data.head()

For all of the remaining plots, we'll use `iris_data` (instead of `iris_set_data`, `iris_ver_data`, and `iris_vir_data`, from above) to access the dataset.

# Scatter plot, Part 1

To create a simple **scatter plot**, we use the `sns.scatterplot` command and specify the values for the horizontal x-axis (`x=iris_data['Petal Width (cm)']`) and the vertical y-axis (`y=iris_data['Petal Length (cm)']`).

In [None]:
sns.scatterplot(x=iris_data['Petal Width (cm)'], y=iris_data['Petal Length (cm)'])

The scatterplot above suggests that petal length and petal width are **positively correlated**, where plants with larger petal length typically also have increased petal width.

To double-check the strength of this relationship, you might like to add a **regression line**, or the line that best fits the data.  We do this by changing the command to `sns.regplot`.

In [None]:
sns.regplot(x=iris_data['Petal Width (cm)'], y=iris_data['Petal Length (cm)'])

The line certainly appears to provide a close fit!

# Scatter plot, Part 2

We can use scatter plots to display the relationships between (_not two, but..._) three variables!  One way of doing this is by color-coding the points.  

For instance, to determine the relationship between petal length, sepal width, and species, we can color-code the points by `'Species'`, and plot the other two other variables (`'Petal Length (cm)'`, `'Sepal Width (cm)'`) on the axes.

In [None]:
sns.scatterplot(x=iris_data['Petal Length (cm)'], 
                y=iris_data['Sepal Width (cm)'], 
                hue=iris_data['Species'])

One pattern that can be seen in the scatter plot is that the plants seem to belong to one of two groups, where _Iris versicolor_ and _Iris virginica_ seem to have similar values for petal length and sepal width, while _Iris setosa_ belongs in a category all by itself. 

There's a plot that communicates this story even more clearly, and that's the two-dimensional (2D) **kernel density estimate (KDE) plot**.  To create a 2D KDE plot, we use the `sns.kdeplot` command.  

For this plot type, the color-coded rectangle in the middle of the plot shows us how likely we are to see different combinations of sepal width and petal length, where darker parts of the figure are more likely.  

In [None]:
sns.kdeplot(data=iris_data['Petal Length (cm)'],
            data2=iris_data['Sepal Width (cm)'],
            shade=True)

Alternatively, we can build a 2D KDE plot with the `sns.jointplot` command, which gives us two additional KDE plots for free:
- the curve at the top of the figure is a KDE plot for the data on the x-axis (in this case, `iris_data['Petal Length (cm)']`), and
- the curve at the bottom of the figure is a KDE plot for the data on the y-axis (in this case, `iris_data['Sepal Width (cm)']`).

In [None]:
sns.jointplot(x=iris_data['Petal Length (cm)'],
              y=iris_data['Sepal Width (cm)'],
              kind="kde")

If we're only interested in the relationship between petal length and species, there's a scatter plot for that, too!  In this case, since `'Species'` contains categorical entries, a traditional scatter plot (as above) isn't the best bet.  (Before moving on, make sure that this makes sense to you!)  

However, there is a type of scatter plot (called a **categorical scatter plot**) that is useful for this, and we build it by using the `sns.swarmplot` command.

In [None]:
sns.swarmplot(x=iris_data['Species'], 
              y=iris_data['Petal Length (cm)'])

This plot communicates similar information as the histogram and KDE plot above: namely, *Iris setosa* tends to have shorter petal length, where *Iris virginica* is much longer, and *Iris versicolor* is somewhere in the middle. 

# What's next?

write later.