intro

# Set up the notebook

As always, we begin by setting up the coding environment.  (_This code is hidden, but you can un-hide and re-hide it by clicking on the "Code" button immediately below this text, on the right._)

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
print("Setup Complete")

# Select a dataset

We'll work with a dataset of three different iris species.

<img src="images/tut3_iris.png">


# Load and examine the data

In this lesson, we'll work with two different versions of the dataset.  

For the first part of the lesson, we'll work with a version where each species has a different CSV file.

In [None]:
# Paths of the files to read
iris_set_filepath = "../input/iris_setosa.csv"
iris_ver_filepath = "../input/iris_versicolor.csv"
iris_vir_filepath = "../input/iris_virginica.csv"

# Read the files into variables 
iris_set_data = pd.read_csv(iris_set_filepath, index_col="Id")
iris_ver_data = pd.read_csv(iris_ver_filepath, index_col="Id")
iris_vir_data = pd.read_csv(iris_vir_filepath, index_col="Id")

# Print the first 5 rows of the Iris setosa data
iris_set_data.head()

# Histogram

In [None]:
# Plot a histogram for each species
sns.distplot(iris_set_data['Petal Length (cm)'], label='Iris-setosa', kde=False)
sns.distplot(iris_ver_data['Petal Length (cm)'], label='Iris-versicolor', kde=False)
sns.distplot(iris_vir_data['Petal Length (cm)'], label='Iris-virginica', kde=False)

# Show the legend
plt.legend()

# Density plot

This is called a **kernel density estimate (KDE)** plot.

In [None]:
# Plot the KDE for each species
sns.kdeplot(iris_set_data['Petal Length (cm)'], label='Iris-setosa', shade=True)
sns.kdeplot(iris_ver_data['Petal Length (cm)'], label='Iris-versicolor', shade=True)
sns.kdeplot(iris_vir_data['Petal Length (cm)'], label='Iris-virginica', shade=True)

# Show the legend
plt.legend()

# A _slightly_ different dataset

In [None]:
# Path of the file to read
iris_filepath = "../input/iris.csv"

# Read the file into a variable iris_data
iris_data = pd.read_csv(iris_filepath, index_col="Id")

# Print the first 5 rows of the data
iris_data.head()

# Scatter plot, Part 1

To create a simple **scatter plot**, we use the `sns.scatterplot` command and need only specify the values that will be used for the horizontal x-axis (`x=iris_data['Petal Width (cm)']`) and the vertical y-axis (`y=iris_data['Petal Length (cm)']`).

In [None]:
sns.scatterplot(x=iris_data['Petal Width (cm)'], y=iris_data['Petal Length (cm)'])

The scatterplot above suggests that petal length and petal width are **positively correlated**, where plants with larger petal length typically also have increased petal width.

To double-check the strength of this relationship, you might like to add a **regression line**, or the line that best fits the data.  We do this by changing the command to `sns.regplot`.

In [None]:
sns.regplot(x=iris_data['Petal Width (cm)'], y=iris_data['Petal Length (cm)'])

The line certainly appears to provide a close fit!

# Scatter plot, Part 2

We can use scatter plots to display the relationships between (_not two, but..._) three variables!  One way of doing this is by color-coding the points.  

For instance, to determine the relationship between petal length, sepal width, and species, we can color code the points by `'Species'`, and plot the other two other variables (`'Petal Length (cm)'`, `'Sepal Width (cm)'`) on the axes.

In [None]:
sns.scatterplot(x=iris_data['Petal Length (cm)'], 
                y=iris_data['Sepal Width (cm)'], 
                hue=iris_data['Species'])

This chart communicates a lot of useful information to the viewer, but perhaps the most compelling story here is that while petal length can easily b ...

If we're only interested in the relationship between petal length and species, there's a scatter plot for that, too!  In this case, since `'Species'` contains categorical entries, a traditional scatter plot (as above) isn't the best bet.  (Before moving on, make sure that this makes sense to you!)  

However, there is a type of scatter plot (called a **categorical scatter plot**) that is useful for this, and we build it by using the `sns.swarmplot` command.

In [None]:
sns.swarmplot(x=iris_data['Species'], 
              y=iris_data['Petal Length (cm)'])

Another pattern in the data that can be read from the scatter plot is that the plants seem to belong to one of two groups, where _Iris versicolor_ and _Iris virginica_ seem to have similar values for petal length and sepal width, while _Iris setosa_ belongs in a category all by itself. 

< insert image here >

There's a plot that communicates this story even more clearly, and that's the two-dimensional (2D) **kernel density estimate (KDE) plot**.

In [None]:
sns.jointplot(x=iris_data['Petal Length (cm)'],
              y=iris_data['Sepal Width (cm)'],  
              kind="kde")

in this case, darker colors in the box are like taller points on the hill