# Improving plots I - Coloring



So far you have learned the basics of plotting data. We've seen that data visualization is a very powerful tool that allows us to communicate a lot of information in a clear and concise way. By adding more elements to our plots, we can communicate even more information in a single graph. In this lesson we will focus on using color to enhance our plots.

Up until now, we have focused on using **matplotlib** for our plotting and **seaborn** to load example datasets. However, **seaborn** can also be used for plotting, and allows for easily customizable coloring of points. Here, we will learn about coloring our plots using **seaborn** plotting functions.

In [None]:
# import the packages we need: numpy, matplotlib, seaborn
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

For our first example, we will utilize the iris dataset from seaborn. Let's load the data and remind ourselves what information is contained in the dataset by looking at the first few rows.

In [None]:
# load iris and preview the data



Say we want to look at the relationship between `sepal_length` and `sepal_width` within our dataset. From our previous lessons, we learned that a scatterplot is the best tool to look at the relationship between two variables. We'll use the `sns.scatterplot` function to plot this. 

In [None]:
# plot sepal_length vs sepal_width


# dont forget a title! 
# We can add a title layer using plt.title as we've done before


If we would like to be even more specific, we could use the `sns.lmplot` function, which works similarly to `sns.scatterplot`, but adds a linear trendline to help visualize the relationship between our x and y variables.

In [None]:
# plot sepal_length vs sepal_width


# dont forget a title! 
# We can add a title layer using plt.title as we've done before


## Coloring by a Categorical Varible

This gives us a general idea of the trend between `sepal_length` and `sepal_width`, but what if we wanted to explore the relationship between these variables on a more granular level? For example - if we wanted to see how this relationship might differ between the different species within our dataset? Let's first determine how many different species we have data for within our dataset

In [None]:
# find unique values within the species column of iris


We want to highlight each of the three species using different colored points in our scatterplot. We can do this easily using the `hue` parameter.

In [None]:
# plot sepal_length vs sepal_width colored by species


plt.title("Sepal Length vs Sepal Width by Species")

# the line below moves the legend outside of the plot borders
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

Similarly, we can use the `sns.lmplot` function to add a linear trendline for each species separately.

In [None]:
# plot sepal_length vs sepal_width colored by species


plt.title("Sepal Length vs Sepal Width by Species")

Now we can make some important conclusions about our data. When we looked at sepal length vs. sepal width across all of our data we did not see a strong relationship between the two variables (as evidenced by the nearly-flat trendline). However, when we explor this trend by species we see that the *setosa* species has the strongest relationship between sepal length and sepal width (as evidenced by the strongly sloped trendline), and the relationship between sepal length and sepal width for the *versicolor* and *virginica* species are  similarly moderate.

You may also want to change the color palette you are working with. You can do this with the `palette` parameter. You can find built-in seaborn color palettes here: https://seaborn.pydata.org/tutorial/color_palettes.html

In [None]:
# plot sepal_length vs sepal_width colored by species

plt.title("Sepal Length vs Sepal Width by Species")

If we want to further emphasize the difference between our datapoints by species, we can also change the point markers, using the `style` parameter.

In [None]:
# plot sepal_length vs sepal_width colored by species


plt.title("Sepal Length vs Sepal Width by Species")
# the line below moves the legend outside of the plot borders
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

The `style` parameter can be used in conjunction with the `hue` parameter, or on its own.

In [None]:
# plot sepal_length vs sepal_width colored by species


plt.title("Sepal Length vs Sepal Width by Species")
# the line below moves the legend outside of the plot borders
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

## Coloring by a Continuous Variable

It is sometimes useful to color by a continuous variable rather than a categorical variable. For example, let's recreate our scatterplot from above, but instead of coloring by the `species` column, let's color by the `sepal_length` column. 

In [None]:
# plot sepal_length vs sepal_width colored by sepal_length


plt.title("Sepal Length vs Sepal Width")
# the line below moves the legend outside of the plot borders
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

Since `sepal_length` is a numeric (i.e. continuous) variable, seaborn automatically assigns a sequential color palette rather than a qualitative or categorical one. Same as before, we can customize the color palette to suit our preferences.

In [None]:
# plot sepal_length vs sepal_width colored by sepal_length


plt.title("Sepal Length vs Sepal Width")
# the line below moves the legend outside of the plot borders
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

Another useful way to illustrate a continuous variable in our datapoints is using the `size` parameter. 

In [None]:
# plot sepal_length vs sepal_width sized by sepal_length


plt.title("Sepal Length vs Sepal Width")
# the line below moves the legend outside of the plot borders
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

We can adjust this range using the `sizes` parameter in addition to `size`

In [None]:
# plot sepal_length vs sepal_width sized by sepal_length


plt.title("Sepal Length vs Sepal Width")
# the line below moves the legend outside of the plot borders
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

Again, the `size` parameter can be used either on its own or in conjunction with the `hue` parameter.

In [None]:
# plot sepal_length vs sepal_width colored and sized by sepal_length


plt.title("Sepal Length vs Sepal Width")
# the line below moves the legend outside of the plot borders
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

Furthermore, we can mix and match illustrating a categorical variabel with `hue` and a continuous variable with `size`.

In [None]:
# plot sepal_length vs sepal_width colored by species, sized by sepal_length


plt.title("Sepal Length vs Sepal Width by Species")
# the line below moves the legend outside of the plot borders
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

## Coloring with other plot types

We can similarly use the `hue` parameter with other plot types we've learned about as well. You can look at some examples of how to colorize other plot types below:

* Bar plot: https://seaborn.pydata.org/generated/seaborn.barplot.html
* Line graph: https://seaborn.pydata.org/generated/seaborn.lineplot.html