## Scatter plots

Here is a step by step code-a-long to get from the standard output of a scatterplot to a cool optimized visualization

The goal of data visualization is to learn and communicate insights about a dataset, specifically show the relationship between two variables in a dataset.
At the end of the notebook you will know how to:

    Create Scatterplots
    Change color schemes
    Work with colormaps

We'll be using Pandas DataFrames as the basis for these exercises as this is a usual use case while doing EDA.

We will be working with the Wine dataset from the UCI machine learning library. Info is here https://archive.ics.uci.edu/ml/datasets/wine


source: https://betterprogramming.pub/how-to-use-colormaps-with-matplotlib-to-create-colorful-plots-in-python-969b5a892f0c by Elizabeth Ter Sahakyan

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
# Use the Wine dataset from the UCI machine learning library

#define column headers
columns = ['Alcohol','Malic acid','Ash','Alcalinity of ash', 'Magnesium','Total phenols','Flavanoids',
           'Nonflavanoid phenols','Proanthocyanins','Color intensity','Hue','OD280/OD315 of diluted wines','Proline']

# read_csv directly from url
wine_df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data', names = columns)

In [None]:
# Take a look at the data
wine_df.head()

In [None]:
wine_df.describe().round(2)

In [None]:
# set up figure and ax using object oriented interface
fig, ax1 = plt.subplots(figsize=(10,8))

#labels
ax1.set_xlabel('Alcohol')
ax1.set_ylabel('Color Intensity')
ax1.set_title('Relationship Between Color Intensity and Alcohol Content in Wines')

#plot
plt.scatter(wine_df['Alcohol'], # value for x axis
            wine_df['Color intensity'], # value for y axis
            s = 300); # Size of the dots

The simple approach to changing the colour is to use the 'color' parameter when calling the scatter function

In [None]:
# Same setup as before
fig, ax1 = plt.subplots(figsize=(10,8))

ax1.set_xlabel('Alcohol')
ax1.set_ylabel('Color Intensity')
ax1.set_title('Relationship Between Color Intensity and Alcohol Content in Wines')

# Set the color inside the scatter method and adjust the 'alpha' to allow transparancy
plt.scatter(wine_df['Alcohol'], 
            wine_df['Color intensity'], 
            s = 300, 
            color = 'green', 
            alpha = 0.5); 

### Colormaps

It is possible to color the scatterplot points based on a value in the dataset using a colormap. Colormaps take the arguments c and cmap inside scatter(), and color is not used. To use matplotlib colormaps, leave out the color argument and use 'c' and 'cmap'.

c is the array of numbers that will be mapped to the colorspace in the colormap. So point this at the column in our data that will be coloured.

There are many colormaps to choose from and the idea is to find one that works for the specifics of your dataset. Consider the following when choosing:

    * Whether representing form or metric data 
    * Your knowledge of the data set (e.g., is there a critical value from which the other values deviate?)
    * If there is an intuitive color scheme for the parameter you are plotting
    * If there is a standard in the field the audience may be expecting

For more read:
Color Sequences for Univariate Maps by Colin Ware:  http://ccom.unh.edu/sites/default/files/publications/Ware_1988_CGA_Color_sequences_univariate_maps.pdf  
All about colorblindness: https://www.color-blindness.com/

In [None]:
# Same setup again
fig, ax1 = plt.subplots(figsize=(10,8))

ax1.set_xlabel('Alcohol')
ax1.set_ylabel('Color Intensity')
ax1.set_title('Relationship Between Color Intensity and Alcohol Content in Wines')

# Plot!
plt.scatter(wine_df['Alcohol'],
            wine_df['Color intensity'],
            c = wine_df['Color intensity'],
            cmap = 'spring_r', # try removing '_r' from the end of the cmap name
            s = 300, 
            alpha = 0.5) 
cbar = plt.colorbar()
cbar.set_label('Color Intensity')