# A litany of Pandas Plotting Examples

* Pandas and Matplotlib provide a bunch of visualizations
* 

In [None]:
# load up the libraries we need

%matplotlib inline
import numpy as np              
import pandas as pd

In [None]:
df = pd.read_csv("diabetes.csv")
# Drop the missing values rows
df = df.dropna()
df.head(5)

## Density Plot

* Visualises the distribution of data over a continuous interval or time period
* A variation of a Histogram that uses kernel smoothing to plot values, allowing for smoother distributions by smoothing out the noise. 
* The peaks of a Density Plot help display where values are concentrated over the interval.

https://datavizcatalogue.com/methods/density_plot.html


In [None]:
df["chol"].plot.density()
df["weight"].plot.density();

## Box Plot

* A convenient way of visually displaying the data distribution through their quartiles
* The lines extending parallel from the boxes are known as the whiskers and indicate variability outside the upper and lower quartiles. 
* Outliers are sometimes plotted as individual dots that are in-line with whiskers. 


In [None]:
df["chol"].plot.box();

##  Histogram 

* Visualises the distribution of data over a continuous interval or certain time period. 
* Each bar in a histogram represents the tabulated frequency at each interval/bin.
* Help give an estimate as to where values are concentrated, what the extremes are and whether there are any gaps or unusual values. 

https://datavizcatalogue.com/methods/histogram.html



In [None]:
df["chol"].hist(bins=50);

In [None]:
# use the bins parameter to adjust the granularity
df["chol"].hist(bins=10);

## Scatter Plot

* Use a collection of points placed using Cartesian Coordinates to display values from two variables. 
* By displaying a variable in each axis, you can detect if a relationship or correlation between the two variables exists.

https://datavizcatalogue.com/methods/scatterplot.html


In [None]:
df.plot(kind='scatter', x='chol', y='weight', title="Cholesterol vs. Weight");

In [None]:
# The c parameter may be given as the name of a column to provide colors for each point
df.plot(kind='scatter', x='chol', y='weight', 
        c='stab.glu', 
        title="Cholesterol vs. Weight");

In [None]:
# the s parameter can be used to adjust the size of the points
df.plot(kind='scatter', x='chol', y='weight', 
        c='stab.glu', 
        s=df["age"], 
        title="Cholesterol vs. Weight");

## Scatterplot Matrix

Scatterplot matrices are a great way to roughly determine if you have a linear correlation between multiple variables. This is particularly helpful in pinpointing specific variables that might have similar correlations. https://www.r-bloggers.com/scatterplot-matrices/

In [None]:
from pandas.plotting import scatter_matrix
scatter_matrix(df[['chol', 'stab.glu', 'hdl', 'ratio', 'glyhb']], # Make a scatter matrix of all columns
               figsize=(30, 30), # Set plot size
               diagonal='kde');  # Show distribution estimates on diagonal

## Bubble Chart

* Uses a Cartesian coordinate system to plot points along a grid where the X and Y axis are separate variables.
* Each point is assigned a label or category 
* Each plotted point then represents a third variable by the area of its circle. 
* Colours can be used to distinguish between categories or used to represent an additional data variable. 
* Used to compare and show the relationships between categorised circles, by the use of positioning and proportions. 
* The overall picture can be use to analyse for patterns/correlations.




In [None]:
df.plot(kind='scatter', x='chol', y='weight', 
        c='hdl', 
        s=df['stab.glu'] / 2, 
        title="Cholesterol vs. Weight");

## Hexbin Plot 


* Hexbin plots can be a useful alternative to scatter plots or bubble charts if your data are too dense to plot each point individually.


In [None]:
# generate a hexbin plot
df.plot.hexbin(x='chol', y='weight', gridsize=20);


In [None]:
# use the gridsize parameter to adjust the granularity
df.plot.hexbin(x='chol', y='weight', gridsize=50);

## Bar Chart

* Uses either horizontal or vertical bars to show discrete, numerical comparisons across categories. 
* One axis of the chart shows the specific categories being compared and the other axis represents a discrete value scale.
* Distinguished from Histograms as they do not display continuous developments over an interval. 
* Bar Chart's discrete data is categorical data and therefore answers the question of "how many?" in each category.

https://datavizcatalogue.com/methods/bar_chart.html


In [None]:
# For continuous data:

df["chol"].plot.bar();

* ACK! that isn't very useful.

In [None]:
# For discrete data
df["sex"].value_counts().sort_index().plot(kind='bar', rot=0);

## RadViz

* RadViz is a way of visualizing multi-variate data. 
* It is based on a simple spring tension minimization algorithm. 
* Basically you set up a bunch of points in a plane. In our case they are equally spaced on a unit circle. Each point represents a single attribute. 
* You then pretend that each sample in the data set is attached to each of these points by a spring, the stiffness of which is proportional to the numerical value of that attribute (they are normalized to unit interval). 
* The point in the plane, where our sample settles to (where the forces acting on our sample are at an equilibrium) is where a dot representing our sample will be drawn. 
* Depending on which class that sample belongs it will be colored differently.

https://pandas.pydata.org/pandas-docs/stable/visualization.html#visualization-radviz


In [None]:
iris_data = pd.read_csv("iris.csv")
iris_data.head()

In [None]:
from pandas.plotting import radviz
radviz(iris_data, 'Name');

In [None]:
radviz(df[["chol", "weight", "age", "height", "location"]], "location");

##  Parallel Coordinates Plot 

* Used for plotting multivariate, numerical data. 
* Ideal for comparing many variables together and seeing the relationships between them. 
* Examples:
    * Comparing computer or cars specs across different models
    * Comparing drug efficacy across patient cohorts
* Each variable is given its own axis and all the axes are placed in parallel to each other. 
* Each axis can have a different scale, as each variable works off a different unit of measurement, or all the axes can be normalised to keep all the scales uniform.
* Values are plotted as a series of lines that connected across all the axes. 

https://datavizcatalogue.com/methods/parallel_coordinates.html

In [None]:
from pandas.plotting import parallel_coordinates
parallel_coordinates(iris_data, 'Name', colormap='gist_rainbow');


## Pie Chart 

* Yup, you can make them.

In [None]:
# For continuous data (pretty useless)
df[["chol"]].plot.pie(y='chol', subplots=False, figsize=(8, 4));

In [None]:
# For discrete data
df["sex"].value_counts().sort_index().plot.pie(y='sex', 
                                                  subplots=False, 
                                                  figsize=(8, 4));

In [None]:
# For discrete data
df["frame"].value_counts().sort_index().plot.pie(y='sex', 
                                                 subplots=False, 
                                                 figsize=(8, 4));