<em><sub>This page is available as an executable or viewable <strong>Jupyter Notebook</strong>:</sub></em>
<br/><br/>
<a href="https://mybinder.org/v2/gh/JetBrains/lets-plot/v2.0.0demos1?filepath=docs%2Fexamples%2Fjupyter-notebooks%2Fcorrelation_plot.ipynb"
   target="_parent">
   <img align="left"
        src="https://mybinder.org/badge_logo.svg">
</a>
<a href="https://nbviewer.jupyter.org/github/JetBrains/lets-plot/blob/master/docs/examples/jupyter-notebooks/correlation_plot.ipynb"
   target="_parent">
   <img align="right"
        src="https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.png"
        width="109" height="20">
</a>
<br/>
<br/>

## Correlation Plot

The `corr_plot` builder takes a dataframe (can be Pandas `Dataframe` or just Python `dict`) as the input and 
builds a correlation plot.

It allows to combine 'tile', 'point' or 'label' layers in a matrix of 'full', 'lower' or 'upper' type.

A call to the terminal `build()` method will create a resulting 'plot' object. 
This 'plot' object can be further refined using regular Lets-Plot (ggplot) API, like `+ ggtitle()`, `+ ggsize()` and so on.


The Ames Housing dataset for this demo was downloaded from [House Prices - Advanced Regression Techniques](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data?select=train.csv) (train.csv), (c) Kaggle.

In [1]:
import pandas as pd
from lets_plot import *
from lets_plot.bistro.corr import *

LetsPlot.setup_html()

The geodata is provided by © OpenStreetMap contributors and is made available here under the Open Database License (ODbL).


In [2]:
mpg_df = pd.read_csv('https://jetbrains.bintray.com/lets-plot/mpg.csv').drop(columns=['Unnamed: 0'])

In [3]:
def group(plots):
    """
    Useful for this demo.
    """
    bunch = GGBunch()
    for idx, p in enumerate(plots):
        x = (idx % 2) * 450
        y = int(idx / 2) * 350
        bunch.add_plot(p, x, y)
        
    return bunch    

### Combining 'tile', 'point' and 'label' layers.

When combining layers, `corr_plot` chooses an acceptable plot configuration by default.

In [4]:
group([
    corr_plot(mpg_df).tiles().build() + ggtitle("Tiles"),
    corr_plot(mpg_df).points().build() + ggtitle("Points"), 
    corr_plot(mpg_df).tiles().labels().build() + ggtitle("Tiles and labels"),
    corr_plot(mpg_df).points().labels().tiles().build() + ggtitle("Tiles, points and labels")
])

The default plot configuration adapts to the changing options - compare 'Tiles and labels' plot above and below.

You can also override the default plot configuration using the parameter 'type' - compare 'Tiles, points and labels' plot above and below.

In [5]:
group([
    corr_plot(mpg_df).tiles().labels(color="white").build() + ggtitle("Tiles and labels"),
    (corr_plot(mpg_df)
     .tiles(type="upper")
     .points(type="lower")
     .labels(type="full").build() + ggtitle("Tiles, points and labels"))
])

### Customizing colors.

Instead of the default blue-grey-red gradient you can define your own lower-middle-upper colors, or 
choose one of the available 'Brewer' diverging palettes.

Let's create a gradient resembling one of Seaborn gradients.

In [6]:
bld = corr_plot(mpg_df).points().labels().tiles()

# Configure gradient resembling one of Seaborn gradients.
gradient = (bld
            .palette_gradient(low='#417555', mid='#EDEDED', high='#963CA7')
            .build()) + ggtitle("Custom gradient")

# Configure Brewer 'BrBG' palette.
brewer = (bld
            .palette_BrBG()
            .build()) + ggtitle("Brewer")


In [7]:
group([
    gradient,
    brewer
])

### 
### Correlation plot with large number of variables in dataset.

The [Kaggle House Prices](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data?select=train.csv) dataset contains 81 variables.

In [8]:
housing_df = pd.read_csv("../data/Ames_house_prices_train.csv")
housing_df.shape

(1460, 81)


Correlation plot that shows all the correlations in this dataset is too large and barely useful. 

In [9]:
corr_plot(housing_df).tiles(type='lower').palette_BrBG().build()


#### The 'threshold' parameter.

The 'threshold' parameter let us specify a level of significance, below which variables are not shown.

In [10]:
(corr_plot(housing_df, threshold=.5).tiles(diag=False).palette_BrBG().build() 
 + ggtitle("Threshold: 0.5")
 + ggsize(550, 400))


Let's further increase our threshold in order to see only highly correlated variables.



In [11]:
(corr_plot(housing_df, threshold=.8)
 .tiles()
 .labels(color='white')
 .palette_BrBG().build() 
 + ggtitle("Threshold: 0.8")
 + ggsize(550, 400))


This picture can clearly give us some good ideas about possible improvements in our dataset for the purposes of 
training a linear regression model for example.

For the instance, variables 'GarageArea' and 'GarageCars' are most likely not needed both for the model traning 
and can be removed and replaced with a synthetic variable of a sort.


### The 'stat_corr()' function.

The 'stat_corr()' function simply creates a bare-bones `ggplot` layer. 

In practice, what you get is a basic heatmap and therefore using of the 'stat_corr' function is rarely justifiable,
unless you are willing to tweek it further to achieve some look that otherwise you couldn't get using the 'corr_plot' builder API.

In [12]:
group([
    (ggplot(mpg_df) 
     + stat_corr() 
     + ggtitle("Basic Heatmap") + ggsize(450, 400)),
    (ggplot(mpg_df, aes(fill='..corr_abs..'))
     + stat_corr(geom='point', type="lower", diag=False, shape=5)
     + stat_corr(geom='text', type="lower", diag=False, label_format='.1f')
     + scale_color_gradient(low='red', high='blue', guide='none')
     + stat_corr(geom='tile', type="upper", diag=True, threshold=.8)
     + stat_corr(geom='text', type="upper", diag=True, threshold=.8, label_format='.1f', size=1.2, color='white')
     + scale_fill_gradient2(low='white', mid='black', high='white', midpoint=.999, guide='none')
     + ggtitle("Weird Corr Plot") + ggsize(450, 400) + theme(axis_line='blank'))
])
