# Data Visualization: Implementation of treemaps in R and Python
#### Author: Alexander Ullmann

## Outline
1. Introduction to treemaps
2. Quality of Life Categorical Data
3. Implementation in R
4. (Implementation in SAS)
5. Implementation in Python
6. Comparison of different implementation methods
7. Exercises

## 1. Introduction to treemaps
In information visualization and computing, treemapping is a method for displaying hierarchical data using nested figures, usually rectangles. The rectangles represent areas proportional to specified dimensions of the data. Often the leaf nodes are colored to show a separate dimension of the data. [source: wikipedia/Treemapping](https://en.wikipedia.org/wiki/Treemapping). 
<br>
<br>
In this tutorial we want to show you how to build treemaps in `R` and `Python`. We will be using data from a questionnaire regarding the satisfaction with quality of life. Similar kind of data is often found in clinical research where focus on life quality and satisfaction after a medical procedure or medication regime is of interest.
<br>
***
The basic idea of our approach is to give an efficient visual overview of the data. The aim is to provide a first comprehensive look at the distribution of an outcome variable in this case a pseudo-continuous (on Likert-Scale) over different subgroup variables which are categorical in nature. From there a more sophisticated analysis could be made for example with regression models and alike.
***

## 2. Quality of Life Categorical Data

The data which we will use is provided by GESIS - Leibniz-Institut für Sozialwissenschaften (2016): German General Social Survey - ALLBUS 2014. GESIS Datenarchiv, Köln. ZA5242 Datenfile Version 1.0.0, http://dx.doi.org/10.4232/1.12437. This data set was put online by the PSI Organisation within the [Wonderful Wednesdays Workshops](https://www.psiweb.org/sigs-special-interest-groups/visualisation/welcome-to-wonderful-wednesdays). This data is publicly available and made free for usage in their [github repository](https://github.com/VIS-SIG/Wonderful-Wednesdays/tree/master/data/2020/2020-05-13) under the CC0 1.0 Universal license. 
<br>
<br>

The primary end point was overall satisfaction with life.
<br>
<br>

### Question to participants was:
* How would you rate your satisfaction with your life overall on a scale from 0-10? (10=completely satisfied)
<br>
<br>

### Categorical variables that were also collected:
* Age
* BMI
* Working hours
* Doctor visits per year
* Net income
* Smoker
* Gender
* Employment status
* Graduation
* Graduation of father
* Graduation of mother
* Highest educational grade (f=father/m=mother)
<br>
<br>

### Aim: What are the most relevant factors for quality of life/satisfaction in life? 
We want to provide a first graphical overview for the data that might facilitate a further analysis to answer this question. A deep dive into statistical modeling of the data is not intended to be part of this tutorial. 

## 3. (Implementation in SAS)
When it comes to SAS and treemaps there is no easy to use solution in SAS® Base. But there is the possibility to work with [SAS Visual Analytics](https://documentation.sas.com/?docsetId=vaobj&docsetTarget=p0cvtwmyn64desn1w721b6u8tzur.htm&docsetVersion=8.1&locale=en) and create treemaps with that. Note that this software comes with a price and may therefore be not easily accessible throughout the departments at your company and in your private work environment.


## 4. Implementation in Python
For the following code to work you should switch your Jupyter Kernel from `R` to `Python`. We are using `Python 3`.

You can use Anaconda Navigator to install packages for your environment. Make sure that your kernel/python environment has the packages installed. Alternatively use the anaconda prompt to install packages like this: 
* activate the environment (for example if your environment for the Python Kernel is named "my_python_env"): conda activate my_python_env.
* pip install pandas
* pip install matplotlib
* pip install plotly

### Load the packages

In [2]:
# load the packages
from IPython.display import Image
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
import plotly.express as px
%matplotlib inline

ModuleNotFoundError: No module named 'plotly'

#### Make more space for your plots, so you don't have a box with a scroll bar.

In [2]:
%%javascript
IPython.OutputArea.prototype._should_scroll = function(lines){
    return false;
}

<IPython.core.display.Javascript object>

### Read the data + Data wrangling

In [4]:
# read the data
my_data = pd.read_csv("https://raw.githubusercontent.com/VIS-SIG/Wonderful-Wednesdays/master/data/2020/2020-05-13/Satisfaction_wW2005.csv")

# handle missings
my_data = my_data.fillna("missing")
my_data = my_data.replace(to_replace = ".", value = "missing")
my_data.head()

Unnamed: 0.1,Unnamed: 0,ID,age,bmi,w_hours,todoctor,income,smoker,gender,employed,graduat,graduat_f,graduat_m,high_grad,high_grad_f,high_grad_m,satisfaction
0,1,1359,>40,>30,missing,<6 visits,<1000,Y,female,non-working,secondary school,secondary school,university entrance,apprenticeship,industrial/agric. teaching,master craftman,10
1,2,2455,>40,<=30,<35h,<6 visits,<1000,N,female,regular half-time,secondary school,elementary school,elementary school,apprenticeship,master craftman,commercial teaching,9
2,3,200,30-40,>30,missing,<6 visits,missing,N,male,non-working,university entrance,university entrance,university entrance,university degree,university degree,university degree,8
3,4,1280,>40,<=30,>45h,<6 visits,missing,N,male,regular full-time,tech.college entrance qual.,secondary school,secondary school,tech.college degree,commercial teaching,commercial teaching,8
4,5,2384,<30,<=30,missing,<6 visits,missing,N,female,non-working,missing,secondary school,elementary school,missing,industrial/agric. teaching,industrial/agric. teaching,7


We want to use a simple and basic approach to manipulate the data
as close to the R - code above as possible.  
The interested reader is encouraged to implement a more elegant way.

Define a function to calculate the `n`, `percent` and `mean` of satisfaction for a subgroup.

In [5]:
def summarizing(data, variables, labels, response, stat = "mean"):
    df_results = pd.DataFrame() # empty dataframe to store end results
    
    for i in range(len(variables)):
        df_var = data.groupby([variables[i]]).size() # get counts
        df_percent = df_var/df_var.sum() # calc percentages
        group_values = df_percent.axes[0].tolist() # capture the group levels
        df_mean = data.groupby([variables[i]]).agg({response: stat}) # calculate mean  
        df_temp = pd.concat([df_var, df_percent, df_mean], axis = 1).reset_index(drop=True) # column bind the data and remove the header column
        df_temp["labels"] = labels[i] # assign labels 
        df_temp["group"] = group_values # assign group levels
        df_results = pd.concat([df_temp, df_results]) # row bind the data to get end results
        
    return df_results

In [6]:
#apply the function to one subgroup to test it
d_test = summarizing(data=my_data, 
                 variables=["age", "bmi"],
                 labels=["Age", "BMI"],
                 response="satisfaction",
                 stat="mean")

# rename columns
d_test.columns = ["N", "Percentage", "Mean of satisfaction", "Labels", "Group"] 
d_test.head()

Unnamed: 0,N,Percentage,Mean of satisfaction,Labels,Group
0,2784,0.803231,7.562141,BMI,<=30
1,635,0.183208,7.538583,BMI,>30
2,47,0.01356,7.404255,BMI,missing
0,604,0.174264,7.698675,Age,30-40
1,401,0.115695,7.613466,Age,<30


In [7]:
# create the data that we need to plotting, use all the variables that we have.
d1 = summarizing(data=my_data, 
                 variables=["age", "bmi", "w_hours", "todoctor", "income", "smoker", "gender", "employed",
                           "graduat", "graduat_f", "graduat_m", "high_grad", "high_grad_f",
                            "high_grad_m"],
                 labels=["Age", "BMI", "Working hours", "Doctor visits per year","Net income", "Smoker", "Gender", "Employment status",
                        "Graduation", "Graduation of father", "Graduation of mother", "Highest edu. grade", "Highest edu. grade (father)",
                         "Highest educational grade (mother)"],
                 response="satisfaction",
                 stat="mean")

# assign proper labels
d1.columns = ["N", "Percentage", "Mean of satisfaction", "Labels", "Group"]
# round numbers
d1 = d1.round(3) 
d1.head(10)

Unnamed: 0,N,Percentage,Mean of satisfaction,Labels,Group
0,760,0.219,7.655,Highest educational grade (mother),commercial teaching
1,832,0.24,7.54,Highest educational grade (mother),industrial/agric. teaching
2,148,0.043,7.703,Highest educational grade (mother),master craftman
3,264,0.076,7.583,Highest educational grade (mother),missing
4,1109,0.32,7.429,Highest educational grade (mother),no degree
5,86,0.025,7.535,Highest educational grade (mother),other degree
6,106,0.031,7.425,Highest educational grade (mother),tech.college degree
7,161,0.046,7.957,Highest educational grade (mother),university degree
0,231,0.067,7.593,Highest edu. grade (father),commercial teaching
1,1491,0.43,7.436,Highest edu. grade (father),industrial/agric. teaching


### Making the plot
We will use the plotly module in python to create an interactive plot.

In [20]:
fig = px.treemap(d1,
                 path=['Labels', 'Group'], # nesting structure: First "Labels" then values within the Labels group as "Group"
                 color='Mean of satisfaction', # Variable for coloring the tiles
                 values='Percentage', # this is the variable that controls the partitioning of the tiles,
                                      # we want the tiles to be propotional to "Percentage"
                 color_continuous_scale='RdYlBu' # color palette             
                 )

# save title as string variable
my_title = ("How would you rate your satisfaction with your life overall on a scale from 0-10? (10=completely satisfied)." +
            "<br>Descriptive analysis by subgroup: tiles are proportional to rel. frequencies of the choices within a subgroup")
            
# update to show the title
fig.update_layout(title=my_title, width=1000, height=800)
# show the figure
fig.show()

The above plot is interactive. We have the ability to zoom into the subgroups. The zoom ability allows us to better see the smaller text. The plot also allows for mouse over effects. Here we get the numbers on the percentages for the categorical variables and the mean of the outcome variable for every subgroup. We observe that graduation grade, non-smoker, being healthy (few doctor's visits per year) correlate positively with satisfaction. The treemap approach to the data that we have demonstrated here is only one possible suggestion of many. Feel free to further explore the data set with other statistical methods. For example one could use regression modeling to look at the variables that contribute to the outcome `satisfaction` the most or use other visual tools like boxplots to capture variability.

## 5. Comparison of different implementation methods
In this Jupyter tutorial we showed how to create efficient overview graphics for questionnaire data about quality of life. We used the `ggplot2` approach in `R` and the `plotly` module in `python`. The advantage of using `plotly` is that it easily incorporates mouse over effects for the tiles. Additionally one can zoom into the subgroups and have therefore a better view on the proportions and text labels of the tiles. Please note that the `plotly` package is also available for the `R` Programming Language. We intentionally showed two different approaches. An interactive graphic is also achievable with `R`. But we noticed that producing an interactive plot in `R` with `plotly` is a bit more advanced if you're not familiar with the `plotly` functionality, see [documentation](https://plotly.com/r/reference/). `Python` has the same hurdle when it comes to `plotly` but we were able to produce a decent output without diving too deep into it. Note that the [d3treeR package](https://github.com/d3treeR/d3treeR/) for interactive treemaps may not be fully supported for some R-versions. We had difficulties to reproduce the same output as in `Python` with it. In `SAS®` you would have to use `SAS® Visual Analytics` for treemaps. The `R` and `Python` approaches are open-source and free to use where as `SAS®` comes with a price.

## 6. Exercises
In this section we want to encourage you to get into the programming yourself and solve some fun exercises.
<br>
<br>
***
R-Exercises:

1. The graphic that we produced with `ggplot` in Jupyter wasn't quite readable, because of the limited dimensions within
   the notebook. One solution would be to save the graphic and look at the plot in fullscreen mode on you monitor. 
   Hint: use `ggsave()`. Get information through the help function `?ggsave` on how to use this command in `R`.
   Make sure to try different values for "width", "height" and "dpi" parameters to make the plot more readable.
       
       
2. Customization of colors. We used 'RdYlBu' that emphasized the middle values and the ends of the scale for the plot. 
   Change the color palette of the tiles to a sequential pattern and observe the difference.
   Hint: ?display.brewer.all().
***

Python-Exercises:
***    
1. Please save the interactive plot that we produced in `Python` so it can be shared among others.
   Hint: [interactive-html-export documentation](https://plotly.com/python/interactive-html-export/).
   
   
2. Change the range of the color scale. In our `Python` example we didn't specify the range of the color schema. Therefore 
   the color ranges default to the range of the input data. 
   The color legend start at about 6.4 and goes up to 9. In some cases
   you want your color schema to fill the whole range. Adjust the range of the colors and see if that makes any sense for our plot.
   Hint: use range_color = [ , ].
   
   
3. Choosing different color scales. There are plenty built-in color schema in `Python`. Use the 'Inferno' and 'Sunset' color scale to see if that fits our data    better.
   Hint: You can get the sequential color scale patterns with 
         fg_cols= px.colors.sequential.swatches()
         fg_cols.show()
   See also the complete [color scale documentation](https://plotly.com/python/builtin-colorscales/)
***