## Additional Readings and Resources

Design:

* The Visual Display of Quantitative Information by Edward Tufte
* Envisioning Information by Edward Tufte
* Visual Explanations: Images and Quantities, Evidence and Narrative by Edward Tufte


Color:

* Information Visualization: Perception for Design by Colin Ware

## Key phrases and Concepts

* Data variables: nominal, ordinal, and quantitative; discrete v. continuous; dependent v. independent
* The perceptual accuracy of how different chart elements represent data variables
* How glyphs represent multiple dimensions of individual data items, how parallel coordinates plot data over many dimensions, and how streamgraphs improve on stacked bar charts
* Chartjunk, the data-ink ratio, and other design rules
* Hue, saturation, value, and other ways of thinking about color

## Introduction

We can use [Tableau](https://www.tableau.com/) to visualize dan analyze data. Tableau has easy navigation, for example, we can easily found a "Show More" button to choose proper graph to visualize correspondent data.

## Data Visualization Framework

There is a framework to help us and simplify our understanding of how data visualization work. The **data visualization framework** can be seen in an image below:

![](images/data-visualization-framework.png)

**Data Layer** contains two entities that are data sources and data collection. Before collecting all data, we need to do preprocessing that ensuring each data imported in the proper format. After data being imported, we need to ensure that all data stored in the best possible structure. In the case of a database, we can maintenance relation between data as database indexes, so will increase retrieval time. After data storing problem being solved, we can do such optimization to enhance our storage. By doing data analysis, we can decide the quality of data we have collected. The task of data analysis included: inspecting, cleansing, and transforming. The goal of data analysis is reduced noise in our data collection, by removing or transforming. Another crucial optimization is data aggregation which has a significant impact on retrieval performance of our data collection. The task of data aggregation included: classification, clustering, etc.

**Mapping Layer** is a complicated layer which has robust implementation of linear algebra and computer graphics algorithms. In this layer, we working on how to associating appropriate geometry with data channels. For example, in the case to represent clusters of data, we need to associate size of cluster into area of circle.

**Graphics layer** has a robust implementation of computer graphics algorithms and user interaction. The main objective of this layer is to produce geometry calculation from mapping layer into a displayable image on the computer screen. In the case of interactive visualization, we need to ensure that any given input processed correctly to produce expected effect on graphical representation of our data.

## Data Types

There are four types of discrete data:

1. Ordinal is type of ordered data defined by range. For example: Small and large.
2. Quantitative is type of ordered data defined by step. For example: 1, 2, 3, 4, ...
3. Nominal is type of unordered data defined by label. For example: Shapes (circle / rectangle) and gender (male / female).
4. Category is type of unordered data defined by group. For example: Social ages (young / old) and nationality.

There are two types of continous data:

1. Fields is type of ordered data defined by standarized scale. For example: Temperature data may defined by Celcius or Kelvin and altitude data may defined by minute or degree. 
2. Cyclic values is type of unordered data defined by perception. For example: Direction data may indicate by north, south, east, and west or clock.

An image below describes briefly each type of data with a good example:

![data_type](images/data-type.png "Data types")


## Data as Variable

There are two type of data as variable:

* Independent variable is variable which desribed by itself and not changed when another value changed.
* Dependent variable is variable which described by another variable. For example, y as dependent variable and x as independent variable, then value of y is defined by change of x values.

An image below describes briefly how data looked in three points of views: science, database, and data warehouse.

![data_variables](images/data-variable.png "Data variables")

## Mapping

As explained in the mapping layer, the mapping process mainly focused on produce proper geometry primitive to the corresponding data channel. It is very important to know which best geometry to represent correspondent data. Failure in deciding best-suited geometry lead to bad perception that will influence the quality of interpretation and conclusion.

There are three group of geometry primitives classified by its dimensions:
1. 1 dimension
    * Position
    * Length
    ![position_length](images/position-length.png "Position and length")
    * Angle/Slope
    ![angle_slope](images/angle.png "Angle or slope")
<br><br>
2. 2 dimension
    * Area
    ![area](area.png "Area")
    * Texture
    * Connection
    * Containtment
    ![texture-connection-containtment](images/texture-connection-containtment.png "Texture, Connection and Containtment")
    * Shape
<br><br>
3. 3 dimension
    * Volume
    * Color/Density
    ![volume-density](images/volume-density.png "Volume and density")
    * Saturation
    * Hue

Below are three list of suited geometry primitives for three commonly used data types (quantitative, ordinal, and nominal) ordered by **perceptual accuary**.

| quantitative | ordinal      | nominal      |
| ------------ | ------------ | ------------ |
| position     | position     | position     |
| length       | density      | hue          |
| angle        | saturation   | texture      |
| slope        | hue          | connection   |
| area         | texture      | containtment |
| volume       | connection   | density      |
| density      | containtment | saturation   |
| saturation   | length       | shape        |
| hue          | angle        | length       |
|              | slope        | angle        |
|              | area         | slope        |
|              | volume       | area         |
|              |              | volume       |

## Charts

Data visualization often consist very simple charts, but the success of data visualization can often depend on how we map our data to the elements of those charts. Here we discuss some of the simple charts and describe when we use each of them.

1. Bar chart
    ![bar](images/bar.png "Bar chart")
    <br><br>
2. Line chart
    ![line](images/line.png "Line chart")
    <br><br>
3. Scatterplot
    ![scatter](images/scatter.png "Scatter plot")    
    <br><br>
4. Gantt chart
    ![gantt](images/gantt.png "Gantt chart")    
    <br><br>
5. Table
    ![table](images/table.png "Table")


We can classify each type of chart by its characteristics:

![when-use-charts](images/when-use-charts.png "When use charts")

## Glyphs

Glyphs is any indicators that may encoded as shapes and colors to represent each kind of data point. For example, to represent a tornado presentation, we may used length and angle to indicate wind speed and direction. We may also used hue and shape to indicate density and coverage such as wind convergence or disvergence is such area. An image below illustrated use of glyphs in various charts:

![glyphs](images/glyphs-in-charts.png "Glyphs in Charts")

Furthermore, we can use glyphs to show data attributes in the charts or other visualization. A very common practical example of using glyphs is visualizing standart deviation and variances in sample data as shown in image below:

![standart deviation](images/glyphs-standart-deviation.png "Glyphs of standart deviation")

Glyphs can be used in presentation visualization. Heatmap is an example presentation which used colorized glyphs on the map. Different group or individual might be not comfort with the Heatmap representation and demand for higher accuracy. We can used colorized glyphs as boxes with various hues in the table visualization. The y and x axes on the table chart discribe data variables and the glyphs help us to show the differences between data point.

![Heatmap](images/heatmap.png "Heatmap with colorized glyphs")

In some case, we might want to transform our visualization into more compact form. We may lost some details and accuary due transformation into compact form. For example, we may unable to plot detailed graph in each class of data due limited space. In that case, we doing a trade off between accuary and classification. An image below show that we can use glyphs to show differences of life ratio between continent with few detail about value of life ratio in each class.

![Life Ratio](images/life-ratio.png "Life ratio")

Glyphs can also represented as features with complex geometry. An image below is an example of glyphs that represent facial features of lawyers that tells us some lawyer with bigger face may indicate that some lawyer have experienced im more cases:

![Chernoff Faces](images/chernoff-faces.png "Chernoff faces")

## Parallel Coordinates

Parallel coordinates is a visualization technique to visualize high-dimensional data. This visualization will be very helpful since most of technology limited in represent high dimensional space as projection in two dimensional space. The main objective of this technique is to reveal certain data features, such as **collinearity** in high-dimensionality data. The main reason why we should used this technique is this technique can reduce visualization complexity due high-dimensionality of data. Below step-by-step how to construct parallel coordinates of n-dimensions:

1. Create two parallel lines of first two dimensions, for example x and y.
2. Place each data into each parallel line based on its corresponding values. For example, points A(3, 2) and B(4, 3) should be placed at y(2, 3) and x(3, 4).
3. Draw a lines between two parallel lines corresponding to each data points. For example, y(2, 3) and x(3, 4) should became two lines A(3 to 2) and B(4 to 3).
4. Remove every glyph drawn in each parallel line, so there is just lines between parallel lines.
5. Create one parallel line for next dimension, repeat step 2 until each dimension became a parallel line.

The most interesting of parallel coordinate is we can easily detect the collinearity. As we can see: orange, yellow and green and dark green are colliniar in x-y plane. Below an example of parallel coordinates of four dimensional data:

![Parallel coordinate](images/parallel-coordinate.png "Parallel coordinate")

## Stacked Graphs

Stacked graphs used when we want to represent accumulated data (multivariate in same axes). Below type of charts belong to stacked graph:

* Stacked bar chart
* Relative stacked bar chart
* Pie chart
* Diverging stacked bar chart
* Stacked line graph
* Stacked graph layout

In some cases, we need to carefully using stacked bar chart. We already knew that position more accurate than length in representing quantitative data. Since stacked bar chart utilize position and length, we need to prioritize position representation over length. An image below illustrate how order of position is important.

![Stacking order](images/stacked-bar-chart1.png "Stacking order")

Stacked bar chart might be not effective to represent huge accumulated data. For example, in visualizing multiple stock prices within long time interval, the stacked graph might be not effective to visualize price changes. To satisfy that condition, we need to transform stacked bar chart into stacked graph layout.

![Stacked graph layout](images/stacked-graph-layout.png "Stacked graph layout")

In order to inspect area or distribution or testing the confidence interval, we may tranform stacked graph into Themeriver layout

![Themeriver Layout](images/themeriver-layout.png "Themeriver Layout")

In order to do smoothing, if it possible to added some weight, then we may transfrom themeriver layout into streamgraph layout

![Streamgraph layout](images/streamgraph-layout.png "Streamgraph layout")

Then, in the case to prioritize position, we may do ordering

![Streamgraph ordering](images/streamgraph-ordering.png "Streamgraph ordering")

## Tufte's Design Rules and Using Color

Tufte's Design Rule is one design criteria invented by [Edward Tufte](https://www.edwardtufte.com/tufte/) which encourage us to be minimalism in order to visualizing data.

### Tufte's Design Rules

There are some important design rule described by Tufte:

1. Let's data speak: In order to provide reliable data, we must be represent the detail of data while guide the audience how to interprete data. For example: If there are some missing data, then show them with less distraction as possible and making the visualization to make audience rely on reasoning to guess how missing data should be.
2. Try to represent picture as information rather than text: Picture is great way to represent data because it has volume, area or length to indicate size, color to indicate differentiation and more. By using color, we can avoid miss-interpretaion.
3. Give annotation: By giving annotation in the data visualization, we can inform audience about the detail and supplement information.
4. Chart junk: 2D representation may be boring, but 3D representation may lead miss-perception.
5. The Data-Ink Ratio: When it come to print in the paper, it is good idea to follow Tufte's minimalism which try to reduce color as much as possible.
6. Micro/macro: When it came into detail, we prefer to see micro scales. But when it came it the overview, we prefer to see macro scales.
7. Information layer: We need to inform audience about any information that can be exploited from visualization. For example: We may need to create different appearance as different elements such as color and shapes.
8. Multiple: We need to stay consistent in order to represent some data in different cases. For example: We need to stay with exactly same chart and glyphs type when represent same measurement method on two different machince performance.
9. Color: Beware with color combination in visualization because wrong color combination may lead miss-interpretation and miss-perception.
10. Narrative: A good quality visualization is one that can tells audience a story.

### Using Color

*Hue* is defined by angle of color wheel, for example: \\(0^0\\) is red, \\(60^0\\) is yellow, \\(120^0\\) is green, \\(180^0\\) is cyan, \\(240^0\\) is blue, and \\(300^0\\) is magenta. *Saturation* is defined by how far a color from gray. *Value* is defined by how far a color from black.

Some researches reveal that human can only differentiate between only five to ten hues (Healy, 1996) and there are only twelve colors (6+6) recommended by Ward's "Information Visualization".

![12 colors](images/12-colors.png)<br><br>

In most cases, saturation means giving more details on the object. So try to add saturated colors for points, strokes, and symbols and add desaturated colors for bigger or fills larger areas, for example: desaturated with white to increase luminance. In specific case, such as plotting bar chart, desaturated color and fills may be better than saturated colors and fiils.

*Contrast* means differences while in the context of color, it's deals with hue and optionally for brightness.