### FIT5147 Data Visualisation Notes

One reason visualisations are effective is that they can contain a lot of information. Standard statistical measures such as the mean, median, standard deviation or correlation summarise the data using just a few numbers. An information graphic potentially provides much more information about the data as it can show thousands (even millions) of graphic elements, each of which can use position, colour, pattern or shape to encode information about the data.

Visualizations can show us trends in the data, where summary statistics may look almost identical.
Example:  
![image.png](attachment:image.png)  
<style type="text/css">
    img {
        width: 400px;
    }
</style>
<!-- https://stackoverflow.com/a/67994843 -->
(These images all have almost identical mean, standard variation and line of best fit in both x and y!)

Human visual processing is pre-attentive and occurs in parallel, so we can see patterns without having to crunch the numbers mentally.

Data visualisation is used for three main purposes in Data Science.

1. Data checking and cleaning. 
    - When you first get your data, you should do some quick plots of the individual features to check that there are no obvious errors and to get a feel for the distribution of values.

2. Exploration and discovery. 
    - According to Mike Lourdes (What is Data Science? n.d.), Hilary Mason, one of the world’s leading data scientists, says that when she gets a new data set, she starts by making a dozen or more scatter plots, trying to get a sense of what might be interesting. Visualisation reveals possible connections and patterns that can then be confirmed (or not) using other kinds of analysis. Visualisation also plays a key role in understanding any spatial data.

3. Presentation and communication of results. 
    - The other important use of visualisation is to present the results of your analysis. This has two main purposes: (1) to help you and other modellers/analysts understand the results and (2) to communicate the results to other stakeholders.

When you first get data, check it for errors (data wrangling)

If you are using statistical tests that require a normal distribution, check that the data appears to be normally distributed. There are statistical tests for this, but they can be quite picky. A better approach is actually to plot the data and test for normality graphically.


## The What-Why-How framework

One of the most widely used frameworks for understanding the design and evaluation of data visualisations were developed by Tamara Munzner. This has three parts:

1. What is the kind of data to be visualised?
2. Why is the data being visualised–what task does the user wish to perform?
3. How is the data visually represented and what interaction is provided


### What: Kinds of data

Different sorts of visualisation are appropriate for different kinds of data. 

1. Tabular dataset: Data that is conceptually organised into a table, with each row corresponding to a different data point or item and each column corresponding to different attributes of the data. This is the sort of data used by traditional statistics, and R is designed for.
2. Network dataset: Data consists of nodes or items and links between these nodes, representing different kinds of abstract relationships such as ‘reports-to’ or ‘is-married-to’. The items can contain attributes. A hierarchy is a kind of network dataset.
3. Spatial dataset: Data in which items are associated with a geographic location or region. This geographic key is a natural way to organise and understand the data.
4. Textual dataset: Data set consists of sequences of words and punctuation.

These are not the only kinds of ways data can be organised, but they are the most common datasets used in data science and data visualisation. Another way of organising data in scientific visualisation is the field in which data is sampled from a continuous, conceptually infinite domain. An example of data organised as a field is an X-ray image.

Attributes in data items are simple values that can be measured or logged. They can be

- Categorical: data that does not have an inherent ordering. E.g. names.
- Ordered: data that can be ranked or ordered. It has two subtypes
    - Ordinal: data that can be ranked but for which the difference between items does not make arithmetic sense. Examples include clothes sizes (small, medium, large) and survey response scales such as one that allows respondents to select from a 5-point scale such as ‘disagree strongly’, ‘disagree’, ‘neutral’, ‘agree’, ‘agree strongly’, months in a year, year something happened.
    - Quantitative: data that has a magnitude supporting arithmetic comparison. For instance, height or weight. This may be an integer or a real number. Time is an important example of quantitative data. Sometimes quantitative data is split into interval vs ratio data, but this distinction is usually not that important. The difference is that ratio data has a natural 0 while interval data does not. Thus length is an example of ratio data, while the date is an example of interval data.

Ordered data can either be sequential, in which case there is a minimum and maximum value, or diverging, in which case it can be understood as two sequences going in opposite directions, e..g. like/dislike scales from -5 to 5 are diverging around the neutral value of 0. Ordered data can also be cyclic, where the values wrap around. Time measurements are often cyclic, e.g. months in the year.

The data displayed is not only the original data but maybe data, such as statistical values, computed from the original data.

Not all types of visualisations are suitable for all types of data.

### Why: The tasks

Discovery: Get information (insight/knowledge) from the data. Can form or test a hypothesis.

Presentation: Present information to some intended audience.

Enjoy: Data visualisations should entice casual users to engage with the data.





Data graphic design summary:
- Data density
- Graphical integrity
- Data correspondence 
- Aesthetics

Design Methodology
• Discussing the data and business needs with the client;
• Brainstorming ideas based on the knowledge of the data
and these needs;
• Creating design-sheets for possible solutions;
• Discussing these design-sheets with the client and
receiving their feedback;
• Deciding on the final design realisation

![image.png](attachment:image.png)