# Data Visualisations

You can have two types of visualisations:
- Tables / tabular way of presenting information
- Plots and pictograms

Tools:
- Excel: basic to advanced plots; starting plots look basic, with heavy customisation can look good
- Google sheets: basic to intermediate plots; starting plots have good design, but no room for customisation
- PowerBI, Tableau
- Figma: pretty plots
- Adobe Illustrator: more complicated infographics
- Python:
  - Matplotlib
  - Seaborn
  - Plotly
  - Sympy: for plotting mathematical functions
- R: more statistics-oriented plots

# Matplotlib

# Tabular representations of data


## Tables



## Contingency table

**Contingency table** (crosstabulation, cross table) - a useful way to summarise two categorical variables. 
- Top row = category 1, 
- left column = category 2, 
- cells = frequencies. 

<img src="Media/Crosstabulation.png" width="530"/>
 
<img src="Media/Contingency-table.png" width="400"/>

*Marginal distribution* - e.g. distribution of sex (52, 48), distribution of handedness (87, 13).

*Conditional distribution* - e.g. distribution of sex within right-handed people;

![image.png](attachment:image.png)

## Contingency table variations

This is an example of a visually-appealing way to present data:

![image.png](attachment:image.png)

![image-2.png](attachment:image-2.png)


## Frequency distribution table

Like a histogram in tabular form. 

<img src="Media/Frequency_distribution_table.png" width="250"/>

## Frequency table (ordered data points)

| Age | Data points |
| - | - |
| 9 and younger | 0.5, 1, 5, 8, 9, 9 |
| 10-19 | 10, 10, 11, 15, 16, 18, 18, 19 |
| 20-29 | 20, 21, 21, 22, 25, 27, 28, 28, 29, 29, 29 |


## Stem-and-leaf diagram

<img src="Media/Stem_and_leaf.png" width="300"/>

## Two-way frequency table

- Two-way relative frequency table if it’s expressed in percentage
you take two variables, one variable occupies column while another occupies rows, then calculate frequency

<img src="Media/Two-way_frequency_table.png" width="500"/>

# Dotplot

The number of dots in the plot equals to the number of observations. 

Variables: X - categorical, Y - numerical. 

<img src="Media/Dotplot.png" width="500"/>

<img src="Media/Dotplot_2.png" width="500"/>

# Scatterplot

Features:
- Shows relationship between two quantitative (continuous) variables
- Often a precursor to line chart if a pattern persists


# Bar plot

Features:
- Compares quantities across categories; so, one categorical variable and one numerical variable;
- Use horisontal bar chart when the category labels are long / too many categories. 

<img src="Media/Bar_plots.png" width="700"/>

<img src="Media/data-viz/bar-plot-types.png">


# Boxplot (box-and-whiskers plot)

- Provides a five-number summary of the data: minimum, 1st quartile, median, 3rd quartile, maximum
- Useful to compare distributions between different groups or identifying outliers in your data
- Needs at least one continuous numeric variable
- Swarm plot: box plots overlaid with scatter plot of the data.

e.g. dataset {4, 4, 6, 7, 10, 11, 12, 14, 15}. Median (Q2) = 10; Q1 = 5, Q3 = 13, IQR = 13-5 = 8. 

![image.png](attachment:image.png)

## Violin plot

An enhancement to the boxplot. 

It can show the nuances in the data distributions that are hidden in the boxplot. 

However, the outliers are less visible. 

# Histogram

Features:
- Shows distribution of 1 or more discrete numeric data 
- Used to show the frequency and spread of values within different categories in a dataset
- Loses identities of individual observations. Count of single numerical variable. 

> A subset of histogram is a **pictograph** - quantity represented by quantity of pictures: 

<img src="Media/Pictograph.png" width="300"/>

Also, a histogram can be represented, instead of counts, in percentages for each bar / category, as *relative frequency histogram*.

# Line plot

Features:
- Represents quantitative data for 1 or more variables:
  - Shows relationship between two quantitative variables (one for each axis) OR
  - Displays trends over time (when x-axis is time) - time series plots - shows relationship between a numeric variable and time

# Pie chart

Each slide = category. 

Size of slice = frequency. 

https://matplotlib.org/stable/gallery/pie_and_polar_charts/pie_and_donut_labels.html#sphx-glr-gallery-pie-and-polar-charts-pie-and-donut-labels-py

# Sankey Diagram

Resources:
- https://sankeymatic.com/
- https://sankeymatic.com/build/
- plotly

**Sankeymatic.com**
NOTE!!!
having one joint "Rejected" category 
that is inserted into from all streams ("Jobs applied", "1st interview" if rejected, "2nd interview" if rejected)
doesn't work - the math doesn't add up. 
The example of this is below.

```txt
// Enter Flows between Nodes, like this:
//         Source [AMOUNT] Target

:Jobs applied #3CB043
:No reply #808080
:Rejected #FF0000
:1st Interview #3CB043

// total applications: 100
// Initial response: no reply (50), rejected (40), 1st interview (10)

Jobs applied [40] Rejected #FF0000
Jobs applied [50] No reply #808080
Jobs applied [10] 1st Interview

1st Interview [10] Rejected


// You can set a Node's color, like this:
// :Budget #708090
//            ...or a color for a single Flow:
// Budget [160] Other Necessities #0F0

// Use the controls below to customize
// your diagram's appearance...
```


Sample of final job application sankey:
```txt
// Sample Job Search diagram:


Applications [4] 1st Interviews
Applications [9] Rejected
Applications [4] No Answer

1st Interviews [2] 2nd Interviews
1st Interviews [2] No Offer

2nd Interviews [2] Offers

Offers [1] Accepted
Offers [1] Declined

:Applications #3CB043
:No Answer #808080
:Rejected #FF0000
:1st Interviews #3CB043
:2nd Interviews #3CB043
:No Offer #FF0000
:Offers #3CB043
:Accepted #3CB043
:Declined #FF0000

//:Jobs applied #3CB043
//:No reply #808080
//:Rejected #FF0000
//:1st Interview #3CB043
//Jobs applied [40] Rejected #FF0000
//Jobs applied [50] No reply #808080
//Jobs applied [10] 1st Interview

//1st Interview [10] Rejected
```


WIP:
- Width: 1000
- Labels: Placement: Per Stage: After the Nodes
```txt
// Sample Job Search diagram:


Applications [9] Further Communication
Applications [9] Rejected #FF0000
Applications [105] No Answer #808080

Further Communication [1] 1st Interview
Further Communication [3] Online Assessment
Further Communication [5] No Further Communication

:Applications #3CB043
:No Answer #808080
:Rejected #FF0000
:Further Communication #3CB043
:1st Interview #3CB043
:Online Assessment #3CB043
:No Further Communication #FF0000

```


# Small multiples (providing context)

<img src="Media/data-viz/small-multiples.png">

<img src="Media/data-viz/small-multiples2.jpg">


# DASHBOARDS

- KPIs should be front and center in your reports or dashboards. Should be visually distinct & easily understandable at a glance;

<h2>PowerBI</h2>

- ✅ Easy to make professional dashboards;
- ❌ Share dashboards only to people with PowerBI accounts;
- ❌ Free account cannot share;
- ❌ Publish to web - public, not private;

<h2>Tableau</h2>

- ✅ Same as PowerBI;
- ❌ Expensive: $900 per year for dashboard analyst + subscription for the user;
- ❌ need license to create and read dashboards;

<h2>R - Shiny</h2>

- ❌ Need to know R

<h2>Python - Plotly Dash</h2>

- ✅ Highly customisable; 
- ✅ Can build full feature dashboards;
- ✅ Moderately easy to use;
- ✅ Recommended for production;
- ✅ Can set login;
- ✅ Customisable frontend components;
- ❌ Steep learning curve: Python, plotly dash library, HTML, CSS

<h2>Python - Streamlit</h2>

- ✅ Incredibly easy to use;
- ✅ Makes it super fast to create web apps;
- ✅ native multi-page support;
- ✅ Also in-built hosting for free provided by streamlit; 
- ✅ No need to know frontend
- ❌ There are not many options for customisation; you can quickly access pre-built functionalities, but it's not nearly as customisable as plotly dash;
- ❌ Need to know Python
- ❌ Too simple
