# Data Visualization

## Important Packages

- ```ggplot2```
    - Part of the ```tidyverse``` metapackage.
    - Creates all sorts of visualizations fo data.
- ```RColorBrewer```
    - Provides ability to pick custom colour schemes
- ```lubridate```
    - Part of the ```tidyverse``` metapackage (still need to load package individually)
    - Convert character strings to date vectors.

## Choosing the Visualization

<div style="display: flex; flex-direction: row; align-items:flex-start;">
    <div style="width: 500px;">
        <ul>
            <li><b>Scatter Plots</b> visualize the relationship between two quantitative variables.</li>
            <li><b>Line Plots</b> visualize trends with respect to an independent, ordered quantity (e.g. time)</li>
            <li><b>Bar Plots</b> visualize comparisons of amounts</li>
            <li><b>Histograms</b> visualize the distribtuion fo one quantitative varibale (ie. all its possible values and how often they occur.)</li>
        </ul>
    </div>
    <img src="media/choosing_visualization.png" style="width:200px; margin-left: 50px;"> 
</div>

***AVOID PIE CHARTS***, use bar graphs instead, as it is easier to compare bar heights than pie slice sizes.

***DON'T USE 3-D VISUALIZATIONS***, they are hard to understand when converted to a static 2-D image format.

***USE LINE PLOT*** when your data includes non-numeric (categorical) data. (Notably, year is a categorical variable.)

## Refining the Visualization

### Convey the Message

- Make the visualization answer the question you have asked most simply and plainly as possible.
- Use legends and labels so that your visualization is understandable without reading the surrounding text.
- Ensure the text, symbols, lines, etc., on your visualization are big enough to be easily read.
- Ensure the data are clearly visible; don't hide the shape/distribtuion fo the data behind other objects (e.g., a bar).
- Make sure to use color schemes that are understandable by those with colorblindness.
- Redundancy can be helpful; sometimes conveying the same message in multiple ways reinforces it for the audience.

### Minimize Noise

- Use colors sparingly. Too many colors can be distracting, create false patterns, and detract formt he message.
- Be wary of overplotting. Overplotting is when marks that represent the data overlap, and is problematic as it prevents you from seeing how many data points are represented in areas of the viisualization where this occurs. If your plot has too may dots or lines and starts to look like a mess, you neeed to do something different.
- Only make the plot area (where the dots, lines, bars are) as big as needed. Simple plots can be made small.
- Don't adjust the axes to zoom in on small differences. If the difference is small, show that it's small!

## Creating visualizations with ```ggplot2```

<img src="media/visualizations_ggplot2.png" width="500px;">

- The **aesthetic mapping**, which tells ```ggplot``` how the columns in the data frame map to the properties of the visualization.
    - To create an aesthetic mapping, we use the ```aes``` function.
    - Here, we set ```x``` to ```date_measured``` and ```y``` to ```ppm```.
    - ```x```, ```y```, ```fill```, ```colour```, ```shape```
- The **geometric object** specifies how the mapped data should be displayed.
    - We use a ```geom_*``` function to create a geometric object.
    - In this case we use ```geom_point```
    - ```geom_point()```, ```geom_line()```, ```geom_histogram()```, ```geom_bar()```, ```geom_vline()```, ```geom_hline()```

- ```scale_x_continuous()```, ```scale_y_continuous```
    - Argument: ```breaks``` set the axis to only have ticks in break.
        - E.g. ```ggplot + scale_x_continuous(breaks = c(2, 5))```, this means that the x axis would only have the values 2, and 5. 1, 3, 4 would be gone.
        - E.g. ```ggplot + scale_x_continuous(breaks = seq(1, 6, 0.33)```, this means that the x axis would span from 1 to 6, with numbers like: 1.33, 1.66, 1.99, and so on.
- ```options(repr.plot.width = ..., repr.plot.height = ...)```
    - Put this line at the start of code block.
    - ```width = 9```, ```height = 7``` works on a 13-inch display.

### Visualizing Oscillations

Given the following graph:

<img src="media/visualizing_oscillations1.png" width="400px;">

We can better understand the oscillation by changing the visualization slightly. We can do this by using *scales*. We scale the horizontal axis using the ```xlim``` function, and the vertical axis with the ```ylim``` function.

Here, we will be using the ```xlim``` function to zoom in on just 5 years of the data.

<img src="media/visualizing_oscillations2.png" width="400px;">

***NOTE:*** ```lubricate``` is a package that is installed by the ```tidyverse``` metapackage, but is not loaded by it.

### Axis transformation and colored scatter plots

#### Transform Axis Logarithmically

When dealing with data that take both *very large* and *very small* values,you can adjust the horizontal and vertical axis so that they are on a **logarithmic** (or log) scale.

We can accomplish logarithmic scaling in a ```ggplot``` using the ```scale_x_log10``` and ```scale_y_log10``` functions.

<img src="media/log_scaling.png" width="400px;">

#### Add colors and shapes to points

To distinguish between more categories, we can add colors and shapes by adding arguments to the ```aes``` function.

We can then, style the legend by using the ```legend.position``` and ```legend.direction``` arguments of the theme function.

<img src="media/colors_shapes.png" width="400px">

***NOTE:*** We can use ```RColorBrewer``` to add color palettes that are colorblind friendly. 

### Bar Plots

We create bar plots via the ```geom_bar``` function in ```ggplot2```. However, by default, ```geom_bar``` sets the heigh of the bars to tehe number of times a value appears in a data frame; here, we want to plot exactly the values of the data frame, so we have to pass the ```stat = "identity"``` argument to ```geom_bar```.

<img src="media/barplots1.png" width="500px;">

If we only want to keep the largest land masses, we can use ```slice_max()```:

```r
islands_top12 <- slice_max(islands_df, order_by = size, n=12)
```

If we want to reorder the bars so that they are ordered in an increasing manner, we can use ```fct_reorder```. (Also note, that by putting the ```fill``` argument in ```labs()```, we can rename the legend.)

- First arg: Is the column
- Second arg: Criteria

<img src="media/fct_reorder.png" width="400px">

### Histograms

Histogram help us visualzie how a particular variable is distributed in a data set by separating the data into bins, and then using vertical bars to show how many data points fell into each bin.

To create histograms, we use ```geom_histogram```, setting only the ```x``` axis to the variable we want to analyze.

<img src="media/histogram1.png" width="400px">

To show how accurate the values plotted on a histogram really are, we can use ```geom_vline``` to specify, on the x axis, the true value of the variable. Notably, there is a similar ```geom_hline``` that is used form horizontal lines.

<img src="media/histogram2.png" width="400px">

If we want to separate out the histogram even more, we can use the ```fill``` aesthetic mapping. We need to use ```as_factor()``` to turn the variable into a factor before using it in the ```fill```. And, to make sure that the colors can be seen, we can set the ```alpha``` argument in ```geom_histogram``` to ```0.5``` for it to be slightly translucent. We also add ```position = "identity"``` in ```geom_histogram``` to ensure the histograms for each experiment will be overlaid side-by-side instead of stacked bars (which is the default for bar plots or histograms when they are colored by another categorical variable.)

<img src="media/histogram3.png" width="400px">

#### Choosing a binwidth for histograms

When creating a histogram in R, the default numbers of bins used is 30. You can set the number of bins yourself by using the ```bins``` argument in the ```geom_histogram``` geometric object. You can set the *width* of the bins using the ```binwidth``` argument in the ```geom_histogram``` geometric object.

<img src="media/diff_bins.png" style="width: 400px;">

### Using ```facet_grid```

```facet_grid``` is used to create a plot that has mutliple subplots arranged in a grid. The argument to ```facet_gird``` specifies the variable(s) used to split the plots into subplots, and how to split them (rows/columns). If the plot is to be split horizontally, into rows, then the ```rows``` argument is used, and if vertically, the ```cols``` argument is used.

Note that column names must be surroudned with ```vars()```. This function allows the column names to be correctly evaluated in the context of the data frame.

<img src="media/facet_grid1.png" width="400px;">

## Note on ```fill``` and ```color```

- ```fill =```
    - With ```geom_bar()```/```geom_histogram()``` this aesthetic **fills in the bars** by a specific colour or separates the counts by a variable differnet from the x-axis.
- ```color =```
    - With ```geom_bar()```/```geom_histogram()``` this aesthetic **outlines the bars** by a specific colour or separates the counts by a variable differnet from the x-axis.
    - With ```geom_point()```, it **fills in the points** (colouring them based on a particular (categorical) variable aside from the x/y-axis)

## Describing the Visualization

- **Direction:** positive/negative/little or no relationships
- **Strength:** The relationship is strong whent he scatter points are close together and look more like a "line" or "curve" rather than a "cloud."
- **Shape:** Linear/nonlinear

Discuss your visualization as a story:

1. Establish the setting and scope, and describe why you did what you did.
2. Pose the question that your visualization answers. Justify why the question is important to the answer.
3. Answer the question using your visualization. Make sure you describe *all* aspects of the visualization (including describing the axes). But you can emphasize different aspects based on what is important to answer your question:
    - **trends (lines)**: Does a line describe the trend well? Is it linear? Is it positive/negative? Is there oscillation? Is it noisy or smooth?
    - **distributions (scatters, histograms)**: How spread out is the data? Where are they centered, roughly? Are there obvious clusters' or 'subgroups,' which would be vidsible as multiple bumpsin the hisogram?
    - **distributions of two variables (scatters)**: Is there a clear / strong or weak or no relationship?
    - **amounts (bars)**: How large are the bars relative to one another? Are there patterns in different groups of bars?
4. Summarize ayour findings, and use them to motivate whatever you will discuss next.

## Saving the visualization

### Raster Images

**Raster** images are represented as a 2-D grid of square pixels, and are compressed and lossy.

A compressed format is ***lossy*** if the image cannot be perfectly re-created when loading and displaying, wtiht he hope that the change is not noticable. 

***Lossless formats*** allow a perfect display of the original image.

Common file types:
- JPEG (.jpg, .jpeg): lossy, usually used for photographs
- PNG (.png): lossless, usually used for plots/line drawings.
- BMP (.bmp): losless, raw image data, no compression (rarely used)
- TIFF (.tif, .tiff): Typically lossless, no compression, used mostly in graphic arts, publishing.


### Vector Images

- SVG (.svg) general-purpose use
- EPS (.eps) general-purpose use (rarrely used)

### Saving the plot on images

```r
ggsave("img/faithful_plot.png", faithful_plot)
ggsave("img/faithful_plot.jpg", faithful_plot)
ggsave("img/faithful_plot.bmp", faithful_plot)
ggsave("img/faithful_plot.tiff", faithful_plot)
ggsave("img/faithful_plot.svg", faithful_plot)
```