<h1>MT Review “Cheat Sheet”</h1>

**Chapter 1
R and tidyverse**


**LO1: Identify the different types of data analysis questions and categorize a question into the correct type:**

Different types of questions:

**Descriptive**: asks about summarized characteristics of a data set without interpretation. Ask what is, rather than why and how. 
Ex: how many languages are there in India?

**Exploratory**: asks if there are trends, relationships within a single data set. 
Ex: does exam success rate change with the amount of sleep students get?

**Predictive**: asks about predicting measurements for things or people. FOCUS on what things predict the outcome, NOT what causes the outcome. Classify new observations based on existing data.
Ex: what is the type of a newly observed tumour based on its width and texture?

**Inferential**: looks for patterns, relationships in a single data set AND makes inference on the wider population.
Ex: does sleep times affect the success rate of students in all of Canada?

**Causal**: asks about whether changing one factor will lead to change in another factor, in the wider population. 
Ex: does sleep deprivation lead to lower success rate in students in BC?

**Mechanistic**: asks about the underlying mechanism of the observed relationships. How does it happen?
Ex: How does sleep deprivation lead to lower success rates in students of BC?


**LO2: Load the tidyverse package into R**

In [23]:
options(repr.matrix.max.rows = 6)
library(palmerpenguins)

**Function**: a function is a special word in R, takes instructions(arguments) and does something

**R package**: a collection of functions that can be used additional to the built-in R package functions. **ex**: read_csv() is a function contained in tidyverse package.

In [22]:
#code# 
library(tidyverse)

**Tidyverse**: this is a meta package contains many functions such as those used to load, clean, wrangle, and visualize data. It also contains several other packages as well (meta package).

In [24]:
# We have to put quotes "" around file names and other words in the code cell to distuiguish
# them from special words(like function) for R

**LO3: read tabular data with read_csv**

.csv: tubular data in the "comma-separated values" format, each value in the table separated by a comma.

**read_csv:** funcion, expects data files to:
have column names (headers)
use a comma (,) to separae columns
does not have row names

In [25]:
##code:
## name <- read_csv("folder name/data file name")

**LO4: naming things in R**

In [34]:
#using the assignment symbol "<-"
my_number <- 3

In [35]:
my_number + 2

**Conventions**: when naming, only use lowercase letters, numbers and _ to separate words

**LO5: create and organize subsets of tabular data using filter, select, arrange, and slice**

**filter**: obtain a smaller set of rows with specific values (ex: only want rows with the year 2022)

**select**: obtain a smaller set of columns (ex: only want year and pollutants)

**code for filter**: filter(name of data, year=="2022")

logical statement: the secod argument in the function, evaluates to TRUE or FALSE, in filter, it evaluates to TRUE.

" " is used to tell R this is a **string** value, not a special word in the R language

**code for select**: select(data name, column1, column2)

**arrange**: **order the rows** of data frame by **values of a particular column**

**code for arrange**: arrange(dataframe, by=desc(column))

-Descending: (desc()), from largest to smallest

**arrange function automatically orders rows in ascending**

**slice**: function which selects rows according to **row number**

**code for slice**: slice(dataframe, 1:10)

-the second argument tells R the raws to keep is from 1 to 10

**LO6: add and modify columns in tabular data using mutate**

**mutate**: perform a calculation, making use of existing columns to compute a new column

**code for mutate**: mutate(dataframe, new column name = the equation)

col_double means that the data in this column is a number-type, specifically real numbers (meaning that these values can contain decimals)

col_integer means that the data in this column is integers (whole numbers)

col_character means that the data in this column contains text (e.g., letter or words)

**LO7: visualize data with a ggplot bar plot**

**Benefits of visualization**: great tool for summarizing information, help effectively communicate with audience.

In [36]:
# hashtag provide comments

?filter

this is a way to pull up the documentation for most functions.

**Chapter 2 (reading in data locally and from the web)**

**Important packages for chapter 2**

- readxl: provides the read_excel() function to load sheet from excel file into R
- DBI: provides dbConnect() function to connect SQLite database. provides dbListTables() function to list the tables in a database
- dbplyr: provides tbl() function to help create a reference to a database table searchable. provides collect() to retrieve data from a database query and bring it to R
- RPostgres: allows us to work on PostgreSQL databases

**LO1: Define the types of path and use them to locate files**

- a file could live local(computer), or remoate (internet), different paths

**1. Relative file path**: where the file is with respect to the folder (**working directory**) currently in, on the computer.

**2. Absolute file path**: file with respect to the base (root) folder of computer's firesystem, regardless of where you are working.

- **Always start with "/"**

**" . " means reach a file from current directory (folder)**

**" .. " means go back to previous directory**

**Generally**, it is better to use relative paths. B/C it helps endure the code can be run on a different computer, and is shorter and easier to write. (able to run on different computer as the path is same on any, but for absolute, depending on the name the person gave to the root folders, may be different.

**LO2: read data into R from various types of path using following functions**

Plain text file: a document containing only text

**1. read_csv**: for reading tabular data with comma separated values
- the delimeter(separator): ","

code: canlang_data <- read_csv("data/can_lang.csv")
- data/ is put before file's name because the data set is located in a sub-folder called data, relative to where we are running our R code.


**skipping rows when reading data** : There sometimes may be extra informations about the data included at the top of data file(metadata). NO delimeters. BUT not intended to be read into a data frame cell with the tabular data.

- in this case, use skip argument: read_csv("data/name", skip=3)

**2. read_tsv**: tsv=tab-separated values files. 

code: read_tsv("data/can_lang.tsv")

**3. read_delim**: a more general function, including read_csv, read_tsv which are special cases. NEED to specify a **delimeter**. 

- delim = "\t" is for tab-separated values file
- delim = ", " is for comma-separated values file
- delim = " ; " is for semicolon-separated values file

**data frames need to have column names**: use argument col_names= " ", " " is an option
use function rename(data, **new_name=old_name**, column2= X2) is also an option

**4. a)read tabular data directly from URL**

(URL): Uniform Resource Locator

**code**: url <- "https://raw.githubusercontent.com/UBC-DSCI/data/main/can_lang.csv"

canlang_data <- read_csv(url)

**4. b)downloading data from a URL**

for URL that are not nicely formatted to directly use any functions

**code**: url <- "https...."

download.file(url,"data/can_lang.csv")

2nd argument is the path to store the downloaded file

**5. reading tabular data from Microsoft Excel file**

- the file name extension is .xlsx
- this is not a plain text file
- use library(readxl)
- use function read_excel()
- use sheet argument to specify the sheet number or name
- use range argument to specify cell ranges (for when single sheet contains multiple tables)

**why should we always explore the data file before importing into R**

- helps me decide which function and arguments I will need to load the data into R successfuly.

**6. reading data from a database**

**database**: a type to data storage -> almost all database management systems employ SQL (strcutured query language) to obtain data from database

a) SQLite database are usually stored and accessed locally on one computer from a file with a .db extension or .sqlite extension.

- NOT plain text files, CANNOT be read in a plain text editor
1. connect R to the database using dbConnect() function from DBI package
- dbConnect() opens up a communication channel that R can use to send SQL commands to database

**library(DBI)
canlang_conn <- dbConnect(RSQLite::SQLite(), "data/can_lang.db")**

- can use dbListTables(connect database) to list the table names in the database

2. **tbl(canlang_conn, "lang")** function allows us to reference this table so we can perform operations and work with data stored in databases as if they were just regular data frames WITHOUT having to store all its data in R's memory.
3. head() function allows us to see the first few rows of a dataset
4. use **collect** function to download the transformed data from the database and store it in a dataframe.
5. write data from R to a .csv file: use write_csv(the collected dataframe, "data/newname.csv")

**Reading data from a PostgreSQL database**

- designed to be used and accessed on a network -> have to provide more information to R when connecting to Postgres databases

Example: 

library(RPostgres)

***canmov_conn <- dbConnect(RPostgres::Postgre(), dbname = "can_mov_db",
                        host = "fakeserver.stat.ubc.ca", port= 5432, user = "user0001",       password = "abc123")***

**Advantages of database:**
1. allow storing large data set across multiple computers with backups
2. Allow multiple users to access them simultaneously and remotely without conflicts and errors
3. provide mechanisms for ensuring data integrity and validating input
4. provide security to keep data safe 

**Chapter 3 (Data Wrangling)**

**Important Packages for Chapter 3**

- dplyr: part of tidyverse metapackage (if loaded tidyverse, then do not need to load this)

  ->provides functions like (select, filter, mutate, arrange, summarize, and group_by)
- purrr: part of the tidyverse metapackage.
- allows us to use the map() and map_df() functions

**LO1: define the term "tidy data"**

**Criteria for Tidy Data!!**

1. Each row is a single observation
2. Each column is a single variable
3. Each value is a single cell

**LO2: discuss the advantages of storing data in a tidy data format**

- tidy data is a single, consistent format that almost every function in tidyverse recognizes, making it easy to manipulate, plot, and analyze using the same tools.
- tidy data is easier for human to interpret.
- Untidy data require more complex code that are easy to have errors and hard for others to understand.

**LO3: define what vectors, lists, and data frames are in R, and describe how they relate to each other**

**data frame**: table-like structure for storing data in R. (stores observations, variables and their values)

- variable: a characteristic, number, or quantity that can be measured
- observation: all of the measurements for a given entity
- value: a single measurement of a single variable for a given entity


**what is a vector?**

- vectors are objects that can contain one or more elements that MUST ALL BE THE SAME DATA TYPE
- you can use c() function to create vectors in R: vector_name <- c("200", "300", "400")

**what is a list?**

- Lists are also objects with multiple, ordered elements, BUT the elements in a list **DO NOT** have to be the **same type**.

**data frames**: is just a special kind of list:

- each element itself must either be a vector or a list
- each element (vector or list) must have the same length

**Tibbles are special kind of data frames that more enhanced**

**LO4: describe the common types of data in R and their uses**

**Data Type**

- **character: (chr)**, letters or numbers surrounded by quotes, ex: "1", "world"
- **double (dbl)**, numbers with decimal values, ex: 1.2333
- **integer (int)**, whole numbers, no decimals, ex: 1L,20L ("L" tells R to store it as int)
- **logical (lgl)**, either true or false, ex: TRUE, FALSE
- **factor (fct)**, used to represent data with a limited number of values(usually categories), ex: color(**categorical**) variable **with levels** red, green, orange. 

**Even though factors sometimes **look** like characters, they are not used to represent text, words, names, and paths in the way characters are. Factors help us encode variables that represent **categories** 

**LO5: Use the following functions for their intended data wrangling tasks**

**1. pivot_longer**

- combines columns, making data frame longer and narrower.
- **combine** columns that are really part of the **same variable** but currently stored in separated columns.

pivot_longer(dataframe,
            cols= columns to combine,
            names_to= "new column 1",
            values_to= "new column 2")

- input for 1st argument is the data frame
- input for 2nd argument are the names of the columns we want to combine into a single column
- input for 3rd: the new column1 that will be created, values come from the **names** of the columns that we want to combine
- input for 4th: the new column2 that will be created, values will come from the **values** of the combines columns

**2. pivot_wider**

- if there's one type of observation spread across multiple rows rather than a single row
- use pivot_wider to increase the number of columns and decrease the number of rows

pivot_wider(data frame,
            names_from = col_name_1,
            values_from = col_name_2)

- input 1st for the dataframe
- input 2nd is **the column** that the **names** of the new columns take from
- input 3rd is **the column's values** that the **values** of the new columns take from

**3. separate**

- use this to deal with multiple delimeters (multiple values stored in the same cell)

separate(dataframe,
    col= col_name,
    into = c("col_name1", "col_name2"),
    sep = "/")

  1. specify the column we want to split
  2. a character vector of the new column names we would like the split columns to have
  3. the separator on which to split

**4. select**

- use to extract a range of columns
- if simply typing all of the column names needed to select may be time-consuming. **instead**, use a "select helper"
  
**select helpers**: **operators** that make it easier for us to select columns

ex: to chose a range of columns, use **(:)** to denote the range. 

ex: select(dataframe, starts_with(" ")) **(starts_with())** is a select helper to choose columns with names start with a particular word or letter.

ex: select(dataframe, contains("_")), **contains()** is a select helper to choose column names that contain a particular thing.


**5. filter**

- use filter to extract rows where logical statement evaluates to TRUE.

ex: extracting rows that have a certain value with ==, filter(dataframe, column == "value")

ex: extracting rows that do not have a certain value with !=, filter(dataframe, column != "value")

ex: extracting rows satisfying multiple conditions using (,) or (&), 

filter(dataframe, colum1 == "value1", column2 == "value2")

ex: extracting rows satisfying **at least one** condition using (|),

filter(dataframe, column1 == "value1" | column1 == "value2")

ex: extracting rows with values in a vector using (%in%), 

similar to using (|), but easier as it is summarized in a vector. 
**different** from == because == means choosing the values that only match that first element listed. But %in% means R will choose the values that can match any of the elements in the vector.

vector_name <- c("value1", "value2", "value3")
filter(dataframe, colum_name %in% vector_name)

ex: extracting rows above or below a threshold using > and <

filter(dataframe, column > 2345)

**6. mutate**

Ex: using mutate to modify columns

mutate(dataframe, new_name = as_factor(column))

- in here, we can use mutate to modify the elements in our column into factor.

Ex: using mutate to create new columns

mutate(dataframe, new_column = operation between old columns or smth)

**7. pipe operator |>**

- used to combine functions, results in a **cleaner, and easier to follow code**
- takes the output from function on the left and passes it to the first argment to function on the right

**reasons why making multiple lines of code and storing temporary objects is not preffered**

- difficult for readers to understand
- tricks the reader to think the temporary intermediate objects are important
- reader has to look through and find where the intermediate objects are used

**compose function is also not a good idea**

- the functions compose in the opposite order in which they are computed by R
- long code makes it difficult for readers to understand

**When should we store temporary objects**

- store a temperary object before feeding it to plot function, so you can look at the wrangled data before plotting it to make sure there are no errors.
- piping many functions can be difficult to debug

**8. summarize**

- use summarize to calculate summary statistics:

ex: summarize(dataframe, new column name = max(old column))

in here, min and max functions can be used to calculate the maximum value from the column specified.

**Basic summary functions**

- min
- max
- mean
- sum

**if there's NA in the column's element:**

- add argument na.rm= TRUE into the summary functions to remove the NA.

**9. group_by() + summarize()**

- this combination is used when you want to apply the same function to groups of rows

group_by(dataframe, col_names) |>
summarize(
            min_col_name1 = min(col_name1),
            max_col_name2 = max(col_name2),
            total_volume = mean(total_volume, na.rm =TRUE))

- group_by() takes an existing data set and converts it into a grouped data set where operations are performed "by group".
- summarize() works analogous to mutate() function, EXCEPT instead of adding columns to an existing data frame, it creates a new data frame. USED to calculate **summary statistics** (max, min, mean) for each group of rows created with group_by()
- pairing these functions together can let you summarize values for subgroups within a data set
- group_by() creates its own columns and summarize() creates its own columns which then both combine to form a dataset

**10. summarize() + across ()**

- to calculate summary statistics on many columns

summarize(across(column1:column4, ~max(.x, na.rm=TRUE))

**11. map(), map_dfr()**

- alternative to summarize+across, for applying function to many columns
- map takes two arguments, an object(a vector, data frame or list) and the function that you would like to apply
- map() does not give dataframe, it gives list instead
- map_dfr() gives data frame, combining row-wise

**12. mutate + across**

- ex: when converting units of measurements across many columns
- or we want to change every value in data from to another data type

mutate(across(dataframe, column1:column4, as.integer))

**13. rowwise + mutate**

- apply function across columns but within one row
- Ex: we want the max value from different columns in one row (ie find the maximum from values in one row)

rowwise(dataframe) |>
mutate(maximum= max(c(column1, column2, column3, column4)))

**similar to group_by(), rowwise() doesn't appear to do anything when it is called by itself, but we can apply rowwise with other functions to change how these other functions operate.**

**Chapter 4 (Effective data visualization)**

**Important packages for chapter 4.0**

- `ggplot2`
  - part of tidyverse metapackage. (if loaded tidyverse, then do not need to load this)
  - This package allows you to create all sorts of visualizations of data.
- `RColorBrewer`
  - This package provides the ability to pick custom colour schemes some of which are colourblind friendly.
- `lubridate`
  - part of the tidyverse metapackage. (still need to load this package **individualy**)
  - This package is a tool to convert character strings to date vectors.

**Basic functions used to aid data visualization**

- `n()`
  - number of rows/observations in the data
  - usually used like `group_by()` + `summarize(n = n())`to give you the count of the rows for each group
- `slice_max(data, order_by = ..., n = ...)`
  - `data`: what data frame we are using
  - `order_by =`: which column we select to order, default is largest first
  - `n`: number of rows selected
  - This function is used to select only the top `n` data rows ordered by some column from a data frame to generate a new data frame
  - same purpose as arrange()+slice(), but more specific and efficient
  - `as.factor()`: simply converts an existing vector to a factor
  - `factor(col_name, levels = c(...,...,...))`: To encode a vector as a factor; allows you to specify the values, and whether they are ordered or not.

**LO1: Describe when to use what kinds of visualizations to answer specific questions using a data set**

Great visualizations clearly answers your question without distraction or additional explanantion.

**4 Kinds of visualization**:
1. **scatter plots**: visualizae the relationship between **two quantitative variables**
2. **line plots**: visualize **trends** with respect to an **independent ordered** quantity (e.g., time)
3. **bar plots**: visualize **comparisons of amounts**
4. **histograms**: visualize the distribution of **one quantitative variable** (e.g, all its possible values and how often they occur)

**Avoid**
- avoid using **pie charts**, better to use bars, as its easier to compare bar heights than pie slice sizes. 
- avoid using **3D visualizations**, as they are hard to understand when converted to 2D image format
- do not use **tables** to make **numerical comparisons**

**LO2: Given a data set and a question, select from the above plot types and use R to create a visualization that best answers the question**

- bar plots ex: Compare the amount of poop different dog breeds have in 2020.
- scatter plots ex: Visualize the relationship between BMI and health insurance cost.
- line plots ex: visualize the trend of CO2 emmision from 2010 to 2020.
- histograms ex: visualize the midterm grade distribution in class of 2020.

**LO3: Effecitve visualizations and rule of thumbs**

**Convey the message, minimize the noise!!!**

**1. Convey the message**

- Make sure the visualization answers the question most simply and plainly as possible.
- Use **legends**, **labels** so that your visualization is understandable without reading explanations.
- Make sure the **text, symbols, lines...** are big enough to be easily read.
- Make sure the data are **clearly visible**
- Make sure to **use color schemes** that are **colorblind friendly**
- Redundancy can be **helpful**, sometimes conveying the same message in multiple ways reinforces it for the audience.

**2. Minimize noise**

- Too many **different colours** can be distracting, create false patterns
- **Overplotting** is when marks that present the data **overlap**, prevents you from seeing how many data points are represented in areas of the visualization.
- Make plots in the **appropriate size**
- **Don't** adjust the axes to zoom in small differences, if the difference is small, show that its small!

**General tools used to refine the 4 visualizations**

Geometric Objects: specifies how the mapped data should be displayed `geom_*`

- `geom_point()` for scatterplot, `geom_line()` for line plot, `geom_histogram()` for histogram, `geom_bar()` for bar plots
- `geom_vline(x-intercept)` to add a vertical line to the plot at specified x-intercept
  - `geom_vline(xintercept =..., linetype = "dashed", size = 1)`
- `geom_hline(y-intercept)` add horizontal line at specified y-intercept

Scales: Used to modify axis, legends. Adjusts how asthetic mappings are displayed
- `scale_x_continuous()` :customize the appearance of continuous variables on the x-axis, allows you to adjust axis labels, breaks, limits, transformations
- `scale_y_continuous()` :customize the appearance of continuous variables on the y-axis, allows you to adjust axis labels, breaks, limits, transformations

Asthetic Mappings: tells `ggplot` how the variables in the data frame map to properties of visualization (colour, shape, position, size)
- `x`,`y`
- `fill`:
- `colour`:
- `shape`:

Labelling:

- `xlab()`: add labels to the x axis " " usually include units and make label name less technical
- `ylab()`: add lables to the y axis " "
- `labs()`: general function for all labels (x, y, legend, colour...)

Font control and legend positioning:

- `theme()`: changes the font size in plots

`theme(text = element_text(size = 12))`

Flipping axes:

- `coord_flip()`:

Subplots:

- `facet_grid()`: 

**`ggplot()` Basics**

- `ggplot(data,aes(x= , y= , ...)) + geom_...() + ...`
  - ggplot takes two arguments.
  - 1st argument is the dataframe to visualize
  - 2nd argument requires an aesthetic mapping that you would address the properties of the visualizaion with.
  - After the ggplot function, different layers are **added** to the plot using `+` instead of `|>`

- `aes()`
  - `x =`: assign variable to x-axis
  - `y=`: assign variable to y-axis
  - `colour =`: assign different colors by factors of the **categorical variable** (non-numerical, factor) you input in this argument

ex: in `aes(..., colour = Column (that has categories, factor))`

   - `shape =`: assign different shapes by factors of the **categorical variable** you input in this argument
   - `fill =` :(for geom_histogram and geom_bar) what factor is used to color the bars
   - `fct_reorder()`: often used with `aes()` to reorder values
     - The first argument defines the column to be reordered
     - The second argument is the criteria used for reordering
     - `fct_reorder() uses **ascending** order by default, can change into descending by `.desc=TRUE`
     - EX: `aes(..., y=fct_reorder(column, criteria, .desc=TRUE),...)`

**Note**:

- `fill` and `colour` can also be used outside the `aes()` function. This is done when you want to manually assign a colour to your points/bars.
- Anything you define in the `aes()` function MUST be labelled in the `labs()` function

- `geom_...()`
  - `geom_bar(stat = "identity")`: tells ggplot2 that you will provide the y-values for the barplot, rather than counting the aggregate number of rows for each x value. (which is the default `stat = "count"`
  - `geom_histogram(position = "identity")`: To ensure the histograms for each factor will be overlaid side-by-side, instead of stacked bars (which is default for bar plots or histograms when they are coloured by another categorical variable)

- `...`
  - `xlab()`: x-axis label, (can add `\n` in the name to create line break)
  - `ylab()`: y-axis label
  - `xlim()`: set the scale limits for the x-axis. `xlim(c(lower boundary, upper boundary))`
  - `ylim()`: set the scale limits for the y-axis. `ylim(lower, upper)`
  - `theme(text=element_text(size = 20))`: changes the font size in plots. a good start is 20
  - `theme(legend.position = "top", legend.direction = "vertical")`: move the legend to better display the plot. 
  - `scale_x_log10`: scale the x values to log scale.
  - `scale_y_log10`: scale the y values to log scale.
  - `scale_color_brewer(palette = " ")`: allows you to choose the specific colour palette you want from the `RColorBrewer` package
  - `scale_fill_manual`: manually select the colour we want to fill our bar into
  - `coord_flip()`: swaps x and y coordinate axes, to give more space to labels on the x axis

**`facet_grid()`**

Facets divide a plot into subplots based on the values of one or more discrete variables.

- To facet into rows based on the discrete variable, use `rows = vars(colname)` argument
- To facet into columns based on the discrete variable, use `cols = vars(colname)` argument
- **Note**, column name must be wrapped by `vars()`

**Why is line plot sometimes better**:

- Line plots connect the sequence of x and y coordinates of the observations with line segments, emphasizing their order, as x variable (eg time) has a natural order to it.
- issue with scatterplot: overplotting can occur where data points overlap on top of one another, making the informaton presented unclear.

**When is scatterplot better**:

- scatterplot is good when neither of the two quantitative variables have natural order

**Key characteristics of data**

- **Direction**: if the y variable tends to increase when x increases, then y has a **positive** relationship with x. If y tends to **decrease** when x increases, then y has **negative** relationship with x. If y does not **meaningfully** increase or decrease as x increases, then y has **little or no** relationship with x.
- **Strength**: if y **reliably** increase, decrease, or stays flat as x increases, then the relationship is **strong**. Otherwise the relationship is **weak**. (Strong when the points are more clustered and look more like "line" than a "cloud"
- **Shape**: if you can draw a stright line roughly through the data points, the relationship is **linear**. Otherwise, it is **nonlinear**

**Example of visual redundancy**

- Conveying the same information with **both scatter point color and shape** - can further improve the clarity of your visualization.

**Bar plots vs Histograms**

- It is better to use bar plots to compare value of an amount (size, proportion, count, percentage) across **different groups of categorical variables**
- It is better to use histograms when displaying the **mean, or median values**, to show distribution of all individual data points.

**Histograms** help us visualize how a particular variable is distributed in a data set by separating the data into bins, and then using vertical bars to show how many data points fell in each bin.

**Saving the visualization**

- Generally, images come in two flavours: **raster** and **vector** formats
  - **Raster** images represent as 2D grid of square pixels, each with its own colour
    - they are often compressed before storing to take up less space.
  - **Lossy** format is if the image cannot be perfectly re-created when loading and displaying
  - **Lossless** format allow a perfect display of the original image
    - Common **raster image** file types:
      - JPEG(.jpg, .jpeg): lossy, usually for photographs
      - PNG (.png): lossless, usually for plots, line drawings
      - BMP (.bmp): lossless, raw image data, no compression (rare)
      - TIFF(.tif, .tiff): typically lossless, no compression used mostly in graphic arts publishing

  - **Vector** images are represented as a collection of mathematical objects (lines, surfaces, shapes, curves). When the computer displays the image, it redraws all of the elements using their mathematical formulas.
     - Common **vector image** file types:
       - SVG (.svg): general-purpose use
       - EPS (.eps), general-purpose use (rare)

**Raster and vector images have opposing advantages and disadvantages**

- Raster image takes same amount of time to load for same sizes images no matter the complexity, vector images takes different time and space to load according to how complex the image is.
- You can zoom into/ scale up vector graphics as much as you like without the image looking bad

**To save the graph**:

- `ggsave(file name, plot name)`: file name could end with .png, .jpg, .bmp, .tiff, .svg

**Chapter 12 Collaboration with version control**

**LO1: Describe what version control is and why data analysis projects can benefit from it**

- **Version Control**: the process of **keeping a record** of **changes** to documents. (**when** changes were made) (**who** made them) throughout the history of development.
    
**Advantages of version control**
1. version control tracks changes to the files in the analysis over the lifespan of the project, include when changes were made and by who. Provides ability to view ealier versions of the project and revert changes.
2. Being able to record and view the history of a data analysis project is **important** for understandng how and why descisions were made to use one method.
3. Helps with collaboration by sharing edits with others and resolving conflict edits.
4. Version control tools usually include a remote repository hosting service (GitHub) that can act as a backup of the local files on computer.

**TWO things to version control a project**
1. **version control system**: the software responsible for tracking changes, sharing changes with others, obtaining changes by others, and resolving conflicting edits. `Git`
2. **repository hosting service**: storing a copy of the version-controlled project online, team members can access it remotely, discuss issues and bugs, and distribute final product. `GitHub`

**LO2: Create a remote version control repository on GitHub**

**Typically, when we put a project under version control, we create **two** copies of the repository.**
1. **local repository**: Primary workspace to create, edit, and delete files. (commonly exist on computer, and also on server **JupyterHub**.
2. **remote repository**: Typically stored in a repository hosting service (**GitHub**), where we can easily share it with our collaborators.

- Both copies of repository have a **working directory**: where you can create, store, edit, and delete files.
- Both maintain full project **history**

**LO3: Use Jupyter's Git version control tools for project versioning and collaboration**

**Cloning a repository**

- **Copying/downloading the entire contents** (files, project history, location of remote repository) of a remote GitHub repository **to a computer** (your local workspace)

**Git has a distinct step of ADDING files to the STAGING AREA because**:

- Not all changes we make are ones we want to push to our remote GitHub repository.
- It allows us to edit multiple files at once, but associated particular commit messages with particular files (so the commit messages can more specifically reflect the changes that were made).

**Commits**

They are snapshot of the file contents as well as the metadata about the repository (who made the commit, when was it made)

- each commit has a human-readable **message**: description of what works was done since the last commit. So that you can easily and effectively review the project's history!
- When we commit our changes to Git, the snapshot of changes, commit message, and time, user are all saved to the Git history on LOCAL computer (local repository).

To commit, we add the files to the **staging area**: not a physical location on the comupter (**conceptual placeholder** for the files until they are **committed**)

**Pushing**

Push the commits on local repository to remote repository **GitHUB**, to match what you have on local repository. (collaborators will be able to see the changes on remote repository

- Pushing with Git is the act of sending changes that were committed to Git to a remote repository, for example, on GitHub.com.
- You should push your work to GitHub anytime you want to share your work with others, or when you are done a wrk session and want to back up your work.

**Pulling**

To obtain new changes made by others from the remote repository, synchroize your local repository to what is on the remote repository.

- **Until you pull** the changes from remote repository, you will **not be able to push** any more changes yourself!
- act of collecting changes that exists in a remote repository, that do not yet exist on the local computer you are working on. 

**Version control workflows**

Generally **three additional stpes** as part of regular (edit, create, delete) workflow
1. Tell `Git` to make a **commit** of your own changes in **local repository**
2. Tell `Git` **when** to send your **new commits** to the **remote `GitHub` repository**
3. Tell `Git` **when** to **retrieve** any new changes made **by others** from the remote repository `GitHub`

**Example Workflow on JupyterHub**
1. Edit, create, and delete files in your cloned local repository on JupyterHub
2. Once you want to record your current version, specify which files to "add" to Git's staging area. (modified files that you want a snapshot)
3. Commit those flagged files to your repository, and include a helpful commit message to tell your collabrators what changes you have made. GitHub has not changed.
4. Continue working
5. When you want to store your commits from your local repository onto your cloud to share with your collaborators, you can push them back to the hosted repository on GitHub. 

**Resolve merge conflicts**

Merge conflicts: occurs when you forgot to pull before you made new changes to the file, and when the other collaborator and you worked on the same ine of code and Git will not be able to automatically merge the changes.

- To fix merge conflicts: open the file in plain text editor
- Begining of merge conflicts is preceded by `<<<<<<< HEAD` end of merge conflict is marked by `>>>>>>>`. version of change before the separator `=======` is your change and after is the other's.
- Use plain text editor to remove the special markings.

**Communicating using GitHub issues**

- Emails and messaging apps are not designed for project specific communication.

**GitHub issues**: an alternative written communication platform to email and messaging apps

- Issues are opened from the "issues" tab on the project's GitHub page, and they remain there even after the conversation is over and issue is closed.
- One issue thread is usually created per topic, and they are easily searchable using GitHub's search tools.
- All issues are accessible to all project collaborators, so no one is left out of the conversaion.
- Issues can be setup so that team members get email notifications when a new issue is created or post under issue thread. 

add