<h1>Final Review “Cheat Sheet”</h1>

**Chapter 1
R and tidyverse**


**LO1: Identify the different types of data analysis questions and categorize a question into the correct type:**

Different types of questions:

**Descriptive**: asks about summarized characteristics of a data set without interpretation. Ask what is, rather than why and how. 
Ex: how many languages are there in India?

**Exploratory**: asks if there are trends, relationships within a single data set. 
Ex: does exam success rate change with the amount of sleep students get?

**Predictive**: asks about predicting measurements for things or people. FOCUS on what things predict the outcome, NOT what causes the outcome. Classify new observations based on existing data.
Ex: what is the type of a newly observed tumour based on its width and texture?

**Inferential**: looks for patterns, relationships in a single data set AND makes inference on the wider population.
Ex: does sleep times affect the success rate of students in all of Canada?

**Causal**: asks about whether changing one factor will lead to change in another factor, in the wider population. 
Ex: does sleep deprivation lead to lower success rate in students in BC?

**Mechanistic**: asks about the underlying mechanism of the observed relationships. How does it happen?
Ex: How does sleep deprivation lead to lower success rates in students of BC?


**LO2: Load the tidyverse package into R**

In [23]:
options(repr.matrix.max.rows = 6)
library(palmerpenguins)

**Function**: a function is a special word in R, takes instructions(arguments) and does something

**R package**: a collection of functions that can be used additional to the built-in R package functions. **ex**: read_csv() is a function contained in tidyverse package.

In [22]:
#code# 
library(tidyverse)

**Tidyverse**: this is a meta package contains many functions such as those used to load, clean, wrangle, and visualize data. It also contains several other packages as well (meta package).

In [24]:
# We have to put quotes "" around file names and other words in the code cell to distuiguish
# them from special words(like function) for R

**LO3: read tabular data with read_csv**

.csv: tubular data in the "comma-separated values" format, each value in the table separated by a comma.

**read_csv:** funcion, expects data files to:
have column names (headers)
use a comma (,) to separae columns
does not have row names

In [25]:
##code:
## name <- read_csv("folder name/data file name")

**LO4: naming things in R**

In [34]:
#using the assignment symbol "<-"
my_number <- 3

In [35]:
my_number + 2

**Conventions**: when naming, only use lowercase letters, numbers and _ to separate words

**LO5: create and organize subsets of tabular data using filter, select, arrange, and slice**

**filter**: obtain a smaller set of rows with specific values (ex: only want rows with the year 2022)

**select**: obtain a smaller set of columns (ex: only want year and pollutants)

**code for filter**: filter(name of data, year=="2022")

logical statement: the secod argument in the function, evaluates to TRUE or FALSE, in filter, it evaluates to TRUE.

" " is used to tell R this is a **string** value, not a special word in the R language

**code for select**: select(data name, column1, column2)

**arrange**: **order the rows** of data frame by **values of a particular column**

**code for arrange**: arrange(dataframe, by=desc(column))

-Descending: (desc()), from largest to smallest

**arrange function automatically orders rows in ascending**

**slice**: function which selects rows according to **row number**

**code for slice**: slice(dataframe, 1:10)

-the second argument tells R the raws to keep is from 1 to 10

**LO6: add and modify columns in tabular data using mutate**

**mutate**: perform a calculation, making use of existing columns to compute a new column

**code for mutate**: mutate(dataframe, new column name = the equation)

col_double means that the data in this column is a number-type, specifically real numbers (meaning that these values can contain decimals)

col_integer means that the data in this column is integers (whole numbers)

col_character means that the data in this column contains text (e.g., letter or words)

**LO7: visualize data with a ggplot bar plot**

**Benefits of visualization**: great tool for summarizing information, help effectively communicate with audience.

In [36]:
# hashtag provide comments

?filter

this is a way to pull up the documentation for most functions.

**Chapter 2 (reading in data locally and from the web)**

**Important packages for chapter 2**

- readxl: provides the read_excel() function to load sheet from excel file into R
- DBI: provides dbConnect() function to connect SQLite database. provides dbListTables() function to list the tables in a database
- dbplyr: provides tbl() function to help create a reference to a database table searchable. provides collect() to retrieve data from a database query and bring it to R
- RPostgres: allows us to work on PostgreSQL databases

**LO1: Define the types of path and use them to locate files**

- a file could live local(computer), or remoate (internet), different paths

**1. Relative file path**: where the file is with respect to the folder (**working directory**) currently in, on the computer.

**2. Absolute file path**: file with respect to the base (root) folder of computer's firesystem, regardless of where you are working.

- **Always start with "/"**

**" . " means reach a file from current directory (folder)**

**" .. " means go back to previous directory**

**Generally**, it is better to use relative paths. B/C it helps ensure the code can be run on a different computer, and is shorter and easier to write. (able to run on different computer as the path is same on any, but for absolute, depending on the name the person gave to the root folders, may be different.

**LO2: read data into R from various types of path using following functions**

Plain text file: a document containing only text

**1. read_csv**: for reading tabular data with comma separated values
- the delimeter(separator): ","

code: canlang_data <- read_csv("data/can_lang.csv")
- data/ is put before file's name because the data set is located in a sub-folder called data, relative to where we are running our R code.


**skipping rows when reading data** : There sometimes may be extra informations about the data included at the top of data file(metadata). NO delimeters. BUT not intended to be read into a data frame cell with the tabular data.

- in this case, use skip argument: read_csv("data/name", skip=3)

**2. read_tsv**: tsv=tab-separated values files. 

code: read_tsv("data/can_lang.tsv")

**3. read_delim**: a more general function, including read_csv, read_tsv which are special cases. NEED to specify a **delimeter**. 

- delim = "\t" is for tab-separated values file
- delim = ", " is for comma-separated values file
- delim = " ; " is for semicolon-separated values file

**data frames need to have column names**: use argument col_names= " ", " " is an option
use function rename(data, **new_name=old_name**, column2= X2) is also an option

**4. a)read tabular data directly from URL**

(URL): Uniform Resource Locator

**code**: url <- "https://raw.githubusercontent.com/UBC-DSCI/data/main/can_lang.csv"

canlang_data <- read_csv(url)

**4. b)downloading data from a URL**

for URL that are not nicely formatted to directly use any functions

**code**: url <- "https...."

download.file(url,"data/can_lang.csv")

2nd argument is the path to store the downloaded file

**5. reading tabular data from Microsoft Excel file**

- the file name extension is .xlsx
- this is not a plain text file
- use library(readxl)
- use function read_excel()
- use sheet argument to specify the sheet number or name
- use range argument to specify cell ranges (for when single sheet contains multiple tables)

**why should we always explore the data file before importing into R**

- helps me decide which function and arguments I will need to load the data into R successfuly.

**6. reading data from a database**

**database**: a type to data storage -> almost all database management systems employ SQL (strcutured query language) to obtain data from database

a) SQLite database are usually stored and accessed locally on one computer from a file with a .db extension or .sqlite extension.

- NOT plain text files, CANNOT be read in a plain text editor
1. connect R to the database using dbConnect() function from DBI package
- dbConnect() opens up a communication channel that R can use to send SQL commands to database

**library(DBI)
canlang_conn <- dbConnect(RSQLite::SQLite(), "data/can_lang.db")**

- can use dbListTables(connect database) to list the table names in the database

2. **tbl(canlang_conn, "lang")** function allows us to reference this table so we can perform operations and work with data stored in databases as if they were just regular data frames WITHOUT having to store all its data in R's memory.
3. head() function allows us to see the first few rows of a dataset
4. use **collect** function to download the transformed data from the database and store it in a dataframe.
5. write data from R to a .csv file: use write_csv(the collected dataframe, "data/newname.csv")

**Reading data from a PostgreSQL database**

- designed to be used and accessed on a network -> have to provide more information to R when connecting to Postgres databases

Example: 

library(RPostgres)

***canmov_conn <- dbConnect(RPostgres::Postgre(), dbname = "can_mov_db",
                        host = "fakeserver.stat.ubc.ca", port= 5432, user = "user0001",       password = "abc123")***

**Advantages of database:**
1. allow storing large data set across multiple computers with backups
2. Allow multiple users to access them simultaneously and remotely without conflicts and errors
3. provide mechanisms for ensuring data integrity and validating input
4. provide security to keep data safe 

**Chapter 3 (Data Wrangling)**

**Important Packages for Chapter 3**

- dplyr: part of tidyverse metapackage (if loaded tidyverse, then do not need to load this)

  ->provides functions like (select, filter, mutate, arrange, summarize, and group_by)
- purrr: part of the tidyverse metapackage.
- allows us to use the map() and map_df() functions

**LO1: define the term "tidy data"**

**Criteria for Tidy Data!!**

1. Each row is a single observation
2. Each column is a single variable
3. Each value is a single cell

**LO2: discuss the advantages of storing data in a tidy data format**

- tidy data is a single, consistent format that almost every function in tidyverse recognizes, making it easy to manipulate, plot, and analyze using the same tools.
- tidy data is easier for human to interpret.
- Untidy data require more complex code that are easy to have errors and hard for others to understand.

**LO3: define what vectors, lists, and data frames are in R, and describe how they relate to each other**

**data frame**: table-like structure for storing data in R. (stores observations, variables and their values)

- variable: a characteristic, number, or quantity that can be measured
- observation: all of the measurements for a given entity
- value: a single measurement of a single variable for a given entity


**what is a vector?**

- vectors are objects that can contain one or more elements that MUST ALL BE THE SAME DATA TYPE
- you can use c() function to create vectors in R: vector_name <- c("200", "300", "400")

**what is a list?**

- Lists are also objects with multiple, ordered elements, BUT the elements in a list **DO NOT** have to be the **same type**.

**data frames**: is just a special kind of list:

- each element itself must either be a vector or a list
- each element (vector or list) must have the same length

**Tibbles are special kind of data frames that more enhanced**

**LO4: describe the common types of data in R and their uses**

**Data Type**

- **character: (chr)**, letters or numbers surrounded by quotes, ex: "1", "world"
- **double (dbl)**, numbers with decimal values, ex: 1.2333
- **integer (int)**, whole numbers, no decimals, ex: 1L,20L ("L" tells R to store it as int)
- **logical (lgl)**, either true or false, ex: TRUE, FALSE
- **factor (fct)**, used to represent data with a limited number of values(usually categories), ex: color(**categorical**) variable **with levels** red, green, orange. 

**Even though factors sometimes **look** like characters, they are not used to represent text, words, names, and paths in the way characters are. Factors help us encode variables that represent **categories** 

**LO5: Use the following functions for their intended data wrangling tasks**

**1. pivot_longer**

- combines columns, making data frame longer and narrower.
- **combine** columns that are really part of the **same variable** but currently stored in separated columns.

pivot_longer(dataframe,
            cols= columns to combine,
            names_to= "new column 1",
            values_to= "new column 2")

- input for 1st argument is the data frame
- input for 2nd argument are the names of the columns we want to combine into a single column
- input for 3rd: the new column1 that will be created, values come from the **names** of the columns that we want to combine
- input for 4th: the new column2 that will be created, values will come from the **values** of the combines columns

**2. pivot_wider**

- if there's one type of observation spread across multiple rows rather than a single row
- use pivot_wider to increase the number of columns and decrease the number of rows

pivot_wider(data frame,
            names_from = col_name_1,
            values_from = col_name_2)

- input 1st for the dataframe
- input 2nd is **the column** that the **names** of the new columns take from
- input 3rd is **the column's values** that the **values** of the new columns take from

**3. separate**

- use this to deal with multiple delimeters (multiple values stored in the same cell)

separate(dataframe,
    col= col_name,
    into = c("col_name1", "col_name2"),
    sep = "/")

  1. specify the column we want to split
  2. a character vector of the new column names we would like the split columns to have
  3. the separator on which to split

**4. select**

- use to extract a range of columns
- if simply typing all of the column names needed to select may be time-consuming. **instead**, use a "select helper"
  
**select helpers**: **operators** that make it easier for us to select columns

ex: to chose a range of columns, use **(:)** to denote the range. 

ex: select(dataframe, starts_with(" ")) **(starts_with())** is a select helper to choose columns with names start with a particular word or letter.

ex: select(dataframe, contains("_")), **contains()** is a select helper to choose column names that contain a particular thing.


**5. filter**

- use filter to extract rows where logical statement evaluates to TRUE.

ex: extracting rows that have a certain value with ==, filter(dataframe, column == "value")

ex: extracting rows that do not have a certain value with !=, filter(dataframe, column != "value")

ex: extracting rows satisfying multiple conditions using (,) or (&), 

filter(dataframe, colum1 == "value1", column2 == "value2")

ex: extracting rows satisfying **at least one** condition using (|),

filter(dataframe, column1 == "value1" | column1 == "value2")

ex: extracting rows with values in a vector using (%in%), 

similar to using (|), but easier as it is summarized in a vector. 
**different** from == because == means choosing the values that only match that first element listed. But %in% means R will choose the values that can match any of the elements in the vector.

vector_name <- c("value1", "value2", "value3")
filter(dataframe, colum_name %in% vector_name)

ex: extracting rows above or below a threshold using > and <

filter(dataframe, column > 2345)

**6. mutate**

Ex: using mutate to modify columns

mutate(dataframe, new_name = as_factor(column))

- in here, we can use mutate to modify the elements in our column into factor.

Ex: using mutate to create new columns

mutate(dataframe, new_column = operation between old columns or smth)

**7. pipe operator |>**

- used to combine functions, results in a **cleaner, and easier to follow code**
- takes the output from function on the left and passes it to the first argment to function on the right

**reasons why making multiple lines of code and storing temporary objects is not preffered**

- difficult for readers to understand
- tricks the reader to think the temporary intermediate objects are important
- reader has to look through and find where the intermediate objects are used

**compose function is also not a good idea**

- the functions compose in the opposite order in which they are computed by R
- long code makes it difficult for readers to understand

**When should we store temporary objects**

- store a temperary object before feeding it to plot function, so you can look at the wrangled data before plotting it to make sure there are no errors.
- piping many functions can be difficult to debug

**8. summarize**

- use summarize to calculate summary statistics:

ex: summarize(dataframe, new column name = max(old column))

in here, min and max functions can be used to calculate the maximum value from the column specified.

**Basic summary functions**

- min
- max
- mean
- sum

**if there's NA in the column's element:**

- add argument na.rm= TRUE into the summary functions to remove the NA.

**9. group_by() + summarize()**

- this combination is used when you want to apply the same function to groups of rows

group_by(dataframe, col_names) |>

summarize(
            min_col_name1 = min(col_name1),
            max_col_name2 = max(col_name2),
            total_volume = mean(total_volume, na.rm =TRUE))

- group_by() takes an existing data set and converts it into a grouped data set where operations are performed "by group".
- summarize() works analogous to mutate() function, EXCEPT instead of adding columns to an existing data frame, it creates a new data frame. USED to calculate **summary statistics** (max, min, mean) for each group of rows created with group_by()
- pairing these functions together can let you summarize values for subgroups within a data set
- group_by() creates its own columns and summarize() creates its own columns which then both combine to form a dataset

**10. summarize() + across ()**

- to calculate summary statistics on many columns

summarize(across(column1:column4, ~max(.x, na.rm=TRUE))

**11. map(), map_dfr()**

- alternative to summarize+across, for applying function to many columns
- map takes two arguments, an object(a vector, data frame or list) and the function that you would like to apply
- map() does not give dataframe, it gives list instead
- map_dfr() gives data frame, combining row-wise

**12. mutate + across**

- ex: when converting units of measurements across many columns
- or we want to change every value in data from to another data type

mutate(across(dataframe, column1:column4, as.integer))

**13. rowwise + mutate**

- apply function across columns but within one row
- Ex: we want the max value from different columns in one row (ie find the maximum from values in one row)

rowwise(dataframe) |>
mutate(maximum= max(c(column1, column2, column3, column4)))

**similar to group_by(), rowwise() doesn't appear to do anything when it is called by itself, but we can apply rowwise with other functions to change how these other functions operate.**

**Chapter 4 (Effective data visualization)**

**Important packages for chapter 4.0**

- `ggplot2`
  - part of tidyverse metapackage. (if loaded tidyverse, then do not need to load this)
  - This package allows you to create all sorts of visualizations of data.
- `RColorBrewer`
  - This package provides the ability to pick custom colour schemes some of which are colourblind friendly.
- `lubridate`
  - part of the tidyverse metapackage. (still need to load this package **individualy**)
  - This package is a tool to convert character strings to date vectors.

**Basic functions used to aid data visualization**

- `n()`
  - number of rows/observations in the data
  - usually used like `group_by()` + `summarize(n = n())`to give you the count of the rows for each group
- `slice_max(data, order_by = ..., n = ...)`
  - `data`: what data frame we are using
  - `order_by =`: which column we select to order, default is largest first
  - `n`: number of rows selected
  - This function is used to select only the top `n` data rows ordered by some column from a data frame to generate a new data frame
  - same purpose as arrange()+slice(), but more specific and efficient
  - `as.factor()`: simply converts an existing vector to a factor
  - `factor(col_name, levels = c(...,...,...))`: To encode a vector as a factor; allows you to specify the values, and whether they are ordered or not.

**LO1: Describe when to use what kinds of visualizations to answer specific questions using a data set**

Great visualizations clearly answers your question without distraction or additional explanantion.

**4 Kinds of visualization**:
1. **scatter plots**: visualizae the relationship between **two quantitative variables**
2. **line plots**: visualize **trends** with respect to an **independent ordered** quantity (e.g., time)
3. **bar plots**: visualize **comparisons of amounts**
4. **histograms**: visualize the distribution of **one quantitative variable** (e.g, all its possible values and how often they occur)

**Avoid**
- avoid using **pie charts**, better to use bars, as its easier to compare bar heights than pie slice sizes. 
- avoid using **3D visualizations**, as they are hard to understand when converted to 2D image format
- do not use **tables** to make **numerical comparisons**

**LO2: Given a data set and a question, select from the above plot types and use R to create a visualization that best answers the question**

- bar plots ex: Compare the amount of poop different dog breeds have in 2020.
- scatter plots ex: Visualize the relationship between BMI and health insurance cost.
- line plots ex: visualize the trend of CO2 emmision from 2010 to 2020.
- histograms ex: visualize the midterm grade distribution in class of 2020.

**LO3: Effecitve visualizations and rule of thumbs**

**Convey the message, minimize the noise!!!**

**1. Convey the message**

- Make sure the visualization answers the question most simply and plainly as possible.
- Use **legends**, **labels** so that your visualization is understandable without reading explanations.
- Make sure the **text, symbols, lines...** are big enough to be easily read.
- Make sure the data are **clearly visible**
- Make sure to **use color schemes** that are **colorblind friendly**
- Redundancy can be **helpful**, sometimes conveying the same message in multiple ways reinforces it for the audience.

**2. Minimize noise**

- Too many **different colours** can be distracting, create false patterns
- **Overplotting** is when marks that present the data **overlap**, prevents you from seeing how many data points are represented in areas of the visualization.
- Make plots in the **appropriate size**
- **Don't** adjust the axes to zoom in small differences, if the difference is small, show that its small!

**General tools used to refine the 4 visualizations**

Geometric Objects: specifies how the mapped data should be displayed `geom_*`

- `geom_point()` for scatterplot, `geom_line()` for line plot, `geom_histogram()` for histogram, `geom_bar()` for bar plots
- `geom_vline(x-intercept)` to add a vertical line to the plot at specified x-intercept
  - `geom_vline(xintercept =..., linetype = "dashed", size = 1)`
- `geom_hline(y-intercept)` add horizontal line at specified y-intercept

Scales: Used to modify axis, legends. Adjusts how asthetic mappings are displayed
- `scale_x_continuous()` :customize the appearance of continuous variables on the x-axis, allows you to adjust axis labels, breaks, limits, transformations
- `scale_y_continuous()` :customize the appearance of continuous variables on the y-axis, allows you to adjust axis labels, breaks, limits, transformations

Asthetic Mappings: tells `ggplot` how the variables in the data frame map to properties of visualization (colour, shape, position, size)
- `x`,`y`
- `fill`:
- `colour`:
- `shape`:

Labelling:

- `xlab()`: add labels to the x axis " " usually include units and make label name less technical
- `ylab()`: add lables to the y axis " "
- `labs()`: general function for all labels (x, y, legend, colour...)

Font control and legend positioning:

- `theme()`: changes the font size in plots

`theme(text = element_text(size = 12))`

Flipping axes:

- `coord_flip()`:

Subplots:

- `facet_grid()`: 

**`ggplot()` Basics**

- `ggplot(data,aes(x= , y= , ...)) + geom_...() + ...`
  - ggplot takes two arguments.
  - 1st argument is the dataframe to visualize
  - 2nd argument requires an aesthetic mapping that you would address the properties of the visualizaion with.
  - After the ggplot function, different layers are **added** to the plot using `+` instead of `|>`

- `aes()`
  - `x =`: assign variable to x-axis
  - `y=`: assign variable to y-axis
  - `colour =`: assign different colors by factors of the **categorical variable** (non-numerical, factor) you input in this argument

ex: in `aes(..., colour = Column (that has categories, factor))`

   - `shape =`: assign different shapes by factors of the **categorical variable** you input in this argument
   - `fill =` :(for geom_histogram and geom_bar) what factor is used to color the bars
   - `fct_reorder()`: often used with `aes()` to reorder values
     - The first argument defines the column to be reordered
     - The second argument is the criteria used for reordering
     - `fct_reorder() uses **ascending** order by default, can change into descending by `.desc=TRUE`
     - EX: `aes(..., y=fct_reorder(column, criteria, .desc=TRUE),...)`

**Note**:

- `fill` and `colour` can also be used outside the `aes()` function. This is done when you want to manually assign a colour to your points/bars.
- Anything you define in the `aes()` function MUST be labelled in the `labs()` function

- `geom_...()`
  - `geom_bar(stat = "identity")`: tells ggplot2 that you will provide the y-values for the barplot, rather than counting the aggregate number of rows for each x value. (which is the default `stat = "count"`
  - `geom_histogram(position = "identity")`: To ensure the histograms for each factor will be overlaid side-by-side, instead of stacked bars (which is default for bar plots or histograms when they are coloured by another categorical variable)

- `...`
  - `xlab()`: x-axis label, (can add `\n` in the name to create line break)
  - `ylab()`: y-axis label
  - `xlim()`: set the scale limits for the x-axis. `xlim(c(lower boundary, upper boundary))`
  - `ylim()`: set the scale limits for the y-axis. `ylim(lower, upper)`
  - `theme(text=element_text(size = 20))`: changes the font size in plots. a good start is 20
  - `theme(legend.position = "top", legend.direction = "vertical")`: move the legend to better display the plot. 
  - `scale_x_log10`: scale the x values to log scale.
  - `scale_y_log10`: scale the y values to log scale.
  - `scale_color_brewer(palette = " ")`: allows you to choose the specific colour palette you want from the `RColorBrewer` package
  - `scale_fill_manual`: manually select the colour we want to fill our bar into
  - `coord_flip()`: swaps x and y coordinate axes, to give more space to labels on the x axis

**`facet_grid()`**

Facets divide a plot into subplots based on the values of one or more discrete variables.

- To facet into rows based on the discrete variable, use `rows = vars(colname)` argument
- To facet into columns based on the discrete variable, use `cols = vars(colname)` argument
- **Note**, column name must be wrapped by `vars()`

**Why is line plot sometimes better**:

- Line plots connect the sequence of x and y coordinates of the observations with line segments, emphasizing their order, as x variable (eg time) has a natural order to it.
- issue with scatterplot: overplotting can occur where data points overlap on top of one another, making the informaton presented unclear.

**When is scatterplot better**:

- scatterplot is good when neither of the two quantitative variables have natural order

**Key characteristics of data**

- **Direction**: if the y variable tends to increase when x increases, then y has a **positive** relationship with x. If y tends to **decrease** when x increases, then y has **negative** relationship with x. If y does not **meaningfully** increase or decrease as x increases, then y has **little or no** relationship with x.
- **Strength**: if y **reliably** increase, decrease, or stays flat as x increases, then the relationship is **strong**. Otherwise the relationship is **weak**. (Strong when the points are more clustered and look more like "line" than a "cloud"
- **Shape**: if you can draw a stright line roughly through the data points, the relationship is **linear**. Otherwise, it is **nonlinear**

**Example of visual redundancy**

- Conveying the same information with **both scatter point color and shape** - can further improve the clarity of your visualization.

**Bar plots vs Histograms**

- It is better to use bar plots to compare value of an amount (size, proportion, count, percentage) across **different groups of categorical variables**
- It is better to use histograms when displaying the **mean, or median values**, to show distribution of all individual data points.

**Histograms** help us visualize how a particular variable is distributed in a data set by separating the data into bins, and then using vertical bars to show how many data points fell in each bin.

**Saving the visualization**

- Generally, images come in two flavours: **raster** and **vector** formats
  - **Raster** images represent as 2D grid of square pixels, each with its own colour
    - they are often compressed before storing to take up less space.
  - **Lossy** format is if the image cannot be perfectly re-created when loading and displaying
  - **Lossless** format allow a perfect display of the original image
    - Common **raster image** file types:
      - JPEG(.jpg, .jpeg): lossy, usually for photographs
      - PNG (.png): lossless, usually for plots, line drawings
      - BMP (.bmp): lossless, raw image data, no compression (rare)
      - TIFF(.tif, .tiff): typically lossless, no compression used mostly in graphic arts publishing

  - **Vector** images are represented as a collection of mathematical objects (lines, surfaces, shapes, curves). When the computer displays the image, it redraws all of the elements using their mathematical formulas.
     - Common **vector image** file types:
       - SVG (.svg): general-purpose use
       - EPS (.eps), general-purpose use (rare)

**Raster and vector images have opposing advantages and disadvantages**

- Raster image takes same amount of time to load for same sizes images no matter the complexity, vector images takes different time and space to load according to how complex the image is.
- You can zoom into/ scale up vector graphics as much as you like without the image looking bad

**To save the graph**:

- `ggsave(file name, plot name)`: file name could end with .png, .jpg, .bmp, .tiff, .svg

**Chapter 12 Collaboration with version control**

**LO1: Describe what version control is and why data analysis projects can benefit from it**

- **Version Control**: the process of **keeping a record** of **changes** to documents. (**when** changes were made) (**who** made them) throughout the history of development.
    
**Advantages of version control**
1. version control tracks changes to the files in the analysis over the lifespan of the project, include when changes were made and by who. Provides ability to view ealier versions of the project and revert changes.
2. Being able to record and view the history of a data analysis project is **important** for understandng how and why descisions were made to use one method.
3. Helps with collaboration by sharing edits with others and resolving conflict edits.
4. Version control tools usually include a remote repository hosting service (GitHub) that can act as a backup of the local files on computer.

**TWO things to version control a project**
1. **version control system**: the software responsible for tracking changes, sharing changes with others, obtaining changes by others, and resolving conflicting edits. `Git`
2. **repository hosting service**: storing a copy of the version-controlled project online, team members can access it remotely, discuss issues and bugs, and distribute final product. `GitHub`

**LO2: Create a remote version control repository on GitHub**

**Typically, when we put a project under version control, we create **two** copies of the repository.**
1. **local repository**: Primary workspace to create, edit, and delete files. (commonly exist on computer, and also on server **JupyterHub**.
2. **remote repository**: Typically stored in a repository hosting service (**GitHub**), where we can easily share it with our collaborators.

- Both copies of repository have a **working directory**: where you can create, store, edit, and delete files.
- Both maintain full project **history**

**LO3: Use Jupyter's Git version control tools for project versioning and collaboration**

**Cloning a repository**

- **Copying/downloading the entire contents** (files, project history, location of remote repository) of a remote GitHub repository **to a computer** (your local workspace)

**Git has a distinct step of ADDING files to the STAGING AREA because**:

- Not all changes we make are ones we want to push to our remote GitHub repository.
- It allows us to edit multiple files at once, but associated particular commit messages with particular files (so the commit messages can more specifically reflect the changes that were made).

**Commits**

They are snapshot of the file contents as well as the metadata about the repository (who made the commit, when was it made)

- each commit has a human-readable **message**: description of what works was done since the last commit. So that you can easily and effectively review the project's history!
- When we commit our changes to Git, the snapshot of changes, commit message, and time, user are all saved to the Git history on LOCAL computer (local repository).

To commit, we add the files to the **staging area**: not a physical location on the comupter (**conceptual placeholder** for the files until they are **committed**)

**Pushing**

Push the commits on local repository to remote repository **GitHUB**, to match what you have on local repository. (collaborators will be able to see the changes on remote repository

- Pushing with Git is the act of sending changes that were committed to Git to a remote repository, for example, on GitHub.com.
- You should push your work to GitHub anytime you want to share your work with others, or when you are done a wrk session and want to back up your work.

**Pulling**

To obtain new changes made by others from the remote repository, synchroize your local repository to what is on the remote repository.

- **Until you pull** the changes from remote repository, you will **not be able to push** any more changes yourself!
- act of collecting changes that exists in a remote repository, that do not yet exist on the local computer you are working on. 

**Version control workflows**

Generally **three additional stpes** as part of regular (edit, create, delete) workflow
1. Tell `Git` to make a **commit** of your own changes in **local repository**
2. Tell `Git` **when** to send your **new commits** to the **remote `GitHub` repository**
3. Tell `Git` **when** to **retrieve** any new changes made **by others** from the remote repository `GitHub`

**Example Workflow on JupyterHub**
1. Edit, create, and delete files in your cloned local repository on JupyterHub
2. Once you want to record your current version, specify which files to "add" to Git's staging area. (modified files that you want a snapshot)
3. Commit those flagged files to your repository, and include a helpful commit message to tell your collabrators what changes you have made. GitHub has not changed.
4. Continue working
5. When you want to store your commits from your local repository onto your cloud to share with your collaborators, you can push them back to the hosted repository on GitHub. 

**Resolve merge conflicts**

Merge conflicts: occurs when you forgot to pull before you made new changes to the file, and when the other collaborator and you worked on the same ine of code and Git will not be able to automatically merge the changes.

- To fix merge conflicts: open the file in plain text editor
- Begining of merge conflicts is preceded by `<<<<<<< HEAD` end of merge conflict is marked by `>>>>>>>`. version of change before the separator `=======` is your change and after is the other's.
- Use plain text editor to remove the special markings.

**Communicating using GitHub issues**

- Emails and messaging apps are not designed for project specific communication.

**GitHub issues**: an alternative written communication platform to email and messaging apps

- Issues are opened from the "issues" tab on the project's GitHub page, and they remain there even after the conversation is over and issue is closed.
- One issue thread is usually created per topic, and they are easily searchable using GitHub's search tools.
- All issues are accessible to all project collaborators, so no one is left out of the conversaion.
- Issues can be setup so that team members get email notifications when a new issue is created or post under issue thread. 

**Chapter 5 ClassificationI: training and predicting**

***Important packages for chapter 6***

- `forcats`
  - forcats package enables us to easily manipulate factors in R.
  - factors are a special categorical type of variable in R that are often used for class/label data
- `tidymodels`
  - K-nearest neighbour algorithm is implemented in the parsnip PACKAGE included in the tidymodels package collection
  - The tidymodels package collection also provides the workflow
- `parsnip`
  - Part of the `tidyverse` metapackage
  - The K-neaest neighbour algorithm is implemented in the "parsnip" package included in tidymodels package collection with many other models.
  - the tidymodels collection provides tools to help make and use models, such as classifiers.

**LO1: Recognize situations where a classifier would be appropriate for making predictions**

**Classification**: predicting a **categorical class (label)** for an observation given its other variables(features).

- Generally, a classifier assigns an observation without a known class, to a class based on how similar it is to other observations with known class.

**K-nearest neighbors**: (a classifier, an algorithm), one method used to predict a categorical class/label for an observation.

- *binary classification*: basic classification problem where only *two* categorical class/labels are involved.

**LO2: Describe what a training data set is and how it is used in classification.**

- **Training set**: A collection of observations with known classes/labels that can be used to train, teach the classifier which can then predict the new observation's class.

***Common functions to use in this chapter***

- `glimpse(df)`
  - This function can make it easier to inspect the data when we have a lot of columns
- `factor(col_name, levels = c(..., ..., ...))`
  - Used to encode a vector as a factor; allows you to specify the values, and whether they are ordered or not
  - first argument is the column you want to convert
  - second argument are the values/categories/levels that are ordered.
- `add_row(df, col_name_1 = ..., col_name_2 = ..., ..., col_name_n = ...)`
  - creates and adds a row/observation to the df
  - specify the name and respective values of each column of the df in argument
- `as_factor()`
  - converts the column/variable into a statistical categorical variable
  - `mutate(df, new_name = as_factor(chosen column))`
- `levels()`
  - Factors have what are called "levels", which you can think of as categories
  - This function return the name of each category in that column
  - levels() function requires a vector as its argument
- `dist()`
  - finds the euclidean distance between the specified observations of the dataframe.
  - this is used with the `slice()` function to first obstain the rows and then the result is piped into `dist()`
  - if there are more than 2 rows, the result is a matric showing the distance between each row
- `distinct`
  - can be used to see all the unique class values present in that column
- `fct_recode`
  - Used to replace the names of factor values with other names. `"new name" = "old name"`


**LO3: Compute, by hand, the straight-line (Euclidean) distance between points on a graph when there are two predictor variables**

1. Distance bewteen points:

   - Formula: `Distance = sqrt((ax-bx)^2)+((ay-by)^2)+((az-bz)^2)+...)`
   - use `mutate` to calculate the distance from new observation:
- For example, manually find K=5:
   - `mutate(dist_from_new = sqrt((column1 - new_obs_colum1)^2 + (column2 - new_obs_column2)^2)`
   - `slice_min(dist_from_new, n = 5)`, which takes the 5 rows of minimum distance
   - And then classify the new obs based on majority voting

**Summary of K-nearest neighbors algorithm**
1. Compute the distance between the new observation and each observation in the training set. `mutate()`
2. Sort the data table in ascending order to the distances `slice_min()`
3. Choose the top K rows of the sorted table
4. Classify the new observation based on majority vote of the neighbor classes

**LO4: Explain the K-nearest neighbors classification algorithm**

- K-nearest neighbor algorithm is a method of classification that classifies new observations based on its **similarities** to **nearby points**
- KNN works in the following order:
  1. choose K: the number of neighbors
  2. calculate the distance of each neighbor to the new obs using euclidean methods
  3. find the K nearest neighbors
  4. assign the class to the new observation by the majority vote among K nearest neighbors

**LO5: Perform K-nearest neighbors classification in R using `tidymodels`**

1. Create a *model specification* for KNN.

   `knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 5) |>
   set_engine("kknn") |>
   set_mode("classification")`

- First, use the nearest_neighbor function
- `weight_func` argument controls how neighbors vote, `"rectangular` allows each K nearest neighbors to get exactly 1 vote.
- set K=5 using `neighbors` argument
- Second, use `set_engine()` specify the package or system will be used for *training the model* in here, use `"kknn"` engine.
- Last, specify that this is a classification problem with `set_mode()` 

2. *fit* the model on the data frame

   `fit(knn_spec, response_variable ~ predictor_variable1 + predictor_variable2, data = data frame)`

   - First, the specified model `knn_spec` to be fitted
   - Specify the variables to use to make the prediction `predictor`, and the response variable before it.
   - lastly, don't forget the dataframe to fit the model to `data =`
- *Note*: you can use argument `response_variable ~ .` to use all the other variables in this data as predictors.
- *Note*: The fit object lists the functions that trains the model as well as the "best" settings for the number of neighbours and weight function.

*Side Note*: `new_obs <- tibble(x column = ..., y column = ..., zcolumn = ..., .....)` creates a new observation with the x and y values. Can have more values.

3. make the *prediction* on the new observation

   `predict(knn_fit, new_obs)`

   - prediction is made on the new observation using the fitted model and the new observation

**LO6: Use a `recipe` to center, scale, balance, and impute data as a preprocession step**

1. *Center and scaling*

   - Because KNN predicts the classes by identifying the nearest observations using euclidean straight line calculations, **any variables with a large scale will have a much larger effect than variables with a small sclae**
   - Just because a variable is large doesn't mean its more important.
   - In many other predictive models: *center* of each variable matters as well.

***Standadrize*** data:
1. Subtract each value by the mean *(center the variable)* All variables will have a mean of **0**
2. Divide each by the standard deviation *(scale the variable)* All variables will have a standard deviation of **1**

**Common problems using K-NN classification**
1. **Varying scales of each variable**
   when using a KNN, the scale of each variable matters since large scale variables can have a greater(unwanted) affects.
2. **Class imbalance**
   another potential issue in a classifier is class imbalance, *when one lbel is much more common than another*

   if there are many more data points with one label overall, the algorithm is more likely to pick that label in general

***The recipe***:

`recipe <- recipe(response ~ predictors, data = dataframe) |>`

          `step_scale(all_predictors()) |>`

          `step_center(all_predictors()) |>`

          `prep()`

`scaled_df <- bake(recipe, df)`

- `recipe()` creates a recipe for Preprocessing Data.
- `prep()` function finalizes the recipe by using the data to compute anything necessary to run the recipe (in this case, the column means and standard deviations).
- `bake()` applies the results of `prep()` onto the data.
- **prep() and bake() are separate because if additional data set is to be calculated using the prepped data set, then further calculations can be done before `bake()`**

**Proper use of recipe helps keep our code simple, readable, and error-free**

** There are tools provided by `tidymodels` to automatically apply `prep` and `bake`**

**Why we need to standardize data**:

- For unscaled data, the variable that has a larger scale will dominate in the distance calculation, making the smaller scaled varaible's impact negligible. This causes inaccurate and biased nearest neighbors to be selected, leading to unreliable classification. Standardization make sure all variables can equally contributing to the nearest neighbors selection, improving the reliability of the prediction. 

**Balancing**

Issue:*imbalance*: when one label is much more common than another. Since K uses labels of nearby points to predict the new point, if one label is much more frequent than another, the algorithm is more likely to pick that label in general.

- Rebalance the data by oversampling the rare class.
- Basically replicate the rare observations multiple times in our data 
- Use `step_upsample(responder, over_ratio = 1, skip = FALSE)`

**Missing data**

Issue: observations where the values of some of the variables were not recorded.

- KNN requires access to *all* values for *all* observations in the training set to calculate straight-line distance to nearby training observations.

1. If *not too many missing values*: simply remove the observations : `drop_na()`
2. If *many of the rows have missing entries*: **impute** the missing entries with the *mean*. Use `step_impute_mean(all_predictors())` in the recipe to fill in the missing values.

**LO7: Combine preprocessing and model training using a `workflow`**

`workflow`: a way to chain together multiple data analysis steps without a lot of code for intermediate steps

`prep()` function is unecessary when the preprocessing is placed in a workflow

`knn_fit <- workflow() |>`

    `add_recipe(recipe_name) |>`
    
    `add_model(knn_spec) |>`
    
    `fit(data = dataframe)`

*`fit()`* is used to fit the whole workflow on the data.

**formula is not needed for `fit()` as it is included in the recipe**

- Now the fit object lists the function that trains the model as well as the "best" settings for the number of neighbors and weight function.
- *the fit object* *also* includes information about the **overall workflow**: including the centering and scaling preprocessing steps.
- **NOW** when we apply the `predict` function on the new observation, it will apply the same recipe steps to the new observation . 

***set.seed()***: used to ensure every operation involing random numbers will produce reproducible results.

#### Little additions the day before the midterm

- databases do things in the *laziest* way possible, this is so that it can help make things a lot faster with large datasets.
- facet_wrap() is used to create many plots side by side and wrapped around a new line if too many plots are created.
- `!is.na(col)` can be used to filter for values in the column that is NOT equal (`!=`) to NA
- `options(repr.plot.width = 10, repr.plot.height = 20)` is used to adjust plot size by length and width on the screen.
- `show_query(tbl(dataset,"table"))` is used to look at the SQL commands sent to the database from the tbl commands.

**Chapter 6: Evaluation and Tuning**

**Common functions we may use**

- `bind_cols(col_object,df)`

**LO 1: Describe what training, validation and test data sets are and how they are used in classification**

How to measure how "good" our classifier is?

- Our classifier is good if it provides accurate predictions on data NOT seen during training: this shows that it has actually learned about **the relationship** between the predictor variables instead of simply memorizing the labels of individual training data examples. 

To evaluate the classifier without needing great amount of data from the actual source:

- we can SPLIT the data into two sets: Training & Test set.
1. First ONLY use **training set** to build our classifier, store the test set untouched
2. Then use classifier to predict the labels in the test set. So we can compare it with the actual labels to conclude our confidence level that the classifier will accurately predict labels for **BRAND NEW** observations. 

**Code to split the data in training, validation and test sets.**

**LO 2: Describe and interpret accuracy, precision, and recall in R using test set, single validation set, and cross-validation.**

**Ways to assess how well the predictions match the actual:**


1. `prediction accuracy`: (# correct predictions)/(total predictions)

   - Benefits: Convenient, general-purpose way to summarize the performance of a classifier with a single number.
   - Downside: ONLY tells us how OFTEN the classifier makes mistakes in GENERAL, **NOT what KINDS of mistake**

2. `confusion matrix`: Shows how many test set labels of **each type** are predicted correctly and incorrectly.

   - Benefits: Gives more detail about the kinds of mistakes the classfier tends to make

**4 kinds of predictions the classifier can make:**

- We typically refer to the label we are more interested in as *positive*
1. **True Positive**: *positive* label classified as *positive*
2. **False Positive**: *negative* label classfied as *positive*
3. **True Negative**: *negative* label classified as *negative*
4. **False Negative**: *positive* label classfied as *negative*

3. `Precision`: quantifies how many of the positive predictions were ***actually*** positive

Formula: (# correct positive predictions)/(total positive predictions)
   - High precision: a calssifier predicts something to be postive, we can TRUST that it is actually positive.

4. `Recall`: quantifies **how many** of the positive observations in the test set **were identified** as positive.

Formula: (# correct positive predictions)/(total positive test set observations)
   - High recall: We TRUST classifier can FIND the positive labels present
     

**LO 3: Set the random seed in R using the `set.seed()` function**

Purpose: We use randomness anytime we need to make a descision in our analysis that **needs to be fair, unbiased, NOT influenced by human input**

EX: Here for classification evaluation, we want R to *randomly* split the data

**In R, the randomness is actually NOT *random*.** 
Once we set the seed `set.seed`, everything may LOOK random, but is actually totally reproducible. ***AS LONG AS its the SAME seed value!***

**Watch out**: 
- If you do not set seed at the begining, the results will NOT be reproducible!!
- If you set multiple seeds, the results will NOT be as randome as it should!!

**LO 4:Evaluating performance with `tidymodels`** 

Steps to assess the classifier:

**1. Create the Training set and Test Set**

- Training Set should be a 50-100% split of the data (usually use 0.75)
- Test Set should be the remaining 0-50% of the data (usually 0.25)
  - You want to trade off between:
    - training an accurate model (by using a **larger training** data set
    - getting an accurate evaluation of its performance (by using a **larger Test** data set)


- `initial_split(data, prop=..., strata= target_column)`
  - `prop=` is the proportion you want for the training set (eg. 0.75)
  - `strata=` stratafies the variable we want to ensure that the same proportion of different classes of that variables ends up in BOTH training and testing sets.
  - use `set.seed()` for reproducible results as `initial_split()` randomly samples from the data.
  - use `training(split_object)` & `testing(split_object)` to assign the training and test sets to reference objects.  

- can use `glimpse(testing)` to view the data frame.

**2. Pre-Process the data**

- K-NN is sensitive to the scale of predictors, so we should perform some preprocessing to standardize them
- We should create the standardization preprocessor **USING ONLY training data**. (This ensures our test data does not influence any aspects of our model training). We want our model to have never seen the test set before!
- create the recipe, `step_scale(all_predictors())`, `step_centre(all_predictors())`
- Once we have created the standardization preprocessor, we can then apply it **separately** to both the **training** and **test** datasets.

**3. Train the Classifier**

- Create the K-nearest neighbor classifier with **only** the **training set**, using standard procedure, `knn_spec`, `knn_fit` workflow()...

  (Here, K-nearest neighbours algorithm can randomly select the majority neighbour class if there is a tie between two classes or more. This is another reason WHY set.seed() is useful!

**4. Create the labels in the Test set**

- Predict the class labels for our **test set** using the `predict()` function
- Use the `bind_cols(prediction data, test data)` to add the column of predictions to the original test data creating the predictions dataframe.

**5. Compute the accuracy**

- To assess classifier's accuracy, we use the `metrics()` function.
  - `metrics(data, truth = target_column, estimate = .pred_class)`
  - `filter(.metric=="accuracy")`
- To check the order of labels in the target variable:
  - `pull(prediction data, target_column)` |> `levels()`
- To find the precision:
  - `precision(truth= target_column, estimate=.pred_class, event_level="positive event level(first or second)")`
- To find the recall:
  - `recall(prediction data, truth=tageted column, estimate=.pred_class, event_level= "first or second")`

**We can also look at confusion matrix for the classifier**:

- `conf_mat(prediction data, truth=target column, estimate=.pred_class)`
- show us the table of predicted labels and correct labels

* It is important to look at the confusion matrix to see whether the classifier is "good" in certain contexts, and whether we would like high precision or high recall depends on the application of this classifier.
* Majority classifier ALWAYS guesses the majority class label from the training data, no matter the predictors. So we would like our classifier's accuracy to be higher than the majority classifier's accuracy for sure. 

**LO5: Tuning the model, choose the number of neighbors in a K-nearest neighbors classifier by maximizaing the estimated cross-validation accuracy**

- Predictive models such as K-NN have parameter that we have to pick: in K-NN we have to pick the number of neighbours K fro the class vote.
- Making the most optimal selection is called **TUning** the model, part of model training.

**Question**

How do we tune the model?

**Answer**

1. Split the **Training data** further into **two** subsets, called the **training (sub)set** and **validation set**.
2. Use the **training (sub)set** for building the classifier, and the **validation set** for evaluating it!!
3. Then we will try different values of the parameter K and pick the one that yields the highest accuracy!

**Cross Validation Method**

- Different from the `initial_split` which is just **one** split, here we are free to create multiple train/validation splits to have multiple classifiers. Then choose a parameter value based on **all** of the different results.
- This leads to a better choice of K for the overall set of training data.

* Averaging all validation set accuracies help reduce the influence of any one (un)lucky validation set on the estimate.

**How Cross-validation works?**

- Instead of randomly splitting the data, we want each observation in the data set to be used in a validation set only a single time.
- The name for this strategy is called cross-validation.
- In cross-validation, we split our overall **training data** into *V* evenly sized chunks/folds
- Then iteratively use 1 chunk as the **Validation set** and combine the remaining *V-1* chunks as the **training (sub)set**


**Use the following functions to perform *V-fold* Cross Validation**

- `vfold_cv(training_dataframe, v=..., strata= target_column)`
  - This function splits our training data into V-folds automatically
  - This is to be done after data has been split into **training** and **Testing** sets
  - Cross-validation uses a random process to select how to partition the training data, therefore set.seed() important here

- `fit_resamples(..., resamples= df_vfold)` instead of `fit(data)`
  - Use instead of `fit()` function when doing cross-validation for **only specified neighbors**
  - This runs cross-validation on each train/validation split
  - first argument is the `workflow()` function which is piped in.

- `tune_grid(..., resamples=df_vfold, grid=n)`
  - It is used **instead** of `fit_resamples()` function when doing cross-validation for *n* neighbors.
  - fits the model for each value in a range of parameter values
  - third argument `grid` specifies that the tuning should try at most *n* values of the number of neighbors K when tuning.
  - first argument `workflow()` is piped in.
  - We set the seed very begining to ensure results from tuning are reproducible. 

- specifying the `grid` (kvals)
  - we use the code: `k_vals<- tibble(neighbors= seq(from = 1, to = 100, by=5))`

- `collect_metrics(...)`
  - used instead of `metrics()` function when doing cross-validation
  - used to aggregate the mean and standard error of the classifier's validation accuracy across the folds
  - argument is the `workflow` 

- `tune()`
  - Each parameter in the model to be tuned should be specified as `tune()` in the model specification rather than given a particular value

- to pull out the K with highest accuracy:
  `best_k<-accuracies df|> arrange(desc(mean))|> head(1)|> pull(neighbors)`

**Notes about V-fold cross validation**

- when you do cross-validation, you need to consider the size of the data, the speed of the algorithm and computer.
- In practice, typically V is chosen to be either **5** or **10**.
- The more the folds, the lesser the standard error, BUT the more expensive the computation
- How good or not the prediction accuracy is depends entirely on the downstream application

**How do you decide which parameter value K is the Best?**

1. We get roughly the **optimal accuracy**, which means that changing the value to a nearby K does not **decrease** the accuracy **too much**, making our choice reliable in the prsence of uncertainty.
2. The cost of training the model is not prohibitice (too large of K is EXPENSIVE!)

**LO6: Describe Underfitting and Overfitting, and relate it to the number of neighbors in K-nearest eighbors classification**

**Under-fitting**:

- Increase the K, more and more training observations get a "say" in what the class of a new observation is.
- Causes an "averaging effect" making the boundary between where the classifier would predict type 1 and type 2 to smooth out and become simpler (too simpler).

- **In general, if the model isn't influenced enough by the training data, it is said to underfit the data.**

**Over_fitting**:

- decrease the K, each individual data point has a tronger and stronger vote.
- Causes more "jagged" boundary between type 1 and 2, **less simple model**, classifier become unreliable on new data.
- If we had a different training set, the predictions would be completely different.

- **In gneral, if the model is influenced TOO much by the training data, it is said to overfit the data**

**Advantages and Disadvantages of KNN Classification**

**Advantages:**

- Simple and easy to understand
- No assumptions about what the data must look like
- Works easily for binary (two-class) and multi-class (> 2 classes) classification problems

**Disadvantages**

- As data gets bigger, KNN gets slower and slower
- Does not perform well with a large number of predictors
- Does not perform well when classes are imbalanced (significantly more observations of one class than the other)

**Predictor selection**

- **adding irrelevant predictors** to KNN hurts accuracy by ading random noise to the distance calculations
- **Accuracy** declines as noise increases
- **Best subset selection** tries every predictor combination for the best accuracy, BUT too **slow** for many variables
- **Forward Selection** adds predictors one at a time, choosing the one that most improves cross-validation accuracy at each step, faster, still can overfit.

**Chapter 7.1: Regression (KNN regression)**

**LO 1: Explain the KNN regression algorithm and describe how it differs from KNN classification**

Regression method is very similar to classification, both needs to split the data into testing and training, tune the model and use cross-validation to choose the K. 

**Difference is that regression method helps predict *numerical* variables based on one or more predictor variables**

**When would regression be appropriate for making predictions**

Answer: We use past informations to predict future observations that are *numerical* values instead of *categorical*. 

- Value we want to predict is called "response variable" and in regression, response variables are *numerical*

**LO 2: In a data set with two or more variables, perform KNN regression**

**Question: Can we use a numerical predictor variable to predict a numerical response variable?**

**1. Create a scatterplot to explore the relationship between the two variables**: response variable on x-axis, predictor on y-axis.

**2. Split the data**

- Split the data into testing and training sets, and store the testing set in a "lock box" Only come back to it when we have chosen our final model. 

**3. Create a model specification for K-NN regression**

***Note***: We use `set_mode("regression")` which tells tidymodels that we need to use different metrics (Ex: **RMSPE** instead of accuracy for tuning and evaluation)

**4. Create a recipe**

- For preprocessing step, we may choose to centre and scale our data, however, if there is only ONE predictor variable, we may choose to not do preprocessing as it would not impact our prediction.

**5. Cross-validation to choose K**

**Because this is predicting *numerical values*, we will never get our prediction to be EXACTLY the true value, so accuracy method does not apply here**

- We will use **RMSPE** (roomt mean squared prediction error) on the training set.
- RMSPE is the sum of the squared difference between predicted and true value and then averaged them. and THEN SQUARE ROOTED.
  - **Large, positive** RMSPE means large mistakes
  - **Small, positive** RMSPE means small mistake (prediction close to true)
  - **SO, we choose the K with the smallest RMSPE from the cross-validations**

**Important NOTES**:

- `RMSPE` is for calculating the root mean squared error on the testing /validation data. predicting on unseen data. Measures how well our model predicts data it was NOT trained with.
  - this indicates how well our model generalizes to future data
- `RMSE` is for predicting and evaluating prediction quality on training set. Measures how well our model predicts the data it was trained for.
  - this indicates how well our model can fit our data

**6. Put recipe, model into a workflow()**

**7. Run our cross-validation**

- Here, we still `collect_metrics()`, but will `filter(.metric == "rmse")`
- The `mean` column shows the average of the `RMSPE` of the estimate by cross-validation

**8. Lastly, we take the `filter(mean == min(mean))` to find the best K with the *minimum* RMSPE**

**9. evaluate the test set**

- asses how well our model predicts, we will asses its RMSPE on the test data

Standard procedures for evaluation:
1. make a new model based on the best K
2. fit the model to test data
3. predict the test data with this model and `bind_cols`
4. obtain the metrics `metrics(truth =, estimate =)` and `filter(.metric == "rmse")`

**LO 3: Describe underfitting and overfitting, and relate it to the number of neighbors in K-NN regression**

- By setting the K **too small** or **too large**, we cause the RMSPE to increase.

**Overfitting**

- **K too small**
- The prediction follows the training set data **too closely**, model influenced too much by the data.
- If we change the training set observations we would get ENTIRELY different prediction, and the model prediction from the previous training set would not correctly predict the new set well.

**Underfitting**

- **K too large**
- Prediction line is very smooth, almost flat.
- Our predicted values depend on too many neighbouring obervations
- Similar, inaccurate predictions for different data given.
- the predicting line does not follows the training set very closely, changing any observations in the training set would not really affect the prediction line at all.
- **Model is too simple to capture the patterns of the data**

**LO 4: Strengths and Limitations of KNN regression**

**Strengths**: 

- Simple, does not require much assumptions of what the data should look like for the model to work
- Works well with **non-linear** relationships

**Weaknesses**:

- Becomes very slow as data gets larger
- Does not perform well with **large numbers of predictors**
- May not predict well for **data beyond** the range in the training data

**Chapter 7.2: Linear Regression**

**LO 1: Use R to fit simple and multivariable linear regression models on training data**

**Simple linear regression**: involves only one predictor and one response variable. Predicting a numerical response variable of new data based on old data available. 

**Difference between linear and KNN**:

- KNN makes predictions by looking at the K nearest neighbors and averaging over their values
- Linear regression create a line of **best fit** and look up predictions from that line

**How linear regression works**

- to have the best fit line, we just need to have the best fit with the knowledge of its **slope** coeficient, and **y-intercept**
  - Then, we would be able to use that to make predictions of any variable values given.

**How to choose the BEST best-fit**

- simple linear regression chooses the line of best fit by choosing the line that *minimizes* the **average squared vertical distance** (equivalent to minimizing the **RMSE**)
 - average squared vertical distance between the best fit and all of the actual data points in the training data.
- **Then**, the assess the **predictive accuracy** of the lm model, we use **RMSPE**

**Additionally, we do not standardize our predictors**

**Performing linear regression**

**1. Split the data into testing and training sets**

**2. Create the model specification for linear regression**

- use `linear_reg()` instead of `nearest_neighbors()`
- `set_engine("lm")`
- `set_mode("regression")`

**3. create the recipe using training data**

**4. Use workflow() to fit/build the model on the training data**

**5. Use the model fitted from training data to predict the test data, evaluate the accuracy**

- after predict and bin_cols, use `metrics()`
- Then, find the RMSPE by `filter(.metric == "rmse")` |> `select(.estimate)`|> `pull()`

**Side Note: To visualize simple linear regression model, we can plot the predicted value ontop of the actual values by making predictions on the maximum and minimum predictor values and connect them with a straigt line, superimpose it on top of the original scatterplot**

**LO 3: Compare and contrast predictions obtained from KNN regression to those obtained using linear regression** 

**Advantages of KNN regression**

- Useful when relationship between predictor and response is **non-linear**.
  - In this case the **linear regression** would be underfit (the predicted values would not match the actual values very well) Would have **High RMSE** (low fit), and **high RMSPE** when assesing the prediction quality on test data.

**Advantages of Linear regression**

1. KNN regression does NOT predict well **beyond the range of predictors in the training data**
linear regression can fix this problem.
2. KNN regression, method gets significantly slower as training data set gets bigger, linear regression would not have that problem.
3. In linear regression, standardization does not affect the fit, but DOES affect the coefficient of equation.
4. Easy to interpret, only needs **intercept** and **slope**

**LO 4: Descibe how linear regression is affected by outliers and multicollinearity**

**Outliers**: can distort the regression line by pulling it towards **extreme** values, leading to **poor fit** and **misleading predictions**

**Multicollinearity**: causes the regression coefficients to become highly sensitive to specific values in the data, making the model's estimates unreliable. 

**Chapter 8: Clustering**

**Important packages for clustering**

- `broom`
  - provides us with `augment()` function
  - provides us with `glance()` function

**LO 1: Describe when clustering is an appropriate technique to use, and what insight it might extrat from the data**

**Clustering** is a data analysis technique involve: **Separating** a data set into **subgroups** of related data.

**Why perform clustering?**
- We can then use the **subgroups** to generate **new questions** about the data and follow up with predictive modeling.
- Improve predictive analysis
- In here **clustering is mostly for exploratory analysis, uncovering patterns in the data**

**Clustering examples**
1. Separate a data set of documents into subgroups that correspond to topics.
2. Separate a data set of human genetic information into groups that correspond to ancestral subpopulations
3. separate a data set of online customers into groups that correspond to purchasing behaviours

**Important**: Clustering is an ***unsupervised*** task: 
- We are trying to understand and examine the structure of data without any **response** variable labels or values to help us.

**Because there is no response variable, it is NOT as EASY to evaluate the "*quality*" of a clustering**

- With classification, we can use a test data set to assess prediction performance, in clustering, there is not one good choice for evaluation.
- We will use visualization here to determine the quality of clustering

**LO 2: Explain the K-means clustering algorithm**

**1. load `library(tidyverse)` and `set.seed()`**

- set a random seed is **important** because K-means clustering algorithm uses randomness when choosing a starting position for each cluster

**Measuring cluster quality**

- The **K-means algorithm** is a procedure that groups data into K clusters
- It starts with an **initial clustering** of the data, and then iteratively improves it by **making adjustments** to the **assignment** of data to clusters until it cannot improve any further.
- In **K-means clustering**, we measure the **quality of a cluster** by its within-cluster sum of squared distances (WSSD)

Computing within squared sum of squared distances (WSSD): 

1. Find the cluster centers by computing mean of each variable over data points in the cluster. **cluster center is like the "middle point" of the cluster based on the average of each variable**

- If you have a cluster of **4 data points** and using **2 variables**
  - add up all the `x` variable values in the cluster and divide by 4 - gives the **average** `x`
  - add up all the `y` values in the cluster and divide by 4 - gives the **average** `y`
  - The cluster center will take on the **average x** and **average y**
2. Next, add up the **squared distance** between **each point** in the cluster **and** the **cluster centre**. Using **Euclidean striaght line distance**
  - The large the sum of squared distance, the more spread out the cluster.
  - **BUT**, a cluster where points are **very close** to the center might still have large sum of squared distance if **There are many data points in the cluster**

3. Then we add up the WSSD for each cluster to get the **total WSSD**

**NOTE**: Since K-means uses straight line distances to measure quality of clustering, it will only work for quantitative variables in this case

**LO 3: The clustering algorithm**

**1. We begin the K-means algorithm by *picking the K***

- randomly assign a roughly equal number of observations to each of the K clusters

**2. Two major steps to minimize the sum of WSSDs**

1. **Center update**: Compute the center of each cluster
2. **Label update**: Reassign each data points to the cluster with the nearest center.

*These two steps are repeated until the cluster assignments no longer change*

**3. Random starts**

- Unlike regression and classification, **K-means** can get "stuck" in a bad solution
- **We can get UNLUCKY random initialization**
  - To solve this problem: We should randomly re-initialize the labels a few times, run K-means for each initialization, and pick the clustering that has the **lowest final total WSSD**

**4. In order to cluster data using K-means, we also have to pick the number of clusters K**

- No response variable, **cannot perform cross-validation**
- If **K too small**, then the clustering merges two or more separate groups of data - **large total WSSD**
- If **K too big**, the clustering divides subgroups into multiple even smaller groups - Indeed **decrease total WSSD** but by **ONLY insignificant amount**
- The best K, we want the "elbow" point, where the total WSSD starts to level off

**Common functions used**

- `augment(kmeans_fit_object, original_data)`
  - takes in the model and the original data frame, and returns a data frame with the data and the cluster assignments for each point (kind of like bind_cols)
  - this function helps us to plot and identify different clusters

- `unnest(glanced)`
  - unpacks the data frames into simpler column data types
  - this is used when a data frame containing clustering statistics for each k-means object is created and we want to "unpack" these statistics
  - Because each value of the column/vector is a list of statistics, therefore, these lists are nested inside each element of the vector.

**LO 4: Performing K-means**

**NOTE**: K-means clustering uses straight-line distance to decide which cluster does a point belongs to, so we **must scale the variables**

**(1). Create a recipe**

- **make sure to scale all variables** `recipe(~., data=)`

**(2). Create the `k_means` model specification**

- We will use `k_means(num_clusters = )` function for this clustering model
- `set_engine("stats")`

**(3) create the workflow()**

- combine the recipe and model specification in a workflow and use the `fit` function

**NOTE**: K-means uses random initialization of the assignments, so we should have set.seed in the begining to make the clustering reproducible.

**(4) Visualize the clusters with *coloured scatter plot***

- We need to first augment the original data frame with the cluster assignments
- we need to `augment(kmeans_fit, original data)`

*`agument()` function is from `tidyclust`*

- After we have our augmented data frame, we can plot the *unstandardized* data, and colour the clusters
- `colour = .pred_cluster`

**(5) Select the best K (finding the "elbow point")**

- We plot the total WSSD of each K with the number of clusters (K).
- **Side note**: We can view the total WSSD `tot.withinss` by using the `glance(fitted_model)` function

1. To calculate the total WSSD for **variety of K**, we create a data frame with a column nuamed `num_clusters`: `tibble(num_clusters = 1:10)`

2. Then we create the model specification again: specifying we want to **tune** the `num_clusters`
- `k_means(num_clusters = tune())`

3. We combine the recipe and new model specification in a workflow()
- In the workflow(), we have to add:
- `tune_cluster(resamples = apparent(original data), grid = kvaltibble)`
- This is to run K-means on each of the different settings of `num_clusters`
- The `grid` argument controls which Ks we want to try
- The `resamples = apparent(original_data)` tell K-means to run on the **whole** data set for each value of the num_clusters
- lastly, use the `collect_metrics()`

4. To get the total WSSD results

- total WSSD corresponds to `mean` column, and `.metric == "sse_within_total"`
- We should also rename the `mean` column to `total_WSSD` by
- `mutate(total_WSSD = mean)`
- then we select for `num_clusters, total_WSSD`

5. Lastly that we have the total_WSSD of each num_cluster, we can make a `geom_point`, `geom_line` plot

- find the "elbow" for the K value to use

**How do we prevent having UNLUCKY initialization?**

- We use the argument `nstart = 10` in the model specification:
- `set_engine("stats", nstart = 10)`
- Now we run the new model specification, K-means clustering will perform 10 times initial starts for each K value.
- `collect_metrics()` will pick the best clustering of the 10 random starts (lowest total_WSSD)

**The more nstart the better analysis, but takes a much longer time** Balance is needed.

**LO 5: Disadvantages of multi-dimensional data**

- makes it very difficult to interpret the different properties of each clusters
- Since we want to visualize the clusters, but we **cannot** visualize them in a higher dimensional space, it becomes **difficult** to **assess the accuracy of our model**