# DSCI 100 Review

## Reading 

### Relative and Absolute Paths

Understand the difference between relative and absolute paths. 

Absolute path Ex. 
`"/home/dsci-100/worksheet_02/data/happiness_report.csv"`

Relative path Ex.
`"data/happiness_report.csv"`

Relative paths are better than absolute paths because relative paths can be opened by different computers

### Differing Read functions

`read_()` <- reads a data frame using the following delimeters

`read_csv()` <- commas (",")

`read_csv2()` <- semicolons (";")

`read_tsv()` <- tabs ("/t")

`read_excel()`- excel sheets (".xlsx")

`read_delim()` - any delimeter 

Ex. `read_delim("path", delim = "/t", colnames = TRUE/FALSE", skip = ...)`

Alternatively, you can also read from a url.

`download.file(URL, path)` is used to download a file if using the url doesn't work

### Reading from a database 

To read from a database, we must first load the library

`library(DBI)`


Connect to the database using `dbConnect(RSQLite::SQLite(), "path.db")`

Use `dbListTables(data)` to list all tables in the database

Use `tbl(data, table_name)` to access a reference of the data frame

In order to convert the reference to a data frame, use `collect()`
Now we will be able to use the operations.

It is better to use `filter()` or `select()` before collecting in order to reduce the data

Now the data frame can be used regularly

For RPostgres, we need to use `library(RPostgres)`

We need a host, port, user and password to connect to this database

Ex. `dbconnect(RPostgres::Postgres(), dbname = ... , host = ... , port = ... , username = ... , password = ...)`

The rest of the process is the same as SQlite.

Benefits of Databases:
- They enable storing large data sets across multiple computers with backups
- They provide mechanisms for ensuring data integrity and validating input
- They provide security and data access control
- They allow multiple users to access data simultaneously and remotely without conflicts and errors

Using `write_csv(df_name, path)` converts a data frame into a path

## Wrangling 

### What is Tidy Data

Variable - a characteristic, number, or quantity that can be measured.

Observation - all of the measurements for a given entity.

Value - a single measurement of a single variable for a given entity.

A tidy data frame meets the following requirements
- each column has **one** variable
- each row has **one** observation
- each cell has **one** value 

A vector must consist of one type, while lists contain multiple types. While data frames typically consist of vectors, they can be made up of lists as well. 

Data frames **must** have vectors or lists of equal length


### Tidy Functions


pivot_longer takes a number of columns and turns them into 2 columns</br> 
`pivot_longer(cols = ..., name_to = ..., values_to = ...)` 

pivot_wider takes 2 columns and turns them into a number of columns</br> 
`pivot_wider(names_from = ..., values_from = ...)`

Use separate if there is a column with multiple values in each cell</br>
`separate(col = ..., into = c(col_name_1, col_name_2), sep = ...)`


### Wrangling Functions

- `select()` 
- `filter()`
- `mutate()`
- `group_by()`
- `summarize()`
- `across()`
- `map()`

| Function    | Description                                                          |
| ----------- | ---------------------------------------------------------------------|
| across      |  allows you to apply function(s) to multiple columns                 |
|filter       |  subsets rows of a data frame                                        |
|group_by     |  allows you to apply function(s) to groups of rows                   |
|mutate       |  adds or modifies columns in a data frame                            |
|map          |  general iteration function                                          |
|pivot_longer |  generally makes the data frame longer and narrower                  |
|pivot_wider  |  generally makes a data frame wider and decreases the number of rows |
|rowwise      |  applies functions across columns within one row                     |
|separate     |  splits up a character column into multiple columns                  |
|select       |  subsets columns of a data frame                                     |
|summarize    |  calculates summaries of inputs                                      |

## Visualization 

### Do's and Dont's 

- Convey the Message 
    - Make sure the visualization answers a specific question
    - Use legends and titles
    - The graph should be readable without any surrounding knowledge
    - Use color schemes that are understandable for the color blind

- Minimize noise
    - use colors sparingly 
    - consider overplotting 
    - only make the plot as big as it needs to be

### Various sub-functions in ggplot

- `aes(x = ..., y = ...)`
    - `color = col_name`
    - `fill = col_name` <- color for bar or histogram plots(only factors)
- `goem_...()`
    - `alpha = ...` <- make the data translucent
    - `binwidth = ...` <- only for bar and histogram 
    - `bins = ...` <- only for bar and histogram
- `xlab()`
- `ylab()` 
- `labs()`
    - `color = ...`
- `theme()`
    - `text = element_text(size = ...)`
    - `legend.position = "top"`
    - `legend.direction = "vertical"`

- `xlim(c(upper, lower))`
- `ylim(c(upper, lower))`
- `scale_x_log10()`
    - `labels = label_comma()`
- `scale_y_log10()`
    - `labels = label_comma()`
- `scale_color_brewer(palette = ...)`
- `facet_grid(rows = vars(col_name))`
- `ggtitle()`



### geom_bar

By default, the height of the bars are set to the number of times a value appears in the data frame(count)
instead, we can make it so that the height represents the values in the data frame(identity)

### Answering the Question

Direction - positive or negative relationship between x and y. 

Strength - stronger if the data takes the shape of a line rather than a cloud shape

Shape - If you can draw a straight line through the data points, it is linear. Otherwise, it is non-linear


### Saving a Visualization

An image can either be stored as a raster image or a vector image. The raster method is quicker but becomes blurry when zoomed in. A vector image often takes longer because the computer has to draw out the entire image. This allows you to zoom while maintaining clarity.

## Classification

## Regression

## Clustering 

## Inference 