# Lecture 10: If Statements and Functions


## If Statements

#### Definition of if Statement
`if` statements are scripts that incorporate conditional logic in your code, 
* they run lines of code only when a "statement" is `TRUE`
* can avoid running lines of code when a "statement" is `FALSE`

Purpose of If Statemnts: 
* can be used to automate data preprocessing and data analysis
* can also be used to help with debugging
* cannot process vectors of length >1

#### Syntax

```
if (Boolean Expression) {
  if statement body
}else if(Boolean Expression){
   else if statement body
}else{
   else statement body 
}
```

* Boolean Expression can be located in parenthesis, this will produce a `TRUE` or `FALSE`
* Statement Body
  * If Boolean expression == `TRUE`, then the code in the `if` statement body is executed
  * If Boolean expression == `FALSE`, then the code in the `if` statement body is NOT executed



#### ifelse vs Else if

**else if**
* is used within if statemnts for conditional branching
* checking multiple conditions sequentially<br>

**ifelse** <br>
* is a function used for element-wise conditional evaluation in vectors
* it evaluates a condtions and returns one value for True and another for false
* syntax : ifelse(condition, value_if_true, value_if_false)


## Functions

#### Definition
They take one or more arguments as input, perform a given task using these inputs, and return one or more objects as output.

**Why use functions?**<br>
* Avoid running several lines over and over again = more efficient
* Allows us to use functions others have built without having to know what it does internally


#### Syntax
```
function_name <- function(arg_1, arg_2, ...) {
   Function body
   return(output)
}
```

* Function Name : the actual name of the function and is stored as an R object of class type 'function'
* Arguments: are placeholders. When you call a function, you pass a value to the argument
* Function Body : a collection of commands performed by the function when it is called
* Return Value : Returns the value of the function and is the last expression in the function body


#### Verbosity
Verbosity is the amount of information the function prints or logs while executing. It is typically controlled by a verbose argument, which is a logical (TRUE/FALSE) parameter that determines whether the function should output detailed messages about its progress.<br>
* FALSE : no text will be added
* TRUE : the function will print the statement


#### Scope
Scope refers to the context in which variables and functions are defined and accessed. 
There are two main types: 
* global scope, where variables are accessible throughout the script or session
* local scope, where variables exist only within a function or block. 
Functions first look for variables in their local environment before checking parent environments, ensuring predictable behavior.

# Lecture 11: Loops

#### What are loops
A loop is a sequence of statements carried out several times in succession.
There are three major types of loops:
  * `for` loops
  * `while` loops
  * `repeat` loops


## For
#### Definition
The `for` loop iterates through each element of a vector or list and performs a task at each iteration. It completes execution after iterating through all elements of a vector or list.

#### Syntax

  ```
  for (iterator in vector/list) {
    for loop body
  }
  ```
* `for`:  is initiated using the `for` keyword

* `iterator` :  is a variable that iterates through each element of the vector

* `in`: is a special keyword that says we want to iterate through the elements "in" the vector or list

*  `vector/list`:  The `for` loop will use the `iterator` to iterate through each element of a vector or list. This includes columns of data frames since data frames are structured lists.

* `for` loop body : At each iteration, the code in the `for` loop body is executed

#### Nested loops
A for loop in a for loop. In R it isn't generally very useful since we'll have  function who can do this more efficiently.

## While
#### Definition
* The `while` loop repeatedly iterates "while" a certain condition is true.They are very useful for stopping an algorithm when convergence is met.


#### Syntax

```
while (boolean expression) {
  while loop body
}
```
* `while`:  is initiated using the `while` keyword
* `boolean expression`:  An expression that produces a `TRUE` or `FALSE`
* `while` loop body :  Code in the `while` loop body is only performed if the `boolean expression` is `TRUE`


## Repeat
#### Definition
* The `repeat` loop...well...repeats a task forever! Notice that there is not condition statement. A `break` statement is used to "break" out of the loop one a condition is met.

#### Syntax
```
repeat {
  repeat loop body
  conditional break statement
}
```
* `repeat`:  is initiated using the `repeat` keyword

* `repeat loop body`:  Code that is executed at each iteration of the loop

* `conditional break statement` : the conditional statement, typically an `if` statement, that indicates whether or not we should `break` out of the loop/end the loop


# Lecture 12: Apply family

## What is the apply family?

The `apply` family, performs implicit looping, most of the time, we can use these functions instead of creating our own for loops!
We have : apply(), lapply(), sapply(), tapply().

### apply()
#### Definition
The `apply()` function applies a function to the rows or the columns of a matrix or dataframe and outputs a vector or list. It takes a data structure, a margin (specifying whether to operate along rows or columns), and a function to apply. It returns a vector or matrix depending on the input and the margin.

#### Syntax

`apply(X, MARGIN, FUN, _)`

* X :  an array or a matrix
* Margin : a vector giving the subscripts which the functions will be applied over. 
    * 1 for rows
    * 2 for columns
* Fun : the function to be applied
* _ : optional argumnet for fun 

### lapply()
#### Definition
The `lapply()` function applies a function to each element of a list and outputs another list. This function applies a function to each element of a list, returning a list of the results. It maintains the structure of the input list.

#### Syntax
`lapply(X, FUN)`

* X :  a vector or list or an expression object
* Fun : the function to be applied


### sapply()
#### Definition
* The `sapply()` function outputs a vector or matrix instead of an unstructured list. Similar to lapply, but it attempts to simplify the output if possible, returning a vector or matrix instead of a list. This is useful when the output of the function is consistent across all elements.

#### Syntax
`sapply(X, FUN)`

* X :  a vector or list or an expression object
* Fun : the function to be applied

### tapply()
#### Definition
The `tapply()` functions applies a function for each factor level in a variable. This function applies a function to elements grouped by a factor or categorical variable, returning a table of results.

#### Syntax
`tapply(X,INDEX, FUN)`

* X :  a vector or list or an expression object
* Index : a list or one or more factors each of the smae length as X. these elements can be coerced to factors by as.factor.
* Fun : the function to be applied

## do.call()
do.call : constructs and executes a function call from a name or a function and a list of arguments to be passed to it.

# Lecture 13: Importing/Exporting Data

## Different methodes of storing data

### CSV Definition
* The file extension `.csv` means "comma separated value"
* Data within `.csv` files is stored as a table with rows and columns
* Each row of data is a separate line
* Each column of data is separated by a comma (",")

#### Syntax Import
to import we use : read.csv(file, header = TRUE/FALSE, sep=",")
* `file` - filepath to the dataset
* `header` - Set to `TRUE` if the first line in your data contains the variable names
* `sep` - The delimiter in your file. This is by default a "," for comma separated files

#### Syntax Export
to export we use : write.csv(x, filename, row.names=TRUE/FALSE)
* `x` - the `R` data frame to exported
* `filename` - the filename to where the data frame will be exported
* `row.names` - Whether or not you would like the row names of the data frame to also be exported. I often set this to `FALSE`, unless there is a special reason to include the row names



#### TXT Definition
TXT and CSV are related the main diference are the delimiters.
* Other than a comma, the tab is the most common delimiter when storing data as a `.txt` file. To specify a "tab" in `R`, we use the delimiter `"\t"`
* Other delimiters include period `"."`, `"|"`, `";"` etc.
#### Syntax Import
to import  data we generaly use : read.table(file, header = TRUE/FALSE, sep="?")
* `file` - filepath to the dataset
* `header` - Set to `TRUE` if the first line in your data contains the variable names
* `sep` - The delimiter in your file. This is by default an empty string.

#### Syntax Export
to export data we generaly use : write.table(x,filename, sep="?", row.names=TRUE/FALSE)
  * `x` - the `R` data frame to exported
  * `file` - the filename to where the data frame will be exported
  * `sep` - the delimiter when exporting the data frame
  * `row.names` - Whether or not you would like the row names of the data frame to also be exported. I often set this to `FALSE`, unless there is a special reason to include the row names


#### XLSX Definition
* To import data from spreadsheets, namely files in Microsoft Excel (`.xlsx`) format. This one need us to literaly import the data in the folder structure. Unlike the previous two we can't use an URL.

#### Syntax Import
to import from an excel sheetThe `read_excel(path, sheet)`
  * `path` - filepath to the dataset
  * `sheet` - An optional arugment specifying the sheet number you would like to import

#### Syntax Export
to exportto  to an excel sheetThe `write_xlsx(x, path)`
  * `x` - the `R` data frame to exported
  * `path` - the filename to where the data frame will be exported

## Navigate Directories



| Function Name  | Parameter             | Description |
|---------------|----------------------|-------------|
| `getwd()`     | None                 | Gets the current working directory. |
| `list.files()` | directory (optional) | Lists files and folders in a directory; defaults to the current working directory. |
| `dir.create()` | filepath             | Creates a new folder from a specified filepath. |
| `setwd()`     | directory path        | Sets the working directory to a specified location. |
| `file.path()` | folder/file names     | Creates a file path by correctly joining folder and file names. |




# Lecture 14: JavaScript Object Notation and APIs

## JavaScript Object Notation
#### Definition
* Most datasets are stored in tabular format using `.csv`, `.txt`, or `.xlsx` file types,`R` loads these datasets as a data frame or "structured list". However, structured lists are highly inefficent at storing hierarchical or nested data structures
* A more efficient way of storing nested data is JavaScript Object Notation (JSON). JSON format can be thought of as an unstructured (nested) list, where each list item can contain another data structure, i.e. a list of lists, data frames, vectors or a mixture of these. JSON format avoids the repetitive formatting of tabular representations of nested data

#### Syntax - Loading JSON Data
`R` loads JSON formatted data as unstructured lists [`list()`], We load JSON files using the `fromJSON()` function in the `jsonlite` library.<br>
#load json data<br>
data <- fromJSON("https://raw.githubusercontent.com/khasenst/datasets_teaching/refs/heads/main/nested_json_example.json")<br>
* To view the contents of the unstructured list, we can use the `str()` and `names()` functions

## APIs
#### Definitions
* Given the efficiency of the JSON format for storing nested data, many companies make their data available as JSON files

* The way in which they make their data available is through an application program interface (API). An API is a set of rules that allows different software applications to communicate with each other for data transfer. We can use an API to download data from an institution's servers to our R workspace!

* An example is the API for the World Health Organization (WHO)  https://apps.who.int/gho/data/node.resources.api
    * Different APIs have different rules for extracting data
    * Institutions typically provide documentation on how to access their data
    * We will focus on the WHO API as an example


#### How is it tied to JSON files
* Available "variables" are listed here: https://ghoapi.azureedge.net/api/
* One of the variables is life expectancy (at birth) `WHOSIS_000001`. Let's take a look!

```
#path to data
url_path <- "https://ghoapi.azureedge.net/api/"

#selected variable
variable <- "WHOSIS_000001"

json_data <- fromJSON(paste0(url_path, variable),
                     #simplifyDataFrame = FALSE  # if you want it as a list, not a data frame
                     )
```


#### Filtering via URL
* The previous example shows how to import data in JSON format from an institutional website using their API. Similar to filtering/subsetting in `R`, we can subset the data in the url itself.

* For example, the script below downloads the same `WHOSIS_000001` dataset but with the following constraints
  * `SpatialDimType` must be `REGION`
  * `NumericValue` must not be missing (`NULL`)
  * `TimeDim` must be greater than or equal to `2020`

```
# root path to API
url_path <- "https://ghoapi.azureedge.net/api/"

# selected variable
variable <- "WHOSIS_000001"

# filter query
filter1 <- "?$filter=SpatialDimType%20eq%20'REGION'"
filter2 <- "%20and%20NumericValue%20ne%20null"
filter3 <- "%20and%20TimeDim%20ge%202020"
```

## Exporting data as a JSON file

* We are able to export datasets as JSON files using
  * `toJSON()` - converts the data into a json class similar to character string
  * `write()` - then exports the character string using a given filename


* Similar to `.csv` for comma separated value files, here, we use `.json` as the extension for JSON files
* The `write()` function is similar to the other exporting functions [`write.csv()`, `write.table()`], where you specify the data and the filepath

```
# export the json data structure to Colab
write(json_data, "student_data.json")
```

# Lecture 15: Shaping and Merging

## Shaping
#### Wide format
Wide format: Each subject/observation has multiple columns for repeated measures.
* In the format above,
  * each row is populated with a single patient
  * each row contains all repeated measurements (i.e. blood pressures across all time points)

* Collecting more time points would increase the number of columns in the dataset, making it a "wider" dataset

* Therefore, we refer to this format as ***wide format***


#### Long format
Long format: Each observation has its own row, with an identifier column and a key-value pair for the measured variables.
* In the format above,
  * each row is populated with a single blood pressure measurement from a single time point
  * patients are included in multiple rows

* Collecting more time points would increase the number of rows in the dataset, making it a "longer" dataset

* Therefore, we refer to this format as ***long format***

#### Syntax Coverting from long to wide format
 To convert long to wide format, we must specify the following arguments in the `reshape()` function
  * `data` - Data frame you would like to convert to wide or long format
  * `timevar` - Repeated observation variable. This is often "time" but can be variable that characterizes the repeated measurement
  * `idvar` - Variable for which multiple observations are collected. This is often the patient ID in biomedical applications.
  * `direction` - set to `"wide"` to convert to wide format and `"long"` to convert to long format

#### Coverting from wide to long format
To convert wide to long format, we must specify the following arguments in the `reshape()` function
  * `data` - Data frame you would like to convert to wide or long format
  * `varying` - Specifies the columns that we need to be combined into long format
  * `idvar` - Variable for which multiple observations are collected. This is often the patient ID in biomedical applications.
  * `v.names` - Names for the resulting value columns (Systolic and Diastolic)
  * `timevar` - The name of the column for time points
  * `times` - The labels for the time points
  * `direction` - Set to `"wide"` to convert to wide format and `"long"` to convert to long format

#### When we use Wide and Long format?
* When to use long format?
  * The majority of the time, we want our data in long format
  * Most functions for statistical modeling and data visualization require data to be in long format
  * Long format is more appropriate when each individual has varying numbers of repeated observations
  * Data is stored vertically with multiple rows per subject
  * Each row represents one observation and has a variable type column
  * Used for analysis, visualization and modeling in R

* When to use wide format?
  * Wide format is more efficient for data storage
  * Wide format is more efficient for data summaries
  * Wide format is more appropriate when each individual has an equal number of observations
  * Data is stored horizontally, with one row per subject
  * Each column represents a different observation
  * Useful for reporting and some statistical methods.

## Merging using Joins


To join or merge two datasets/data frames, we must specify the following arguments in the `merge()` function
  * `x` - The left data frame in the merge
  * `y` - The right data frame in the merge
  * `by.x` - Column in data frame `x` used in the merge
  * `by.y` - Column in data frame `y` used in the merge
  * `all.x` - if `TRUE`, keep all data from data frame `x`; default is `FALSE`
  * `all.y` - if `TRUE`, keep all data from data frame `y`; default is `FALSE`

| Join Type   | Definition | Syntax |
|------------|------------|---------|
| **Inner Join** | Returns only matching rows from both datasets based on the specified key(s). Non-matching rows are removed. | ```r  bp_inner <- merge(x = bp, y = demo, by.x = "Patient.ID", by.y = "id", all.x = FALSE, all.y = FALSE) ``` |
| **Left Join** | Returns all rows from the left dataset (`bp`) and matching rows from the right dataset (`demo`). Non-matching rows in `demo` are filled with `NA`. | ```r  bp_left <- merge(x = bp, y = demo, by.x = "Patient.ID", by.y = "id", all.x = TRUE, all.y = FALSE) ``` |
| **Right Join** | Returns all rows from the right dataset (`demo`) and matching rows from the left dataset (`bp`). Non-matching rows in `bp` are filled with `NA`. | ```r  bp_right <- merge(x = bp, y = demo, by.x = "Patient.ID", by.y = "id", all.x = FALSE, all.y = TRUE) ``` |
| **Full Join** | Returns all rows from both datasets. If there is no match, `NA` is assigned for missing values from either dataset. | ```r  bp_full <- merge(x = bp, y = demo, by.x = "Patient.ID", by.y = "id", all.x = TRUE, all.y = TRUE) ``` |


#### Inner Join
* Consider two data frames `x` (left) and `y` (right)
* Inner Join - Combines the rows between `x` and `y` based on values in a common variable
  * If values in the common variable are not in both `x` and `y`, their data are excluded

#### Left Join
* Consider two data frames `x` (left) and `y` (right)
* Left Join - Combines the rows between `x` and `y` based on values in a common variable
  * If values in the common variable are in `y` but not `x`, their data are excluded
  * If values in the common variable are in `x` but not `y`, their data are still included

#### Right Join
* Consider two data frames `x` (left) and `y` (right)
* Right Join - Combines the rows between `x` and `y` based on values in a common variable
  * If values in the common variable are in `x` but not `y`, their data are excluded
  * If values in the common variable are in `y` but not `x`, their data are still included


#### Full Join
* Consider two data frames `x` (left) and `y` (right)
* Full Join - Combines the rows between `x` and `y` based on values in a common variable
  * If values in the common variable are in `x` but not `y`, their data are still included
  * If values in the common variable are in `y` but not `x`, their data are still included