# Objectives

## High Level Objectives

### Problems we focus on

- Predict a class/category for a new observation/measurement
- Predicting a value for a new observation/measurement
- Finding previous unkonwn/unlabelled subgroups in your data
- Estimating an average or a proportion from a representatitve sample (group of people or units) and using that estimate to generalize to the broader population (e.g. the proportion of undergraduate students that own an iphone)

### Types of Questions

| Question Type | Description | Example |
|:-------------:|:-----------:|:-------:|
| Descriptive | A question that asks about summarized characteristics of a data set without interpretation (i.e., report a fact). | How many people live in each province and territory in Canada? |
| Exploratory | A question that asks if there are patterns, trends, or relationships within a single data set. Often used to propose hypotheses for future study. | Does political party voting change with indicators of wealth in a set of data collected on 2,000 people living in Canada? |
| Predictive | A question that asks about predicting measurements or labels for individuals (people or things). The focus is on what things predict some outcome, but not what causes the outcome. | What political party will someone vote for in the next Canadian election? |
| Inferential | A question that looks for patterns, trends, or relationships in a single data set and also asks for quantification of how applicable these findings are to the wider population. | Does political party voting change with indicators of wealth for all people living in Canada? |
| Causal | A question that asks about whether changing one factor will lead to a change in another factor, on average, in the wider population. | Does wealth lead to voting for a certain political party in Canadian elections? | 
| Mechanistic | A question that asks about the underlying mechanism of the observed patterns, trends, or relationships (i.e., how does it happen?) | How does wealth lead to voting for a certain political party in Canadian elections?

### Types of Answers

1. **Summarization:** Computing and reporting aggregated values pertaining to a data set. Summarization is most often used to answer descriptive questions, and can occasionally help with answering exploratory questions.
2. **Visualization:** Plotting data graphically. Visualization is typically used to answer descriptive and exploratory questions, but plays a critical supporting role in answering all of the types of questions above
3. **Classification:** Predicting a class or category for a new observation. Classification is used to answer predictive questions.
4. **Regression:** Predicting a quantitative value for a new observation. Regression is also used to answer predictive questions.
5. **Clustering:** Finding previously unknown/unlabeled subgroups in a data set. Clustering is often used to answer exploratory questions.
6. **Estimation:** Taking measurements for a small number of items from a large group and making a good guess for the average or proportion for the large group. Estimation is used to answer inferential questions.

## Basic functions used in DS

- ```install.packages("package_name")``` is used to install packages.

- ```library(package_name)``` load packages into your workspace.

- ```nrow(data_frame)``` Computes total number of rows in a data frame.

- ```ncol(data_frame)``` Computes total number of columns in a data frame.

- Insert ```?``` below the name of function you want help with and R will provide you with its documentation. <br>
    ```?read_csv```

## Other functions

- ```toupper()```
- ```tolower()```
- ```print()``` or ```print(..., n = ...)```

## Creating subsets of Data

- ```filter``` can be used to obtain the subset of rows with desired values from a data frame. <br>```var <- filter(orig, observation == "value")```

- ```is.na(...)``` If the value is NA, it returns TRUE, otherwise it returns False. <br>
  You can use ```!is.na()``` to filter out rows that have missing values.

- ```select``` can be used to extract the columns from a data frame. <br> ```var <- select(orig, col1, col2)```

- ```arrange``` can allow us to order the rows of a data frame by the values of a particular column. <br> 
  ```var <- arrange(dataframe, col_name)``` Takes column names as input and orders the rows in ascending order <br>
  ```var <- arrange(orig, by = desc(col))``` Takes column names as input and orders the rows in descending order<br>

- ```slice``` selects rows according to their row number. <br> ```var <- slice(orig, 1:10)```

- ```mutate``` adds a new variable to a data frame as a function of the old columns. <br> ```var <- mutate(orig, new_col = col1/col2)```

- ```head(n = ...)``` Returns the n first rows data frame.

- ```tail(n = ...)``` Returns the n last rows of a data frame.