# **Introduction to Modern Statistics**
## Chapter 2: Study Design

#### **Populations and samples**

A question always refers to a **target population**.

Often times it is not feasible to collect data for every case in a population. But when it is, that is called a ***census***, which is a difficult and very expensive way to collect data for the entire population. Instead **a sample is taken**.

A **sample** is the data you have. Ideally, a sample is a small fraction of the population, and this sample data may be used to provide an estimate of the population average and to answer the research question.

We use specific terms to differenciate when a number is being calculated on a sample of data (**statistic**) and when it is being calculated or considered for calculation on the entire population (**parameter**)

#### **Sampling from a population**

We should always randomly select a sample from a population, but first, we **must determine what is our target population**.

For example: If we would like to estimate the time to graduation for Duke undergraduates in the last five years by collecting a sample of graduates. All graduates in the last five years represent the **target population** and graduates who are selected for review are collectively called the **sample**.

![image.png](attachment:image.png)

If someone was permitted to pick and choose exactly which graduates were included in the sample, it is entirely possible that the sample would overrepresent that persons interest, which may be entirely unintentional. This introduces **bias** into a sample. **Sampling randomly address this problem**.

A **simple random sample** is the most basic random sample, which is equivalent to drawing names out of a hat to select cases. This means that each case in the population has an equal chance of being included and the cases in the sample are not related to each other.

The act of taking a simple random sample helps minimize bias. However, **bias can crop up in other ways**. For example, when in a survey the non-response rate is high.

#### **Four Sampling Methods**

**Simple Random Sampling**: Is when each case in the population has an equal chance of being included in the final sample and knowing that a case is included in a sample does not provide useful information about which other cases are included.

**Stratified Sampling**: The population is divided into groups called *strata*. The strata are chosen so that similar cases are grouped together, then a second sampling method, usually simple random sampling, is employed within each stratum.

This is very useful when the cases in each stratum are very similar with respect to the outcome of interest.

![image.png](attachment:image.png)

**Cluster Sample**: We break up the population into many groups, called **clusters**. Then we sample a fixed number of clusters and ***include all observations from each of those clusters in the sample***. 

A **Multistage sample** is like a cluster sample, but rather than keeping all observations in each cluster, we would collect a random sample within each selected cluster.

Sometimes cluster or multistage sampling can be more economical than the alternative sampling techniques.

- Unlike stratified sampling, these approaches are more helpful when there is a lot of case-to-case variability within a cluster but the clusters themselves don't look very different from one another.

![image-2.png](attachment:image-2.png)

![image-3.png](attachment:image-3.png)

#### **Experiments**

##### **Principles of experimental design**

***Randomized experiments are fundamentally important when trying to show a causal connection between two variables***


- **Controlling**: 
    Researches do their best to **control** any other difference in the groups.

- **Randomization**:
    Researches randomize *patients* to account for variables that cannot be controlled. Example, some patients may be more susceptible to a disease than others due to their dietary habits. In this example, dietary habit is a ***confounding variable***, which is defined as a variable that is associated with both the explanatory and response variables. Randomizing patients into the treatment or control group helps even out such differences.

- **Replication**: 
    The more cases researches observe, the more accurately they can estimate the effect of the explanatory variable on the response. Is a single study, we **replicate** by collecting a sufficiently large sample. 

    What is considered sufficiently large varies from experiment to experiment, but a minimum we want to have multiple subjects (experimental units) per treatment group.

- **Blocking**: 
    Researches sometimes know or suspect that variables, other than we want to control, influence the response. Under these circunstances, they may first group individuals based on this variable into **blocks** and then randomize cases within each block to the tratment groups. This strategy is often referred to as blocking.

    For example, if we want to know the effect of a drug on heart attacks, we might first split patients in the study into low-risk and high-risk blocks, then randomly assign half the patients from each block to the control group and the other half to the treatment group.

    ![image.png](attachment:image.png)

##### **Reducing bias in human experiments**

For example, we want to know if the drug reduced deaths in patients. 

These researches designed a randomized experiment because they wanted to dran causal conclusions about the drug's effect. Study volunteers were randomly placed into two study groups. One group, the treatment group, and the control group.

The person in the other group doesn't receive the drug and sits idly, hoping her participation doesn't increase her risk of death. These perspectives suggest there are actually two effects in this study: *the one of interest is the effectiveness of the drug, and the sencond is an emotional effect of (not) taking the drug, which is difficult to quantify*.

***Emotional effects might bias the study***. So a placebo can be used. - **blind study**

When the person who measures the control and treatment group does not know which one has the drug or not, then is a **double blind study**


#### **Observational Studies**

***Observational studies are generally only sufficient to show associations or form hypotheses that can be later checked with experiments***

**Example**:

Some previous research tells us that using sunscreen actually reduces skin cancer risk, so maybe there is another variable that can explain this hypothetical association between sunscreen usage and skin cancer. One important piece of information that is abasent is sun exposure. If someone is out in the sun all day, they are more likely to use sunscreen and more likely to get skin cancer. Exposure to the sun is unaccounted for in the simple observational inverstigation.

Sun exposure is a confounding variable. ***The presence of confounding variables is what inhibits the ability for observational studies to make causal claims***

##### **Prospective study**

Identifies individuals and collects information as events unfold. 

##### **Retrospective studies**

Collect data after events have taken place.