Click [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://githubtocolab.com/CU-Denver-MathStats-OER/Statistical-Theory/blob/main/Chap1/01-Intro-to-Inference.ipynb) to open an interactive version of the full text section.

# <a name="01intro">1.1: Introduction to Statistics</a>
---

<font color="dodgerblue">**Statistics**</font> is the study of
collection, organization, analysis, interpretation, and presentation of data. Statistical methods are essential in exploring questions that can be analyzed using data. Sometimes insights from data arise from a nice visualization or plot. Many times, a nice visualization leads to more advanced questions that require further statistical theory, models, and analysis. We will explore both modern and classical statistical methods, and in order to do so, we need data!


# <a name="01what-is-R">What is R?</a>
---

[R](https://www.r-project.org/about.html) is a programming language used largely for statistical computing, data wrangling and visualization. We will be using R as a tool for exploring statistical theory.  The first stable version of R was released in 2000, and after all of this time, there is a large community of R users that have already created tons of useful packages and shared interesting data sets that are frequently updated.



## <a name="01load-pack">Loading Packages with the `library()` Command</a>
---

Each time we start or restart a new R session and want to access the library of functions and data in the package, we need to load the library of files in the package with the `library()` command.

-   The `dplyr` package is already installed in Google Colaboratory
-   We still need to use a `library()` command to load the package if we want to access data and functions in the package.
-   If we do not run the code cell below, we will not be able to run the rest of the code cells in this document without receiving error messages.
-   **Run the code cell below to load the `dplyr` package.**

In [None]:
library(dplyr)

## <a name="01help">Finding Help Documentation</a>
---


We can find help without opening a separate browser window or tab. The `?` help operator and `help()` function provide access to the help manuals for R functions, data sets, and other objects. Running a `?` or `help()` command in a code cell opens a side bar with a tab displaying the help documentation.

-   For example, the package `dplyr` contains a data set called `storms`.
-   Where is the data from, and what variables are in the data set?
-   **Run the code cell below to access the help documentation for the `storms` data set.**
    -   Resizing the tab in the side bar may help the documentation be more readable.
    -   We can close the tab if we want to increase the size of our working window.

In [None]:
?storms

## <a name="01q1">Question 1</a>
---

After reading the `storms` help documentation, answer the following
questions:

a.  What is the source of the data?

b.  What variables are included in the data?

c.  Over what period of time and how frequently are observations recorded?

### <a name="01sol1">Solutions to Question 1</a>
---

<br> <br> <br> <br>
  



## <a name="01q2">Question 2</a>
---

Insert a code cell and run the command `?hist` to see the help
documentation for the histogram function.

a.  What option can we use to add a main title to the histogram?

b.  What option can we use to set the fill color for the bars of a histogram?

### <a name="01sol2">Solution to Question 2</a>
---

<br> <br> <br> <br>
  



# <a name="01inference">Statistical Inference</a>
---

A fundamental application of statistics is to use data from a subset of
a population to draw conclusions about the population.

-   A <font color="dodgerblue">**population**</font> includes all individuals or objects of interest.
-   A <font color="dodgerblue">**sample**</font> is a subset of the population.
-   <font color="dodgerblue">**Statistical inference**</font> is the process of drawing conclusions about the entire population based on information in a sample.
-   This process can be thought of as a cycle that is pictured below.


<figure>
<img
src="https://upload.wikimedia.org/wikipedia/commons/e/ee/01fig-inference.png"
style="width:80.0%"
alt="Image Credit: Loneshieling, modified by Adam Spiegler, CC BY-SA 4.0." />
<figcaption aria-hidden="true">Image Credit: <a
href="https://commons.m.wikimedia.org/wiki/File:Population_versus_sample_%28statistics%29.png">Loneshieling</a>,
modified by Adam Spiegler, <a
href="https://creativecommons.org/licenses/by-sa/4.0">CC BY-SA
4.0</a>.</figcaption>
</figure>

**This semester we will mainly focus on steps 3 and 4; however, the methods we learn are not as powerful without carefully considering steps 1 and 2!**



## <a name="01q3">Question 3</a>
---

In the `storms` data set, is the data from a sample or a population? What information in the help documentation supports your answer? Recall you can run the command `?storms` to open the help documentation.

### <a name="01sol3">Solution to Question 3</a>
---

<br> <br> <br> <br>  
  



## <a name="01q4">Question 4</a>
---

What statistical questions might be worth investigating among the
variables in the `storms` data set? What data visualizations could be useful to uncover interesting questions?

Run the `summary(storms)` command in the code cell below to view a numerical summary for each variable in the data set to help formulate your question.

In [None]:
summary(storms)

### <a name="01sol4">Solution to Question 4</a>
---

<br> <br> <br> <br>  
  
  



# <a name="01design">Designing Studies</a>
---

Often in statistics we would like to investigate whether one variable is associated to another. Researchers carry out studies to understand the conditions and causes of certain outcomes.

-   Does daily exercise reduce the risk of early onset dementia?
-   Is rewarding people or punishing people a more effective incentive to help them quit smoking?
-   Is a new vaccine effective at preventing disease?

If we are using one variable to help us understand or predict the values (or category) of another variable:

- We call the first variable the <font color="dodgerblue">**explanatory, independent, or predictor variable**</font>.
- The second the <font color="dodgerblue">**response or dependent variable**</font>.
- Different categories of a predictor variable are called <font color="dodgerblue">**treatments or levels**</font>.



## <a name="01q5">Question 5</a>
---

For each question below, which variable is the predictor variable and which is the response variable? How would your organize the data you collect in each case?

a.  Does daily exercise reduce the risk of early onset dementia?

b.  Is rewarding people or punishing people a more effective incentive to help them quit smoking?

c.  Is a new vaccine effective at preventing disease?

### <a name="01sol5">Solution to Question 5</a>
---

  
<br> <br> <br> <br>  
  



## <a name="01q6">Question 6</a>
---

Both studies below are designed to examine whether rewarding good
behavior or punishing bad behavior is a more effective method to help people quit smoking. Which study do you believe is better designed? Why?

### <a name="01studyA">Study A</a>
---

Employees at a large company voluntarily enroll in a quit smoking study.
When they join, they are provided two options to select from:

- Option 1 (<font color="mediumseagreen">**Reward-based group**</font>): If after six months the participant has quit smoking, they get an \$800 reward.

- Option 2: (<font color="tomato">**Deposit-based group**</font>): Pay an initial \$150 refundable deposit. If after six months the participant:
  - Has quit smoking, they receive their \$150 deposit back plus an additional \$800 reward.
  - Has not quit smoking, then they do not receive their \$150 deposit back.

After six months, we compare the success rate between the two groups to determine which method is more effective.

### <a name="01studyB">Study B</a>
---

Employees at a large company voluntarily enroll in a quit smoking study.

-  When they join, they are randomly assigned to either be in the <font color="mediumseagreen">**Reward-based group**</font> or <font color="tomato">**Deposit-based group**</font> with the same exact reward and penalty system for each option as in [Study A](#01studyA).

After six months, we compare the success rate between the two groups to
determine which method is more effective.

### <a name="01sol6">Solution to Question 6</a>
---

<br> <br> <br> <br>
  
  
  



## <a name="01confounding">Confounding Variables</a>

---

A variable that is associated with both the predictor variable and the response variable is called a <font color="dodgerblue">**confounding or lurking variable**</font>.



<figure>
<a title="Adam Spiegler, CC BY-SA 4.0 &lt;https://creativecommons.org/licenses/by-sa/4.0&gt;" href="https://upload.wikimedia.org/wikipedia/commons/thumb/b/b7/01fig-confounding.png/512px-01fig-confounding.png"><img width="512" alt="Diagram of Confounding Variable" src="https://upload.wikimedia.org/wikipedia/commons/thumb/b/b7/01fig-confounding.png/512px-01fig-confounding.png"></a>
<figcaption aria-hidden="true">Image Credit: Adam Spiegler, <a
href="https://creativecommons.org/licenses/by-sa/4.0">CC BY-SA 4.0</a>
</figcaption>
</figure>


## <a name="01q7">Question 7</a>
---

In [Question 6](#01q6), identify a possible confounding variable in [Study A](#01studyA) or explain why there are no confounding variables. Identify a possible confounding variable in [Study B](#01studyB) or explain why there are no confounding variables.

### <a name="01sol7">Solution to Question 7</a>
---

  
<br> <br> <br> <br>  
  



## <a name="01expriment">Experiments and Observational Studies</a>
---

- An <font color="dodgerblue">**observational study**</font> is a study in which the researcher does not actively control the assignment of individuals to different treatments or levels of a predictor variable.
  - If the treatment groups are chosen by the individuals in the study, the samples in each treatment group are likely to differ in some meaningful way other than just the treatment.

- An <font color="dodgerblue">**experiment**</font> is a study in which the researcher controls the assignment of individuals to different treatments or levels of a predictor variable.
-  In a <font color="dodgerblue">**randomized experiment**</font> the predictor variable for each individual is determined randomly, before the response variable is measured.
  - If treatment groups are randomly determined, they should be similar in every way except for the treatment itself.
  - <font color="mediumseagreen">**When properly designed, randomized experiments can show a predictor variable causes a change in the response variable.**</font>

<font color="tomato">**There are almost always confounding variables in observational studies. Thus observational studies can almost never be used to establish causation.**</font> Sometimes it is not possible to design an experiment for ethical or practical reasons. We can still investigate whether two variables are *associated* with each other in observational studies.



## <a name="01q8">Question 8</a>
---

In [Question 6](#01q6), determine whether [Study A](#01studyA) is an observational study or an experiment. Determine whether [Study B](#01studyB) is an observational study or an experiment. Explain how you determined your answers.

### <a name="01sol8">Solution to Question 8</a>
---

  
<br> <br> <br> <br>  
  



# <a name="CC License">Creative Commons License Information</a>
---

![Creative Commons
License](https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png)

*Statistical Methods: Exploring the Uncertain*  by [Adam
Spiegler (University of Colorado Denver)](https://github.com/CU-Denver-MathStats-OER/Statistical-Theory)
is licensed under a [Creative Commons
Attribution-NonCommercial-ShareAlike 4.0 International
License](http://creativecommons.org/licenses/by-nc-sa/4.0/). This work is funded by an [Institutional OER Grant from the Colorado Department of Higher Education (CDHE)](https://cdhe.colorado.gov/educators/administration/institutional-groups/open-educational-resources-in-colorado).

For similar interactive OER materials in other courses funded by this project in the Department of Mathematical and Statistical Sciences at the University of Colorado Denver, visit <https://github.com/CU-Denver-MathStats-OER>.