# Loading Libraries for This Session



Have you installed `DataExplorer` and `dplyr`? You will need them to run
some of the code cells in this document.

-   If so great!
    -   But we will still need to load the package before we are able to
        access the libraries of functions and data in the package.
-   If you have not installed the packages, you will need to before
    running the code cell below.
    -   There should be a message on top of these window asking if you
        would like to install. Click that link!

    -   Or, you run the command
        `install.packages(c("DataExplorer", "dplyr"))` in the console.

**Run the code cell below to load `DataExplorer` and `dplyr` packages.**

In [None]:
# message=FALSE means output to console not displayed as output

library(DataExplorer)
library(dplyr)

-   Nothing appears on screen, but code is running in the console.
-   The `message=FALSE` option means the message received as output will
    not be displayed after running.
-   See <https://rmarkdown.rstudio.com/lesson-3.html> for a deeper
    explanation of code chunk options.

# Introduction to Statistical Inference

------------------------------------------------------------------------

<span style="color: blue;">**Statistics**</span> is the study of
collection, organization, analysis, interpretation, and presentation of
data.

## Getting to Know Your Data

------------------------------------------------------------------------

The package `dplyr` contains a dataset called `storms`. Let’s check out
the documentation for the data.

In [None]:
?storms  # must load dplyr

-   Whenever you have a question about a dataset or function, type
    `?data_name` or `?function_name` to view help documentation.
-   For example, run the code cell below to learn about the `glimpse`
    function.

In [None]:
?glimpse  # must load dplyr

### Getting a Glimpse of the Data

------------------------------------------------------------------------

In [None]:
glimpse(storms)  # get a glimpse of storms data

## Question 1

------------------------------------------------------------------------

What do you notice about the dataset `storms`? How would you summarize
what information is contained in this dataset?

### Solution to Question 1

------------------------------------------------------------------------

  
  
  

## Question 2

------------------------------------------------------------------------

What additional information would be nice to know about the `storms`
dataset? Write some questions below. Then experiment with different
commands in the code cell below to try to answer them.

### Solution to Question 2

------------------------------------------------------------------------

In [None]:
options(width = 100)  # sets width of output so summary displays

summary(storms)  # summary of each variable in storms
#head(storms)  # prints first 6 rows to screen
#view(storms)  # opens new tab to display full dataset
#tail(storms)  # opens last 6 rows to screen
#plot_intro(storms)  # summary of data, requires DataExplorer
#plot_missing(storms)  # where is missing data, requires DataExplorer

# The Structure of Data

------------------------------------------------------------------------

<span style="color: blue;">**Data frames**</span> are two-dimensional
data objects and are the **fundamental** data structure used by most of
R’s libraries of functions and datasets.

-   Tabular data is <span style="color: blue;">**tidy**</span> if each
    row corresponds to a different observation and column corresponds to
    a different variable.

Each column of a data frame is a
<span style="color: blue;">**variable**</span> (stored as a **vector**)
of possibly different data types. If the variable:

-   Is measured or counted by a number, it is a
    <span style="color: blue;">**quantitative**</span> or
    <span style="color: blue;">**numerical**</span> variable.
    -   Quantitative variables may be discrete (integers) or continuous
        (decimals).
-   Groups observations into different categories or rankings, it is a
    <span style="color: blue;">**qualitative**</span> or
    <span style="color: blue;">**categorical**</span> variable.

## Working with Categorical Data

------------------------------------------------------------------------

-   Sometimes we think a variable is one data type, but it is actually
    being stored (and thus interpreted by R) as a different data type.
-   One common issue is categorical data is stored as characters. We
    would like observations with the same values to be group together.

Categorical data should be stored as a
<span style="color: blue;">**factor**</span> in R.

In [None]:
storms$status <- factor(storms$status)
summary(storms$status)

# Data Visualization

------------------------------------------------------------------------

For an overview of different types of plots, see the [Overview of Plots
Help
Document](https://htmlpreview.github.io/?https://github.com/aspiegler/Statistical-Theory/blob/main/Overview-of-Plots.html).

The type analysis we can do depends on whether:

-   We are investigating a single variable, or looking for correlation
    between multiple variables.
-   The variable(s) are numerical and/or categorical.
-   The data satisfies certain assumptions.

In [None]:
par(mfrow = c(2, 2))  # Create a 2 x 2 array of plots

# The next 4 plots created will be arranged in the array
boxplot(storms$wind)  # create boxplot of wind speed

# Code below creates a histogram of wind speed
hist(storms$wind,
     col = "steelblue")  # change color of bars

plot(storms$status, 
     col = "gold")  # plots status, which is categorical

plot(wind ~ pressure, data = storms)  # plots two numerical variables

In [None]:
par(mfrow = c(1, 1))   # change settings so one image displayed in a window

# Compare numerical wind speed for different categories of storms
plot(wind ~ status, data = storms, col = "springgreen4")

-   **For one numerical variable:** histograms, boxplots, and density
    plots.
-   **For one categorical variable:** barplots and pie charts.
-   **For two numerical variables:** scatter plots.
-   **For one numerical and one categorical variables:** side-by-side
    boxplots or density plots.
-   **For two categorical variables:** grouped barplots.
-   **For three or more variables:** add distinguishing colors,
    shape/line types, and/or interactivity to plots.

# Statistical Inference

------------------------------------------------------------------------

-   A <span style="color: blue;">**population**</span> includes all
    individuals or objects of interest.
-   A <span style="color: blue;">**sample**</span> is a subset of the
    population.
-   <span style="color: blue;">**Statistical inference**</span> is the
    process of drawing conclusions about the entire population based on
    information in a sample.
-   This semester we will **focus on inference**, and we will need some
    **probability** to do so.

<figure>
<img
src="https://lh3.googleusercontent.com/BbpECaHMPmwzL30K21igjP4mt8ryFChYQRhAvn3U4MxaC56KnKCXsahklT8-vJRP_-o=w2400"
width="400" alt="Diagram of Statistical Inference" />
<figcaption aria-hidden="true">Diagram of Statistical
Inference</figcaption>
</figure>

## Question 3

------------------------------------------------------------------------

In the `storms` data example, is the data from a sample or a population?

### Solution to Question 3

------------------------------------------------------------------------

  
  
  

## Question 4

------------------------------------------------------------------------

What statistical questions might be worth investigating among the
variables in the `storms` dataset?

### Solution to Question 4

------------------------------------------------------------------------

  
  
  

# Collecting Data: Sampling

------------------------------------------------------------------------

Since drawing a sample that resembles the population in every way
(except smaller in number) is critical for drawing valid conclusions,
how we pick samples is sometimes the most important step.

<figure>
<img
src="https://lh3.googleusercontent.com/eiJozbALtlXJS1mpD8FImS3_QQ7yRpjr-RJ--O0PNyg0ICC6PmRb8WQy78T34X15iB0=w2400"
width="450" alt="Diagram of Sampling Methods" />
<figcaption aria-hidden="true">Diagram of Sampling Methods</figcaption>
</figure>

## Summary of Sampling Methods

------------------------------------------------------------------------

-   When selecting a <span style="color: blue;">**simple random
    sample**</span>, all individuals are equally likely to be selected.
-   When selecting a <span style="color: blue;">**stratified
    sample**</span>, the population is subdivided into groups based on
    some meaningful characteristic.
-   When selecting a <span style="color: blue;">**systematic
    sample**</span>, the first individual is chosen at random. Then a
    rule is used so that every $\mbox{n}^{\mbox{th}}$ individual is
    selected after that.
-   When selecting a <span style="color: blue;">**cluster
    sample**</span> groups rather than individual units of the target
    population are selected at random for the test. For example, only
    people with last digit of phone number equal to 8 are chosen.
-   A <span style="color: red;">**convenience sample**</span> is when
    people or elements in a sample are selected on the basis of their
    accessibility and availability.
-   <span style="color: red;">**Voluntary sampling**</span> is a type of
    a convenience sample.

<span style="color: blue;">**Sampling bias**</span> occurs when the
method of selecting a sample causes the sample to differ from the
population in some relevant way.

<span style="color: blue;">**Randomly selecting samples is the best way
to avoid bias!**</span>

# Collecting Data: Designing Studies

------------------------------------------------------------------------

Often in statistics we would like to investigate whether one variable is
associated to another. Researchers carry out studies to understand the
conditions and causes of certain outcomes.

-   Does smoking cause lung cancer?
-   Is paying people or punishing people a more effective incentive to
    get vaccinated?
-   Is a new vaccine effective at preventing disease?

If we are using one variable to help us understand or predict the values
(or category) of another variable, we call the first variable the
<span style="color: blue;">**explanatory or predictor variable**</span>
and the second the <span style="color: blue;">**response
variable**</span>.

## Question 5

------------------------------------------------------------------------

Both studies below are designed to examine determine whether rewarding
good behavior or punishing bad behavior is a more effective method to
help people quit smoking. Which study do you believe is better designed?
Why?

### Study a

------------------------------------------------------------------------

Employees at a large company voluntarily enroll in a quit smoking study.
When they join, they are provided two options to select from:

-   Option 1 (Reward-based group): If after six months the participant
    has quit smoking, they get an \$800 reward.
-   Option 2: (Deposit-based group): Pay an initial \$150 refundable
    deposit.

If after six months the participant:

-   Has quit smoking, they receive their \$150 deposit back plus an
    additional \$800 reward.
-   Has not quit smoking, then they do not receive their \$150 deposit
    back.

After six months, the success rate is compared between the two groups.

### Study b

------------------------------------------------------------------------

Employees at a large company voluntarily enroll in a quit smoking study.

-   When they join, they are randomly assigned to either be in the
    Reward-based or Deposit-based group (same as described above).

After six months, the success rate is compared between the two groups.

### Solution to Question 5

------------------------------------------------------------------------

  
  
  

## Confounding Variables

------------------------------------------------------------------------

A variable that is associated with both the explanatory variable and the
response variable is called a <span style="color: blue;">**confounding
variable**</span>.

<figure>
<img
src="https://lh3.googleusercontent.com/IPFq9yoP6CaEy4mp7HkZJ0qq7UakZDQLPVp_Eu5ShphIGGxrdmfiOncs8LCjs0rPyrc=w2400"
width="250" alt="Diagram of Confounding Variable" />
<figcaption aria-hidden="true">Diagram of Confounding
Variable</figcaption>
</figure>

## Experiments and Observational Studies

------------------------------------------------------------------------

-   An <span style="color: blue;">**observational study**</span> is a
    study in which the researcher does not actively control the value of
    any variable.
-   An <span style="color: blue;">**experiment**</span> is a study in
    which the researcher actively controls one or more of the
    explanatory variables.
-   The different categories of the explanatory variable are called
    <span style="color: blue;">**treatments**</span>.
-   In a <span style="color: blue;">**randomized experiment**</span> the
    explanatory variable for each unit is determined randomly, before
    the response variable is measured.
-   If treatment groups are randomly determined, they should be similar
    in every way except for the treatment.
-   <span style="color: red;">There are almost always confounding
    variables in observational studies. Thus observational studies can
    almost never be used to establish causation.</span>