# Introduction to Statistical Science


## Introduction

This module introduces the concept of data and covers: data structures, variables, summaries, and basic data collection techniques. This is the first step towards understanding and working with data.   

## Learning Outcomes 

- Introduction to Data
- Data types
- Population and Sample
- Bias
- Basic Statistics

### Reading and Resources

We invite you to further supplement this notebook with the following recommended resources:

- Diez, D., Barr, C. & Çetinkaya-Rundel, M. (2017). Chapter 1: Introduction to Data *OpenIntro Statistics (3rd Ed.)*.

  https://www.openintro.org/stat/textbook.php?stat_book=os 


- Rauser. J. (2014). Statistics without the agonizing pain. Keynote Strata + Hadoop 2014. https://www.youtube.com/watch?v=5Dnw46eC-0o

[Paper referenced in the video above] 
 - Lefèvre T, Gouagna L-C, Dabiré KR, Elguero E, Fontenille D, Renaud F, et al. (2010) Beer Consumption Increases Human Attractiveness to Malaria Mosquitoes. PLoS ONE 5(3): e9546. https://doi.org/10.1371/journal.pone.0009546

   
- Scheurer, V. (2011). Convicted on Statistics? *Understanding Uncertainty.* University of Cambridge. Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales (CC BY-NC-SA 2.0 UK) https://understandinguncertainty.org/node/545  

# Brief History of Statistics

![History.png](attachment:History.png "History of statistics")

Timeline of Statistics. See related article here: https://www.statslife.org.uk/history-of-stats-science/1190-the-timeline-of-statistics


The image above tells the story of how statistical conventions were developed over time. Some of the key developments you can see include:  

- Beginning of civilization – census of the population or trade records
- 5th century BCE  -  Thucydides in "History of the - Peloponnesian War" describes the use of mode to determine the height of the walls
- 9th century CE – earliest writing on statistics, "Manuscript on Deciphering Cryptographic Messages", written by Al-Kindi on the use of statistics and frequency analysis to decipher encrypted messages.
- 14th century, "Nuova Cronica", history of  Florence – use of statistical information on population, commerce and trade, education, religious facilities, and has been described as the first introduction of statistics as a positive element in history.
- 1662 – birth of Statistics, development of early statistical and census methods by John Graunt and William Petty as a framework for modern demography.
- 1713 – Ars Conjectandi (Latin for "The Art of Conjecturing"), book by  Jacob Bernoulli on probability theory, containing the very first version of the law of large numbers. Abraham de Moivre's The Doctrine of Chances - (1718) – first textbook on probability theory.
- 1761 – Thomas Bayes proved Bayes' theorem which describes the probability of an event, based on prior knowledge of conditions that might be related to the event. 
- 1765 – Joseph Priestley invented the first timeline charts.
- Beginning of 18th century – central limit theorem (Pierre-Simon Laplace), and method of least squares (Carl Friedrich Gauss) and further development of the theory of errors later in 18th century
- Late 19th and early 20th centuries – emergence of modern statistics, establishment of Royal Statistical Society and American Statistical Association
- Same time – development of correlation, principal component analysis (1903), design of experiments (1935), Turing machine ( 1936), The Kaplan–Meier estimator (1958), Statistical programming language R was created (1993), The term big data first appeared ( 1997), 




# Examples of Applications of Statistics

Listed below are just a few interesting applications of statistics. In modern times, it would be difficult to find a sector of industry, government or research that does not apply statistics in some way.

- Astrostatistics - used to process the vast amount of data produced by automated scanning of the cosmos, to characterize complex datasets, and to link astronomical data to astrophysical theory
- Business analytics – applies statistical methods to data sets (including Big Data) to develop new insights and understanding of business performance and opportunities.
- Demography – the statistical study of populations.
- Reliability Engineering – the study of the ability of a system or component to perform its required functions under stated conditions for a specified period of time
- Banking - When someone deposits his money in banks, the idea is that he will not withdraw the amount in the near future. So, banks lend this money to other customers to earn profit in the form of interest. They use statistical approach for this service. They compare the number of people making deposits against the number of people requesting loans and at the same time ascertaining the estimated day for the claim.
- Economics - There are so many concepts of economics that are completely dependent on statistics. All the data collected to find out the national income, employment, inflation, etc., are interpreted through it. In fact, theory of demand and supply, relationship between exports and imports are studied through this subject. The perfect example of this is census; the bureau uses its formulas for calculating a country’s population.


# Definition of Data

Data and Information are often used interchangeably; however, the extent to which a set of data is informative to someone depends on the extent to which it is unexpected by that person. 

Given below are definitions from the Cambridge Dictionary and Business dictionary.

"Information, especially facts or numbers, collected to be examined and considered and used to help with making decisions"
-- Cambridge Dictionary

"Information in raw or unorganized form (such as alphabets, numbers, or symbols) that refer to, or represent, conditions, ideas, or objects. Data is limitless and present everywhere in the universe. See also information and knowledge."
-- Business Dictionary



# Working with Data
- Statistics is the study of how best to collect, analyze, and draw conclusions from data.
 
 ### Steps in the process of working with data:

The first step in the problem solving and decision making process is always to **identify** and **define** the problem. 

   1. Identify a problem - The ﬁrst step in decision making process is to identify and deﬁne the problem. A problem can be regarded as a diﬀerence between the actual situation and the desired situation which often is a hypothesis. Without a problem or a hypothesis to test, decision-making becomes a difficult and time-consuming process. In some cases, like data mining, working without a hypothesis or problem can be appropriate.
   
   2. Collect relevant data - Collecting relevant data is a very important step. Many investigations can be addressed with a small number of data collection techniques, analytic tools, and fundamental concepts in statistical inference. A clearly formulated problem asks for what subjects/cases to be studied. It is important to consider how data is collected so that they are reliable and help achieve the goal.
   
   3. Analyze the data - This comprises of all the observational studies and experiments that need to be carried out to ﬁnd (ideally quantiﬁed) evidence for or against the hypothesis to make better-informed decisions with quantiﬁed (ideally) uncertainities.
 
   4. Form a conclusion - With all the above mentioned steps, it is possible to gather evidence and come up with a decision.  Statistics as a subject focuses on making all these steps rigorous and efficient.


# Types of Data

At the highest level, two kinds of data exist: **quantitative** and **qualitative**.

## Quantitative data (or Continuous/Numerical data)
Deals with numbers and things you can measure **objectively** such as: 
   - dimensions such as height, width, and length
   - temperature and humidity
   - prices, and
   - area and volume.

### Types of Quantitative data
   - **Discrete**:
   Discrete data is a count that cannot be made more precise. Typically, it involves integers. For instance, the number of children (or adults, or pets) in your family is discrete data, because you are counting whole, indivisible entities: you cannot have 2.5 kids, or 1.3 pets. Some examples are: 
        - Measured quantities
        - Results of experiments
        - Numerical values obtained by counting
   - **Continuous**:
   Continuous data, on the other hand, could be divided and reduced to finer and finer levels. For example, you can measure the height of your kids at progressively more precise scales—meters, centimeters, millimeters, and beyond—so height is continuous data.  Some examples are: 
        - Value obtained by measuring (e.g. height of all students)
        - All values in a given interval of numbers (e.g. federal spending)

## Qualitative data (sometimes referred to as Categorical data)

Deals with characteristics and descriptors that canot be easily measured, but can be observed **subjectively** such as:
- tastes
- textures
- attractiveness, and 
- color. 

Broadly speaking, when you measure something and give it a number value, you create quantitative data. When you classify or judge something, you create qualitative data. But this is just the highest level of data: there are also different types of quantitative and qualitative data.

### Types of Qualitative data

When you classify or categorize something, you create qualitative or attribute data. There are several different types of qualitative data. For example, categorical data, as the name implies, is grouped into some sort of category or multiple categories. If I were to collect information about a person's pet preferences, I would have to group that information by the type of pet. 

Categorical data is also data that is collected in an either/or, or yes/no fashion. For example, if I were to ask the people in my oﬃce to check 'yes' or 'no' on whether they had children, then I can display that information in a bar graph or a pie chart comparing co-workers that had children versus co-workers that do not have children. Common types of qualitative data are summarized below:

   - **Categorical**  
        - **Ordinal** : When items are assigned to categories that have some kind of implicit or natural order, such as "Short, Medium, or Tall.", the data is of ordinal nature. Another example is a survey question that asks us to rate an item on a 1 to 10 scale, with 10 being the best. This implies that 10 is better than 9, which is better than 8, and so on.
        - **Nominal** : Any categorical data that doesn't have an order (e.g. "blue", "red", "green). When collecting unordered or nominal data, we assign individual items to named categories that do not have an implicit or natural value or rank. If we went through a box of Jujubes and recorded.  
    
        - **Binary** : Binary data place things in one of two mutually exclusive categories: right/wrong, true/false, or accept/reject.  It is nominal but with only two categories.  
          

   - **Other**  e.g. Text, Video,  


## Data Types Example 

*county dataset* – summarizes economic and demographic information from 3,143 counties in the United States


![us_census.png](attachment:us_census.png "US Census")

### Data types of variables from the table above are:

name, state : Categorical, Nominal

pop2010 : Numerical, Discrete

fed_spend, poverty : Numerical, Continuous

smoking_ban : Categorical, Ordinal


# Relationships Between Data
Many analyses are looking for a relationship between two or more variables. For example, a data scientist may like to answer some of the following questions:

(1) Is federal spending, on average, higher or lower in counties with high rates of poverty?

(2) If homeownership is lower than the national average in one county, will the percent
of multi-unit structures in that county likely be above or below the national average?

(3) Which counties have a higher average income: those that enact one or more smoking
bans or those that do not?

To answers these questions, exploratory data analysis must ﬁrst be conducted. Examining summary statistics could provide insights for each of the three questions about counties. Additionally, graphs can be used to visually summarize data and are useful for answering such questions as well. The scatter plot is a basic tool that gives the relationship between two numeric variables. More techniques will be discussed as we progress through the course.
Here are a few key definitions that you will need to be familiar with as we move forward:

- **Independent variables**
  If two variables are not associated, then they are said to be independent. That is, two variables are independent if there is no evident relationship between the two.
  
- **Associated, or dependent** 
  When two variables show some connection with one another, they are called associated variables. Associated variables can also be called dependent variables and vice- versa.
    - **Positive association** - Two variables are said to be positively associated when they have a linear relationship with a positive slope. That means when the value of one variables increases, the value of the other variable increases as well. For example, the amount people spend is positively associated with the money people make, if we assume that people who earn more spend more.
    - **Negative association** - When the value of a variable goes down when the value of the other variable goes up. This is often characterized by a downward trend when the variables are plotted.

   
NOTE: A pair of variables are either related in some way (associated) or not (independent). **No pair of variables is both associated and independent**



These are used to compare two sets of numerical data. The two values are plotted on two axes labelled as for
continuous data. Sometimes there may be a correlation between the two variables, if so a ‘line of best fit‘ should be
drawn. Lines of best fit should always be straight and drawn with a ruler. It need not pass through the origin.

Scatterplots are a great tool to compare two sets of data. When there is a good corelation between two variables, it is possible to draw a "line of best fit" across the data points. Nonlinear patterns seem to follow along some curve.  When there is no pattern, then there is no clear relationship between the variables.

![image.png](attachment:image.png "linear and nonlinear patterns")

[ Source: Walker, J.I. (2017)] 

The above graphs shows relationships between data using scatter plots.  Linear relationships are those that follow linear pattern are linearly associated while non linear relationships are those that follow along a curve. 

# Variable Types

There are two key types of variables, which are listed below.

- **Explanatory variable** – is variable or a set of variables that can inﬂuence the response variable and/or explain changes in response variable
- **Response variable** – is an observed variable

To identify the explanatory variable in a pair of variables, determine which of the two is suspected of aﬀecting the other and plan an appropriate analysis.

Consider a problem such as this: 

*Is federal spending, on average, higher or lower in counties with high rates of poverty?*

If we suspect poverty might aﬀect spending in a county, then poverty is the explanatory variable and federal spending is the response variable in the relationship.

**NOTE**: In some cases, there is no explanatory or response variable.

# Association does not imply causation
Labeling variables as explanatory and response **does not guarantee the relationship between the two is actually causal**, even if there is an association, or *correlation*, identiﬁed between the two variables. We use these labels only to keep track of which variable we suspect aﬀects the other.

It might be useful to explain that "causes" is an _asymmetric_ relation (X causes Y is diﬀerent from Y causes X), whereas "is correlated with" is a _symmetric_ relation.

For instance, homeless population and crime rate might be correlated, in that both tend to be high or low in the same locations. It is equally valid to say that homeless population is correlated with crime rate, or crime rate is correlated with homeless population. To say that crime causes homelessness, or homeless populations cause crime are diﬀerent statements. And correlation does not imply that either is true. For instance, the underlying cause could be a 3rd variable such as drug abuse, or unemployment.



# Data Collection Types
There are two primary types of data collection: observational studies and experiments.

- **Observational Studies**:
  - Observational studies can provide evidence of a naturally occurring association between variables, but they cannot by themselves show a causal connection. For example, data may be collected via surveys, records (like medical or company records), or follow a cohort of many similar individuals for a particular study. In each of these situations, researchers merely observe the data that arise. Hence an observational study is conducted when data is collected in a way that does not directly interfere with how the data arise.
  

- **Experiments**:
  - When the possibility of a causal connection needs to be investigated, an experiment needs to be conducted. There is explanatory and a response variable in this case.
  
  - To check if there really is a causal connection between the explanatory variable and the response, a sample of individuals are collected and split into groups. The individuals in each group are assigned a treatment. 
  
  - When individuals are randomly assigned to a group, the experiment is called a randomized experiment.


In general, association does not imply causation, and causation can only be inferred from a randomized experiment.


# Observational Studies

Generally, data in observational studies are collected only by monitoring what occurs, while experiments require the primary explanatory variable in a study be assigned for each subject by the researchers.

Making causal conclusions based on experiments is often reasonable. However, making the same causal conclusions based on observational data can be treacherous and is not recommended. Thus, observational studies are generally only sufficient to show associations.

*Example*: Observational study to track sunscreen use and if it is related to skin cancer.

*Observed fact*: The more sunscreen someone used, the more likely the person was to have skin cancer.

*Question*: Does this mean sunscreen causes skin cancer?  

The observations in this study certainly cannot relate to _Causation_.  Only an association if any can be determined.



# Confounding Variables

- In statistics, a **confounding variable** is a variable that influences both the dependent variable and independent variable causing a spurious association. Confounding is a causal concept, and as such, cannot be described in terms of correlations or associations. 

- Confounding variable is also sometimes referred to as *confounding factor, lurking variable,* or  *confounder* and is a variable correlated with both the explanatory and response variables.

Continuing from the previous section, say, some previous research tells us that using sunscreen actually reduces skin cancer risk, so maybe there is another variable that can explain this hypothetical association between sunscreen usage and skin cancer. One important piece of information that is absent is sun exposure. If someone is out in the sun all day, she is more likely to use sunscreen and more likely to get skin cancer. Exposure to the sun is unaccounted for in the simple investigation.
Hence, sun exposure is a confounding variable.

- One method to justify making causal conclusions from observational studies is to exhaust the search for confounding variables, but there is no guarantee that all confounding variables can be examined or measured.


Confounding arises due to various reasons.  Some of the reasons are listed below:

- **Confounding by indication**: When evaluating the effect of a particular drug, many times people who take the drug differ from those who do not according to the medical indication for which the drug is prescribed.
- **Selection bias**: Not everyone invited to participate in a study participates, causing imbalance between groups.
- **Recall bias**: Not everyone with an exposure recalls their exposure history correctly, perhaps causing uneven recall in different groups.
- **Many other ways**, including unbalanced groups by chance, especially in smaller studies.

**Collinearity** (or multicollinearity or ill-conditioning) occurs when independent variables in a regression are so highly correlated that it becomes difficult or impossible to distinguish their individual effects on the dependent
variable. Thus, collinearity can be viewed as an extreme case of confounding, when essentially the same variable is entered into a regression equation twice, or when two variables contain exactly the same information as two other variables, and so on.



# Observational Study or an Experiment 

Please watch this 12-minute keynote talk by John Rauser. Statistics Without the Agonizing Pain. Strata Hadoop 2014.

https://www.youtube.com/watch?v=5Dnw46eC-0o 


In the video above by John Rauser, what kind of study was used to determine if beer consumption increases human attractiveness to malaria mosquitoes: observational or experimental?

Answer - This is an **experiment**, as the researchers assigned the volunteers to a treatment group (beer or water).

For hypothesis testing, many studies focus on the comparison of groups. 

In observational studies, this comparison focuses on the how response variable differs between naturally occurring groups in a sample from the population of interest. 

In experiments, this comparison focuses on how an attribute differs across treatment groups. 


# Forms of Observational Studies

Observational studies come in two forms: prospective and retrospective studies. 

- **Prospective study** – A prospective study identifies individuals and collects information as events unfold. In this case, it is required to make precise estimates of either the incidence of an outcome or likelihood of the outcome.

- **Retrospective study** – Retrospective studies collect data after events have taken place.  A retrospective study looks backwards and examines factors in relation to an outcome that is established at the start of the study.  Special care is to be taken to avoid sources of bias and confounding in retrospective studies.


# Population versus Sample

As we discussed previously, the first step towards working with data is "problem-identification".  Each problem refers to a target population. Say our problem is, "Does a new drug reduce the number of deaths in patients with severe heart disease?". In this study, the target population is "patients with severe heart disease".  Often times, it is too expensive to collect data for every case in a population. 

Instead, a *sample* is taken. A sample represents a subset of the cases and is often a small fraction of the population. For instance, 100 patients with heart disease (or some other number) in the population might be selected, and this sample data may be used to provide an estimate of the population average.  

- **Population** is a collection of people, items, or events and includes all members of a defined group that we are studying or collecting information on for data driven decisions.

- **Sample** is a subset, a small fraction of a population.

*Population* is any large collection of objects or individuals, such as Americans, students, or trees about which information is desired.  A Parameter is any summary number, like an average or percentage, that describes the entire population.
The population mean μ and the population proportion p are two different population parameters. 

We might be interested in learning about μ, the average weight of all middle-aged female Americans. The population consists of all middle-aged female Americans, and the parameter is µ. Or, we might be interested in learning about p, the proportion of likely American voters approving of the president's job performance. The population comprises all likely American voters, and the parameter is p.

The problem is that most times, we don't know the real value of a population parameter. The best we can do is estimate the parameter! This is where samples and statistics come in to play.



# Sampling Principles

A Finite subset of population, selected from it, with the objective of investigating its properties is called a ‘*sample*’. The number of units in the sample is known as sample size.  A sample helps in drawing conclusion about the characteristics of the population. After inspecting the sample we draw the hypothesis to accept it or reject it.  

- **Randomly** selected 

  In general, we always seek to randomly select a sample from a population. The most basic type of random selection is equivalent to how raffles are conducted. A random sample is a group or set chosen from a larger population or group of factors of instances in a random manner that allows for each member of the larger group to have an equal chance of being chosen.

What if we picked a sample by hand?  It is entirely possible that the sample could be skewed to that person's interests, which may be entirely unintentional. This introduces bias into a sample. Sampling randomly helps resolve this problem. The most basic random sample is called a simple random sample.

- **Representative** sample
  
  A representative sample is a group or set chosen from a larger statistical population or group of factors or instances that adequately replicates the larger group according to whatever characteristic or quality is under study. For instance, If approximately 15% of the United States’ population is of Hispanic descent, a sample of 100 Americans should include around 15 Hispanic people to be representative.

As much as taking random sample decreases bias, bias can _creep_ in many ways.  

- In cases of surveys where *non-response rate is high*, even if people are picked in random, caution must be taken.   
  For instance, if only 30% of the people randomly sampled for a survey actually respond, then it is unclear whether the results are representative of the entire population. This *non-response bia*s can skew results.

- In cases of _convenience samples_, only individuals easily accessible are included in the sample. For instance, if a political survey is done by stopping people walking on Bay Street, this will not represent all of the city of Toronto. It is often diﬃcult to discern what sub-population a convenience sample represents.



# Sampling methods
Almost all statistical methods are based on the notion of implied randomness.

- **Simple random sampling** – each case in the population has an equal chance of being included in the final sample.
For example, at a fundraising event, every attendees name is entered for the chance to win a prize. Names are selected at random, then returned to the jar, giving individuals multiple opportunities to win prizes.

![random_sampling.png](attachment:random_sampling.png "Random sampling")

- **Stratified sampling** – random sampling from groups of similar cases
Stratified Sampling is possible when it makes sense to partition the population into groups based on a factor that may influence the variable that is being measured. These groups are then called strata.  An individual group is called a stratum.  With stratified sampling one should:

    - partition the population into groups (strata)
    - obtain a simple random sample from each group (stratum), and
    - collect data on each sampling unit that was randomly sampled from each group (stratum)
    
For example, if we were to pick a stratified sample of people living in the US, we can consider population to be people living in the US, groups ( strata ) to be the 4 time-zones in the US (Eastern, Central, Mountain, Pacific). We draw a random sample of 500 people from each of the timezones, and we present 4 x 500 = 2000 people as our stratified sample

![stratified_sampling.png](attachment:stratified_sampling.png "Stratified sampling")

# Other types of Sampling 

When there is a lot of case-to-case variability within a cluster but the clusters are very similar
- **Cluster sample** – break up the population in groups (clusters or strata), sample fixed number of clusters and include all observations from each of those clusters in a sample. 

  It is important to note that, unlike with the strata in stratified sampling, the clusters should be microcosms, rather than subsections, of the population. Each cluster should be heterogeneous. Additionally, the statistical analysis used with cluster sampling is not only different, but also more complicated than that used with stratified sampling. Cluster sampling really works best when there are a reasonable number of clusters relative to the entire population. 
  
  
- **Multistage sample** – similar to cluster sample, but instead of including all observations from each selected cluster, collect random sample within each selected cluster


Use of Stratified, or Cluster sample and preference of one over the other are very subjective. Questions of interest and the study would affect which sampling method is to be chosen. 


# Designing Experiments

Studies where the researchers assign treatments to cases are called experiments. When this assignment includes randomization, we refer to them as randomized experiments. Randomized experiments are fundamentally important when trying to show a causal connection between two variables. They are generally built on four principles:

Four Principles:

- **Controlling** the differences in the group

  Suppose a farmer wishes to evaluate a new fertilizer. She uses the new fertilizer on one field of crops (A), while using her   current fertilizer on another field of crops (B). The irrigation system on field A has recently been repaired and provides adequate water to all of the crops, while the system on field B will not be repaired until next season. She concludes that the new fertilizer is far superior.
  
  The problem with this experiment is that the farmer has neglected to control for the effect of the differences in irrigation. This leads to experimental bias, the favoring of certain outcomes over others. To avoid this bias, the farmer should have tested the new fertilizer in identical conditions to the control group, which did not receive the treatment. Without controlling for outside variables, the farmer cannot conclude that it was the effect of the fertilizer, and not the irrigation system, that produced a better yield of crops
  

- **Randomization** to account for variables that the researches cannot control

  Because it is generally extremely difficult for experimenters to eliminate bias using only their expert judgment, the use of randomization in experiments is common practice. In a randomized experimental design, objects or individuals are randomly assigned (by chance) to an experimental group. Using randomization is the most reliable method of creating homogeneous treatment groups, without involving any potential biases.
  
  
- **Replication** by collecting a sufficiently large sample

  Although randomization helps to ensure that treatment groups are as similar as possible, the results of a single experiment, applied to a small number of objects or subjects, should not be accepted without question. To improve the signiﬁcance of an experimental result, replication, the repetition of an experiment on a large group of subjects, is required.


- **Blocking** is used when the researches know that other variables influence the response
 
  Experimental subjects are first divided into homogeneous blocks before they are randomly assigned to a treatment group. If, for instance, an experimenter had reason to believe that age might be a significant factor in the effect of a given medication, he might choose to first divide the experimental subjects into age groups, such as under 30 years old, 30-60 years old, and over 60 years old. Then, within each age level, individuals would be assigned to treatment groups using a completely randomized design. In a block design, both control and randomization are considered.


# Definition of Bias


Randomized experiments are the gold standard for data collection, but they do not ensure an unbiased perspective into the cause and eﬀect relationships in all cases. Neglecting to control diﬀerences in groups can cause bias (as in our example of the farmer evaluating fertilizer above). Medical experiments are often misconstrued due to placebo eﬀects. A classic example of a placebo is a sugar pill that is made to look like the actual treatment pill. All these are a result of biases.


Statistical studies often involve several kinds of experiments: treatment groups, control groups, placebos.  An experiment is a study that imposes a treatment (or control) to the subjects (participants), controls their environment (for example, restricting their diets, giving them certain dosage levels of a drug or placebo), and records the responses.

The purpose of most experiments is to pinpoint a cause-and-effect relationship between two factors (such as alcohol consumption and impaired vision; or dosage level of a drug and intensity of side effects). An example question that experiments try to answer could be:

Does taking zinc help reduce the duration of a cold? Some studies show that it does.

**Treatment-Group versus Control-Group Tests**

Most experiments try to determine whether some type of experimental treatment (or important factor) has a significant effect on an outcome. For example, does zinc help to reduce the length of a cold? Subjects who are chosen to participate in the experiment are typically divided into two groups: a treatment group and a control group. 

The treatment group consists of participants who receive the experimental treatment whose effect is being studied (in this case, zinc tablets).

The control group consists of participants who do not receive the experimental treatment being studied. Instead, they get a placebo (a fake treatment; for example, a sugar pill); a standard, nonexperimental treatment (such as vitamin C, in the zinc study); or no treatment at all, depending on the situation.

In the end, the responses of those in the treatment group are compared with the responses from the control group to look for differences that are statistically significant (unlikely to have occurred just by chance).

**Placebo Tests**

A placebo is a fake treatment, such as a sugar pill. Placebos are given to the control group to account for a psychological phenomenon called the placebo effect, in which patients receiving a fake treatment still report having a response, as if it were the real treatment. For example, after taking a sugar pill a patient experiencing the placebo effect might say, “Yes, I feel better already”. By measuring the placebo effect in the control group, you can tease out what portion of the reports from the treatment group were due to a real physical effect and what portion were likely due to the placebo effect. 

**Bias** - Intentional or unintentional favouring of one group or outcome over other potential groups or outcomes in the population.


Two main categories of biases:
 - Selection bias
 - Response bias
 
**Selection Bias** - The bias that results from an unrepresentative sample is called selection bias:

 - Undercoverage - occurs when some members of the population are inadequately represented in the sample
 - Nonresponse bias - bias that results when respondents differ in meaningful ways from nonrespondents
 - Voluntary bias - sample members are self-selected volunteers

**Response Bias** - Response bias refers to the bias that results from problems in the measurement process:
- **Leading questions** – questions that encourage the expected answer
- **Social desirability** – responses may be biased toward what the respondents believe is socially desirable

**Reducing Bias in Human Experiments** 
- Split participants into two groups:
  - Treatment group
  - Control group
- The study is ideally _double-blind_ – researchers who interact with participants and participants are unaware what group the participant they are interacting with belongs to. 



# Cognitive Bias

![cognitive_bias.png](attachment:cognitive_bias.png "Cognitive bias")

(Source: https://commons.wikimedia.org/wiki/File:The_Cognitive_Bias_Codex_-_180%2B_biases,_designed_by_John_Manoogian_III_(jm3).png)

The diagram above shows the four problems that biases help us address, namely:
- Information overload, 
- lack of meaning, 
- the need to act fast, and 
- how to know what needs to be remembered for later.

# Other Types of Biases
You may want to explore the many types of bias that exist. Here are some resources for you to explore:

- List of cognitive biases used in the previous image: https://en.wikipedia.org/wiki/List_of_cognitive_biases

- List of types of statistical biases: https://en.wikipedia.org/wiki/Bias_(statistics) 



# Mean and Median

There are techniques to explore and summarize numerical data( variables).  Some of them are scatter plots( for many variables), dot plots (for one variable), and histograms.

When a numerical variable needs to be studied, we use metrics like mean and median. Suppose we are studying email data and our variable is "number of characters in an email" and our sample size is 50.  

**Mean** – the average of the numbers

  The mean, sometimes called the average, is a common way to measure the center of a distribution of data. To find the mean number of characters in the 50 emails, we add up all the character counts and divide by the number of emails.

**Median** – "middle" of a sorted list

  If the data are ordered from smallest to largest, the median is the observation
right in the middle. If there are an even number of observations, there will be two
values in the middle, and the median is taken as their average.

- Value in the “middle”
- 50% of the points below; 50% above
- Average the two middle values if there is an even number of points

**Mean and Median Example**
Below is a subset of data used by John Rauser in the video.

[John Rauser ( 2014). Statistics Without the Agonizing Pain. Strata Hadoop 2014. Retrieved from https://www.youtube.com/watch?v=5Dnw46eC-0o]

Number of mosquitos attracted to people drinking water:
21  22  15  12  21  16

Mean: 

$$\overline{X} = \frac{(21+22+15+12+21+16)}{6} = 17.83 $$

Median:

First, reorder:
12  15  16  21  21  22

Find an average of 2 numbers in the middle:
 $$\frac{(16+21)}{2}  = 18.5 $$

Mean and median are two statistics we can use to quantify observations. We may be further interested in measures of spread such as min, max, middle, distribution of data, variability, quartiles and outliers when studying observations of a variable.

**NOTE** : The mean in normal distributions has one main disadvantage: it is particularly susceptible to the inﬂuence of **outliers**. Outliers are values that are unusual compared to the rest of the data set by being especially small or large in numerical value. If we were to ﬁnd the average salary of 5 employees, whose salaries are 40k,50k,45k,40k and 100k, we compute mean to be sum of salaries divided by 5 which is 55K. This isn’t the best representation of the group because most of the salaries are between 40k and 50k. The mean is being skewed by the one large salary. Therefore, in this situation, we would like to have a better measure of central tendency, such as median.

# Variance and Standard Deviation

Computing the mean or the middle value of dataset is often not enough.  Variability in the data is also important.  Here, we introduce two measures of variability: the variance and the standard deviation. The standard deviation roughly describes how far away the typical observation is from the mean. We call the distance of an observation from its mean its deviation.

Since these are based on mean, the measure is susceptible to skew in case of normal distributions.

For other distributions like binomial distributions ( data is binary in nature), the notion of mean and median is different.  

- **Variance**
   - A measure of the variability of the data
   - Roughly the average squared distance from the mean

- **Standard Deviation**
    - How far away the observation is from the mean. The distance is called deviation
    - Standard Deviation is the square root of variance
    - 95% of the points are usually within 2 standard deviations of the mean
    
**Variance and Standard Deviation Example**
Same dataset as above:
21  22  15  12  21  16

Mean: 

$$\overline{X} = 17.83 $$

Variance:

$$S^2 = \frac{((21-17.83)^2 + (22-17.83)^2 + … + (16-17.83)^2)}{5} = 16.57 $$ 

Standard Deviation:
$$S = \sqrt{S^2} = √16.57 = 4.07$$




**End of Module**

You have reached the end of this module.

If you have any questions, please reach out to your peers using the discussion boards. If you
and your peers are unable to come to a suitable conclusion, do not hesitate to reach out to your instructor on the designated discussion board.

When you are comfortable with the content, you may proceed to the next module.


## References


- Bias, statistics, n.d.  Retrieved on Dec 5, 2018 from Wikipedia https://en.wikipedia.org/wiki/Bias_(statistics)

- Champkin, J. (2014). *The timeline of statistics*. Significance. Retrieved December 5, 2018 from https://www.statslife.org.uk/history-of-stats-science/1190-the-timeline-of-statistics

- Cognitive Bias Codex, 2016. Retrieved Dec 5, 2018 from Wikimedia Commons https://commons.wikimedia.org/wiki/File:The_Cognitive_Bias_Codex_-_180%2B_biases,_designed_by_John_Manoogian_III_(jm3).png Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)

- Data, (n.d.). Retrieved Dec 5, 2018 from Business Dictionary http://www.businessdictionary.com/

- Data, (n.d.). Retrieved Dec 5, 2018 from Cambridge Dictionary https://dictionary.cambridge.org/dictionary/english/data

- List of cognitive biases, n.d. Retrieved on Dec 5, 2018 from Wikipedia https://en.wikipedia.org/wiki/List_of_cognitive_biases

- Rauser, J. ( 2014). Statistics Without the Agonizing Pain. Strata Hadoop 2014. Retrieved from https://www.youtube.com/watch?v=5Dnw46eC-0o

- United States Census, 2018. Economic and Demographic information by County. Retrieved Dec 5, 2018 from https://www.census.gov/data.html

- Walker, J.I. (2017). *Reading Scatterplots*. Retrieved Dec 5, 2018 from https://www.mathbootcamps.com/reading-scatterplots/
        