## **1. Statistical Inference**  
Using data analysis and statistics to make conclusions about a population is called statistical inference.

The main types of statistical inference are:  
1. Estimation
2. Hypothesis testing

## **1.1 Estimation**  
Statistics from a sample are used to estimate population parameters.

The most likely value is called a **point estimate**.

There is **always** uncertainty when estimating.

The uncertainty is often expressed as **confidence intervals** defined by a likely lowest and highest value for the parameter.

An example could be a confidence interval for the number of bicycles a Dutch person owns:  
> "The average number of bikes a Dutch person owns is between 3.5 and 6."

## **1.2 Hypothesis Testing**  
**Hypothesis testing** is a method to check if a claim about a population is true. More precisely, it checks how likely it is that a hypothesis is true is based on the sample data.

There are different types of hypothesis testing.

The steps of the test depends on:
- Type of data (categorical or numerical)
- If you are looking a at:
    - A single group
    - Comparing one group to another
    - Comparing the same group before and after a change  

Some examples of claims or questions that can be checked with hypothesis testing:  
> - 90% of Australians are left handed
> - Is the average weight of dogs more than 40kg?
> - Do doctors make more money than lawyers?

## **1.3 Probability Distributions**  
Statistical inference methods rely on probability calculation and probability distributions.

You will learn about the most important probability distributions in the next pages.
1. Normal Distribution
2. Standard Normal Distribution

## **2. Normal Distribution**  
The normal distribution is an important probability distribution used in stastistics.

Many real world examples of data are normally distributed.  

The normal distribution is described by the mean ($\mu$) and the standard deviation ($\sigma$).

The normal distribution is often referred to as a 'bell curve' because of it's shape:

Most of the values are around the centre ($\mu$)
The <u>median</u> and mean are equal
It has only one <u>mode</u>
It is symmetric, meaning it decreases the same amount on the left and the right of the centre
The area under the curve of the normal distribution represents probabilities for the data.

The area under the whole curve is equal to 1, or 100%

Here is a graph of a normal distribution with probabilities between standard deviations ($\sigma$):
<img src='Normal_Distribution_1.png' alt='Normal Distributions with indicated probabilities.'>  

- Roughly 68.3% of the data is within 1 standard deviation of the average (from μ-1σ to μ+1σ)
- Roughly 95.5% of the data is within 2 standard deviations of the average (from μ-2σ to μ+2σ)
- Roughly 99.7% of the data is within 3 standard deviations of the average (from μ-3σ to μ+3σ)  

> **Note:** Probabilities of the normal distribution can only be calculated for intervals (between two values).

## **2.1 Different Mean and Standard Deviations**  
The mean describes where the centre of the normal distribution is.

Here is a graph showing three different normal distributions with the **same** standard deviation but different means.
<img src='Normal_Distribution_2.png' alt='Normal Distributions with different means.'>  

The standard deviation describes how spread out the normal distribution is.

Here is a graph showing three different normal distributions with the **same** mean but different standard deviations.
<img src='Normal_Distribution_3.png' alt='Normal Distributions with different standard deviations.'>  

The purple cuve has the biggest standard deviation and the black curve has the smallest standard deviation.

The area under each of the curves is still 1, or 100%.

## **2.2 A Real Data Example of Normally Distributed Data**  
Real world data is often normally distributed.

Here is a histogram of the age of Nobel Prize winners when they won the prize:

<img src='Normal_Distribution_4.png' alt='Histogram of the age of Nobel Prize winners when they won the prize and normal distribution fitted to the data.'>  

The normal distribution drawn on top of the histogram is based on the population mean ($\mu$) and standard deviation ($\sigma$) of the real data.

We can see that the histogram close to a normal distribution.

Examples of real world variables that can be normally distributed:
- Test scores
- Height
- Birth weight

## **2.3 Probability Distributions** 
Probability distributions are functions that calculates the probabilities of the outcomes of random variables.

Typical examples of random variables are coin tosses and dice rolls.

Here is an graph showing the results of a growing number of coin tosses and the expected values of the results (heads or tails).

The expected values of the coin toss is the probability distribution of the coin toss.
<img src='Normal_Distribution_5.gif' alt='Simulated coin tosses and expected values.'> 

Notice how the result of random coin tosses gets closer to the expected values (50%) as the number of tosses increases.

Similarly, here is a graph showing the results of a growing number of dice rolls and the expected values of the results (from 1 to 6).  

<img src='Normal_Distribution_6.gif' alt='Simulated dice rolls and expected values.'>

Notice again how the result of random dice rolls gets closer to the expected values (1/6, or 16.666%) as the number of rolls increases.

When the random variable is a **sum** of dice rolls the results and expected values take a different shape.

The different shape comes from there being more ways of getting a sum of near the middle, than a small or large sum.
<img src='Normal_Distribution_7.gif' alt='Simulated sum of two dice rolls and expected values.'>

As we keep increasing the number of dice for a sum the shape of the results and expected values look more and more like a normal distribution.
<img src='Normal_Distribution_8.gif' alt='Simulated sum of 3 dice rolls and expected values.'><img src='Normal_Distribution_9.gif' alt='Simulated sum of 5 dice rolls and expected values.'>

Many real world variables follow a similar pattern and naturally form normal distributions.

Normally distributed varaiables can be analysed with well-known techniques.

You will learn about some of the most common and useful techniques in the following pages.

## **3. Standard Normal Distribution**
The standard normal distribution is a normal distribution where the mean is 0 and the standard deviation is 1.

Normally distributed data can be transformed into a standard normal distribution.

Standardizing normally distributed data makes it easier to compare different sets of data.

The standard normal distribution is used for:  
- Calculating confidence intervals
- Hypothesis tests  

Here is a graph of the standard normal distribution with probability values (p-values) between the standard deviations:
<img src='Standard_Normal_Distribution.png' alt='Standard Normal Distribution with indicated probabilities.'>

Standardizing makes it easier to calculate probabilities.

The functions for calculating probabilities are complex and difficult to calculate by hand.

Typically, probabilities are found by looking up tables of pre-calculated values, or by using software and programming.

The standard normal distribution is also called the 'Z-distribution' and the values are called 'Z-values' (or Z-scores).

## **3.1 Z-Values** 
Z-values express how many standard deviations from the mean a value is.

The formula for calculating a Z-value is:  
$\displaystyle Z = \frac{x-\mu}{\sigma}$  
 $x$ is the value we are standardizing,  $\mu$ is the mean, and  is the standard deviation.

For example, if we know that:
> The mean height of people in Germany is 170 cm ($\mu$)
>
> The standard deviation of the height of people in Germany is 10 cm ($\sigma$)
> 
> Bob is 200 cm tall ($x$)

Bob is 30 cm taller than the average person in Germany.

30 cm is 3 times 10 cm. So Bob's height is 3 standard deviations larger than mean height in Germany.

Using the formula:  
$\displaystyle Z = \frac{x-\mu}{\sigma} = \frac{200-170}{10} = \frac{30}{10} = \underline{3}$

The Z-value of Bob's height (200 cm) is 3.

## **3.2 Finding the P-value of a Z-Value**  
Using a Z-table(refer file '4_References_Statistics.ipynb' for more info.) or programming we can calculate how many people Germany are shorter than Bob and how many are taller.

Example  
With Python use the Scipy Stats library `norm.cdf()` function find the probability of getting less than a Z-value of 3:


In [1]:
import scipy.stats as stats
print(stats.norm.cdf(3))

0.9986501019683699


Using either(Python or R) method we can find that the probability is $\approx 0.9987$, or $99.87\%$ 

Which means that Bob is taller than 99.87% of the people in Germany.

Here is a graph of the standard normal distribution and a Z-value of 3 to visualize the probability:
<img src='P-value_of_Z-Value.png' alt='Standard Normal Distribution with indicated probability for a z-value of 3.'>

These methods find the p-value up to the particular z-value we have.

To find the p-value above the z-value we can calculate 1 minus the probability.

So in Bob's example, we can calculate 1 - 0.9987 = 0.0013, or 0.13%.

Which means that only 0.13% of Germans are taller than Bob.

## **3.3 Finding the P-Value Between Z-Values**  
