# What is statistics?

Probability theory computes probabilities of complex events given the underlying base probabilities.

Statistics takes us in the opposite direction.

We are given **data** that was generated by a **Stochastic process**

We **infer** properties of the underlying base probabilities.

# Example:  deciding whether a coin is biased.

In a previous video we discussed the distribution of the number of heads when flipping a fair coin many times.

Let's turn the question around: we flip a coin 1000 times and get 570 heads. 

Can we conclude that the coin is biased (not fair) ?

What can we conclude if we got 507 heads?

### The Logic of Statistical inference
The answer uses the following logic.

* Suppose that the coin is fair. 

* Use **probability theory** to compute the probability of getting at least 570 (or 507) heads.

* If this probability is very small, then we can **reject** <font color='red'>with confidence</font> the hypothesis that the coin is fair.

## Calculating the answer
Recall the simulations we did in the video "What is probability".

We used $x_i=-1$ for tails and $x_i=+1$ for heads.

We looked at the sum $S_k=\sum_{i=1}^k x_i$, here $k=1000$.

If number of heads is $570$ then $S_{1000} = 570-430 = 140$  

It is very unlikely that $|S_{1000}| > 4\sqrt{k} \approx 126.5$

In [None]:
from math import sqrt
4*sqrt(1000)

126.49110640673517

It is very unlikely that the coin is unbiased.

### What about 507 heads?

507 heads = 493 tails $ \Rightarrow S_n = 14$,       $\;\;\;14 \ll 126.5$

We cannot conclude that coin is biased.

## Conclusion
The probability that an unbiased coin would generate a sequence with 570 or more heads is extremely small. From which we can conclude, <font color='red'>with high confidence</font>, that the coin **is** biased.

On the other hand, $\big| S_{1000} \big | \geq 507$ is quite likely. So getting 507 heads does not provide evidence that the coin is biased.

# Real-World examples
You might ask "why should I care whether a coin is biased?"

* This is a valid critique. 
* We will give a few real-world cases in which we want to know whether a "coin" is biased or not.

## Case I: Polls
* Suppose elections will take place in a few days and we want to know how people plan to vote.
* Suppose there are just two parties: **D** and **R**.

* We could try and ask **all** potential voters.

* That would be very expensive.

* Instead, we can use a poll: call up a small randomly selected set of people.

* Call $n$ people at random and count the number of **D** votes.

* Can you say <font color='red'>with confidence</font> that there are more **D** votes, or more **R** votes?

* Mathematically equivalent to flipping a biased coin and 

* asking whether you can say <font color='red'>with confidence</font> that it is biased towards "Heads" or towards "Tails"

## Case 2: A/B testing
A common practice when optimizing a web page is to perform A/B tests.

* A/B refer to two alternative designs for the page.

![AB](images/AB.png)

* To see which design users prefer we randomly present design A or design B.

* We measure how long the user stayed on a page, or whether the user clicked on an advertisement.

* We want to decide, <font color='red'>with confidence</font>, which of the two designs is better.

* Again: similar to making a decision <font color='red'>with confidence</font> on whether "Heads" is more probably than "Tails" or vice versa.

In [2]:
import pandas as pd
# Import data from GitHub (or from your local computer)
df = pd.read_csv("https://raw.githubusercontent.com/kirenz/datasets/master/wage.csv")

In [4]:
df.head()

Unnamed: 0.1,Unnamed: 0,year,age,maritl,race,education,region,jobclass,health,health_ins,logwage,wage
0,231655,2006,18,1. Never Married,1. White,1. < HS Grad,2. Middle Atlantic,1. Industrial,1. <=Good,2. No,4.318063,75.043154
1,86582,2004,24,1. Never Married,1. White,4. College Grad,2. Middle Atlantic,2. Information,2. >=Very Good,2. No,4.255273,70.47602
2,161300,2003,45,2. Married,1. White,3. Some College,2. Middle Atlantic,1. Industrial,1. <=Good,1. Yes,4.875061,130.982177
3,155159,2003,43,2. Married,3. Asian,4. College Grad,2. Middle Atlantic,2. Information,2. >=Very Good,1. Yes,5.041393,154.685293
4,11443,2005,50,4. Divorced,1. White,2. HS Grad,2. Middle Atlantic,2. Information,1. <=Good,1. Yes,4.318063,75.043154


In [3]:
# mode
df['age'].mode()

0    40
dtype: int64

In [5]:
# mode
df['wage'].mode()

0    118.884359
dtype: float64

In [6]:
# calculation of the mean (e.g. for age)
df["age"].mean()

42.41466666666667

In [7]:
# calculation of the mean (e.g. for wage)
df["wage"].mean()

111.70360820174366

In [8]:
# calculation of the mean (e.g. for age) and round the result
round(df["age"].mean(), 2)

42.41

In [9]:
# calculation of the mean (e.g. for wage) and round the result
round(df["wage"].mean(), 2)

111.7

In [10]:
# calculation of the median (e.g. for age)
df["age"].median()

42.0

In [11]:
# calculation of the median (e.g. for wage)
df["wage"].median()

104.921506533664

### Measures of dispersion 

In [12]:
# quantiles
df['age'].quantile([.25, .5, .75])

0.25    33.75
0.50    42.00
0.75    51.00
Name: age, dtype: float64

In [13]:
# Range
df['age'].max() - df['age'].min()

62

In [14]:
# standard deviation
round(df['age'].std(),2)

11.54

### Summary statistics

In [15]:
# summary statistics for all numerical columns
round(df.describe(),2)

Unnamed: 0.1,Unnamed: 0,year,age,logwage,wage
count,3000.0,3000.0,3000.0,3000.0,3000.0
mean,218883.37,2005.79,42.41,4.65,111.7
std,145654.07,2.03,11.54,0.35,41.73
min,7373.0,2003.0,18.0,3.0,20.09
25%,85622.25,2004.0,33.75,4.45,85.38
50%,228799.5,2006.0,42.0,4.65,104.92
75%,374759.5,2008.0,51.0,4.86,128.68
max,453870.0,2009.0,80.0,5.76,318.34


#### Compare summary statistics for specific groups in the data

In [16]:
# summary statistics by groups
df['age'].groupby(df['education']).describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
education,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1. < HS Grad,268.0,41.794776,12.611111,18.0,33.0,41.5,50.25,75.0
2. HS Grad,971.0,42.217302,12.02348,18.0,33.0,42.0,50.0,80.0
3. Some College,650.0,40.887692,11.523327,18.0,32.0,40.0,49.0,80.0
4. College Grad,685.0,42.773723,10.902406,22.0,34.0,43.0,51.0,76.0
5. Advanced Degree,426.0,45.007042,10.263468,25.0,38.0,44.0,53.0,76.0


# Summary
Statistics is about analyzing real-world data and drawing conclusions.

Examples include:

* Using polls to estimate public opinion.

* performing A/B tests to design web pages

* Estimating the rate of global warming.

* Deciding whether a medical procedure is effective