# **Basic Statistics Concepts**

## **1.1 Statistics Introduction**  
 Statistics gives us methods of gaining knowledge from data.

## **What is Statistics Used for?**  
Statistics is used in all kinds of science and business applications.  
  
Statistics gives us more accurate knowledge which helps us make better decisions.  
  
Statistics can focus on making predictions about what will happen in the future. It can also focus on explaining how different things are connected.  

## **Typical Steps of Statistical Methods**  
The typical steps are:
 - Gathering data
 - Describing and visualizing data
 - Making conclusions
It is important to keep all three steps in mind for any questions we want more knowledge about.

Knowing which types of data are available can tell you what kinds of questions you can answer with statistical methods.

Knowing which questions you want to answer can help guide what sort of data you need. A lot of data might be available, and knowing what to focus on is important.

## **How is Statistics Used?**  
Statistics can be used to explain things in a precise way. You can use it to understand and make conclusions about the group that you want to know more about. This group is called the **`population`**.

A population could be many different kinds of groups. It could be:

 - All of the people in a country
 - All the businesses in an industry
 - All the customers of a business
 - All people that play football who are older than 45
and so on - it just depends on what you want to know about.

Gathering data about the population will give you a sample. This is a part of the whole population. Statistical methods are then used on that sample.

The results of the statistical methods from the sample is used to make conclusions about the population.

## **Important Concepts in Statistics**  
1. Predictions and Explanations
2. Populations and Samples
3. Parameters and Sample Statistics
4. Sampling Methods
5. Data Types
6. Measurement Level
7. Descriptive Statistics
8. Random Variables
9. Univariate and Multivariate Statistics
10. Probability Calculation
11. Probability Distributions
12. Statistical Inference
13. Parameter Estimation
14. Hypothesis Testing
15. Correlation
16. Regression Analysis
17. Causal Inference  
These all topics are discussed step by step in this tutorial file and comming tutorial files.

## **Statistics and Programming**  
Statistical analysis is typically done with computers. Small amounts of data can analyzed reasonably well without computers.

Historically, all data analysis was performed by manually. It was time-consuming and prone to errors.

Nowadays, programming and software is typically used for data analysis.

In this course, we will show examples of code to do statistics with the programming languages Python.

## **1.2 Gathering Data**  
Gathering data is the first step in statistical analysis.

Say for example that you want to know something about all the people in France.

The **population** is then all of the people in France.

It is too much effort to gather information about all of the members of a population (e.g. all 67+ million people living in France). It is often much easier to collect a smaller group of that population and analyze that. This is called a **`sample`**.

## **A representative sample**  
The sample needs to be **`similar`** to the whole population of France. It should have the same characteristics as the population. If you only include people named Jacques living in Paris who are 48 years old, the sample will not be similar to the whole population.

So for a good sample, you will need people from all over France, with different ages, professions, and so on.

If the members of the sample have similar characteristics (like age, profession, etc.) to the whole population of France, we say that the sample is **`representative`** of the population.

A good **`representative sample`** is crucial for statistical methods.

> **Note:** Data from a proper sample is often just as good data from the whole population, as long as it is representative!  
  A good sample allows you to make accurate conclusions about the whole population.

## **1.3 Describing Data**  
Describing data is typically the second step of statistical analysis after gathering data.

## **Descriptive Statistics**  
The information (data) from your sample or population can be visualized with graphs or **summarized** by numbers. This will show key information in a simpler way than just looking at raw data. It can help us understand how the data is **distributed**.

Graphs can visually show the data distribution.  

Examples of graphs include:  
- Histograms
- Pie charts
- Bar graphs
- Box plots  
Some graphs have a close connection to numerical summary statistics. Calculating those gives us the basis of these graphs.

For example, a box plot visually shows the **quartiles** of a data distribution.

Quartiles are the data split into four equal size parts, or quarters. A quartile is one type of summary statistics.

## **Summary statistics**  
Summary statistics take a large amount of information and sums it up in a few key values.

Numbers are calculated from the data which also describe the shape of the distributions. These are individual 'statistics'.

Some important examples are:
- Mean, median and mode
- Range and interquartile range
- Quartiles and percentiles
- Standard deviation and variance

> **Note:** Descriptive statistics is often presented as a part of statistical analysis.
  
***
> Descriptive statistics is also useful for guiding further analysis, giving insight into the data, and finding what is worth investigating more closely.

## **1.4 Making Conclusions**  
Using statistics to make conclusions about a population is called statistical inference.  

Types:  
- Statistical Inference.
- Causal Inference.

## **Statistical Inference**  
Statistics from the data in the **sample** is used to make conclusions about the whole **population**. This is a type of **statistical inference**.

**Probability theory** is used to calculate the certainty that those statistics also apply to the population.

When using a sample, there will **always** be some uncertainty about what the data looks like for the population.

Uncertainty is often expressed as **confidence intervals**.

Confidence intervals are numerical ways of showing how likely it is that the **true value** of this statistic is within a certain range for the population.

**Hypothesis testing** is a another way of checking if a statement about a population is true. More precisely, it checks how likely it is that a hypothesis is true is based on the sample data.

Some examples of statements or questions that can be checked with hypothesis testing:  
- People in the Netherlands taller than people in Denmark
- Do people prefer Pepsi or Coke?
- Does a new medicine cure a disease?

> **Note:** Confidence intervals and hypothesis testing are closely related and describe the same things in different ways. Both are widely used in science.

## **Causal Inference**  
Causal inference is used to investigate if something causes another thing.

For example: Does rain make plants grow?

If we think two things are related we can investigate to see if they **correlate**. Statistics can be used to find out how strong this relation is.

Even if things are correlated, finding out of something is caused by other things can be difficult. It can be done with good **experimental design** or other special statistical techniques.

> **Note:** Good experimental design is often difficult to achieve because of ethical concerns or other practical reasons.

## **1.5 Prediction and Explanation**  
Some types of statistical methods are focused on predicting what will happen.

Other types of statistical methods are focused on explaining how things are connected.

## **Prediction**  
Some statistical methods are not focused on explaining how things are connected. Only the accuracy of prediction is important.

Many statistical methods are successful at predicting without giving insight into how things are connected.

Some types of machine learning let computers do the hard work, but the way they predict is difficult to understand. These approaches can also be vulnerable to mistakes if the circumstances change, since the how they work is less clear.

> **Note:** Predictions about future events are called **forecasts**. Not all predictions are about the future.

Some predictions can be about something else that is unknown, even if it is not in the future.

## **Explanation**  
Different statistical methods are often used for explaning how things are connected. These statistical methods may not make good predictions.

These statistical methods often explain only small parts of the whole situation. But, if you only want to know how a few things are connected, the rest might not matter.

If these methods accurately explains how all the relevant things are connected, they will also be good at prediction. But managing to explain every detail is often challenging.

Some times we are specifically interested in figuring out if one thing causes another. This is called **causal inference**.

If we are looking at complicated situations, many things are connected. To figure out what causes what, we need to untangle every way these things are connected.

> **Note:** Making conclusions about causality should be done carefully.

## **1.6 Populations and Samples**  
The terms 'population' and 'sample' are important in statistics and refer to key concepts that are closely related.


## **Population and Samples**  
**Population:** Everything in the group that we want to learn about.

**Sample:** A part of the population.

Examples of populations and a sample from those populations:

|Population|Sample|  
|---|---|  
|All of the people in Germany|500 Germans|
|All of the customers of Netflix|300 Netflix customers|
|Every car manufacturer|Tesla, Toyota, BMW, Ford|  

For good statistical analysis, the sample needs to be as "similar" as possible to the population. If they are similar enough, we say that the sample is **representative** of the population.

The sample is used to make conclusions about the whole population. If the sample is not similar enough to the whole population, the conclusions could be useless.

> **Note:** Many words have specific meanings in statistics.  
**  
> The word 'population' normally refers to a group of people. In statistics, it is any specific group that we are interested in learning about.

## **1.7 Parameters and Statistics**  
The terms 'parameter' and (sample) 'statistic' refer to key concepts that are closely related in statistics.

They are also directly connected to the concepts of populations and samples.

## **Parameters and Statistics**  
**Parameter:** A number that describes something about the whole population.

**Sample statistic:** A number that describes something about the **sample**.

The parameters are the key things we want to learn about. The parameters are usually unknown.

Sample statistics gives us **estimates** for parameters.

There will always be some **uncertainty** about how accurate estimates are. More certainty gives us more useful knowledge.

For every parameter we want to learn about we can get a sample and calculate a sample statistic, which gives us an estimate of the parameter.

## **Some Important Examples**  
|Parameter|Sample statistic|
|---|---|
|Mean|Sample mean|
|Median|Sample median|
|Mode|Sample mode|
|Variance|Sample variance|
|Standard deviation|Sample standard deviation|  
  
**Mean, median and mode** are different types of averages (typical values in a population).

For example:

- The typical age of people in a country
- The typical profits of a company
- The typical range of an electric car
**Variance** and **standard deviation** are two types of values describing how spread out the values are.

A single class of students in a school would usually be about the same age. The age of the students will have **low** variance and standard deviation.

A whole country will have people of all kinds of different ages. The variance and standard deviation of age in the whole country would then be **bigger** than in a single school grade.

## **1.8 Study Types**  
A statistical study can be a part of the processs of gathering data.

There are different types of studies. Some are better than others, but they might be harder to do.

## **Main Types of Statistical Studies**  
The main types of statistical studies are **observational** and **experimental** studies.

We are often interested in knowing if something is the **cause** of another thing.

Experimental studies are generally better than observational studies for investigating this, but usually require more effort.

An observational study is when observe and gather data without changing anything.

## **Experimental Studies**  
In an experimental study, the **circumstances** around the sample is changed. Usually, we compare two groups from a population and these two groups are treated **differently**.

One example can be a medical study to see if a new medicine is effective.

One group receives the medicine and the other does not. These are the different circumstances around those samples.

We can compare the health of both groups afterwards and see if the results are different.

Experimental studies can allow us to investigate causal relationships. A well designed experimental study can be useful since it can **isolate** the relationship we are interested in from **other effects**. Then we can be more confident that we are measuring the true effect.

## **1.9 Sample Types**  
A study needs participants and there are different ways of gathering them.

Some methods are better than others, but they might be more difficult.

## **Different Types of Sampling Methods**  
1. Random Sampling  
2. Convenience Sampling
3. Systematic Sampling 
4. Stratified Sampling
5. Clustered Sampling

## **1.9.1 Random Sampling**  
A random sample is where every member of the population has an **equal chance** to be chosen.

Random sampling is the best. But, it can be difficult, or impossible, to make sure that it is completely random.

> **Note:** Every other sampling method is compared to how close it is to a random sample - the closer, the better.

## **1.9.2 Convenience Sampling**  
A convience sample is where the participants that are the easiest to reach are chosen.

> **Note:** Convenience sampling is the easiest to do.
**
> In many cases this sample will not be similar enough to the population, and the conclusions can potentially be useless.

## **1.9.3 Systematic Sampling**  
A systematic sample is where the participants are chosen by some regular system.

For example:
- The first 30 people in a queue
- Every third on a list
- The first 10 and the last 10

## **1.9.4 Stratified Sampling**  
A stratified sample is where the population is split into smaller groups called 'strata'.

The 'strata' can, for example, be based on demographics, like:
- Different age groups
- Professions  
  
Stratification of a sample is the first step. Another sampling method (like random sampling) is used for the second step of choosing participants from all of the smaller groups (strata).

## **1.9.5 Clustered Sampling**  
A clustered sample is where the population is split into smaller groups called 'clusters'.

The clusters are usually natural, like different cities in a country.

The clusters are chosen randomly for the sample.

All members of the clusters can participate in the sample, or members can be chosen randomly from the clusters in a third step.

## **2.0 Data Types**  
Data can be different types, and require different types of statistical methods to analyze.

## **Different types of data**  
There are two main types of data: Qualitative (or 'categorical') and quantitative (or 'numerical'). These main types also have different sub-types depending on their **measurement level**.

## **Qualitative Data**  
Information about something that can be sorted into different categories that can't be described directly by numbers.

Examples:  
- Brands
- Nationality
- Professions  
  
With categorical data we can calculate statistics like **proportions**. For example, the proportion of Indian people in the world, or the percent of people who prefer one brand to another.

## **Quantitative Data**  
Information about something that is described by numbers.

Examples:  
- Income
- Age
- Height  

With numerical data we can calculate statistics like the **average** income in a country, or the **range** of heights of players in a football team.

## **2.1 Measurement Levels**  
Different data types have different measurement levels.

Measurement levels are important for what types of statistics can be calculated and how to best present the data.  

The main types of data are Qualitative (categories) and Quantitative (numerical). These are further split into the following measurement levels.

These measurement levels are also called measurement 'scales'  
1. Nominal Level
2. Ordinal level
3. Interval Level
4. Ratio Level



## **2.1.1 Nominal Level**  
Categories (qualitative data) without any order.

Examples:  
- Brand names
- Countries
- Colors

## **2.1.2 Ordinal level**  
Categories that can be ordered (from low to high), but the precise "distance" between each is not meaningful.

Examples:  
- Letter grade scales from F to A
- Military ranks
- Level of satisfaction with a product  

Consider letter grades from F to A: Is the grade A precisely twice as good as a B? And, is the grade B also twice as good as C?

Exactly how much distance it is between grades is not clear and precise. If the grades are based on amounts of points on a test, you can say that there is a precise "distance" on the point scale, but not the grades themselves.

## **2.1.3 Interval Level**  
Data that can be ordered and the distance between them is objectively meaningful. But there is no natural 0-value where the scale originates.

Examples:  
- Years in a calendar
- Temperature measured in fahrenheit  

> **Note:** Interval scales are usually invented by people, like degrees of temperature.  
**  
> 0 degrees celcius is 32 degrees of fahrenheit. There is consistent distances between each degree (for every 1 extra degree of celcius, there is 1.8 extra fahrenheit), but they do not agree on where 0 degrees is.

## **2.1.4 Ratio Level**  
Data that can be ordered and there is a consistent and meaningful distance between them. And it also has a natural 0-value.

Examples: 
- Money
- Age
- Time  

Data that is on the ratio level (or "ratio scale") gives us the most detailed information. Crucially, we can compare precisely how big one value is compared to another. This would be the ratio between these values, like twice as big, or ten times as small.