# Basic Statistics

In this section we will revise some of the basic statistical concepts that you will need to be comfortable with to get started with data analysis.


### What is Statistics?

Statistics means handling data effectively:
collection, organisation, analysis, interpretation & presentation.

We use the term "statistics" to describe two rather different activities:

* Analysing a set of data to produce simple summary metrics (we call this **descriptive statistics**).

* Calculate something we can’t directly measure (we call this **inferential statistics**).

### Population vs sample

**Population**:
The entire set of individuals I would like to know something about.

**Sample**:
The set of individuals for which you have data (e.g. the people who answer your questionnaire).

The population includes the sample… but a sample usually doesn’t cover the whole population!


### Parameter vs statistic

**Parameter**: a number that describes the *population* (i.e. a theoretical quantity which cannot be observed directly)
It is a fixed number but we do not know it’s exact value.

e.g. theoretical mean ($\mu$) and variance ($\sigma^2$) of the height of the UK population.


**Statistic**: a number calculated from the *sample* that is used to describe/summarize the data that is actually collected.

e.g. Sample mean ($\bar{x}$) and variance ($s^2$) of the height of the students on this course.



Although we cannot examine a large population directly, we can use statistics such as $\bar{x}$ and $s^2$ to *estimate* unknown parameters of interest. This process is called **statistical inference**.


<center><img src="../Resources/stats_cycle.jpg" style="height:300px" /></center>

N.B. Because a term such as "mean" can be associated with both the sample and the the population, it is important to be clear exactly what we are talking about. Notice the symbols that are chosen: we tend to use Greek letters (e.g. $\mu$) for population parameters, but Roman letters (e.g. $\bar{x}$) for sample statistics. This can be helpful to remember!

---


# Types of data

Before diving into the various statistics we can use to describe sampled data, we need to think carefully about the different *types* of data we might have collected. There are two basic data types in statistics: **quantitative** and **categorical**:

## Quantitative data
Quantitative data arise from a *measurement* or a *counting* process. We can distinguish between two subtypes:

### Continuous data 
(*any value is allowed, within a relevant interval*)

* Blood pressure
* Relative humidity
* Maximum velocity achieved by a projectile


### Discrete data 
(*only specific values are allowed*)

* Number of chlorine atoms in a molecule
* Shoe size
* Change in number of coal-fired power stations, 2000-2020


## Categorical data
With categorical data, there are *no relevant numerical relationships* between the values that we might collect. Once again, we distinguish between two subtypes:

### Nominal data 
(*values have no relevant ordering*)

* Manufacturer of a SARS-CoV-2 vaccine
* Genus of an insect
* Type of a rock (e.g. basalt / granite / sandstone / ...)


### Ordinal data 
(*values have a relevant ordering*)

* Degree of agreement (e.g. a [Likert scale](https://en.wikipedia.org/wiki/Likert_scale))
* Perceived expertise (e.g. beginner / intermediate / advanced)
* Life stage (e.g. embryo / larva / pupa / adult)


---