# Statistics 101

As you are aiming to learn about data, it is important for you to know basics of statistics. It is indeed the fundamental ground of any analysis you will perform. In this course, we will cover the basics of what statistics are and simple metrics you need to master in order to run thorough analysis 🚀. 

## What you'll learn in this class  🧐🧐

- What are statistics and the different types of data
- How to collect statistical data
- How to create representative samples
- Constructing a mean, median and mode
- Measuring variation with intervals, standard deviations and Z-Score

## Statistics and data types

### Definition

As defined in the Merriam-Webster dictionary 📖: 

* _Statistics is the set of methods for collecting, processing and interpreting observational data relating to a group of individuals or units._

### Data types

#### Quantitative Data #️⃣

When we think of statistics, we usually think of numbers (e.g. age, weight, income, etc.). These are referred to as numerical or quantitative data

##### Discrete

Quantitative data can only have **specific values**. For example, you can have 15 people in a class but not 15.3 people. Similarly, if you study shoe sizes, you will only have sizes between 33 and 48 but you cannot have 33.78. These are called **discrete values**. They can only take certain values in an interval.

##### Continuous

Conversely, continuous quantitative data can take any value within a given interval. Examples include an employee's salary, the measurement of a temperature or the income of a company.

#### Qualitative Data 🥃

Gender, country of origin, eye colour, skin colour, etc. are considered categories. This is called **qualitative or categorical data**. These can also be of two types:

##### Ordinal

If your qualitative data can be ordered, then they are **ordinal**. For example, when you grade your students between A and D (A being the highest grade, D the lowest), you can give an order: grade A is better than B which is better than C which is better than D. Similarly, when you do a study where you survey respondents are asked to answer "Perfect" "Good" "Not bad" "Can do better" "Needs review", you can give an order to these answers.

##### Nominal

For other qualitative data, we cannot give an order. For example, country, sex, eye or skin colour are purely **nominal** data (_e.g._ you cannot say for example that Male is better than Female).

## Sampling & Collecting Data

### What is sampling?

It is important to consider whether your studies should focus on the total population of individuals or on a portion of that population (or sample). If you wanted to know the entire French population, you would have to conduct a *census* which, in a country like France, is extremely expensive and time-consuming. Companies never conduct a census but use **sampling**, i.e. a test on a representative part of the entire population of individuals they wish to study.

#### Parameters

The term **parameter** defines any factor descriptive of a particular population. For example, the proportion of men and women, the average age, or the unemployment rate can be three different parameters.

### Collecting data

#### Experimental VS observational studies

There are two completely different ways of collecting data:

- Observing what's happening in a population.
- Running an experiment on that population.

For example, we could go out on the street and report the color of every car on the street. This would be an observational study. Conversely, if we took each of the cars on a track to test their performance, we're clearly in an experimental study.

The difference is sometimes equivocal. For example, if you're doing a marketing study where you send a questionnaire to people, you're more of an observational study. Unless you have biased your study by giving information before your respondents do the test. In this case you are in an experimental study.

#### Transversal, retrospective or longitudinal studies

Experimental as well as observational studies can be of two types: transversal or longitudinal.

- Transversal studies correspond to data collection at time T. They can be complemented by retrospective studies that look for historical data to see how things have changed in the past.
- Longitudinal studies follow a group of individuals into the future (pharmaceutical studies, for example, follow patients to test their medication).

### Create a representative sample

#### Simple sampling

How to get unbiased data? The simplest way would be to select individuals from our sample at random. If each of them has an equal chance of being selected then we have what is called a _single random sample_. A simple way to do this is to create a system. For example, you can make several random lists of people, number them and take each even number.

However, there are two other more rigorous techniques for creating representative samples.

#### Stratified Sampling

When we talk about stratified sampling, we look at the proportion of individuals with certain characteristics in the population and try to replicate these same proportions in the sample. Be careful, these characteristics must be mutually exclusive for sampling to be valid. For example, you may have in your sample 30% men and 70% women. The proportion corresponding to the population and the characteristic (male or female) is mutually exclusive (since the person can be either a man or a woman but not both).

#### Cluster sampling

Cluster sampling is a two-stage sampling :

- Constitute random groups of individuals
- Randomly select a group to be studied by simple random sampling
- Study each individual in this group.

For example, if we wanted to study the employees of a company. We could do 10 groups of individuals from the company and then randomly choose one group to study.

#### Convenience Sampling

On the contrary, this method should be avoided as much as possible. Convenience sampling involves selecting the individuals to be studied yourself. For example, you can go out into the street and ask people to respond to your study. These people, chosen by you, are likely to belong to a socio-economic group influenced, for example, by the neighbourhood in which you are interviewing people. Similarly, if you are doing a study and you are interviewing only friends of yours, you are performing convenience sampling.

Even if the results will be of little significance, businesses can still use this type of sampling if they are under severe time or money constraints.

### Test your results

How can we determine whether our experience is producing results that are not representative of reality? One way is to repeat the experiment a number of times and see if the results remain the same. This is what scientists do to confirm the results of their experiments: they replicate them on different samples and under different circumstances.

#### Blinding

Blinding is widespread: subjects do not know why they are being tested. It allows them to adopt the behaviours they would have in real life. This technique is widely used when testing drugs. One group took the drug, while another took a placebo, but nobody knows which group received what.

One way to approach the problem in an even more effective way is to do a double-blind study. That is, neither the subjects nor the scientists know who is receiving the treatment.

#### Confounding


One concept that is important to keep in mind is the concept of *confounding*. That is, when you don't know what factor influences the outcome of your experiment. For example, it is well known that smokers are more prone to cardiovascular disease. However, no scientist has been able to demonstrate causality because it is not known whether the socio-economic background is key factor of influence, as well as air quality etc.


## Measuring the center of a distribution

There are three ways to measure the centre of a distribution in statistics: mean, median and mode.

### Mean (or Average)

The mean is the sum of all the measurements in your sample divided by the total number of individuals. The formula is as follows:

$$
\bar{X} = \frac{\sum_{i=0}^{n}X_{i}}{n}
$$

Example: We have the following sample, Mary got 17/20 on her homework, Roman got 14/20 and Michael got 15/20. On average, all three got :

$$
\bar{X} = \frac{17+14+15}{3} = 15.33
$$

## Median

The median is literally the middle of your sample. However, it must be ordered from the smallest value to the largest. If you have an even number of individuals in your sample, you will simply take the average of the two values in the middle of your dataset.

Example: Here is the distribution of salaries among 5 employees

<table>
  <tr>
   <td>Alexis
   </td>
   <td>20 000€ / year
   </td>
  </tr>
  <tr>
   <td>Sarah
   </td>
   <td>22 000€ / year
   </td>
  </tr>
  <tr>
   <td>Jean-Claude
   </td>
   <td>23 000€ / year
   </td>
  </tr>
  <tr>
   <td>Mathilde
   </td>
   <td>40 000€ / year
   </td>
  </tr>
  <tr>
   <td>Bertrand
   </td>
   <td>70 000€ / year
   </td>
  </tr>
</table>

The median of this sample is: 23 000€.

## Mode

Finally, the mode is the number that appears most frequently in your sample. If there are no repeating values, then the mode cannot be calculated. Conversely, if you have several repeating values, then the highest frequency can be taken. This is called a multi-modal distribution. However, computers have a lot of difficulties to manage these features.

## When and how to use these measurements

Means are simple to calculate but may not be useful in all samples. This is very true, for example, when studying wages. If 9 people earn €50,000 / year and one person earns €1,000,000 / year then the mean becomes €145,000 / year. This is not representative of anything at all in the population. Here the median would have been more relevant for example. This is why the government speaks rather of median salary than average salary.

The median also better measures the distribution centre of qualitative ordinal data, which the average cannot calculate.

Finally, you can also make weighted averages. When you want some data to weigh more in the balance than others. This is often done for school grades, for example. Mathematics has a higher weight than English etc. Here is the mathematical formula:

$$
\bar{X} = \frac{\sum_{i=1}^{n}w_{i}x_{i}}{\sum_{i=1}^{n}w_{i}}
$$

Now let's look at the distribution of your samples. You've probably heard of the "bell curve" that describes the distribution of statistical samples. When your mean, median and mode are equal then you have an unbiased curve. However, this is rarely the case. Your curve is either right skewed, meaning that your mean and median are larger than your mode, or it is left skewed, meaning that your Mode is larger than your mean and median.

![](https://essentials-assets.s3.eu-west-3.amazonaws.com/M03-Python_programming_and_statistics/D01-Introduction_to_python_and_statistics/normal_distribution.png)

![](https://essentials-assets.s3.eu-west-3.amazonaws.com/M03-Python_programming_and_statistics/D01-Introduction_to_python_and_statistics/new_skewed_distributions.png)

Having a normal distribution is important for certain statistical calculations. One example is a Student T-test that can be used in an A/B test to see if there is a statistical significance between two variance. Usually operations around means don’t need a normal distribution but a best practice is to always check the assumptions behind each theorem


## Measure of Variation

### Intervals

The simplest way to measure variation is with an interval that represents the difference between the largest and smallest value in your sample. For example, in 1993, the lowest temperature of the year in Paris was -5 degrees Celsius and the highest was 33 degrees. So the interval is 38 degrees.

### Standard Deviation

A value that is used much more is the standard deviation. It allows us to know how much the values in our sample deviate from the mean. Here is the formula:

$$
 \sigma = \sqrt{\frac{\sum_{i=1}^{n}(x_i - \bar{x})^2}{n-1}}
$$

Although standard deviations are sensitive to statistical outliers(i.e. exceptionally large or small numbers), they are not as affected as intervals. The larger your database, the less your results will be affected by these extreme values. It is still recommended to look at all points that are abnormally far from "normal" values and remove them from your sample because you know that they only bias your calculations.

Standard deviations can be calculated on any distribution. Through statistical experience, mathematicians have seen that, empirically, 68% of the values in a sample are often located one standard deviation from the mean. 95% of the values are located two standard deviations from the mean and 99.7% of the values in your sample are located three standard deviations from the mean. Nothing has been proven but this is what we observe in statistical experiments.

![](https://essentials-assets.s3.eu-west-3.amazonaws.com/M03-Python_programming_and_statistics/D01-Introduction_to_python_and_statistics/three_sigma_rule.png)

## Z-Score

For each individual in your sample, you can calculate his or her Z-score. This is a measure that will allow you to determine how many standard deviations your individual is within from the mean.

$$
Z = \frac{x - \mu}{\sigma}
$$

A Z-score is positive for all values that are above average and negative for all values that are below average. For example, if you get a Z-score of 1.5 for an individual in your sample, this means that the individual is 1.5 standard deviations away from the mean.

By convention, the values:

- Within plus or minus 2 times the standard deviation of the mean are considered ordinary.
- More than 2 times the standard deviation of the mean are considered uncommon.
- More than 3 times the standard deviation of the mean are considered abnormal.

## Box plot 🍱

Another way to look at statistics is through quartiles, quintiles, deciles and percentiles which correspond to 25%, 20%, 10% and 1% of your sample. You will distribute your values in a box plot that looks like the one below.

![](https://essentials-assets.s3.eu-west-3.amazonaws.com/M03-Python_programming_and_statistics/D01-Introduction_to_python_and_statistics/boxplot_explained.png)

To create a quartile, you will need to find 3 numbers that divide your sample into 4 equal parts. In such a way that 25% of your sample will be part of your first quartile, 25% will be part of the second quartile, and so on. For example on a sample of 100 people who were asked their salary:

- 25 earn less than 2000€ / month
- 25 earn between 2001 and 3000€ / month
- 25 earn between 3001 and 4000€ / month
- 25 earn more than 4000€ / month

Q2 is the median value of your sample. The difference between Q3 and Q1 is called the interquartile interval, which covers 50% of all values in your sample. For example, the interquartile interval of our top sample is 2000€ (4000€ - 2000€).

## Resources 📚📚

- Definition of Statistics - [https://bit.ly/2r5M0ZZ](https://bit.ly/2r5M0ZZ)
- Sample - [http://bit.ly/2vvlT2S](http://bit.ly/2vvlT2S)
- Marketing Research: Sampling - [http://bit.ly/2esWUpo](http://bit.ly/2esWUpo)
- Stratified Sampling - [http://bit.ly/2xLfBxj](http://bit.ly/2xLfBxj)
- Cluster Sampling - [http://bit.ly/2jBTYJ8](http://bit.ly/2jBTYJ8)
- Searching for Sex - [http://nyti.ms/2wYjdKw](http://nyti.ms/2wYjdKw)
- Understanding Boxplots - [https://bit.ly/2kopsX](https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51)