# Statistics Two - Descriptive Statistics

Descriptive statistics gives us insight into data without having to look at all of it in detail.



### Key Features to Describe about Data
Getting a quick overview of how the data is distributed is a important step in statistical methods.

We calculate key numerical values about the data that tells us about the distribution of the data. We also draw graphs showing visually how the data is distributed.

Key Features of Data:

- Where is the center of the data? (location)
- How much does the data vary? (scale)
- What is the shape of the data? (shape)
- These can be described by summary statistics (numerical values).


### The Center of the Data
The **center** of the data is where most of the values are concentrated.

Different kinds of averages, like mean, median and mode, are **measures** of the center.

**Note**: Measures of the center are also called location parameters, because they tell us something about where data is 'located' on a number line.

### The Variation of the Data
The variation of the data is how spread out the data are around the center.

Statistics like standard deviation, range and quartiles are measures of variation.

**Note**: Measures of variation are also called scale parameters.

### The Shape of the Data
The shape of the data can refer to the how the data are bunched up on either side of the center.

Statistics like skew describe if the right or left side of the center is bigger. Skew is one type of shape parameters.

## 1.0 Frequency Tables
One typical of presenting data is with frequency tables.

A frequency table counts and orders data into a table. Typically, the data will need to be sorted into intervals.

Frequency tables are often the basis for making graphs to visually present the data.

## 1.1 Visualizing Data
Different types of graphs are used for different kinds of data. For example:

- Pie charts for qualitative data
- Histograms for quantitative data
- Scatter plots for bivariate data
- Graphs often have a close connection to numerical summary statistics.

For example, box plots show where the quartiles are.

Quartiles also tell us where the minimum and maximum values, range, interquartile range, and median are.



## 1.2 Statistics - Histograms
A histogram visually presents quantitative data.

### Histograms
A histogram is a widely used graph to show the distribution of quantitative (numerical) data.

It shows the frequency of values in the data, usually in intervals of values. Frequency is the amount of times that value appeared in the data.

Each interval is represented with a bar, placed next to the other intervals on a number line.

The height of the bar represents the frequency of values in that interval.

Here is a histogram of the age of all 934 Nobel Prize winners up to the year 2020:



![histo.PNG](attachment:histo.PNG)

This histogram uses age intervals from 10 to 19, 20 to 29, and so on.

**Note**: Histograms are similar to bar graphs, which are used for qualitative data.

### Bin Width
The intervals of values are often called 'bins'. And the length of an interval is called 'bin width'.

We can choose any width. It is best with a bin width that shows enough detail without being confusing.

Here is a histogram of the same Nobel Prize winner data, but with bin widths of 5 instead of 10:

Histogram of the age of Nobel Prize winners

![binhisto.PNG](attachment:binhisto.PNG)

This histogram uses age intervals from from 15 to 19, 20 to 24, 25 to 29, and so on.

Smaller intervals gives a more detailed look at the distribution of the age values in the data.

## 1.3 Statistics - Bar Graphs
A bar graph visually presents qualitative data.

### Bar Graphs
Bar graphs are used show the distribution of qualitative (categorical) data.

It shows the frequency of values in the data. Frequency is the amount of times that value appeared in the data.

Each category is represented with a bar. The height of the bar represents the frequency of values from that category in the data.

Here is a bar graph of the number of people who have won a Nobel Prize in each category up to the year 2020:

![bar.PNG](attachment:bar.PNG)

Some of the categories have existed longer than others. Multiple winners are also more common in some categories. So there is a different number of winners in each category.

**Note**: Bar graphs are similar to histograms, which are used for quantitative data.

## 1.4 Statistics - Pie Charts
A pie chart visually presents qualitative data.

### Pie Charts
Pie graphs are used to show the distribution of qualitative (categorical) data.

It shows the **frequency** or **relative frequency** of values in the data.

Frequency is the amount of times that value appeared in the data. Relative frequency is the percentage of the total.

Each category is represented with a slice in the 'pie' (circle). The size of each slice represents the frequency of values from that category in the data.

Here is a pie chart of the number of people who have won a Nobel Prize in each category up to the year 2020:

![pie.PNG](attachment:pie.PNG)

This pie chart shows relative frequency. So each slice is sized by the percentage for each category.

Some of the categories have existed longer than others. Multiple winners are also more common in some categories. So there is a different number of winners in each category.

## 1.5 Statistics - Box Plots
A box plot is a graph used to show key features of quantitative data.

### Box Plots
A box plot is a good way to show many important features of quantitative (numerical) data.

It shows the median of the data. This is the middle value of the data and one type of an average value.

It also shows the range and the quartiles of the data. This tells us something about how spread out the data is.

Here is a box plot of the age of all the Nobel Prize winners up to the year 2020:



![box.PNG](attachment:box.PNG)

**Violin plot**: https://towardsdatascience.com/violin-plots-explained-fb1d115e023d

The **median** is the red line through the middle of the 'box'. We can see that this is just above the number 60 on the number line below. So the middle value of age is 60 years.

The left side of the box is the 1st **quartile**. This is the value that separates the first **quarter**, or 25% of the data, from the rest. Here, this is 51 years.

The right side of the box is the 3rd **quartile**. This is the value that separates the first three **quarters**, or 75% of the data, from the rest. Here, this is 69 years.

The distance between the sides of the box is called the **inter-quartile range (IQR)**. This tells us where the 'middle half' of the values are. Here, half of the winners were between 51 and 69 years.

The ends of the lines from the box at the left and the right are the minimum and maximum values in the data. The distance between these is called the **range**.

The youngest winner was 17 years old, and the oldest was 97 years old. So the range of the age of winners was 80 years.

**Note**: Box plots are also called 'box and whiskers plots'.

## 1.6 Statistics - Variation
Variation is a measure of how spread out the data is around the center of the data.

### The Variation of the Data
Measures of variation are statistics of how far away the values in the observations (data points) are from each other.

There are different measures of variation. The most commonly used are:

- Range
- Quartiles and Percentiles
- Interquartile Range
- Standard Deviation
Measures of variation combined with an average (measure of center) gives a good picture of the distribution of the data.

**Note**: These measures of variation can only be calculated for numerical data.



### (a). Range
The range is the difference between the smallest and the largest value of the data.

Range is the simplest measure of variation.

Here is a histogram of the age of all 934 Nobel Prize winners up to the year 2020, showing the range:

![range.PNG](attachment:range.PNG)

The youngest winner was 17 years and the oldest was 97 years. The range of ages for Nobel Prize winners is then 80 years.

## (b). Quartiles and Percentiles
Quartiles and percentiles are ways of separating equal numbers of values in the data into parts.

**Quartiles** are values that separate the data into four equal parts.

**Percentiles** are values that separate the data into 100 equal parts.

Here is a histogram of the age of all 934 Nobel Prize winners up to the year 2020, showing the **quartiles**:

![quartile.PNG](attachment:quartile.PNG)

The quartiles $(Q0,Q1,Q2,Q3,Q4)$ are the values that separate each quarter.

Between $Q0$ and $Q1$ are the 25% lowest values in the data. Between $Q1$ and $Q2$ are the next 25%. And so on.

- $Q0$ is the smallest value in the data.
- $Q2$ is the middle value (median).
- $Q4$ is the largest value in the data.


### Percentiles
Percentiles are values that separate the data into 100 equal parts.

For example, The 95th percentile separates the lowest 95% of the values from the top 5%

- The 25th percentile (P25%) is the same as the first quartile $(Q1)$.

- The 50th percentile (P50%) is the same as the second quartile $(Q2)$ and the median.

- The 75th percentile (P75%) is the same as the third quartile $(Q3)$

### (c). Statistics - Interquartile Range
Interquartile range is a measure of variation, which describes how spread out the data is.

### Interquartile Range
Interquartile range is the difference between the first and third quartiles (Q1 and Q3).

The 'middle half' of the data is between the first and third quartile.

The first quartile is the value in the data that separates the bottom 25% of values from the top 75%.

The third quartile is the value in the data that separates the bottom 75% of the values from the top 25%

Here is a histogram of the age of all 934 Nobel Prize winners up to the year 2020, showing the **interquartile range (IQR)**:

![image.png](attachment:image.png)

Here, the middle half of is between 51 and 69 years. The interquartile range for Nobel Prize winners is then 18 years.


### (d). Statistics - Standard Deviation
Standard deviation is the most commonly used measure of variation, which describes how spread out the data is.

### Standard Deviation
Standard deviation (σ) measures how far a 'typical' observation is from the average of the data (μ).

Standard deviation is important for many statistical methods.

Here is a histogram of the age of all 934 Nobel Prize winners up to the year 2020, showing **standard deviations**

![image.png](attachment:image.png)

Each dotted line in the histogram shows a shift of one extra standard deviation.

If the data is **normally distributed**:

- Roughly 68.3% of the data is within 1 standard deviation of the average (from μ-1σ to μ+1σ)


- Roughly 95.5% of the data is within 2 standard deviations of the average (from μ-2σ to μ+2σ)


- Roughly 99.7% of the data is within 3 standard deviations of the average (from μ-3σ to μ+3σ)