# **Descriptive Statistics**  

Descriptive statistics gives us insight into data without having to look at all of it in detail.

## **1. Key Features to Describe about Data**  
Getting a quick overview of how the data is distributed is a important step in statistical methods.

We calculate key numerical values about the data that tells us about the distribution of the data. We also draw graphs showing visually how the data is distributed.

Key Features of Data:

1. Where is the centre of the data? (location)
2. How much does the data vary? (scale)
3. What is the shape of the data? (shape)  

These can be described by **summary statistics** (numerical values).

## **1.1 The Centre of the Data**  
The **centre** of the data is where most of the values are concentrated.

Different kinds of averages, like mean, median and mode, are **measures** of the centre.  

> **Note:** Measures of the centre are also called location parameters, because they tell us something about where data is 'located' on a number line.

## **1.2 The Variation of the Data**  

The **variation** of the data is how spread out the data are around the centre.

Statistics like standard devition, range and quartiles are **measures** of variation.

> **Note:** Measures of variation are also called **scale parameters**.

## **1.3 The Shape of the Data**  
The shape of the data can refer to the how the data are bunched up on either side of the centre.

Statistics like **skew** describe if the right of left side of the centre is bigger. Skew is one type of **shape parameters.**

## **2. Visualizing Data**  
Different types of graphs are used for different kinds of data. For example:

- Pie charts for qualitative data
- Histograms for quantitative data
- Scatter plots for bivariate data  

Graphs often have a close connection to numerical summary statistics.

For example, box plots show where the **quartiles** are.

Quartiles also tell us where the minimum and maximum values, range, interquartile range, and median are.

## **3. Frequency Tables**  
A frequency tables is a way to present data. The data are counted and ordered to summarize larger sets of data.

With a frequency table you can analyze the way the data is distributed across different values.  

One typical of presenting data is with **frequency tables.**

A **frequency table** counts and orders data into a table. Typically, the data will need to be sorted into intervals.

Frequency tables are often the basis for making graphs to visually present the data.  

1. Frequency Tables
2. Relative Frequency Tables  
3. Cumulative Frequency Tables

## **3.1 Frequency Tables**  
Frenquency means the number of times a value appears in the data. A table can quickly show us how many times each value appears.

If the data has many different values, it is easier to use intervals of values to present them in a table.

Here is the age of the 934 Nobel Prize winners up until the year 2020. In the table each row is an age interval of 10 years.  

|Age Interval|Frequency|  
|---|---|
10-19|1
20-29|	2
30-39|	48
40-49|	158
50-59|	236
60-69|	262
70-79|	174
80-89|	50
90-99|	3  

We can see that there is only one winner from ages 10 to 19. And that the highest number of winners are in their 60s.  

> **Note:** The intervals for the values are also called 'bins'.

## **3.2 Relative Frequency Tables**  
Relative frenquency means the number of times a value appears in the data compared to the total amount. A percentage is a relative frequency.

Here are the relative frequencies of ages of Noble Prize winners. Now, all the frequencies are divided by the total (928) to give percentages.  

|Age Interval|Relative Frequency|
|---|---|
10-19|	0.11%
20-29|	0.21%
30-39|	5.14%
40-49|	16.92%
50-59|	25.27%
60-69|	28.05%
70-79|	18.63%
80-89|	5.35%
90-99|	0.32%

## **3.3 Cumulative Frequency Tables**  
Cumulative frequency counts up to a particular value.

Here are the cumulative frequencies of ages of Nobel Prize winners. Now, we can see how many winners have been younger than a certain age.  

|Age|Cumulative Frequency|
|---|---|
Younger than 20|1
Younger than 30|3
Younger than 40|51
Younger than 50|208
Younger than 60|	442
Younger than 70|	701
Younger than 80|	875
Younger than 90|	925
Younger than 100|	928  

Cumulative frequency tables can also be made with relative frequencies (percentages).

## **4. Histograms**  
A histogram visually presents quantitative data.  
A histogram is a widely used graph to show the distribution of quantitative (numerical) data.

It shows the **frequency** of values in the data, usually in intervals of values. Frequency is the amount of times that value appeared in the data.

Each interval is represented with a bar, placed next to the other intervals on a number line.

The height of the bar represents the frequency of values in that interval.

Here is a histogram of the age of all 934 Nobel Prize winners up to the year 2020:  

<img src='Histogram_1.png'>  

This histogram uses age intervals from 10 to 19, 20 to 29, and so on.  

> **Note:** Histograms are similar to bar graphs, which are used for qualitative data.

## **4.1 Bin Width**  
The intervals of values are often called 'bins'. And the length of an interval is called 'bin width'.

We can choose any width. It is best with a bin width that shows enough detail without being confusing.

Here is a histogram of the same Nobel Prize winner data, but with bin widths of 5 instead of 10:  
<img src='Histogram_2.png'> 

This histogram uses age intervals from from 15 to 19, 20 to 24, 25 to 29, and so on.

Smaller intervals gives a more detailed look at the distribution of the age values in the data.


## **5. Bar Graphs**  
A bar graph visually presents qualitative data.  

Bar graphs are used show the distribution of qualitative (categorical) data.

It shows the **frequency** of values in the data. Frequency is the amount of times that value appeared in the data.

Each category is represented with a bar. The height of the bar represents the frequency of values from that category in the data.

Here is a bar graph of the number of people who have won a Nobel Prize in each category up to the year 2020:  

<img src='Bar_graph_1.png'>  

Some of the categories have existed longer than others. Multiple winners are also more common in some categories. So there is a different number of winners in each category.

> **Note:** Bar graphs are similar to histograms, which are used for quantitiative data.

## **6. Pie Charts**  
A pie chart visually presents qualitative data.  

Pie graphs are used to show the distribution of qualitative (categorical) data.

It shows the **frequency** or **relative frequency** of values in the data.

Frequency is the amount of times that value appeared in the data. Relative frequency is the percentage of the total.

Each category is represented with a slice in the 'pie' (circle). The size of each slice represents the frequency of values from that category in the data.

Here is a pie chart of the number of people who have won a Nobel Prize in each category up to the year 2020:  

<img src='Pie_chart_1.png'> 
 
This pie chart shows relative frequency. So each slice is sized by the percentage for each category.

Some of the categories have existed longer than others. Multiple winners are also more common in some categories. So there is a different number of winners in each category.


## **7. Box Plots**  
A box plot is a graph used to show key features of quantitative data.  

A box plot is a good way to show many important features of quantitative (numerical) data.

It shows the median of the data. This is the middle value of the data and one type of an average value.

It also shows the range and the quartiles of the data. This tells us something about how spread out the data is.

Here is a box plot of the age of all the Nobel Prize winners up to the year 2020:  

<img src='Box_plot_1.png'>  

The **median** is the red line through the middle of the 'box'. We can see that this is just above the number 60 on the number line below. So the middle value of age is 60 years.

The left side of the box is the 1st **quartile**. This is the value that separates the first **quarter**, or 25% of the data, from the rest. Here, this is 51 years.

The right side of the box is the 3rd **quartile**. This is the value that separates the first three **quarters**, or 75% of the data, from the rest. Here, this is 69 years.

The distance between the sides of the box is called the **inter-quartile range (IQR)**. This tells us where the 'middle half' of the values are. Here, half of the winners were between 51 and 69 years.

The ends of the lines from the box at the left and the right are the minimum and maximum values in the data. The distance between these is called the **range**.

The youngest winner was 17 years old, and the oldest was 97 years old. So the range of the age of winners was 80 years.  

> **Note:** Box plots are also called 'box and whiskers plots'.

## **8. Average**  
An average is a measure of where most of the values in the data are located.  

## **The Center of the Data**  
The center of the data is where most of the values in the data are located. Averages are measures of the location of the center.

There are different types of averages. The most commonly used are:
- Mean
- Median
- Mode  

> **Note:** In statistics, averages are often referred to as 'measures of central tendency'.

## **9. Mean**  
The mean is usually referred to as 'the average'.  
The mean is a type of average value, which describes where center of the data is located.  

The mean is the sum of all the values in the data divided by the total number of values in the data:  
(40 + 21 + 55 + 31 + 48 + 13 + 72)/7 = 38.57  

The mean is calculated for numerical variables. A variable is something in the data that can vary, like:

- Age
- Height
- Income

> **Note:** There are are multiple types of mean values. The most common type of mean is the arithmetic mean.
>
>In this tutorial, 'mean' refers to the arithmetic mean.

## **9.1 Calculating the Mean**  
You can calculate the mean for both the population and the sample.

The formulas are the same and uses different symbols to refer to the population mean ($\mu$) and sample mean ($\bar{x}$).

Calculating the **population mean** ($\mu$) is done with this formula:  
$\displaystyle  \mu = \frac{\sum x_{i}}{n}$ 

Calculating the sample mean ($\bar{x}$) is done with this formula:  
$\displaystyle \bar{x} = \frac{\sum x_{i}}{n}$
 

The bottom part of the fraction ($n$) is the total number of observations.

 $\sum$ is the symbol for adding together a list of numbers.

 $x_{i}$ is the list of values in the data: $x_{1}, x_{2}, x_{3}, \ldots$

The top part of the fraction ($\sum x_{i}$) is the sum of $x_{1}, x_{2}, x_{3}, \ldots$ added together.

So, if a sample has 4 observations with values: 4, 11, 7, 14 the calculation is:  
$\displaystyle \bar{x} = \frac{4 + 11 + 7 + 14}{4} = \frac{36}{4} = \underline{9}$

## **9.2 Calculation with Programming**  
The mean can easily be calculated with many programming languages.

Using software and programming to calculate statistics is more common for bigger sets of data, as calculating by hand becomes difficult.  
  
Example
With Python use the NumPy library mean() method to find the mean of the values 4,11,7,14:


In [19]:
import numpy 

valuesMean = [4, 11, 7, 14]
mean = numpy.mean(valuesMean)

print(mean)

9.0


## **9.3 Statistics Symbol Reference**  

|Symbol|Description|
|---|---|
|$\mu$|The population mean. Pronounced 'mu'.|
|$\bar{x}$|The sample mean. Pronounced 'x-bar'.|
|$\sum$|The summation operator, 'capital sigma'.|
|$x$|The variable 'x' we are calculating the average for.|
|$i$|The index 'i' of the variable 'x'. This identifies each observation for a variable.|
|$n$|The number of observations.|

## **10. Median**  
The median is the '**middle** value' of the data set ordered from low to high.   
The median is a type of average value, which describes where the center of the data is located.

The median is found by ordering all the values in the data and picking the middle value:  
> 13, 21, 21, 40, 48, 55, 72 = 40  

The median is less influenced by extreme values in the data than the mean.

Changing the last value to 356 does not change the median:  
> 13, 21, 21, 40, 48, 55, 356  

The median is still 40.

Changing the last value to 356 changes the mean a lot:  
> (13 + 21 + 21 + 40 + 48 + 55 + 72)/7 = 38.57
>
> (13 + 21 + 21 + 40 + 48 + 55 + 356)/7 = 79.14  

***
> **Note:** Extreme values are values in the data that are much smaller or larger than the average values in the data.

## **10.1 Finding the Median**  
The median can only be calculated for numerical variables.

The formula for finding the middle value is:
$\displaystyle \frac{n + 1}{2}$
 

Where $n$ is the total number of observations.

If the total number of observations is an **odd** number, the formula gives a whole number and the value of this observation is the median.

> 13, 21, 21, 40, 48, 55, 72 = 40 

Here, there are 7 total observations, so the median is the 4th value:  
$\displaystyle \frac{7 + 1}{2} = \frac{8}{2} = 4$
 
 

The 4th value in the ordered list is **40**, so that is the median.

If the total number of observations is an **even** number, the formula gives a decimal number between two observations.

> 13, 21, 21, 40, 42, 48, 55, 72 = 40 and 42

Here, there are 8 total observations, so the median is between the 4th and 5th values:  
$\displaystyle \frac{8 + 1}{2} = \frac{9}{2} = 4.5$
 
 

The 4th and 5th values in the ordered list is **40** and **42**, so the median is the mean of these two values. That is, the sum of those two values divided by 2:  
$\displaystyle \frac{40+42}{2} = \frac{82}{2} = \underline{41}$

 
> **Note:** It is important that the numbers are ordered before you can find the median.

## **10.2 Finding the Median with Programming**  
The median can easily be found with many programming languages.

Using software and programming to calculate statistics is more common for bigger sets of data, as finding it manually becomes difficult.

Example
With Python use the NumPy library **median()** method to find the median of the values 13, 21, 21, 40, 42, 48, 55, 72:

In [20]:
valuesMedian = [13,21,21,40,42,48,55,72]

median = numpy.median(valuesMedian)

print(median)

41.0


## **11. Mode**  
The **mode** is the value(s) that appears most often in the data:  
The mode is a type of average value, which describes where most of the data is located.  

The mode is the value(s) that are the most common in the data.  
A dataset can have multiple values that are modes.  
A distribution of values with only one mode is called **unimodal**.  
A distribution of values with two modes is called **bimodal**. In general, a distribution with more than one mode is called **multimodal**.  
Mode can be found for both categorical and numerical data.  

> 40, 21, 55, <u>21</u>, 48, 13, 72

Here, 21 appears two times, and the other values only once. The mode of this data is 21.

The mode is also used for **categorical data**, unlike the median and mean. Categorical data can't be described directly with numbers, like names:

> Alice, <u>John</u>, Bob, Maria, <u>John</u>, Julia, Carol

Here, John appears two times, and the other values only once. The mode of this data is John.

> **Note:** There can be more than one mode if multiple values appear the same number of times in the data.



## **11.1 Finding the Mode**  
Here is a **numerical** example:

4, <u>7</u>, 3, 8, 11, <u>7</u>, 10, 19, 6, 9, <u>12</u>, <u>12</u>

Both 7 and 12 appears two times each, and the other values only once. The modes of this data is 7 and 12.

Here is a **categorical** example with names:

Alice, <u>John</u>, Bob, Maria, <u>John</u>, Julia, Carol

John appears two times, and the other values only once. The mode of this data is John.

## **11.2 Finding the Mode with Programming**  
The mode can easily be found with many programming languages.

Using software and programming to calculate statistics is more common for bigger sets of data, as calculating manually becomes difficult.

Example
With Python use the statistics library multimode() method to find the modes of the values 4,7,3,8,11,7,10,19,6,9,12,12:  

In [21]:
from statistics import multimode

valuesMode = [4,7,3,8,11,7,10,19,6,9,12,12]

mode = multimode(valuesMode)

print(mode)

[7, 12]


## **12 Variation**  
Variation is a measure of how spread out the data is around the centre of the data.  

### **The Variation of the Data**
Measures of variation are statistics of how far away the values in the observations (data points) are from each other.  

There are different measures of variation. The most commonly used are:  
1. Range
2. Quartiles and Percentiles
3. Interquartile Range
4. Standard Deviation  

Measures of variation combined with an average (measure of centre) gives a good picture of the distribution of the data.  

> **Note:** These measures of variation can only be caclucated for numerical data.

## **13 Range**  
The range is a measure of variation, which describes how spread out the data is.  

The range is the difference between the smallest and the largest value of the data.  

Range is the simplest measure of variation.

Here is a histogram of the age of all 934 Nobel Prize winners up to the year 2020, showing the **range:**
<img src='Range_1.png'>  
  
The youngest winner was 17 years and the oldest was 97 years. The range of ages for Nobel Prize winners is then 80 years.


## **13.1 Calculating the Range**  
The range can can only be calculated for numerical data.

First, find the smallest and largest values of this example:  
> <u>13</u>, 21, 21, 40, 48, 55, <u>72</u>

Calculate the difference by substracting the smallest from the largest:  
72 - 13 = <u>59</u>

## **13.2 Calculating the Range with Programming**  
The range can easily be found with many programming languages.

Using software and programming to calculate statistics is more common for bigger sets of data, as finding it manually becomes difficult.

Example
With Python use the NumPy library `ptp()` method to find the range of the values 13, 21, 21, 40, 48, 55, 72:

In [22]:
import numpy

valuesRange = [13,21,21,40,48,55,72]

Range = numpy.ptp(valuesRange)

print(Range)

59


## **14 Quartiles and Percentiles**  
Quartiles and percentiles are a measures of variation, which describes how spread out the data is.  
Quartiles and percentiles are both types of **quantiles.**  

Quartiles and percentiles are ways of separating equal numbers of values in the data into parts.

1. **Quartiles :** are values that separate the data into four equal parts.

2. **Percentiles :** are values that separate the data into 100 equal parts.

Here is a histogram of the age of all 934 Nobel Prize winners up to the year 2020, showing the **quartiles:**  
<img src='Quartiles_Percentiles.png'>  

  
The quartiles (Q0,Q1,Q2,Q3,Q4) are the values that separate each quarter.

Between Q0 and Q1 are the 25% lowest values in the data. Between Q1 and Q2 are the next 25%. And so on.  
- Q0 is the smallest value in the data.
- Q2 is the middle value (median).
- Q4 is the largest value in the data.

## **14.1 Quartiles**  
**Quartiles** are values that separate the data into four equal parts.

Here is a histogram of the age of all 934 Nobel Prize winners up to the year 2020, showing the **quartiles:**  
<img src='Quartiles_Percentiles.png'>  
The quartiles (Q0,Q1,Q2,Q3,Q4) are the values that separate each quarter.

Between Q0 and Q1 are the 25% lowest values in the data. Between Q1 and Q2 are the next 25%. And so on.

- Q0 is the smallest value in the data.
- Q1 is the value separating the first quarter from the second quarter of the data.
- Q2 is the middle value (median), separating the bottom from the top half.
- Q3 is the value separating the third quarter from the fourth quarter
- Q4 is the largest value in the data.

## **14.2 Calculating Quartiles with Programming**  
Quartiles can easily be found with many programming languages.

Using software and programming to calculate statistics is more common for bigger sets of data, as finding it manually becomes difficult.

Example  
With Python use the NumPy library `quantile()` method to find the quartiles of the values 13, 21, 21, 40, 42, 48, 55, 72:

In [23]:
import numpy

valuesQuartiles = [13,21,21,40,42,48,55,72]

quartiles = numpy.quantile(valuesQuartiles, [0,0.25,0.5,0.75,1])

print(quartiles)

[13.   21.   41.   49.75 72.  ]


## **14.3 Percentiles**  
**Percentiles** are values that separate the data into 100 equal parts.

For example, The 95th percentile separates the lowest 95% of the values from the top 5%

The 25th percentile (P25%) is the same as the first quartile (Q1).

The 50th percentile (P50%) is the same as the second quartile (Q2) and the median.

THe 75th percentile (P75%) is the same as the third quartile (Q3)

## **14.4 Calculating Percentiles with Programming**  
Percentiles can easily be found with many programming languages.

Using software and programming to calculate statistics is more common for bigger sets of data, as finding it manually becomes difficult.

Example  
With Python use the NumPy library `percentile()` method to find the `65`th percentile of the values 13, 21, 21, 40, 42, 48, 55, 72:

In [24]:
import numpy

valuesPercentiles = [13,21,21,40,42,48,55,72]

percentiles = numpy.percentile(valuesPercentiles, 65)

print(percentiles)

45.3


## **15 Interquartile Range**  
Interquartile range is a measure of variation, which describes how spread out the data is.  

Interquartile range is the difference between the first and third quartiles (Q1 and Q3).

The 'middle half' of the data is between the first and third quartile.  

The first quartile is the value in the data that separates the bottom 25% of values from the top 75%.

The third quartile is the value in the data that separates the bottom 75% of the values from the top 25%

Here is a histogram of the age of all 934 Nobel Prize winners up to the year 2020, showing the **interquartile range (IQR):**  
<img src='InterQuartile_range.png'>  
Here, the middle half of is between 51 and 69 years. The interquartile range for Nobel Prize winners is then 18 years.

## **15.1 Calculating the Interquartile Range with Programming**  
The interquartile range can easily be found with many programming languages.

Using software and programming to calculate statistics is more common for bigger sets of data, as finding it manually becomes difficult.

Example  
With Python use the SciPy library `iqr()` method to find the interquartile range of the values 13, 21, 21, 40, 42, 48, 55, 72:

In [25]:
from scipy import stats

valuesInterQuartile = [13,21,21,40,42,48,55,72]

interQuartile = stats.iqr(valuesInterQuartile)

print(interQuartile)

28.75


## **16. Standard Deviation**  
Standard deviation is the most commonly used measure of variation, which describes how spread out the data is.  

Standard deviation is the most used measure of variation.

Standard deviation (σ) measures how far a 'typical' observation is from the average of the data (μ).

Standard deviation is important for many statistical methods.

Here is a histogram of the age of all 934 Nobel Prize winners up to the year 2020, showing **standard deviations:**  
<img src='Standard_Deviation.png'>  
> **Note:** A normal distribution has a "bell" shape and spreads out equally on both sides. 
Each dotted line in the histogram shows a shift of one extra standard deviation.

If the data is **normally distributed:**  
- Roughly 68.3% of the data is within 1 standard deviation of the average (from μ-1σ to μ+1σ)
- Roughly 95.5% of the data is within 2 standard deviations of the average (from μ-2σ to μ+2σ)
- Roughly 99.7% of the data is within 3 standard deviations of the average (from μ-3σ to μ+3σ)  

> **Note:** Values within one standard deviation (σ) are considered to be typical.
> 
> Values outside three standard deviations are considered to be outliers.

## **16.1 Calculating the Standard Deviation**  
The formulas are **almost** the same and uses different symbols to refer to the standard deviation ($\sigma$) and sample standard deviation ($s$).

Calculating the **standard deviation** ($\sigma$) is done with this formula:
$\displaystyle  \sigma = \sqrt{\frac{\sum (x_{i}-\mu)^2}{n}}$  

Calculating the sample **standard deviation ($s$)** is done with this formula:
$\displaystyle s = \sqrt{\frac{\sum (x_{i}-\bar{x})^2}{n-1}}$
 

 $n$ is the total number of observations.

 $\sum$ is the symbol for adding together a list of numbers.

 $x_{i}$ is the list of values in the data: $x_{1}, x_{2}, x_{3}, \ldots$

 $\mu$ is the population mean and $\bar{x}$ is the sample mean (average value).  

 $(x_{i} - \mu )$ and $(x_{i} - \bar{x} )$ are the differences between the values of the observations ($x_{i}$) and the mean.

Each difference is squared and added together.

Then the sum is divided by $n$ or $n - 1$) and then we find the square root.

Using these 4 example values for calculating the **population standard deviation:**

> 4, 11, 7, 14

We must first find the <u>mean</u>:  
$\displaystyle \mu = \frac{\sum x_{i}}{n} = \frac{4 + 11 + 7 + 14}{4} = \frac{36}{4} = \underline{9}$  

Then we find the difference between each value and the mean $(x_{i}- \mu)$:  
- $4-9 = -5$
- $11-9 = 2$
- $7-9 = -2$
- $14-9 = 5$

Each value is then squared, or multiplied with itself $( x_{i}- \mu )^2$:  
- $(-5)^2 = (-5)(-5) = 25$
- $2^2 \; \; \; \; \; \, = 2*2 \; \; \; \; \; \; \; \: = 4$  
- $(-2)^2 = (-2)(-2) = 4$
- $5^2 \; \; \; \; \; \, = 5*5 \; \; \; \; \; \; \; \: = 25$

All of the squared differences are then added together $\sum (x_{i} -\mu )^2$:
$25 + 4 + 4 + 25 = 58$

Then the sum is divided by the total number of observations, $n$:  
$\displaystyle \frac{58}{4} = 14.5$

Finally, we take the square root of this number:  
$\sqrt{14.5} \approx \underline{3.81}$

So, the standard deviation of the example values is roughly: $3.81$ 

## **16.2 Calculating the Standard Deviation with Programming**  
The standard deviation can easily be calculated with many programming languages.

Using software and programming to calculate statistics is more common for bigger sets of data, as calculating by hand becomes difficult.

**Population Standard Deviation**
Example  
With Python use the NumPy library **std()** method to find the standard deviation of the values 4,11,7,14:

In [26]:
import numpy

valuesStandardDeviation = [4,11,7,14]

StandardDeviation = numpy.std(valuesStandardDeviation)

print(StandardDeviation)

3.8078865529319543


**Sample Standard Deviation**  

Example: 2  
With Python use the NumPy library `std()` method to find the **sample** standard deviation of the values 4,11,7,14:

In [27]:
import numpy

valuesStandardDeviation2 = [4,11,7,14]

StandardDeviation2 = numpy.std(valuesStandardDeviation2, ddof=1)

print(StandardDeviation2)

4.396968652757639


## **16.3 Statistics Symbol Reference**  
|Symbol|Description|
|---|---|
|$\sigma$|Population standard deviation. Pronounced 'sigma'.|
|$s$|Sample standard deviation.|
|$\mu$|The population mean. Pronounced 'mu'.|
|$\bar{x}$|The sample mean. Pronounced 'x-bar'.|
|$\sum$|The summation operator, 'capital sigma'.|
|$x$|The variable 'x' we are calculating the average for.|
|$i$|The index 'i' of the variable 'x'. This identifies each observation for a variable.|
|$n$|The number of observations.|