# Central Tendency: Mean, Median, Mode

Using these measures (**`mean, median`, and `mode`**), we can summarize and get a basic underdstanding of our data. Although these measures of central tendency are limitated in the information they provide, they are an effective starting point for describing our data.

In [31]:
import pandas as pd

In [32]:
df = pd.DataFrame(data=[4, 6, 7, 1, 6, 9, 2, 7, 2, 5, 9, 2, 5, 7, 1, 4, 5,4,5,5,8], columns=["points"])

df

Unnamed: 0,points
0,4
1,6
2,7
3,1
4,6
5,9
6,2
7,7
8,2
9,5


## Mean

In [33]:
# determine the mean of a column using `mean()` method
df.mean()

points    4.952381
dtype: float64

In [34]:
# You can also determine the sum of the column and divide with the size of the dataframe/column
mean = df.sum() / len(df['points'])
mean

points    4.952381
dtype: float64

### Median
If our values are ordered (as the values in our dataset are), the **<mark>median</mark>** is the middle-most value if there are an odd number of values. If there are an even number of values, then we average the two center-most values. In our dataset, we have an even number of values so we will average the two center values ( 4 and 5) to attain a median value of **<mark>4.5</mark>**. This value is slightly different from our mean value, but is another way to describe the center of our data.

The median is useful in that, unlike the mean, it is fairly unaffected by outliers.

In [35]:
# You can determine the median using the `median()` method
df.median()

points    5.0
dtype: float64

## Mode
The **<mark>mode</mark>** is the most frequently occurring value or set of values. If any values are tied in terms of their frequency, then those values will be reported as modes together, and the dataset is said to be bimodal or multimodal. It's most useful when your data is repetitive and you want to identify which values occur most frequently. For our dataset, the most frequently occuring value is **<mark>5</mark>**.

Our mode value is a little different from both the mean (4.0) and median (4.5) values, but can also be used to describe the central tendency of a dataset.

In [36]:
df['points'].mode()[0]

np.int64(5)

# <p style="text-align: center;">Comparing mean, median, and mode</p>

If our data is normally distributed with a bell-shaped curve, then the **mean**, **median** and **mode** are nearly identical.

<img src="./images/normal.png"/>

However, if there are extreme outliers (a value that is much higher or lower than the rest of the values), they will cause the **mean** to shift significantly in the direction of the outliers, whereas the **median** is fairly unaffected. 

<div class="alert alert-block alert-warning">
    <b>Alert:</b> When the <mark>mean</mark> and <mark>median</mark> have a wide difference between them, that means the data is skewed and there are outliers present.
</div>

<img src='./images/skewed.png'>
