In [1]:
import pandas as pd
import numpy as np

import scipy
from scipy import stats

#### Mean

Mean is a central tendency of the data.

It is defined as the sum of all observations divided by the number of observations.

The whole data is spread out around the mean.

In [2]:
# Example

prices = pd.Series([0,0,35,40,10,54,87,12,95,64,56,4,45,76,34,56,87,34,56,32,56,48,89,
                   42,65,100,99,98,100,96,99])
prices.mean()

57.064516129032256

In [3]:
# Adding some missing values in the previous series
prices = pd.Series([0,0,35,40,10,54,87,np.nan,12,95,64,56,4,45,76,34,np.nan,56,
                    87,34,56,32,56,48,np.nan,89,42,65,100,99,98,100,96,99])
prices.mean()

57.064516129032256

Python ignores missing values and calculate the mean of the available data in the series.

#### Trimmed mean

In [4]:
l = [1,2,3,4,5,6,7,8,9,10]

scipy.stats.trim_mean(l,proportiontocut=0.20)

5.5

In [5]:
# So, how does the trimmed mean work
# 20% of number of observations from the top and bottom of the sequence need to be removed
# so the from the list 'l' we remove 1,2,9,10
# calculating the mean now with the rest of numbers
np.mean([3,4,5,6,7,8])

5.5

In [6]:
# So, the mean calculated above matches the mean number derived from scipy.stats

#### Median

Median is the middle most observation in the data when it is arranged in ascending or descending order of their values. 

It is divided into equal parts. Thus, it is a positional average.

Median will be the middle term if the number of observations are odd. 

Median will be average of middle two terms, if number of observations are even.

In [7]:
prices.median()

56.0

#### Mode

Mode of the data is the value that has highest frequency.

In other ways, it is the most repeated observation.

Mode can be unimodal, bimodal, multimodal i.e. there can one or more than one mode present in a series of data.

In [8]:
prices.mode()

0    56.0
dtype: float64

### Partition Values

#### Quartiles

The values that divide the data into four equal parts are called quartiles.

25% of the data lies below first quartile and 75% percent above it.

The second quartile divides the data into two equal halves. Q2 is the median.

The third quartile divides the observation into 75%-25% i.e that 75% of the data lies below Q3 and 25% above it. 

In [9]:
# First Quartile
prices.quantile(0.25)

34.5

In [10]:
# Second Quartile or Median
prices.quantile(0.50)

56.0

In [11]:
# Third Quartile
prices.quantile(0.75)

88.0

#### Decile

The values that divide the dataset into 10 equal parts are called deciles.

The first value in the decile output simply denotes the minimum value.

10% of the data lies below the first decile and 90% above it.

Similarly, 20% of the data lies below second decile and 80% above it.

The fifth decile is same as second quartile i.e D5 = Q2 = Median.

Therefore, there are 9 deciles.

In [12]:
# Removing NaNs to calculate decile
prices.dropna(inplace=True)

In [13]:
np.percentile(prices,np.arange(0,100,10))

array([ 0., 10., 34., 40., 48., 56., 64., 87., 95., 99.])

10% of the data values lie below 10

20% of the data values lie below 34

30% of the data values lie below 40

40% of the data values lie below 48

50% of the data values lie below 56

60% of the data values lie below 64

70% of the data values lie below 87

80% of the data values lie below 95

90% of the data values lie below 99