## **Central Tendencies** 
A measure of central tendency is a single value that attempts to describe a set of data by identifying the central position within that set of data. Measures of central tendencies are sometimes also called as measures of central locations or summary statistics. 
<br>
Following are the types of measures of central tendencies,

#### 1. Mean 
The mean (or average) is the most popular and well known measure of central tendency. The mean is equal to the sum of all the values in the data set divided by the number of values in the data set. The sample mean is usually denoted by $\overline{x}$ (x bar) and the population mean is usually denoted by $\mu$ (mu). <br>  
The formula of calculating mean is as follows, 
$$\overline{x} = {{x_1 + x_2 + \dots + x_n}\over{n}} = {{\sum{x}}\over{n}}$$

In [7]:
def mean(x):
    """
    x is an list containing integers
    """
    if not isinstance(x, list):
        return "mean calc is not possible"
    return sum(x) / len(x)

print(mean([1, 2, 3, 4]))
print(mean(1)) #Error ?

2.5
mean calc is not possible


In [8]:
def mean(x):
    """
    x is an list containing integers
    """
    try:
        return sum(x) / len(x)
    except:
        return "mean calc is not possible"
print(mean([1, 2, 3, 4]))
print(mean(1)) #Error ?

2.5
mean calc is not possible


In [45]:
# 51 random numbers generate between 0 and 100
from numpy.random import randint, seed
seed(42)
x = randint(0, 100, 51)
print(x)

[51 92 14 71 60 20 82 86 74 74 87 99 23  2 21 52  1 87 29 37  1 63 59 20
 32 75 57 21 88 48 90 58 41 91 59 79 14 61 61 46 61 50 54 63  2 50  6 20
 72 38 17]


In [46]:
import numpy as np
np.mean(x)

50.1764705882353

The mean has one main disadvantage i.e. it is particularly susceptible to the influcence of outliers. These are values that are unusual compared to rest of the dataset. <br>
In situations like these we use another measure of central tendency, particularly median.

<br>

#### 2. Median
The median is the middle score for a set a of data that has been arranged in order of magnitude. The median is less affected by outliers and skewed data. 

In [47]:
def median(v):
    n = len(v)
    sorted_v = sorted(v)
    midpoint = n // 2
    if n % 2 == 1:
        return sorted_v[midpoint]
    
    low = midpoint - 1
    high = midpoint
    return (sorted_v[low]+sorted_v[high]) / 2

In [48]:
print(median(x))
print(type(median(x)))

54
<class 'numpy.int64'>


In [49]:
int(np.median(x))

54

<br>

#### 3. Mode 
The mode is the value that appears most frequently in a data set. A set of data may have one mode, more than one mode, or no mode at all.

In [52]:
from collections import Counter # Revise collections module of python

def mode(x):
    c = Counter(x)
    ans = []
    for value, freq in c.most_common():
        if freq == c.most_common()[0][1]:
            ans.append(value)
    return ans

In [53]:
temparr = [1, 1, 2, 2, 3]
print(mode(x))
print(mode(temparr))

[20, 61]
[1, 2]


In [54]:
import statistics
statistics.multimode(x)

[20, 61]

<br>

#### 4. Quantile
Quantiles are cut points dividing the range of a probability distribution into continuous intervals with equal probabilities, or dividing the observations in a sample in the same way. The median is a quantile; the median is placed in a probability distribution so that exactly half of the data is lower than the median and half of the data is above the median. The median cuts a distribution into two equal areas and so it is sometimes called 2-quantile.

In [55]:
def quantile(x, p):
    p_index = int(p * len(x))
    return sorted(x)[p_index]

In [57]:
print(sorted(x))

[1, 1, 2, 2, 6, 14, 14, 17, 20, 20, 20, 21, 21, 23, 29, 32, 37, 38, 41, 46, 48, 50, 50, 51, 52, 54, 57, 58, 59, 59, 60, 61, 61, 61, 63, 63, 71, 72, 74, 74, 75, 79, 82, 86, 87, 87, 88, 90, 91, 92, 99]


In [63]:
quantile(x, 0.25)

21

In [64]:
quantile(x, 0.5)

54

In [65]:
quantile(x, 0.75)

74

In [66]:
np.quantile(x, 0.3)

32.0

In [68]:
np.quantile(x, 0.33)

37.5

<br>

## **Dispersion**
Dispersion in statistics is a way of describing how spread out a set of data is. Dispersion is the state of data getting dispersed, stretched, or spread out in different categories. Dispersion is a set of measures that helps one to determine the quality of data in an objectively quantifiable manner. <br>
If all the values are close together then the spread is low, on the other hand, if some or all of the values differ by a large amount from the mean (and each other), then there is large spread in data.
Following are the types of measures of dispersion, 

#### 1. Variance 
Variance is the arithmetic mean of the sqaures of the deviations of the given values from their arithmetic mean. In simpler language, it is calculated by taking the differences between each number in the data set and the mean, then squaring the differences to make them positive, and finally dividing the sum of the squares by the number of values in the data set.
Variance is calculated by using the following formula, 
$$ V =  \frac{ \sum_{i=1}^n (x_{i} -  \bar{x}) ^{2}}{N}  $$

In [73]:
def variance(x):
    n = len(x)
    avg = mean(x)
    sum_squares = sum([(ele-avg)**2 for ele in x])  # Homework: Revise list comprehensions
    
    #sum_squares = 0
    #for ele in x:
    #    sum_squares += (ele-avg)**2
    
    return sum_squares / n
    
    

In [74]:
variance(x)

782.4198385236446

In [75]:
np.var(x)

782.4198385236447

<br>

#### 2. Standard Deviation 
As variance is produced by squaring the distance from the mean, its unit does not match that of original data. Standard deviation is a mathematical trick to bring back the parity. It is the positive square root of the variance. Standard deviation is denoted using $\sigma$ (sigma) and can be calculated using the following formula, 
$$ \sigma = \sqrt{V} = \sqrt{\frac{ \sum_{i=1}^n (x_{i} -  \bar{x}) ^{2}}{N}} $$ 

<br>

**Note:** The denominator in the calculation might change according to the context of variance / standard deviation, i.e. of population or sample.

In [76]:
from math import sqrt
def std(x):
    return sqrt(variance(x))

In [78]:
std(x)

27.971768598421598

In [79]:
np.std(x)

27.9717685984216

<br> 

#### 3. Range
The range is the simplest measure of dispersion. As a quantity, the range is the difference between the higest and lowest scores in a distribution.

In [80]:
def srange(x):
    return max(x) - min(x)

In [81]:
srange(x)

98

In [82]:
np.ptp(x) #peak to peak

98

<br>

#### 4. Inter-Quartile Range (IQR)
In descriptive stats, the interquartile range (IQR), also called the midspread or H-spread, is a measure of dispersion being equal to difference between 75th and 25th percentiles (i.e. 0.75 and 0.25 percentile). <br>

$$ IQR = Q3 - Q1 $$

The interquartile range is often used to find outliers in data. Outliers here are defined as observations that fall below Q1 âˆ’ 1.5 IQR or above Q3 + 1.5 IQR. In a boxplot, the highest and lowest occurring value within this limit are indicated by whiskers of the box (frequently with an additional bar at the end of the whisker) and any outliers as individual points.

<div align = "center">
    
<img src = "https://upload.wikimedia.org/wikipedia/commons/thumb/1/1a/Boxplot_vs_PDF.svg/800px-Boxplot_vs_PDF.svg.png" width = "400" />

<br>
    
Boxplot (with IQR) and a pdf (Probability density function) of a Normal N(0, $\sigma^2$) population
    
</div>

In [83]:
def iqr(x):
    return np.quantile(x, 0.75) - np.quantile(x, 0.25)

In [84]:
print(iqr(x))

51.0
