In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

> __*How do we summarize a data numerically? Measure its center & how the data spread out!*__

# 4. Measure of center (central tendency)

<hr>

### a. Mean (balance point)

- __Mean__ or average: average of population $\mu$ or average of sample $\bar{x}$. Can be calculated by a simple formula: 

    - Without frequencies:

    $$\displaystyle \bar{x} = \frac {\sum{x}} {n}$$
    
    - With frequencies:

    $$\displaystyle \bar{x} = \frac {\sum{x} f} {n}$$

#### _Example_

> Calculate the mean of __1,3,3,4,4,4,5,5__!

- Without frequencies:
    
    $\displaystyle \bar{x} = \frac {\sum{x}} {n} = \frac {1+3+3+4+4+4+5+5} {8} = \frac {29} {8} $= __3.625__
    
- With frequencies, create a frequency table

    x|f|xf
    ---|---|---
    1|1|1
    3|2|6
    4|3|12
    5|2|10
    $\sum$|8|29

    $\displaystyle \bar{x} = \frac {\sum{x} f} {n} = \frac {29} {8}$ = __3.625__
    
    

In [6]:
x = np.array([1,3,3,4,4,4,5,5])
x = pd.Series(x)
x.mean()

3.625

<hr>

### b. Median (middle point)

- __Median__ or middle quartile $Q_{2}$: the middle value of an sorted order.

#### _Example_

> Calculate the median of __1,3,3,4,4,4,5,5__!

Middle value of __1, 3, 3, 4 | 4, 4, 5, 5__

Median = $\displaystyle \frac {4+4} {2}$ = __4__

In [7]:
x = np.array([1,3,3,4,4,4,5,5])
x = pd.Series(x)
x.median()

4.0

<hr>

### c. Mode

- __Mode__: the value that occurs most often, usually for categories

#### _Example_

> Calculate the mode of __1,3,3,4,4,4,5,5__!

Mode = __4__ because it occurs __3__ times, the most often than the others.

In [8]:
x = np.array([1,3,3,4,4,4,5,5])
x = pd.Series(x)
x.mode()

0    4
dtype: int32

# 5. Measure of spread

<hr>

Measure of spread aka __data distribution__: How spread out the other values are. Why we need to measure data spread? Because only measure its center is not enough.

For instance, the following lists have the same mean & median, but different spread / distribution. So if you just look at the center, it will missed the data spread.
    
__5,5,5,5,5,5,5__ mean == median, no spread

__1,1,1,5,9,9,9__ mean == median, spread weighted on both edge

__1,2,4,5,6,8,9__ mean == median, spread out
    
What parameters that can be used to measure data spread?

- __Range__: largest value - small values, but the problem is if there is a one extreme value it could  greatly impact.

    __1,3,3,4,4,4,5,5__ range = 5 - 1 = __4__
    
    __1,3,3,4,4,4,5,60__ range = 60 - 1 = __59__
    
<hr>    

- __IQR (*Interquartile Range*)__ $= Q_{3} - Q_{1}$ range of middle 50%. The problem is it only consider 2 values: $Q_{1}$ and $Q_{3}$. It will good if we consider all data to know how the data spread.

    __1,3,3,4,4,4,5,5__ 
    
    $Q_{1} = 3$ and $Q_{3} = 4.5$ so IQR $= Q_{3} - Q_{1} = 4.5 - 3$ = __1.5__

<hr>

- __Standard Deviation__: average distance from the mean. Symbol standard deviation for population: $\sigma$ & for sample: __*s*__. It can be calculated by a simple formula:

    $$\displaystyle s = \sqrt \frac {\sum(x - \bar{x}) ^ 2} {n - 1}$$ or if using frequencies: $$\displaystyle s = \sqrt \frac {\sum(x - \bar{x}) ^ 2 f} {n - 1}$$

    _Example:_ __11,13,14,14__, so $\bar{x} = 13$
    
    <table>
    <thead>
        <tr><th>$x$</th><th>$x-\bar{x}$</th><th>$(x-\bar{x})^2$</th></tr>
    </thead>
    <tbody>
        <tr><td>11</td><td>-2</td><td>4</td></tr>
        <tr><td>13</td><td>0</td><td>0</td></tr>
        <tr><td>14</td><td>1</td><td>1</td></tr>
        <tr><td>14</td><td>1</td><td>1</td></tr>
        <tr><td>.........</td><td>...................</td><td>...................</td></tr>
        <tr><td></td><td>$\sum$</td><td><b>6</b></td></tr>
    </tbody>
    </table>
    
    So standard deviation $\displaystyle s = \sqrt \frac {6} {4-1} = 1.41$


In [13]:
x = np.array([11,13,14,14])
x = pd.Series(x)

print(x.std(ddof=1))
print(np.std(x))

1.4142135623730951
1.224744871391589


<hr>

- __z__ or __the number of Standard Deviations__: how far is a data from its mean, measured with standard deviation unit. seberapa jauh suatu data terhadap mean, dihitung dg satuan standard deviation. misal: x memiliki jarak dg mean sejauh 0.5 standard deviation.

    $\displaystyle z = \frac {x - \bar{x}} {s}$ and so $\displaystyle x = \bar{x} + zs$
    
    - _Example:_ How many standard deviations from the mean is the median of __1,3,3,4,4,4,5,5__? We calculate that its __median = 4__, __mean = 3.625__ and __standard deviation = 1.3__, so we can find the z.
    
    $\displaystyle z = \frac {x - \bar{x}} {s} = \frac {4 - 3.625} {1.3}$ = __0.288__ standard deviations from the mean
    
    - _Example:_ With the same data above, what value is 2 standard deviation below the mean? Below means negatif, so $x = \bar{x} - zs$
    
    $\displaystyle x = \bar{x} - zs = 3.625 - 2(1.3)$ = __1.025__

<hr>

- __Outliers__: values that far removed from the rest of the data, or a data value that is numerically distant from a dataset. It might be the lowest or highest value in a dataset. For example if we have __1,3,3,5,87__ we have an outlier = __87__. How to decide the outliers?
    
    1. __IQR Method__: 
        - low outlier = values that lower than $Q_{1} - 1.5 \times $ __IQR__
        - high outlier =  values that more than $Q_{3} + 1.5 \times $ __IQR__
        
        _Example_: from __2,3,5,6,7,14__ we get $Q_{1}$=3, $Q_{3}$=7 & IQR = 7-3 = 4. So we can calculate the outliers are below: 3 - 1.5(4) = -3 & above: 7 + 1.5(4) = 13. So the outlier from the list is __14__.
        
    2. __Standard deviations / z-score method__:
        A value that more than __2__ or __2.5__ standard deviations from the mean, is an outlier!
        
        - low outlier = values below $\displaystyle \bar{x} - 2.5 \cdot s$
        - high outlier = values above $\displaystyle \bar{x} + 2.5 \cdot s$
        
        _Example_: from __1,3,3,4,4,4,5,5__ we get $\bar{x}$=3.625, $s$=1.3. The outliers are below: 3.625 - 2(1.3) = __1.025__ & above 3.625 + 2(1.3) = __6.225__. So the outlier is __1__.

# 6. Effects of Outliers on Center & Spread

<hr>

Apa pengaruh outliers terhadap center & spread?

- _Example_: 
    
    Data logger temperatur ($^{\circ}C$): __-350, 15, 20.5, 26, 30.5, 31, 31__
    
    Outlier = __-350__

    Measure|With Outlier|Without Outlier
    ---|---|---
    Mean|-28|25.667
    Median|26|28.25
    Mode|31|31
    Range|381|16
    
    Dari tabel di atas diketahui bahwa adanya outlier:
    
    - __Sangat mempengaruhi__ Mean & Range, serta turunannya seperti Standard Deviasi.
    
    - __Relatif tidak mempengaruhi__ Median & Mode, serta turunannya.