# Mean without outliers (Optional Challenge)



üìö As you already know, the mean is defined by:

$$ \bar{x} = \frac{1}{n} \sum_{i=0}^{n} x_i = \frac{x_1 + x_2 + ... + x_{n-1} + x_n}{n}$$

‚ö†Ô∏è However, an outlier can wrongly influence the mean.

üí™ The median is a more robust measure of central tendancy.

ü§î But what if we could create a function `mean_without_outliers` to compute - as the name says - the mean without outliers ?




## Preliminary step: defining `outliers`

This question implies a preliminary step: what is an `outlier` ?

For each observation:

* `option 1:` We could consider that an outlier is an observation with a **`z-score`** below -3 or above 3 for example. 
    - But it implies a strong assumption: you are assuming that your distribution is Gaussian.
    - We could also be stricter with the z-score replacing 3-std-limit by 2, or more loose replacing the 3-std-limit by 4 or 5...

* `option 2:` We could use the definition of an outlier in a **`whisker boxplot`** where an outlier is an observation that lives below `Q1 - 1.5 IQR` or above `Q3 + 1.5 IQR`



In [2]:
import numpy as np
import pandas as pd

In [3]:
sample = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 100]

## Outliers defined by Z-score

### Draft

- For your sample, compute:
    - the mean
    - the standard deviation
    - the z-score of each observation
- Remove the outliers (observation with a z-score higher than your cutoff or lower than -cutoff
- Compute the mean with the remaining elements

Once you are satisfied with your steps, you can wrap these steps up into a single function in the next section of this notebook.

In [9]:
# YOUR CODE HERE
sample = np.array(sample)
sample_mean = sample.mean()
sample_std = sample.std()

In [10]:
sample_z_scores = (sample - sample_mean)/sample_std
sample_z_scores

array([-0.47944098, -0.44281701, -0.40619305, -0.36956909, -0.33294512,
       -0.29632116, -0.2596972 , -0.22307323, -0.18644927, -0.14982531,
        3.14633142])

In [12]:
np.where(abs(sample_z_scores)<3)

(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),)

## `mean_without_outliers_z_score`

In [15]:
def mean_without_outliers_z_score(elements):
    ''' return the mean of of a list of elements without outliers using the z_score'''
    return sample[np.where(abs(sample_z_scores)<3)]

In [16]:
mean_without_outliers_z_score(sample)

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

## Outliers defined by the boxplot

### Draft

- For your sample, compute:
    - Q1
    - Q3
    - IQR
    - the lower bound Q1 - 1.5 IQR
    - the upper bound Q3 + 1.5 IQR
- Remove the outliers (observations that are lower than the lower bound or greaterthan the upper bound
- Compute the mean with the remaining elements

Once you are satisfied with your steps, you can wrap these steps up into a single function in the next section of this notebook.

In [19]:
# YOUR CODE HERE
if 0.25*len(sample)%1==0:
    Q1 = sample[0.25*len(sample)-1]
else :
    Q1 = (sample[round(0.25*len(sample))]+sample[round(0.25*len(sample))-1])/2

In [21]:
if 0.75*len(sample)%1==0:
    Q3 = sample[0.75*len(sample)-1]
else :
    Q3 = (sample[round(0.75*len(sample))]+sample[round(0.75*len(sample))-1])/2

In [23]:
IQR = Q3 - Q1

In [24]:
l_bound = Q1 - 1.5*IQR

In [25]:
u_bound = Q3 + 1.5*IQR

In [27]:
Q1,Q3,IQR,u_bound,l_bound

(3.5, 8.5, 5.0, 16.0, -4.0)

In [29]:
filtered = [elt for elt in sample if elt>l_bound and elt<u_bound]

In [30]:
np.mean(filtered)

5.5

### `mean_without_outliers_boxplot`

In [0]:
def mean_without_outliers_boxplot(elements):
    ''' return the mean of elements without outliers using the boxplot definition'''
    if 0.25*len(sample)%1==0:
        Q1 = sample[0.25*len(sample)-1]
    else :
        Q1 = (sample[round(0.25*len(sample))]+sample[round(0.25*len(sample))-1])/2
    if 0.75*len(sample)%1==0:
        Q3 = sample[0.75*len(sample)-1]
    else :
        Q3 = (sample[round(0.75*len(sample))]+sample[round(0.75*len(sample))-1])/2
    IQR = Q3 - Q1
    l_bound = Q1 - 1.5*IQR

## Comparisons

*Uncomment the following cell*

In [0]:
# data = {'method': ['mean', 'mean filtering by z-score', 'mean filtering by outliers'], 
#         'result': [np.mean(sample),mean_without_outliers_z_score(sample), mean_without_outliers_boxplot(sample)]}
# comparison_df = pd.DataFrame(data = data)
# round(comparison_df,2)

üëè If you managed to finish the optional, congrats !

üíæ Do not forget to `git add/commit/push` your work !