# Estimates of location

The estimates of location are useful for describing the overral behaviour of a variable.

"Get a typical value for each feature" is the main ideia behind using those estimates.

A single value that is able to describe a feature is way easier to study if compared to thousand of original inputs. 

Those estimates describe where most of the data is located ad, e.g. the central tendency.

In [16]:
import numpy as np
import pandas as pd
from scipy import stats

### Mean

The mean is the simplest of the location estimates.

It is very weak against outliers.

In [43]:
def mean(numbers):
    return sum(numbers)/len(numbers)

If we have the following data for the income of a family members :

[1000, 5500, 4250, 3700]

Then the mean or average income for the family would be :

In [44]:
mean([1000, 3700, 4250, 5500])

3612.5

But now, if there is one member within this family that makes much more money than the rest, 

the average income will be raised,not giving a trully overall description of that family average income :

In [23]:
mean([1000, 3700, 4250, 5500, 100000])

22890.0

That's why the mean is weak for outliers.

### Trimmed mean

If we take the same example form the simple mean, we could apply the trimmed mean.

The trimmed mean remove a given percentage of values from both tails of a variable, 

being a little stronger to outliers if compared to the simple mean.

In [25]:
round(stats.trim_mean([1000, 5500, 4250, 3700, 100000], 0.2),1)

4483.3

### Weighted mean

The weighted mean is useful in cases like :
- there is a imbalance between classes (a class with more samples than other classes)
- treat features that vary more or are less accurate than others (e.g. one sensor is less accurate than the others, give it a smaller weight)

In [38]:
def weig_mean(values, weights):
    return sum(s*w for s,w in zip(sensor_1, weights))/sum(weights)

In [39]:
sensor_values = [10, 11, 79, 11, 11, 48]
weights = [2, 2, 0.5, 2, 2, 0.5] # less accurate sensor

weig_mean(sensor_values, weights)

16.61111111111111

### Median

This is literally the point where half of the data lies above and the other half below.

In [40]:
def median(numbers):
    size = len(numbers)
    numbers = sorted(numbers)
    middle = int(size/2)
    return (numbers[middle-1] + numbers[middle])/2 if size%2 == 0 else numbers[middle]