In [26]:
# dependencies
import numpy as np
import pandas as pd
import statistics
from scipy.stats import trimboth, trim_mean

## Techniques
There are several techniques you've probably heard in a class or an article that represent some statistical method applied to some data. The most common of these are:

- count
- mean 
- standard deviation
- median
- mode
- range
- percentiles

In [30]:
prices = [198300, 2385000, 658200, np.nan, 658200]
indices = ['house_a', 'house_b', 'house_c', 'house_d', 'house_e']

prices_df = pd.DataFrame(data, index=indices, columns=['price'])
prices_df

Unnamed: 0,price
house_a,198300.0
house_b,2385000.0
house_c,658200.0
house_d,
house_e,658200.0


### `nan` handling

#### different version, different default approach
Note that different library versions of the same statistical method may handle missing data in different ways. Where one chooses to halt on missing values, others skip over `nan` and continue performing the operation.

In [28]:
print(mode(prices))

658200


In [33]:
print(np.median(prices))
print(np.mean(prices))
print(np.std(prices))

nan
nan
nan


In [32]:
print(statistics.mode(prices))
print(statistics.median(prices))
print(statistics.mean(prices))
print(statistics.stdev(prices))

658200
658200
nan
nan


In [20]:
prices_df.describe()

Unnamed: 0,price
count,3.0
mean,1080500.0
std,1152895.134
min,198300.0
25%,428250.0
50%,658200.0
75%,1521600.0
max,2385000.0


In [4]:
prices_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, house_a to house_c
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   price   3 non-null      int64
dtypes: int64(1)
memory usage: 48.0+ bytes


### Similar or relevant techniques

In some cases, our data may have a small number of outliers. Depending on how realistic the outliers are in context, it can be useful to perform a modified approach to something like a basic mean when evaluating the descriptive statistics of a dataset.

When our outliers make sense for the data we have, a trimmed mean allows you to cut a specified proportion of the tails of the data off and then evaluate the mean. 
- `trim_mean(arr, 0.01)`: Takes `arr` and sorts the contents, then slices off the left and rightmost tails 
- clip()

In [5]:
data = [1, 25, 156, 78, 465, 12312, 98, 5651, 75615]

print('dataset:', data)
print('mean:\t', np.mean(data))

dataset: [1, 25, 156, 78, 465, 12312, 98, 5651, 75615]
mean:	 10489.0


In [6]:
trimmed_data = trimboth(data, 0.05)

print('new dataset', trimmed_data)
print('new mean:t', np.mean(trimmed_data))

new dataset [    1    98    25    78   156   465  5651 12312 75615]
new mean:t 10489.0


In [7]:
trim_mean(data, .2)

2683.5714285714284