<img src="../data/images/statistics.jpg" style="width: 700px" />

In [2]:
# let's import packages that we need

import math
import statistics
import numpy as np
import scipy.stats
import pandas as pd

In [3]:
"""
1. We create lists x and x_with_nan. 
2. They’re almost the same, with the difference that x_with_nan contains a *nan* value. 
3. It’s important to understand the behavior of the Python statistics routines when they come across a 'not-a-number value NAN'. 
4. In data science, missing values are common and we often replace them with nan.
"""

x = [8.0, 1, 2.5, 4, 28.0]

x_with_nan = [8.0, 1, 2.5, math.nan, 4, 28.0]

<div class="alert alert-block alert-info">
<b>How to get a nan value? You can use all of these functions interchangeably ¯\_(ツ)_/¯:</b>
</div>


- float('nan')
- math.nan
- np.nan

In [4]:
# create np.ndarray and pd.Series objects that correspond to x and x_with_nan:
y, y_with_nan = np.array(x), np.array(x_with_nan)
z, z_with_nan = pd.Series(x), pd.Series(x_with_nan)

print(y)
print(y_with_nan)

print(z)
print(z_with_nan)


[ 8.   1.   2.5  4.  28. ]
[ 8.   1.   2.5  nan  4.  28. ]
0     8.0
1     1.0
2     2.5
3     4.0
4    28.0
dtype: float64
0     8.0
1     1.0
2     2.5
3     NaN
4     4.0
5    28.0
dtype: float64


## MEAN

- The sample mean, also called the sample arithmetic mean or simply the average.
- It is the arithmetic average of all the items in a dataset. 
- The mean of a dataset 𝑥 is mathematically expressed as Σᵢ𝑥ᵢ/𝑛, where 𝑖 = 1, 2, …, 𝑛. 
- In other words, it’s the sum of all the elements 𝑥ᵢ divided by the number of items in the dataset 𝑥.

In [5]:
# calculate the mean with pure Python using sum() and len(), WITHOUT importing libraries:
mean_ = sum(x) / len(x)
mean_

8.7

In [6]:
# alternatively, apply built-in Python statistics functions:

mean_ = statistics.mean(x)
mean_


8.7

In [7]:
# Python. fmean() is introduced in Python 3.8 as a faster alternative to mean(). It always returns a floating-point number:
mean_ = statistics.fmean(x)
mean_

8.7

In [7]:
# If there are nan values among our data, then statistics.mean() and statistics.fmean() will return nan:

mean_ = statistics.mean(x_with_nan)
print(mean_)


mean_ = statistics.fmean(x_with_nan)
print(mean_)


nan
nan


In [8]:
# If you use NumPy, then you can get the mean with np.mean():

mean_ = np.mean(y)
mean_

8.7

In [9]:
# Note we used mean() as a function, but we can use the corresponding method .mean() as well:
# because y is a numpy array
mean_ = y.mean()
mean_

8.7

In [10]:
# Often  wedon’t need to get a nan value as a result. 
# If you prefer to ignore nan values, then you can use np.nanmean()
np.nanmean(y_with_nan)

8.7

## WEIGHTED MEAN

<div class="alert alert-block alert-info">
<b>The weighted mean, also called the weighted arithmetic mean or weighted average, is a generalization of the arithmetic mean that enables you to define the relative contribution of each data point to the result.</b>
</div>


- Define one weight 𝑤ᵢ for each data point 𝑥ᵢ of the dataset 𝑥, where 𝑖 = 1, 2, …, 𝑛 and 𝑛 is the number of items in 𝑥.
- Multiply each data point with the corresponding weight, sum all the products, and divide the obtained sum with the sum of weights: Σᵢ(𝑤ᵢ𝑥ᵢ) / Σᵢ𝑤ᵢ
- The weighted mean is very handy when you need the mean of a dataset containing items that occur with given relative frequencies.

#### EXAMPLE

You have a set in which: 
- 20% of all items are equal to 2, 
- 50% of the items are equal to 4, 
- 30% of the items are equal to 8.

In [11]:
# calculate the mean of a set like this:

0.2 * 2 + 0.5 * 4 + 0.3 * 8

4.8

In [13]:
# WITHOUT IMPORTING LIBRARIES
# Implement the weighted mean in pure Python by combining sum() with either range() or zip() 

x = [8.0, 1, 2.5, 4, 28.0]
w = [0.1, 0.2, 0.3, 0.25, 0.15]

wmean = sum(w[i] * x[i] for i in range(len(x))) / sum(w)
print(wmean)

wmean = sum(x_ * w_ for (x_, w_) in zip(x, w)) / sum(w)
print(wmean)

6.95
6.95


In [14]:
# USE NumPY for large datasets
# use np.average() to get the weighted mean of NumPy arrays or Pandas Series:

y, z, w = np.array(x), pd.Series(x), np.array(w)
wmean = np.average(y, weights=w)
print(wmean)


wmean = np.average(z, weights=w)
print(wmean)


6.95
6.95


## HARMONIC MEAN

- Harmonic mean is the reciprocal of the mean of the reciprocals of all items in the dataset: 
- 𝑛 / Σᵢ(1/𝑥ᵢ), where 𝑖 = 1, 2, …, 𝑛 and 𝑛 is the number of items in the dataset 𝑥. 

<img src="../data/images/harmonic.jpg" style="width: 700px" />

<img src="../data/images/harmonicWHY.jpg" style="width: 700px" />

<img src="../data/images/harmonicWHY2.jpg" style="width: 700px" />

In [17]:
# WITHOUT IMPORTING LIBRARIES

hmean = len(x) / sum(1 / item for item in x)
hmean

# NB: It’s quite different from the value of the arithmetic mean for the same data x, which we calculated to be 8.7 :)

2.7613412228796843

In [18]:
# using Statistics library

hmean = statistics.harmonic_mean(x)
print(hmean)

2.7613412228796843


- If we have a nan value in a dataset, then it’ll return nan. 
- If there’s at least one 0, then it’ll return 0. 
- If you provide at least one negative number, then you’ll get statistics.StatisticsError

In [19]:
statistics.harmonic_mean(x_with_nan)


nan

In [20]:
statistics.harmonic_mean([1, 0, 2])

0

In [21]:
statistics.harmonic_mean([1, 2, -2])

StatisticsError: harmonic mean does not support negative values

## GEOMETRIC MEAN

- The geometric mean is the 𝑛-th root of the product of all 𝑛 elements 𝑥ᵢ in a dataset 𝑥: 
- ⁿ√(Πᵢ𝑥ᵢ), where 𝑖 = 1, 2, …, 𝑛.
- In other words: Geometric Mean is a special type of average where we multiply the numbers together and then take a square root (for two numbers), cube root (for three numbers) etc.

<img src="../data/images/geometric.jpg" style="width: 700px" />

<img src="../data/images/geometricWHY.jpg" style="width: 700px" />

In [15]:
# pure Python

gmean = 1
for item in x:
    gmean *= item
gmean **= 1 / len(x)
gmean

# geometric mean, in this case, differs significantly from the values of 
# the arithmetic (8.7) 
# harmonic (2.76) 
# means for the same dataset x.

4.677885674856041

In [16]:
gmean = statistics.geometric_mean(x)
gmean

4.67788567485604

In [17]:
# will return NAN
gmean = statistics.geometric_mean(x_with_nan)
gmean

# If there’s a zero or negative number among your data, 
# then statistics.geometric_mean() will raise the statistics.StatisticsError. 

nan

## MEDIAN

- Median is the middle element of a sorted dataset. 
- The dataset can be sorted in increasing or decreasing order. 
- If the number of elements 𝑛 of the dataset is odd, then the median is the value at the middle position: 0.5(𝑛 + 1). 
- If 𝑛 is even, then the median is the arithmetic mean of the two values in the middle, that is, the items at the positions 0.5𝑛 and 0.5𝑛 + 1.

<img src="../data/images/median.jpg" style="width: 700px" />

<div class="alert alert-block alert-warning">
<b>IMPORTANT:</b> The main difference between the behavior of the mean and median is related to dataset outliers or extremes. The mean is heavily affected by outliers, but the median only depends on outliers either slightly or not at all. 
</div>

<img src="../data/images/mean-median.png" style="width: 700px" />

In [18]:
# MEDIAN
"""
Two most important steps of this implementation are as follows:

1. Sorting the elements of the dataset
2. Finding the middle element(s) in the sorted dataset
"""

n = len(x)
if n % 2:
    median_ = sorted(x)[round(0.5*(n-1))]
else:
    x_ord, index = sorted(x), round(0.5 * n)
    median_ = 0.5 * (x_ord[index-1] + x_ord[index])
median_

4

In [19]:
"""
The sorted version of x is [1, 2.5, 4, 8.0, 28.0]

--> element in the middle is 4. 

The sorted version of x[:-1] is[1, 2.5, 4, 8.0]

--> here are two middle elements, 2.5 and 4. Their average is 3.25.
"""



median_ = statistics.median(x)
print(median_)

median_ = statistics.median(x[:-1])
print(median_)

4
3.25


`median_low()` and `median_high()` functions always return an element from the dataset:

- If the number of elements is odd, then there’s a single middle value, so these functions behave just like `median()`.
- If the number of elements is even, then there are two middle values. In this case, `median_low()` returns the lower and `median_high()` the higher middle value.

In [20]:
statistics.median_low(x[:-1])


2.5

In [21]:
statistics.median_high(x[:-1])

4

## MODE

- Mode is the value in the dataset that occurs most __frequently__. 
- If there isn’t a single such value, then the set is multimodal since it has multiple modal values. 
- For example, in the set that contains the points `2, 3, 2, 8, and 12`, the number 2 is the mode because it occurs twice, unlike the other items that occur only once.

In [22]:
u = [2, 3, 2, 8, 12]
mode_ = max((u.count(item), item) for item in set(u))[1]
mode_

2

#### mode()

- Returns a single value
- If there’s more than one modal value, then `mode()` raises StatisticsError

#### multimode()

- returns a list that contains the result
-  if there is more than one modal value, `multimode()` returns the list with all modes.


__Both can handle NAN__

In [23]:
mode_ = statistics.mode(u)
print(mode_)

mode_ = statistics.multimode(u)
print(mode_)

2
[2]


In [24]:
v = [12, 15, 12, 15, 21, 15, 12]

# statistics.mode(v)  # --> Raises StatisticsError

statistics.multimode(v)

[12, 15]

In [25]:
#Pandas example

u, v, w = pd.Series(u), pd.Series(v), pd.Series([2, 2, math.nan])

print(u.mode())
print(v.mode())
print(w.mode())

0    2
dtype: int64
0    12
1    15
dtype: int64
0    2.0
dtype: float64


## VARIANCE

- The sample variance quantifies the spread of the data. 
- It shows numerically how far the data points are from the mean. 
- You can express the sample variance of the dataset 𝑥 with 𝑛 elements mathematically as 𝑠² = Σᵢ(𝑥ᵢ − mean(𝑥))² / (𝑛 − 1), where 𝑖 = 1, 2, …, 𝑛 and mean(𝑥) is the sample mean of 𝑥. 


<div class="alert alert-block alert-info">
<b>Variance is --> the average of the squared differences from the Mean.</b> 
</div>


__To calculate the variance follow these steps__:

- Work out the Mean (the simple average of the numbers)
- Then for each number: subtract the Mean and square the result (the squared difference).
- Then work out the average of those squared differences.

<img src="../data/images/dogs.jpg" style="width: 700px" />

<img src="../data/images/variance-example.jpg" style="width: 700px" />

In [37]:
var_ = statistics.variance(x)
var_

123.2

In [38]:
statistics.variance(x_with_nan)

nan

In [40]:
#NumPy

# NOTE: ddof=1. 
# That’s how you set the delta degrees of freedom to 1. 
# This parameter allows the proper calculation of 𝑠², with (𝑛 − 1) in the denominator instead of 𝑛.

var_ = np.var(y, ddof=1)
print(var_)

# method 
var_ = y.var(ddof=1)
print(var_)


123.19999999999999
123.19999999999999


In [41]:
np.var(y_with_nan, ddof=1)

nan

In [42]:
# skip nan values

np.nanvar(y_with_nan, ddof=1)

123.19999999999999

In [43]:
# Pandas skips NAN by default

z_with_nan.var(ddof=1)

123.19999999999999

## STANDARD DEVIATION

- The sample standard deviation is another measure of data spread. 
- It’s connected to the sample variance, as standard deviation, 𝑠, is the positive square root of the sample variance. 
- The standard deviation is often more convenient than the variance because it has the same unit as the data points.

### Remember our previous example about dogs?

<div class="alert alert-block alert-info">
The Standard Deviation is a measure of how spread out numbers are.
The formula is easy: it is the square root of the Variance."
</div>

<img src="../data/images/st-dev.jpg" style="width: 700px" />

In [46]:
std_ = statistics.stdev(x)
std_

11.099549540409287

In [31]:
# try out these examples one by one

np.std(y, ddof=1)
# y.std(ddof=1)
# np.std(y_with_nan, ddof=1)
# y_with_nan.std(ddof=1)
# np.nanstd(y_with_nan, ddof=1)

11.099549540409285

## PERCENTILES

- Percentile: the value below which a percentage of data falls.
- Each dataset has three __quartiles__, which are the percentiles that divide the dataset into four parts. 


<img src="../data/images/percentile.jpg" style="width: 700px" />

__The first quartile__ 
- is the sample 25th percentile. 
- It divides roughly 25% of the smallest items from the rest of the dataset.

__The second quartile__ 
- is the sample 50th percentile or the median. 
- Approximately 25% of the items lie between the first and second quartiles and another 25% between the second and third quartiles.

__The third quartile__ 
- is the sample 75th percentile. 
- It divides roughly 25% of the largest items from the rest of the dataset.

In [34]:
"""
result 8.0 is the median of x, while 0.1 and 21.0 are the sample 25th and 75th percentiles, respectively. 
The parameter n defines the number of resulting equal-probability percentiles, 
and method determines how to calculate them.
"""

x = [-5.0, -1.1, 0.1, 2.0, 8.0, 12.8, 21.0, 25.8, 41.0]
statistics.quantiles(x, n=4, method='inclusive')

[0.1, 8.0, 21.0]

In [36]:
y = np.array(x)

# find 5th percentile
np.percentile(y, 5)

# find 95th percentile
np.percentile(y, 95)

34.919999999999995

In [37]:
# ignore nan values, then use np.nanpercentile() 

np.nanpercentile(y_with_nan, [25, 50, 75])

array([2.5, 4. , 8. ])

## Correlation Between Pairs of Data

__Measures of correlation between pairs of data:__

- Positive correlation exists when larger values of 𝑥 correspond to larger values of 𝑦 and vice versa.
- Negative correlation exists when larger values of 𝑥 correspond to smaller values of 𝑦 and vice versa.
- Weak or no correlation exists if there is no such apparent relationship.

<img src="../data/images/correlation.png" style="width: 700px" />

In [42]:
# setup

x = list(range(-10, 11))
y = [0, 2, 2, 2, 2, 3, 3, 6, 7, 4, 7, 6, 6, 9, 4, 5, 5, 10, 11, 12, 14]

x_, y_ = np.array(x), np.array(y)
x__, y__ = pd.Series(x_), pd.Series(y_)

#### CORRELATION COEFFICIENT

- The correlation coefficient, or Pearson product-moment correlation coefficient, is denoted by the symbol 𝑟. 
- The coefficient is another measure of the correlation between data. 


1. The value 𝑟 > 0 indicates positive correlation.
1. The value 𝑟 < 0 indicates negative correlation.
1. The value r = 1 is the maximum possible value of 𝑟. It corresponds to a perfect positive linear relationship between variables.
1. The value r = −1 is the minimum possible value of 𝑟. It corresponds to a perfect negative linear relationship between variables.
1. The value r ≈ 0, or when 𝑟 is around zero, means that the correlation between variables is weak.

<img src="../data/images/ice-cream1.jpg" style="width: 700px" />

In [1]:
# NumPy 

cov_matrix = np.corrcoef(np.array([14.2, 16.4, 11.9, 15.2, 18.5]), np.array([215,325, 332, 445, 408]))
cov_matrix

NameError: name 'np' is not defined

#### How do we interpret the result?

<img src="data/images/cc1.jpg" style="width: 500px" />


__The value 𝑟 > 0 indicates positive correlation__
- note that the value 0.83 is very close to one, so it is a strong positive correclation:
- when temperatures grow --> sales of ice cream grow too! There is correlation between the two variables. 