# Statistics

In [77]:
import numpy as np

### Statistics is a field of study concerned with collecting and analyzing data. 
### A well trained statistician is able to use the conclusions borne out from these analyses, to help a business make better decisions.

# Two main types of *activities* in Statistics: 

### Descriptive statistics: 
> Encompasses the many tools used to *DESCRIBE* data 

### Inferential statistics: 
> Encompasses the many tools used to *INFER* from data.   
#### *Infer = Learn*

# Central Tendency: Mean vs. Median.

> The Central Tendency of a set refers to the general behavior of the set at its *"middle"*
##### Mean and Median are just 2 different definitions for this "middle" of the set.

# Mean
### The *AVERAGE* value of a set
<br>

### Given a set X with n items: 
> X = (X1, X2, X3, ... Xn)

### The MEAN can be generalized as: 
> MEAN = (1/n) * (X1 + X2 + X3 + ... + Xn)
<br>

### For Ex:
##### Find MEAN for X = [1,10,3,4,7].
> This is a set with n = 5 items; 
##### Therefore MEAN = (1/5) * (1 + 3 + 4 + 7 + 10)      = (1/5) * (25) = 25/5 = 5
#### <font color = "red">MEAN = 5 </font>

In [78]:
# Verify using Numpy:
a_set = 1,10,3,4,7
mean = np.mean(a_set)

print("mean:", mean)

mean: 5.0


#### The mean can be too sensitive to outliers, which is one reason why the median is sometimes used instead of the mean.
<br><br>

# Median
### The MIDDLE value of an ORDERED set.
<br>

### Given an ordered set X with n items: 
> X = (X1, X2, X3, ... Xn)

### The MEDIAN can be generalized as:

#### If n = ODD
> MEDIAN = X((n/2) + 1) 
#### In other words: Middle index of ordered set.

#### If n = EVEN
> MEDIAN = 1/2 * [X(n/2) + X((n/2) + 1)] 
#### In other words: Average of both middle indices of an ordered set.
<br><br>

### For Ex: 
##### Find MEDIAN for X = [1,10,3,4,7].
> First order the set: x = [1, 3, 4, 7, 10]
##### Therefore MEDIAN = X(5/2 + 1) =  X(2 + 1) = X(3) = 4
#### <font color = "red">MEDIAN = 4 </font>

In [79]:
# Verify using Numpy:
a_set = 1,10,3,4,7
median = np.median(a_set)

print("median:", median)

median: 4.0


##### In this case we had the very same set but we calculated a different value for our Central Tendecy. 
#### <font color = "red">MEAN = 5 || MEDIAN = 4</font>

# Data Spread: Variance Vs. Standard Deviation.

> Variance and Std. Deviation are values that refer to the "spread" of a dataset
aka distance between the individual points of a given set.

# Variance

### The VARIANCE uses the MEAN to calculate spread.

### Given an ordered set X with n items and MEAN M: 
> X = (X1, X2, X3, ... Xn) || mean = M

### The VARIANCE can be generalized as:
> VARIANCE = [Sum[(X - MEAN)^2]] / n 
#### where X stands for each element in set.

> Formula: $$\sigma^2 = \frac{\displaystyle\sum_{i=1}^{n}(x_i - M)^2} {n}$$

### For Ex:
##### Find VARIANCE for X = [1,10,3,4,7].
> Take each difference (Xi - MEAN). Square it. Then average the result:
#### σ2	=	 [ (1-5)^2 + (10-5)^2 + (3-5)^2 + (4-5)^2 + (7-5)^2 ] / 5
#### σ2	=	 [ (-4)^2 + (5)^2 + (-2)^2 + (-1)^2 + (2)^2 ] / 5
#### σ2	=	 [ 16 + 25 + 4 + 1 + 4 ] / 5
#### σ2	=	 [ 50 ] / 5
#### σ2 =    10.0

In [80]:
# Verify using Numpy:
a_set = 1,10,3,4,7
variance = np.var(a_set)

print("variance:", variance)

variance: 10.0


# Standard Deviation

### The STD Deviation is a measure of VARIANCE.

### Given an ordered set X with n items and MEAN M: 
> X = (X1, X2, X3, ... Xn) || mean = M

### The STD DEVIATION can be generalized as:

> Formula: Notice it is simply the square root of the variance equation.
#### $$SD = \sqrt{\frac{\displaystyle\sum_{i=1}^{n}(x_i - M)^2} {n}}$$

### For Ex:
##### Find STANDARD DEVIATION for X = [1,10,3,4,7].
> Find The Variance: SEE ABOVE.
#### SD = Square Root(Variance)
#### SD = Square Root(10.0)
#### SD = 3.162277...

In [81]:
# Verify using Numpy:
a_set = 1,10,3,4,7
std_dev = np.std(a_set)

print("Standard Dev:", std_dev)

Standard Dev: 3.16227766017


# Pulling It All Together!

In [82]:
# X is a Python List
X = [37.89, 53.18, 27.31, 39.33, 44.64, 53.79, 11.11, 22.12, 19.55]

# Sorting the data and printing it.
X.sort()
print(X)
# [13.75, 21.52, 32.32, 43.34, 43.47, 44.32, 55.63, 56.98]

[11.11, 19.55, 22.12, 27.31, 37.89, 39.33, 44.64, 53.18, 53.79]


In [83]:
# Using NumPy's built-in functions to Find Mean, Median, Variance and Standard Deviation
mean = np.mean(X)
median = np.median(X)
variance = np.var(X)
sd = np.std(X)

In [84]:
# Printing the values
print("Mean:", mean)
print("Median:", median)
print("Variance:", variance)
print("Standard Deviation:", sd) 

Mean: 34.3244444444
Median: 37.89
Variance: 203.773869136
Standard Deviation: 14.2749384985
