# Statistics and Probability
This is a new section of the "foundational" knowledge that we need to go deeper into Business Analytics/Data Science. Obviously Statistics, Probability, Mathematics are vast areas of their own. We will revisit only the concepts that we will need for the future sections on Descriptive, Predictive and Prescriptive Analytics.

Where appropriate, we will provide a forward reference of a concept to the future section where the concept will be used. Conversely in the future sections we will provide a backward reference to these core concepts.

An example is: 
 * We will revisit the concepts of **Conditional Probability and Bayes' Theorm of conditional probablity**. We will use these concepts (Bayes' Theorem) in developing **Naive Bayes Classification Model** to predict classification of an outcome based on a set of input variables in the section on **Predictive Analytics**

Real life application of Naive Bayes Classification are:
* Classifying an email as **spam** based on the presence of a set of key words
* Classifying an insurance claim as valid or fraudulent based some attributes of the claim in **fraud detection**
* etc.


## Mean, Grand Mean (Mean-of-Means), introduction to ANOVA
The **Mean** of a collection of numbers is the average value of the collection of numbers. The **Mean** is the **sum** of all the numbers in the collection divided by the count of numbers of the collection.

In case there are **multiple** collections of numbers and you want to do some **analysis** of all the collections and see if one of the collection of numbers **significantly** different form other collections of numbers and verify such **hypothesis** you can use a technique called **Analysis of Variance** also called **ANOVA**. 

**ANOVA** is a statistical process, widely used in **Descriptive Analytics**. One of the steps of **ANOVA** is to calculate the **Grand Mean or the Mean of Means** of multiple samples of data.

We will study **ANOVA** in details in **Descriptive Analytics** section.

Couple of simple Python function to calculate the **Mean** and **Grand Mean** of a set of samples is below.


In [12]:
import math

def mean(x):
    return round((sum(x) / len(x)), 2)

def mean_of_means(x):
    list_of_mean = [round((mean(x_i)), 2) for x_i in x]
    return mean(list_of_mean)

# =================== Example of calculation of Mean and Grand Mean of several samples ========================

x1 = [23, 45, 67, 11, 89, 234]
x2 = [12, 55, 73, 11, 109, 234]
x3 = [67, 45, 84, 9, 87, 268]

list_of_lists = [x1, x2, x3]
list_of_mean = [round((mean(x_i)), 2) for x_i in list_of_lists]
m_of_m = mean_of_means (list_of_lists)

print('Individual lists are ', x1,',', x2, ',', x3)
print('List of Lists = ', list_of_lists)
print('List of means = ', list_of_mean)
print('Mean of Means or Grand Mean = ', m_of_m)

Individual lists are  [23, 45, 67, 11, 89, 234] , [12, 55, 73, 11, 109, 234] , [67, 45, 84, 9, 87, 268]
List of Lists =  [[23, 45, 67, 11, 89, 234], [12, 55, 73, 11, 109, 234], [67, 45, 84, 9, 87, 268]]
List of means =  [78.17, 82.33, 93.33]
Mean of Means or Grand Mean =  84.61


## Dispersion, Deviation, Variance and Standard Deviation of a Data Sample
**Dispersion** is the measure of **how spread out the data is** in the Data Sample. It is the difference between the **Maximum Value** and the **Minimum Value** of the Data Sample.

Another measure of the **spread of the data** in a Data Sample is the **Deviation** which is the list of the difference of each data point from the Mean of the Data Sample. In **Regression Analysis** (which we will learn in details later) they are also called the **Errors** or **Residuals**.

The **Deviations** can be positive or negative. So the sum of Deviations of very spread out data can be close to zero. This can give the wrong impression that the data is NOT widely spread out because its **Deviation**  is zero or close to zero.

***--> create math notations***

To counter this problem, the most widely used measure of the **spread of the data** in a Data Sample is the**Variance** of a Data Sample which is the **sum of squares of the deviations** of each data points from the Mean of the Data, divided by **(n-1)**, where **n is the sample size**.

***--> create math notations***

Another measure of the **spread of the data in a sample** is the **Standard Deviation**. Standard Deviation is the **square root** of the **Variance** of the data in the sample.

**Dispersion, Deviation and Variance** of a Data Sample can be easily calculated as follows


In [13]:
input_data = [2, 27, 48, 99, 348, 587, 439, 567, 602]

print('Input Data = ', input_data)
print('================================================')

def data_range(x):
    return max(x) - min(x)
print('Dispersion of Input Data = ', data_range(input_data))
print('===================================================')
def diff_from_mean(x):
    x_bar = mean(x)
    return [round((x_i - x_bar), 2) for x_i in x]
print('Diff from mean of Input Data = ', diff_from_mean(input_data))
print('===================================================')

print('Deviation of Input Data = ', round(sum(diff_from_mean(input_data)), 4))
print('===================================================')

def sum_of_squares(x):
    return(sum(x_i**2 for x_i in x))

def variance(x):
    l = len(x)
    deviations = diff_from_mean(x)
    return (sum_of_squares(deviations)/(l - 1))
print('Variance of Input Data = ', round(variance(input_data), 2))
print('===================================================')

def standard_deviation(x):
    v = variance(x)
    return math.sqrt(v)
print('Standard Deviation of Input Data = ', round(standard_deviation(input_data), 2))
print('===================================================')

Input Data =  [2, 27, 48, 99, 348, 587, 439, 567, 602]
Dispersion of Input Data =  600
Diff from mean of Input Data =  [-300.11, -275.11, -254.11, -203.11, 45.89, 284.89, 136.89, 264.89, 299.89]
Deviation of Input Data =  0.01
Variance of Input Data =  66710.61
Standard Deviation of Input Data =  258.28


## Statistical Hypothesis Testing
### Null and Alternate Hypothesis
Statistical **Hypothesis Testing** is making an assumption (hypothesis) and testing with the test data to see if the assumption was correct or incorrect. Every hypothesis test, regardless of the data population and other parameters involved, requires the three steps below.
* Making an initial assumption.
* Collecting evidence (data).
* Based on the available evidence (data), deciding whether to reject or not reject the initial assumption.

The initial assumption made is called **Null Hypothesis (H-0)** and the alternative (opposite) to the **Null Hypothesis** is called the **Alternate Hypothesis (H-A)**

Two widely used approach to **hypothesis testing** are
* Critical value approach
* p-value approach

The **Critical value** approach involves comparing the observed test statistic to some cutoff value, called the **Critical Value**. If the test statistic is more extreme (i.e. more than the **Upper Critical Value** or less than the **Lower Critical Value**) than the **Critical Value**, then the null hypothesis is rejected in favor of the alternative hypothesis. If the test statistic is not as extreme as the critical value, then the null hypothesis is not rejected.

The **p-value** approach involves determining the probability of observing a more extreme test statistics in the direction of **Alternate Hypothesis**, assuming the null hypothesis were true. 

If the **p-value** is less than (or equal to) **α (the accepted level of p-value)**, then the null hypothesis **is rejected** in favor of the alternative hypothesis. If the P-value is greater than **α (the critical value)**, then the null hypothesis **is not rejected**.

### Z-Score and p-Value
In this section we are just learning the definitions of **Z-Score** and **p-Value** and their inter-relations. In a subsequent section we will use the Z-Score, p-value along with **Level of Confidence** or **Level of Significance** to test a hypothesis (i.e. Reject (i.e. the Alternate Hypothesis is acceptedas the new norm. the Null Hypothesis or Fail to Reject the Null Hypothesis (i.e. Null Hypothesis remains valid)

A **Z-Score** of a sample of data is a score that expresses the value of a distribution in standard deviation with respect to the mean. It shows how far (**how many Standard Deviation**) a specific value of data is from the sample **Mean**.
Z-Score is calcualted by the formula

**z = (X - X-bar)/Std-dev**

where 

X = a Data Value

X-bar = Sample Mean
      
Std-dev = Standard Deviation of the sample

**p-value** of a Data Value is the probability of obtaining a sample data that is "more extreme* than the ones observed in your data assuming the Null Hypothesis is true.

The p-value of a z-score can be obtained from a Statistical Z-Table or using a Python Library function. Here we will use the Python Library function.

**p-value = stats.norm.cdf(z-score)**

However, depending on the data we are trying to test (in the case 53) compared to the currently known data (National Average = 60, Standard Deviation = 3) we may have to use a slightly different formula. Do do that we need to learn the **Left Tail** and **Right Tail** tests.

### Left-Tail, Right-Tail and Two-Tail Tests of Hypothesis
If the data we are trying to test (53) is **less than** the **Mean** (60) we use the **Left Tail Test**. If the data (say the class average was 68 as opposed to 53) is **greater than** the **Mean** (60), we use the **Right Tail Test**.

For a **Right Tail Test** the formula for p-value (again using a Python Library function) is

**p-value =  1- stats.norm.cdf(z-score)**

***p-value for a z-score can be looked up from the Statistical Z-Table***

#### An Example of Z-Score and p-value
Assume that we have the scores of a test in Business Analytics in a class of 100. The Mean of the sample (100 test scores) is 53. The National Average of the same test is 60 with a Standard Deviation of 3. We want to calculate the Z-score and p-value for this class sample (Average is 53) with respect to the National data (Average = 60, Standard Deviation = 3) to test our hypothesis "the class score is similar to the National Average"

Here we will calculate the z-score and corresponding p-value for Case-1 where the **class average is 53** and Case-2 where the **class average is 66**
      

In [2]:
import scipy.stats as stats

# Example of a Left Tail Test
print('========== Example of a Left Tail Test ============')
# Case-1 where class score mean = 53
print('Class score mean = ', 53)
# Calculating the z-score of 53 with respect to the National Score (Mean = 60, S-Dev = 3)
zscore1 = round((53 - 60)/3, 2)
print('Zscore for mean class score (53) = ', zscore1)
# Since 53 is less than the national average 60 we will do the Left Tail Test
prob1 = round(stats.norm.cdf(zscore1), 6)
print('p-value for the mean class score (53) = ',  prob1)

# Example of a Right Tail Test
print('========== Example of a Right Tail Test ============')
# Case-2 where class score mean = 68
print('Class score mean = ', 66)
# Calculating the z-score of 68 with respect to the National Score (Mean = 60, S-Dev = 3)
zscore2 = round((66 - 60)/3, 2)
print('Zscore for mean class score (66) = ', zscore2)
# Since 68 is more than the national average 60 we will do the Right Tail Test
prob2 = round(1 - stats.norm.cdf(zscore2), 6)
print('p-value for the mean class score (66) = ',  prob2)

Class score mean =  53
Zscore for mean class score (53) =  -2.33
p-value for the mean class score (53) =  0.009903
Class score mean =  66
Zscore for mean class score (66) =  2.0
p-value for the mean class score (66) =  0.02275


### Level of Confidence and Level of Significance
Since the results of statistical test are not **definete proof** of the conclusion, the results are always associsated with a **Level of Confidence** or a **Livel of Significance**. Normally we would strive for a high **Level of Confidence**  or a statistically significant result with high **Level of Significance** when we are testing if a Null Hypothesis is true or the Alternate Hypothesis should replace the Null Hypothesis.

Usually the **Level of Confidence (C)** used are 95% (0.95), 99% (0.99) etc. for the conclusions of a hypothesis testing to be considered **"reliable"**. **Level of Significance** is the inverse of Level of Confidence, i.e. 

**Level of Significance = 1 - Level of Confidence** or S = 1- C. For Level of Confidence of 99% (0.99) the Level of Significance is 0.01 and for the Level of Confidence of 95% (0.95), the Level of Significance is 0.05.

In majority of hypothesis tests a Level of Significance of 0.05 is used. This is called the **Critical Value α** to test the p-value (calculated in the previous step)

If the p-value is **less than** the **Critical Value α**, the test results are considered as "highly significant**. **Critical Value α = 0.01**, by the same token is considered as "very highly significant".

### Hypothesis Testing Using Z-Score, p-Value and Level of Significance
In a hypothesis test using -Score and p-value, if the p-value is less than **Critical Value α** (0.05 in our case), the test is considered statistically highly significant and Alternate Hypothesis is accepted and the Null Hypothesis is rejected and vice versa.

In our test case-1 where the mean class score is 53, the p-value is 0.00993 which is less than the Critical Value α (0.05), the Null Hypothesis, that the mean marks of the class is similar to the national average is **Rejected**

In test case-2 where the mean class score is 66, the p-value is 0.02275 which is more than the Critical Value α (0.05), the Null Hypothesis, that the mean marks of the class is similar to the national average is **Accepted/Retained**

A Two-Tailed test can also be used in the above case using the same concepts of Z-Score, p-value and α, the Critical Significance Level. We will discuss Hypothesis Testing in more details in the **Descriptive Analytics** section.

### Getting p-value from z-score and z-score from p-value
We have already used **stats.norm.cdf(zscore1)** to get p-value from z-score

***p-value = stats.norm.cdf(zscore1)***

Now we will use stats.norm.ppf(p-value) to get z-score from p-value

***z-score = stats.norm.ppf(c-value), remembering, p-value = 1 - c-value***

Let us calculate z-score for the most commonly used **Confidence Levels (C)** of 90% (0.9), 95% (0.95), 98% (0.98) and 99% (0.99), i.e. the most commonly used **Significance Levels (S)** of 0.1, 0.05, 0.02 and 0.01 respectively

In [3]:
import scipy.stats as stats
from scipy.stats import norm

z_score_1 = stats.norm.ppf(0.9) # for C= 0.9 i.e. p = 0.1
print(z_score_1)
z_score_2 = stats.norm.ppf(0.95) # for C= 0.95 i.e. p = 0.05
print(z_score_2)
z_score_3 = stats.norm.ppf(0.98) # for C= 0.98 i.e. p = 0.02
print(z_score_3)
z_score_4 = stats.norm.ppf(0.99) # for C= 0.99 i.e. p = 0.01
print(z_score_4)
# For 2-tail test the corresponding z-scores are (+-)1.645, 1.96, 2.33 and 2.575 respectively (show calc with α/2 )
print("===================================================================")
z_score_5 = stats.norm.ppf(0.95) # for C= 0.95 i.e. p = 0.05 on each tail
print(z_score_5)
z_score_6 = stats.norm.ppf(0.975) # for C= 0.975 i.e. p = 0.025 on each tail
print(z_score_6)
z_score_7 = stats.norm.ppf(0.99) # for C= 0.99 i.e. p = 0.01 on each tail
print(z_score_7)
z_score_8 = stats.norm.ppf(0.995) # for C= 0.995 i.e. p = 0.005 on each tail
print(z_score_8)

1.2815515655446004
1.6448536269514722
2.0537489106318225
2.3263478740408408
1.6448536269514722
1.959963984540054
2.3263478740408408
2.5758293035489004


### Example Scenarios of Different Types of Hypothesis Tests
#### Example - 1

*** A company has stated that they make straw machine that makes straws that are 4 mm in diameter. A worker belives that the machine no longer makes straws of this size and samples 100 straws to perform a hypothesis test with 99% Confidence level. Write the null and alternate hypothesis and any other related data.***

                   H-0: µ = 4 mm H-a: µ != 4 mm n = 100, C = 0.99, Critical Value α = 1 - C = 0.01 

#### Example - 2
*** Doctors believe that the average teen sleeps on average no longer than 10 hours per day. A researcher belives that the teens sleep longer. Write the H-0 and H-a***

                   H-0: µ <= 10   H-a: µ > 10
                   
#### Example - 3
*** The school board claims that at least 60% of students bring a phone to school. A teacher believes this number is too high and randomly samples 25 students to test at a Significance Level of 0.02. Write the H-0, H-a and other related informations***

                  H-0: p >= 0.60  H-a: p < 0.60  n = 25  Critical Value α = 0.02   C = 1 - α = 1- 0.02 = 0.98 (98%)
                  
With the available information, it is possible to write the **null** and **alternate** hypotheses, but in these examples we do not have enough information to test them.

Recall the steps of hypothesis tests outlined above

* Write the hypotheses H-0 and H-a
* Given µ, standard deviation calculate the z-score for the number to be tested using formula z = (X-bar - µ)/Std-dev
* Calculate the p-value using the python function p-value = 1- stats.norm.cdf(z-score)
* Given Significance Level Critical Value α or given Confidence Level calculate Critical Value α = 1-C
* For **Left Tail** test use the p-value calculated
* For **Right Tail Test** p-value = 1- (calculated p-value)
* For **Two Tail Test** compare the calculated p-vlaue with  α/2
* If the calculated p-value is **less** than Critical Value α, **reject** Null Hypothesis else **fail to reject** the Null Hypothesis

***Note: If H-a has <, it is a Left Tail Test, if H-a has >, it is a Right Tail Test, if H-a has != it is a 2-Tail Test***

So, to be able to test the hypothesis we need to have x (the value to be tested), x-bar (sample mean), std-dev (sample standard deviation, required Confidence Level or the required Significance Level.

In the next example we will go through these steps (assuming all the necessary information are given)

#### Example - 4
Records show that students on average score less than or equal to 850 on a test. A test prep company says that the students who take their course will score higher than this. To test, they sample 1000 students who score on an average of 856 with a standard deviation of 98 after taking the course. At 0.05 Significance Level, test the company claim.

            H-0: µ <= 850  H-a: µ > 850  n = 1000  x-bar = 856  std-dev = 98  α = 0.05 (C = 0.95 or 95%)
       
Let's calculate the z-score and p-value to test the hypothesis. It is a **Right Tail Test**


In [68]:
import numpy as np
from scipy.stats import norm

x_bar = 856
µ = 850
s_dev = 98
z_score = (x_bar - µ)/s_dev
print("Z-score = ", z_score)
p_value = (1 - norm.cdf(z_score)) # since it is a Right Tail test
print("p-value = ", p_value)

Z-score =  0.061224489795918366
p-value =  0.4755902131389005


***Since the calculated p-value is greater than α (0.05) we fail to reject  the null hypothesis, i.e. company claim is invalid or NOT Statistically Significant***

#### Example - 5
A newspaper reports that the average age a woman gets married is 25 years or  less. A researcher thinks that the average age is higher. He samples 213 women and gets an average of 25.4 years with standard deviation of 2.3 years. With 95% Confidence Level, test the researcher's claim.

Let's calculate the z-score and p-value to test the hypothesis. It is a **Right Tail Test**


        H-0: µ <= 25  H-a: µ > 25  n = 213  x-bar = 25.4  s-dev = 2.3  C = 95% = 0.95  α = 0.05

Let's calculate the z-score and p-value to test the hypothesis. It is a **Right Tail Test**

In [69]:
import numpy as np
from scipy.stats import norm

x_bar = 25.4
µ = 25
s_dev = 2.3
z_score = (x_bar - µ)/s_dev
print("Z-score = ",z_score)

p_value = (1 - stats.norm.cdf(z_score)) # since it is a Right Tail test
print("p-value = ", p_value)

Z-score =  0.17391304347826025
p-value =  0.43096690081487876


***Since the calculated p-value is greater than α (0.05) we fail to reject  the null hypothesis, i.e. researcher's claim is invalid or NOT Statistically Significant***

#### Example - 6
A study showed that on an average women in a city had 1.48 kids. A researcher believes that the number is wrong. He surveys 128 women in the city and finds that on an average these women had 1.39 kids with standard deviation of 0.84 kids. At 90% Confidence Level, test the claim.

    H-0: µ = 1.48 H-a: µ != 1.48   n = 128   x-bar = 1.39   s-dev = 0.84   C = 90% = 0.9. 
    
    
Let's calculate the z-score and p-value to test the hypothesis. It is a **Two Tail Test**. This is a Two Tailed Test, so critical value = (1 -c) /2 = 0.05
    


In [70]:
import numpy as np
from scipy.stats import norm

x_bar = 1.39
µ = 1.48
s_dev = 0.84
z_score = (x_bar - µ)/s_dev
print("Z-score = ", z_score)
p_value = stats.norm.cdf(z_score) # since it is a Two Tail test
print("p-value = ",p_value)

Z-score =  -0.10714285714285725
p-value =  0.4573378238740764


***Since the calculated p-value is greater than α/2 (0.05) we fail to reject  the null hypothesis, i.e. researcher's claim is invalid or NOT Statistically Significant***

#### Example - 7
The government says the average weight of males is 162.9 pounds or greater. A researcher thinks this is too high. He does a study of 39 males and gets an average weight of 160.1 pounds with a standard deviation of 1.6 pounds. At 0.05 Significance Level, test the claim.

    H-0: µ >= 162.9   H-a: µ < 162.9   n = 39    x-bar = 160.1    s-dev = 1.6   α = 0.05

Let's calculate the z-score and p-value to test the hypothesis. It is a **Left Tail Test**

In [79]:
import numpy as np
from scipy.stats import norm

x_bar = 160.1
µ = 162.9
s_dev = 1.6
z_score = (x_bar - µ)/s_dev
print("Z-score = ", z_score)
p_value = stats.norm.cdf(z_score) # since it is a Left Tail test
print("p-value = ",p_value)

Z-score =  -1.750000000000007
p-value =  0.040059156863816475


***Since the calculated p-value is less than α (0.05) we reject  the null hypothesis, i.e. researcher's claim is valid or Statistically Significant***



## Analysis of Variance (ANOVA)


## What is ANOVA
ANOVA or Analysis of Variance is a set of statistical tests to test if there is a **significant** difference between the **means** of a set of samples. It tests if the means of various samples of data are (***statistically***) equal or not. ANOVA, in its simplest form, tests if at least one of the sample mean is significantly different from the means of other sample. It does not conclude if means of **more than one** samples are different from other sample means. Nor does it make **pair-wise** comparisons between the samples. More advanced form of ANOVA tests and helps researchers conclude these aspects of the data (we will not discuss those).

An important fact to note that while we use ANOVA to test if the sample means differ significantly, we actually compare the **variances**. Hence the name **Analysis of Variance**.

## What is this test for?
The one-way analysis of variance (ANOVA) is used to determine whether there are any statistically significant differences between the means of three or more independent (unrelated) groups. This guide will provide a brief introduction to the one-way ANOVA, including the assumptions of the test and when you should use this test. If you are familiar with the one-way ANOVA, you can skip this guide and go straight to how to run this test in SPSS Statistics by clicking here.

## What does this test do?
The one-way ANOVA compares the means between the groups you are interested in and determines whether any of those means are statistically significantly different from each other. Specifically, it tests the **Null Hypothesis**:

                                H-0: µ-0 = µ-1 = µ-2 ........µ-k

where µ = group mean and k = number of groups. 

If, however, the one-way ANOVA returns a statistically significant result, we accept the **Alternative Hypothesis (HA)**, which is that there are **at least** one group mean that is statistically significantly different from each other group means.

At this point, it is important to realize that the one-way ANOVA is an omnibus test statistic and cannot tell you which specific groups were statistically significantly different from each other, only that at least two groups were. To determine which specific groups differed from each other, you need to use a post hoc test. Post hoc tests are described later in this guide.

## Some Definitions
### Grand Mean
The **Grand Mean**  of a set of multiple samples is the mean of all observations: every data point, divided by the joint sample size. The Grand Mean can be calculated by adding all the observations of all samples and then dividing the SUM by the **total** number pf observations.

Alternatively, the Grand Mean can also be calculated by the first calculating the **Means** of each of the individual samples, then adding the **sample means** and dividing the SUM by the number of samples or **Groups**.

The **number of observations** in each of the samples does **NOT** have to be the same. The calculation of Grand Mean and calculation of **Sum Square of Treatments** (Treatments are also called **Groups**) and **Mean Square of Treatments** (defined below) shows that these definitions take into consideration the **unequal** sizes of samples by taking a **weighted average** of the **sum of squares** in the calculation of **Mean Square of Treatments**.

## Sum Square of Treatments (SST) and Mean Square of Treatments (MST)

**Sum of Squares of Treatments (SST)** measures the varition **between** Groups/Treatments and is defined  as

         SUM[ for sample-1( sample size * (Sample Mean - Grand-Mean)-Square) + 
              for sample-2( sample size * (Sample Mean - Grand-Mean)-Square) +
              .....
              for sample-k( sample size * (Sample Mean - Grand-Mean)-Square)]
              
       i.e. SUM[n1 * (mean1 - Grand-Mean)**2 + n2 * (mean2- Grand-Mean)**2 +...(nK * (meanK - Grand-Mean)**2]
       
Where 
       
       n1, n2...nK are the sample sizes of the K Treatments or Groups
       
       mean1, mean2.....meanK are the means of each of the K samples
       
       Grand-Mean is the Grand Mean defined above
       
**Mean Square of Treatments (MST)** of **K** Treatments or Groups is defined as

       SST/(K - 1)
       
## Degree of Freedom between Treatments/Groups
The **Degree of Freedom** between **K** Treatments or Groups is defined as

                 DF-between = (K - 1)

## Sum of Square of Errors (SSE) and Mean Square of Errors (MSE)
**Sum of Squares of Errors** measures the varition **within** Groups/Treatments and is defined  as

      SUM [ (x1 - Sample-Mean)**2 + (x2 - Sample-Mean)**2 +.....(xn - Sample-Mean)**2]

Where

      x1, x2,...xn are the observations of the sample
      n is the size of the sample
      
Using the above definition of **SSE**, we can see the **SSE** can also be defined as

      SSE = (n - 1) * Sample Variance
      
**Mean Square of Errors (MSE)** of K samples with n-i observations in each is defined as
     
      SSE / (ni - 1)* K = SSE/ (ni * K - K) = SSE/ (N - K)
      
Where 

      ni is the size of the i-th sample
      K is the number of groups/samples
      N is "Total" number of observations i.e. SUM(ni) over K groups/samples
      
## Degree of Freedom within Group/Treatment
**Degree of Freedom** within groups is defined for **K** groups with **ni** observations for each as 

      DF-within = K * (ni - 1) = (K * ni) - K = N - K
      
## F-Statistic for One Way ANOVA
**F-Statistic** for One Way ANOVA is defined as

     F-Statistic = MST/MSE
     
## p-value of One Way ANOVA
One Way ANOVA uses the F-Statisctc (MST/MSE follow F-Distribution) as opposed to Z-Statistic (for Normal Distribution) as we saw the Hypothesis Testing in the previous section on **Hypothesis Testing** of samples with sample size of **30 or greater**.

Statistical tables are available to get p-values from F-Statistic. One important point to note that ANOVA is ***always a Right Tail Test*** and hence is calculated, for hypothesis testing as ***(1 - p-value-from-table)***

In our case we will use a the Python **cdf** function (for **F-Distribution**). In the section on Hypothesis Testing (for Normal Distribution) we used the **cdf** function for Normal Distribution.

## One Way ANOVA Testing Steps
Following the above definitions, the following are the steps of One Way ANOVA

* Calculate the Grand Mean
* Calculate the SST (between groups/treatments)
* Calculate the MST (between groups/treatments)
* Calculate the SSE (within groups/treatments)
* Calculate the MSE (within groups/treatments)
* Calculate F-Statistic = MST/MSE
* Get the p-value from the F-Statistic
* If the calculated p-value is **smaller** than the "Required** Level of Significance, **Reject** the Null Hypothesis, (i.e. **at least one of the sample means significantly differ from other sample means**, otherwise **Fail to Reject** the Null Hypothesis (i.e. ***all the sample means are equal***)

We will first do the One Way ANOVA manually (using spread sheet or calculator). Next we will do the same One Way ANOVA using the "1-Factor ANOVA" using Excel Analysis ToolPack and then will do the same using Python.

In [2]:
import pandas as pd
anovadf = pd.read_csv("../../../CSV/anova-1way-csv.csv")

anovadf["New York"]

0    18
1    19
2    20
3    21
4    22
5    23
6    18
7    19
8    20
9    21
Name: New York, dtype: int64

In [3]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [4]:
anovadf.head(10)

Unnamed: 0,New York,Texas,Oregon
0,18,18,21
1,19,20,22
2,20,16,17
3,21,20,18
4,22,21,22
5,23,20,19
6,18,18,21
7,19,19,20
8,20,17,18
9,21,18,23


In [5]:
import scipy.stats as stats
stats.f_oneway(anovadf["New York"], anovadf["Texas"], anovadf["Oregon"])

F_onewayResult(statistic=2.1025029797377845, pvalue=0.1417043469872377)

In [6]:
1- stats.f.cdf(2.102502, 2, 27)

stats.f.ppf(q=0.99, dfn= 2, dfd=27)

0.1417044671120501

5.488117768420701

In [7]:

def mean(x):
    return round((sum(x) / len(x)), 2)

def diff_from_mean(x):
    x_bar = mean(x)
    return [round((x_i - x_bar), 2) for x_i in x]

def sum_of_squares(x):
    return round((sum(x_i**2 for x_i in x)), 2)

def variance(x):
    l = len(x) 
    deviations = diff_from_mean(x)
    return round((sum_of_squares(deviations)/(l - 1)), 2)

def calc_summary1(in_df):
    i = 0
    summary = []
    while i < len(in_df.columns):
        x = in_df[in_df.columns[i]]
        name = x.name
        gr_sum = sum(x)
        gr_count = len(x)
        gr_mean = mean(x)
        deviation = diff_from_mean(x)
        gr_variance = variance(x)
        ss_within_gr = sum_of_squares(deviation)
        summary.append({'Groups': name,'Count':gr_count, 'Sum': gr_sum, 'Average': gr_mean, 'Variance': gr_variance})
        #ret_df = pd.DataFrame([])
        #ret_df = ret_df.append(summary)
        i += 1
    ret_df = pd.DataFrame(summary)
    ret_df = ret_df[['Groups', 'Count', 'Sum', 'Average', 'Variance']]
    
    return ret_df

result_df1 = calc_summary1(anovadf)
result_df1.head(10)

Unnamed: 0,Groups,Count,Sum,Average,Variance
0,New York,10,201,20.1,2.77
1,Texas,10,187,18.7,2.46
2,Oregon,10,201,20.1,4.1


In [10]:
def calc_sse_mse(in_df):
    num_samples = 0
    sse = 0
    total_sample_size = 0
    while num_samples < len(in_df.columns):
        data = in_df[in_df.columns[num_samples]]
        dev = diff_from_mean(data)
        sse += sum_of_squares(dev)
        total_sample_size += len(data)
        num_samples += 1
        mse = round(sse/(total_sample_size-num_samples), 4 )
    #print('SSE = ', sse, 'Total Sample Size =', total_sample_size, 'MSE = ', mse)
    return sse, mse, total_sample_size

anova_sse, anova_mse, anova_sample_size = calc_sse_mse(anovadf)

def grand_mean_df(in_df):
    cum_mean = 0
    num_groups = 0
    while num_groups < len(in_df.columns):
        cum_mean += mean (in_df[in_df.columns[num_groups]])
        num_groups += 1
    return round(cum_mean/num_groups, 4)

def calc_sst_mst(in_df):
    grand_mean = grand_mean_df(in_df)
    num_groups = 0
    sst = 0
    while num_groups < len(in_df.columns):
        data = in_df[in_df.columns[num_groups]]
        sst +=  round(len(data) * (mean(data) - grand_mean)**2, 4)
        num_groups += 1
    mst = round(sst/(num_groups -1), 4)
    return sst, mst, num_groups
anova_sst, anova_mst, groups = calc_sst_mst(anovadf)

def calc_summary_2(in_df):
    anova_sst, anova_mst, groups = calc_sst_mst(anovadf)
    anova_sse, anova_mse, anova_sample_size = calc_sse_mse(anovadf)
    df_between = groups -1
    df_within = anova_sample_size - groups
    f_stats = anova_mst/anova_mse
    p_value = 1- stats.f.cdf(f_stats, df_between, df_within)
    summary = [{'Source of Variance': 'Between Groups', 'SST/SSE': anova_sst, 'df': df_between, 'MST/MSE': anova_mst,
                 'F_Statistics': f_stats, 'p_value': p_value},
               {'Source of Variance': 'Within Groups', 'SST/SSE': anova_sse, 'df': df_within, 'MST/MSE': anova_mse,
                 'F_Statistics': '', 'p_value': ''}
               ]
    ret_df = pd.DataFrame(summary)
    ret_df = ret_df[['Source of Variance', 'SST/SSE', 'MST/MSE', 'df', 'F_Statistics', 'p_value']]
    return ret_df
result_df = calc_summary_2(anovadf)
result_df.head(10)
    

Unnamed: 0,Source of Variance,SST/SSE,MST/MSE,df,F_Statistics,p_value
0,Between Groups,13.0667,6.5334,2,2.10253,0.141701
1,Within Groups,83.9,3.1074,27,,


In [9]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

def anova_1way(in_df):
    ret_df1 = calc_summary1(in_df)
    #result_df1.head(10)
    ret_df2 = calc_summary_2(in_df)
    #result_df2 = result_df2[['Source of Variance', 'SST/SSE', 'MST/MSE', 'df', 'F_Statistics', 'p_value']]
    return ret_df1, ret_df2
    
res_df1, res_df2 = anova_1way(anovadf)
res_df1.head(10)
res_df2.head(10)  
    

Unnamed: 0,Groups,Count,Sum,Average,Variance
0,New York,10,201,20.1,2.77
1,Texas,10,187,18.7,2.46
2,Oregon,10,201,20.1,4.1


Unnamed: 0,Source of Variance,SST/SSE,MST/MSE,df,F_Statistics,p_value
0,Between Groups,13.0667,6.5334,2,2.10253,0.141701
1,Within Groups,83.9,3.1074,27,,


## Covariance, Correlation, Least Square Method in Regression Analysis
### Covariance and Correlation Coefficient
**Sample Covariance** measures the strength and the direction of the relationship between the elements of **two** samples. **Variance**, as defined before deals with **one** sample of data whereas **Covariance** measures how much and in what direction a variable change (***positive, negative or independent***) with the change of the second variable.

***--> create math notations***

**Covariance** of two samples of data [x1, x2...xi] and [y1, y2,...yi] is measured as

**Cov(xy) = SUM((xi - x-bar)(yi - y-bar))/(n-1)** where

xi, yi = The ith value of the two samples (x, y) of data

x-bar, y-bar = Average of x-data sample and y-data sample

n = sample size

**Positive Covariance** means y-value increases as x-value increases. **Negative Covariance** means y-value decreases as x-value increases. **Zero Covariance (Covariance value zero or close to zero** means x-values and y-values are **Independent or Nearly Independent** of each other.

**Sample Correlation**, also called **Correlation Coefficient** between data samples x and y is measured from the **Covariance** between x, y using the formula

**r-xy = (S-xy)/ (sigma-x)(sigma-y)** where 

r-xy = Correlation Coefficient between x and y

S-xy = Covariance between x, y

sigma-x = Standard Deviation of x

sigma-y = Standard Deviation of y

**Correlation Coefficient** is **unit-less** and has values between -1 (perfect anti-correlation) and +1 (perfect correlation). 

Positive, negative and zero/near-zero **Correlation Coefficient** are interprted in the same way as positive, negative and zero/near-zero **Covariance**

We will be using **Covariance, Correlation Coefficient** in details in **Regression Analysis (Predictive Analytics section)**. In **Regression Analysis** we will primarily use **Least Square** method of finding the best fit for the **Regression Line** through the data.

We will discuss **Least Square Method** briefly here and in more details in **Regression Analysis (Predictive Analytics)** section.

The **Covariance** and **Correlation Coefficient** of data samples can be calculated using Python as follows



In [73]:
x = [1, 2, 3, 4, 5]
y = [10, 20, 30, 40, 50]

def dot(v, w):
    return sum(v_i * w_i for v_i, w_i in zip(v, w))
print(dot(x, y))

def covariance(x, y):
    n = len(x) # length of both x and y are required to be the same
    return (dot(diff_from_mean(x), diff_from_mean(y)))/ (n-1)

print('Covariance between x and y = ', covariance(x, y))
print('===================================================')

def correlation(x, y):
    sdev_x = standard_deviation(x)
    sdev_y = standard_deviation(y)
    if sdev_x > 0 and sdev_y > 0:
        return covariance(x,y)/(sdev_x * sdev_y)
    else:
        return 0
    
print('Correlation between x and y = ', correlation(x, y))
print('===================================================')

550
Covariance between x and y =  25.0
Correlation between x and y =  1.0


### Least Square Method
**Covariance** and **Correlation** are measures of linear association. In **Linear Regression**the first variable xi is ca
lled the **explanatory or predictive** variable. The corresponding observation yi, taken from the input xi, is called the **response**. For example, can we explain or predict the **income of banks (response variable)** from its **assets (explanatory variable)**.

In **Linear Regression**, the response variable is linearly related to the explanatory variable, but is subject to deviation
or to **error**. So the relationship can be expressed as 


  **y-i = alpha + beta * x-i + error**
  
Our goal is, given the data, the x-i’s and y-i’s, to find the values of **alpha** and **beta** that will give the line having the best fit to the data. The principle of **Least Squares Regression** states that the best choice of this linear relationship is the one that minimizes the **square in the vertical distance (error)** from the y values in the data and the y values on the regression line. Thus, our problem of finding the **best fit** line translates to a **minimization** problem.

This can be done with a small amount of calculus ("Gradient Descent", which we will **not do**). We will also have to note two important facts
* ***With the best fit the error is always zero***
* ***The best fit line passes through the point x-bar, y-bar***

Skipping the calculus, the value of **beta** for the best fit (called **beta-hat**) is

**beta-hat = Covariance(x,y) / Variance (x)**

Also since the best fit line passes through (x-bar, y-bar), 

**y-bar = alpha-hat + beta-hat * x-bar + 0** (error = 0 for the best fit line)


**alpha-hat = y-bar - beta-hat * x-bar**

We have already created the Python functions for **Covariance(x,y) and Variance(x)**, **x-bar and y-bar***, and so we can easily calculate the value of **beta-hat** using those functions. Once **beta-hat** is calculated, **alpha-hat** can be calculated by substituting the values of **beta-hat, x-bar and y-bar**.

We will get back to this subject in more details in the **Linear Regression (Predictive Analytics)** section.

The discussion we have had so far is called **Simple Linear Regression** where the **dependent variable (response)** depnds on a **single** **independent (explanatory) variable**.

We will also discuss the case of **Multiple Linear Regression** where the **dependent variable (response)** depnds on **multiple independent (explanatory) variables**.

A third method of regression called the **Logistic Regression** will also be dicussed.



## Probability, Conditional Probability, Bayes' Theorem
## Conditional Probability
**Conditional Probability is defined as the probability of an event ( A ), given that another ( B ) has already occurred.**

If events A and B are not independent, then the probability of the intersection of A and B (the probability that both P(B|A) = vents occur) is defined by 
P(A and B) = P(A)P(B|A).

From this definition, the conditional probability P(B|A) is easily obtained by dividing by P(A):

**P(B|A) = P(B and A) / P(A)**

In the Predictive Analytics section we will learn a very widely used **Classification** algorithm called the **Naive Bayes Classifiaction Algorithm**.

It is a Machine Learning algorithm that is often used in data sets with multiple attributes. It is very easy to calculate and hence is often used to classify things in real time, such as "if an email containing a set of key words is classified as spam", "a newly published article belongs to a class of articles", "if an insurance claim, just submitted is real or fraud" etc.

The **Bayes** part of the name comes from Thomas Bayes, the inventor of the foundational Bayes' theorem and the **Naive** part of the name comes from the assumption that the factors guiding the occurrance of an event are **independent** of each other, even though in real life, they may not be so (a somewhat **naive** assumption). However, this algorithm produces very good/reliable results and is widely used.



## Bayes' Theorem
Bayes' Theorem (also called Bayes' Law or Bayes' Formula) is stated as

***Probability of an event B given that an event A has occurred, is equal to the probability of B given A has occurred multiplied by the probability of A given B has occurred divided by the probability of B***

***P(A|B) = (P(B|A) X P(A))/P(B)***

where

P(A|B) = Probability of event A given the event B has occurred

P(B|A) = Probability of event B given the event A has occurred

P(A), P(B) = Probabilities of event A and B respectively

### Commonly used terms in Bayesian Classification
A is called the **Proposition** and B is called the **Evidence**

P(A) is called the **Prior Probability of Proposition** and P(B) is called the **Prior probability of Evidence**

P(A|B) is called the **Posterior**

P(B|A) is called the **Likelyhood**


In other words

***Posterior = (Likelihood X Prior Probability of Proposition)/Prior Probability of Evidence***

### Bayesian Theorem as applied to Naive Bayes Algorithm
In Machine Learning classification there are multiple clesses C1, C2, C3...and each class with multiple features x1, x2, x3...(e.g. an insurance claim is in class 'Valid' or 'Fraud' and each claim has features such as 'amount of claim', 'doctor submitting the claim', 'amount of the claim', 'frequency of high value claim for same treatment by the same doctor' etc.). The aim of the algorithm is to determine the **Conditional Probability** of an object (an insurance claim) with features x1, x2,...xn belonging to a class Ci.

We will learn Bayesin Classification and it's calculation (using Python) in much more details in the **Predictive Analytics** section.
