# Advance Statistics - Hypothesis Testing

### 1. Reference Material
<br>1. Basic of Prob and Stats - https://www.analyticsvidhya.com/blog/2017/02/basic-probability-data-science-with-examples/
<br>2. Cartoon guide to Statstics - Go to Google Drive (E-Book Folder)
<br>3. Inferential Statstics - https://www.analyticsvidhya.com/blog/2017/01/comprehensive-practical-guide-inferential-statistics-data-science/
<br>4. Perumtation and combination at mathisfun.com
<br>5. Descriptive Stats Udacity - https://www.udacity.com/course/intro-to-descriptive-statistics--ud827#
<br>6. Inferentia Stats Udacity - https://www.udacity.com/course/intro-to-inferential-statistics--ud201#
<br>7. Stats & Probability detailed Course by Khanacademy.org - https://www.khanacademy.org/math/statistics-probability
<br>8. Statstics Crash Course Playlist by Khanacademy - https://www.youtube.com/watch?v=uhxtUt_-GyM&list=PL1328115D3D8A2566

In [1]:
import pandas as pd
import numpy as np
import scipy.stats as stat
import math as m
from scipy.stats import binom

### 6. T-tests

T-tests are very much similar to the z-scores, the only difference being that instead of the Population Standard Deviation, we now use the Sample Standard Deviation. The rest is same as before, calculating probabilities on basis of t-values.

The Sample Standard Deviation is given as:

where n-1 is the Bessel’s correction for estimating the population parameter.

Another difference between z-scores and t-values are that t-values are dependent on Degree of Freedom of a sample. Let us define what degree of freedom is for a sample.

#### The Degree of Freedom –  
It is the number of variables that have the choice of having more than one arbitrary value. For example, in a sample of size 10 with mean 10, 9 values can be arbitrary but the 1oth value is forced by the sample mean.

Points to note about the t-tests:

1. Greater the difference between the sample mean and the population mean, greater the chance of rejecting the Null Hypothesis. Why? (We discussed this above.)
2. Greater the sample size, greater the chance of rejection of Null Hypothesis.

#### (A) 1- Sample T-Test

This is the same test as we described above. This test is used to:

* Determine whether the mean of a group differs from the specified value.
* Calculate a range of values that are likely to include the population mean.

For eg: A pizza delivery manager may perform a 1-sample t-test whether their delivery time is significantly different from that of the advertised time of 30 minutes by their competitors.

where, X(bar) = sample mean

μ = population mean

s = sample standard deviation

N = sample size


In [2]:
# # Python In-built function for T-test
# stat.ttest_1samp- 1 sample
# stat.ttest_ind - 2 sample
# stat.ttest_rel - paired

In [3]:
hr_attr = pd.read_csv('03_Challenge1.csv')

In [4]:
hr_attr.head()

Unnamed: 0,EmpNo,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,Dept,salary
0,A1,0.38,0.53,2,157,3,0,1,0,sales,low
1,A2,0.8,0.86,5,262,6,0,1,0,sales,medium
2,A3,0.11,0.88,7,272,4,0,1,0,sales,medium
3,A4,0.72,0.87,5,223,5,0,1,0,sales,low
4,A5,0.37,0.52,2,159,3,0,1,0,sales,low


**H0 : average_montly_hours = 200 <br>
H1 : average_montly_hours != 200**

In [5]:
m=hr_attr.average_montly_hours.mean()
m

201.0503366891126

In [5]:
?stat.ttest_1samp

In [6]:
stat.ttest_1samp(hr_attr.average_montly_hours, 200)

Ttest_1sampResult(statistic=2.5756342895976716, pvalue=0.010015125950754933)

In [7]:
pvalue=stat.ttest_1samp(hr_attr.average_montly_hours, 200)[1]
print(pvalue)
if pvalue <= 0.05:
    print("Alternate Hypothesis Passed i.e Average monthly hour is not equal to 200")
else:
    print("Failed to reject Null Hypothesis i.e. Avg Monthly hours equal to 200")

0.010015125950754933
Alternate Hypothesis Passed i.e Average monthly hour is not equal to 200


#### (B) Two Sample T-Test

This test is used to :
1. Determine whether the means of two independent groups differ.
2. Calculate a range of values that is likely to include the difference between the population means.


#### Demo
**H0 : Average monthly hours is same for low and medium salary <br>
H1 : Average monthly hours is not same for low and medium salary**

In [8]:
hr_attr.head()

Unnamed: 0,EmpNo,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,Dept,salary
0,A1,0.38,0.53,2,157,3,0,1,0,sales,low
1,A2,0.8,0.86,5,262,6,0,1,0,sales,medium
2,A3,0.11,0.88,7,272,4,0,1,0,sales,medium
3,A4,0.72,0.87,5,223,5,0,1,0,sales,low
4,A5,0.37,0.52,2,159,3,0,1,0,sales,low


In [9]:
?stat.ttest_ind

In [10]:
hr_attr.groupby('salary').average_montly_hours.describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
salary,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
high,1237.0,199.867421,47.710446,96.0,161.0,199.0,241.0,307.0
low,7316.0,200.996583,50.832214,96.0,155.0,199.0,246.0,310.0
medium,6446.0,201.338349,49.344188,96.0,156.0,201.0,245.0,310.0


In [11]:
stat.ttest_ind(hr_attr.loc[hr_attr.salary == 'low','average_montly_hours'], 
               hr_attr.loc[hr_attr.salary == 'medium','average_montly_hours'])

Ttest_indResult(statistic=-0.39900653336152675, pvalue=0.6898945822032513)

In [12]:
low=hr_attr.loc[hr_attr['salary']=='low','average_montly_hours']
medium=hr_attr.loc[hr_attr['salary']=='medium','average_montly_hours']
high=hr_attr.loc[hr_attr['salary']=='high','average_montly_hours']

In [13]:
stat.ttest_ind(high,medium)

Ttest_indResult(statistic=-0.9654006464129494, pvalue=0.3343745714238685)

In [14]:
stat.ttest_ind(high,low)

Ttest_indResult(statistic=-0.7288680398062308, pvalue=0.4661023478267259)

#### (C) Paired T Test

In [20]:
?stat.ttest_rel

In [21]:
pre = np.array([67, 45, 78, 90])
post= np.array([78, 59, 85, 92])

In [22]:
pre.mean()

70.0

In [23]:
post.mean()

78.5

In [24]:
stat.ttest_rel(pre,post)

Ttest_relResult(statistic=-3.2716515254078793, pvalue=0.04671879927774855)

In [25]:
# convert the paired t-test into one sample t-test
# c=pre-post
# Ho--> miu_diff=0
# Ha--> miu_diff!=0
# stat.ttest_1samp(c, 0)

### Types of Errors in Hypothesis Testing

- Type I error - the null hypotesis is actually true, but we are rejecting the hypothesis.
- Type II error - the null hypotesis is false, but we accept it.

Now we have defined a basic Hypothesis Testing framework. It is important to look into some of the mistakes that are committed while performing Hypothesis Testing and try to classify those mistakes if possible.
<br>If we look at the Null Hypothesis definition, we notice, at the first look, that it is a statement subjective to the tester like you and me and not a fact. That means there is a possibility that the Null Hypothesis can be true or false and we may end up committing some mistakes on the same lines.
<br><br>There are two types of errors that are generally encountered while conducting Hypothesis Testing.
<br>**(A) Type I Error:** Look at the following scenario – A male human tested positive for being pregnant. Is it even possible? This surely looks like a case of False Positive. More formally, it is defined as the incorrect rejection of a True Null Hypothesis. The Null Hypothesis, in this case, would be – Male Human is not pregnant.
<br>**(B) Type II Error:** Look at another scenario where our Null Hypothesis is – A male human is pregnant and the test supports the Null Hypothesis.This looks like a case of False Negative. More formally it is defined as the acceptance of a false Null Hypothesis.
<br>The below image will summarize the types of error :
<img src="ht_errors.jpg">