# Inferential Statistics
 ## Hypothesis Testing

1.Cumulative Density Function (CDF): stats.norm.cdf - Returns the probability for an observation equal to or lesser than a specific value from the distribution. It can also be thought of as - given a z-score, what is the cumulative probability distribution upto that z-score

2.Percent Point Function (PPF): stats.norm.ppf - Returns the observation value for the provided probability that is less than or equal to the provided probability from the distribution. It can also be thought of as - given a cumulative probability, what is the z-score

3.CDF is the reverse of PPF

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import math

In [2]:
print("confidence Interval of 95.45% is:")
stats.norm.ppf(0.02275),stats.norm.ppf(1 - 0.02275)   #we gave cummulative probability its returns us z score values 

confidence Interval of 95.45% is:


(-2.0000024438996036, 2.0000024438996027)

In [3]:
print("confidence interval of 95% is:")
stats.norm.ppf(0.025),stats.norm.ppf(1 - 0.025)                       #100-95,5/100 = 0.05 = alpha, alpha/2 = 0.05/2 = 0.025

confidence interval of 95% is:


(-1.9599639845400545, 1.959963984540054)

In [9]:
print("confidence interval of 99.69 is:")
stats.norm.ppf(0.00155),stats.norm.ppf(1-0.00155)            #100 - 99.69 = 0.31,aplha = 0.31/100 = 0.0031, alpha/2 = 0.00155

confidence interval of 99.69 is:


(-2.9576439172550852, 2.957643917255075)

In [10]:
print("confidence interval of 99.7305 is:")
stats.norm.ppf(0.0013475),stats.norm.ppf(1-0.0013475)# 100 - 99.7305 = 0.2695,alpha = 0.2695/100 = 0.002695, alpha/2 = 0.0013475

confidence interval of 99.7305 is:


(-3.0005415302965868, 3.0005415302965988)

In [24]:
print(stats.norm.cdf([-2.0]))
print(stats.norm.cdf([-1.95])) # if we -ve values it will give left tailed upto cummulative probability 
print(stats.norm.cdf([1.95]))   # if we give +ve values it will give upto right tailed cummulative probability
print(stats.norm.cdf([2.95]))
print(stats.norm.cdf([-3.000]))

[0.02275013]
[0.02558806]
[0.97441194]
[0.99841113]
[0.0013499]


In [23]:
stats.norm.ppf(0.9984)

2.9478425521848974

## Hypothesis Testing using Z - test and P - value

## P Value
- Sampling(or statistics) can never be perfect. We will always come across some weird samples which might give us incorrect impression e.g. one random sample with mean weight much lesser than actual weight does not mean that there is a problem with overall weights using all other samples.

It can be defined as the probability value of getting the statistical summary result equal to or more extreme than the actual observed results, given that the null hypothesis is true. A p-value is the probability that the results from your sample data occurred by chance.

If your p-value is less than the chosen significance level then you reject the null hypothesis i.e. accept that your sample gives reasonable evidence to support the alternative hypothesis. It does NOT imply a "meaningful" or "important" difference; that is for you to decide when considering the real-world relevance of your result.

In [None]:
Example 1

A principal at a certain school claims that the students in his school are above average intelligence. A random sample of thirty students IQ scores have a mean score of 112.5. Is there sufficient evidence to support the principal’s claim? The mean population IQ is 100 with a standard deviation of 15. Use alpha=0.05

Null hypothesis - The accepted fact is that the population mean is 100, so: H0: μ=100.
Alternate Hypothesis - The claim is that the students have above average IQ scores, so: H1: μ > 100.
The fact that we are looking for scores “greater than” a certain point means that this is a one-tailed test.

In [2]:
alpha = 0.05
# finding confidence interval
stats.norm.ppf(0.05),stats.norm.ppf(0.95)

(-1.6448536269514729, 1.6448536269514722)

In [13]:
n = 30
xbar = 112.5
mean = 100
std_d = 15
#z_test_stat = xbar - mean/(std_d/np.sqrt(n))
z_test_stat = 12.5/2.7387
print("z test statistic:",z_test_stat)
print("z score:", 1.64)

z test statistic: 4.564209296381494
z score: 1.64


Z Test
- Since, this is a right tailed test, the test-statistic should be greater than z-score to reject the null hypothesis
-If it’s less than z-score, you cannot reject the null hypothesis

P value

In [18]:
p_val_upper_reg = stats.norm.cdf(z_test_stat)
#Actual value can be given as 1 - value above
p_val_upper_reg = 1 - p_val_upper_reg
p_val_upper_reg

2.5069020465062763e-06

In [20]:
p_value_lower_region = stats.norm.cdf(-1 * z_test_stat)
p_value_lower_region

2.506902046495801e-06

### Example 3
An engineer measured the Brinell hardness of 25 pieces of ductile iron that were subcritically annealed. The resulting data were: [170,167,174,179,179,187,179,183,179,156,163,156,187,156,167,156,174,170,183,179,174,179,170,159,187]

The engineer hypothesized that the mean Brinell hardness of all such ductile iron pieces is greater than 170. Therefore, he was interested in testing the hypotheses:
* H0: μ = 170
* HA: μ > 170

The engineer did some basic statistics:
* *n=25, xbar=172.52, s=10.31, se=2.06*

In [2]:
sample = [170,167,174,179,179,187,179,183,179,156,163,156,187,156,167,156,174,170,183,179,174,179,170,159,187]

In [3]:
s = pd.Series(sample)
n = len(s)
xbar = (s.mean(),2)
s = round(s.std(),2)
μ = 170

In [4]:
tt = xbar - mean/(s/np.sqrt(n))    # test statistic
p_value = stats.t.sf(np.abs(tt),n - 1) * 2     # two sided pvalue = Prob(abs(t)>tt)
print("t statistic = %6.3f, p value = %6.4f", (tt,p_value))

NameError: name 'mean' is not defined

In [None]:
tt = xbar - μ/(s/np.sqrt(n))
pval = stats.t.sf(np.abs(tt), n-1)*2  # two-sided pvalue = Prob(abs(t)>tt)
print  (tt, pval)

In [6]:
stats.ttest_1samp(a = sample,popmean = 170)

Ttest_1sampResult(statistic=1.2218430153659992, pvalue=0.23363279636357662)

In [7]:
stats.t.ppf(q = 0.5,df = len(sample)-1) #q = quantile to check, and sample -1 is degrees of freedom n-1

6.70747681633171e-17

## t-test-statistic
If the engineer set his significance level α at 0.05 and used the critical value approach to conduct his hypothesis test, he would reject the null hypothesis if his t-test statistic were greater than 1.7109:


Since the engineer's t-test statistic (1.22), is not greater than 1.7109, the engineer fails to reject the null hypothesis. That is, the test statistic does not fall in the "critical region." There is insufficient evidence, at the α = 0.05 level, to conclude that the mean Brinell hardness of all such ductile iron pieces is greater than 170.

