## Chi square test


At a steel plant company, statistical quality-control methods have been used very successfully
in controlling slab width on continuous casting units. 
The company claims that a large reduction in the steel slab width variance resulted from the use of these methods.

Suppose that the variance of steel slab widths is expected to be 156 (squared units).
A test is carried out to determine whether the variance is above the required level, 
with the intention to take corrective action *if it is concluded that the variance is greater than 156.* 

A random sample of 25 slabs gives a sample variance of 175.
Using α = 0.05, should corrective action be taken?


In [19]:
#importing libraries
from scipy.stats import chi2
from scipy.stats import chisquare
from scipy.stats import chi2_contingency

In [20]:
# Sample n = 25, sample variance s^2 = 175
# Hypothesized value of variance = 156
# H0: σ2 <= 156, H1: σ2 > 156, test at significance level 0.05
### Right tail test
### ie rejection region is 0.05 area to the right ###

# Test Statistic = χ2 = (n – 1)s^2 /σ^2

test_statistic = (25-1)*175/156
print("test_statistic:",test_statistic)

# The critical value from the Chi-Square distribution based on the significance level
# and degrees of freedom provided.
# bcz it is right tailed test, q=0.95 is specified to find out the ctitical value.

crit_value = chi2.ppf(df=24,q=1-0.05)
print("crit_value:",crit_value)

# pvalue for the test-statistic & df provided.
# chi2.cdf indicates the cumulative probability(area from left till that point) of the test statistic.
# so p value for this right tailed test(ie rejection region shall be test-statistic prob <0.05) is 1-ch2.cdf().

pvalue = 1-chi2.cdf(test_statistic, df=24)

print("pvalue:",pvalue)

# for the right tail test, criteria of Null rejection

if test_statistic > crit_value:
    print('\nNull Rejected')
else:
    print('\nFailed to Reject Null')


test_statistic: 26.923076923076923
crit_value: 36.41502850180731
pvalue: 0.30805154704070414

Failed to Reject Null


*Test statistic found is less than critical value => so null hypothesis is not rejected.*

*The p-value is greater than significance level => so null hypothesis is not rejected.*

*So we cannot say the variance is greater than 156 based on the sample data. Hence no need to take any corrective action.*

### Two tail test chi2

Test the claim at 5% significance level

H0: σ2 = 152, H1: σ2 != 152

This is a Two tail test

Sample of n = 13 had stdev of 7.2

What is the decision?

In [47]:
# Critical Values for 5% at df = 13-1 are χ2 (0.025)  and χ2 (0.975).
# Test Statistic χ2 to be calculated. 

# two-tail test... so 5% significance indicates 0.975, 0.025 probabilities.
# chi2.interval provides the Upper, lower values.

criticial_values = chi2.interval(alpha =0.95, df=12) # χ2U,χ2L
lower_critical_value = criticial_values[0]
upper_critical_value = criticial_values[1]
test_statistic = (13-1)*(7.2**2)/(152)
pvalue = chi2.cdf(test_statistic, df=12)

print("\ncriticial_values:",criticial_values)
print('\ntest_statistic:',test_statistic)
print("\npvalue:",pvalue)

#criteria for NUlL rejection.
if test_statistic < lower_critical_value or test_statistic > upper_critical_value:
    print('\nNull Rejected')
else:
    print('\nFailed to Reject Null')



criticial_values: (4.403788506981702, 23.33666415864534)

test_statistic: 4.092631578947369

pvalue: 0.018293771496678428

Null Rejected


### Goodness of fit test

NULL H: Expected and observed frequencies are same.

ALTERNATE H: Expected and observed frequencies are not same.

In [9]:
#expected and observed frequencies.
expected = [10, 20, 30, 40, 50]
observed = [18, 12, 20, 42, 58]

from scipy.stats import chisquare

#p vlaue & statistic. 
stat, p = chisquare(f_obs=observed, f_exp=expected)
print('\n\t ch2 stat:', stat)
print('\n\t p value:', p)

signi_level = 0.05

#criteria for NUlL rejection.
if p <= signi_level: 
    print('\n\t REJECT NULL') 
else: 
    print('\n\t FAIL TO REJECT NULL') 


	 ch2 stat: 14.313333333333334

	 p value: 0.006359337473559783

	 REJECT NULL


*Rejected Null Hypothesis.*

*So the observed & expected frequencies are not the same.*

### Test for homogenity

“Survey response is influenced by gender of the pollster”
We want to test this. A set of people are polled by male, another by female pollsters. 
The responses are tabulated. Topic is purposely chosen to test this claim. It is on abortion rights. Use alpha=5%

          Interviewer
          (Man| Woman)
Men-Agree( 560 |   308)  

Men-disagree( 240 |  92)



NULL H: Survey is not influenced by gender ie "responses to survey by man & woman are equal"

ALTERNATE H: Survey is influenced ie "responses to survey by man & woman are not equal"

In [14]:
import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency 
 
data = [[560,308], [240,92]] 
print(data)
stat, p, dof, expected = chi2_contingency(data) 

print('\n\t ch2 stat:', stat)
print('\n\t p value:', p)
print('\n\t degrees of freedom:', dof)
print('\n\t expected frequencies:', '\n\t', expected[0], '\n\t', expected[1])
 
signi_level = 0.05

#criteria for NUlL rejection.
if p <= signi_level: 
    print('\n\t REJECT NULL') 
else: 
    print('\n\t FAIL TO REJECT NULL') 

[[560, 308], [240, 92]]

	 ch2 stat: 6.184241574593305

	 p value: 0.012889293542101093

	 degrees of freedom: 1

	 expected frequencies: 
	 [578.66666667 289.33333333] 
	 [221.33333333 110.66666667]

	 REJECT NULL


*Rejected NULL.*

*Hence survey is influenced by Gender ie responses to survey by men & women are not equal.*

### Test for independence

Test the claim whether Smoking & Gender are independent ??

NULL H: Smoking & Gender are independent

ALTERNATE H: Smoking & Gender are not independent

In [16]:
data = pd.DataFrame({'Gender' : ['F','M', 'M', 'M', 'F', 'F','M', 'M','F','M'] * 5,
                   'Smoking' : ['Smoker','Non-Smoker', 'Smoker', 'Non-Smoker', 'Non-Smoker', 
                                'Smoker','Smoker','Smoker','Non-Smoker','Smoker'] * 5
                  })

In [17]:
data_table = pd.crosstab(data['Gender'], data['Smoking'])
data_table

Smoking,Non-Smoker,Smoker
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1
F,10,10
M,10,20


In [18]:
stat, p, dof, expected = chi2_contingency(data_table) 

print('\n\t ch2 stat:', stat)
print('\n\t p value:', p)
print('\n\t degrees of freedom:', dof)
print('\n\t expected frequencies:', '\n\t', expected[0], '\n\t', expected[1])
 
signi_level = 0.05

#criteria for NUlL rejection.
if p <= signi_level: 
    print('\n\t REJECT NULL') 
else: 
    print('\n\t FAIL TO REJECT NULL') 


	 ch2 stat: 0.78125

	 p value: 0.3767591178115821

	 degrees of freedom: 1

	 expected frequencies: 
	 [ 8. 12.] 
	 [12. 18.]

	 FAIL TO REJECT NULL


*We fail to reject NULL.* 

*Hence we cannot say Smoking & Gender are not independent based on the sample data.*