## Tests of Goodness of Fit for Normal Distribution
Note: Normal Distribution is continuous. We must modify the way the categories are defined and how the expected frequencies are computed.

Chemline hires approximately 400 new employees annually for its four plants located throughout the United States. An aptitude test is given to new employees. The personnel director asks whether a normal distribution applies for the population of test scores. If such a distribution can be used, the distribution would be helpful in evaluating specific test scores; that is, scores in the upper 20%, lower 40%, and so on, could be identified quickly. Hence, we want to test the null hypothesis that the population of test scores has a normal distribution.

### Load Sample Data, Compute Sample Mean and Standard Deviation

In [1]:
from scipy.stats import chi2, norm
import numpy as np

In [2]:
# Create a numpy array to store the 50 sample test scores
score = np.array([71, 66, 61, 65, 54, 93, 60, 86, 70, 70, 73, 73, 55, 63, 56, 62, 76, 54, 82, 79, 76, 68, 53, 58, 85, 80,
                  56, 61, 61, 64, 65, 62, 90, 69, 76, 79, 77, 54, 64, 74, 65, 65, 61, 56, 63, 80, 56, 71, 79, 84])

In [3]:
X_bar = score.mean(); s = score.std(ddof=1)
print X_bar, s

68.42 10.4140603542


### Develop Hypotheses
Using these values, we state the following hypotheses about the distribution of the job applicant test scores. 

$H_0$: The population of test scores has a normal distribution with mean 68.42 and standard deviation 10.41 

$H_a$: The population of test scores does not have a normal distribution with mean 68.42 and standard deviation 10.41

### Establish Categories
* With the continuous normal probability distribution, we must use a different procedure for defining the categories. We need to define the categories in terms of intervals of test scores. 

* Recall the rule of thumb for an expected frequency of at least five in each interval or category. We define the categories of test scores such that the expected frequencies will be at least five for each category. With a sample size of 50, one way is to divide the normal distribution into 10 equal-probability intervals.

In [4]:
bins = [-float('inf')]
for i in range(9):
    p = (i+1)/10.; z = norm.ppf(p); bound = X_bar + z*s
    print [p, '{:6.4f}'.format(z), '{:6.4f}'.format(bound)]
    bins.append(bound)
bins.append(float('inf'))

[0.1, '-1.2816', '55.0738']
[0.2, '-0.8416', '59.6553']
[0.3, '-0.5244', '62.9589']
[0.4, '-0.2533', '65.7816']
[0.5, '0.0000', '68.4200']
[0.6, '0.2533', '71.0584']
[0.7, '0.5244', '73.8811']
[0.8, '0.8416', '77.1847']
[0.9, '1.2816', '81.7662']


### Observed and Expected Frequencies

In [5]:
frequency = []
for i in range(10):
    observed, expected = sum(num >= bins[i] and num < bins[i+1] # count the number of scores within an interval; True =1
                             for num in score), 50*.1
    print '{:2d}'.format(observed), expected
    frequency.append((observed, expected))

 5 5.0
 5 5.0
 7 5.0
 8 5.0
 2 5.0
 5 5.0
 2 5.0
 5 5.0
 5 5.0
 6 5.0


In [6]:
chi_square = sum([(x[0]-x[1])**2./x[1] for x in frequency])

# Find the critical value for 95% confidence
# The df = number of intervals - 3 
crit = chi2.ppf(0.95, 7) 

p_value = 1 - chi2.cdf(chi_square, 7)
print chi_square, crit, p_value

6.4 14.0671404493 0.493894649969


### Comment
* The numbers here are slightly differrent from in the book (Statistics for Business and Economics) because of rounding errors.
    * The numbers here are more accurate.
    * The 30th percentile here is 62.96 whereas it is 63.01 in the book.
    * That leads to the counts of 7 and 8 in the 3rd and 4th intervals. Those two counts are 9 and 6, respectively in the book.


### Summary
<img src="Fig12-6.bmp">