In [1]:
#All code present was constructed by and is the explicit property of Kerry Hall. 
#Problems and datasets sourced from "Probability and Statistics for Engineers and Scientists" by Walpole, Myers, Myers, and Ye, 9th ed.
import numpy as np
from scipy.stats import norm, chi2, t
from statistics import mean, median, mode

Problem set: 8.4, 8.6, 8.24, 8.28, 8.42, 8.50

8.4: The number of tickets issued for traffic violations by 8 state troopers during the Memorial Day weekend are 5, 4, 7, 7, 6, 3, 8, 6. (a) If these values represent the number of tickets issued by a random sample of 8 state troopers from Montgomery County in Virginia, define a suitable population, (b) If the values represent the tickets issued by a random sample of 8 state troopers from South Carolina, define a suitable population.

A suitable population for tickets issued by 8 troopers in Montgomery County in Virginia is the tickets issued by all troopers in Virginia. A suitable population for the tickets given by 8 troopers from South Carolina is the tickets given in the US. I will also add that population choice depends heavily on the group being studied. It would be appropriate in many different scenarios to limit the population through a boundary different than the ones I specified.

8.6: Find the mean, median, and mode for the sample whose observations represent the number of sick days claimed on 9 federal income tax returns. Which value appears to be the best measure of the center of these data? State the reasons for your preference. 

In [2]:
#Location of 95 changed in set to facilitate its removal for taking the average without it for comparison
set_86 = np.array([15, 7, 8, 19, 12, 8, 22, 14, 95,], float)
print("\nThe mean is: ", np.round(mean(set_86), 2), 
      "\n\nThe mode is: ", mode(set_86), 
      "\n\nThe median is: ", median(set_86),
      "\n\nThe mean without the 95 is: ", mean(set_86[:-1]), sep="")


The mean is: 22.22

The mode is: 8.0

The median is: 14.0

The mean without the 95 is: 13.125


The value that best represents the set is the median. The reason that this is the best value is because these numbers do not represent a continuum of values, but a number of days specific to a given employee. This results in the mean being very subject to outlier data (i.e. the 95). The mean is maligned as a result, and the removal of the 95 actually makes it come very close to the median value and be more representative of the set. The mode does not reveal anything about this dataset. It is always important to consider all three as they each describe related aspects of a dataset, but have varying importance depending on what the dataset actually indicates. 

8.24: If a certain machine makes electrical resistors having a mean resistance of 40 ohms and a standard deviation of 2 ohms, what is the probability that a random sample of 36 of these resistors will have a combined resistance of more than 1458 ohms?

In [3]:
n824 = 36
mu824 = 40
std824 = 2
resistance_needed = 1458
mu_rqd = resistance_needed/n824
condition_p824 = 1 - norm.cdf((mu_rqd-mu824)*np.sqrt(n824)/std824)
print("\nThe probability that the resistors will have a combined resistance greater than 1458 ohms is: ", np.round(100*condition_p824, 2),"%.\n", sep="")


The probability that the resistors will have a combined resistance greater than 1458 ohms is: 6.68%.



8.28: A random sample of size 25 is taken from a normal population having a mean of 80 and a standard deviation of 5. A second random sample of size 36 is taken from a different normal population having a mean of 75 and a standard deviation of 3. Find the probability that the sample mean computed from the 25 measurements will exceed the sample mean computed from the 36 measurements by at least 3.4 but less than 5.9. Assume the difference of the means to be measured to the nearest tenth.

In [4]:
#If I were going to make this a permanent addition to code, I would probably add more intermediate variables to make the flow better
s1_mu = 80
s2_mu = 75
s1_std = 5
s2_std = 3
n1 = 25
n2 = 36
UL828 = 5.9
LL828 = 3.4
condition_p828 = norm.cdf((UL828-s1_mu+s2_mu)/np.sqrt(s1_std**2/n1 + s2_std**2/n2)) - norm.cdf((LL828-s1_mu+s2_mu)/np.sqrt(s1_std**2/n1 + s2_std**2/n2))
print("\nThe probability that the first sample's mean will exceed the second sample's mean by a value between 3.4 and 5.9 is: ", np.round(100*condition_p828, 2),"%.\n", sep="")


The probability that the first sample's mean will exceed the second sample's mean by a value between 3.4 and 5.9 is: 71.34%.



8.42: The scores on a placement test given to college freshmen for the past fie years are approximately normally distributed with a mean of 74 and a variance of 8. Would you consider a variance of 8 to be a valid value of the variance if a random sample of 20 students who take the placement test this year obtain a sample variance value of 20?

Without even performing the chi-squared test, it is blatantly obvious that this sample is inconsistent with the normal population. A result like this could indicate the presence of two groups within the test, one generating scores trending towards a greater mean and the other towards a lower mean. It would be a good idea to consider these students' different backgrounds at this point to look for the source of these two separate groups.

In [5]:
mu842 = 74
var842 = 8
n842 = 20
svar842 = 20
chi_val = (n842 - 1)*svar842/var842
chi_ll = chi2.ppf(0.025, df=19)
chi_ul = chi2.ppf(0.975, df=19)
print("\nThe calculated Chi squared value is: ", np.round(chi_val, 2),
      "\n\nThe Chi upper limit is: ", np.round(chi_ul, 2),
      "\n\nThe Chi lower limit is: ", np.round(chi_ll, 2),
      "\n\nAs the calculated value is outside the limit, it is unreasonable to maintain the assumption that the variance is still valid.\n", sep="")


The calculated Chi squared value is: 47.5

The Chi upper limit is: 32.85

The Chi lower limit is: 8.91

As the calculated value is outside the limit, it is unreasonable to maintain the assumption that the variance is still valid.



8.50: A maker of a certain brand of low-fat cereal bars claims that the average saturated fat content is 0.5 gram. In a random sample of 8 cereal bars of this brand, the saturated fat content was 0.6, 0.7, 0.7, 0.3, 0.4, 0.5, 0.4, and 0.2. Would you agree with the claim? Assume a normal distribution.

In [6]:
fat_contents = np.array([0.6, 0.7, 0.7, 0.3, 0.4, 0.5, 0.4, 0.2,])
mu850 = 0.5
sample_mu850 = mean(fat_contents)
n850 = 8
std850 = np.std(fat_contents, ddof=1)
calc_t = np.sqrt(n850)*(sample_mu850-mu850)/std850
ul850 = t.ppf(0.975, 7)
ll850 = t.ppf(0.025, 7)
print("\nBecause the upper t-limit is: ", np.round(ul850, 2), ", and the lower t-limit is: ", 
      np.round(ll850, 2), " and because the calculated t value of: ", np.round(calc_t, 2), 
      " falls within that range, there is insufficient evidence to disagree with the claim.", sep="")


Because the upper t-limit is: 2.36, and the lower t-limit is: -2.36 and because the calculated t value of: -0.39 falls within that range, there is insufficient evidence to disagree with the claim.
