# Chapter 5: Inference for categorical data

### Guided Practice 6.2

We get that $SE_{\hat{p}} = \sqrt{\frac{\hat{p} \times (1 - \hat{p})}{n}} = 0.0159$.

### Guided Practice 6.4

A plausible hypothesis would be $P_{0}: $ support and opposition are the same fraction ($ = 0.50$) and $P_{A}: $ states that support and opposition fractions are in different proportions ($ \ne 0.50$).

### Guided Practice 6.5

Let's check the success-failure condition with the null value $np_{0} = 412.5 \approx 413 = n(1 - p_{0})$, so we can use the normal distribution.

### Guided Practice 6.8

We basically want $1.645 \times \sqrt{\frac{p(1 - p)}{n}} \lt 0.01$, we can conveniently create a Python routine which, taking as input `p`, the confidence level and the margin of error, will output the minimum sample size `n` to achieve that. We run the function with the three proportions found previously and see what's the respective `n`.

In [1]:
from scipy import stats

def get_sample_size(p, confidence_level, error_margin):
    z = round(stats.norm.interval(confidence_level)[-1], 2)

    return round((z ** 2) * ((p * (1 - p)) / (error_margin ** 2)))

for p in [0.017, 0.062, 0.013]:
    print(f"With p = {p}, we need a sample size of n = {get_sample_size(p, 0.90, 0.01)}.")


With p = 0.017, we need a sample size of n = 449.
With p = 0.062, we need a sample size of n = 1564.
With p = 0.013, we need a sample size of n = 345.


### Guided Practice 6.10

We can again leverage the previously created function to answer this question.

In [2]:
get_sample_size(0.70, 0.95, 0.05)

323

### Exercise 6.1 - Vegetarian college students.

* (a) Since $np = 60 \times 0.08 = 4.8$ and $n(1 - p) = 60 \times 0.92 = 55.2$, the normal distribution won't work well.
* (b) Yes, since the sample proportion is closer to 0 than 1 and the sample size is not very big for such a proportion.
* (c) A sample size of $n = 125$ would give us $SE_{\hat{p}} = 0.024$ therefore we can compute $Z = 1.67$, which means that value is quite unusual.
* (d) With a sample size of $n = 250$ we get $SE_{\hat{p}} = 0.017$ and therefore $Z = 2.33$, so the proportion becomes more unusual.
* (e) We reduced the standard error by 28% so we had roughly one quarter of the standard error with half the sample size.

### Exercise 6.2 - Young Americans, Part I.

* (a) Maybe only slightly left skewed due to $p$ being closer to 1 and the sample size being 20.
* (b) Since $np = 30.8$ and $n(1 - p) = 9.2$, we fail the success-failure condition so the normal approximation won't work well.
* (c) Since $SE_{\hat{p}} = 0.054$ we have that $Z = 1.47$ so the observation would be considered unusual.
* (d) In this case, $Z = 2.08$, so the observation is more unusual.

### Exercise 6.3 - Orange tabbies.

* (a) It is left skewed since the sample size is small and the proportion is close to 1. Therefore we expect a noticeable left skewness.
* (b) True, since that would halve the _standard error_ (remember that $n$ is under the square root too).
* (c) True, since we have a big sample size which makes it easier to approximate it to a normal distribution. Also, $np = 140 * 0.90 = 126$ and $n(1 - p) = 14$.
* (d) True, a bigger sample size would increase the _success-failure condition_ values.

### Exercise 6.4 - Young Americans, Part II.

* (a) True, since we have a very small sample size and a proportion closer to 0. 
* (b) True, since we need to meet the following:

$$
\begin{cases}np \ge 10 \\ n(1 - p) \ge 10 \end{cases} = \begin{cases}n \ge 13.33 \\ n \ge 40 \end{cases} => n \ge 40
$$

* (c) We have $SE_{\hat{p}} = 0.057$ so $Z = -0.88$, therefore it is not considered unusual.
* (d) We now have $Z = -1.53$ so the sample proportion is unusual.
* (e) False, a sample size $3n$ will decrease the _standard error_ by a factor $\frac{1}{\sqrt{3}}$.



### Exercise 6.5 - Gender equality.

* (a) False, confidence is about the population not the sample. 
* (b) True, since a confidence interval built on some data gives us the degree of confidence about a population.
* (c) True, since this is another definition of confidence interval.
* (d) True, since quadrupling the sample size will halve the _standard error_.
* (e) True, since we still get a low margin of error.

### Exercise 6.6 - Elderly drivers.

* (a) We have $SE_{\hat{p}} = 0.015$ and so our 95% confidence interval should be $[63.08\%,\ 68.91\%]$, so the values are correct.
* (b) No, since the interval is lower than 70%.

### Exercise 6.7 - Fireworks on July 4<sup>th</sup>.

For a 95% confidence interval we have a margin of error of $1.96 * SE_{\hat{p}} = 0.04 = 4\%$.

### Exercise 6.8 - Life rating in Greece.

* (a) The parameter of interest is the proportion of Greek people living in a condition poor enough to be considered "suffering". The value of this proportion is $\hat{p} = 0.25$.
* (b) We may want to check the _success-failure condition_, having $np = 0.25 * 1000 = 250$ and $n(1 - p) = 0.75 * 1000 = 750$.
* (c) The 95% confidence interval is $[22.32\%,\ 27.68\%]$.
* (d) A higher confidence level would give us a wider confidence interval.
* (e) A larger sample would shrink the standard error, thus shrinking the confidence interval itself.

### Exercise 6.9 - Study abroad.

* (a) An optional web survey may be completed by those who choose to do so, therefore this is not a representative sample of the population of interest.
* (b) The 90% confidence interval is $[52.89\%,\ 57.11\%]$. We are confident that the proportion of students sure to spend a study period abroad is between 52.89% and 57.11%.
* (c) It means that by taking multiple samples and calculating the sample parameter of interest, and building then a 90% confidence interval, 90% of these intervals will capture the true population parameter.
* (d) If this sample were representative, then yes since the interval is above 50%.

### Exercise 6.10 - Legalization of marijuana, Part I.

* (a) It is a sample statistic since it is referred to a sample.
* (b) We are 95% confident that the true population parameters (the fraction of US residents wanting to legalize Cannabis) lies within the interval $[58.59\%,\ 63.41\%]$.
* (c) Let's check the _success-failure_ condition: $np = 1578 * 0.61 \approx 963 $ and $n(1 - p) = 1578 * 0.39 \approx 615$, therefore the normal model is reasonable.
* (d) I am 95% confident that this statement is true.

### Exercise 6.11 - National Health Plan, Part I.

* (a) The null hypothesis is that half of the Independents support a National Health Plan, while the alternative hypothesis tells us that the proportion is bigger (a one-tailed test is necessary here). We have that $SE_{\hat{p}} = 0.02$. The found _p-value_ is $p = 0.006 \lt 0.05$. Therefore we can conclude there's enough evidence for us to reject the null hypothesis.
* (b) As found before, a 95% confidence interval does not include 0.5 (50%) but a 99% confidence interval might.

### Exercise 6.12 - Is college worth it? Part I.

* (a) The null hypothesis is that half of the partecipants said they did not go to college because they could not afford it, while the alternative hypothesis is that the proportion of those who did not go to college because they could not afford it is less than 0.5, i.e. is in "minority". We found a p-value of 0.25 (one sided), therefore we can't reject the null hypothesis with this data, so we cannot confirm the statement.
* (b) Sure, since the p-value is greater than 0.05.

### Exercise 6.13 - Taste test.

* (a) Given H<sub>0</sub>: p = 0.5 and H<sub>A</sub>: p $\ne$ 0.5 we find a p-value of 0.31 (two tailed test), therefore we cannot reject the null hypothesis.
* (b) In this context the p-value is the probability to see a 53% success rate if we used random guessing.

### Exercise 6.14 - Is college worth it? Part II.

* (a) The 90% confidence interval is $[43.48\%,\ 52.52\%]$.
* (b) Using our previously defined script we get $n = 2984$.

### Exercise 6.15 - National Health Plan, Part II.

An appropriate sample size would be $n = 6657$.

### Exercise 6.16 - Legalize Marijuana, Part II.

We need to use a sample size of $n = 2285$.

### Guided Practice 6.13

We define as $p_{fo} = \frac{145}{12933} = 0.011$ as the percentage of heart attacks in the group receiving fish oil (treatment group) and $p_{pl} = \frac{200}{12938} = 0.016$ as the fraction of heart attacks in the control group. We have $p_{fo} - p_{pl} = 0.011 - 0.015 = -0.004$ with $SE = 0.0014$ (using the formula implemented in the function below). So our confidence interval is $[-0.0077,\ -0.0023]$, which means that we're 95% confident that the difference between the treatment and control heart attack frequency is between -0.77% and -0.23%, which is not a significant difference.

### Guided Practice 6.14

It is an experimental study since we have both a treatment and a control group.

### Guided Practice 6.15

H<sub>0</sub> : p<sub>m</sub> - p<sub>nm</sub> = 0.03 and H<sub>A</sub> : p<sub>m</sub> - p<sub>nm</sub> $\ne$ 0.03, with p<sub>m</sub> being the proportion receiving mammogram and p<sub>nm</sub> the proportion in the control group.

### Guided Practice 6.19

H<sub>0</sub> : p<sub>new</sub> - p<sub>old</sub> = 0 and H<sub>A</sub> : p<sub>new</sub> - p<sub>old</sub> $\ge$ 0.03.

### Guided Practice 6.21

We remember that $Var[\alpha A + \beta B] = \alpha^{2} Var[{A}] + \beta^{2} Var[{B}]$ and since $Var[\alpha X] = \alpha^{2}Var[X]$ so $Var[-X] = (-1)^2Var[X] = Var[X]$ we get that $SE_{\hat{p}_1 - \hat{p}_2}^{2} = SE_{\hat{p}_{1}}^{2} + SE_{\hat{p}_{2}}^{2}$.

### Exercise 6.17 - Social experiment, Part I.

Let's calculate the pooled proportion as $\hat{p}_{yes} = \frac{20}{45} = 0.44$, and now let's check the success-failure conditions as

$$
n_{pro} * \hat{p}_{yes} = 20 * 0.44 = 8.8 \\
$$

$$
n_{pro} * (1 - \hat{p}_{yes}) = 20 * 0.56 = 11.2 \\
$$

$$
n_{con} * \hat{p}_{yes} = 25 * 0.44 = 11 \\
$$

$$
n_{con} * (1 - \hat{p}_{yes}) = 20 * 0.56 = 14 \\
$$

The first conditions holds 8.8, therefore it does not meet the success-failure condition.

### Exercise 6.18 - Heart transplant success.

We have that $\hat{p}_{survived} = \frac{28}{103} = 0.2718$ and we fail to have $n_{survived} \times \hat{p}_{survived} \ge 10$. The confidence interval won't be accurate as it won't be symmetric around the estimate.

### Exercise 6.19 - Gender and color preference.

* (a) False, since that is an interval.
* (b) True, since we are 95% confident to see those percent differences.
* (c) True, since this is another definition of a confidence interval.
* (d) True, since our confidence interval is above the 0.
* (e) False, at it is simply the negated.

### Exercise 6.20 - Government shutdown.

* (a) With significance level of 5%, we need a p-value less that 0.05 to reject the null hypothesis. However, from the given confidence interval we can't deduct that there are any significant differences, so this is false.
* (b) False, it's the opposite: we are 95% confident that 16% to -2% of those who make less than 40,000$ are personally affected.
* (c) False, a 90% confidence interval would be narrower.
* (d) True. We get that $SE = 0.05$ so roughly yes, we get that interval. 

### Exercise 6.21 - National Health Plan, Part III.

* (a) First of all we have $p_{D} - p_{I} = 0.79 - 0.55 = 0.24$ and $SE_{p_{D} - p_{I}} = 0.042$ (calculated using the module defined below). Now, the 95% confidence interval is $[0.16,\ 0.32]$.
* (b) True, since the 95% confidence interval is definitely greater than 0.

In [3]:
def standard_error(p, n):
    return ((p * (1 - p)) / n) ** 0.5

def standard_error_difference(p1, p2, n1, n2):
    return (standard_error(p1, n1) ** 2 + standard_error(p2, n2) ** 2) ** 0.5


### Exercise 6.22 - Sleep deprivation, CA vs. OR, Part I.

Since $p_{OR} - p_{CA} = 0.088 - 0.080 = 0.008$ and $SE_{p_{OR} - p_{CA}} = 0.005$ we have that the 95% confidence interval is $[-0.002,\ 0.018]$. The observed difference might be due to chance since the confidence interval shows that the difference in proportion can go below 0.

### Exercise 6.23 - Offshore drilling, Part I.

* (a) Respectively, $p_{grad} = 0.2374 = 23.74\%$ and $p_{no\ grad} = 0.3368 = 33.68\%$.
* (b) $H_{0} : p_{no\ grad} = p_{grad}$ and $H_{0} : p_{no\ grad} \ne p_{grad}$. We obtain $SE_{p_{no\ grad} - p_{grad}} = 0.0314$. Now we get a p-value as $pval = 0.0015$ (two-tailed test), thus there's convincing evidence that the proportion of non college graduates who do not have an opinion is greater than the proportion of college grade who neither do have an opinion.

### Exercise 6.24 - Sleep deprivation, CA vs. OR, Part II.

* (a) Success-failure condition is met and so we can conduct an hypothesis test $H_{0} : p_{Ca} = p_{Or}$ and $H_{0} : p_{Ca} \ne p_{Or}$. So we get $pval = 0.099$ which tells us there's no strong evidence supporting the alternative hypothesis.
* (b) It was made a type II error.

### Exercise 6.25 - Offshore drilling, Part II.

* (a) Respectively $35.16\%$ and $33.93\%$.
* (b) We get a pvalue of 0.71, showing us that there is not strong evidence supporting that the two differences are not by chance.

### Exercise 6.26 - Full body scan, Part I.

* (a) $H_{0} : p_{r} = p_{d}$ and $H_{A} : p_{r} \ne p_{d}$. We find a p-value of $0.498$, therefore we can't reject the null hypothesis.
* (b) We would make a Type II error, i.e. failing to reject a null hypothesis which ought to be rejected indeed.

### Exercise 6.27 - Sleep deprived transportation workers.

We check if we meet the success-failure condition using the script below.

In [4]:
def success_failure_two(p1, p2, n1, n2):
    p_p = (p1 * n1 + p2 * n2) / (n1 + n2)
    return all(
        x >= 10 for x in [p_p * n1, p_p * n2, (1 - p_p) * n1, (1 - p_p) * n2]
    )

success_failure_two(35 / 203, 35 / 292, 203, 292)

True

Let's now create a whole routine which output the p-value for us (it will be rudimental at first).

In [5]:
def p_val_calc_diff(p1, p2, n1, n2):
    if not success_failure_two(p1, p2, n1, n2):
        return -1
    else:
        se = standard_error_difference(p1, p2, n1, n2)

        z = (p1 - p2) / se
        print(z)

        if z < 0:
            return stats.norm.cdf(z) * 2
        else:
            return (1 - stats.norm.cdf(z)) * 2

p_val_calc_diff(35 / 292, 35 / 203, 292, 203)

-1.6109110850818888


0.10719910357164927

We get a p-value which obliges us to reject the alternative hypothesis.

### Exercise 6.28 - Prenatal vitamins and Autism.

* (a) $H_{0} : p_{v} = p_{nv}$ and $H_{A} : p_{v} \ne p_{nv}$.
* (b) We can check using the routine defined above, and since we get a p-value less than 0.05, we can accept the alternative hypothesis.
* (c) Since we have not proven causation, we might use a softer term.

### Exercise 6.29 - HIV in sub-Saharan Africa.

* (a) Let's use pandas for this (see below).
* (b) $H_{0} : p_{n} = p_{l}$ and $H_{A} : p_{n} \ne p_{l}$.
* (c) We find a p-value of 0.003, therefore we reject the null hypothesis.

In [6]:
import pandas as pd 

df = pd.DataFrame({
    "outcome": ["virologic failure"] * 26 + ["success"] * 94 + ["virologic failure"] * 10 + ["success"] * 110, 
    "medicine": ["nevaprine"] * 120 + ["lopinavir"] * 120
})

pd.crosstab(df.medicine, df.outcome, margins=True)

outcome,success,virologic failure,All
medicine,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
lopinavir,110,10,120
nevaprine,94,26,120
All,204,36,240


### Exercise 6.30 - An apple a day keeps the doctor away.

It depends on the number of students and the purpose of this analysis.

### Guided Practice 6.23

We would expect 33 of the jurors to be Hispanic and 24.75 from other races.

### Guided Practice 6.24

* (a) The center gets wider.
* (b) The variability seems to increase.
* (c) The shape becomes more normal-like.

### Guided Practice 6.28

In [7]:
1 - stats.chi2.cdf(11.7, df=7)

0.11086624560489045

### Guided Practice 6.29

In [8]:
1 - stats.chi2.cdf(10, df=4)

0.04042768199451274

### Guided Practice 6.30

In [9]:
1 - stats.chi2.cdf(9.21, df=3)

0.026625274336369742

### Guided Practice 6.34

Let's use, as above, Python.

In [10]:
import numpy as np

obs = np.array([717, 369, 155, 69, 28, 14, 10])
exp = np.array([743, 338, 153, 70, 32, 14, 12])

x2 = stats.chisquare(obs, exp, ddof=6)
x2.statistic

4.626783138388285

### Guided Practice 6.35

We need to use $n - 1 = 7 - 1 = 6$ degrees of freedom.

### Exercise 6.31 - True or false, Part I.

* (a) False. The chi-square has only one parameter, i.e. the degrees of freedom.
* (b) True. The more the degrees of freedom the more it is normal, though.
* (c) True.
* (d) False. It becomes more normal.

### Exercise 6.32 - True or false, Part II.

* (a) True. It becomes more normal shaped.
* (b) True. We would get a p-value of 0.8197.
* (c) False. Only the right tail.
* (d) True, as it becomes more normal (and the tails gets larger).

### Exercise 6.33 - Open source textbook.