# Chapter 5: Inference for categorical data

### Guided Practice 6.2

We get that $SE_{\hat{p}} = \sqrt{\frac{\hat{p} \times (1 - \hat{p})}{n}} = 0.0159$.

### Guided Practice 6.4

A plausible hypothesis would be $P_{0}: $ support and opposition are the same fraction ($ = 0.50$) and $P_{A}: $ states that support and opposition fractions are in different proportions ($ \ne 0.50$).

### Guided Practice 6.5

Let's check the success-failure condition with the null value $np_{0} = 412.5 \approx 413 = n(1 - p_{0})$, so we can use the normal distribution.

### Guided Practice 6.8

We basically want $1.645 \times \sqrt{\frac{p(1 - p)}{n}} \lt 0.01$, we can conveniently create a Python routine which, taking as input `p`, the confidence level and the margin of error, will output the minimum sample size `n` to achieve that. We run the function with the three proportions found previously and see what's the respective `n`.

In [5]:
from scipy import stats

def get_sample_size(p, confidence_level, error_margin):
    z = round(stats.norm.interval(confidence_level)[-1], 2)

    return round((z ** 2) * ((p * (1 - p)) / (error_margin ** 2)))

for p in [0.017, 0.062, 0.013]:
    print(f"With p = {p}, we need a sample size of n = {get_sample_size(p, 0.90, 0.01)}.")


With p = 0.017, we need a sample size of n = 449.
With p = 0.062, we need a sample size of n = 1564.
With p = 0.013, we need a sample size of n = 345.


### Guided Practice 6.10

We can again leverage the previously created function to answer this question.

In [6]:
get_sample_size(0.70, 0.95, 0.05)

323

### Exercise 6.1 - Vegetarian college students.

* (a) Since $np = 60 \times 0.08 = 4.8$ and $n(1 - p) = 60 \times 0.92 = 55.2$, the normal distribution won't work well.
* (b) Yes, since the sample proportion is closer to 0 than 1 and the sample size is not very big for such a proportion.
* (c) A sample size of $n = 125$ would give us $SE_{\hat{p}} = 0.024$ therefore we can compute $Z = 1.67$, which means that value is quite unusual.
* (d) With a sample size of $n = 250$ we get $SE_{\hat{p}} = 0.017$ and therefore $Z = 2.33$, so the proportion becomes more unusual.
* (e) We reduced the standard error by 28% so we had roughly one quarter of the standard error with half the sample size.

### Exercise 6.2 - Young Americans, Part I.

* (a) Maybe only slightly left skewed due to $p$ being closer to 1 and the sample size being 20.
* (b) Since $np = 30.8$ and $n(1 - p) = 9.2$, we fail the success-failure condition so the normal approximation won't work well.
* (c) Since $SE_{\hat{p}} = 0.054$ we have that $Z = 1.47$ so the observation would be considered unusual.
* (d) In this case, $Z = 2.08$, so the observation is more unusual.

### Exercise 6.3 - Orange tabbies.

* (a) It is left skewed since the sample size is small and the proportion is close to 1. Therefore we expect a noticeable left skewness.
* (b) True, since that would halve the _standard error_ (remember that $n$ is under the square root too).
* (c) True, since we have a big sample size which makes it easier to approximate it to a normal distribution. Also, $np = 140 * 0.90 = 126$ and $n(1 - p) = 14$.
* (d) True, a bigger sample size would increase the _success-failure condition_ values.

### Exercise 6.4 - Young Americans, Part II.

* (a) True, since we have a very small sample size and a proportion closer to 0. 
* (b) True, since we need to meet the following:

$$
\begin{cases}np \ge 10 \\ n(1 - p) \ge 10 \end{cases} => \begin{cases}n \ge 13.33 \\ n \ge 40 \end{cases} => n \ge 40
$$

* (c) We have $SE_{\hat{p}} = 0.057$ so $Z = -0.88$, therefore it is not considered unusual.
* (d) We now have $Z = -1.53$ so the sample proportion is unusual.
* (e) False, a sample size $3n$ will decrease the _standard error_ by a factor $\frac{1}{\sqrt{3}}$.



### Exercise 6.5 - Gender equality.

* (a) False, confidence is about the population not the sample. 
* (b) True, since a confidence interval built on some data gives us the degree of confidence about a population.
* (c) True, since this is another definition of confidence interval.
* (d) True, since quadrupling the sample size will halve the _standard error_.
* (e) True, since we still get a low margin of error.

### Exercise 6.6 - Elderly drivers.

* (a) We have $SE_{\hat{p}} = 0.015$ and so our 95% confidence interval should be $[63.08\%,\ 68.91\%]$, so the values are correct.
* (b) No, since the interval is lower than 70%.

### Exercise 6.7 - Fireworks on July 4<sup>th</sup>.

For a 95% confidence interval we have a margin of error of $1.96 * SE_{\hat{p}} = 0.04 = 4\%$.

### Exercise 6.8 - Life rating in Greece.

* (a) The parameter of interest is the proportion of Greek people living in a condition poor enough to be considered "suffering". The value of this proportion is $\hat{p} = 0.25$.
* (b) We may want to check the _success-failure condition_, having $np = 0.25 * 1000 = 250$ and $n(1 - p) = 0.75 * 1000 = 750$.
* (c) The 95% confidence interval is $[22.32\%,\ 27.68\%]$.
* (d) A higher confidence level would give us a wider confidence interval.
* (e) A larger sample would shrink the standard error, thus shrinking the confidence interval itself.

### Exercise 6.9 - Study abroad.

* (a) An optional web survey may be completed by those who choose to do so, therefore this is not a representative sample of the population of interest.
* (b) The 90% confidence interval is $[52.89\%,\ 57.11\%]$. We are confident that the proportion of students sure to spend a study period abroad is between 52.89% and 57.11%.
* (c) It means that by taking multiple samples and calculating the sample parameter of interest, and building then a 90% confidence interval, 90% of these intervals will capture the true population parameter.
* (d) If this sample were representative, then yes since the interval is above 50%.

### Exercise 6.10 - Legalization of marijuana, Part I.

In [9]:
p = 0.55
n = 1509

se = ((p * (1 - p)) / n) ** 0.5
p - 1.645 * se, p + 1.645 * se

(0.5289326997892237, 0.5710673002107763)