# HW 6: Binomial Tests, Z-Tests for Proportions


In [None]:
from datascience import *
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

import scipy.stats as stats
import scipy
from statsmodels.stats.proportion import proportions_ztest as Z_test

import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)
warnings.simplefilter(action="ignore", category=np.VisibleDeprecationWarning)

## Section 1: Male Suicide

In this section, we will use the *suicide_summary* data set, which is a summary of the suicide information from the *mort22* dataset.  After looking at the overall data, we will next focus on suicides among Black males in the US.  

**Question 1.1** Use `Table.read_table` to read in a data set called *suicide_summary.csv*

In [None]:
suicide = ...

suicide

**Question 1.2** What percent of these deaths are due to suicide?  You can do this via simple arithmetic or with a combination of Table methods and the sum function.  

In [None]:
overall_suicide_percent = ...
overall_suicide_percent

**Question 1.3** Run a z-test for population proportion to see if Black Males commit suicide at a rate different than the overall rate.  

In [None]:
...

The test statistic, if you did that problem correctly, the test statistic is negative.  What does that mean?

1. That the proportion of suicides amongst Black male deaths is less than 1.5%.

2. That the proportion of suicides amongst Black male deaths is exactly 1.5%.

3. That the proportion of suicides amongst Black male deaths is greater than 1.5%.

In [None]:
answer_to_1_3 = ...

**Question 1.4** Run a z-test for population proportion to see if Black Males commit suicide at a rate that is significantly lower than the general population.  

In [None]:
...

**Question 1.5** What percent of Black Male deaths are suicides?

In [None]:
...

**Question 1.6** State your conclusion to the question, "Do Black Males commit suicide at a rate lower than the general public?"  Be sure to reference what the rate in the general public is, what the rate among Black Males is, and the $p$-value in your response.  

*Write your conclusion to 1.5 here.*

## Section 2

In this section, we'll compare White Males to both the general population and Black Males.  


**Question 2.1** Run a z-test for popoulation proportions to see if White males commit suicide at a rate higher than the general population.  

In [None]:
...

**Question 2.2** What percentage of White Male deaths are suicides?

In [None]:
...

**Question 2.3** State your conclusion to the question, "Do White males commit suicide at a rate higher than the overall population?"

*Write your answer to 2.3 here*

**Question 2.4** Run a 2-sample z-test for population proportions to see if White males commit suicide at a rate higher than Black males. 

In [None]:
...

**Question 2.5** Make a bar graph showing the percentages of suicides for both White and Black males on the same graph.  

In [None]:
...

## Section 3: Two Two Proportions Z-test

You of course remember from the lecture that there are two ways of running a 2-proportion z-test.  The standard error can either be pooled or unpooled.  

How is it done?  First, find $\displaystyle \hat{p}_1 = \frac{\mathrm{x_1}}{\mathrm{n_1}}$.

Then find $\displaystyle z = \frac{\hat{p}_1 - \hat{p}_2}{\sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1}  + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2}}}$.

This is the UNpooled standard error, $\displaystyle \sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1}  + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2}}$, which I prefer.


**Question 3.1** Use your knowledge of arithmetic in Python to calculate the unpooled standard error for the z-test.

In [None]:
x1 = ...
n1 = ...

x2 = ...
n2 = ...

p1 = x1/n1
p2 = x2/n2

unpooled_SE = ...

unpooled_SE

You likely recall from class that I told you that the z-test function we can use from `statsmodels` uses the pooled standard error.

The pooled standard error is found by first letting $\displaystyle \hat{p} = \frac{x_1+x_2}{n_1+n_2}$ then $\displaystyle \sqrt{\hat{p}(1-\hat{p}) \left( \frac{1}{n_1} + \frac{1}{n_2}\right)}$

**Question 3.2** Calculate the pooled standard error in this situation.  

In [None]:
pooled_prop = (x1+x2)/(n1+n2)

pooled_SE = ...

pooled_SE

In [None]:
## Just run this cell.  

z_unpooled = (p2 - p1)/unpooled_SE

p_value_unpooled = 1-scipy.stats.norm.cdf(z_unpooled)



z_pooled = (p2-p1)/pooled_SE
p_value_pooled = 1-scipy.stats.norm.cdf(z_unpooled)


(z_unpooled, p_value_unpooled, z_pooled, p_value_pooled)

**Question 3.3** In your opinion, does it make any difference if we pool or unpool the data when calculating the standard error. The cell just above this one might help you decide.  It shows the value of the test statistics and p-values when using the pooled or unpooled standard errors.  

*Write your answer to 3.3 here*

## Section 4: Monthly Trends

We processed *mort22* data more to find the percentage of deaths due to suicide among just the males.  It's called *MonthPercentageSuicide.csv*.  

**Question 4.1**  Read in *MonthPercentageSuicide.csv* and call it `monthly_trends`.


In [None]:
monthly_trends = ...

monthly_trends

**Question 4.2** Make a bar chart of `montly_trends` showing the Percent each month.  

In [None]:
...

**Question 4.3** There appears to be a relationship where the percent of suicides among males increases in the summer time.  To see if that's true we could run a Chi-squared goodness of fit test to start.  Run the test assuming that 1/12th of all suicides occur in each month.  


In [None]:
null_proportions = np.repeat(1/12, 12)

yes = monthly_trends.column("Yes")

total = sum(yes)

proportions = yes/total
display(np.round(proportions, 3))

scipy.stats.chisquare(..., ... )


**Question 4.4** Is the rate of suicides amongst males constant throughout the year?  If not, when is the suicide rate the highest?  In your answer, cite a $p$-value and some relevant percentages.  

*Write your answer to 4.4 here*

## Section 5:  Biology (sort of)

According to [VeryWellHealth.com](https://www.verywellhealth.com/what-is-the-rarest-eye-color-5087302) approximately 27% of the Americans have blue eyes.  

Suppose 5 people out of 6 have blue eyes, is that reasonable?  Let's work our way through a binomial test.  

Why the binomial test?  Recall, the z-test is just an approximation, and what it's meant to approximate is the results of the binomial test.  Like many z-procedures, mathematically it depends upon the Central Limit Theorem.  Well, guess what?  The Central Limit Theorem does NOT apply to a sample of such small size.  So, because the sample size is too small to jump to the conclusion that the CLT would force the z-test to work and since the size is small enough that the we can compute the exact values in the binomial distribution, it really is more appropriate to use the Binomial Test.  


**Question 5.1** Assume that the 27% estimate is correct.  Calculate the probability that 5 out of 6 people would have blue eyes.  Assume independence, that's a requirement of the binomial distribution.  

*Hint:* The function `scipy.special.comb` may be useful.  

In [None]:
...

**Question 5.2** Assume that the 27% estimate is correct.  Calculate the probability that 7 out of 7 people would have blue eyes.  Again assume independence, else we can't use the binomial distribution. 

In [None]:
...

**Question 5.3** Add together these last two probabilities.  

In [None]:
...

**Question 5.4** Run a binomial test using `scipy.stats.binomtest` to see if it's reasonable to assume that the proportion of a population with blue eyes is 27% based on this data (6 of 7 people with blue eyes).

In [None]:
...

If you've done the last three problems correctly, the probability you found in 5.3 is the same as the p-value from 5.4.  That's because the steps that I made you follow on the first few problems here are what the Binomial Test is doing.  You'll recall that from class.  

Now, confession time.  The 6 people involved in this problem were members of the same family, a husband, wife and their children.  

**Question 5.5**  Was the assumption of independence valid in this case?

*Write your answer here*

The small $p$-value in the case above does not really prove that people are more than 27% likely to have blue eyes.  Eye color is genetic and if both parents have blue eyes then [this optometrist](https://www.johnoconnor.co.nz/eye-colour/) claims that each child has a 99% of being blue eyed too.  Therefore, if both parents are blue eyed then the probability that at least 3 of their 4 children would be blue eyed is closer to 99%.  

A tragic real-world example of something like this, is the story of Sally Clark.  If you've never heard her story you can read it [here](https://en.wikipedia.org/wiki/Sally_Clark).  But here's an extremely short-version.  She had two children die of Sudden Infanct Death Syndrom (SIDS), but was convicted of murdering them partly based on [Roy Meadow's](https://en.wikipedia.org/wiki/Roy_Meadow) miscalculated probability. Quote, "He claimed that, for an affluent non-smoking family like the Clarks, the probability of a single cot death was 1 in 8543" then he assumed that he could use the assumption of independence when calculating the probability of two sons dying that way.  

**Question 5.6** KNOWING FULL WELL that this is not correct, what probability did he produce?

In [None]:
miscalculated_probability = ...

miscalculated_probability

Because her children were both boys, and were related, calculating the exact probability that they'd both die of SIDS is difficult and complicated.  But, some experts estimate it could be higher than 1%, and that once her first son died of SIDS it became very likely that her second son might.  

Sally Clark's wrongful conviction ruined her life, obviously.  She died rather young and never psychologically recovered from the two-fold tragedy of losing her children and then spending years in prison.  Meadow's dreadful error hurt his career and reputation, and his misuse of statistics and probability have also been cited as an examples of the fallacies such as the [Prosecutors Fallacy](https://en.wikipedia.org/wiki/Base_rate_fallacy).  Their story is cautionary tale about the dangers of assuming independence when it does not apply.

## Congratulations.  

You're done with HW 6. Download it as HTML or pdf and upload that file to D2L.  
