In [1]:
from scipy.stats import chi2, norm
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from math import sqrt

**Problem 23.**

Childbirth, part 1 There is some concern that if a woman has an epidural to reduce pain during childbirth, the drug can get into the baby’s bloodstream, making the baby sleepier and less willing to breastfeed. The International Breastfeeding Journal published results of a study conducted at Sydney University. Researchers followed up on 1178 births, noting whether the mother had an epidural and whether the baby was still nursing after 6 months. Below are their results.

(a) What kind of test would be appropriate? 

(b) State the null and alternative hypotheses.

**Answers to Problem 23**

(a) The Chi-Square Test of Independence would be appropriate.

(b) $H_0:$ Breastfeeding distributions are independence whether mothers had eipdural or not.

<font color="red"> $H_0$: Whether a baby is still nursing after 6 months since born is independent of whether its mother had epidural. </font>

$H_A:$ Breastfeeding do not have the independence distribution between the group of mothers had eipdural and the group of mothers do not had eipdural.

In [3]:
def sol23():
    study = pd.DataFrame(
        data=np.array([[206, 498],[190, 284]]),
        columns=["Drug","NoDrug"],
        index=["Feed","NoFeed"]
    )
    samples = study.sum().sum()
    dist = (study.Drug + study.NoDrug) / samples
    study["DrugExp"] = study.Drug.sum() * dist
    study["NoDrugExp"] = study.NoDrug.sum() * dist
    print(study)
    chi2score = ((study.Drug-study.DrugExp)**2/study.DrugExp 
                 + (study.NoDrug-study.NoDrugExp)**2/study.NoDrugExp).sum()
    print(chi2score)
    print(f"pval (chi2): {1-chi2.cdf(chi2score,1)}")

    k1 = 206
    n1 = k1 + 190
    k2 = 498
    n2 = k2 + 284
    p1hat = k1 / n1
    p2hat = k2 / n2
    se = sqrt(p1hat * (1-p1hat) / n1  + p2hat * (1-p2hat) / n2)
    pval = 2* norm.cdf((p1hat-p2hat)/se)
    print(f"pval (2 props): {pval}")
    
sol23()

        Drug  NoDrug     DrugExp   NoDrugExp
Feed     206     498  236.658744  467.341256
NoFeed   190     284  159.341256  314.658744
14.869338194764707
pval (chi2): 0.00011522069676450641
pval (2 props): 0.00012683851459129597


**Problem 25**

Childbirth, part 2 In Exercise 23, the table shows results of a study investigating whether aftereffects of epidurals administered during childbirth might interfere with successful breastfeeding. We’re planning to do a chi-square test.

(a) How many degrees of freedom are there?

(b) The smallest expected count will be in the epidural/no breastfeeding cell. What is it?

(c) Check the assumptions and conditions for inference.

**Answers to Problem 25**

(a) There is 1 degree of freedom.

(b) The smallest count is 190 (mothers had epidural and the no breastfeeding at 6 months).

<font color="red"> According to calculations in problem 23, it is 159.34. </font>

(c) Counted data condition: I have counted the number in groups.

Independence Assumption:  These 1178 births may not come from a random sample, but mothers'decisions of epidural usage and breastfeeding shoule be independent.

Expected cell frequency condition:  There are at least 5 individual in each cell, thus the expected cell frequency condition is met.

**Problem 27**

Childbirth, part 3 In Exercises 23 and 25, we’ve begun to examine the possible impact of epidurals on successful breastfeeding.

(a) Calculate the component of chi-square for the epidural/no breastfeeding cell.

(b) For this test, x2 = 14.87. What’s the P-value? 

(c) State your conclusion.

In [5]:
def sol27a():
    obs = 190
    exp = (474 / 1178) * 396
    chi_square = (( obs - exp ) ** 2 / exp)
    print(f"exp: {exp:.2f}")
    print(f"chi_square: {chi_square:.2f}")

sol27a()

exp: 159.34
chi_square: 5.90


**Answer to 27(a)** 

The component of chi-square for the epidural/no breastdeeding cell is 5.90.

In [6]:
def sol27b():
    df = 1
    chi_square = 14.87
    pval = 1 - chi2.cdf(chi_square, df)
    print(f"P-value: {pval:.6f}")
    
sol27b()

P-value: 0.000115


**Answer to 27(b)** 

The P-value is 0.000115.

**Answer to 27(c)**

The P-value of 0.000115 is very samll, and it leads me to reject the null hypothesis.  Successful breastfeeding is not independent of the epidurals usages during birth.

**Problem 29**

Childbirth, part 4 In Exercises 23, 25, and 27, we’ve tested a hypothesis about the impact of epidurals on successful breast- feeding. The following table shows the test’s residuals.

(a) Show how the residual for the epidural/no breastfeeding cell was calculated.

(b) What can you conclude from the standardized residuals?

In [7]:
def sol29a():
    obs = 190
    exp = (474 / 1178) * 396
    residual = (obs - exp) / sqrt(exp)
    print(f"Residual: {residual:.2f}")
    
sol29a()

Residual: 2.43


**Answer to 29(a)**

The residual for the epidural/no breastfeeding cell is 2.43.

**Answer to 29(b)**

The residual corresponding to the epidural/breastfeeding cell has a negative value with a significant maganitude, and it tells us that the less observed than expected for mothers had epidural and breastfeedings are continue when babies are 6 months; the largest residual is the epidural/no breastfeeding cell, and it tells us that more than expected cases for mothers had epidural and breastfeedings are stopped when babies are 6 months.

"I note one residual in the cell "epidural/breastfeeding" that has a negative value with significant magnitude.  This signifies the observed babies who continue with breadfeeding after 6 months and whose mothers took epidural during birth are significantly fewer than what we expect."

The calculated residuals are all of non-negligible scales.  These signify that our observed counts are way off from what we would expect, if we assume that ... and ... are independent.  Therefore, just by looking at the residuals, we would be able to assert that the null hypothesis should be rejected.

**Problem 31**

Childbirth, part 5 In Exercises 23, 25, 27, and 29, we’ve looked at a study examining epidurals as one factor that might inhibit successful breastfeeding of newborn babies. Suppose a broader study included several additional issues, including whether the mother drank alcohol, whether this was a first child, and whether the parents occasionally supplemented breastfeeding with bottled formula. Why would it not be appro- priate to use chi-square methods on the 2 * 8 table with yes/no columns for each potential factor?

**Answer to Problem 31**

The 2*8 table with yes/no columns for each potential factor is not independent, for example, women have drug issue may also have alcohol addiction.

It is not appropriate to use chi2 test in such a 2x8 table with yes/no columns for each of several factors .  This is because the columns counting yes/no for each factor are concerning with the same group of subjects.  As a result, we have dependent columns, and the conditions/assumptions of a chi2 test are violated.

**Problem 43**

Grades Two different professors teach an introductory statis- tics course. The table shows the distribution of final grades they reported. We wonder whether one of these professors is an “easier” grader.

a) Will you test goodness-of-fit, homogeneity, or independence?

b) Write appropriate hypotheses.

c) Find the expected counts for each cell, and explain why the chi-square procedures are not appropriate.

**Answers to Problem 43**

(a) I will test Chi-square test of homogeneity.

(b) $H_0:$ The distributions of final grades are homogeneous for both professors.

<font color="red">$H_0:$ Final grades given by the two professors share the same distribution. </font>

$H_A:$ The distributions of final grades are not the same for both professors.

(c) The expected individuls in "F" are less than 5, and the expected cell frequency condition is not met, and the ch-square procedures are not appropriate.

In [8]:
def sol43c():
    grades = pd.DataFrame(data=np.array([[3,9],[11,12],[14,8],[9,2],[3,1]]),
                          columns=["Alpha","Beta"],
                          index=["A","B","C","D","F"])
    total = grades.sum().sum()
    dist = (grades.Alpha + grades.Beta) / total
    grades["AlphaEXP"] = grades.Alpha.sum() * dist
    grades["BetaEXP"] = grades.Beta.sum() * dist

    grades["AlphaChi2"] = (grades.Alpha - grades.AlphaEXP) ** 2 / grades.AlphaEXP
    grades["BetaChi2"] = (grades.Beta - grades.BetaEXP) ** 2 / grades.BetaEXP

    print(grades)

    print(grades.AlphaChi2.sum() + grades.BetaChi2.sum())
    print(1-chi2.cdf(grades.AlphaChi2.sum() + grades.BetaChi2.sum(), 4))
    
    
sol43c()

   Alpha  Beta   AlphaEXP    BetaEXP  AlphaChi2  BetaChi2
A      3     9   6.666667   5.333333   2.016667  2.520833
B     11    12  12.777778  10.222222   0.247343  0.309179
C     14     8  12.222222   9.777778   0.258586  0.323232
D      9     2   6.111111   4.888889   1.365657  1.707071
F      3     1   2.222222   1.777778   0.272222  0.340278
9.36106719367589
0.05268163187757102


**Problem 45**

Grades, again In some situations where the expected cell counts are too small, as in the case of the grades given by Pro- fessors Alpha and Beta in Exercise 43, we can complete an analysis anyway. We can often proceed after combining cells in some way that makes sense and also produces a table in which the conditions are satisfied. Here, we create a new table display- ing the same data, but calling D’s and F’s “Below C”:

a) Find the expected counts for each cell in this new table, and explain why a chi-square procedure is now appropriate.

b) With this change in the table, what has happened to the number of degrees of freedom?

c) Test your hypothesis about the two professors, and state an appropriate conclusion.

In [9]:
def sol45a():
    grades = pd.DataFrame(data=np.array([[3,9],[11,12],[14,8],[12,3]]),
                          columns=["Alpha","Beta"],
                          index=["A","B","C","BelowC"])
    total = grades.sum().sum()
    dist = (grades.Alpha + grades.Beta) / total
    grades["AlphaEXP"] = grades.Alpha.sum() * dist
    grades["BetaEXP"] = grades.Beta.sum() * dist

    grades["AlphaChi2"] = (grades.Alpha - grades.AlphaEXP) ** 2 / grades.AlphaEXP
    grades["BetaChi2"] = (grades.Beta - grades.BetaEXP) ** 2 / grades.BetaEXP

    print(grades)

    print(grades.AlphaChi2.sum() + grades.BetaChi2.sum())
    print(1-chi2.cdf(grades.AlphaChi2.sum() + grades.BetaChi2.sum(), 3))
    
sol45a()

        Alpha  Beta   AlphaEXP    BetaEXP  AlphaChi2  BetaChi2
A           3     9   6.666667   5.333333   2.016667  2.520833
B          11    12  12.777778  10.222222   0.247343  0.309179
C          14     8  12.222222   9.777778   0.258586  0.323232
BelowC     12     3   8.333333   6.666667   1.613333  2.016667
9.305839920948616
0.025489182240714503


**Answers to Problem 45**

(a) The expected individuls in all cells are more than 5, and the expected cell frequency condition is met, and the chi-square procedures are now appropriate.

(b)With this change in the table, the number of degrees of freedom is changed to 3, from 4.  The number of degrees of freedom is smaller than the previous table.

(c) The P-value of 0.025 is small, and it leads me to reject the null hypothesis.  Then I could conclude that the distributions of final grades are not the same for both professors.