# API-201 ABC REVIEW SESSION #7

**Friday, October 28**

# Table of Contents
1. [Lecture Recap](#Lecture-Recap)
2. [Exercises](#Exercises)

# Lecture Recap <a class="anchor" id="Lecture-Recap"></a>

# Exercise: Project STAR Part 2<a class="anchor" id="Exercises"></a>

**From last week:** The Project STAR (for Student-Teacher Achievement Ratio) was designed to determine the effect of smaller class size in the earliest grades on short-term and long-term pupil performance ([source](https://dataverse.harvard.edu/dataset.xhtml?persistentId=hdl:1902.1/10766)). Over 7,000 students in 79 schools across the state of Tennessee were randomly assigned into one of three interventions: small class (13 to 17 students per teacher), regular class (22 to 25 students per teacher), and regular-with-aide class (22 to 25 students with a full-time teacher's aide). Classroom teachers were also randomly assigned to the classes they would teach. The interventions were initiated as the students entered school in kindergarten and continued through third grade. 



In this exercise, we are going to use data from Project STAR to assess whether there is a statistically significant impact of class sizes on learn about the pupils involved in the project through visualization and measure the association between classroom size and student achievement.

Unlike last class, the data has been aggregated at the teacher level so that scores are averages across all students taught by that teacher. 

[Download the data using this link.](https://github.com/5harad/API201-students/raw/main/review_sessions/review_7/STAR_teachers.xlsx)

## Data Dictionary
* `gktchid`: kindergarten teacher ID
* `gkclasstype`: kindergarten class type; S - Small, R - Regular/Large
* `gktreadss`: average kindergarten reading score
* `gktmathss`: average kindergarten math score
* `hsactmath`: average high school ACT math score
* `hsactread`: average high school ACT reading score

**1. Upload the Excel file `STAR_teachers.xlsx` to Google Colab and use `read_excel` to read its first worksheet as a new table called `star_teachers`. Examine the first 10 rows of the data.**

In [86]:
library(tidyverse)
library(readxl)

# Your answer here!

# START
star_teachers <- read_excel(path = "STAR_teachers.xlsx", sheet = 1)
head(star_data, 10)
# END  

gktchid,gkclasstype,gktreadss,gktmathss,hsactread,hsactmath
<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
11203801,S,406.5,448.5,18.5,19.5
11203802,R,410.0,454.0,25.0,20.0
11203803,R,410.2,479.4,19.4,20.6
12305601,R,437.875,477.875,23.125,19.75
12305602,S,418.5,472.75,19.5,19.5
12305603,R,429.5,480.0,19.7,16.5
12806801,S,454.0,503.3333,26.33333,22.66667
12806802,R,444.5,492.75,22.5,19.75
12806803,R,444.4,513.4,22.4,20.0
12807601,S,430.5,469.8333,19.5,17.5


**2. Calculate the number of teachers and the mean and variance of reading score by class type. Which class type has a higher average reading score? Which class type has greater variance in reading score?** 


In [15]:
# Your answer here!

# START
star_teachers %>%
    group_by(gkclasstype) %>%
    summarize(n = n(),
              mean = mean(gktreadss),
              var = var(gktreadss))
# END

gkclasstype,n,mean,var
<chr>,<int>,<dbl>,<dbl>
R,196,445.8346,362.3407
S,126,451.6607,548.6109


**3. Suppose $\hat\mu_R$ is the sample mean of the reading score in regular classes and $\hat\mu_S$ is the sample mean in small classes. Using your results from (2), calculate the difference in sample means and the standard error of $\hat\mu_S - \hat\mu_R$.**

Recall that $SE(\hat\mu_S - \hat\mu_R) = \sqrt{\frac{\hat\sigma_S^2}{n_S} + \frac{\hat\sigma_R^2}{n_R}}$ where $\hat\sigma_S^2$ and $\hat\sigma_R^2$ are the sample variances and $n_S$ and $n_R$ are the sample sizes.

In [36]:
# Your answer here!

# START
diff <- 451.7 - 445.8
se <- sqrt(548.6 / 126 + 362.3 / 196)
c("Difference in Means" = diff, "Standard error" = se)
# END

**4. What is the 95% confidence interval of $\mu_S - \mu_R$?**

In [63]:
# Your answer here!

# START
c("Lower Bound" = diff - 2 * se, 
  "Upper Bound" = diff + 2 * se)
# END

**5. What is the Z-score corresponding to the null hypothesis $\mu_S - \mu_R = 0$?**

In [87]:
# Your answer here!

# START
z <- (diff - 0) / se
c("Z-score" = z)
# END

**6. What is the p-value corresponding to the Z-score? Is the difference in means statistically significant?**

In [85]:
# Your answer here!

# START
p <- 2 * pnorm(-abs(z))
c("p-value" = p)
c("Reject null hypothesis?" = p < .05)
# END