Before you turn this problem set in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All). Note that in code sections, you must replace `stop("Not Implemented")` with your code. Otherwise, you will have points automatically deducted in the grading process.

**Please do not rename this file.**

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER/EXPLANATION HERE". In addition, please do not include your name on this assignment to ensure anonymity for the peer reviews.

---

# Problem 1

Part of this assignment will examine the [Gapminder](http://www.gapminder.org/) dataset, which can be imported into R as follows:

In [None]:
# This information is available within the gapminder package
install.packages("gapminder")
library(gapminder)
data("gapminder")

Now there should be a data set called `gapminder` in the R global environment. Conduct a few tests on the data. First, make a data frame called `year_2007` that contains only information from the year 2007, the most recent year in the data set. Make sure that all of the columns have the same names as those in the original data set.

In [None]:
# YOUR CODE HERE
year_2007 = gapminder[gapminder$year == 2007,]

Run the next code cell to test if the data set was correctly subsetted.

In [None]:
stopifnot(max(year_2007$year) == 2007,
         min(year_2007$year) == 2007,
         names(year_2007) == c("country", "continent", "year", "lifeExp", "pop", "gdpPercap"
),
         dim(year_2007) == c(142L, 6L))

Life expectancy in Africa is lower than much of the rest of the world. Create a new variable in the `year_2007` data frame called `in_Africa` that indicates whether a country is on the continent of Africa. Create another variable in the data frame called `lifeExp_above_60` that indicates whether the life expectancy for a country is above 60. In order to make sure the variables are easy to understand, make the values for the `in_Africa` variable either "In Africa" or "Not in Africa", and the values for the `lifeExp_above_60` variable either "Above 60" or "Not Above 60".

In [None]:
# YOUR CODE HERE
year_2007["in_Africa"] = ifelse(year_2007$continent == "Africa", "In Africa", "Not in Africa")
year_2007["lifeExp_above_60"] = ifelse(year_2007$lifeExp > 60,"Above 60","Not Above 60" )

Now see if the new variables were created successfully:

In [None]:
stopifnot(unique(year_2007$in_Africa) == c("Not in Africa", "In Africa"),
         unique(year_2007$lifeExp_above_60) == c("Not Above 60", "Above 60"),
         table(year_2007$lifeExp_above_60) == structure(c(99L, 43L), .Dim = 2L, .Dimnames = structure(list(
    c("Above 60", "Not Above 60")), .Names = ""), class = "table"))

In order to see the data that will be tested, make a $2 \times 2$ table of these two variables called `my_table`. Make sure that the table is readable!

In [None]:
# YOUR CODE HERE
my_table = table(year_2007[,7],year_2007[,8])

This will check to see if the table was created correctly:

In [None]:
stopifnot(my_table == structure(c(12L, 87L, 40L, 3L), .Dim = c(2L, 2L), .Dimnames = structure(list(
    c("In Africa", "Not in Africa"), c("Above 60", "Not Above 60"
    )), .Names = c("", "")), class = "table"))

Now perform a test for equal proportions for the hypotheses
$$ \begin{array}{ll}
H_0: & p_{Africa} \geq p_{non-Africa} \\
H_1: & p_{Africa} < p_{non-Africa}
\end{array}$$
These will test to see if Africa has a significantly lower proportion of countries with a life expectancy greater than 60 years than the rest of the world. Use the continuity correction and make a 97% confidence interval for $p_{Africa} - p_{non-Africa}$. Assign the test to an object named `lifeExp_prop_test`.

In [None]:
# YOUR CODE HERE
lifeExp_prop_test = prop.test(my_table,conf.level = 0.97,correct = TRUE)

If you did this correctly, the following cell will run:

In [None]:
stopifnot(round(lifeExp_prop_test$statistic,digits = 3) == structure(81.091, .Names = "X-squared"))

---

# Problem 2

This problem will continue examining the Gapminder data set. If the R environment has been cleared, the following commands will bring the `gapminder` data frame back to the environment:

In [None]:
library(gapminder)
data("gapminder")

Create another subset of the `gapminder` data set containing information on Oceania and Europe in the year 2007. Name this data frame `EO_year_2007`, and make sure that the column names are the same as in the original data set.

In [None]:
# YOUR CODE HERE
EO_year_2007 = gapminder[gapminder$year == 2007&(gapminder$continent=="Oceania"|gapminder$continent=="Europe"),]

Check that the subset was done correctly:

In [None]:
stopifnot(unique(EO_year_2007$continent) == structure(4:5, .Label = c("Africa", "Americas", "Asia", "Europe", 
"Oceania"), class = "factor"),
         dim(EO_year_2007) == c(32L, 6L))

There are multiple ways to perform calculations within groups. Probably the easiest way to do this without loading any extra packages in R is to use the `aggregate` function. Formulas can be used within the `aggregate` function to easily specify the groups to aggregate over. For example, if attempting to aggregate the average (mean) `Sepal.Length` in the `iris` dataset, one can use the code:

In [None]:
data("iris")
aggregate(Sepal.Length ~ Species,data=iris,FUN=mean)

Using the `aggregate` function, find the average GDP per capita (`gdpPercap`) for Europe and Oceania. Name the aggregate data frame `gdp_means`, and make sure it has two columns with the names `continent` and `gdpPercap`. 

In [None]:
# YOUR CODE HERE
gdp_means = aggregate(gdpPercap ~ continent, data = EO_year_2007,FUN =mean)

If you did this correctly, the following test should run without errors:

In [None]:
stopifnot(round(gdp_means$gdpPercap,digits=2) == c(25054.48, 29810.19))

Based on the aggregation, the average GDP per capita for Oceania is higher than that of Europe. This is supported by visualizing the boxplot for the two groups:

In [None]:
# Note that this will only work if you have created the year_2007 data frame in problem 1
boxplot(gdpPercap/1000 ~ continent,data=year_2007,ylab = "GDP per Capita ($K)",main = "GDP per capita in 2007")

Before testing to see if the means are significantly different for the two continents, examine whether or not the groups have equal variances. Test the hypotheses:
$$\begin{align}
H_0: & \sigma_{Europe}^2 = \sigma_{Oceania}^2 \\
H_1: & \sigma_{Europe}^2 \neq \sigma_{Oceania}^2
\end{align}$$
Name the test object `EO_gdp_var_test`.

In [None]:
# YOUR CODE HERE
EO_gdp_var_test = var.test(EO_year_2007[EO_year_2007$continent == "Europe",6],EO_year_2007[EO_year_2007$continent == "Oceania",6])

Check that the test was performed correctly:

In [None]:
stopifnot(round(EO_gdp_var_test$p.value,digits = 3) == 0.833)

Is the difference in the average GPA per capita *statistically significant*? Perform a t-test (using your `EO_year_2007` data frame) to see if there is a statistically significant difference in the means of GDP per capita between Europe and Oceania. Based on the variance test above, the variances in the GDPs for the two continents are not significantly different (because the p-value is higher than any reasonable value for $\alpha$), so make sure you perform your test correctly. Test the hypotheses
$$ \begin{array}{ll}
H_0: & \mu_{Europe} = \mu_{Oceania} \\
H_1: & \mu_{Europe} \neq \mu_{Oceania}
\end{array} $$
and assign the test to an object named `EO_mean_gdp_test`. 

In [None]:
# YOUR CODE HERE
#the p-value is stuck at 0.4779 no matter what I tried :(
EO_mean_gdp_test = t.test(EO_year_2007[EO_year_2007$continent == "Europe",6],EO_year_2007[EO_year_2007$continent == "Oceania",6])

The following test will work if you successfully coded the hypothesis test:

In [None]:
stopifnot(round(EO_mean_gdp_test$p.value,digits = 3) == 0.581)

Based on the test output, the average GDP per capita is not significantly different for Oceania and Europe, as the p-value is much higher than any reasonable $\alpha$-level (this is usually set to 0.05). 

Not so fast! One of the assumptions of using the t-test is that the underlying distributions for each of the groups is normal or approximately normal. The following plot shows the densities for the two groups in the sample:

In [None]:
plot(density(subset(EO_year_2007,continent == "Oceania")$gdpPercap/1000),
     xlim = c(0,60),
     xlab = "GDP per capita ($K)",
     col="blue",
     lwd=2,
     main = "GDP per Capita - Density Curves")
lines(density(subset(EO_year_2007,continent == "Europe")$gdpPercap/1000),
      col='red',
      lwd=2)
legend("topright",legend = c("Europe","Oceania"),col = c("red","blue"),lwd=2)

Neither of the curves looks very much like the normal density curve (do a Google search if that isn't familiar). This suggests that the t-test might not be the correct method to use. Instead, conduct a Mann-Whitney test on the following hypotheses:
$$\begin{align}
H_0: & m_{Europe} = m_{Oceania} \\
H_1: & m_{Europe} \neq m_{Oceania} 
\end{align}$$
In this case, $m$ represents the median of each group. Name the test object `EO_gdp_mann_whitney`.

In [None]:
# YOUR CODE HERE
x = gapminder[gapminder$continent == "Europe" & gapminder$year == "2007",6]
y = gapminder[gapminder$continent == "Oceania" & gapminder$year == "2007",6]
EO_gdp_mann_whitney = wilcox.test(x,y)

Finally, check to see that the test was conducted correctly:

In [None]:
stopifnot(round(EO_gdp_mann_whitney$p.value,digits=3) == 0.681)

Since the p-value is still much higher than a reasonable $\alpha$-level (remember, this is usually set to 0.05), the medians of the two groups are not significantly different. 

---

# Problem 3

In the context of hypothesis testing using data in a contingency table, a test of proportions can be used when the dimensions of the tables are $2 \times 2$ or $1 \times 2$, though data in the latter case are rarely represented in a contingency table. However, if the contingency table has dimensions greater then $2 \times 2$, a test of proportions won't work without simplifying the table, as was done in Problem 1. For a higher-dimension contingency table, Pearson's $\chi^2$ test can be used. In order to show this, begin by again subsetting the `gapminder` data to look at only the information from the year 2007. In addition, use the `cut` function to create a new variable *within* the subsetted data frame called `gdp_bin` such that the `gdpPercap` variable is grouped by whether the GDP per capita is in the sets (0,10000], (10000,20000], (20000,30000], (30000,40000], or (40000,50000]. Name the subsetted data frame `problem_3`. Note that R will name the bins using scientific notation for the numbers by default. This should not be changed.

In [None]:
library(gapminder)
data("gapminder")

levels = c(0,10000,20000,30000,40000,50000)
problem_3 = gapminder[gapminder$year == 2007,c(2,6)]
problem_3$gdp_bin = cut(problem_3$gdpPercap,levels)


Check to make sure this was done correctly:

In [None]:
stopifnot(table(problem_3$continent,problem_3$gdp_bin) == structure(c(47L, 16L, 20L, 5L, 0L, 5L, 7L, 3L, 6L, 0L, 0L, 0L, 
6L, 6L, 1L, 0L, 1L, 2L, 11L, 1L, 0L, 1L, 2L, 2L, 0L), .Dim = c(5L, 
5L), .Dimnames = structure(list(c("Africa", "Americas", "Asia", 
"Europe", "Oceania"), c("(0,1e+04]", "(1e+04,2e+04]", "(2e+04,3e+04]", 
"(3e+04,4e+04]", "(4e+04,5e+04]")), .Names = c("", "")), class = "table"))

Now conduct Pearson's $\chi^2$ test to see if there is an association between which continent a country is on and its GDP per capita. Name the test object `gdp_continent_chisq`. Do not change any of the default options for the test in R. Note: R will return a warning message for this test because there are cells in the contingency table with small expected values. Ignore this warning in this assignment, but not in research!

In [None]:
# YOUR CODE HERE
tbl = table(problem_3$continent, problem_3$gdp_bin)
gdp_continent_chisq = chisq.test(tbl)

Finally, check to see that the test performed correctly: 

In [None]:
stopifnot(round(gdp_continent_chisq$statistic,digits=3) == structure(73.731, .Names = "X-squared"))

Based on this test, there is a clear relationship!