Before you turn this problem set in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All). Note that in code sections, you must replace `stop("Not Implemented")` with your code. Otherwise, you will have points automatically deducted in the grading process.

**Please do not rename this file.**

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER/EXPLANATION HERE". In addition, please do not include your name on this assignment to ensure anonymity for the peer reviews.

---

#### Note: Please post any questions that you have to the lesson 7 discussion board. This includes any questions about notation.

# Problem 1

### Weighted Averages

This will require the `gapminder` data set, which was used in an earlier assignment (the `gapminder` package should, therefore, be installed on your computer). For more information about the data, load the `gapminder` package with `library(gapminder)`, and find the data set description with `?gapminder`.

If one wants to examine some worldwide trend, it is often important that the data be **weighted** in order to obtain numbers that make sense. For example, if one wants to know the global human life expectancy, it makes sense that the nation-wide life expectancy in India or China should have more weight than that in, say, Luxembourg. In such a case, it is logical to weigh each country's contribution to the worldwide life expectancy by its population relative to the population of the planet at a certain time. For country $i$ at time $t$, the population weight of that country can be calculated as 

$$\text{weight}_{i,t} = \frac{\text{pop}_{i,t}}{\sum_{i = 1}^n \text{pop}_{i,t}}$$

If you're a little unsure about the notation here, $\sum_{i = 1}^n \text{pop}_{i,t}$ reads as "the sum of all of the countries' populations at time $t$.

Begin by correctly finding a weight for all of the countries using any method. As a hint, the `aggregate`, `merge`, and `transform` functions may be useful. Make a data frame called `pop_weight` that includes, at a minimum, all of the variables in the `gapminder` data frame and a variable called `pop_weight` that contains each country's population weight for all 12 different years that there are measurements. The `pop_weight` data frame should have 1704 rows.

In [None]:
library("gapminder")
data("gapminder")
x = gapminder
by_year_sum=aggregate(as.numeric(x$pop),  by =list(Category = x$year), FUN = "sum")
pop_weight = list()
for(i in 1:length(x$pop)){
    temp = (i-1)%%length(by_year_sum$x)+1
    pop_weight = c(pop_weight,x$pop[i]/(by_year_sum$x[temp]))
    }
abc = x
abc$pop_weight = NA
for (i in 1:length(abc$pop_weight)){abc$pop_weight[i] = as.numeric(pop_weight[i]) }
pop_weight = abc

#second round function is creating an error with the max value from .24099 to .24100

Check to see that the calculation was done correctly.

In [None]:
stopifnot(c(names(gapminder),"pop_weight") %in% names(pop_weight),
         round(summary(round(pop_weight$pop_weight,digits = 5)),digits = 5) == structure(c(2e-05, 0.00067, 0.00167, 0.00704, 0.00418, 0.24099
), .Names = c("Min.", "1st Qu.", "Median", "Mean", "3rd Qu.", 
"Max."), class = c("summaryDefault", "table")))

Now find the weighted average life expectancy for each country for each of the years in the `gapminder` data set. The `aggregate` and `transform` functions may be helpful here. Create a data frame called `weighted_life` which contains two variables, `year` and `weighted_lifeExp`. The weighted average can be found as

$$\text{weighted average}_t = \sum_{i = 1}^n \text{weight}_{i,t} \times \text{lifeExp}_{i,t}$$

Similarly, create another data frame called `unweighted_life` which contains the variables `year` and `unweighted_lifeExp`. Each observation in these data sets should correspond to the global average life expectancy in a given year. Both data frames should have 12 observations. Check to make sure the averages were found correctly. 

In [None]:
# YOUR CODE HERE
library("gapminder")
data("gapminder")
x = gapminder
by_year_sum=aggregate(as.numeric(x$pop),  by =list(Category = x$year), FUN = "sum")
weights = list()
for(i in 1:length(x$pop)){
    temp = (i-1)%%length(by_year_sum$x)+1
    weights = c(weights,x$pop[i]/(by_year_sum$x[temp]))
    }
abc = x
abc$weights = NA
for (i in 1:length(abc$weights)){abc$weights[i] = as.numeric(weights[i]) }
pop_weight = abc
unweighted_life = aggregate(pop_weight$lifeExp, by= list(year = x$year), FUN = mean)
colnames(unweighted_life) = c("year","unweighted_lifeExp")
a = transform(pop_weight, new_weights =pop_weight$weights*pop_weight$lifeExp)
weighted_life = aggregate(a$new_weights, by = list(a$year), FUN = sum)
colnames(weighted_life) = c("year","weighted_lifeExp")

#rounding error...the numbers are getting rounded to 2 digits!

Now check to make sure you calculated the averages correctly.

In [None]:
stopifnot(round(summary(round(weighted_life$weighted_lifeExp,digits=5)),digits = 5) == structure(c(48.94424, 55.81933, 62.05951, 60.63986, 65.94676, 
68.91909), .Names = c("Min.", "1st Qu.", "Median", "Mean", "3rd Qu.", 
"Max."), class = c("summaryDefault", "table")),
         round(summary(round(unweighted_life$unweighted_lifeExp,digits=5)),digits = 5) == structure(c(49.05762, 55.16103, 60.55168, 59.47444, 64.37392, 
67.00742), .Names = c("Min.", "1st Qu.", "Median", "Mean", "3rd Qu.", 
"Max."), class = c("summaryDefault", "table")))

Next, create a plot that compares the two averages over the years. Remember to include all of the necessary information on the graph, as discussed in lesson 6. Your plot must be able to run on any computer without any extra packages added (other than the `gapminder` package for the data). 

In [None]:
# YOUR CODE HERE
xrange = range(weighted_life$year)
yrange = range(weighted_life$weighted_lifeExp)
plot(xrange,yrange, type = "n", xlab = "year", ylab = "life expectancy averages")
colors = rainbow(24)
for(i in 1:12){
    lines(weighted_life$year,weighted_life$weighted_lifeExp, type="b", lwd=1.5,
           col="red")
}
colors = rainbow(96)
for(i in 1:12){
    lines(unweighted_life$year,unweighted_life$unweighted_lifeExp, type="b", lwd=1.5,
           col="blue")
}
title("average life expectancies with years")
legend(1955, 65, legend=c("weighted", "unweighted"),
       col=c("red", "blue"), lty=1:1, cex=0.8)

Note that there is an observable difference in the two averaging methods. Why does the weighting have the effect that it does? This is a rhetorical question, but it's very important to think about this type of comparison in statistical applications! 

---

# Problem 2

Continuing the analysis of the `gapminder` data from problem 1, find the population-weighted average life expectancy for each continent. First, find the continent-level weights. That is, for country $i$ on continent $j$ at time $t$, find

$$ \text{weight}_{i,t} = \frac{\text{pop}_{i,t}}{\sum_{i \text{ in } j} \text{pop}_{i,t}} $$

Note that this can be done in a couple of different ways, though the easiest option without loading extra packages might be to use `aggregate`, `merge`, and `transform` again. Create a single data frame called `continent_weights` that includes all of the variables in the `gapminder` data frame, plus the variable `continent_weight`.

In [None]:
# YOUR CODE HERE
library("gapminder")
data("gapminder")
x = gapminder
sums = list()
continents = levels(factor(x$continent))
years = levels(factor(x$year))
for (i in 1:length(continents)){
    for (j in 1:length(years)){
        sums = c(sums,sum(as.numeric(x$pop[which(x$year == years[j] & x$continent == continents[i])])))    }
}

continent_weight = list()
for (i in 1:length(x$pop)){
    temp1 = which(years == x$year[i])
    temp2 = which(continents == x$continent[i])
    temp = (temp2-1)*12 + temp1
    continent_weight = c(continent_weight,(as.numeric(x$pop[i])/as.numeric(sums[temp])))
}
pop_weight = x
pop_weight$continent_weight = NA
for (i in 1:length(pop_weight$continent_weight)){pop_weight$continent_weight[i] = as.numeric(continent_weight[i]) }
pop_weight$weights = NULL
continent_weights = pop_weight

#max value not matching after rounding

Now check to make sure that the weights were calculated correctly.

In [None]:
stopifnot(round(summary(round(continent_weights$continent_weight,digits=5)),digits = 5) == structure(c(9e-05, 0.00393, 0.01054, 0.03521, 0.02521, 0.83567
), .Names = c("Min.", "1st Qu.", "Median", "Mean", "3rd Qu.", 
"Max."), class = c("summaryDefault", "table")))

Calculate the population-weighted averages for each continent. Create a data frame named `continent_weighted` with three variables: `year`, `continent`, and `avg_life` where each observation corresponds to the population-weighted average life expectancy for a year and continent. This data frame should have 60 observations.

In [None]:
# YOUR CODE HERE
avg_life = list()
for (i in 1:length(continents)){
    for (j in 1:length(years)){
        avg_life = c(avg_life,sum(as.numeric(continent_weights$lifeExp[which(continent_weights$year == years[j] & continent_weights$continent == continents[i])])*as.numeric(continent_weights$continent_weight[which(continent_weights$year == years[j] & continent_weights$continent == continents[i])]) 
        ))}
}
year = as.list(rep(years,5))
continent = as.list(rep(continents,12))
continent_weighted = data.frame(year = numeric(0),continent=numeric(0),avg_life=numeric(0))
for (i in 1:60){
    continent_weighted[nrow(continent_weighted)+1,]=c(year[i],continent[i],avg_life[i])
}

#rounding error here as well

Now check that the averages were calculated correctly.

In [None]:
stopifnot(round(summary(round(continent_weighted$avg_life,digits = 5)),digits = 5) == structure(c(38.79973, 54.39396, 67.87161, 64.33739, 72.71974, 
81.06215), .Names = c("Min.", "1st Qu.", "Median", "Mean", "3rd Qu.", 
"Max."), class = c("summaryDefault", "table")))

Finally, make a visualization that shows the change in life expectancy for each continent over time. Make sure the visualization allows the viewer to compare the trends between continents. In addition, include the global weighted average that you calculated in problem 1. Again, be sure to follow the best practices when making a plot, as outlined in lesson 6.

In [None]:
# YOUR CODE HERE
install.packages("ggplot2")
library(ggplot2)
sp = ggplot(data = continent_weighted, aes(x = year, y = avg_life, fill = continent)) + geom_bar(stat = "identity", position = position_dodge()) + xlab("year") + ylab("average life expectancies") + ggtitle("                           average life expectancies with year")
sp

---

# Problem 3

Using only the data from the year 2007 in the `gapminder` data frame, create a bar plot that includes the five countries from each continent with the highest life expectancies. This will end up including 22 countries, as the Oceania continent only has Australia and New Zealand. Make sure the visualization includes visual cues that allow the viewer to easily identify the continent of the country. Be sure to include all of the elements of a plot discussed in lesson 6. This will take a few steps to complete correctly. Depending on personal preference, Google may be necessary to discover little tweaks to be made. The lesson 7 discussion board is also available.

In [None]:
# YOUR CODE HERE
library('gapminder')
x = gapminder[which(gapminder$year == "2007"),c(1,2,3,4)]
ocean = x[x$continent == "Oceania",]
asia = x[x$continent == "Asia",]
asia = asia[order(asia$lifeExp,decreasing = T)[1:5],]
eur = x[x$continent == "Europe",]
eur = eur[order(eur$lifeExp,decreasing = T)[1:5],]
africa = x[x$continent == "Africa",]
africa = africa[order(africa$lifeExp,decreasing = T)[1:5],]
america = x[x$continent == "Americas",]
america = america[order(america$lifeExp,decreasing = T)[1:5],]
data1 = merge(ocean,asia,all = TRUE)
data2 = merge(eur,africa,all = TRUE)
data3 = merge(data2,america,all = TRUE)
data = merge(data1,data3,all=TRUE)

install.packages("ggplot2")
library(ggplot2)
sp = ggplot(data, aes(x=country, y=lifeExp, fill = continent)) + geom_bar(stat = "Identity") + ylab("Life expentancy") + xlab("Country") + ggtitle("         Life expectancies for different countries")
sp + theme(axis.text.x = element_text(angle = 90, hjust = 1))


---

# Problem 4

Using the `NBAStandings1e` and `NBAStandings2016` datasets in the `Lock5Data` package, create a slopegraph to show how the win percentage (`WinPct`) for each team changed between the 2010-2011 and 2015-2016 seasons. Remember to clearly label each team for both seasons, and offset the labels from each other so that all of the team names are readable. In addition, some teams changed names and cities over the five year span of the data set. Account for this by removing the old team name from the dataset for the 2010-2011 season. Once again, all plot elements covered in lesson 6 should be included.

In [None]:
# Load the data
library(Lock5Data)
data("NBAStandings1e")
data("NBAStandings2016")

# YOUR CODE HERE
stop("Not Implemented")