# Exercises: Unit 7, bringing probability theory and statistics together

In this set of exercises you will not need the computer. A major point in the unit was to show how combining probability and statistics can help us quantify the uncertainty of a proportion or a mean in a population without a computer intensive method such as bootstrapping. You can - of course - use R and your computer if you wish to do so but all of the exercises do in principle not require the use of it. You will need a computational device to take a square root here and there, though.

## Exercises:

### Exercise 1: Rolling a die

A die will be rolled 10 times. The chance it never lands such that 6 is shown can be found by one of the follwoing computations. Which one and Why?

(i) $\left(\frac{1}{6}\right)^{10}$

(ii) 1 - $\left(\frac{1}{6} \right)^{10}$

(iii) $\left(\frac{5}{6} \right)^{10}$

(iv) 1 - $\left(\frac{5}{6} \right)^{10}$

The correct answer is (iii): The chance to have no six is $(5/6)$, so the chance to have no six in 10 rolls must be $(5/6)^10$. You may check your answer by comparing:

In [1]:
dbinom(0,10,1/6)
(5/6)^(10)

### Exercise 2: Family composition

Of families with 4 children, what proportion have more girls than boys, assuming that the chance for a girl is 
0.487?

The families with four children who have more girls than boys are the families with 3 and 4 girls. So this is then:  

${4 \choose 3} \times (0.487^3 \times (0.513) +  {4 \choose 4} \times 0.487^4 \times 0.513^0) = 0.2932578$

### Exercise 3: Expected value and standard deviation of a box

Find the expected value and the standard error for the sum of 100 draws at random with replacement from the
following boxes:

![box_exercise-2.png](attachment:box_exercise-2.png)

Box (a) $100 \times 2 = 200$ and $\sqrt(100) \times 2.345 = 23.45$
Box (b) $100 \times -0.25 = -25$ and $\sqrt(100) \times 1.479 = 14.79$
Box (c) $100 \times 0 = 0$ and $\sqrt(100) \times 2.16 = 21.6$
Box (d) $100 \times 0.667 = 66.7$ and $\sqrt(100) \times 0.471 = 4.71$

### Exercise 4: Find the sequence

The expected value for a sum is 50, with a SE of 5. the chance process generating the sum is repeated 10 times.
Which is the sequence of observed values?

(a) 51, 57, 48, 52, 57, 61, 58, 41, 53, 48

(b) 51, 49, 50, 52, 48, 47, 53, 50, 49, 47

(c) 45, 50, 55, 45, 50, 55, 45, 50, 55, 45

Answer: it must be (a), since with (b) the numbers are too close to 50 and with (c) the pattern is too regular.

### Exercise 5: Assessing the probability of the magnitude of a sum

One 100 draws will be made from the box [[1], [3], [3], [9]] at random with replacement.

(a) How large can the sum be and how small can it be?

(b) How likely is it that the sum is in the range from 370 to 430. 

Answer: Larges 900, samallest 100. Chance: Convert to standard units: This is -1 and 1. Since the sample is large enough for a normal approximantion this means that 370 and 430 is within 1 standard unit from the mean, so - since we have a normal approximation - the probability is about 68 %.

### Exercise 6: The probability of girls again

According to the natural probabilities the chance that the sex of a baby is 
determined to be a girl or a boy is as if there were random draws from the box [[Girl],[Boy]] with probabilities
0.487 and 0.513. What is the chance that the next 2500 births (ignoring twins and other multiple births) will show
more than 1275 girls? (Hint: Use the table for the standard normal distribution from unit 7).

2500 is large enough to use a normal approximation. The expected value of girls will be $2500 \times 0.487 \approx 1218$ with a standard error of $\sqrt{2500} \times \sqrt{0.487 \times 0.513} \approx 25$. Convert to standard units: $(1275 - 1218)/25 = 2.28$. This is approximately $0.0024$

### Exercise 7: Polling

A simple random sample of 1600 persons is taken to estimate the percentage of voters of one of two parties, party A, among the 25000 eligible voters in a certain town. It turns out that 917 people in the sample are  voters for party A.

(a) Find the 95 % confidence interval for the percentage of voters for party A among the 25000 eligible voters.

(b) Im the example is the value of 917 expected or observed? Explain.

(c) Th SD of the box si exactly equal to $\sqrt{0.573\times 0.427}$ or estimated form the data? Explain.

(d) The SE for the number of voter for party A in the sampple is exactly equal to 20, or estimated from the data as 20? Explain.

Answer:

(a) $917/1600 \times 100 \% \approx 57.3 \%$ is the estimate of the percentage of voters for party A. The SD is estimated as $\sqrt{0.573 \times 0.427} \approx 0.5$. The SE for the number of voters for party A is $\sqrt{1600} \times 0.5 = 20$ so the SE in the proportion is $20/1600 \times 100 \% = 1,25 \%$. The 95 % confidence interval is then $57.3 \% \,\, \text{plus/minus} \,\, 2 \times 1.25 \%$, so the 95 % confidence interval is $[54.8, 59.8]$.
(b) observed
(c) estimated from data
(d) estimated from data

### Exercise 8: A student survey

A university has 30.000 registered students. As part of a survey, 900 of these students are chosen at random. The average age in the sampe turns out to be 22.3 years and the SD is 4.5 years.

(a) The average age of all 30.000 students is estimated as?

(b) This estimate is likely to be off by ?

(c) Find a 95 %-confidence interval for the average age of all 30.000 registered students.

Answer:

The sample average is like the average of the draws. The SD of the box is estimated as $\sqrt{900} \times 4.5 = 135$ years, so the SE for the average is $135/900 = 0.15$ years.

The estimate is 22.3 years plus minus 0.15 years. 

The interval is [22, 22.6] years.

## Project people count: The future of humanity in numbers and pictures

In this course you have worked on a project called "people count" using demographic data. The context of census data and demographic data was used to apply some of the concepts we learned in this course, such as visualizing data, subsetting and extracting particular datasets from a larger dataset by using R's substting rules. Compute the mean and median age for grouped data, quantify the uncertainty attached to such estimates using a bootstrap algorithm and predicting peppulation growth. Often we looked at three typical countries, Kenya, a so allced bottom layered country, because of the shape of its population barplot with the typical shape you get for a very young population, the US as an example of a middle layered country and Japan as an examle for an aging society.

This week I want to encourage you to think about an outline for producing a data-story about your country using the
analysis done during the project so far as a guide. It should be a text built around data and charts that tells the story of your country using the population numbers we used in this course from the `JWL`package and which are called `population_statistics_by_age_and_sex`. You may, but you need not also use a general demorgaphic dataset, if you want, which is called `general_demograpics` and which can be studied using the R help function, for  the description of the variables.

The data-story should be:

- About 4 pages long
- Written as a Jupyter Notebook.
- Using the analysis you did in the project in units 1 - 6.
- Describe your country, either your country where you have been born or where you are currently living.
- Contain, text, graphics and data and using the two demographic datasets we used in the course.
- Be roughly built around the template:
    - my country short description in word
    - my country viewed from the perspective of census data, a short demographic description in numbers, pictures    and text, giving a view into the past (10 years ago) and giving an outlook into the future, say in 10 years from now)
    - Some concluding passage.
- You have considerable freedom in the way you write this but make sure to built in the analysis you did throughout the assignment. This will be one important element for the grading of the project work.


Use the project work time of this week to make a plan for this essay and to familiarize yourself a bit with working with text and code in one Notebook. Here is a very brief intro to the most basic elements.

Jupyter Notebooks use Markdown for formatting text. You have to choose `Markdown` from the menue bar above, such that Jupyter knows that the cell you are currently working in with text is a markdown cell and not a code cell.

Markdown is a lightweight, easy to learn markup language for formatting plain text. 

Remember that this exercise sheet was written in a Jupyter notebook, so all of the narrative text and images you have seen so far were achieved writing in Markdown and code. Let’s cover the basics with a quick example:

# This is a level 1 heading

## This is a level 2 heading

This is some plain text that forms a paragraph. Add emphasis via **bold** and __bold__, or *italic* and _italic_. 

Paragraphs must be separated by an empty line. 

* Sometimes we want to include lists. 
* Which can be bulleted using asterisks. 

1. Lists can also be numbered. 
2. If we want an ordered list.

[It is possible to include hyperlinks](https://www.example.com)

Inline code uses single backticks: foo(), and code blocks use triple backticks: 
```
bar()
``` 
Or can be indented by 4 spaces: 

    foo()

And finally, adding images is easy: ![Alt text](https://www.example.com/image.jpg)

# This is a level 1 heading

## This is a level 2 heading

This is some plain text that forms a paragraph. Add emphasis via **bold** and __bold__, or *italic* and _italic_. 

Paragraphs must be separated by an empty line. 

* Sometimes we want to include lists. 
* Which can be bulleted using asterisks. 

1. Lists can also be numbered. 
2. If we want an ordered list.

[It is possible to include hyperlinks](https://www.example.com)

Inline code uses single backticks: foo(), and code blocks use triple backticks: 
```
bar()
``` 
Or can be indented by 4 spaces: 

    foo()

And finally, adding images is easy: ![Alt text](https://www.example.com/image.jpg)

To see the source text from which these lines were compiled, double click into the cell. It will then 
show you the source text. When you execute the cell it will show you the formatted markup text.

## Sharing Your Notebooks

When people talk about sharing their notebooks, there are generally two paradigms they may be considering.

Most often, individuals share the end-result of their work, much like this article itself, which means sharing non-interactive, pre-rendered versions of their notebooks. However, it is also possible to collaborate on notebooks with the aid of version control systems such as Git or online platforms like Google Colab.

### Before You Share

A shared notebook will appear exactly in the state it was in when you export or save it, including the output of any code cells. Therefore, to ensure that your notebook is share-ready, so to speak, there are a few steps you should take before sharing:

Click “Cell > All Output > Clear”
Click “Kernel > Restart & Run All”
Wait for your code cells to finish executing and check ran as expected

This will ensure your notebooks don’t contain intermediary output, have a stale state, and execute in order at the time of sharing.

### Exporting Your Notebooks

Jupyter has built-in support for exporting to HTML and PDF as well as several other formats, which you can find from the menu under “File > Download As.”

You can of course always share .ipynb files more directly and this is also the format I prefer for submitting your data essay.