# Statistical Data Management Session 6: Discrete Random Variables (chapter 4 in McClave & Sincich)


**It is not necessary to use Python for all of the exercises. We clearly indicate when it is. ("Run the following cell of code" or "Use Python" etc.)**


## 1. Expected Value

You play the following game at a fair. You pay €4.20 to participate and are presented with a bag which contains 240 marbles, 10 red and 230 blue. You are allowed to pick one at random without looking. If the marble you draw is red, you win €100 back, nothing if the marble was blue.

1. Let $x$ represent your profit. Find $E(x)$.

|$x$ | 	-4.20 | 100-4.20  |
|:---:|:---|:---:|
|$P(x)$ |	230/240 | 	10/240 |

$E(x)=-4.20 \frac{230}{240} + 95.8\frac{10}{240} = -0.033 < 0$, so this game is not profitable (for the player).

2. Now let the fee to play be a variable $c$ (used to be €4.20). How many marbles should be red to make this game profitable to you, as a function of $c$? The potential price of €100 stays the same. Check with $c=4.2$.

Let $r$ represent the number of red marbles. The game is profitable to you if $E(x)>0$:

$E(x) = -c \frac{240-r}{240} + (100-c)\frac{r}{240} \Leftrightarrow \cdots \Leftrightarrow r > 2.4\cdot c$. So the original game is profitable from 11 red marbles onwards.

## 2. Expected Value *(based on ex. 4.48 from the book)*

In a casino, one can play the following game. You spin a wheel with 20 values on it, marked 5, 10, ..., 100. Contestants spin the wheel once or twice, with the objective of obtaining the highest total score *without going over 100*. They win their total score in euros. Let $x$ represent the score of a single contestant playing the game. Assume a "fair" wheel (i.e., a wheel with equally likely outcomes). If the total of the player's spins exceeds 100, the total score $x$ is set to 0.
1. If the player is permitted only one spin of the wheel, find the probability distribution for $x$.
2. Find $E(x)$ and interpret this value.
3. Find the standard deviation of $x$.
4. Suppose the player obtains a 20 on the first spin and decides to spin again. What is, in this case, the probability that the player's total score exceeds 100 (and is reset to 0 consequentially)? Find the probability distribution for $x$ in case the player obtained 20 on the first spin and decides to spin again.
5. **Run the following cell of code.** Make sure you understand what happens. The `scipy.stats` (`sts`) package API may be found [here](https://docs.scipy.org/doc/scipy/reference/stats.html). Have a look at 'rv_discrete'.


Assume that $x$ is a number/score that comes out after a contestant spins the wheel. 
1. There are 20 outcomes in total that are all equally likely. The probability of any of the number after one spin is 1/20. Hence, $P(5) = 0.05, P(10) = 0.05$, etc. 
2. Expected value: 
    $E(x) = \sum xP(x) = 5 \cdot 0.05 + 10 \cdot 0.05 + \cdots + 100 \cdot 0.05 = 52.5$.
3.  $ \sigma^2 = \sum (x-E(x))^2 p(x) = (5-52.5)^2\cdot0.05 + ... = 831.25 $.
    
    $ \sigma = \sqrt{831.25} = 28.83$.   
 
4. If the outcome from the first spin is 20, possible values for $x$, which is the score of the first and second spin combined, but possibly reset to 0, are: 0 (with second spin outcome of 85, 90, 95 or 100), 25, 30, 35, 40, ..., 100. Therefore $P(x = 0)$ is the probability that the sum of the first spin and second spin exceeds 100. $P(x = 0) = 4/20  = 0.2$ (4 possible values on the second spin such that the sum will exceed 100). The full probability distribution is $P(x=0)=0.2$, $P(x = 25) = 0.05$ (if the second outcome is 5), $P(x = 30) = 0.05$ (if the second outcome is 10), ..., $P(x = 100) = 0.05$ (if the second outcome of the spin is 80). Note that these probabilities indeed sum to 1.
5.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats as sts
%matplotlib inline

outcomes = np.arange(5, 101, 5)
print(outcomes)
probabilities = np.empty(20)
probabilities.fill(0.05)

#manually overwrite in case of non-equal probabilities (make sure they sum to 1 though!):
#uncomment following two lines to use this
#probabilities[0] = 0.07
#probabilities[1] = 0.03

print(probabilities)
x = sts.rv_discrete(values=(outcomes, probabilities))
print(x.mean())
print(x.std())
print(x.var())

plt.figure(figsize=(10, 6))
plt.stem(outcomes, x.pmf(outcomes))
plt.title("Stem diagram of discrete probability distribution", fontsize=16)
plt.show()
plt.close()

## 3. The Binomial Random Variable *(based on ex. 4.66 from the book)*

**Caesarian births.** The American College of Obstetricians and Gynecologists reports that 32% of all births in the United States take place by Caesarian section each year. (*National Vital Statistics Reports*, Mar. 2010).
1. In a random sample of 100 births, how many, on average, will take place by Caesarian section?
2. What is the expected standard deviation of the number of Caesarian section births in a sample of 100 births?
3. Assuming that the distribution is mound-shaped, use your answers from 1 and 2 to compute the interval that will likely contain 95% of the observations.
4. **Run the following cell of code.** What do the numerical values represent? (Look up `binom` in the `stats` API!)
5. Plot the cdf **using Python**.

Let $x$ be the number of births by caesarian section (per 100 births).  
1. $E(x) = np = 100\cdot0.32 = 32$. 
2.	$\sigma = \sqrt{npq} = 4.66$.    
3.	By applying the Empirical rule we know that 95% of the observation will fall within two standard deviations of the mean. $\mu \pm 2\sigma \Rightarrow [22.67;41.33]$. Therefore, 95% of the Caesarian section birth would lie somewhere between 22 and 42 (rounding towards boundaries to be sure).

In [None]:
caeserian = sts.binom(100, 0.32)
outcomes = np.arange(0, 101, 1)

plt.figure(figsize=(10, 6))
plt.stem(outcomes, caeserian.pmf(outcomes))
plt.title('PMF of binomial distribution with $n=100$, $p=0.32$', fontsize=16)
plt.show()
plt.close()

plt.figure(figsize=(10, 6))
plt.stem(outcomes, caeserian.cdf(outcomes))
plt.title('CDF of binomial distribution with $n=100$, $p=0.32$', fontsize=16)
plt.show()
plt.close()

print(caeserian.mean()) # expected value
print(caeserian.std()) # standard deviation
print(caeserian.cdf(20)) # probability of having at most 20 caeserian sections in the sample of 100 (20 included)
print(caeserian.pmf(30)) # probability of having exactly 30 caeserian sections in the sample of 100

6. Find the probability that in a random sample of 100 births, 40 or more occur with a caeserian section.

In [None]:
print(1 - caeserian.cdf(39)) # cdf(39) is probability up to and including 39

7. Run the following cell of code, but first you have to give the variable a meaningful name. Explain why the two outcomes are the same.

Changing 0.32 into 0.68 models the number of births performed *without* caesarian section. $\sigma$ remains the same. The first number that is calculated is the probability of having at most 40 caeserian sections. The second number is 1 - the probability of having at most 59 births *without* caeserian section, i.e. the probability of having 60 or more births without caeserian section. Both boil down to the same thing.

In [None]:
noncaeserian = sts.binom(100, 0.68)
print(caeserian.cdf(40))
print(1 - noncaeserian.cdf(59))

## 4. The Poisson Random Variable

The 1953 storm that caused severe flooding in the Netherlands and Belgium is described as a "once in 250 year storm". (SNICK, I. 2009. "Duizendjarige storm zet zeespiegel vier meter hoger". https://www.standaard.be/cnt/9i25dc0f). **Use Python** to answer the following questions.


1. How many such storms are to be expected the next 1000 years? How does the knowledge that one took place only 70 years ago, influence your answer?
2. Run the following cell of code.


1. There are 4 to be expected. Note that knowledge of a recent one doesn't influence this number as the storms are independent of each other. This is similar to the Gambler's fallacy: "Every time I lose betting, it increases my chances of winning next time", which is wrong of course.

**Remark:** $\lambda$ needn't necessarily be a natural number! If this was a "once in 400 years storm", we would have had $\lambda = 2.5$. The set of possible outcomes is still the natural numbers in that case (either 0 storms, 1 storm, 2 storms etc.).

In [None]:
storms = sts.poisson(4)
outcomes = np.arange(0, 15, 1)

# two different plots, but here with some extra tricks to represent both 
# clearly in one graph (you're not required to be able to do this yourself)
plt.figure(figsize=(10,6))
plt.title('Number of storms to be expected over the next 1000 years', fontsize=16)
plt.stem(outcomes, storms.cdf(outcomes), label="cdf")
plt.stem(outcomes, storms.pmf(outcomes), linefmt='C1-', markerfmt='C1o', label="pmf")
plt.legend()
plt.show()
plt.close()

3. What is the probability that 2 or fewer such storms occur the next 1000 years?
4. What is the probability that no such storms occur the next 1000 years?
5. What is the probability that 5 or more such storms occur the next 1000 years?
6. What is the probability that 100 or more such storms occur the next 1000 years?


In [None]:
print('3.', storms.cdf(2))
print('4.', storms.pmf(0))
print('5.', 1 - storms.cdf(4)) # including 5 => excluding 4 or fewer
print('6.', 1 - storms.cdf(99)) # not 0 officially but de facto 0 because of rounding

## 5. Train Delays

You take the train and noticed it has a 15% probability of being delayed. You take a course which has 13 exercise sessions and you will sometimes arrive late because of the train being delayed. If the lecturer wants to allow for only a 1% probability of wrongfully accusing a student, from how many absences onwards should they be suspicious you are using the fact that the train is late too often as an excuse? **Use Python to answer this question.**

In [None]:
delayed = sts.binom(13, 0.15)
print(delayed.cdf(4)) # => ok
print(delayed.cdf(5)) # => over 99% 

There is a probability of (over) 99% that you are 5 or fewer times late because of the train being delayed, so only a (less than) 1% probability that you are late 6 or more times because of this, so from 6 (including) onwards, the lecturer should raise suspicion.

## 6. Website Visits

Our website receives on average 150 HTTP get requests per second. **Use Python to address the following questions**

1. Our web hosting account allows up to 200 requests per second, additional ones are then ignored. What is the probability of the server ignoring at least one request?
2. We want to allow for maximally a 5% probability of having ignored requests. From which average amount of requests per second onwards should we upgrade our account?

In [None]:
print('1')
requests = sts.poisson(150)
probability_having_ignored_requests = 1 - requests.cdf(200)
print(probability_having_ignored_requests)
print()
print('2')

mean_requests = 150
while probability_having_ignored_requests < 0.05:
    mean_requests += 1
    requests = sts.poisson(mean_requests)
    probability_having_ignored_requests = 1 - requests.cdf(200)

print(mean_requests)

## 7.  Tracking Missiles *(ex. 4.65 from the book)*

The U.S. government has devoted considerable funding to missile defense research over the past 20 years. The latest development is the Space-Based Infrared System (SBIRS), which uses satellite imagery to detect and track missiles (*Chance*, Summer 2005). The probability that an intruding object (e.g., a missile) will be detected on a flight track by SBIRS is 0.8. Consider a sample of 20 simulated tracks, each with an intruding object. Let $x$ equal the number of these tracks on which SBIRS detects the object. **Use python where required.**

1. Graph the probability mass function (pmf) and the cumulative density function for all possible number of tracks with objects detected by the SBIRS. Use two different stem plots.
2. Find $P(x = 15)$, the probability that SBIRS will detect the object on exactly 15 tracks.
3. Find $P(x \geq 15)$, the probability that SBIRS will detect the object on at least 15 tracks. 
4. Find $E(x)$ and interpret the result.

In [None]:
tracking = sts.binom(20,0.8)
outcomes = np.arange(0,21)

# two different plots was more than enough, but here with some extra tricks to
# represent both clearly in one graph (you're not required to be able to do this yourself)
plt.figure(figsize=(10,6))
plt.title('TRACKING', fontsize=16)
plt.stem(outcomes,tracking.cdf(outcomes),  label="cdf")
plt.stem(outcomes,tracking.pmf(outcomes), linefmt='C1-', markerfmt='C1o', label="pmf")
plt.legend()
plt.show()
plt.close()

print('2.', tracking.pmf(15))
print('3.', 1 - tracking.cdf(14))
print('4.', tracking.mean())