# Statistical Data Management Session 7: Continuous Random Variables (chapter 5 in McClave & Sincich)


**It is not necessary to use Python for all of the exercises. We clearly indicate when it is. ("Run the following cell of code" or "Use Python" etc.)**

## 1. Uniform Distribution

You ask a computer to generate a random real number between 0 and 15.
1. What is the expected value of this number?
2. What is the probability that this number is less than 4?
3. What is the probability that this number is exactly 7.474114198980?
4. What is the probability that this number is less than or equal to 4?
5. What is the probability that this number is more than 7?

1. $E(x) = (15-0)/2 = 7.5$
2. $P(x<4) = \int_0^4 \frac{1}{15} dx = \frac{1}{15}(4-0) = 4/15$.
3. 0
4. Same as in 2.!
5. $P(x>5) = 1 - P(x<7) = 1-7/15 = 8/15$.

## 2. Standard Normal Distribution *(ex 5.26 from the book)*

Find and visualise **(using and altering the code below)** the following probabilities for the standard normal random variable $z$. 
1. $P(z<-1.56)$
2. $P(z>1.46)$
3. $P(0.67 \leq z < 2.41)$
4. $P(-1.96\leq z\leq-0.33)$
5. $P(z\geq0)$
6. $P(z\geq-2.33)$
7. $P(z<2.33)$

In [None]:
import sqlite3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats as sts
%matplotlib inline

normal = sts.norm(0, 1) # expected value 0, standard deviation 1 for a standard normal distribution
xrange = np.linspace(-3, 3, 10000)
part = np.linspace(-3, -1.56, 10000)  # alter the boundaries to visualise different areas

plt.plot(xrange, normal.pdf(xrange))
plt.fill_between(part, normal.pdf(part))
plt.show()
plt.close()

print('1.', normal.cdf(-1.56))
print('2.', 1 - normal.cdf(1.46))
print('3.', normal.cdf(2.41) - normal.cdf(0.67))
print('4.', normal.cdf(-0.33) - normal.cdf(-1.96))
print('5.', 0.50) # no calculation needed!
print('6.', 1 - normal.cdf(-2.33))
print('7. Same as above due to symmetry!')

## 3. Normal Distribution *(ex 5.44 from the book)*

When attempting to score a goal in soccer, where should you aim your shot? Should you aim for a goalpost (as some soccer coaches teach), the middle of the goal, or some other target? To answer these questions, *Chance* (Fall 2009) utilized the normal probability distribution. Suppose the accuracy $x$ of a professional soccer player's shots follows a normal distribution with a mean of zero feet and a standard deviation of 3 feet. (For example, if the player hits the target, $x=0$; if they miss 2 feet to the right, $x=2$; and if they miss 1 foot to the left, $x=-1$.) A regulation soccer goal is 24 feet wide. Assume that a goalkeeper will stop (save) all shots within 9 feet of where they are standing; all other shots on goal will score. Consider a goalkeeper who stands in the middle of the goal. Visualise the following probabilities and calculate **using Python**.

1. If the player aims for the right goalpost, what is the probability that they will score?
2. If the player aims for the centre of the goal, what is the probability that they will score?
3. If the player aims for halfway between the right goalpost and the outer limit of the goalkeeper’s reach, what is the probability that they will score?

1. Goal: $9<X<12 \Longrightarrow$ z-score between -1 and 0 $\Longrightarrow P(9<X<12) = P(-1<z<0) = P(z<0) – P(z<-1) = 0.5 – 0.1587 = 0.3413.$ Technically, one could also aim right and score to the left of the goalkeeper, but that probability is negligible.
2. Goal: $-12<X<-9$ and $9<X<12 \Longrightarrow P(goal) = 2P(9<X<12)$ due to symmetry. z-scores of 9 and 12 are 3 and 4 respectively. $P(3<z<4) = P(z<4) – P(z<3) = 1 – 0.9987 = 0.0013$. $P(goal) = 0.0026$.
3. Goal: $9<X<12 \Longrightarrow P(goal) = P(-0.5<z<0.5) = 0.3830$.

In [None]:
# Code used to find probabilities above:
print('1.', 0.5 - normal.cdf(-1))
print('2.', 2*(normal.cdf(4) - normal.cdf(3)))
print('2.', normal.cdf(0.5) - normal.cdf(-0.5))

## 4. Linear Combination of Random Variables

Wooden planks are delivered to a furniture factory, where they will be sanded and packaged in cardboard. The thickness of the planks follows a normal distribution with mean 20 mm and standard deviation 0.5 mm, $N(20,0.5)$. A machine sands off a thickness which follows a normal distribution $N(1,0.1)$, on either side of the plank. Two planks are then wrapped together in cardboard (packaging goes totally around). The thickness of the cardboard (one layer of it) is exactly 6 mm. 

1. Find the mean and standard deviation of the thickness of the total packaged product.
2. It can be shown that a linear combination of normal distributions is also a normal distribution. Knowing this, and using the information from 1., find the probability that the thickness of the total packaged product is more than 50 mm. **Use Python for this question.**

1. Let $P$ and $S$ be the random variable representing the thickness of the planks and sanded-off thickness respectively. The total thickness is 

$T=2P-4S+12$

$E(T)=2E(P)-4E(S)+12=40-4+12=48$

$Var(T)= Var(2P-4S+12) = 4Var(P)+16Var(S) = 4\cdot 0.25 + 16\cdot 0.01 = 1.16$, so the standard deviation is $\sqrt{1.16}\approx 1.0770$. We needed to assume that $P$ and $S$ are independent, which they indeed are: the planks arrive from a different factory, which is independent from our sanding machine. This is not the case for all phenomena. Take e.g. the sum of the lenghts of both your thumbs. As you take the lengths of the thumbs of _the same person_, both lengths will be (highly) correlated.

Note that despite taking **off** ($-$ sign) a bit of the thickness by sanding, the variance of the machine **adds** ($+$ sign) to the total variance of the final package.

2.

In [None]:
package_thickness = sts.norm(48, np.sqrt(1.16))
print(1 - package_thickness.cdf(50))

## 5. Assessing Normality

Some statistical tests rely on the assumption that the data are drawn from a normally distributed population. To assess whether this is a realistic assumption, one can make so-called Q-Q plots **using Python**. The more "normal" the data look, the closer the observations are to the red theoretical line.

1. Run the code below to make a Q-Q plot of the durations of the campain videos of presidential candidates from session 1. Also make a histogram. Would you conclude that the assumption of normality is realistic?

In [None]:
conn = sqlite3.connect("../../shared/boas_peru_metadata.db")
df_durations = pd.read_sql_query("SELECT duration FROM videos_peru", conn)

plt.figure(figsize=(10,6))    
sts.probplot(df_durations['duration'], plot=plt)
plt.title("Q-Q Plot for video durations")
plt.show()
plt.close()

plt.figure(figsize=(10,6))   
plt.hist(df_durations['duration'], bins='sqrt')
plt.title("Histogram for video durations")
plt.show()
plt.close()

# Clearly, it's unrealistic to assume normality.

2. The following dataframe contains the respective height (in cm) and weight (in kg) of 507 active individuals (see https://www.openintro.org/book/statdata/?data=bdims). Generate Q-Q plots, histograms and boxplots for both attributes. Assess their normality.

In [None]:
df = pd.read_csv('../../shared/bdims.csv',delimiter=",")
print(df)

figsize = (8,5)

for column in df.columns:
    plt.figure(figsize=figsize)
    plt.boxplot(df[column], vert=False)
    plt.title("Boxplot for " + column)
    plt.show()
    plt.close()
    
    plt.figure(figsize=figsize)
    plt.hist(df[column], bins="sqrt")
    plt.title("Histogram for " + column)
    plt.show()
    plt.close()
    
    plt.figure(figsize=figsize)
    sts.probplot(df[column], plot=plt)
    plt.show()
    plt.close()
    
# The assumption that these are drawn from a normally distributed population is realistic (for both attributes).

3. Run the following cell. Based on the covariance matrix, do you expect a positive or negative correlation between the attributes?

In [None]:
mu = df.mean()
print(mu)
cov = df.cov()
print(cov)

# Positive counter-diagonal elements => expect positive correlation, confirmed by 2D histogram (heigher count in bin = warmer colour)

plt.figure(figsize=(8,8))
plt.hist2d(df["height"], df["weight"], bins= int(np.round(np.sqrt(len(df)),0)))   
plt.show()
plt.close()

## 6. Other Continuous Distributions...

A certain phenomenon $x$ follows a Fisk distribution with $c=4$ (see https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.fisk.html#scipy.stats.fisk). **Use Python to answer the following questions.** 

Note: we didn't cover this distribution and will never use it again, but we want to show that once you know the distribution of a phenomenon, calculations can be left to Python!

1. Plot the pdf and cdf of this distribution, starting from the code below.
2. Calculate $P(x<1.3)$.
3. Calculate $P(x \geq 2.5)$.
4. Calculate $P(1 < x \leq 2.7)$.
5. Find the expected value and standard deviation of the distribution of $x$.

In [None]:
x = sts.fisk(4)
xrange = np.linspace(0,3,10000)

plt.figure(figsize=(10, 6))
plt.plot(xrange, x.pdf(xrange), label="pdf")
plt.plot(xrange, x.cdf(xrange), label="cdf")
plt.legend()
plt.show()
plt.close()

print('2.', x.cdf(1.3))
print('3.', 1 - x.cdf(2.5))
print('4.', x.cdf(2.7) - x.cdf(1))
print('5.', x.mean(), x.std())

## 7. Joint Probability Functions

The following function represents the joint probability density function for two random variables, with $X$ ranging from $0$ to $+\infty$, $Y$ between $1$ and $3$ and $c$ a constant to be determined.

$$f_{X,Y}(x,y)=c\cdot e^{-x}y$$

1. Find the value for $c$ so that this is a correct pdf.
2. Find the marginal pdfs $f_Y(y)$ and $f_X(x)$.
3. Are $X$ and $Y$ independent?
4. Write down the correct integrals (without calculating!) to find $P(2<X<3,Y<2)$ and $P(Y<X)$.
5. Calculate these probabilities.

1. First of all, for all values for $x$ and $y$ over the domain, $f_{X,Y}(x,y)$ is positive (assuming $c$ is as well). Secondly, the integral over the entire domain should be equal to 1: 
\begin{align*}
\int_0^{+\infty}\int_{1}^3 f_{X,Y}(x,y) \, dydx  &= c\int_{0}^{+\infty}e^{-x} dx \int_{1}^3 y dy\\ &= c\cdot\left[ -e^{-x} \right]_0^{+\infty}\cdot \left[ \frac{1}{2}y^2 \right]_1^{3} \\ &= c\cdot 1\cdot 4 = 1 \\ \Leftrightarrow &\; c=\frac{1}{4}.
\end{align*}

2. By integrating out $x$, we find $f_Y(y) = \int_0^{+\infty} f_{X,Y}(x,y) = \frac{1}{4}y$. Similarly, $f_X(x) = \int_1^{3} f_{X,Y}(x,y) = e^{-x}$.
3. Yes, as $f_{X,Y}(x,y)$ factorises into $f_X(x)\cdot f_Y(y)$.
4. $$P(2<X<3,Y<2) = \int_2^{3}\int_{1}^2 f_{X,Y}(x,y) \, dydx $$ $$P(Y<X) = \int_{y=1}^3\int_{x=y}^{+\infty} f_{X,Y}(x,y) \, dxdy $$
5. $$P(2<X<3,Y<2) = \int_2^{3}\int_{1}^2 f_{X,Y}(x,y) \, dydx = \frac{1}{4}\int_2^{3}e^{-x}dx\int_{1}^2y\;dy =  \frac{1}{4}\cdot (e^{-2}-e^{-3})\cdot \frac{3}{2} $$

For the following integral, you'll need that, applying integration by parts, $\int ye^{-y}dy=-ye^{-y}-e^{-y}+C$.
\begin{align*}
P(Y<X) &= \int_{y=1}^3\int_{x=y}^{+\infty} f_{X,Y}(x,y) \, dxdy \\
&= \frac{1}{4}\int_{y=1}^3ye^{-y}dy \\
&= \frac{e^{-1}}{2}-e^{-3}.
\end{align*}