# Statistical Data Management Session 7: Continuous Random Variables (chapter 5 in McClave & Sincich)


**It is not necessary to use Python for all of the exercises. We clearly indicate when it is. ("Run the following cell of code" or "Use Python" etc.)**

## 1. Uniform Distribution

You ask a computer to generate a random real number between 0 and 15.
1. What is the expected value of this number?
2. What is the probability that this number is less than 4?
3. What is the probability that this number is exactly 7.474114198980?
4. What is the probability that this number is less than or equal to 4?
5. What is the probability that this number is more than 7?

## 2. Standard Normal Distribution *(ex 5.26 from the book)*

Find and visualise **(using and altering the code below)** the following probabilities for the standard normal random variable $z$. Find the same probabilities using the z-table on Toledo.
1. $P(z<-1.56)$
2. $P(z>1.46)$
3. $P(0.67 \leq z < 2.41)$
4. $P(-1.96\leq z\leq-0.33)$
5. $P(z\geq0)$
6. $P(z\geq-2.33)$
7. $P(z<2.33)$

In [None]:
import sqlite3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats as sts
%matplotlib inline

normal = sts.norm(0, 1) # expected value 0, standard deviation 1 for a standard normal distribution
xrange = np.linspace(-3, 3, 10000)
part = np.linspace(-3, -1.56, 10000)  # alter the boundaries to visualise different areas

plt.plot(xrange, normal.pdf(xrange))
plt.fill_between(part, normal.pdf(part))
plt.show()
plt.close()

print('1.', normal.cdf(...))

## 3. Normal Distribution *(ex 5.44 from the book)*

When attempting to score a goal in soccer, where should you aim your shot? Should you aim for a goalpost (as some soccer coaches teach), the middle of the goal, or some other target? To answer these questions, *Chance* (Fall 2009) utilized the normal probability distribution. Suppose the accuracy $x$ of a professional soccer player's shots follows a normal distribution with a mean of zero feet and a standard deviation of 3 feet. (For example, if the player hits the target, $x=0$; if he/she misses 2 feet to the right, $x=2$; and if he/she misses 1 foot to the left, $x=-1$.) Now, a regulation soccer goal is 24 feet wide. Assume that a goalkeeper will stop (save) all shots within 9 feet of where he/she is standing; all other shots on goal will score. Consider a goalkeeper who stands in the middle of the goal. Visualise the following probabilities and calculate with a z-table.

1. If the player aims for the right goalpost, what is the probability that he/she will score?
2. If the player aims for the centre of the goal, what is the probability that he/she will score?
3. If the player aims for halfway between the right goalpost and the outer limit of the goalkeeperâ€™s reach, what is the probability that he/she will score?



## 4. Linear Combination of Random Variables

Wooden planks are delivered to a furniture factory, where they will be sanded and packaged in cardboard. The thickness of the planks follows a normal distribution with mean 20 mm and standard deviation 0.5 mm, $N(20,0.5)$. A machine sands off a thickness which follows a normal distribution $N(1,0.1)$, on either side of the plank. Two planks are then wrapped together in cardboard (packaging goes totally around). The thickness of the cardboard (one layer of it) is exactly 6mm. 

1. Find the mean and standard deviation of the thickness of the total packaged product.
2. It can be shown that a linear combination of normal distributions is also a normal distribution. Knowing this, find the probability that the thickness of the total packaged product is more than 50 mm. **Use Python for this question.**



## 5. Assessing Normality and Bivariate Gaussian Distribution

Some statistical tests rely on the assumption that the data are drawn from a normally distributed population. To assess whether this is a realistic assumption, one can make so-called Q-Q plots **using Python**. The more "normal" the data look, the closer the observations are to the red theoretical line.

1. Run the code below to make a Q-Q plot of the durations of the campain videos of presidential candidates from session 1. Also make a histogram. Would you conclude that the assumption of normality is realistic?

In [None]:
conn = sqlite3.connect("../../shared/boas_peru_metadata.db")
df_durations = pd.read_sql_query("SELECT duration FROM videos_peru", conn)

plt.figure(figsize=(10,6))    
sts.probplot(df_durations['duration'], plot=plt)
plt.title("Q-Q Plot for video durations")
plt.show()
plt.close()

2. The following dataframe contains the respective height (in cm) and weight (in kg) of 507 active individuals (see https://www.openintro.org/book/statdata/?data=bdims). Generate Q-Q plots, histograms and boxplots for both attributes. Assess their normality.

In [None]:
df = pd.read_csv('../../shared/bdims.csv',delimiter=",")
print(df)


3. Run the following cell. Based on the covariance matrix, do you expect a positive or negative correlation between the attributes?

In [None]:
mu = df.mean()
print(mu)
cov = df.cov()
print(cov)

4. Run the code below to generate a 3D plot of the bivariate density function (defined with these mean and covariance matrices). The second plot is a combined contour plot of the bivariate density function + scatter plot of the data. Interpret the result. [The code itself is for illustration purposes only, you are not expected to analyse/understand it in detail.]

In [None]:
from mpl_toolkits.mplot3d import Axes3D

height = np.linspace(140, 210, 1000)
weight = np.linspace(20, 150, 1000)
X, Y = np.meshgrid(height, weight)
pos = np.dstack((X, Y))

height_vs_weight = sts.multivariate_normal(mu, cov)
density_function = height_vs_weight.pdf(pos)

fig = plt.figure(figsize=(25,15))
ax = fig.add_subplot(projection='3d')
ax.plot_surface(X, Y, density_function, cmap='plasma')
ax.set_xlabel('Height')
ax.set_ylabel('Weight')
plt.show()
plt.close()

plt.figure(figsize=(10,8))
plt.scatter(df["height"], df["weight"])
plt.contour(X,Y, density_function,cmap='plasma')
plt.xlabel('Height', fontsize=18)
plt.ylabel('Weight', fontsize=18)
plt.show()
plt.close()

## 6. Joint Probability Functions

The following function represents the joint probability density function for two random variables, with $X$ ranging from $0$ to $+\infty$, $Y$ between $1$ and $3$ and $c$ a constant to be determined.

$$f_{X,Y}(x,y)=c\cdot e^{-x}y$$

1. Find the value for $c$ so that this is a correct pdf.
2. Find the marginal pdfs $f_Y(y)$ and $f_X(x)$.
3. Are $X$ and $Y$ independent?
4. Write down the correct integrals (without calculating!) to find $P(2<X<3,Y<2)$ and $P(Y<X)$.
5. Calculate these probabilities.




## 7. Other Continuous Distributions...

A certain phenomenon $x$ follows a Fisk distribution with $c=4$ (see https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.fisk.html#scipy.stats.fisk). **Use Python to answer the following questions.** 

Note: we didn't cover this distribution and will never use it again, but we want to show that once you know the distribution of a phenomenon, calculations can be left to Python!

1. Plot the pdf and cdf of this distribution, starting from the code below.
2. Calculate $P(x<1.3)$.
3. Calculate $P(x \geq 2.5)$.
4. Calculate $P(1 < x \leq 2.7)$.
5. Find the expected value and standard deviation of the distribution of $x$.

In [None]:
x = sts.fisk(4)
xrange = np.linspace(0,3,10000)

