# DSCI 551: Descriptive Statistics and Probability for Data Science
# Lab 4

# Submission instructions

rubric={mechanics:5}
- To submit this assignment, submit this Jupyter notebook `.ipynb` file completed with your answers.
- Be sure to follow the [general lab instructions](https://ubc-mds.github.io/resources_pages/general_lab_instructions/).
- Use proper English, spelling, and grammar throughout your submission.


## Exercise 1


### 1(a)
rubric={raw:5}

Let $X\sim\mathcal{N}(0,1)$ and $Y=2+3X$.

1. Are $X$ and $Y$ independent? 
2. What is the distribution of $Y$? 
3. What is $E[XY]$?
4. What is $Cov(X,Y)$?
5. What is the correlation coefficient $\rho$ between $X$ and $Y$?

Provide your answers to Exercise **1a** below.

YOUR ANSWER HERE

### (optional) 1(b)
rubric={reasoning:2}

1. Find $P(\max(X,Y)< 0)$

Provide your answer to Exercise **1b** below.

YOUR ANSWER HERE

## Exercise 2
rubric={reasoning:2}

For each of the following, explain why it is not a valid covariance matrix for a bivariate Gaussian distribution.

1. $\Sigma=\begin{pmatrix} -2 & 0 \\ 0 & 2\end{pmatrix}$
2. $\Sigma=\begin{pmatrix} 4 & 0 \\ -1 & 5\end{pmatrix}$

Provide your answers to question **2** below. 

YOUR ANSWER HERE

## Exercise 3
rubric={raw:9}

For each of the following bivariate (2-variable) joint distributions, answer the following questions by filling in the table at the bottom of the page. No need to explain your answers.

Questions for each distribution: 

1. Is the correlation coefficient (ρ) positive, negative, or zero?
2. Is E[X] greater, less than, or equal to E[Y]? 
3. Is the distribution a bivariate Gaussian, or not? 

![](pdf1.png)

![](pdf2.png)

![](pdf3.png)

Provide solution for question **3** in the following table.

|                   | ρ>0, ρ<0, or ρ=0? | E[X]>E[Y], E[X]<E[Y], or E[X]=E[Y]?  | Bivariate Gaussian? (yes/no) | 
|-------------------|-------------------|---------|----------|
| Distribution (a)  |                   |           |        |
| Distribution (b)  |                   |           |        |
| Distribution (c)  |                   |           |        |

YOUR ANSWER HERE

## Exercise 4
rubric={reasoning:7}

Let $X$ and $Y$ denote the DSCI 511 and DSCI 551 grades, respectively, for an MDS student. According to the historical records, the random vector $(X,Y)$ can be modeled using the bivariate (2D) normal distribution with parameters $\mu_x = 85, \mu_y = 80, \sigma_x = 4, \sigma_y = 5, \rho=0.25$.

Answer the following questions, using R (and the internet) as needed.

1. What is the covariance matrix, $\Sigma$?
2. What's the marginal distribution of DSCI 551 grades?
3. Which do you expect to be larger, the conditional probability $P(Y \geq 80 \mid X = 70)$ or the marginal probability $P(Y \geq 80)$? Briefly justify your answer. (You're welcome to compute the actual probabilities, but you don't have to.)
4. Which do you expect to be larger, the conditional probability $P(Y \geq 80 \mid X=Y)$ or the marginal probability $P(Y \geq 80)$? Briefly justify your answer. (You're welcome to compute the actual probabilities, but you don't have to.)
5. Define $Z= (X+Y)/2$, the student's average grade across the two courses. Find $P(Z\geq 80)$. Hint: see [here](https://en.wikipedia.org/wiki/Sum_of_normally_distributed_random_variables#Correlated_random_variables).
6. What is $P(X\geq Y)$?
7. For two general jointly Gaussian RVs $X$ and $Y$ (no longer necessarily distributed as above), is it possible to have $P(X\geq Y) > 0.5$ and at the same time $E[X] < E[Y]$? Briefly justify your answer.

Provide answers to question **4** below.

YOUR ANSWER HERE

## Exercise 5
rubric={reasoning:5}

The [Monty Hall problem](https://en.wikipedia.org/wiki/Monty_Hall_problem) is an infamous brain teaser and a good demonstration of Bayes' theorem. The problem goes like this:

Suppose you're on a game show, and you have to chose one of three doors labelled **A**, **B**, and **C**. Behind one door is a car, behind the other two doors are goats. You pick a door, say **Door A**. After you've chosen your door (and before it is opened) the host, who knows what's behind all the doors, opens another door, say **Door B**, which has a goat. The host then asks you, _"Do you want to change your choice to **Door C**?"_. What should you do? Is it to your advantage to switch your choice of door from **Door A** to **Door C**?

Provide your answer to question **5** below.

YOUR ANSWER HERE

## (Optional) Exercise 6: Copulas
rubric={reasoning:2}

This is a useful exercise to explore the concept of dependence. Even if you can't complete all of it, try to at least give some of this question a try, as it's partially a walk-through.

### 6(a) PIT scores

Let random variable $Y$ have strictly increasing cdf $F$. One can mathematically prove that $F(Y)$ has a Uniform(0,1) distribution. This is called the Probability Integral Transform (PIT). 

1. Generate 1000 observations from the Exponential distribution with mean 1 using the `rexp` function in R -- let's call this vector `y`. Plot these observations against the observation number by applying the `plot` function to `y`, to show what exponentially distributed data look like.
2. Now plot `pexp(y)` (call this vector `u`) -- what distribution do these data follow, and why?
3. Now transform `u` by `qnorm(u)`, and plot this new vector. What distribution do these data follow?

In fact, using this approach we can transform a random variable to have any distribution we want. 

In [None]:
# Provide answers for question 6a below.
    
# your code here
fail() # No Answer - remove if you provide an answer

### 6(b) Normal scores plots

Copulas describe the _full_ dependence between random variables. By doing a PIT-transform to each random variable, the joint distribution that follows is called a _copula_. We'll visualize two copulas using a _normal scores plot_ -- a scatter plot of data whose marginal distributions have been transformed to standard Gaussian. 

The following code generates two bi-variate samples of size 1000, each iid, called `dat1` and `dat2`. 

In [None]:
set.seed(123)
n <- 1000
## Sample 1
rho <- 0.8
z1 <- rnorm(n)
z2 <- rnorm(n, mean=rho*z1, sd=sqrt(1-rho^2))
x <- qexp(pnorm(z1))
y <- qexp(pnorm(z2))
dat1 <- matrix(c(x,y), ncol=2)
## Sample 2
u <- runif(n)
tau <- runif(n)
alpha <- 0.3
cdf <- function(x) 1-(1+x)^(-alpha)
qf <- function(tau) (1-tau)^(-1/alpha) - 1
QYgX <- function(tau, x) (1+x)*(1-tau)^(-1/(1+alpha)) - 1 - x
xpar <- qf(u)
ypar <- QYgX(tau, xpar)
x <- qexp(cdf(xpar))
y <- qexp(cdf(ypar))
dat2 <- matrix(c(x,y), ncol=2)

The marginal distribution of each random variable is Exponential with mean 1.

1. Make a scatter-plot of the data for each sample.
2. Transform the data so that each random variable have standard Gaussian marginal distributions.
3. Make a scatter plot of the transformed data for each sample -- these are called normal scores plots, and is a way to visualize the dependence between two random variables. Based on these plots, is the dependence structure different between the two samples? Sample 1 has a _Gaussian_ copula; Sample 2 has an _MTCJ_ copula. 
4. Both transformed samples have a (Pearson) correlation of about 0.8. Does this mean that the dependence structure between both variables is the same for both samples?

In [None]:
# Provide answer for question 6b below

# your code here
fail() # No Answer - remove if you provide an answer