<div class="w3-bar w3-blue-grey w3-padding">
<div class="w3-bar-item"><h2> SAMBa Training </h2></div>
<div class="w3-bar-item w3-right"><img class="w3-image w3-right" style="width:40%;max-width:400px" src="../images/SAMBa_white.png"></div>
</div>

This worksheet contains a few simple problems involving some basic methods in probability and statistics. You are expected to solve these problems using the programming language $\texttt{R}$.


<div class="w3-panel w3-leftbar w3-border-yellow w3-pale-yellow w3-padding-small">
    <h3><i class="fa fa-pencil-square-o"></i> 1
</h3>
</div>

The Central Limit Theorem states that the sum $Z = X_1 +...+ X_n$ of $n$ independent identically distributed (iid) random variables is approximately normally distributed for large $n$. In particular, if $X_1,... X_n$ have mean $\mu$ and variance $\sigma^2$, then the standardised sum $Z_s=((X_1-\mu) + ... + (X_n-\mu))/(\sqrt{n\sigma^2})$  has a standard normal distribution $N(0,1)$ for large $n$.

Susie D gets many emails each day. Let $X_i \overset{\text{iid}}{\sim} Pois(70)$ for $i=1,...,n$ denote the number of emails she receives during day $i$.

Write a function `func` that takes n as input, generates a random sample $x$ of $n$ independent draws of $X_i$ and returns the sample mean and variance of $x$. You may find the function `rpois` and the following function structure helpful.

In [None]:
# A function to generate n iid X_i and return the sample mean and variance.
func<-function(n){
  x <-rpois(n,70)
    mean_x <-mean(x)
    var_x <-var(x)
    return(c(mean_x,var_x))
}

Write a function function `func2` that takes n as an input, makes a call to `func`, and uses both outputs of `func` to calculate an approximate standardised sum $Z_s$, and finally returns $Z_s$.

In [None]:
func2<-function(n){
  y=func(n)
  Z_s=y[1]/sqrt(y[2])
  Z_s
}

Use the function `sapply` with the first argument `rep(500,1000)` to generate $N = 1000$ such $Z_s$ for $n = 500$. 

In [None]:
Zs_samples=sapply(rep(500,1000),func2)

Plot a histogram of your samples of $Z_s$. What do you observe? Use the function `qqnorm` on your samples of $Z_s$ to produce a normal QQ plot. What do you observe and why is this the case? 

In [None]:
hist(Zs_samples)
qqnorm(Zs_samples)
#they are normally distributed as says the CLT.
#n=500 is sufficient for the sample variance to a good
#approximation to the true variance.


<div class="w3-panel w3-leftbar w3-border-yellow w3-pale-yellow w3-padding-small">
    <h3><i class="fa fa-pencil-square-o"></i> 2
</h3>
  

</div>


Data in $\texttt{R}$ are usually stored in data frames. A data frame is a table in which each column is a vector containing the values of one variable. The variables can be of either numeric, factor or character type.

One of Andreas K's favourite built in datasets in R is called `trees`. Use the command `dat <- trees` to create a new data 
frame with this data. Have a play around with this dataset so that you are comfortable extracting a particular row or column. Use the following commands to perform some exploratory data analysis.

In [None]:
dat <- trees #let dat be the trees dataframe
dat[1:10,] #print the first 10 rows
summary(dat) #print a summary of the dataset
pairs(dat,pch=19) #plot a matrix of scatterplots

Fit a simple linear model with response variable `Volume` and predictor variable `Girth`, storing the results as an object `mod`.

In [None]:
mod=lm(Volume~Girth,data=dat)

Use `plot(mod)` and `summary(mod)` to assess the goodness of fit of the model. Does `Girth` predict `Volume` well?

In [None]:
plot(mod)
summary(mod)

From the formula for the volume of a cylinder, we may expect the variable `Girth`$^2$ to be a better predictor for `Volume`. Add this variable as a new column `Girth2` to the data frame dat.

In [None]:
dat$Girth2 <- (dat$Girth)^2 #note numerical operations are performed elementwise in R.

Fit a linear model as before but with `Girth` replaced by `Girth2`. Did this improve the model?

In [None]:
mod2=lm(Volume~Girth2,data=dat)
summary(mod2)
#improves- F-statistic increases, Residual se decreases etc

<div class="w3-panel w3-leftbar w3-border-yellow w3-pale-yellow w3-padding-small">
    <h3><i class="fa fa-pencil-square-o"></i> 3
</h3>
  

</div>


When Paul M was a child he sold newspapers at a kiosk. In the morning he bought newspapers at 25 Pesos each, but after the first 200 newspapers the price falls to 5 Pesos each. He sells the newspapers at 100 Pesos each. If any remain unsold at the end of the day, he gets no refund. From previous experiences, he infers the demand for newspapers each day is distributed according to a $N(178,21^2)$ distribution. He would like to know what the optimal number of newspapers to buy each day was.

Let $n$ be the number of papers Paul M buys each day. Write functions the following functions:

`cost(n)` to give the cost of buying $n$ papers,

`profit(n,d)` to give the profit from sales when there is demand for $d$ papers and Paul M bought $n$ papers, and

`average.profit(n,nreps)` to simulate `nreps` values of demands $d\sim N(178,21^2)$ and return the average profit for these cases.

In [None]:
cost <- function(n){
  min(200,n)*25+(n>200)*(n-200)*5
}
profit <- function(n,d){
  100*min(n,d)-cost(n)
}
average.profit <- function(n,nreps){
  demand_vec <- rnorm(nreps,178,21)# note that the 3rd argurment of rnorm in R is sd not variance!
    profit_vec <- sapply(demand_vec,profit,n=n) #alternatively you may use a for loop. But sapply is a neat way to do things in R.
  av_profit <- mean(profit_vec)
    return(av_profit)
}

Use `average.profit` to work out the optimal number of newspapers Paul M should buy each day. You may find it helpful to use a small number of simulations initially, then increase this number to get accurate reuslts as your search closes in on the optimal value. Careful- a local maximum may not be a global maximum! Make sure to let Paul M know what you discover.

In [None]:
#after some initial searching with small nreps, we find the maximum is in the interval (185,220)
x=seq(from=185,to=220,by=0.5)
y=sapply(x,average.profit,nreps=500000) #takes a minute
plot(x,y)
#global maximum is around 214

<div class="w3-bar w3-blue-grey">
<a href="./03_matlab-ws.ipynb" class="w3-bar-item w3-button"><h2><i class="fa fa-angle-double-left"></i> Previous</h2></a>
<a href="./00_schedule.ipynb" class="w3-bar-item w3-button w3-center" style="width:60%"><h2>Schedule</h2></a>
<a href="./00_schedule.ipynb" class="w3-bar-item w3-button w3-right"><h2>Next <i class="fa fa-angle-double-right"></i></h2></a>
</div>