In [None]:
# Define colors
Pitt.Blue<-"#003594"
Pitt.Gold<-"#FFB81C"
Pitt.DGray <- "#75787B"
Pitt.Gray <- "#97999B"
Pitt.LGray <- "#C8C9C7"
# ggplot preferences
library("ggplot2")
library("repr")
options(repr.plot.width=10, repr.plot.height=10/1.68)
pitt.theme<-theme( panel.background = element_rect(fill = "white",linewidth = 0.5, linetype = "solid"),
  panel.grid.major = element_line(linewidth = 0.5, linetype = 'solid', colour =Pitt.Gray), 
  panel.grid.minor = element_line(linewidth = 0.25, linetype = 'solid', colour = "white")
  )
base<- ggplot() +aes()+ pitt.theme


## Simulation as a Tool
Similar to our numerical approach to solving equations and optimizing problems, we might also struggle to deal with the analytical complexities of transforming our results

* While we know how to get various features from linear OLS models, these are normally inputs to larger problems
* Sometimes the linkages in this problems can be non-obvious.

Even where we **can** probably calculate something if we thought about it long enough, the time it takes to do this could have been dedicated to running a quick simulation and getting a ballpark on the number.

### Law of Large Numbers
The fundametal idea here is that we make use of the law of large numbers.

That is, when we think of an iid sample  $(X_1,X_2,\ldots,X_n)$, where each has mean $\mu_X$ and variance $\sigma^2_X$, the LLN tells us that for any positive difference $\epsilon$ as $n\rightarrow\infty$ we have:
$$\Pr\left\{\left| \overline X_n-\mu_X \right|>\epsilon \right\} \rightarrow 0 $$



So if we explicitly set out to construct the random sample $(X_1,X_2,\ldots,X_n)$ to be independent, and where each of the $X_i$ draws are taken from the right distribution, then we can look at the sample average as an approximation for the true expectation.

The LLN tells us that so long as we draw a large enough sample, we should get an outcome that is close to the truth

### Galton Board
Here I can show you a little table-top toy that illustrates the concept.

### Simulation as a brute-force calculator
So if there's a value $\mu$ that we'd like to get an approximation for, if we can constuct a random variable $X$ that has the property that $\mathbb{E}X=\mu$ then one option for assessing $\mu$ is simply to simulate it!

## Example: Calculating a fixed value
A simple geometric problem is trying to calculate the value of $\pi$, as a geometric constant that appears in a lot of formluae!

If you're super clever, there are some really pretty math ideas to derive $\pi$. 

For instance, consider the infinite series:
$$\pi =3+ \frac{4}{2\times 3\times 4}-\frac{4}{4\times 5\times 6}+\frac{4}{6\times 7\times 8}$$


In [None]:
nilakanthaPi<- function(n) {
    x=3 #initialize
    for (k in  1:n ) x=x+(4*(-1)**(k+1))/((1+2*k)**3-(1+2*k)) #loop n times
    return(x) # return value
} 

In [None]:
nilamkanthatseq<- data.frame("n"=c(3:200))
# sapply just applies the list/vector to the function
nilamkanthatseq["pi"]<-sapply(nilamkanthatseq$n,nilakanthaPi)
nilamkanthatseq["error"]<-abs(pi-nilamkanthatseq["pi"])
nilamkanthatseq["log.error"]<-log( abs(pi-nilamkanthatseq["pi"] )  )

In [None]:
ggplot(data=nilamkanthatseq) +aes(x=n,y=error)+ pitt.theme+geom_point(color=Pitt.Blue) + scale_y_continuous(trans='log10')

In [None]:
# It's about machine error at 50,000 draws
nilakanthaPi(50000)
pi

### But suppose we didn't know this math sequence
If we have a good sense for what we need this particular number for, it's not too hard to begin using the Law of Large Numbers to brute force measure it via simulation. 

The question then is how to construct a random variable $X_n$ with the property that $\mathbb{E}X=\pi$.

But here we will deploy our sense for what $\pi$ is measuring, where we will think of the area of a circle. In that way, we will use the formula for a circle and an area of known volume to construct the random variable $X$

* In particular we will first draw a pair of initial random variables $(U_1,U_2)$, each drawn from the interval $[-1,1]$
* We'll then assess whether this pair live inside a circle of radius 1 centered at zero. $$\text{InCircle}(u_1,u_2):=u_1^2+u_2^2\leq 1$$

<table>
    <tr><td>
    <h3>Drawing from:</h3> </td><td><h3>Returns 1 if:</h3></td></tr>
    <tr><td>
    <img src=https://alistairjwilson.github.io/MQE_AW/i/PiCalc1.svg  alt="Original draw"></td><td>
    <img src=https://alistairjwilson.github.io/MQE_AW/i/PiCalc2.svg alt="Event definition"></td></tr>
</table>

**Question:** What is the chance that a random draw from the blue region $(U_1,U_2)$, lies in the yellow region?

Our random variable $X$ will therefore be a function of the pair of  uniform random variables:
$$ X=\begin{cases} 4 & \text{if }U_1^2+U_2^2\leq1\\ 0 & \text{otherwise}\end{cases} $$
where we know that $$\mathbb{E}X=\Pr\left\{\left(U_1,U_2\right)\text{ in circle}\right\}\cdot 4 + \Pr\left\{(U_1,U_2)\text{ not in circle}\right\}\cdot 0\\ =\frac{\pi 1^2}{2\times2}\cdot 4=\pi$$

So let's draw some values

In [None]:
#runif provides us with random uniform draws
u1<- runif(n=5000,min=-1,max=1) 
u2<- runif(n=5000,min=-1,max=1)
simData.pi<-data.frame("u1"=u1,"u2"=u2)
# Enter the binary outcome on whether they lie inside the circle
simData.pi["InCircle"]<-ifelse(  (simData.pi$u1**2+simData.pi$u2**2)<=1,4,0)
# take the average across this binary event!
head(simData.pi)
mean(simData.pi$InCircle)

Plotting it:

In [None]:
# Change the plot output to Jupyter to have constant aspect ratio
options(repr.plot.width=10, repr.plot.height=10)
ggplot() + pitt.theme+geom_point( data=subset(simData.pi,InCircle>0),aes(x=u1,y=u2),color=Pitt.Gold)+
geom_point( data=subset(simData.pi,InCircle==0),aes(x=u1,y=u2),color=Pitt.Blue)
# Set the output back to previous
options(repr.plot.width=10, repr.plot.height=10/1.68)

Doing the same thing but where we only store the mean value via a function.

In [None]:
pi.sim<- function(n) mean( ifelse( (runif(n)**2 + runif(n)**2)<= 1,4,0) )
pi.sim.df<-data.frame(order=3:8) 
pi.sim.df["n"]<-10**pi.sim.df["order"]
#sapply is used to apply the list to the pi.sim function in turn,
# where the runif() term doesn't like being applied to a list!
pi.sim.df["pi"]<-sapply(pi.sim.df$n,pi.sim)

In [None]:
pi.sim.df["error"]<-abs(pi-pi.sim.df["pi"])
ggplot(data=pi.sim.df) +aes(x=n,y=error)+ pitt.theme+
geom_point(color=Pitt.Blue,size=4) + 
scale_x_continuous(trans='log10')+ scale_y_continuous(trans='log10')
pi.sim.df

### So...
* It's obviously not as accurate as the more complex way of calculating it, but as an approximation, this is pretty good! 
* The benefit here was that we didn't need to be a first-order mathematician to construct a sequence that quickly converged, I just had to know what the constant I was after was measuring.
* That said, if you did know the mathematical formula, you'd be much quicker using it.

## Figuring out complex variables

There are many events we might want to have a good prediction about the likely chances, but where we know that it is a an aggregate of lots of other smaller events. Analytically, the permutations involved can often be overwhelming. In contrast, a particular realization may be relatively easy to simulate.

While it can therefore be *possible* to calculate something, you're likely better just simulating each of the components that we do have a better model of. Here we'll look at an example that was of recent prominence!

## Electoral College
Take the recent US presidential election. The winner is determined by the vote totals in each of the separate states, and where each state has a specific number of electoral colleges votes.

The event that we want to forecast is the likelihood that one candidate wins. Given this structure, and some way of simulating each of the component states, we can use simulation methods to understand the final result.

### Component parts
For this particular model then we need a  core (outer loop) procedure that  runs the model many times, storing the relevant variables so we can calculate things afterwards.

But inside each run of the election we need:
 * A parameterized random variable that draws an election outcome for each state
 * A function that takes outcomes for each state and produces a number of electoral college votes
 * A function that aggregates the electoral college votes into an outcome

#### Parameterized model
We could be more sophisticated here and think of a model that takes into account:
* Demographics
* Electoral spending
* Historical outcomes
* Common shocks (national good/bad news)

For now though we'll think about a much simpler model, that models each state independently, and we'll come back to integrate in a commmon shock

## Setting the parameters
Ideally, we would *estimate* the parameters of the model from data, which would also give us a sense for how far off we are. However, we haven't gotten to that point in the class (soon though!).

To roughly calibrate the parameters I'm going to instead use probability data from *The Economist* magazine's model (which is publicly available [here](https://github.com/TheEconomist/us-potus-model)).

Their model is more complicated, where it takes as inputs many other features, and allows for many other moving parts/nuance. However, we're going to start simple!

In [None]:
# Enter the data from the economist:
economist.Data<-read.csv('economist/state_averages_and_predictions_topline.csv')
# Enter the electoral votes for each state (I copied these from wikipedia!)
ev.Data<-read.csv('./ev.csv') 
# show the data frame heads
head(economist.Data ) 
head(ev.Data) 
elecVotes

In [None]:
elecVotes<-ev.Data["ev"]
rownames(elecVotes) <- ev.Data$state

Checking the data, it's obvious that these are Democratic Candidate win probs, rather than Republican Candidate

In [None]:
stateProb<-economist.Data['projected_win_prob']
# set the row names to the states
rownames(stateProb) <- economist.Data$state
# rename the column
names(stateProb)<-"dem.Prob"
# Two-party model, so rep.Prob is
stateProb["rep.Prob"]<-1-stateProb["dem.Prob"]
head(stateProb) 
elecVotes <- ev.Data['ev']
rownames(elecVotes) <- ev.Data$state

In [None]:
stateProb["ev"]<-1
for ( state in economist.Data$state) { 
    stateProb[state,"ev"]<-elecVotes[state,"ev"]
}

So we can now access all of our parameters as follows:

In [None]:
# Dem Probability in Alaska
stateProb["AK","dem.Prob"]
# Rep Probability in Arizona
stateProb["AZ","rep.Prob"]
# Electoral votes from California
stateProb["NY","ev"]

Also, will be helpful to define the list of states

In [None]:
stateList<-rownames(stateProb)
stateList

### Running the simulation
Set the simulation size:

In [None]:
n.sims<-10000

Now we need to write a function that draws an outcome for each state as a single repetition of the sample

Let's carefully go through what this function does

In [None]:
stateDraw<- function(probList,evList){
    # This is 51 uniform random numbers in[0,1]
    # which we will use to create the realization
    rnd<-runif(51,min=0,max=1)
    # initialize the out vector for ease of putting it together
    out=c(1:53)
    # Determine outcome for each of the 51
    for (ii in 1:51) {
        # if the model rnd number is less than the model prob then output the electoral votes, 
        # otherwise zero
        out[ii]<- ifelse(  rnd[ii] < probList[ii]  , evList[ii] ,  0 )
    }
    # sum the electoral vites
    demev<-sum(out[1:51])
    # name the output vector
    names(out)<-c(stateList,"dem.Total","rep.Total")
    # assign vote totals to the last two locations
    out["dem.Total"]<-demev
    out["rep.Total"]<- (538- demev)
    out
}

So let's run the function once and check it's output.

**Question:** What are we expecting it to look like?

In [None]:
stateDraw( stateProb$dem.Prob ,stateProb$ev ) 

### Repeating
So if we're happy with it, we can move on to simulating it **lots** of times.

In [None]:
n.sims <- 100000
outputMatrix <- matrix(1,nrow=n.sims, ncol=53)
for (rep in 1:n.sims) {
    outputMatrix[rep,] <- stateDraw(stateProb$dem.Prob,stateProb$ev)
}
outSim.Economist<-data.frame(outputMatrix)
names(outSim.Economist) <- c(stateList,"dem.Total","rep.Total")
head(outSim.Economist)
# we'll later define a function to do this more generally

### Outcomes
So now that we have the simulation run, let's calculate the resulting outcome. There are **538** total electoral votes, so we'll check the fraction of times the Democrats get enough votes to win outright

In [None]:
mean(ifelse(outSim.Economist$dem.Total>538/2,1,0))

and the complementary probability for the Republicans to win or draw:

In [None]:
mean(ifelse(outSim.Economist$dem.Total<538/2,1,0))

This seems a little high. One  problem here is that we're not accounting for the correlations.

The economist data actually includes their estimated correlation matrix:

In [None]:
inMatrix<-read.csv('./economist/state_correlation_matrix.csv')
stateCorrMatrix<-as.matrix(inMatrix[1:51,2:52])
rownames(stateCorrMatrix)<-inMatrix$state
stateCorrMatrix[c("PA","AZ","GA","MI","FL"),c("PA","AZ","GA","MI","FL")]

**Question:** How can we generate the analog to this from our simulation?

In [None]:
head(outSim.Economist) # c("PA","AZ","GA","MI","FL")
#code_to_calculate_correlations
cor(outSim.Economist[,c("PA","AZ","GA","MI","FL")])

### Adding common shocks
While we could go back to their primitive model to play with where these correlations come from, we're going just do something quick and dirty to put in *some* correlation via common shocks.

I'm going to use a parametric model for generating the state win probabilities:
$$ \Pr(\text{State }j\text{ is Dem})=\frac{\exp(\alpha_j+\epsilon)}{\exp(\alpha_j+\epsilon)+1}$$
where $\epsilon$ is a common $U[-k,k]$ shock and $\alpha_j$ is a state-level parameter. 

$$ \Pr(\text{State }j\text{ is Dem})=\frac{\exp(\alpha_j+\epsilon)}{\exp(\alpha_j+\epsilon)+1}$$
**Question:** What does the shock do to the probability?

**Question:** Why use the function $\tfrac{\exp(x)}{\exp(x)+1}$?

### Illustrating the function

In [None]:
expProb<-function(x) exp(x)/(exp(x)+1)
base+geom_function(fun = expProb, colour=Pitt.Blue, linewidth=2)+xlim(-10,10)

So this is just a way of mapping a real-valued number in $\mathbb{R}$ into a probability in $[0,1]$, where lower numbers indicate lower probabilities.

### Where to get the $\alpha_j$ parameters?
In order to calculate the new parameters from the original probabilities, I integrated the formula across a uniform shock in $[-k,k]$ to set the unconditional probability equal to the value from the Economist data $p_j$:  
$$\mathbb{E}P_j=\int^k_{-k} \frac{1}{2k} \frac{\exp(\alpha_j+\epsilon)}{\exp(\alpha_j+\epsilon)+1} d\epsilon=p_j.$$

I then inverted this formula to figure out the how to map the given probability $p_j$ for each state $j$ into a parameter $\alpha_j$ for the model *(don't worry about this step, this is just so I can use the Economist probability data).*

This lead to the below function:

In [None]:
gen.alpha<- function(p,k) {
    if (p <= 0.0001) { # This makes sure we don't output -infinity
        return (-10)
    }
    else if (p>=0.9999) { # This make sures we don't output +infinity
        return (10)
    } 
    else { # Standard output for the economist probability!
        return (k+ log(( exp(2*p*k)-1))-log((exp(2*k) -exp(2*k*p) )) )
    }
}

In order to check this function works, we'll make use a quick ***simulation*** to check it!

In [None]:
# calculate the parameters 
alpha.0.1 <- gen.alpha(0.1,2) 
alpha.0.8 <- gen.alpha(0.8,4)
# draw 10000 shocks from the uniform on [-1,1] 
shock<-runif(1000000,min=-1,max=1)
# Calculate the mean for p=0.1 with a [-2,2] shock
alpha.0.1
mean(exp(alpha.0.1+shock*2)/(exp(alpha.0.1+shock*2)+1))


So looks like my formula was at least correct for these values. That is, I've checked that the marginal probabilities for each state match will match the probabilities we read in from *The Economist* model

Now let's check that the model's doing what we want in creating some variation in the probabilities across different shocks.

In [None]:
#Interquartile range p=0.1, k=2/2:
IQR(exp(alpha.0.1+2*shock)/(exp(alpha.0.1+2*shock)+1))
IQR(exp(alpha.0.1+4*shock)/(exp(alpha.0.1+4*shock)+1))
#Interquartile range p=0.8, k=2/4:
IQR(exp(alpha.0.8+2*shock)/(exp(alpha.0.8+2*shock)+1))
IQR(exp(alpha.0.8+4*shock)/(exp(alpha.0.8+4*shock)+1))

**Question:** What else could I run here to check that it generates variation in the probabilities? 

In [None]:
quantile(exp(alpha.0.1+4*shock)/(exp(alpha.0.1+4*shock)+1),c(0.1,0.9))

In [None]:
quantile(exp(alpha.0.1+2*shock)/(exp(alpha.0.1+2*shock)+1), probs=c(0.25,0.5,0.75))

So as we make $k$ larger, the range in probabilities shifts.

### Figure out the parameter for each state:
Using the formula above, we can calculate the $\alpha$ parameter for each state:

In [None]:
# Enter the alpha parameter for each state
kVal <- 5
stateProb["alpha"]<-0 
for (state in stateList) {
    stateProb[state,"alpha"]=gen.alpha(stateProb[state,"dem.Prob"],kVal)
    }
head(stateProb)
stateProb["NY",]

### Re-simulate the model with common shocks
While the shock $\epsilon$ is mean-zero and random *across* simulations, I'm going to make it common to each state *within* each simulation.

In this way if the shock is very negative it will decrease/increase the chances for a Democratic/Republican win in each state.

Let's write a function to carry out the simulation:

In [None]:
stateDrawC<- function( alphaList , evList, kparam) {
    eps<-runif(1,min=-kparam,max=kparam) # the common shock draw
    rnd<-runif(51,min=0,max=1) # This is 51 uniform random numbers in [0,1] to determine outcome
    
    out=c(1:54) #initialize the out vector as 53 values
    # Determine outcome for each of the 51 electoral regions
    for (ii in 1:51) {
        # if the model rnd number is less than the model prob (the exp/(exp+1) term) then output the electoral votes, otherwise zero
        out[ii]<-ifelse( rnd[ii] < exp( alphaList[ii]+eps) /( exp( alphaList[ii]+eps )+1), evList[ii] ,0)
        # Here I adjust for the way that Maine distributes it's electoral votes by just splitting them every time
        if (ii==22) out[ii]=2
    }
    demev<-sum(out[1:51]) # sum of the electoral votes
    names(out)<-c(stateList,"dem.Total","rep.Total","shock") # name the out vector
    out["dem.Total"]<-demev # assign the 52nd entry to the democratic total
    out["rep.Total"]<- (538- demev) # assign the 53nd entry to the republican total
    out["shock"]<-eps
    out # this is the returned output
}
# Run it once to see the output!
stateDrawC(stateProb$alpha,stateProb$ev,kVal)

### Run the new simulation:
So now we just repeat the simulation again.

However, because we're spending so much time writing code to add repetitions to a matrix and convert to a data frame, let's write it as a function!

In [None]:
monte.carlo.sim<-function(fun,fun.arg,nSims=10000){
    # this line just runs fun(arguments in fun.arg list) so if fun.arg=(x,y,z) do.call(fun,fun.arg) runs fun(x,y,z)
    rep1<- do.call(fun,fun.arg) 
    # Set the dimensions for the output matrix
    nc<-length(rep1)
    lbl<-names(rep1)
    outputMatrix <- matrix(1,nrow=nSims, ncol=nc)
    outputMatrix[1, ]<-rep1 # write the sim to the first line
    for (rep in 2:nSims) { # for each of the remaining sims, add them in
        outputMatrix[rep, ] <- do.call(fun,fun.arg)
    }
    df<-data.frame(outputMatrix) # convert it to a data frame
    names(df)<-lbl  # get the names from the simulation output lbl
    return(df) # return the data frame as the output
}

So we don't have to run this code anymore:
```
outputMatrix <- matrix(1,nrow=n.sims, ncol=54) # Nsims x 54 matrix of ones
for (rep in 1:n.sims) { 
    outputMatrix[rep,] <- stateDrawC(stateProb$alpha,stateProb$ev,kVal)
}
outSim.Economist.Corr<-data.frame(outputMatrix) 
names(outSim.Economist.Corr) <- c(stateList,"dem.Total","rep.Total","shock") 
```

Running the sim:

In [None]:
outSim.Economist.Corr <- monte.carlo.sim( #calling our simulation function
    stateDrawC, # the function to simulate
    nSims=10000, # number of times to simulate
    fun.arg=list(alphaList=stateProb$alpha,evList=stateProb$ev,kparam=8) #arguments to the function
)
head(outSim.Economist.Corr)
# Average win probability for Dems in the simulation:
mean(ifelse(outSim.Economist.Corr$dem.Total>538/2,1,0))

So not a huge difference on average, despite really quite large shocks. Let's make sure that the shock is doing what it's meant to.

Here I'm going to plot a smoothed version of the conditional mean number of electoral votes for the democrats with the shock:

In [None]:
ggplot(data=outSim.Economist.Corr) +aes(x=shock,y=dem.Total)+ 
pitt.theme+geom_smooth(color=Pitt.Blue,size=3,method="loess",formula="y~x")

So even with fairly large shocks our model still suggests that it was likely that the Democrats would win the election. The economist model is instead built on other fundamentals within each state, as such it has quite a different correlation structure.

Let's examine five key swing states: PA, AZ, GA, MI, FL to look at the correlations

In [None]:
# Original Independent Prob Model 
round( cor(outSim.Economist[c("PA","AZ","GA","MI","FL")]),3 )
# Common Shock Prob Model
round( cor(outSim.Economist.Corr[c("PA","AZ","GA","MI","FL")]),3 )
# Economist Model
stateCorrMatrix[c("PA","AZ","GA","MI","FL"),c("PA","AZ","GA","MI","FL")]

Our correlations certainly are not close to *The Economist* model correlations. However, we're fairly close to the model's predictions on the [probability of a Biden win](https://projects.economist.com/us-2020-forecast/president): 97 % if we set $k=3$

In [None]:
kVal <- 3.31
stateProb["alpha"]<-0
for (state in stateList) {
    stateProb[state,"alpha"]=gen.alpha(stateProb[state,"dem.Prob"],kVal)
    }
# initialize the matrix we're going to fill in
outSim.Economist.Corr <- monte.carlo.sim( #calling our simulation function
    stateDrawC, # the function to simulate
    nSims=100000, # number of times to simulate
    fun.arg=list(alphaList=stateProb$alpha,evList=stateProb$ev,kparam=kVal) #arguments to the function
)
# Average win probability for Dems in the simulation:
mean(ifelse(outSim.Economist.Corr$dem.Total>538/2,1,0))

### Let's make the model more competitive
I wouldn't be entirely satisfied with the above model, as I think the chances for a Democratic win seem very high, though maybe this is a function of the Economist data we uses as an input.

However, to show you how we can use such a model once we have it let's play with the underlying numbers

Let's shift the alpha parameters for the model:
$$ \Pr(\text{State }j\text{ is Dem})=\frac{\exp(\alpha_j+\epsilon)}{\exp(\alpha_j+\epsilon)+1}$$

In [None]:
# Generate a new version of alpha for all the states
stateProb["alpha.comp"]<-stateProb["alpha"]-3.5

**Question:** What will this shift do?

The above simply pushes all of the state parameters to the left (funnily enough, having the opposite effect on outcomes)

In [None]:
outSim.Comp.Corr<-monte.carlo.sim(
    stateDrawC,nSims=10000, 
    fun.arg=list(alphaList=stateProb$alpha.comp,evList=stateProb$ev,kparam=kVal)  
) 
head(outSim.Comp.Corr)
# Average win probability for Dems in the simulation:
mean(ifelse(outSim.Comp.Corr$dem.Total>538/2,1,0))
print(paste0("Probability dem win:", round(mean(ifelse(outSim.Comp.Corr$dem.Total>538/2,1,0)),3)))

So we've set-up more of a toss-up election, by modifying the parameters.

Let's look at what the win probabilities are by state in the model:

In [None]:
ProbVector<-1:51
names(ProbVector)<-stateList 
for (state in stateList) {
   ProbVector[state]=round(mean(outSim.Comp.Corr[,state])/stateProb[state,"ev"],2)
}
ProbVector

Seems reasonable enough. Now we have have a competitive parameterization, let's see what happens as we change things...

### The counter-factual
Normally, once we have a working model, the point is then to use it to figure out *what might be*.

In this case, we're going to try and work out what the effects are for the Republican candidate from Texas becoming a lean-Democratic state.

So, let's look at the current value of the $\alpha$ parameter for Texas, as a strongly lean R state. 

In [None]:
stateProb["TX","alpha.comp"]

To try and understand the effects within the electoral college from the migration patterns in Texas, we will change Texas to a lean-D state.

**Question:** How can we do this?

In [None]:
# But what happens when we make TX a safe dem seat holding everything else as more competitive
stateProb["TX","alpha.comp"]<- 2.5
outSim.Comp.Corr<-monte.carlo.sim(stateDrawC,nSims=10000, 
        fun.arg=list(alphaList=stateProb$alpha.comp,evList=stateProb$ev,kparam=kVal))
head(outSim.Comp.Corr)
# Average win probability for Dems in the simulation:
mean(ifelse(outSim.Comp.Corr$dem.Total>538/2,1,0))
print(paste0("Probability dem win:", round(mean(ifelse(outSim.Comp.Corr$dem.Total>538/2,1,0)),2)))

So Texas represents only 7 percent of the electoral votes:

In [None]:
100*stateProb["TX","ev"]/538

But the model is telling us that flipping it lean-D causes a shift in the probabilites of 20 percentage points, signaling that the paths to victory within the electoral college becomes much harder without this big state.

**Question:** How can I check the model probability of a Republicans winning without winning Texas?

In [None]:
#what_to_enter

## Simulating an econometric method
Another setting where we frequently want to know both expected values, but also distributions, is in trying to understand the properties of econometric models.

In particular, while asymptotic results mean that we can use $t$ and $F$ tests for parameters in OLS models, unless the disturbances are normally distributed, these tests are not appropriate for finite samples

**But when can we treat the data as if it is "large"**

Simulations offer us a way to examine finite-sample properties of a procedure... 


...or for more newer/more-experimental methods, they can help us understand whether the method is well formed under the relevant assumptions

### Linear Model Example

In [None]:
simLinearModel<-function(n,beta0=1,beta1=1,sigmaX=1,sigmaU=1){
  # Draw the x values 
  xD=rnorm(n,mean=0,sd=sigmaX)
  # Draw the u values from a very Non-Normal distribution
  uD=(rbeta(n,shape1=0.5,shape2=0.5,)-1)*sigmaU*sqrt(8)
  # Put them both into a dataset (y,x,u) 
  # where y=b0+b1*x+u
  simdata.d<-data.frame(y=beta0+beta1*xD+uD,x=xD,u=uD)
  # Estimate the model
  simdata.m<-lm(y~x,data=simdata.d)
  #Return the model as the output of the function
  return(coefficients(summary(simdata.m))["x",c("Estimate","Std. Error")])
}

### Run the model once to check

In [None]:
simLinearModel(100)

### Simulation with 25 observations per LM

In [None]:
sim.df.25<-monte.carlo.sim (
    simLinearModel,
    fun.arg=list(n=25),nSims=10000)
head(sim.df.25)  

In [None]:
mean(sim.df.25$Estimate)

In [None]:
sd(sim.df.25$Estimate)
1/sqrt(25-2)

### Simulation with 100 observations per LM

In [None]:
sim.df.100<-monte.carlo.sim(simLinearModel,fun.arg=list(n=100),nSims=10000)
head(sim.df.100) 

## Let's look at the distribution of the $\beta_1$ estimate

In [None]:
ggplot(sim.df.25, aes(x = Estimate))+
geom_histogram(aes(y =..density..),breaks = seq(0.4, 1.6, by = 0.05), 
                   colour = Pitt.DGray, 
                   fill = Pitt.Blue,size=1)+
geom_function(fun = dnorm, args = list(mean = 1, sd = 1/5),color=Pitt.Gold,linewidth =3)+theme( 
    panel.background = element_rect(fill = Pitt.LGray, linewidth = 0.5, linetype = "solid"),
  panel.grid.major = element_line(linewidth=  0.5, linetype = 'solid', colour =Pitt.Gray), 
  panel.grid.minor = element_line(linewidth= 0.25, linetype = 'solid', colour = Pitt.Gray)
  )

## t-statistic distribution
The true value of the parameter $\beta_1$ is 1, so we should be able to look at a t-statistic for the null:
$$ H_0:\beta_1=1$$
which is given by:
$$ \frac{\hat{\beta}-1}{\text{se}(\hat{\beta}_1)} $$

In [None]:
sim.df.25["tStat1"]<- (sim.df.25["Estimate"]-1 ) /sim.df.25["Std. Error"]

In [None]:
ggplot(sim.df.25, aes(x = tStat1))+
geom_histogram(aes(y =..density..),breaks = seq(-3, 3, by = 0.25), 
                   colour = Pitt.DGray, 
                   fill = Pitt.Blue,size=1)+
geom_function(fun = dt, args = list(df = 23),color=Pitt.Gold,linewidth = 3)+theme( 
    panel.background = element_rect(fill = Pitt.LGray, linewidth = 0.5, linetype = "solid"),
  panel.grid.major = element_line(linewidth = 0.5, linetype = 'solid', colour =Pitt.Gray), 
  panel.grid.minor = element_line(linewidth = 0.25, linetype = 'solid', colour = Pitt.Gray)
  )

Let's check the type-I error of the test, as we know that the null here is true, as we used this value to generate the data!

The two-sided 95-percent critical value for the t-statistic is given by:

In [None]:
qt(0.975,df=23)

So we can calculate how often we would falsely reject:

In [None]:
mean(ifelse(abs(sim.df.25["tStat1"])>qt(0.975,df=23),1,0))

**Question:** Is this close or far from the true value?

## Soccer score lines.

In a homework you will build a simulation model of scorelines for soccer.

For this we'll use parameters from an estimated model from [fivethirtyeight.com](https://data.fivethirtyeight.com/) but where I'll ask you to do the simulations.


In [None]:
inTeams<-read.csv("https://projects.fivethirtyeight.com/soccer-api/club/spi_global_rankings.csv")
#inTeams<-read.csv('./538/spi_global_rankings.csv')
head(inTeams)

Get the mean parameters (and the log mean)

In [None]:
mean.off<-mean(inTeams$off)
mean.def<-mean(inTeams$def)
lmean.off<-mean(log(inTeams$off))
lmean.def<-mean(log(inTeams$def) ) 
mean.off
mean.def

Get just the teams in the Premier League

In [None]:
premLeague<-subset(inTeams,league=="Barclays Premier League")
premTeams<-premLeague$name
rownames(premLeague)<-premTeams
df.prem<-premLeague[,c("off","def")]
df.prem

The offense parameter from fivethiryeight is the expected goals when playing against an average team, where the model uses a Poisson process for the number of goals produced. 

A Poisson distribution is a random variable with mass function at each integer $k\in\mathbb{N}$:
$$ \Pr(k)= \frac{\lambda^k e^{-\lambda}}{k!}.$$

A property of this distribution is that the expected value is $\lambda.$

Because the value for $\lambda$ has to be greater than 0, we will use a model where team $i$, when playing team $j$, scores $\lambda_{ij}=\exp(\alpha_i-\delta_j)$ where:
* $\alpha_i$ is an offense parameter for team $i$
* $\delta_j$ is a defense parameter for team $j$

So we can set the model parameters (and eliminate the $\lambda_0$ term) to get:

In [None]:
lmean.def<- log(mean(df.prem$def))
lmean.off<- log(mean(df.prem$off))               
df.prem["alpha"]<-log(df.prem["off"])-lmean.def
df.prem["delta"]<-lmean.off-log(df.prem["def"])
head(df.prem) 

where this will guarantee that the expected goals for/conceded against the average team will match the given `off` and `def` parameters

The five-thirty eight model incorporates some additional terms reflecting a home game bonus effect, end a slight increase in the likelihood of a draw. We'll ignore that for now.

Set up lists of the parameters:

In [None]:
alphaList<-df.prem$alpha
deltaList<-df.prem$delta
names(alphaList)<-rownames(df.prem)
names(deltaList)<-rownames(df.prem)
alphaList["Liverpool"]

Using the `rpois` command we can draw a random possion draw of the scoreline for a particular match. Here we'll try this for *Liverpool* vs. *Manchester City*:

In [None]:
rpois(1,exp(alphaList["Liverpool"]-deltaList["Manchester City"]) )

In [None]:
c(rpois(1,exp(alphaList["Liverpool"]-deltaList["Manchester City"])),
  rpois(1,exp(alphaList["Manchester City"]-deltaList["Liverpool"])))

Generalizing this into a function that takes as arguments the two team names:

In [None]:
draw.score<-function(team1,team2){
    c(
        rpois(1,exp(alphaList[team1]-deltaList[team2])),
  rpois(1,exp(alphaList[team2]-deltaList[team1]))
    )
}
draw.score("Liverpool","Arsenal")

In [None]:
df.prem[c("Liverpool","Arsenal"),]

In [None]:
# Any guesses ?
draw.score("Liverpool","Arsenal")

We can assemble the set of all matches

In [None]:
#install.packages('gtools')
library('gtools')
# All possible matches in a season
allMatches<-permutations(20, 2, v=rownames(df.prem),repeats.allowed=FALSE)
colnames(allMatches)<-c("home","away")
head(allMatches,9)
length(allMatches)

*(Again, the fivethirtyeight model is a bit more complicated, and it incorporates the dynamics, for when a match **means** something extra to one team; we will also ignore this, and not try to carry running totals)*

Your assignment will be to form this in to a coherent picture of the outcomes for an entire league season.

In [None]:
# Example scores through the entire season
ScoresMatrix <- matrix(nrow=nrow(allMatches),  ncol=4)
for (ii in 1:nrow(allMatches)  ) {
     ScoresMatrix[ii,1:2]=allMatches[ii,]
     ScoresMatrix[ii,3:4]= draw.score(allMatches[ii,"home"],allMatches[ii,"away"] )  
}
colnames(ScoresMatrix)<-c("home.team","away.team","home.score","away.score")
head(ScoresMatrix)