In [2]:
import time
import numpy as np
import pandas as pd
import scipy.stats as stats
import scipy.optimize as optimize
import scipy.integrate as integrate
import matlab.engine

ModuleNotFoundError: No module named 'matlab'

### Part I: Execution time comparisons of algorithms to estimate the univariate IID Student t model with MLE (BFGS) and pseudo-MLE (titer)

#### Motivation:
From the first take-home assignment, you all now have familiarity with obtaining the MLE of the univariate IID location-scale Student t model via the BFGS algorithm. The point of this easy exercise is to make you explicitly aware of algorithm convergence tolerance and significant digits, allowing fair comparison across methods.

It also is to show you how some "tricks" can be deployed to fit a model faster than just "plug and chug" brute force MLE optimization. I have numerous such inventions, and we will see another below, in the context of the multivariate noncentral Student's t distributions. This one here, for the IID Student t model, is a bit of a joke, but I have a more sophisticated application for fitting an asymmetric GARCH model with stable-Paretian innovations: The MLE requires several seconds, whereas my way takes literally a microsecond. We need to compute it thousands of times for backtest applications, so we need this speed improvement. And, amazingly, my method *beats the MLE* in accuracy and MSE. If you are curious about this, see my time-series book.

For keeping track of time, in Matlab, there are functions to do this. The oldest way in Matlab, still there, is "tic and toc" (oh, those dorky Matlab engineers...). Use the Matlab help command to investigate how this works, but crucially, it will then list the more sophisticated methods of starting a clock, and ending a clock, and reporting the elapsed time, also saving the result as a variable. I suggest you use Matlab's better methods for doing this, instead of the "manual", simple, tic-toc.

Back now to the task at hand.

---
#### Preparation:
Read and understand Example 4.3 in Fundamental Statistical Inference. It delivers "nearly" the MLE of the 3-parameter Student t model, using some tricks to enhance speed. It is not as accurate or efficient as true MLE, but in reality, it does not matter. That it does not matter requires some explanation:

This is a major "takeaway" I want you to learn in this course. For real data, in real life, most surely multivariate, you do not know the distribution (or, more generally, the intractable and unknowable DGP). In our simple course content of IID models and emphasis on distributions, your assumption of Student t might be reasonably justified, but the data are of course not actually Student t (unless obviously you are simulating the data).

So, the parameter estimation method is not as important as you might think. I belabor this idea extensively in two of my recent papers, and in my time-series book, using Gaussian (and Laplace) multivariate mixtures for modeling asset returns. I get blatantly better (out of sample, risk adjusted) portfolio performance when I use my own devised estimation methods, which are different than the MLE, and not even consistent. The reason it works, and consistency is irrelevant, is because the true DGP is, w.p.1, not the specified model.

In this field (and maybe in life in general...), all that counts is performance, and not useless superficialities. Here, what counts is out of sample (OOS) portfolio performance compared to the major benchmark models, notably after accounting for transaction costs. And for those of you have heard of GARCH, watch how many papers (I only mean the ones in the top journals --- nobody cares about the rest) ignore transaction costs. The portfolio turnover is much higher with GARCH, ruining your day when you account for the transaction costs.

In addition to OOS performance, the ease of model parameter estimation, and ease of computing the predictive portfolio weights, are also relevant factors. If you are curious, again see my time-series book, and also (just read the abstract):

https://www.sciencedirect.com/science/article/abs/pii/S2452306217300126

---
#### The Task. 
Use my attached Matlab program and confirm it works by running:

x=trnd(3,1e5,1);  [dfhat, muhat, chat, iters] = titer(x)

If a piece of the code is deprecated, you figure out how to modify it to Matlab 2023 standards.

We wish to determine which is faster of the two methods, *for a fixed parameter estimation tolerance level, necessarily the same across both algorithms*.

(i) My t-iteration (titer.m) program, or
(ii) brute force MLE maximization using BFGS.

This is done with simulation and recording the time it takes for, say, k=100 or 1000 estimations. You report the two times as "Based on 100 (or 1000) replications, and a requested parameter estimation tolerance of (say) 4 significant digits, the total time required for each of the two algorithms are xxx seconds, and yyy seconds, respectively".

##### First for titer:
For a fixed sample size, say T=250 (about one year of daily stock return data), generate a simulated IID Student t data set, call it X. Run titer several times on this *same* data set X, varying the estimation tolerance, until you determine the value that leads to the outputted parameter vector having, say, 4 digit accuracy. Not at least 4, but (roughly) just 4.

You will need to go look in the titer.m codes and find everywhere a tolerance is required, e.g., in the root-finding function calls. An obvious suggestion is that you augment the function and pass parameter "tol" to it. Notice you could devise an algorithm to do this operation of determining the optimal value of tol. I guess the "round" command in Matlab will be of use. If you make a correct algorithm to do this, please show it and brag about it in your report. You can of course also do this step manually, but in real life, best you "algorithmatize" it.

Note: You do *not* compare the parameter estimate vector to the true parameter vector! But rather, start with tol=0.0000001, get the (pseudo-)MLE from titer, and keep increasing tol until the first four digits (in each of the 3 parameters) do not change.

Confirm with a few different X data sets until you are sure.

As an example, with tol set to an extreme value like 0.000001, you get, say for the df parameter,

df-hat-high-tol 5.125xxxxxx

and now you continue to increase tol until df-hat is 5.125, but the remaining digits differ from the value df-hat-high-tol. Notice every time, we use the same data set X. Ensure we get this 4-digit accuracy for all 3 parameters. Maybe one of the 3 parameters has a higher accuracy: that is okay, but all 3 need to have at least 4 digit accuracy, and ideally no more. We want the largest tol value possible to ensure this, so that we can do reliable speed comparisons with the other method.

No need to go overboard: The optimal value of tol should itself be accurate to about one significant digit, e.g., maybe you find tol=0.001, or 0.0005 to be right. That is enough.

Again, check this with a few other X data sets. We need to ensure that the optimal tolerance value is robust to the data set used.

##### Now repeat the above, but for the BFGS-based algorithm for the exact MLE. 
Do not compare the MLE from BFGS to the estimated parameter vector from titer! Nor compare it to the true parameter vector. Both such comparisons are irrelevant for what we are trying to do.

So, now you have a tol value for titer, say tol_titer, and a tol for BFGS, say tol_BFGS, such that both deliver the estimated parameter vector to (about) 4 significant digits. Maybe these two tolerance values are close in value, but it does not matter.

Crucial, again, is to realize that the estimated parameter vector from titer, and that from BFGS, will differ, because the former is only pseudo-MLE. So, there is no need to compare the estimated parameter vectors across the two methods. We know they will not be exactly the same.

##### Next
fix a seed value, and confirm that, in a FOR loop over k, for 100 (or 1000) iterations, we generate 100 different IID Student t data sets, but such that, when we repeat this, *with the same seed value*, we get the same 100 data sets.

Now we know both algorithms (for this Student t model and parameter constellation you used, and target tolerance, here 4 significant digits) have the same accuracy. Thus, it now makes sense that we can compare their estimation times. So, you set up a FOR loop, at least k=100 iterations, in which you generate an IID Student t data set with T observations, and pass that data vector to *one* of the two estimation methods. You need to keep track of how long this takes, over the entire FOR loop. Then, you repeat this, *using the same seed value so you ensure that each of the 100 data sets are the same*, using the *other* estimation method; and likewise keep track of how long it takes.

NOTE: Realistically, you actually could use different seed values, because we are doing this for 100 (or 1000) data sets, as opposed to just one, but it is so easy to manage the seed value to ensure that you compare the two estimation methods with the same 100 data sets. Also, managing seed values in simulations is important in general, and hence why I want you to do this.

---
#### In Report:
You of course show your codes (no need to show the titer.m codes; I have those, unless you modified them in some significant way), and discuss and demonstrate the above ideas for ensuring the two algorithms deliver about the same estimation tolerance. Then you report the estimation times. My hope is, titer is faster, but who knows.

Note: There is no efficiency gain for using the EM algorithm for MLE when working with the *univariate* Student t. That all changes with the multivariate Student t. So, no need to use it for the above comparison, as a "third method".

### Part II: The conditional distribution of a bivariate Student t.

#### Motivation:
This is a useful conceptual idea for dealing with a measure-zero event, and I used it to confirm the theoretical distributional results from a multivariate conditional saddlepoint approximation for AR(p) model selection. No need to look, but if you are curious, see my time series book, pp 382-3.

---
#### Preparation:
From my time series book, read section C.1. You can just skim up to equation C.14, and read and understand from C.14 to C.27.

Next, skim-read the attached short article, from author Ding on the *conditional* distribution of X2 given X1, where X1 and X2 are some margins of a multivariate t distribution. Note how recent the article is! You think such results would be known since 90 years.

We will be working in this assignment with just the bivariate version, so X1 and X2 will be scalars, not vectors. You do not need to understand the derivation, but notice the final result is very clear, and you can easily program and compute this, given the density parameter values.

We want to verify the author's result is correct (it is), because, as Ding points out, there were some previous errors in the literature on this, as well as more complicated derivations. Ding's method of proof is clean, smart, and correct.

---
#### Assignment:
##### II.1. 
From that paper, make a program that simply inputs the parameters of a bivariate Student t (2 location terms, and a 2X2 dispersion matrix, and the df parameter), and computes the parameters of the conditional distribution of X2 given X1.

##### II.2. 
From my time series book section that I mentioned above, make a program to simulate a bivariate Student t. You pass to your function the true parameters: 2 location terms, a 2 X 2 dispersion matrix, and the single df value, and of course, the number of bivariate vectors to simulate, say n. You can use location = [0, 0] and start with identity matrix dispersion, and play around with using df=1, 2, 3, 4. You can make a bivariate plot of the data, in Matlab; see functions like mesh, and related functions. Once you get all the coding done and make the nice plots, then, for the next questions, use a non-identity dispersion matrix, i.e., such that X1 and X2 are correlated.

Note: In the multivariate Student t, even with identity dispersion matrix, the univariate margin distributions are *not* independent. Can you say why? I can think of two explanations. You try.

##### II.3. 
Make a function that inputs the n X 2 matrix of simulated bivariate Student t values, as well as a parameter, a scalar, called x2. Choose a value where there is significant "mass" of the density, e.g., 0 is the best if E[X2]=0.

What you will do is "condition on x2". Clearly, the event X2 = x2 is measure zero, so that will not work in a simulation exercise. Instead, you extract from the n X 2 matrix those values of x1 that correspond to values of X2 that lie in the interval (x2-eps, x2+eps), for some small value eps>0. Call this subset of values X1midX2.

Notice that, as eps gets smaller, you will need a larger value of n to ensure that you get some elements out of it. Choose eps and n to enhance accuracy, meaning, very large n, and small eps, so that you get, say, at least 100, better 1000, values of x1 conditional on X2 being in the specified interval. Notice simulating the bivariate Student t is fast, so taking n to be, say, 1e6, is feasible.

##### II.4. 
The goal is to confirm that the paper is correct with its theoretical statement that the conditional is indeed Student t, but with different df value, as given in the paper. We do this by estimating the univariate IID Student t model on data set X1midX2. You can make a kernel density plot of the Student t conditional distribution based on the MLE from data set X1midX2, and overlay this with the theoretical Student t distribution from the paper. They should be very close, notably if you choose n large, and eps small. (For fun, you can use BFGS, and titer, and show both, along with, obviously, the theory result.)

Repeat this with other values of x2, say 4 values in total, e.g., 0, 1, 2, 3. Also, say, 3 different df values of the bivariate Student t model. I think the df value does not have to be larger than 1.0 (please check). If so, then you could thus try, say, df=1, df=3, df=5.

---
#### In Report:
This could result in a single-page graphic in your report, containing the 3*4=12 sub-plots. You figure out how best to illustrate your results. But please not 12 pages, each with one graphic...

### Part III: Estimating the bivariate Student t with MLE, and determining the distribution of a linear combination of the two jointly distributed random variables.

#### Motivation:
This has immediate usage for determining the distribution of a portfolio of financial assets. It is a starting point (or, as they say, is a "toy example"), because the multivariate Student t is in the elliptical class, meaning, the margins are symmetric, and the tail behavior of each asset (after accounting for the scale term) are the same: Remember, the multivariate Student t only has *one* shape parameter, the df. This model is also IID, which is not realistic for daily financial returns.

There is an amazing literature on estimating the parameters of the IID multivariate Student t, with numerous variations on the EM algorithm. See (just to get an idea; you will not implement it) https://core.ac.uk/download/pdf/82371806.pdf

---
#### Assignment:
##### III.1. 
Make a program to implement direct MLE for the bivariate case, noting there are only 2+3+1 = 6 parameters. That is, you have a function to return the log-likelihood of the bivariate IID Student t, and this gets passed to fminunc in Matlab (or fmincon if you wish to use it). HINT: I have it already, as Matlab code, in my time series book, page 527. So, this has become a "non-question" :-)

Note how we need to use constrained optimization, enforcing that the two scale terms in the dispersion matrix are positive, and the df is positive. You can just use my program, assuming it (still) works.

So, optional: Expand on mine and also ensure that the dispersion matrix is positive definite. This is harder, if you code it as a nonlinear constraint (using, say, eigenvalues), but in this case, with just one off-diagonal element, I think it should be very easy. You figure it out if you want, and perhaps use fmincon, which is a useful function to know about, e.g., in portfolio optimization when you cannot convert the problem to a quadratic programming problem and use quadprog.m, which is far faster.

Now simulate from a bivariate Student t (try a few different parameter constellations) and confirm the estimation algorithm works. Make nice, convincing, smart plots showing you get the correct parameters, such as a 6-element boxplot, with each of the 6 obviously corresponding to one of the 6 parameters. Be smart: For each of the 6 parameters, you boxplot the MLE estimates *minus the true parameter value*. This way, all 6 of the boxplots are nicely centered at zero. Remember, perfect labeling, font sizes, colors, everything. Go total Steve Jobs here.

##### III.2. 
Now imagine you have a set of bivariate financial asset returns data (e.g., daily percentage log or regular returns data on two stocks), and wish to fit the bivariate IID Student t model. For now, we will just use simulated bivariate Student t data. We want to know the distribution of a_1 X_1 + a_2 X_2, where a_1 and a_2 are, for us, between 0 and 1, and sum to one. This is the constraint for a long only, fully invested portfolio. So, as usual, you choose some set of parameters for the bivariate Student t, with a non-diagonal dispersion matrix, and then:

Simulate n replications, n large, and compute vector P = a_1 X_1 + a_2 X_2. Make a kernel density plot of this univariate object (the portfolio distribution), for some given weights a_1 and a_2 that you choose and fix.

Then use my equation C.27 to program the relevant characteristic function. Use the inversion theorem over a suitable grid of "x-values" to plot the density. Notice this is not the same as in your first take-home assignment: There, X_1 and X_2 were independent. As mentioned above, even with a diagonal dispersion matrix, X_1 and X_2 are not independent.

Overlay the true density from the inversion theorem, and the kernel density estimate from the simulation, and confirm they are very close. Make beautiful plots with appropriately thick lines, labeled x and y axes, a graphics legend, title, etc., with nice font sizes.

### Part IV: Non-elliptic distributions for modeling financial asset returns.

#### Motivation:
Asset returns (daily and higher-frequency) are not just non-Gaussian, they are non-elliptic. We need more sophisticated distributions, and also smart laws of motion for the changing scale and correlation terms (simple CCC or DCC GARCH are far from adequate). As always, I love to boast about my research: Just read the abstracts (and note the target journals) to get an idea of where these basic homework ideas can lead:

https://www.sciencedirect.com/science/article/abs/pii/S0304407619301563

https://www.sciencedirect.com/science/article/abs/pii/S0378426621000042

Here, we deal with just the distributional aspect, and stick to an IID structure for simplicity, this being the theme of this first course. We will meet two non-elliptic distributions in this assignment.

---
#### Preparation:
Read just the single page of my time series book, page 650. This is for a multivariate Laplace distribution. Notice, crucially, this sentence:

Let (Y ∣ G = g) ∼ N_d (𝝁, g𝚺) for 𝝁 ∈ ℝ^d and 𝚺 > 0, i.e., positive definite, and let G ∼ Gam (b, 1).

So, you see that it is trivial to simulate realizations from Y ∼ Lap(𝝁, 𝚺, b). We will be working only in the d=2 (bivariate) case, so it will be easy for you to make a 2 X 2 positive definite covariance matrix 𝚺.

Next, same book, see page 654, equation (14.41). This is a discrete mixture distribution. We saw this already in Chapter 5 of the Fundamental Statistical Inference book, in the univariate, Gaussian context.

We will only consider a mixture of two components, so k=2.

We saw above that simulating from a single multivariate Laplace is easy. To simulate a discrete mixture of them, this is, in turn, addressed in Fundamental Statistical Inference, the very short section 5.1.2, complete with Matlab codes, in the context of simulating from a discrete mixture of univariate Gaussian. You can trivially adapt them to the model at hand here.

continued:
Again, time-series book, please read section 12.2, but just pages 530-532, for the multivariate noncentral Student t distribution (let's call it NCT). We will not need to simulate from it, but it is easy if needed, being a continuous Gaussian mixture distribution like the Laplace.

Notice my approximation to the (log of the) density given there. It is extremely accurate, and insanely fast. It is even more accurate than my saddlepoint approximation to the NCT in the univariate case. So, you need to just copy-paste and use this. Evaluation of the exact density is ridiculously time consuming, and far too slow for actual applications.

It even turns out that there is, so far at least, no EM algorithm for the (notably multivariate) NCT, which is highly unfortunate. As such, the density approximation I provide is highly useful, and the only way I know to actually use this distribution for modeling large sets of financial asset returns. I told you, lots of tricks!


---
#### Assignment:
##### IV.1. 
Make a program to input all the relevant parameters for simulating the 2-component, bivariate discrete mixture of Laplace. Document it nicely. Choose reasonable values for parameters, such as might roughly correspond to financial data, e.g., b1=10, b2=5, and the Sigma2 matrix has much larger diagonal elements than Sigma1, this 2nd component of the mixture capturing the more extreme behavior of the returns. Simulate realizations, and produce one nice-looking 3D plot.

##### IV.2. 
Recall in question III.1 above, you got a free ride, since the exact codes are right in my book. That was not an accident. Now, you are to make something similar, but for the k=2, d=2 discrete mixture multivariate Laplace.

For d larger, you would need the EM algorithm I derived in my book for this distribution. For our case with k=d=2, there are --- let's count: 4 location terms, Sigma_1 and Sigma_2 have a total of 6 terms, then there is b_1 and b_2, so a total of 12 parameters (you please check this), with obvious constraints on the Sigma elements, and the b_i. So, still very modest for brute force BGFS optimization, and the likelihood is easily obtained from time-series book equations (14.31) and (14.41).

Notice we again need the "modified Bessel function of the third kind", which I give in eq (14.30). To be sure that the function you use (I think besselk in Matlab) is correct, do what I wrote in a recent email: Compute the integral in (14.30), which is definitely what we need, and compare it to the besselk function output. You include the codes for this, and show equivalence.

Make a program to compute the MLE of this distribution (again, just d=k=2).

Choose a large sample size, say T=10,000, and generate *one* data set, and estimate it, and report, in a nice table, the true parameters in one row, and the MLE values in the other. Based on this sample size, the MLE values should be close to the true parameter values. Notice we have what is called a "label switching problem" here, in that mixture components 1 and 2 could be switched (and also of course the two weights in the mixture), and the model does not change. Be sure to keep that in mind when reporting your results.

Using my generic MLE codes from Chapter 4 with BFGS, you can also produce approximate standard errors for the parameters. Report these also. You do *not* need to generate more accurate confidence intervals via a bootstrap. There are two reasons I do not request this:

a) you have enough to do already in this assignment,

b) *who cares* about them, or arguably even the (freely obtained) "standard errors" from BFGS? The point of the model is to fit real data, and do portfolio optimization. The only relevant model assessment is OOS performance of portfolio allocation (or, risk assessment perhaps). The actual parameters of the model are, for us, essentially "nuisances that need to be calibrated".

##### IV.3. 
Make a program to compute the MLE of the bivariate (so, dimension 2) NCT. Recall the "preparation" notes above. It inputs obviously a T X 2 data set, and outputs the MLE parameter vector.

Now make a program that inputs a T X 2 data set, and computes and outputs the MLE of the bivariate 2-component mixture Laplace, and that of the bivariate NCT. This is obviously trivial --- it just combines your previous two programs. What is new is this: You also output the AIC and BIC values corresponding to to each of the two models, so that we can, via these in-sample model-comparison methods, ascertain which model is preferred by the data.

##### IV.4. 
I attach a data set (a matrix) of (percentage log) daily returns from the stocks on the DJIA, with d=25 stocks. There is an excel version, and a Matlab version, the former if some of you wish to use, say, Python. (See note below about Python.)

As you all know, the DIJA index has 30 stocks. I have a data set from Jan-1990 to Oct-2021 (yielding about 8000 time points). At that starting date, only 25 stocks have valid data, and hence I only send the returns for those 25 stocks. That is fine for our purpose here. (This data set is also not corrected for survivorship bias. Ask me in class if you are curious what that it is, and/or web-search it. We do not care about this bias for our purpose in this assignment.)

For this exercise, we do not need time-stamps, or names of stocks. The reason is that we are in pure statistics learning mode here. In reality, you would need that further stock level info (and account for survivorship bias in the index), and also, as I belabored above, augment in-sample measures (AIC and BIC, etc.) with out-of-sample performance measures, e.g., several of the various risk-return measures such as Sharpe, max-drawdown, Sortino, Rachev ratio, etc., along with accounting for transaction costs, doing various robustness checks, etc..

Set up a (nicely documented) program that inputs this data set, and then:

---
You can of course use more than 50! Please try if feasible. And, you do not need to worry if you draw the same pair. With unlimited computing resources, you would choose, say, 1000 pairs:

For rep = 1:50
- randomly draw 2 out of the 25 stocks.

- fit the Mix-Lap and the NCT; store their AIC and BIC values
 
end

Output of the function:

1. The 50 X 4 table of AIC and BIC values, for each of the two distributional models.

2. A *smart plot* (everything should be totally clear by just looking at it) comparing the AIC values. And then, similar, for BIC.

---
#### In Report:
Write in your report the conclusions, e.g., "According to AIC (BIC), the mix-lap is preferred in 80% (74%) of the cases." Be sure to discuss in your report your algorithm for randomly choosing 2 stocks. Trivial for you data-science majors, and actually all of you, but it is even little things like this that "non quant finance people" have no clue how to do. But they do know the CAPM model, have passed the first CFA exam, and know that a matrix is not just a movie.