# Introduction to Quantitative Finance

# 量化金融导论

Copyright (c) 2019 Python Charmers Pty Ltd, Australia, <https://pythoncharmers.com>. All rights reserved.

版权所有 (c) 2019 Python Charmers Pty Ltd, 澳大利亚, <https://pythoncharmers.com>。保留所有权利。

<img src="img/python_charmers_logo.png" width="300" alt="Python Charmers Logo">

Published under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license. See `LICENSE.md` for details.

根据知识共享署名-非商业性使用 4.0 国际 (CC BY-NC 4.0) 许可证发布。详情请参阅 `LICENSE.md`。

Sponsored by Tibra Global Services, <https://tibra.com>

由 Tibra Global Services 赞助，<https://tibra.com>

<img src="img/tibra_logo.png" width="300" alt="Tibra Logo">


## Module 1.3: Ordinary Least Squares

## 模块 1.3: 普通最小二乘法

### 1.3.2 Regression Tests

### 1.3.2 回归测试

In this module we will look further into Multivariate OLS and examine some of the requirements of the algorithm, as well as some of the details of the regression results we saw in the last module.

在本模块中，我们将进一步探讨多元OLS（普通最小二乘法），并研究该算法的一些要求，以及我们在上一个模块中看到的回归结果的一些细节。

When performing OLS for Linear Regression Models, there are a few assumptions that need to be met. The key ones are:

在进行线性回归模型的普通最小二乘法（OLS）时，需要满足一些假设。关键的假设包括：

The first assumption is the key one - that is that the relationship between $X$ and $Y$ can, in fact, be described using the model $Y = X\beta + u$. It may *not* be able to be precisely modeled this way, but it may be possible to get close enough that it doesn't matter.

第一个假设是关键——即$X$和$Y$之间的关系实际上可以用模型$Y = X\beta + u$来描述。虽然可能无法完全精确地用这种方式建模，但可能足够接近，以至于无关紧要。

The second assumption is that the expected value of $u$ is zero. There may be fluctuations in the vector $u$, but the overall expected value is 0. More formally, we assume that $E(u|X) = 0$, that is the expected value of $u$ when given $X$ is zero. If it were not, then we can alter the bias term to make it zero, which would be learned from the OLS, giving us our zero value!

第二个假设是$u$的期望值为零。向量$u$可能会有波动，但总体的期望值是0。更正式地说，我们假设$E(u|X) = 0$，即在给定$X$的情况下，$u$的期望值为零。如果不是这样，我们可以调整偏置项使其为零，这将通过OLS学习得到，从而得到我们的零值！

The third assumption is that the error term ($u$) and the data itself $X$ do not have any correlation. In other words, $u$ is unexplained error that cannot be explained by the data. Put more formally, there is no heteroskedasticity or autocorrelation between $u$ and $X$, which is a stronger assumption than the second, but along the same lines. We will cover these terms in a later module more formally.

第三个假设是误差项（$u$）和数据本身 $X$ 之间没有任何相关性。换句话说，$u$ 是无法用数据解释的未解释误差。更正式地说，$u$ 和 $X$ 之间不存在异方差性或自相关性，这是一个比第二个假设更强的假设，但思路相同。我们将在后续模块中更正式地讨论这些术语。

The fourth assumption is that $X$ has a finite variance. This is sometimes (slightly incorrectly) referred to as $X$ being non-stochastic. We will investigate how variance plays into the model in several later modules.

第四个假设是 $X$ 具有有限的方差。这有时（稍微不准确地）被称为 $X$ 是非随机的。我们将在后续的几个模块中探讨方差在模型中的作用。

The fifth assumption is that there are no linear relation between the measurements (variables, columns, features) in $X$, known as having **full column rank**.

第五个假设是，$X$ 中的测量值（变量、列、特征）之间不存在线性关系，这被称为**满列秩**。

If any of these assumptions are untrue, the resulting model does not necessarily have the properties we will discuss in the rest of this module, and the model itself might be biased or inaccurate. However, it may still be *useful* in a practical sense. For instance, if two variables are slightly linearly related, we break the last assumption, however in practice the model is generally still useful. However if they are heavily related, then the resulting model will be unstable.

如果这些假设中的任何一个不成立，生成的模型不一定具有我们将在本模块其余部分讨论的特性，模型本身可能会有偏差或不准确。然而，在实际应用中，它可能仍然是有用的。例如，如果两个变量略微线性相关，我们违反了最后一个假设，但在实践中，模型通常仍然有用。然而，如果它们高度相关，那么生成的模型将是不稳定的。


<div class="alert alert-warning">
    Like most models and concepts, there is always some debate about the definitions and assumptions behind them. Further, some people use the same term to describe different concepts. When discussing an algorithm, it would be best practice to note any key assumptions or variance from the "norm" that you consider. If you aren't sure, provide a reference.

与大多数模型和概念一样，关于它们的定义和假设总是存在一些争议。此外，有些人使用相同的术语来描述不同的概念。在讨论算法时，最佳实践是注明你所考虑的任何关键假设或与“常规”的偏差。如果你不确定，请提供参考。
</div>


In [1]:
%run setup.ipy

Let's load in some data for a regression problem and have a look at the results. In this dataset, we are trying to predict house prices from other characteristics of the area, in Boston, Massachusetts. Prices are in thousands, but are from 1978, so are quite low!
让我们加载一些回归问题的数据并查看结果。在这个数据集中，我们试图根据马萨诸塞州波士顿地区的其他特征来预测房价。价格以千为单位，但由于是1978年的数据，所以相当低！


In [4]:
# 从scikit-learn库中加载波士顿房价数据集# scikit-learn是一个机器学习库，包含了一些示例数据集from sklearn.datasets import load_boston

ImportError: 
`load_boston` has been removed from scikit-learn since version 1.2.

The Boston housing prices dataset has an ethical problem: as
investigated in [1], the authors of this dataset engineered a
non-invertible variable "B" assuming that racial self-segregation had a
positive impact on house prices [2]. Furthermore the goal of the
research that led to the creation of this dataset was to study the
impact of air quality but it did not give adequate demonstration of the
validity of this assumption.

The scikit-learn maintainers therefore strongly discourage the use of
this dataset unless the purpose of the code is to study and educate
about ethical issues in data science and machine learning.

In this special case, you can fetch the dataset from the original
source::

    import pandas as pd
    import numpy as np

    data_url = "http://lib.stat.cmu.edu/datasets/boston"
    raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
    data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
    target = raw_df.values[1::2, 2]

Alternative datasets include the California housing dataset and the
Ames housing dataset. You can load the datasets as follows::

    from sklearn.datasets import fetch_california_housing
    housing = fetch_california_housing()

for the California housing dataset and::

    from sklearn.datasets import fetch_openml
    housing = fetch_openml(name="house_prices", as_frame=True)

for the Ames housing dataset.

[1] M Carlisle.
"Racist data destruction?"
<https://medium.com/@docintangible/racist-data-destruction-113e3eff54a8>

[2] Harrison Jr, David, and Daniel L. Rubinfeld.
"Hedonic housing prices and the demand for clean air."
Journal of environmental economics and management 5.1 (1978): 81-102.
<https://www.researchgate.net/publication/4974606_Hedonic_housing_prices_and_the_demand_for_clean_air>


In [None]:
# 加载波士顿房价数据集boston_data = load_boston()

In [None]:
# 获取波士顿房价数据集对象的类型type(boston_data)

sklearn.utils.Bunch

In [None]:
# 打印波士顿房价数据集的描述信息print(boston_data.DESCR)

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pu

In [None]:
# 将scikit-learn数据集转换为pandas DataFrame的辅助函数# 来源: https://stackoverflow.com/questions/38105539/how-to-convert-a-scikit-learn-dataset-to-a-pandas-dataset/46379878#46379878def sklearn_to_df(sklearn_dataset):    # 使用数据集的特征名称作为列名,创建DataFrame    df = pd.DataFrame(sklearn_dataset.data, columns=sklearn_dataset.feature_names)    # 将目标变量添加为新的一列    df['target'] = pd.Series(sklearn_dataset.target)    return df

In [48]:
# 定义波士顿房价数据集的URL地址    data_url = "http://lib.stat.cmu.edu/datasets/boston"    # 从URL读取CSV数据，使用空白字符分隔，跳过前22行，不使用表头    boston = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)    #boston = boston.rename(columns={x : i for x, i in enumerate(['CRIM', 'ZN' , 'INDUS' , 'CHAS' , 'NOX' , 'RM' , 'AGE' , 'DIS' , 'RAD' , 'TAX' , 'PTRATIO' , 'B' , 'LSTAT', 'target'])})

In [49]:
# 显示波士顿房价数据集的前5行数据boston.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3
1,396.9,4.98,24.0,,,,,,,,
2,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8
3,396.9,9.14,21.6,,,,,,,,
4,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8


In [46]:
# 导入statsmodels库的公式APIimport statsmodels.formula.api as smf# 使用OLS(普通最小二乘法)构建多元线性回归模型# 因变量是target(房价)，自变量包含所有13个特征变量# 使用fit()方法拟合模型，自动添加常数项est = smf.ols(formula='target ~ CRIM + ZN + INDUS + CHAS + NOX + RM + AGE + DIS + RAD + TAX + PTRATIO + B + LSTAT',               data=boston).fit()  # Does the constant for us

PatsyError: Error evaluating factor: NameError: name 'LSTAT' is not defined
    target ~ CRIM + ZN + INDUS + CHAS + NOX + RM + AGE + DIS + RAD + TAX + PTRATIO + B + LSTAT
                                                                                         ^^^^^

In [47]:
# 显示回归模型的详细统计结果，包括系数、标准误差、t值、p值、R方等统计指标est.summary()

NameError: name 'est' is not defined

In the above table, there is a coef column, which gives the values for $\beta$ in our model for each independent variable.

在上面的表格中，有一个 coef 列，它为每个自变量提供了我们模型中 $\beta$ 的值。
If the coefficient is negative, there is an inverse relationship between the independent variable and the dependent one.

如果系数为负，则自变量与因变量之间存在反比关系。
It is important to note that this is not a direct relationship, as retraining the model with just one parameter will likely change this coefficient:
需要注意的是，这并不是一种直接关系，因为仅使用一个参数重新训练模型很可能会改变这个系数：


In [13]:
# 导入statsmodels库的formula API模块import statsmodels.formula.api as smf# 构建一个简单的线性回归模型，使用CRIM(犯罪率)作为唯一自变量来预测房价(target)est_simple = smf.ols(formula='target ~ CRIM',               data=boston).fit()  # Does the constant for us

In [14]:
# 获取简单线性回归模型的参数估计值（包括截距和CRIM变量的系数）est_simple.params

Intercept    24.033106
CRIM         -0.415190
dtype: float64

In addition to the coefficient itself, we are given the standard error, the probability (using the t-statistic) that this value is significant (i.e. if it is less than 0.05), and the lower and upper bounds for the 95% confidence interval - where we can say with 95% confidence that the true value lies within those bounds.
除了系数本身，我们还得到了标准误差、该值显著的概率（使用 t 统计量）（即如果它小于 0.05），以及 95% 置信区间的上下界——我们可以有 95% 的置信度认为真实值位于这些界内。


A key reason for this is related to the second warning, indicating there is a strong multicollinearity. We will review this term in the next module and fix the problem it is causing over the next few modules. For now, it indicates that the independent variables are effectively correlated to a high degree, which breaks an assumption with OLS. In short, it means the independent variables are each predicting the same components of the output, and the coefficients are effectively arbitrary. 

一个关键原因与第二个警告有关，表明存在很强的多重共线性。我们将在下一个模块中回顾这个术语，并在接下来的几个模块中解决它引起的问题。目前，它表明自变量实际上高度相关，这违背了OLS的一个假设。简而言之，这意味着每个自变量都在预测输出的相同部分，而系数实际上是任意的。

As an example, if we have two variables $a$ and $b$ that are correlated, the coefficient value for $a$ and $b$ in a trained model is effectively shared between them, and whatever value actually appears in the OLS model is just one of many possibilities.

For the test statistics, good values (for various definitions of "good") for these scores allow us to say with a high confidence that the model accurately predicts the data. Bad values indicate that the model should not be used. We will now review a few key values from this table, as a means to validate our model.
对于测试统计量，这些分数的良好值（对于“良好”的各种定义）使我们能够高度自信地说模型准确地预测了数据。不良值表明不应使用该模型。我们现在将回顾此表中的一些关键值，以验证我们的模型。


### The $R^2$ statistic

The key statistic to review, and the "one value" that you are likely to report in your executive summary, is the $R^2$ statistic. It measures how much of the variance in the predicted variable ($Y$) is explained by your model ($X\beta$), compared to the error of the model ($u$). A high value (near 1) indicates that the model perfectly explains the variable being predicted. A low value (near 0) indicates that the model does not explain the variable at all, which is achieved if the model always predicts the expected value of $Y$. The score can be negative as well, as the model itself can be a net-negative in predictive power (i.e. it model actually predicts incorrectly more than correctly).

需要审查的关键统计量，以及您可能在执行摘要中报告的“一个值”，是$R^2$统计量。它衡量了预测变量（$Y$）的方差中有多少是由您的模型（$X\beta$）解释的，与模型的误差（$u$）相比。高值（接近1）表示模型完美地解释了被预测的变量。低值（接近0）表示模型根本没有解释变量，如果模型总是预测$Y$的期望值，就会达到这种情况。该分数也可以是负的，因为模型本身可能在预测能力上是净负的（即模型实际上预测错误的次数多于正确的次数）。

In the above results, the $R^2$ value is around 0.741, indicating that around 74% of the variance in the predicted variable $Y$ can be explained by the model $X\beta$. That said, our model has a few problems which we will address soon.

在上述结果中，$R^2$ 值约为 0.741，表明预测变量 $Y$ 的约 74% 的方差可以由模型 $X\beta$ 解释。也就是说，我们的模型存在一些问题，我们很快就会解决。

To obtain the $R^2$ value, store the regression results object obtained above and extract it:
要获取 $R^2$ 值，请存储上面获得的回归结果对象并提取它：


In [13]:
# 获取模型的 R 平方值（决定系数），用于衡量模型对数据的拟合程度est.rsquared

0.7406077428649428

#### Exercises

#### 练习

1. Review the documentation at the following link to see what other values can be obtained from a trained estimator:     http://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.RegressionResults.html#statsmodels.regression.linear_model.RegressionResults

1. 查看以下链接的文档，了解可以从训练好的估计器中获取哪些其他值：http://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.RegressionResults.html#statsmodels.regression.linear_model.RegressionResults
2. What is the difference between `est.rsquared` and `est.rsquared_adj`? When should you use one over the other?

2. `est.rsquared` 和 `est.rsquared_adj` 之间有什么区别？在什么情况下应该使用其中一个而不是另一个？

There are quite a few terms on the documentation page we haven't seen yet - many will be reviewed in later modules in this course.
文档页面上有一些我们尚未见过的术语——其中许多将在本课程的后续模块中进行复习。


The adjusted R-squared is a modified version of R-squared that accounts for predictors that are not significant in a regression model. In other words, the adjusted R-squared shows whether adding additional predictors improve a regression model or not.

调整后的 R 平方是 R 平方的修正版本，它考虑了回归模型中不显著的预测变量。换句话说，调整后的 R 平方显示了添加额外的预测变量是否会改善回归模型。

 - from https://corporatefinanceinstitute.com/resources/knowledge/other/adjusted-r-squared/
 
In general I would prefer the adjusted R-squared unless there is a very simple multiple with a small number of predictors which are uncorrelated.
一般来说，除非有一个非常简单的多重回归模型，且预测变量数量较少且不相关，否则我会更倾向于使用调整后的R平方。


### The $F$ statistic


The $F$ statistic is another measure of how significant the fit is. It divides the mean squared error of the model, by the mean squared error of the error term in the model. The probability value under it indicates the probability that we would achieve such a statistic, *if all the coefficients were zero*. In our model, our probability is very low (6.72e-135) indicating there is almost no chance that such an F statistic would be obtained by such a "zero" model.
$F$ 统计量是衡量拟合显著性的另一个指标。它将模型的均方误差除以模型中误差项的均方误差。其下的概率值表示*如果所有系数都为零*，我们获得该统计量的概率。在我们的模型中，概率非常低（6.72e-135），表明几乎不可能通过这样的“零”模型获得这样的 $F$ 统计量。


In [15]:
# 获取F统计量的值，用于检验整个回归模型的显著性est.fvalue

108.07666617432622

In [16]:
# 获取F检验的p值，用于评估回归模型的整体显著性est.f_pvalue

6.722174750114365e-135

To put this formally, the F statistic is a test against the null hypothesis:

正式来说，F统计量是对零假设的检验：

$H_0: \beta_i = 0 \forall i$

The alternative hypothesis is that *at least* one of the values in $\beta$ is not 0.

备择假设是 $\beta$ 中至少有一个值不为 0。

The F statistic can be computed using the following terms:

F 统计量可以使用以下项计算：

$ F = \frac{ESS}{RSS}$

Where $ESS$ is the explained variance of the model and $RSS$ is the unexplained variance. Given the explained variance of the model is due to the component $\beta X$ and the unexplained component is due to $u$, we can derive the equations as below:

其中，$ESS$ 是模型的解释方差，$RSS$ 是未解释方差。给定模型的解释方差由分量 $\beta X$ 引起，而未解释分量由 $u$ 引起，我们可以推导出以下方程：

$ ESS = \frac{1}{k-1}\sum{(\hat{Y_i} - \bar{Y})^2}$

Where $\hat{Y_i}$ is the *ith* predicted value and $\bar{Y}$ is the overall mean of $Y$, and $k$ is the number of variables. In other words, it is the total deviation from the mean that the model explains.

其中 $\hat{Y_i}$ 是第 *i* 个预测值，$\bar{Y}$ 是 $Y$ 的总体均值，$k$ 是变量的数量。换句话说，这是模型解释的与均值的总偏差。

For the variance explained by the residuals, we get:

对于由残差解释的方差，我们得到：

$ RSS = \frac{n}{k}\sum{u_i^2}$

Where $u$ is the error term in our linear regression model and $n$ is the number of samples. There are a few ways to alter these equations to make them easier to compute, all based on performing algebra with the OLS estimator equations defined in earlier modules.
其中 $u$ 是我们线性回归模型中的误差项，$n$ 是样本数量。有几种方法可以修改这些方程，使它们更容易计算，所有这些方法都基于对早期模块中定义的 OLS 估计方程进行代数运算。


### Likelihood Function, Akaike information criterion (AIC) and  Bayesian information criterion (BIC)

### 似然函数、赤池信息准则（AIC）和贝叶斯信息准则（BIC）

These three measures are related, and represent the plausibility of the given data given the set of parameters in the model.

这三个指标是相关的，表示给定模型参数集下数据的合理性。
In all three cases, we use them as relative values. That is, we use these values to compare two different models, and choose the model with the lowest score of these three values (or whichever single statistic you are most concerned with).

在这三种情况下，我们将它们用作相对值。也就是说，我们使用这些值来比较两个不同的模型，并选择这三个值中得分最低的模型（或者你最关心的任何一个统计量）。

For instance, if model 1 has a BIC of 3085 and model 2 has a BIC of 4000, we choose model 1.

例如，如果模型1的BIC为3085，模型2的BIC为4000，我们选择模型1。

The key function here is the likelihood function, which is used to compute the AIC and BIC. The likelihood function $\mathcal{L}(\beta \mid x)$ is the likelihood that the data could be generated from a model with the given parameters. From an information theory perspective, we aim to maximise the likelihood function. From a computing perspective, it is often easier to both compute the *log likelihood*, and to *minimise the negative log likelihood*. A key component of this is that computers find adding numbers easier than multiplying small numbers, and we can convert from log-space to normal-space using the following pattern:

这里的关键函数是似然函数，用于计算AIC和BIC。似然函数$\mathcal{L}(\beta \mid x)$表示数据可以从具有给定参数的模型中生成的可能性。从信息论的角度来看，我们的目标是最大化似然函数。从计算的角度来看，通常更容易计算*对数似然*，并*最小化负对数似然*。其中的一个关键点是，计算机发现加法比乘法更容易，我们可以使用以下模式将对数空间转换为正常空间：

$log(xy) = log(x) + log(y)$

When dealing with probabilities, many probability values are very small, and multiplying small numbers near zero is hard for computers. Often, they "underflow" and consider a very small number to just be zero, and then any product from that point on is zero. Instead, we compute the log of all numbers and add them together - no underflow!

在处理概率时，许多概率值非常小，计算机很难处理接近零的小数相乘。通常，它们会“下溢”，将非常小的数视为零，然后从那时起任何乘积都为零。相反，我们计算所有数的对数并将它们相加——这样就不会发生下溢！

Once the likelihood function (or negative log likelihood) has been computed, the maximum value it can take (when optimised) is $\hat{L}$. From here, the AIC is defined as:

一旦计算了似然函数（或负对数似然），它在优化后可以达到的最大值是 $\hat{L}$。从这里开始，AIC 被定义为：

$ AIC = sk - s\ln(\hat{L})$

The BIC is defined similarly:

BIC 的定义类似：

$ BIC = \ln(n)k - 2\ln({\hat L})$

Typically the BIC is preferred, as it is more stable in most circumstances. However, for the BIC to be valid, the number of samples must be much more than the number of parameters.
通常情况下，BIC 更受青睐，因为它在大多数情况下更稳定。然而，要使 BIC 有效，样本数量必须远大于参数数量。
