# Introduction to Quantitative Finance

Copyright (c) 2019 Python Charmers Pty Ltd, Australia, <https://pythoncharmers.com>. All rights reserved.

<img src="img/python_charmers_logo.png" width="300" alt="Python Charmers Logo">

Published under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license. See `LICENSE.md` for details.

Sponsored by Tibra Global Services, <https://tibra.com>

<img src="img/tibra_logo.png" width="300" alt="Tibra Logo">


## Module 1.1: Distributions and Random Processes

### 1.1.4: Normality Tests

If you are analysing data on the assumption it is normally distributed, you should test that assumption first. Properties of normal distributions do not necessarily apply to data that has a different underlying distribution. As an example, an ANOVA test assumes normality in your data, and the results of an ANOVA are not valid if the data comes from some other source.

如果你在分析数据时假设它是正态分布的，你应该首先测试这个假设。正态分布的性质不一定适用于具有不同基础分布的数据。例如，ANOVA 检验假设数据是正态分布的，如果数据来自其他来源，ANOVA 的结果将无效。

There are a number of normality tests that provide a way for us to determine if it is likely that the data comes from a normal distribution.

有许多正态性检验方法可以帮助我们确定数据是否可能来自正态分布。

One method of testing for normality is to compute the skew and kurtosis of the data. A standard normal distribution has a skew of 0 and a kurtosis of 3.

In [1]:
%run setup.ipy

In [2]:
# 从pickle文件中读取苹果公司(AAPL)的股票数据并存储到DataFrame中
aapl = pd.read_pickle("data/AAPL.pkl")

In [3]:
# 计算每日收益变化
aapl['Gain'] = aapl['Adj Close'].diff()
# 删除包含空值的行
aapl.dropna(inplace=True)

In [4]:
# 计算苹果股票收益的偏度(skewness)，用于衡量收益分布的不对称性
stats.skew(aapl['Gain'])

0.49966243998511045

In [5]:
# 计算收益数据的峰度值（kurtosis）
# 峰度衡量数据分布的尖峭程度，正态分布的峰度为0
stats.kurtosis(aapl['Gain'])

20.448973653158085

The AAPL stock price increases do not appear to have a normal distribution based on this data. Let's have a look at the histogram again:

<text>
根据这些数据，AAPL 股票价格的上涨似乎并不符合正态分布。让我们再看一下直方图：
</text>


In [6]:
# 创建一个柱状图来展示股票收益的分布
# X轴是收益值(Gain)，使用最多100个区间进行分组
# Y轴是每个区间的计数
alt.Chart(aapl).mark_bar().encode(
        alt.X("Gain", bin=alt.Bin(maxbins=100)),
        y='count()',
    )

A high Kurtosis is obtained from very "sharp" peaks such as this one. The skew is not that high, but is positive, indicating a right-leaning distribution.

需要翻译的文本:
<text>
A high Kurtosis is obtained from very "sharp" peaks such as this one. The skew is not that high, but is positive, indicating a right-leaning distribution.
</text>

翻译结果:
<text>
高峰度是由非常“尖锐”的峰值（如此例所示）获得的。偏度并不那么高，但为正，表明分布向右倾斜。
</text>

More objective tests are available in the `scipy.stats` package. For instance, the Shapiro-Wilk test is commonly used and is a good test for small to medium datasets, with up to a few thousand data points.

<text>
在 `scipy.stats` 包中提供了更客观的测试。例如，Shapiro-Wilk 测试是常用的，并且对于中小型数据集（最多几千个数据点）是一个很好的测试。
</text>


In [7]:
# 使用 Shapiro-Wilk 测试检验 AAPL 股票收益是否服从正态分布
statistic, p = stats.shapiro(aapl['Gain'])

In [8]:
# 比较 p 值和可接受的显著性水平(alpha)来判断是否可以拒绝原假设
p

1.9754031136943266e-56

In [9]:
# 判断数据是否符合正态分布
# 如果p值大于0.05，表示数据符合正态分布
# 如果p值小于等于0.05，表示数据不符合正态分布
if p > 0.05:
    print("The data looks like it was drawn from a normal distribution")
else:
    print("The data does not look like it was drawn from a normal distribution")

The data does not look like it was drawn from a normal distribution


### What is a p-value?

The p-value above is a commonly used term to describe the probability of your test being true.

As it is a probability, it has a value between 0 and 1. Values near 0 indicate that your test is "not likely to be true" and values near 1 indicate that your test is likely to be true. Often, we apply a threshold, and if our p value is greater than that threshold, we accept the outcome as "likely enough, and we continue as if it were true", that is, we accept the outcome of the test as a "positive".

由于它是一个概率，它的值介于0和1之间。接近0的值表明你的测试“不太可能是真的”，而接近1的值表明你的测试很可能是真的。通常，我们会应用一个阈值，如果我们的p值大于该阈值，我们就接受结果为“足够可能，并继续假设它是真的”，也就是说，我们将测试的结果接受为“阳性”。

It is very common to use a threshold of 0.05 when performing a test. That is, if our test has a greater than 95% chance of being true, we accept it as such. While this is an adequate rule of thumb, it is not a one-size-fits-all solution to the problem of choosing a p value threshold.

在进行测试时，通常使用0.05作为阈值。也就是说，如果我们的测试有超过95%的概率为真，我们就接受它。虽然这是一个合适的经验法则，但它并不是选择p值阈值的通用解决方案。

Where this is normally seen in classical statistics is with a Null, and Alternative hypothesis. We will delve into these deeper later, but as this is used above, the null hypothesis is our "nothing is surprising" hypothesis, and the alternative is "there is something interesting here". For the Shapiro-Wilk used above, the hypothesis are:

<text>
在经典统计学中，这通常出现在零假设和备择假设中。我们稍后会深入探讨这些概念，但正如上面所提到的，零假设是我们的“没有什么令人惊讶的”假设，而备择假设是“这里有一些有趣的东西”。对于上面使用的Shapiro-Wilk检验，假设如下：
</text>

* $H_0$ (the Null hypothesis): The data is drawn from a normal distribution
* $H_A$ (the Alternative hypothesis): The data was not drawn from a normal distribution

Here we have mutually exclusive tests. If we get a value of $a$ for our Null hypothesis, then the probability of our Alternative being true is $1-a$. Statisticians are a pessimistic bunch, so require a very high threshold before we reject the Null hypothesis. This is akin to requiring a very high amount of evidence to reject it. Therefore, to reject the Null hypothesis, to indicate something else is going on here, we require the p value to be less than 0.05, i.e. for there to be a greater than 95% chance the Alternative hypothesis is true.

This might seem like a high standard to meet, but humans often see patterns in data that are not there. We use statistics to test these patterns and ensure we don't fall afoul of this over confident pattern matching.

这似乎是一个很高的标准，但人类经常在数据中看到并不存在的模式。我们使用统计方法来测试这些模式，确保我们不会因为过度自信的模式匹配而犯错。

Before you decide to run a new statistical test, you should see first what the p value would tell you. Often the language is "accept the null hypothesis" or "fail to accept the null hypothesis". This will tell you how to use the test.

<text>
在决定运行新的统计测试之前，你应该首先看看 p 值会告诉你什么。通常使用的语言是“接受原假设”或“未能接受原假设”。这将告诉你如何使用该测试。
</text>


We could see from the Kurtosis that this dataset above wasn't normal. Let's look at a different set of data.

我们可以从峰度看出，上述数据集并不符合正态分布。让我们看看另一组数据。


In [10]:
# 创建一个包含100个身高数据的NumPy数组
heights = np.array([
    205.61624376, 155.80577135, 202.09636984, 159.19312848,
    160.0263383 , 147.44200373, 160.96891569, 160.76304892,
    167.59165377, 164.31571823, 151.11269914, 176.43856129,
    176.88435091, 138.04177187, 183.87507305, 162.81488426,
    167.96767641, 144.68437342, 180.88771461, 179.18997091,
    189.81672505, 163.68662119, 175.70135072, 167.32793289,
    163.72509862, 207.93257342, 177.41722601, 167.28154916,
    170.26294662, 187.01142671, 178.3108478 , 168.8711774 ,
    202.77222671, 138.55043572, 187.10284379, 155.13494037,
    175.24219374, 188.54739561, 191.42024196, 174.34537673,
    158.36285104, 183.17014557, 166.36310929, 185.3415384 ,
    163.87673308, 173.70401469, 168.78499868, 167.39762991,
    166.89193943, 191.04035344, 148.02108024, 140.82772936,
    168.85378921, 142.13536543, 189.77084606, 173.7849811 ,
    157.61303804, 171.62493617, 173.30529631, 162.92083214,
    169.52974326, 142.01039665, 176.01691215, 170.32439763,
    172.64616031, 158.35076247, 185.96332979, 176.6176222 ,
    204.68516079, 161.43591954, 172.42384543, 179.36900257,
    170.01353653, 194.40269002, 139.96802012, 156.47281846,
    210.21895193, 153.30508193, 157.10282665, 200.07040619,
    174.69616438, 168.97403285, 188.9396949 , 156.19358617,
    179.56494356, 175.04014032, 164.1384659 , 167.90219562,
    184.80752625, 143.56580744, 169.80537836, 186.5894398 ,
    166.39251657, 165.65510886, 195.49137372, 152.21650272,
    163.14001055, 170.27382512, 147.63901378, 190.32910286])

In [11]:
# 使用 Shapiro-Wilk 测试检验身高数据是否服从正态分布
statistic, p = stats.shapiro(heights)

In [12]:
# 如果 p 值大于 0.05 (显著性水平)
if p > 0.05:
    # 打印数据符合正态分布的结论
    print("The data looks like it was drawn from a normal distribution")
    # 打印 p 值，保留 3 位小数
    print("p={:.3f}".format(p))
else:
    # 打印数据不符合正态分布的结论
    print("The data does not look like it was drawn from a normal distribution")

The data looks like it was drawn from a normal distribution
p=0.278


#### Exercise

Two other commonly used tests for normality are available in `scipy.stats`. They are `stats.normaltest` and `stats.kstest`. Review the help and references for these functions, and run them on the `heights` data. What are the strengths and weaknesses of each test?

In [13]:
# 对身高数据进行正态性检验
statistic_chi, p_c = stats.normaltest(heights) # 使用卡方检验进行正态性检验
statistic_k, p_k = stats.kstest(heights,cdf = 'norm') # 使用KS检验进行正态性检验
print(str(p_c), str(p_k))

# 打印KS检验和Shapiro-Wilk检验的区别说明
print("From Stack overflow, the Kolmogorov-Smirnov is for a completely specified distribution, while the Shapiro-Wilk is for normality, with unspecified mean and variance.")

0.6994130645220737 0.0
From Stack overflow, the Kolmogorov-Smirnov is for a completely specified distribution, while the Shapiro-Wilk is for normality, with unspecified mean and variance.


*For solutions, see `solutions/scipy_normal_tests.py`*

### Statsmodels

We will now perform a normality test using the `statsmodels` package. This package allows for higher level statistics than the `scipy` module we have been using. We will be using `statsmodels` for much of the ordinary least squares computation in future modules.

<text>
我们现在将使用 `statsmodels` 包进行正态性检验。这个包提供了比我们一直在使用的 `scipy` 模块更高级的统计功能。在未来的模块中，我们将使用 `statsmodels` 进行大量的普通最小二乘法计算。
</text>


In [14]:
# 导入statsmodels库的API模块，用于进行统计建模和分析

import statsmodels.api as sm

In [15]:
# 使用statsmodels库的kstest_normal函数对heights数据进行Kolmogorov-Smirnov正态性检验
statistic, p_value = sm.stats.diagnostic.kstest_normal(heights)

In [16]:
# 根据 p 值判断数据是否符合正态分布
if p_value > 0.05:  # 如果 p 值大于显著性水平 0.05
    print("The data looks like it was drawn from a normal distribution")  # 打印数据符合正态分布
    print("p={:.3f}".format(p_value))  # 打印 p 值，保留 3 位小数
else:  # 如果 p 值小于等于 0.05
    print("The data does not look like it was drawn from a normal distribution")  # 打印数据不符合正态分布

The data looks like it was drawn from a normal distribution
p=0.395


#### Exercise

Review the documentation for `statsmodels` at https://www.statsmodels.org and run the Jarque-Bera test for normality on this data.

In [17]:
# 导入statsmodels库中的统计工具模块
from statsmodels.stats import stattools

# 对heights数据进行Jarque-Bera正态性检验
# jbstat: JB统计量
# pvalue: 检验的p值
# skew: 偏度
# kurtosis: 峰度
jbstat, pvalue, skew, kurtosis = stattools.jarque_bera(heights)
print(pvalue)

0.6714923453511482


*For solutions, see `solutions/jarque_bera.py`*

### Handling conflicts

There are many different normality tests. If you get the same result for all the tests (i.e. multiple tests suggest normal data), then you can be reasonably sure the data does come from a normal distribution.

有许多不同的正态性检验。如果所有检验都得到相同的结果（即多个检验表明数据是正态的），那么你可以合理地确定数据确实来自正态分布。

If you get conflicting result, the results are not quite so clear. In a conflicting case, it would be unlikely that the results will be wildly different. Instead, you are likely to get a few slightly "above the line" and a few slightly "below the line". Depending on the use case, you can interpret a single "is normal" result as being good enough. Much of the later analysis you can do will be fine for "normal-like" data, rather than strictly normal data.

如果你得到的结果相互冲突，那么结果就不那么明确了。在冲突的情况下，结果不太可能会有很大的差异。相反，你可能会得到一些稍微“高于线”和一些稍微“低于线”的结果。根据使用情况，你可以将单个“是正态”的结果解释为足够好。你后续进行的许多分析对于“类似正态”的数据来说都是可行的，而不需要严格的正态数据。

If you do have a very sensitive application that requires a great degree in confidence in your normality test, research further the assumptions behind different normality tests and see which are most applicable to your application. The SciPy and Statsmodels documentation contain references for each of the normality tests.

A major property is the number of samples in your dataset. Some tests work better with more samples, and some work better with fewer. We will investigate this in the last exercise for this module.

#### Exercise

We are going to investigate the relationship that sample size has with the results of a normality test. We want to test the likelihood a normality test will reject the normality hypothesis for a dataset that *actually is generated from a normal distribution*, as the sample size increases.

<text>
我们将研究样本量与正态性检验结果之间的关系。我们想要测试随着样本量的增加，正态性检验拒绝*实际上来自正态分布*的数据集的正态性假设的可能性。
</text>

Write a script that:

1. Creates a normal distribution
2. Randomly samples N data points from that distribution
3. Checks for normality against four different normality tests
4. Repeats steps 1-3 a large number of times, and with varying N
5. Plot the likelihood each test fails for a given sample size.

Below is a snippet of code that runs 20 tests against one sample of data, and determines if the test determines it is normal or not. For an alpha value of 0.05, you would expect about 1 of the tests to fail on average.

<text>
以下是一段代码片段，它对一个数据样本运行了20次测试，并确定测试是否认为数据是正态分布的。对于0.05的alpha值，你预计平均会有大约1次测试失败。
</text>


In [18]:
# 定义一个函数来测试数据的正态性
def normality_script(sample_size, test_type, repetitions = 100):
    # 创建一个标准正态分布对象
    distribution = stats.norm()
    # 从该分布中生成随机样本
    data = distribution.rvs(sample_size)
    
    # 初始化计数器
    passed = 0
    failed = 0
    # 重复执行指定次数的测试
    for i in range(repetitions):
        # 每次迭代都生成新的正态分布数据
        distribution = stats.norm()
        data = distribution.rvs(sample_size)
    
        # 根据指定的测试类型执行相应的正态性检验
        if test_type == "sw":
            # Shapiro-Wilk测试
            statistic, p = stats.shapiro(data)
        elif test_type == "cs":
            # Chi-Square测试
            statistic, p = stats.normaltest(data)
        elif test_type == "ks":
            # Kolmogorov-Smirnov测试
            statistic, p = stats.kstest(data, cdf = 'norm')
        elif test_type == "jb":
            # Jarque-Bera测试
            statistic, p, skew, kurtosis = stattools.jarque_bera(data)
        
        # 根据p值判断测试是否通过（α = 0.05）
        if p > 0.05:
            passed += 1
        else:
            failed += 1
        
    # 返回失败率
    return failed/(passed+failed)

In [19]:
# 导入所需的库
import pandas as pd
from statsmodels.stats import stattools
from scipy import stats
import altair as alt
import numpy as np

# 定义不同的样本大小
sample_sizes = [10, 30, 50, 100, 1000, 5000]
#sample_sizes = np.linspace(10, max_sample_size, 20, dtype=np.int)

# 定义要测试的正态性检验方法
# sw: Shapiro-Wilk test
# cs: Chi-Square test
# ks: Kolmogorov-Smirnov test
# jb: Jarque-Bera test
test_types = ["sw", "cs", "ks", "jb"]
data = []

# 对每个样本大小和每种检验方法进行测试
for size in sample_sizes:
    for test in test_types:
        # 获取在当前样本大小和检验方法下的失败率
        p_fail = normality_script(size, test)
        # 创建包含测试类型、样本大小和失败率的行
        row = [test,size,p_fail]
        data.append(row)
        
# 将结果转换为DataFrame
df = pd.DataFrame(data, columns=['Test', 'Sample', 'Failed'])

# 使用Altair创建折线图
# x轴表示样本大小
# y轴表示失败率
# 不同的检验方法用不同的颜色表示
alt.Chart(df).mark_line().encode(
    x='Sample',
    y='Failed',
    color = 'Test'
)

  return hypotest_fun_in(*args, **kwds)


In [20]:
# 设置样本大小为30
sample_size = 30
# 初始化通过和失败的计数器
passed = 0
failed = 0
# 进行20次测试
for i in range(20):
    # 创建一个标准正态分布对象
    distribution = stats.norm()
    # 从该分布中生成sample_size个随机样本
    data = distribution.rvs(sample_size)
    # 使用chi-square正态性检验来测试数据
    stat, p = stats.normaltest(data)
    # 如果p值大于0.05，说明数据符合正态分布
    if p > 0.05:
        passed += 1
    else:
        failed += 1
# 打印通过和失败的测试次数
print("{} passed and {} failed".format(passed, failed))

19 passed and 1 failed


*For solutions see `solutions/many_normal_tests.py`*