# Introduction to Quantitative Finance

Copyright (c) 2019 Python Charmers Pty Ltd, Australia, <https://pythoncharmers.com>. All rights reserved.

<img src="img/python_charmers_logo.png" width="300" alt="Python Charmers Logo">

Published under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license. See `LICENSE.md` for details.

Sponsored by Tibra Global Services, <https://tibra.com>

<img src="img/tibra_logo.png" width="300" alt="Tibra Logo">


## Module 1.1: Distributions and Random Processes

### 1.1.3: Moments

Moments describe distributions. We'll focus on the normal (and normal-ish) distributions for now, but will look at other distributions later.

<text>
矩描述了分布。我们目前将专注于正态（或近似正态）分布，但稍后会研究其他分布。
</text>

A normal distribution is fully described by the first two moments, which are the mean and the variance. Reviewing the help for the `stats.norm` function, we can see these are the only two parameters we can input (see the docstring of the function).

In [1]:
# 运行初始化脚本setup.ipy
%run setup.ipy

<div class="alert alert-success">
    Note: it's worth opening up setup.ipy and seeing what's in there. This file will be run at the start of most of our notebooks.
</div>

In [2]:
# 查看 scipy.stats.norm 函数的帮助文档

stats.norm?

[0;31mSignature:[0m       [0mstats[0m[0;34m.[0m[0mnorm[0m[0;34m([0m[0;34m*[0m[0margs[0m[0;34m,[0m [0;34m**[0m[0mkwds[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mType:[0m            norm_gen
[0;31mString form:[0m     <scipy.stats._continuous_distns.norm_gen object at 0x7f9f8fa35280>
[0;31mFile:[0m            ~/miniconda3/envs/quant_finance/lib/python3.12/site-packages/scipy/stats/_continuous_distns.py
[0;31mDocstring:[0m      
A normal continuous random variable.

The location (``loc``) keyword specifies the mean.
The scale (``scale``) keyword specifies the standard deviation.

As an instance of the `rv_continuous` class, `norm` object inherits from it
a collection of generic methods (see below for the full list),
and completes them with details specific for this particular distribution.

Methods
-------
rvs(loc=0, scale=1, size=1, random_state=None)
    Random variates.
pdf(x, loc=0, scale=1)
    Probability density function.
logpdf(x, loc=0, scale=1)
   

As noted in that description, the first moment, the mean, is referred to as the location. It specifies where the normal distribution is centred.

<text>
正如该描述中所指出的，第一矩，即均值，被称为位置。它指定了正态分布的中心位置。
</text>


In [3]:
# 定义一个函数，用于绘制正态分布的直方图
def plot_histogram_normal(mean, standard_deviation, color):
    # 创建一个正态分布对象，使用指定的均值和标准差
    distribution = stats.norm(mean, standard_deviation)
    # 从该分布中随机生成10000个样本，并转换为DataFrame格式
    normal_values = pd.DataFrame({"value": distribution.rvs(10000)})

    # 使用Altair创建直方图
    chart = alt.Chart(normal_values).mark_bar().encode(
        # X轴设置为数值，并将数据分成最多100个箱
        alt.X("value", bin=alt.Bin(maxbins=100)),
        # Y轴显示每个箱的计数
        y='count()',
        # 设置直方图的颜色
        color=alt.value(color)
    )
    return chart

# 创建两个不同参数的正态分布直方图
# 第一个直方图：均值为0，标准差为1，红色
chart_1 = plot_histogram_normal(0, 1, "red")
# 第二个直方图：均值为3，标准差为1，蓝色
chart_2 = plot_histogram_normal(3, 5, "blue")
# 将两个直方图叠加显示
chart_1 + chart_2

The mean is the expected value of the distribution. Given all other things equal, if we chose *n* values randomly from this distribution, the average value (mean) would be equal to the mean of the distribution. This might seem like circular knowledge, but note the values are computed in different ways:

<text>
均值是分布的期望值。在其他条件相同的情况下，如果我们从该分布中随机选择 *n* 个值，平均值（均值）将等于分布的均值。这看起来像是循环知识，但请注意这些值是以不同的方式计算的：
</text>


In [4]:
# 设置实际均值为57
actual_mean = 57
# 设置标准差为0到10之间的随机数
standard_deviation = random.random() * 10
# 设置试验次数为100000
N_TRIALS = 100000

# 创建一个均值为actual_mean，标准差为standard_deviation的正态分布
distribution = stats.norm(actual_mean, standard_deviation)
# 从该分布中随机采样N_TRIALS个值
normal_values = distribution.rvs(N_TRIALS)

In [5]:
# 计算样本均值与实际均值之间的误差
error = np.mean(normal_values) - actual_mean
# 打印实际均值和计算得到的样本均值
print("The actual mean was {actual_mean}, while the computed mean was {computed_mean:.3f}".format(
    actual_mean=actual_mean, computed_mean=np.mean(normal_values)))
# 打印误差值
print("This gives an error of {error:.3f}".format(error=error))

The actual mean was 57, while the computed mean was 57.005
This gives an error of 0.005


Note that the mean is not the median, although in a normal distribution, they are usually about the same (and theoretically they are the same value). The median is not a "moment".

In [6]:
# 计算从正态分布中抽样的数据的中位数

The second moment of a normal distribution is the variance, also known as the scale factor of the distribution. It is the expected value of the squared difference between a random value and the mean:

<text>
正态分布的二阶矩是方差，也称为分布的尺度因子。它是随机值与均值之间平方差的期望值：
</text>

$V=\frac{1}{n}\sum^n_{i=0}(X_i-\mu)^2$

Note that the square in the result makes the unit squared as well. For instance, if our measurements were in metres $m$, the variance would be in metres squared, $m^2$. As a result, it's not directly comparable to the initial value. For instance:

In [7]:
# 计算正态分布样本的方差
V = np.var(normal_values)
# 打印方差值
V

2.6513375771328085

We can not directly compare this to our original units, i.e. we can not say the variance is "about 0.5% of the mean".
Such a statment is meaningless as the units are different. 
For that reason, we usually use the square root of the variance, known as the standard deviation, which is in the same units as X, and is therefore comparable in such a way:

<text>
出于这个原因，我们通常使用方差的平方根，即标准差，它与 X 的单位相同，因此可以以这种方式进行比较：
</text>

$V=\sigma^2=\frac{1}{n}\sum^n_{i=0}(X_i-\mu)^2$

It is this "standard deviation" that is the second input into our `stats.norm` function:

In [8]:
# 创建两个正态分布直方图并叠加显示
chart_3 = plot_histogram_normal(0, 1, "green")  # 创建均值为0、标准差为1的绿色正态分布直方图
chart_4 = plot_histogram_normal(6, 2, "orange")  # 创建均值为6、标准差为2的橙色正态分布直方图
chart_3 + chart_4  # 将两个直方图叠加显示在同一个图表中

The larger standard deviation makes the distribution more spread out, but it is the same shape, simply "scaled".

### Further Moments

There are two further moments in common use. The third sequentially is called the skew.
It can be visualised as "pulling" the distribution to the left (negative skew) or right (positive skew).

A normal distribution is symmetrical, and has a skew of 0. This is why it does not appear in the equation or function calls to generate the normal distribution.

The fourth standardised moment is the kurtosis, more commonly seen in financial data than in many other datasets. A higher value indicates "fatter tails" than a standard normal distribution. The kurtosis value of a normal distribution is always 3 - we consider this our baseline when interpreting the kurtosis value of other distributions.

In [9]:
# 使用skewnorm函数创建一个偏度为4的偏态正态分布对象

stats.skewnorm(4)

<scipy.stats._distn_infrastructure.rv_continuous_frozen at 0x7f9f8c23f8f0>

In [10]:
# 查看 scipy.stats 中 skewnorm 函数的帮助文档，该函数用于生成偏态正态分布

stats.skewnorm?

[0;31mSignature:[0m       [0mstats[0m[0;34m.[0m[0mskewnorm[0m[0;34m([0m[0;34m*[0m[0margs[0m[0;34m,[0m [0;34m**[0m[0mkwds[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mType:[0m            skewnorm_gen
[0;31mString form:[0m     <scipy.stats._continuous_distns.skewnorm_gen object at 0x7f9f8dfb9400>
[0;31mFile:[0m            ~/miniconda3/envs/quant_finance/lib/python3.12/site-packages/scipy/stats/_continuous_distns.py
[0;31mDocstring:[0m      
A skew-normal random variable.

As an instance of the `rv_continuous` class, `skewnorm` object inherits from it
a collection of generic methods (see below for the full list),
and completes them with details specific for this particular distribution.

Methods
-------
rvs(a, loc=0, scale=1, size=1, random_state=None)
    Random variates.
pdf(x, a, loc=0, scale=1)
    Probability density function.
logpdf(x, a, loc=0, scale=1)
    Log of the probability density function.
cdf(x, a, loc=0, scale=1)
    Cumulative distribution f

In [11]:
# 定义一个函数，用于绘制偏态正态分布的直方图
def plot_histogram_normal_skewed(mean, standard_deviation, skew, color):
    # 创建偏态正态分布对象，设置偏度、位置(均值)和尺度(标准差)参数
    distribution = stats.skewnorm(skew, loc=mean, scale=standard_deviation)
    # 从分布中随机生成10000个样本，并转换为DataFrame格式
    normal_values = pd.DataFrame({"value": distribution.rvs(10000)})

    # 使用Altair创建直方图
    chart = alt.Chart(normal_values).mark_bar().encode(
        # X轴设置为数值，并将数据分成最多100个箱
        alt.X("value", bin=alt.Bin(maxbins=100)),
        # Y轴显示每个箱中的计数
        y='count()',
        # 设置直方图的颜色
        color=alt.value(color)
    )
    return chart

In [12]:
# 调用plot_histogram_normal_skewed函数绘制偏态正态分布直方图
# 参数说明:
# - 均值(mean) = 0
# - 标准差(standard_deviation) = 1  
# - 偏度(skew) = 2，表示正偏(向右偏)
# - 颜色 = 蓝色
plot_histogram_normal_skewed(0, 1, 2, "blue")

For seeing the kurtosis in action, let's look at some data. We will load in the AAPL stock price from a h5 file:

<text>
为了观察峰度的实际效果，让我们来看一些数据。我们将从 h5 文件中加载 AAPL 股票价格：
</text>


In [13]:
# 从pickle文件中读取苹果公司(AAPL)的股票数据
aapl = pd.read_pickle("data/AAPL.pkl")

In [14]:
# 显示苹果公司股票数据的前5行记录
aapl.tail()

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Adj Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2012-12-12,547.77,548.0,536.27,539.0,17398000,539.0
2012-12-13,531.15,537.64,525.8,529.69,22330700,529.69
2012-12-14,514.75,518.13,505.58,509.79,36056400,509.79
2012-12-17,508.93,520.0,501.23,518.83,27057400,518.83
2012-12-18,525.0,534.9,520.25,533.9,22319000,533.9


#### Exercises

1. Compute the increase in price for each day (Close - Open)
2. Plot a histogram of these increases
3. Investigate the `stats.skew` and `stats.kurtosis` functions to compute the third and fourth moment of the dataset.

*For solutions, see `solutions/moments.py`*

#### Extended exercise

Quandl has a python module for extracting datasets. The documentation is available at https://www.quandl.com/tools/python

Install this module, and review the documentation to obtain stock prices for the following four tech giants:
* IBM
* Google
* Apple (more up-to-date than our dataset)
* Amazon

Compute the skew and kurtosis of each stock, and compare the results. Looking at the histograms of the stock prices, the skew and the kurtosis, what does this tell you about the usefulness of these moments?

Note: Extended exercises are more open-ended than normal exercises, and may take significantly longer to complete. They also tend to be harder than other exercises. 

In [15]:
# Exercise 1
aapl['Gain'] = aapl['Close'] - aapl['Open']

# 增强版直方图
chart = alt.Chart(aapl).mark_bar(
    opacity=0.5,  # 设置透明度
    color='blue'  # 设置颜色
).encode(
    alt.X("Gain", 
          bin=alt.Bin(maxbins=50),
          title="价格变化（美元）"  # X轴标题
    ),
    alt.Y('count()',
          title="频率"  # Y轴标题
    )
).properties(
    title="每日价格变化分布",  # 图表标题
    width=600,  # 图表宽度
    height=400  # 图表高度
)

chart.display()

# Exercise 3
skew = stats.skew(aapl['Gain'])
print("The skew is {}".format(skew))
kurtosis = stats.kurtosis(aapl['Gain'])
print("The kurtosis is {}".format(kurtosis))

The skew is -0.3589561655618472
The kurtosis is 12.862787458978701


In [16]:
# 计算每日价格变化
aapl['Price_Change'] = aapl['Close'] - aapl['Open']
aapl.head()

# 创建价格变化的直方图
# 使用 Altair 绘制红色柱状图
# bin=alt.Bin(maxbins=100) 将数据分成最多100个区间
chart = alt.Chart(aapl).mark_bar().encode(
        alt.X("Price_Change", bin=alt.Bin(maxbins=100)),
        y='count()',
        color=alt.value('red'))
chart.display()

# 打印价格变化的偏度(skewness)和峰度(kurtosis)
# 偏度衡量分布的对称性，峰度衡量分布尾部的厚度
print("Skew: " + str(stats.skew(aapl['Price_Change'])))
print("Kurtosis: " + str(stats.kurtosis(aapl['Price_Change'])))

Skew: -0.3589561655618472
Kurtosis: 12.862787458978701


*For solutions, see `solutions/moments.py`*

### Z-scores

A "z-score" is a common normalisation method used for data. It removes the scale of the data, and instead considers the size of the data in terms of the standard deviation. It is a transformation of the data from one scale to another, using the mean and standard deviation:

In [17]:
# 创建一个包含7个数值的numpy数组
original_data = np.array([10, 20, 5, 105, 30, 17, 19], dtype=np.float32)
# 计算数组的平均值
m = np.mean(original_data)
# 计算数组的标准差
s = np.std(original_data)

The transformation is to subtract the mean, and divide by the standard deviation:

In [18]:
# 计算Z分数：将原始数据标准化，通过减去均值并除以标准差，得到每个数据点偏离平均值的标准差倍数
zscores = (original_data - m) / s

In [19]:
# 打印标准化后的数据(z-scores)，即原始数据减去均值后除以标准差，用于衡量每个数据点偏离平均值的程度
zscores

array([-0.612737  , -0.29735765, -0.77042663,  2.3833666 ,  0.01802167,
       -0.39197144, -0.3288956 ], dtype=float32)

The values of the z-scores are normalised, allowing us to compare data from different scales - for instance, comparing the stock prices between AAPL and MSFT for a period of one month, where direct comparisons are initially hard. 

<text>
z-score 的值经过标准化处理，使得我们能够比较不同尺度的数据——例如，比较 AAPL 和 MSFT 在一个月内的股票价格，而直接比较最初是困难的。
</text>

Let's load some data from Quandl. To do that, create a file called `my_secrets.py` and create a value called `QUANDL_API_KEY` and set that equal to your API key from Quandl. You can obtain one by signing up at https://www.quandl.com/tools/api and then viewing your profile page at https://www.quandl.com/account/profile

You can copy the file `my_secrets_template.py` to create this file for you. Just copy the file and fill out the data. Ensure this file is in the same directory as your notebooks.

In [20]:
%%writefile my_secrets.py

# 存储 Quandl API 密钥的变量
QUANDL_API_KEY = "ue4SAPctpsjD3UJYZ2o1"

Overwriting my_secrets.py


In [21]:
# 导入 Quandl 库，用于获取金融数据
import quandl
# 导入包含 API 密钥的本地配置文件
import my_secrets
# 设置 Quandl API 密钥，用于认证访问 Quandl 数据服务
quandl.ApiConfig.api_key = my_secrets.QUANDL_API_KEY

获取前十市值的公司

In [22]:
!pip install yfinance



In [23]:
import quandl
import pandas as pd
import altair as alt
import yfinance as yf

def get_market_caps():
    """
    使用 yfinance 获取市值数据，因为 Quandl 的市值数据需要付费订阅
    """
    # 获取标普500成分股列表
    sp500 = pd.read_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')[0]
    tickers = sp500['Symbol'].tolist()
    
    # 获取市值数据
    market_caps = {}
    for ticker in tickers[:50]:  # 限制请求数量，避免超时
        try:
            stock = yf.Ticker(ticker)
            info = stock.info
            if 'marketCap' in info and info['marketCap'] > 0:
                market_caps[ticker] = {
                    'name': info.get('longName', ''),
                    'marketcap': info['marketCap']
                }
        except:
            continue
    
    # 转换为DataFrame
    df = pd.DataFrame.from_dict(market_caps, orient='index')
    df['ticker'] = df.index
    df['marketcap_billions'] = df['marketcap'] / 1e9
    
    # 获取前10大公司
    return df.nlargest(10, 'marketcap')

def plot_market_caps(top_10):
    """
    创建市值可视化图表
    """
    chart = alt.Chart(top_10).mark_bar().encode(
        x=alt.X('marketcap_billions:Q', 
                title='市值（十亿美元）'),
        y=alt.Y('ticker:N', 
                sort='-x',  # 按市值降序排列
                title='公司代码'),
        tooltip=['name', 'marketcap_billions'],
        color=alt.value('#1f77b4')  # 设置统一的蓝色
    ).properties(
        title='美股市值前十大公司',
        width=600,
        height=400
    )
    
    return chart

try:
    # 获取数据
    print("正在获取市值数据...")
    top_10_companies = get_market_caps()
    
    # 显示数据
    print("\n市值前十大公司：")
    print(top_10_companies[['ticker', 'name', 'marketcap_billions']].to_string())
    
    # 绘制图表
    chart = plot_market_caps(top_10_companies)
    chart.display()
    
except Exception as e:
    print(f"获取数据时出错: {str(e)}")

正在获取市值数据...

市值前十大公司：
      ticker                          name  marketcap_billions
AAPL    AAPL                    Apple Inc.         3678.581031
AMZN    AMZN              Amazon.com, Inc.         2357.357969
GOOGL  GOOGL                 Alphabet Inc.         2355.123716
GOOG    GOOG                 Alphabet Inc.         2355.123716
ABBV    ABBV                   AbbVie Inc.          320.241107
ACN      ACN                 Accenture plc          221.326098
AXP      AXP      American Express Company          213.503181
AMD      AMD  Advanced Micro Devices, Inc.          203.451695
ABT      ABT           Abbott Laboratories          197.433590
ADBE    ADBE                    Adobe Inc.          189.536911


输出前十市值公司的股价分布

In [24]:
!pip install seaborn



In [25]:
import quandl
import pandas as pd
import numpy as np  # 改为 np
import altair as alt
import yfinance as yf
from scipy import stats
import seaborn as sns

def get_market_caps():
    """
    使用 yfinance 获取市值数据
    """
    try:
        # 使用pandas而不是numpy来读取网页表格
        sp500 = pd.read_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')[0]
        tickers = sp500['Symbol'].tolist()
        
        # 获取市值数据
        market_caps = {}
        print("正在获取个股市值数据...")
        for ticker in tickers[:50]:  # 限制请求数量
            try:
                stock = yf.Ticker(ticker)
                info = stock.info
                if 'marketCap' in info and info['marketCap'] > 0:
                    market_caps[ticker] = {
                        'name': info.get('longName', ticker),
                        'marketcap': info['marketCap']
                    }
                    print(f"已获取 {ticker} 的市值数据")
            except Exception as e:
                print(f"获取 {ticker} 数据时出错: {str(e)}")
                continue
        
        # 转换为DataFrame
        df = pd.DataFrame.from_dict(market_caps, orient='index')
        df['ticker'] = df.index
        df['marketcap_billions'] = df['marketcap'] / 1e9
        
        # 获取前10大公司
        return df.nlargest(10, 'marketcap')
    
    except Exception as e:
        print(f"获取市值数据时出错: {str(e)}")
        return None

def get_stock_data(tickers, start_date='2015-01-01', end_date='2025-01-01'):
    """
    获取指定股票的历史价格数据
    """
    price_data = pd.DataFrame()
    
    for ticker in tickers:
        try:
            print(f"正在获取 {ticker} 的历史价格数据...")
            stock = yf.download(ticker, start=start_date, end=end_date, progress=False)
            stock_prices = stock['Adj Close'].rename(ticker)
            price_data = pd.concat([price_data, stock_prices], axis=1)
        except Exception as e:
            print(f"获取 {ticker} 数据时出错: {str(e)}")
            
    return price_data

def analyze_distribution(price_data):
    """
    分析每只股票价格的分布情况
    """
    # 创建价格分布图表
    melted_data = price_data.melt(var_name='ticker', value_name='price')
    chart = alt.Chart(melted_data).mark_area(
        opacity=0.3,
        interpolate='step'
    ).encode(
        alt.X('price:Q', bin=alt.Bin(maxbins=50), title='价格'),
        alt.Y('count():Q', stack=None, title='频率'),
        alt.Color('ticker:N', title='股票代码'),
        tooltip=['ticker:N', 'count():Q']
    ).properties(
        width=800,
        height=400,
        title='股票价格分布对比'
    )
    
    # 计算统计指标
    stats_df = pd.DataFrame()
    for column in price_data.columns:
        data = price_data[column].dropna()
        stats_df.loc[column, 'Mean'] = data.mean()
        stats_df.loc[column, 'Std'] = data.std()
        stats_df.loc[column, 'Skewness'] = stats.skew(data)
        stats_df.loc[column, 'Kurtosis'] = stats.kurtosis(data)
        _, p_value = stats.normaltest(data)
        stats_df.loc[column, 'Normal Test p-value'] = p_value
        
    return chart, stats_df

def main():
    try:
        # 获取市值数据
        print("正在获取市值数据...")
        top_10_companies = get_market_caps()
        
        if top_10_companies is None or top_10_companies.empty:
            print("未能获取市值数据")
            return
            
        # 显示市值前十公司
        print("\n市值前十大公司：")
        print(top_10_companies[['ticker', 'name', 'marketcap_billions']].to_string())
        
        # 获取历史价格数据
        tickers = top_10_companies['ticker'].tolist()
        price_data = get_stock_data(tickers)
        
        if price_data.empty:
            print("未能获取价格数据")
            return
            
        # 分析分布情况
        print("\n正在分析价格分布...")
        distribution_chart, stats_df = analyze_distribution(price_data)
        
        # 显示统计结果
        print("\n统计指标:")
        pd.set_option('display.float_format', lambda x: '%.3f' % x)
        print(stats_df)
        
        # 显示分布图表
        distribution_chart.display()
        
        # 显示正态分布检验结果
        print("\n正态分布检验结果:")
        for index, row in stats_df.iterrows():
            is_normal = row['Normal Test p-value'] > 0.05
            print(f"{index}: {'符合正态分布' if is_normal else '不符合正态分布'} (p-value: {row['Normal Test p-value']:.3f})")
            
    except Exception as e:
        print(f"分析过程中出错: {str(e)}")

if __name__ == "__main__":
    main()

正在获取市值数据...
正在获取个股市值数据...
已获取 MMM 的市值数据
已获取 AOS 的市值数据
已获取 ABT 的市值数据
已获取 ABBV 的市值数据
已获取 ACN 的市值数据
已获取 ADBE 的市值数据
已获取 AMD 的市值数据
已获取 AES 的市值数据
已获取 AFL 的市值数据
已获取 A 的市值数据
已获取 APD 的市值数据
已获取 ABNB 的市值数据
已获取 AKAM 的市值数据
已获取 ALB 的市值数据
已获取 ARE 的市值数据
已获取 ALGN 的市值数据
已获取 ALLE 的市值数据
已获取 LNT 的市值数据
已获取 ALL 的市值数据
已获取 GOOGL 的市值数据
已获取 GOOG 的市值数据
已获取 MO 的市值数据
已获取 AMZN 的市值数据
已获取 AMCR 的市值数据
已获取 AEE 的市值数据
已获取 AEP 的市值数据
已获取 AXP 的市值数据
已获取 AIG 的市值数据
已获取 AMT 的市值数据
已获取 AWK 的市值数据
已获取 AMP 的市值数据
已获取 AME 的市值数据
已获取 AMGN 的市值数据
已获取 APH 的市值数据
已获取 ADI 的市值数据
已获取 ANSS 的市值数据
已获取 AON 的市值数据
已获取 APA 的市值数据
已获取 APO 的市值数据
已获取 AAPL 的市值数据
已获取 AMAT 的市值数据
已获取 APTV 的市值数据
已获取 ACGL 的市值数据
已获取 ADM 的市值数据
已获取 ANET 的市值数据
已获取 AJG 的市值数据
已获取 AIZ 的市值数据
已获取 T 的市值数据
已获取 ATO 的市值数据
已获取 ADSK 的市值数据

市值前十大公司：
      ticker                          name  marketcap_billions
AAPL    AAPL                    Apple Inc.         3678.581031
AMZN    AMZN              Amazon.com, Inc.         2357.357969
GOOGL  GOOGL                 Alphabet Inc.         2355.12371

In [26]:
# 从Quandl获取MSFT和AAPL的股票数据
data = quandl.get_table('WIKI/PRICES', ticker = ['MSFT', 'AAPL'], # 指定获取微软和苹果的股票数据
                        qopts = { 'columns': ['ticker', 'date', 'adj_close'] }, # 只选择股票代码、日期和经调整收盘价这三列
                        date = { 'gte': '2017-01-01', 'lte': '2024-01-01' }, # 设置日期范围从2017年到2019年
                        paginate=True) # 启用分页获取数据，用于处理大量数据

In [27]:
# 随机抽样显示数据中的5行
data.sample(5)
# 查看数据对象的类型
type(data)

pandas.core.frame.DataFrame

If we compare the means, we see that AAPL has a higher adjusted close value.

<text>
如果我们比较均值，我们会发现 AAPL 的调整后收盘价更高。
</text>


In [28]:
# 按股票代码分组计算每支股票的平均调整收盘价

data.groupby("ticker")['adj_close'].mean()

ticker
AAPL    154.137248
MSFT     75.098922
Name: adj_close, dtype: float64

However, we might be more interested to see whether movements swing wildly, or are stable with regard to the current price.

In [29]:
# 使用Altair创建股票收盘价分布的柱状图
alt.Chart(data).mark_bar(opacity=0.4).encode(
    # X轴：设置为调整后的收盘价(adj_close)，并将数据分成最多30个区间
    x=alt.X("adj_close", bin=alt.Bin(maxbins=30)),
    # Y轴：统计每个区间的数量，stack=None表示不堆叠不同股票的数据
    y=alt.Y('count()', stack=None),
    # column='ticker',  # 注释掉的代码：原本用于按股票代码分列显示
    # 使用不同颜色区分不同的股票代码
    color='ticker',
)

To truly compare these distributions, we need to convert them to z-scores first, which gives us more information about the relative stock price movements:

<text>
要真正比较这些分布，我们首先需要将它们转换为 z 分数，这为我们提供了有关股票价格相对变动的更多信息：
</text>


In [30]:
# 显示数据框的所有列名

data.columns

Index(['ticker', 'date', 'adj_close'], dtype='object')

In [31]:
# 将数据透视为以日期为索引、股票代码为列的价格矩阵
prices = data.pivot(columns="ticker", index="date", values='adj_close')
# 计算每支股票的Z分数：(价格-均价)/标准差
z_scores = (prices - prices.mean())/prices.std()
# 显示前5行Z分数数据
z_scores.head()

ticker,AAPL,MSFT
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2017-01-03,-2.399152,-1.330589
2017-01-04,-2.406966,-1.356847
2017-01-05,-2.371503,-1.356847
2017-01-06,-2.293364,-1.306206
2017-01-09,-2.228449,-1.324962


In [32]:
# 创建一个柱状图来展示标准化后的股票收盘价分布
alt.Chart(z_scores.melt(value_name="z_score_adj_close")).mark_bar(opacity=0.4).encode(
    # X轴：标准化后的收盘价，分成最多30个区间
    x=alt.X("z_score_adj_close", bin=alt.Bin(maxbins=30)),
    # Y轴：每个区间的计数，不堆叠显示
    y=alt.Y('count()', stack=None),
    # column='ticker',  # 注释掉的列分组
    # 使用不同颜色区分不同的股票代码
    color='ticker',
)

We can now compare the distributions, visually and directly against each other. This specific analysis doesn't tell us much, but we can use z-scores to compare distributions of data from different scales, as we saw above.

<text>
我们现在可以直观地直接比较这些分布。这种特定的分析并没有告诉我们太多信息，但正如我们上面所看到的，我们可以使用 z 分数来比较来自不同尺度的数据分布。
</text>


#### Exercise

Perform the same analysis, but using the increase in adjusted closing price in a given day, rather than the absolute value.

In [33]:
# 计算每日收益率
returns = prices.pct_change().iloc[1:,:]
# 对收益率进行标准化处理，计算z分数
z_rets = (returns - returns.mean())/returns.std()

# 创建柱状图可视化标准化后的收益率分布
# 使用altair绘制直方图，设置透明度为0.4
alt.Chart(z_rets.melt(value_name="z_score_returns")).mark_bar(opacity=0.4).encode(
    # X轴显示标准化收益率，最多分30个箱
    x=alt.X("z_score_returns", bin=alt.Bin(maxbins=30)),
    # Y轴显示每个箱的计数，不堆叠
    y=alt.Y('count()', stack=None),
    # column='ticker',  # 注释掉的列分组
    # 根据股票代码设置不同颜色
    color='ticker',
)

  returns = prices.pct_change().iloc[1:,:]


*For solutions, see `solutions/adjusted_increases.py`*