# Readme

- 针对python版本的`datatable`，我们也希望给出相应的解决方案，并尽量与R版本进行对比。
- 与R版本的类似，我们只会输出结果数据集的**前5行**。
- 其余注意事项可查看*anwer-keys-rdt.ipynb*中的readme部分。
- 欢迎大家关注我们的微信公众号**大喵与村长的R进制**。
- 如有任何问题请在我们的Github主页开issue。最后感谢renkun！

# Import data

In [5]:
import datatable as dt
data = dt.fread("data/stock-market-data.csv")
data[:5,:]

Unnamed: 0_level_0,symbol,date,pre_close,open,high,low,close,volume,amount,adj_factor,capt,index_w50,index_w300,index_w500,industry
Unnamed: 0_level_1,▪▪▪▪,▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪
0,600000.SH,20120104,8.49,8.54,8.56,8.39,8.41,34201379,290230000.0,6.65527,125501000000.0,0.0464093,0.0212594,0,BANKS
1,600000.SH,20120105,8.41,8.47,8.82,8.47,8.65,132116203,1144750000.0,6.65527,129082000000.0,0.0464093,0.0212594,0,BANKS
2,600000.SH,20120106,8.65,8.63,8.78,8.62,8.71,61778687,537044000.0,6.65527,129977000000.0,0.0464093,0.0212594,0,BANKS
3,600000.SH,20120109,8.71,8.72,8.99,8.68,8.95,80136249,711430000.0,6.65527,133559000000.0,0.0464093,0.0212594,0,BANKS
4,600000.SH,20120110,8.95,8.95,9.1,8.88,9.07,72004632,647207000.0,6.65527,135350000000.0,0.0464093,0.0212594,0,BANKS


- 数据集为“面板数据”：包含多个股票（横截面），而每个股票则有多个按照日期排序的变量（时间序列）

- 股票代码`symbol` 和日期`date`共同组成了数据集的key，也即每个唯一的`symbol` 和`date`组合决定了一个唯一的观测。

- 整个数据集首先按照代码`symbol`排列，其次按照日期`date`排列。

- 若干主要变量说明：
    - `symbol`：股票代码。.SH 结尾的是沪股，.SZ 结尾的是深股
    - `date`：日期
    - `pre_close`： 昨收盘
    - `open`：开盘价
    - `high`：最高价（日内）
    - `low`：最低价（日内）
    - `close`：收盘价
    - `volume`：成交量
    - `amount`：成交金额
    - `industry`：行业
    - `index_w50`：该股票在上证50指数的成分比例
    - `index_w300`：该股票在上证300指数的成分比例
    - `index_w500`：该股票在中证500指数的成分比例

# Answer Keys

## 1. 哪些股票的代码中包含"8"这个数字？

In [39]:
data[dt.re.match(dt.f.symbol, ".*8.*"),:][:5, :]

Unnamed: 0_level_0,symbol,date,pre_close,open,high,low,close,volume,amount,adj_factor,capt,index_w50,index_w300,index_w500,industry
Unnamed: 0_level_1,▪▪▪▪,▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪
0,600008.SH,20120104,5.2,5.29,5.29,4.92,4.96,6685638,34015100.0,3.29223,10912000000.0,0,0.00122996,0,UTILITIE
1,600008.SH,20120105,4.96,4.97,4.98,4.78,4.85,5600697,27303400.0,3.29223,10670000000.0,0,0.00122996,0,UTILITIE
2,600008.SH,20120106,4.85,4.86,5.0,4.78,4.98,3773453,18459000.0,3.29223,10956000000.0,0,0.00122996,0,UTILITIE
3,600008.SH,20120109,4.98,4.98,5.2,4.91,5.17,5749379,29119000.0,3.29223,11374000000.0,0,0.00122996,0,UTILITIE
4,600008.SH,20120110,5.17,5.17,5.32,5.13,5.28,8276808,43363100.0,3.29223,11616000000.0,0,0.00122996,0,UTILITIE


## 2. 每天上涨和下跌的股票有多少

In [58]:
data[:, dt.update(updown_tag = dt.ifelse(dt.f.close > dt.f.pre_close, "up", dt.f.close < dt.f.pre_close, "down", "steady"))]
data[:, dt.count(dt.f.symbol), dt.by(dt.f.date, dt.f.updown_tag)][:5, :]

Unnamed: 0_level_0,date,updown_tag,symbol
Unnamed: 0_level_1,▪▪▪▪,▪▪▪▪,▪▪▪▪▪▪▪▪
0,20120104,down,2007
1,20120104,steady,122
2,20120104,up,191
3,20120105,down,2071
4,20120105,steady,117


## 3. 每天每个交易所上涨、下跌的股票各有多少？

In [66]:
data[:, dt.update(
    updown_tag = dt.ifelse(dt.f.close > dt.f.pre_close, "up", dt.f.close < dt.f.pre_close, "down", "steady"),
    exch_tag = dt.str.slice(dt.f.symbol, -2, None)           
    )]
data[:, dt.count(dt.f.symbol), dt.by(dt.f.date, dt.f.updown_tag, dt.f.exch_tag)][:5, :]


Unnamed: 0_level_0,date,updown_tag,exch_tag,symbol
Unnamed: 0_level_1,▪▪▪▪,▪▪▪▪,▪▪▪▪,▪▪▪▪▪▪▪▪
0,20120104,down,SH,794
1,20120104,down,SZ,1213
2,20120104,steady,SH,42
3,20120104,steady,SZ,80
4,20120104,up,SH,85


## 4. 沪深300成分股中，每天上涨、下跌的股票各有多少？

In [70]:
data[:, dt.update(updown_tag = dt.ifelse(dt.f.close > dt.f.pre_close, "up", dt.f.close < dt.f.pre_close, "down", "steady"))]
data[dt.f.index_w300 > 0,:
    ][:, dt.count(dt.f.symbol), dt.by(dt.f.date, dt.f.updown_tag)][:5, :]

Unnamed: 0_level_0,date,updown_tag,symbol
Unnamed: 0_level_1,▪▪▪▪,▪▪▪▪,▪▪▪▪▪▪▪▪
0,20120104,down,275
1,20120104,steady,5
2,20120104,up,20
3,20120105,down,242
4,20120105,steady,8


## 5. 每天每个行业各有多少只股票？

In [74]:
data[:, dt.count(dt.f.symbol), dt.by(dt.f.date, dt.f.industry)][:5, :]

Unnamed: 0_level_0,date,industry,symbol
Unnamed: 0_level_1,▪▪▪▪,▪▪▪▪,▪▪▪▪▪▪▪▪
0,20120104,AERODEF,10
1,20120104,AIRLINE,12
2,20120104,AUTO,85
3,20120104,BANKS,16
4,20120104,BEV,30


## 6. 股票数最大的行业和总成交额最大的行业是否总是同一个行业？

In [105]:
data = data[:, {
    'symbol_num' : dt.count(dt.f.symbol),
    'amount_num' : dt.sum(dt.f.amount)
    }, dt.by(dt.f.date, dt.f.industry)
    ]
data[:, dt.update(
    symbol_max = dt.max(dt.f.symbol_num),
    amount_max = dt.max(dt.f.amount_num)
    ), dt.by(dt.f.date)
    ]
data[(dt.f.symbol_num == dt.f.symbol_max) | (dt.f.amount_num == dt.f.amount_max), :
    ][:, dt.count(), dt.by(dt.f.date)
    ][:5, :]

Unnamed: 0_level_0,date,count
Unnamed: 0_level_1,▪▪▪▪,▪▪▪▪▪▪▪▪
0,20120104,1
1,20120105,2
2,20120106,1
3,20120109,2
4,20120110,2


## 7. 每天涨幅超过5%、跌幅超过5%的股票各有多少？

In [111]:
data[:, dt.update(
    updown_percent = dt.ifelse(dt.f.close/dt.f.pre_close-1 > 0.05, "up5p", dt.f.close/dt.f.pre_close-1 < -0.05, "down5p", "others")
    )]
data[dt.f.updown_percent != "others", :
    ][:, dt.count(dt.f.symbol), dt.by(dt.f.date, dt.f.updown_percent)
    ][:5, :]

Unnamed: 0_level_0,date,updown_percent,symbol
Unnamed: 0_level_1,▪▪▪▪,▪▪▪▪,▪▪▪▪▪▪▪▪
0,20120104,down5p,277
1,20120104,up5p,17
2,20120105,down5p,886
3,20120105,up5p,10
4,20120106,down5p,66


## 8. 每天涨幅前10的股票的总成交额和跌幅前10的股票的总成交额比例是多少？

In [181]:
# 计算每只股票的涨幅并按照涨幅和日期排序
data[:, dt.update(
    ret = dt.f.close/dt.f.pre_close - 1)]
data = data[:, :, dt.sort(dt.f.date, -dt.f.ret)]

# 计算每日的成交量
data[:, dt.update(
    daily_amount = dt.sum(dt.f.amount)
    ), dt.by(dt.f.date)]

# 将涨幅前十和后十的股票的成交量进行计算
data = dt.rbind(data[:10, {"amount": dt.sum(dt.f.amount),
                            "daily_amount": dt.f.daily_amount,
                            "tag": "top"
                            }, dt.by(dt.f.date)], 
                data[-10:,{"amount": dt.sum(dt.f.amount),
                            "daily_amount": dt.f.daily_amount,
                            "tag": "bottom"
                            }, dt.by(dt.f.date)])

# 去重后分别计算成交比例
data = data[:, :, dt.sort(dt.f.date, dt.f.tag)]
data = data[0, :, dt.by(dt.f.date, dt.f.tag)]
data[:, {"ratio": dt.f.amount/dt.f.daily_amount}, dt.by(dt.f.date, dt.f.tag)][:5, :]

Unnamed: 0_level_0,date,tag,ratio
Unnamed: 0_level_1,▪▪▪▪,▪▪▪▪,▪▪▪▪▪▪▪▪
0,20120104,bottom,0.0057457
1,20120104,top,0.00830585
2,20120105,bottom,0.00435634
3,20120105,top,0.00750696
4,20120106,bottom,0.00932084


## 9. 每天开盘涨停的股票与收盘涨停的股票各有多少？（涨停按照收益率超过1.5%的标准计算）

In [6]:
# 计算开盘与收盘收益率
data[:, dt.update(
    close_stop = dt.ifelse(dt.f.close/dt.f.pre_close - 1 > 0.015, "close_stop", "others"),
    open_stop = dt.ifelse(dt.f.open/dt.f.pre_close - 1 > 0.015, "open_stop", "others")
    )]

data1 = data[:, dt.count(dt.f.symbol), dt.by(dt.f.date, dt.f.close_stop)
    ][dt.f.close_stop != "others", :]
data1.names = {"close_stop": "up_tag"}

data2 = data[:, dt.count(dt.f.symbol), dt.by(dt.f.date, dt.f.open_stop)
    ][dt.f.open_stop != "others", :]
data2.names = {"open_stop": "up_tag"}

data = dt.rbind(data1, data2)
data = data[:, :, dt.sort(dt.f.date)]
data[:5, :]

Unnamed: 0_level_0,date,up_tag,symbol
Unnamed: 0_level_1,▪▪▪▪,▪▪▪▪,▪▪▪▪▪▪▪▪
0,20120104,close_stop,70
1,20120104,open_stop,325
2,20120105,close_stop,60
3,20120105,open_stop,27
4,20120106,close_stop,743
