下载**原始**数据

+ 原地址：https://nlp.stanford.edu/pubs/stock-event.html
+ 百度网盘地址：链接：https://pan.baidu.com/s/1QdQTDLNhAKYJnZP-nc9jhw?pwd=1t22 
提取码：1t22

注：正在运行的简化版项目不用下载原始数据。

参考文献

Lee, Heeyoung, Mihai Surdeanu, Bill MacCartney, and Dan Jurafsky. 2014. “On the Importance of Text Analysis for Stock Price Prediction.” In Proceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014, 1170–75.

# 解压缩文件

In [1]:
import os
import gzip
import tarfile
from pathlib import Path

定义数据集文件夹路径`data_fp`，保证里面有以下文件：

+ `8K.tar.gz`
+ `EPS.tar.gz`
+ `price_history.tar.gz`
+ `snp_list.tar.gz`

以上原始文件共占1.85GB硬盘空间，解压缩过后又会占1.84GB硬盘空间。

注：正在运行的简化版项目压缩包内已经包含这些文件的简化版。

In [2]:
data_fp = Path("C:/Users/xh_z/SynologyDrive/Projects/FinTechCaseStudies/CaseStudy3Sim")
data_fp_unzipped = data_fp/'.unzipped'

In [3]:
print([file for file in os.listdir(data_fp) if file.endswith('.tar.gz')])

['8K.tar.gz', 'EPS.tar.gz', 'price_history.tar.gz', 'snp_list.tar.gz']


解压缩

In [4]:
gz_files = ['8K.tar.gz', 'EPS.tar.gz', 'price_history.tar.gz', 'snp_list.tar.gz']

In [5]:
for gz_file in gz_files:
    with tarfile.open(data_fp/gz_file) as f:
        print(f'processing {gz_file}')
        f.extractall(data_fp_unzipped)

processing 8K.tar.gz
processing EPS.tar.gz
processing price_history.tar.gz
processing snp_list.tar.gz


In [6]:
# every file in the archive '8K.tar.gz' is an archive on its own
# and has been extracted to subfolder "8K-gz"
path_8K = data_fp_unzipped/'8K-gz'
gz_files_8K = [file for file in os.listdir(path_8K) if file.endswith('.gz')]
print(gz_files_8K[:10])

['AAPL.gz']


In [7]:
# for gz_file_8K in gz_files_8K:
#     with gzip.open(path_8K/gz_file_8K, 'rb') as f:
#         file_content = f.read()
#     save_as = path_8K/gz_file_8K.replace('.gz', '.txt')
#     with open(save_as, 'w', encoding='utf-8') as f:
#         f.write(file_content.decode('utf-8'))
#     os.remove(path_8K/gz_file_8K)

# 文本预处理

解压缩后得到的文件夹`8K-gz`内，是部分标普1500公司从2002年到2012年的8K文本。每个公司对应一个压缩`*.gz`文件，以该公司的股票代码命名。进一步解压缩这个文件，可以得到一个纯文本文件，由该公司2002至2012年之间内提交的所有8K文档拼接得到。每个8K文档包含在一个`<DOCUMENT></DOCUMENT>`标签里。可查看样例文件`AAPL.xml`。

我们先将每个8K文档抽取出来，然后解析文档内容，抽取提交时间、内容类型等元数据，然后对正文部分进行一些删减，只保留有实质含义的内容，最后对正文部分进行分词。

这一步的输出会存储在子文件夹`processed`里。对于原始文件，该文件夹最后大小为5.62 GB。

+ 安装`spaCy`，用于文本预处理，参照[https://spacy.io/usage](https://spacy.io/usage)。我使用的指令：
```
conda install -c conda-forge spacy
conda install -c conda-forge cupy
python -m spacy download zh_core_web_sm
python -m spacy download en_core_web_sm
```
注：`spaCy`可以使用GPU来加速计算，但是需要自行安装cuda（PyTorch自带的cuda不可以）。cuda下载地址：[https://developer.nvidia.com/cuda-downloads](https://developer.nvidia.com/cuda-downloads)。

+ 安装[BeautifulSoup4](https://beautiful-soup-4.readthedocs.io/en/latest/#installing-beautiful-soup)，同时安装`lxml`。
```
pip install beautifulsoup4 lxml
```

In [8]:
import re
import spacy
import json

from bs4 import BeautifulSoup
from tqdm import tqdm

In [9]:
# spacy.require_gpu()

In [10]:
nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])
nlp.max_length = nlp.max_length*10

In [11]:
def parse_one_doc(doc):
    line_l = doc.split('\n')
    # step 1: extract meta data entries and the 'TEXT' entry
    # this step produces `toc`
    # which stands for 'table of contents', and is a list of section titles
    toc = [] 
    for i, line in enumerate(line_l):
        if line.startswith('TIME:'):
            time_str = line[5:].strip()
        elif line.startswith('EVENTS:'):
            etypes = line[7:].strip().split('\t')
        elif line.startswith('TEXT:'):
            # lines after this line is the filing content
            text_lines = line_l[(i+1):]
            for j, line in enumerate(text_lines):
                if line.startswith('ITEM:'):
                    # first several lines in filing content indicate filing items
                    # keep them in `toc`
                    toc.append(line[5:].strip())
                else:
                    # substantial filing content starts from the next line
                    content_lines = [line.strip() for line in text_lines[(j+1):] if len(line.strip())>0]
                    break
            break
        else:
            continue

    # step 2: drop stereotyped lines
    dropped = [False] * len(content_lines)
    for i, line in enumerate(content_lines):
        if not dropped[i]:
            if line.startswith('Table of Contents'):
                dropped[i] = True
            elif line.startswith('Check the appropriate box below if the Form 8-K'):
                dropped[i] = True
                # also check the next a few lines
                # if two consecutive lines correspond to a checkbox, then drop them
                for j in range(i+2, min(i+2*10, len(content_lines)), 2):
                    if content_lines[j] == 'o':
                        dropped[j-1] = True
                        dropped[j] = True          
            else:
                continue
    content_lines = [line for line, is_dropped in zip(content_lines, dropped) if not is_dropped]
    n = len(content_lines)
    
    # step 3: seperate doc into sections
    toc_spans = [[None, None] for _ in toc]
    sec_i = 0
    sec = toc[sec_i].lower()
    for i, line in enumerate(content_lines):
        l = line.lower()
        if l.endswith(sec) or l.endswith(sec+'.'):
            # start of an item section
            toc_spans[sec_i][0] = i
            # also the end of a previous item section
            if sec_i > 0:
                toc_spans[sec_i-1][1] = i
            sec_i += 1
            if sec_i == len(toc):
                # the last section ends with document
                break
            sec = toc[sec_i].lower()
    toc_spans[-1][1] = n
    for start, end in toc_spans:
        try:
            assert (start is not None) and (end is not None)
        except AssertionError as e:
            # simple strategy failed
#             print(toc_spans, toc, '\n*****\n'.join(content_lines), sep='\n')
            body = {'all': ' '.join(content_lines)}
            head = ''
            seperated = False
            break
    else:
        # pass all assertion tests
        body = {sec: ' '.join(content_lines[span[0]:span[1]]) for sec, span in zip(toc, toc_spans)}
        # keep the content not in any toc sections in `head`
        head = ' '.join(content_lines[0:toc_spans[0][0]])
        seperated = True
    # step 4: save results
    doc_d = {'body': body, 'head': head, 
             'time': time_str, 'events': etypes, 'toc': toc, 
             'toc_spans': toc_spans, 'seperated': seperated}
    return doc_d

In [12]:
def cleanup_text(doc, logging=False):
    try:
        doc = re.sub('\s+', ' ', doc).strip()
        doc = nlp(doc)
        # tokenization, then only keep words
        tokens = [tok.text.lower() for tok in doc if tok.is_alpha]
        tokens = ' '.join(tokens)
        return tokens
    except:
        print(doc)
        return ''

In [13]:
text_path = data_fp/'.processed'
if not os.path.isdir(text_path):
    os.mkdir(text_path)

In [14]:
for gz_file_8K in tqdm(gz_files_8K):
    firm = gz_file_8K.split('.')[0]
    save_as = text_path/(firm+'.json')
    if not os.path.isfile(save_as):
        # unzip
        with gzip.open(path_8K/gz_file_8K, 'rb') as f:
            file_content = f.read().decode('ascii')
        # extract 8K documents
        root_node = BeautifulSoup(file_content, 'lxml')
        doc_l = [doc.text for doc in root_node.find_all('document')]
        # for each 8K doc, perform text preprocessing pipelines
        doc_d_l = []
        for one_doc in doc_l:
            doc_d = parse_one_doc(one_doc)
            doc_d['body'] = {sec: cleanup_text(doc) for sec, doc in doc_d['body'].items()}
            doc_d_l.append(doc_d)
        # save to json
        with open(save_as, 'w', encoding='utf-8') as f:
            json.dump(doc_d_l, f)

100%|████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<?, ?it/s]


In [16]:
# doc_d_l[0]['body']

In [None]:
!jupyter nbconvert --to html -TagRemovePreprocessor.remove_cell_tags='{"remove_cell"}' data_preprocess_1.ipynb