Preprocess the raw CNN/Dailymail story files.

1. Build summary from highlights.
2. Write story and summary to a single file where the story and summary are separated by a tab.

The data can be obtained from:

https://github.com/JafferWilson/Process-Data-of-CNN-DailyMail

Parameters:

- DATA_DIRECTORIES: The directories containing stories to process.
- OUTPUT_DIR: Where the processed stories will be stored.
- MAX_SUMMARY_SENTENCES: The maximum number of highlights used to make the summary.
- EXTENSION: The file extension to use for the processed stories.

## Parameters

In [1]:
DATA_DIRECTORIES = ['../cnn/stories', '../dailymail/stories']
OUTPUT_DIR = '../data/preprocessed_stories'
MAX_SUMMARY_SENTENCES = 2
EXTENSION = 'clean'

## Get data files

In [2]:
import glob
import os

In [3]:
FILES = []
for directory in DATA_DIRECTORIES:
    stories = glob.glob(os.path.join(directory, '*'))
    FILES.extend(stories)

In [4]:
len(FILES)

312085

In [5]:
FILES[:10]

['../cnn/stories/0001d1afc246a7964130f43ae940af6bc6c57f01.story',
 '../cnn/stories/0002095e55fcbd3a2f366d9bf92a95433dc305ef.story',
 '../cnn/stories/00027e965c8264c35cc1bc55556db388da82b07f.story',
 '../cnn/stories/0002c17436637c4fe1837c935c04de47adb18e9a.story',
 '../cnn/stories/0003ad6ef0c37534f80b55b4235108024b407f0b.story',
 '../cnn/stories/0004306354494f090ee2d7bc5ddbf80b63e80de6.story',
 '../cnn/stories/0005d61497d21ff37a17751829bd7e3b6e4a7c5c.story',
 '../cnn/stories/0006021f772fad0aa78a977ce4a31b3faa6e6fe5.story',
 '../cnn/stories/00083697263e215e5e7eda753070f08aa374dd45.story',
 '../cnn/stories/000940f2bb357ac04a236a232156d8b9b18d1667.story']

## Build story parser

In [6]:
import os
def parse(file, max_summary_sentences=None):
    with open(file) as f:
        file_id = os.path.basename(file).partition('.')[0]
        content = f.read()
        content = content.replace('\t', '<tab>')
        context, *highlights = content.split('@highlight')
        if max_summary_sentences is not None:
            highlights = highlights[:max_summary_sentences]
        summary = ' . '.join(highlights) + ' .'
        return file_id, context.strip(), summary.strip()

In [7]:
import tqdm
summaries = [parse(f, MAX_SUMMARY_SENTENCES) for f in tqdm.tqdm(FILES)]

100%|██████████| 312085/312085 [02:32<00:00, 2050.34it/s]


In [8]:
summaries[0]

('0001d1afc246a7964130f43ae940af6bc6c57f01',
 'It\'s official: U.S. President Barack Obama wants lawmakers to weigh in on whether to use military force in Syria.\n\nObama sent a letter to the heads of the House and Senate on Saturday night, hours after announcing that he believes military action against Syrian targets is the right step to take over the alleged use of chemical weapons.\n\nThe proposed legislation from Obama asks Congress to approve the use of military force "to deter, disrupt, prevent and degrade the potential for future uses of chemical weapons or other weapons of mass destruction."\n\nIt\'s a step that is set to turn an international crisis into a fierce domestic political battle.\n\nThere are key questions looming over the debate: What did U.N. weapons inspectors find in Syria? What happens if Congress votes no? And how will the Syrian government react?\n\nIn a televised address from the White House Rose Garden earlier Saturday, the president said he would take his c

## Calculate some basic statistics on data

Summary stats should match (when there is no limit placed on summary length and new lines are not split out)

"The source documents in the training
set have 766 words spanning 29.74 sentences
on an average while the summaries consist of 53
words and 3.72 sentences."

see,
https://arxiv.org/pdf/1602.06023.pdf

In [9]:
context_lens, summary_lens = [], []
context_sentences, summary_sentences = [], []
for _, context, summary in tqdm.tqdm(summaries):
    context_lens.append(len(context.split(' ')))
    context_sentences.append(context.count('.'))
    summary_lens.append(len(summary.split(' ')))
    summary_sentences.append(summary.count('.'))

100%|██████████| 312085/312085 [00:34<00:00, 9068.00it/s] 


In [10]:
import pandas as pd
df = pd.DataFrame({
    'context_len': context_lens,
    'summary_len': summary_lens,
    'context_sent': context_sentences,
    'summary_sentences': summary_sentences
})

In [11]:
df.describe()

Unnamed: 0,context_len,summary_len,context_sent,summary_sentences
count,312085.0,312085.0,312085.0,312085.0
mean,655.043568,27.544294,31.035487,2.142243
std,320.493317,6.812521,19.038796,0.577843
min,1.0,4.0,0.0,1.0
25%,417.0,23.0,18.0,2.0
50%,597.0,26.0,27.0,2.0
75%,833.0,30.0,40.0,2.0
max,2356.0,131.0,2269.0,13.0


## drop to disk

In [12]:
try:
    os.makedirs(OUTPUT_DIR)
except OSError:
    pass

In [13]:
for file_id, context, summary in tqdm.tqdm(summaries):
    text = '\t'.join([context, summary])
    dst = os.path.join(OUTPUT_DIR, f'{file_id}.{EXTENSION}')
    with open(dst, 'w') as f:
        f.write(text)

100%|██████████| 312085/312085 [01:40<00:00, 3114.33it/s]
