In [1]:
import os
from os.path import join
import pandas as pd
from pathlib import Path
import pickle
import spacy
import sys
from tqdm import tqdm

project_root = Path('..')
sys.path.append(os.path.abspath(project_root))
from notebooks.utils import init_data_dir  # noqa

init_data_dir(project_root)

raw_path = Path('../data/raw')
preprocess_path = Path('../data/preprocess')
resources_path = Path('../resources')

nlp = spacy.load('en_core_web_sm')

# Extracting the British Academic Written English Corpus

This script is for extracting essays from the British Academic Written English Corpus (BAWE)[<sup>1</sup>](#fn1). The corpus contains about 3000 student assignments from four different disciplinary areas: Arts and Humanities, Social Sciences, Life Sciences, and Physical Sciences. These assignments are further distributed across four levels of study. 

### Benefits
* There are ~300,000 sentences within the corpus, which is enough data for fairly high-parametric models, though likely not enough for neural networks.
* The corpus is made up of university-level essays, which is close to the data that our program will encounter.
* There are many authors in the dataset, so a wide range of writing style is represented.

### Flaws
* Peering through the dataset, the essays seem to mostly be for different prompts. More investigation is needed here. Having examples of essays from the same prompt is desirable so that style differences between very similar essays can be tested.
* Having a large set of authors means that the dataset is unbalanced when considering one author vs. the rest.

## Instructions

First, download the BAWE data and extract it into the `/data/raw` directory. Move the `download` folder into the `/data/raw` directory, and rename it to `bawe`. The data directory should look like this:

```
|-- data
|   |-- raw
|   |   |-- bawe
|   |   |   |-- CORPUS_ASCII
|   |   |   |-- CORPUS_ByDisc
|   |   |   |-- ...
```

Now, run the following code cells to process the data into a pandas dataframe. The dataframe will be saved as `/data/preprocess/bawe_df.hdf5`.

In [2]:
def bawe_texts(filenames):
    # We use the plain text version of the corpus
    corpus_path = join(raw_path, Path('bawe/CORPUS_TXT'))

    for filename in tqdm(filenames):
        # The first 4 characters of the filename indicate the author, 5th
        # character indicates the genre.
        author = int(filename[:4])
        genre = filename[4]

        with open(join(corpus_path, filename), 'r') as f:
            text = f.read()

        yield author, genre, text

In [3]:
with open(join(resources_path, 'bawe_splits.p'), 'rb') as f:
    bawe_splits = pickle.load(f)
    train_filenames = bawe_splits['train']

df = pd.DataFrame(bawe_texts(train_filenames),
                  columns=['author', 'genre', 'text'])
df = df.sort_values(by=['author', 'genre']).reset_index(drop=True)

df

100%|██████████| 2577/2577 [00:00<00:00, 24873.94it/s]


Unnamed: 0,author,genre,text
0,1,a,Racism is still a problem within our society t...
1,1,b,Official statistics are those produced by eith...
2,1,c,Since the fourteenth century the practice of m...
3,1,d,Much more reproductive choice is now available...
4,2,a,Victorian notions of women's madness were larg...
...,...,...,...
2572,6998,a,E. Warwick Slinn describes dramatic monologue ...
2573,6998,b,Hugh Blair voices an attack on the practices o...
2574,6998,c,"'The first thing to remember about Donne,' wri..."
2575,6998,d,Susan Wiseman calculated that the latest possi...


In [4]:
# print('Counting sentences...', flush=True)
# sentence_counts = [len(list(nlp(text).sents)) for text in tqdm(df['text'])]

In [5]:
# df['sentence_count'] = sentence_counts

# authors = set(df['author'])
# sentence_count = sum(sentence_counts)
# min_sentence_count = min(sentence_counts)
# max_sentence_count = max(sentence_counts)
# avg_sentence_count = sentence_count / len(sentence_counts)

# print(f'Author count: {len(authors)}')
# print(f'Sentence count: {sentence_count}')
# print(f'Minimum sentence count: {min_sentence_count}')
# print(f'Maximum sentence count: {max_sentence_count}')
# print(f'Average sentence count: {avg_sentence_count}')

# df

In [6]:
df.to_hdf(join(preprocess_path, 'bawe_df.hdf5'), key='bawe_df')

### References

1. <span id="fn1"> https://www.coventry.ac.uk/research/research-directories/current-projects/2015/british-academic-written-english-corpus-bawe/ </span>