# Preprocess

Preprocessing of the dataset. First, all words are extracted and get total count, word cound and ID assigned. Then, all articles are encoded as a list of word ID's for quick analysis.

The preprocessed data is used in the `Collocation.ipynb` notebook to quickly find collocation counts for words in the dataset.

This notebook was inspired by the following tutorial:

`https://github.com/sgsinclair/alta/blob/a482d343142cba12030fea4be8f96fb77579b3ab/ipynb/utilities/Collocates.ipynb`

## Loading the dataset

In [None]:
from datasets import load_dataset
import re
from tqdm import tqdm
from collections import Counter
import numpy as np
import pandas as pd
import os

In [None]:
dataset = load_dataset("cnn_dailymail.py", "3.0.0", split="train") 

In [None]:
docs = dataset['article']
n_docs = len(docs)

## Counting words

The first step in preprocessing the data is counting which words exist in the data and how often they occur.

In [None]:
def tokenise(text):
    return re.findall(r'\b\w[\w-]*\b', text.lower()) # from the tutorial notebook

Counting how often each word occurs:

In [None]:
word_count = Counter() # counts the total amount of words in the dataset
big_word_list = [] # temporary cache so the counter doesn't have to be updated at each document

counter_update = 1000 # the counting is faster if you don't update the counter at each document

for i in tqdm(range(n_docs)):
    tokenised = tokenise(docs[i])
    big_word_list += tokenised
    if i % counter_update == 0:
        word_count += Counter(big_word_list)
        big_word_list = []

Counting how many documents contain each word at least once:

In [None]:
doc_count = Counter() # counts how many documents contain each token
big_word_list = [] # temporary cache so the counter doesn't have to be updated at each document

counter_update = 1000 # the counting is faster if you don't update the counter at each document

for i in tqdm(range(n_docs)):
    tokenised = tokenise(docs[i])
    big_word_list += list(set(tokenised))
    if i % counter_update == 0:
        doc_count += Counter(big_word_list)
        big_word_list = []

In [None]:
all_words = list(dict(word_count).keys())
print('the dataset contains', len(all_words), 'unique words')

Storing the word statistics in a DataFrame:

In [None]:
col_word_count = []
col_doc_count = []
all_words = list(dict(word_count).keys())

for i in tqdm(range(len(all_words))):
    col_word_count.append(word_count[all_words[i]])
    col_doc_count.append(doc_count[all_words[i]])

In [None]:
df = pd.DataFrame({'word_count':col_word_count, 'doc_count':col_doc_count}, index=all_words)
df['avg_word_count'] = df['word_count'] / n_docs
df['avg_doc_count'] = df['doc_count'] / n_docs
df.head()

Sorting the DataFrame by most common words and assigning each word an ID:

In [None]:
df = df.sort_values('avg_doc_count', ascending=False)
df['id'] = range(len(df))
df

Exporting the DataFrame:

In [None]:
df.to_csv('dataset_words.csv')

In [None]:
pd.read_csv('dataset_words.csv', index_col=0)

## Encoding the articles

In order to make finding collocations as quick as possible, the articles will be encoded as an array of word IDs. This uses the IDs assigned in the previous step.

In [None]:
def encode(text, df):
    # encodes the article as an array of word IDs using df as a conversion table
    text_tok = tokenise(text)
    # words that do not occur in the encoding scheme will receive id -1
    text_enc = np.array([df['id'][word] if word in df.index else -1 for word in text_tok])
    return text_enc

This is the encoding for the first article in the dataset:

In [None]:
enc_art_1 = encode(docs[0], df)
enc_art_1

Decoding function to make sure the encoding worked:

In [None]:
def decode(enc, df):
    text_enc = [df[df['id'] == tok].index[0] for tok in enc]
    return ' '.join(text_enc)

This is how the article is reconstructed (without punctuation and capitalisation, since those are removed when tokenising):

In [None]:
decode(enc_art_1, df)

Exporting the preprocessed articles:

In [None]:
def str_encode(text, df):
    # converts the array of IDs to a string so that it can be written to CSV format
    return ' '.join([str(i) for i in encode(text, df)])

Let's encode a small portion of the dataset to see what the resulting DataFrame will look like:

In [None]:
start_row, end_row = 0, 1000

batch_article = docs[start_row:end_row]
batch_highlights = dataset['highlights'][start_row:end_row]
batch_id = dataset['id'][start_row:end_row]
batch_encoding = [str_encode(text, df) for text in batch_article]
batch_df = pd.DataFrame({'article': batch_article, 'highlights': batch_highlights, 'endoding': batch_encoding, 'id':batch_id})
batch_df['label'] = np.nan
batch_df

The following cell writes the entire preprocessed dataset to disk. It uses multiple files to avoid getting files that are too large.

Since the Huggingface dataset contains almost 300.000 rows, this may take a while.

In [None]:
csv_size = 10000 # amount of rows per csv file
filepath = 'test/' # map waarin het bestand wordt geschreven


start_row = 0 # first row to start encoding
while start_row < n_docs:
    # calculate slice of dataset and generate file name
    end_row = min(start_row + csv_size, n_docs) # make sure not to exceed last row
    filename = str(start_row) + '-' + str(end_row) + '.csv'
    print('preparing', filename)
    
    # process the data and store it in a DataFrame
    batch_article = docs[start_row:end_row]
    batch_highlights = dataset['highlights'][start_row:end_row]
    batch_id = dataset['id'][start_row:end_row]
    batch_encoding = [str_encode(text, df) for text in batch_article]
    batch_df = pd.DataFrame({'article': batch_article, 'highlights': batch_highlights, 'encoding': batch_encoding, 'id':batch_id})
    batch_df['label'] = np.nan
    
    # write the file to disk
    write_to = filepath + filename
    
    os.makedirs(filepath, exist_ok=True)
    batch_df.to_csv(write_to, index=True)
    print('wrote to file:', write_to)
    
    start_row = end_row

