# Learning How to Read: Text Preprocessing

A big part of working with natural languages is preparing a text for later work. In this notebook I demonstrate the process of preparing a document using an e-mail sent out by [arXiv.org](https://arxiv.org/), a website containing preprints of academic papers in the sciences. This particular e-mail contains the names and abstracts of recent papers from select computer science subjects.

I load in the e-mail as a string below.

In [None]:
with open("arxiv_email_cs_030918.txt") as f:
    email_string = f.read()

In [None]:
print(email_string)

In [None]:
email_string[:(1024 * 10)]

In raw form there is a lot of undesired formatting. We may be primarily interested in abstracts, but we will need to identify their locations. Additionally there are newline characters, `\n`, placed in an undesirable way; they separate lines, not necessarily paragraphs. We will see how to deal with these issues, along with how to approach the words in the e-mail.

## Noise and Formatting

This e-mail contains multiple abstracts, and we may view each one as its own document. In addition to the abstract, we may think that the title and authors are interesting information. Formatting also separates one abstract from another. But all other formatting and information should be removed, and the abstracts should not have lines separated as they are.

We would like to turn this into a corpus of documents, each one containing a single abstract with appropriate formatting, and also track title and authorship information in a separate document. A lot of the work we want to do can be done using regular expressions.

In this e-mail, not all papers have abstracts (they could be revisions to existing papers). Those papers are listed at the bottom of the document and should be excluded.

In [None]:
import re, string
import nltk

In [None]:
abstract = r"""-{78}[\n]                        # A line of ------
               [\\]{2}[\n]                      # Two slashes, new line
               arXiv:[0-9.]+[\n]                # arXiv number
               Date:.*                          # Has a date
               [\n]{2}                          # Two new lines
               Title:.*[\n](?:.*[\n])*?         # Capture Title field, including multiple lines
               Authors:.*[\n](?:.*[\n])*?       # Capture Author field, ...
               Categories:.*[\n](?:.*[\n])*?    # Capture Categories field, ...
               (?:Comments:.*[\n](?:.*[\n])*?)? # If a Comments field exists, capture it too, ...
               [\\]{2}[\n]                      # Check for an isolated \\ ; this helps catch articles w/ abstracts
               \s.*[\n](?:.*[\n])*?             # Abstracts starts with a space; then get the rest of the content
               [\\]{2}.*[\n]                    # Line ends with \\ ( ... ) so get this
               -{78}                            # Final line of -------
            """

abstract_strs = re.findall(abstract, email_string, re.X)    # Using re.X allows up to split up our regex and add comments
print(abstract_strs[2])

In [None]:
len(abstract_strs)

`abstract_strs` contains abstract substrings; now we want to extract abstracts and other data. Our earlier regular expression can be modified to extract this data.

In [None]:
abstract_title = r"""Title: (.*(?:.*[\n])*?)Authors"""

abstract_title_strs = list(map(lambda x: re.findall(abstract_title, x)[0][:-1], abstract_strs))    # [:-1] to remove \n
abstract_title_strs

In [None]:
abstract_authors = r"""Authors: (.*(?:.*[\n])*?)Categories"""

abstract_authors_strs = list(map(lambda x: re.findall(abstract_authors, x)[0][:-1], abstract_strs))
abstract_authors_strs

In [None]:
abstract_text = r"""[\\][\n]\s(.*(?:.*[\n])*?)[\\].*[\n]-"""

abstract_text_strs = list(map(lambda x: re.findall(abstract_text, x)[0][1:-1], abstract_strs))
abstract_text_strs

Now we replace all whitespace characters with a single space for every entry of these lists.

In [None]:
abstract_title_strs = list(map(lambda x: re.sub(r"\s+", " ", x), abstract_title_strs))
abstract_authors_strs = list(map(lambda x: re.sub(r"\s+", " ", x), abstract_authors_strs))
abstract_text_strs = list(map(lambda x: re.sub(r"\s+", " ", x), abstract_text_strs))

abstract_title_strs

In [None]:
abstract_authors_strs

In [None]:
abstract_text_strs

In the case of authors we may actually want a consistent formatting (notice that sometimes the names are separated by `,` and sometimes by "and"). Let's address that.

In [None]:
abstract_authors_strs = list(map(lambda x: re.sub(r",", " and", x), abstract_authors_strs))
abstract_authors_strs

## Tokenization

Tokenization separates a sentence into tokens, which are words, parts of words (for example, we may separate `it's` into `it` and `'s`), or punctuation. The naïve approach is to split on spaces.

In [None]:
abstract_text_strs[0].split(' ')

NLTK provides smarter tokenizers for us to use, though. There are several options to choose from, but we'll keep it simple and use `wordpunct_tokenize()`.

In [None]:
from nltk.tokenize import wordpunct_tokenize

In [None]:
print(wordpunct_tokenize(abstract_text_strs[0]))

In [None]:
abstract_text_structs = list(map(wordpunct_tokenize, abstract_text_strs))
abstract_text_structs

## Part of Speech Tagging

The next step I will take is to tag the words (or rather their stems) with part of speech tags, which label the words in the sentence with their part of speech classification (for example, "book" is a noun, "he" is an adverb, etc.).

The recommended part of speech tagger from NLTK is `pos_tag()`, though the package offers many taggers and facilities for training a tagger.

For each word, the tagger creates a tuple, with the first string in the tuple being the word, and the second being the word's part of speach classification.

In [None]:
from nltk.tag import pos_tag, pos_tag_sents

In [None]:
pos_tag(abstract_text_structs[0])

In [None]:
nltk.help.upenn_tagset()    # See what tags mean

In [None]:
abstract_text_structs = pos_tag_sents(abstract_text_structs)

## Stemming

Are the words "run" and "running" the same? If we think so, we may want to use a stemmer to extract "run" from both words, as "run" is the stem of the word of interest in both cases.

NLTK provides stemmers, one of which being the popular Snowball stemmer.

In [None]:
from nltk.stem import SnowballStemmer

In [None]:
stemmer = SnowballStemmer("english")
stemmer.stem("running")

In [None]:
a = abstract_text_structs[0]
abstract_text_structs = [[(stemmer.stem(w[0]), w[1]) for w in a] for a in abstract_text_structs]
abstract_text_structs

## Removing Stopwords and Punctuation

Words like "and", "the", "a", etc. don't distinguish documents, so we want to remove them. Also we are not particularly interested in punctuation, so we will remove that as well.

In [None]:
from nltk.corpus import stopwords
import string

In [None]:
stopwords.words("english")

In [None]:
string.punctuation

In [None]:
abstract_text_structs = [[(w[0], w[1]) for w in a if w[0] not in stopwords.words("english") and \
                          w[0] not in string.punctuation] for a in abstract_text_structs]
abstract_text_structs

We will save our work in a collection of files organized as a tagged corpus.

In [None]:
import pandas as pd
import os

In [None]:
ids = pd.DataFrame({"title": abstract_title_strs, "authors": abstract_authors_strs},
                   index=pd.Index(["abs" + str(i) for i in range(len(abstract_text_strs))]))
ids

In [None]:
ids.to_csv("abstracts_id.csv")

' '.join(w[0] + "/" + w[1] for w in abstract_text_structs[0])

In [None]:
os.makedirs("abstracts")

In [None]:
for a, name in zip(abstract_text_structs, ids.index):
    abstract_file_text = ' '.join(w[0] + "/" + w[1] for w in a)
    with open("abstracts/" + name + ".txt", mode='x') as f:
        f.write(abstract_file_text)

We can now handle these files as a tagged NLTK corpus.

In [None]:
from nltk.corpus.reader.tagged import TaggedCorpusReader

In [None]:
abstracts_dir = os.path.abspath('abstracts')
abstract_corpus = TaggedCorpusReader(abstracts_dir, ".*\.txt")

In [None]:
abstract1 = nltk.Text(abstract_corpus.words("abs0.txt"))
abstract1[:5]

In [None]:
abstract1.plot()

In [None]:
abstract1.collocations()

In [None]:
abstract_corpus.tagged_words()

In [None]:
abstract_corpus.tagged_words(fileids=["abs10.txt"])

We now have a (small) dataset ready for later work.