Processing large amounts of text data can prove to be quite challenging, which is why finding ways to optimize each step of pre processing is crucial for a real-life NLP project.

In this article I show many different problems you may encounter while cleaning your text, in particular for the purpose of topic modelling.

In [10]:
from time import time
import pandas as pd

## 1. PDF text extraction

This may seem niche for most cases, but I'll include it since it could be of use to more than one reader:

When dealing with the extraction of text from pdf files, you may be tempted to use the package pyPDF2, since there's plenty of recommendations for it on most google results for the query "how to extract text from pdf". However the results this package gives are quite sub optimal, missing large amounts of text. I instead advise for the package tika, whose parser while not being perfect, increased the amount of properly extracted text by a lot.

In [13]:
import requests
import PyPDF2

import io
from tika import parser
from bs4 import BeautifulSoup

In [4]:
pdf_file = "pdf_es/CCU 2019 - 30 marzov6 EEFF completos.pdf"

In [5]:
def extract_content(path_or_url):
    """
    A simple user define function that, given a url, download PDF text content
    Parse PDF and return plain text version
    """
    # retrieve PDF binary stream
    pdf = PyPDF2.PdfFileReader(path_or_url)  
    # access pdf content
    text = [pdf.getPage(i).extractText() for i in range(0, pdf.getNumPages())]
    # return concatenated content
    return "\n".join(text)

In [6]:
def extract_content_tika(path_or_url):
    file_data = []
    _buffer = io.StringIO()
    data = parser.from_file(path_or_url, xmlContent=True)
    xhtml_data = BeautifulSoup(data['content'])
    for page, content in enumerate(xhtml_data.find_all('div', attrs={'class': 'page'})):
        _buffer.write(str(content))
        parsed_content = parser.from_buffer(_buffer.getvalue())
        _buffer.truncate()
        file_data.append(parsed_content['content'])
    return "\n".join(file_data)

In [7]:
start_time = time()
text_pdf2 = extract_content(pdf_file)
delta_time = int(time() - start_time)
print(f"Time taken to extract text using pyPDF2: {delta_time}s.")

Time taken to extract text using pyPDF2: 45


In [14]:
start_time = time()
text_tika = extract_content_tika(pdf_file)
delta_time = int(time() - start_time)
print(f"Time taken to extract text using tika: {delta_time}s.")

2020-12-21 15:47:58,476 [MainThread  ] [WARNI]  Failed to see startup log message; retrying...


Time taken to extract text using tika: 77


In [15]:
print(f"Length of text extracted using pyPDF2: {len(text_pdf2)} characters")

Length of text extracted using pyPDF2: 1748569 characters


In [16]:
print(f"Length of text extracted using tika: {len(text_tika)} characters")

Length of text extracted using tika: 492704776 characters


We see a massive increase of 281x times as many characters! Meaning we were successful in extracting much more text data using tika rather than pyPDF2

## 2. Cleaning text

From this point on the main discussion will be why avoiding pandas .apply() method is a bad idea, and how to replace it with much more efficient methods. We'll be using tika's extracted text since that's what would be expected.

In [30]:
import string
import re

In [60]:
text_df = pd.DataFrame({"ext": ["tika", "pypdf2"], "text": [text_tika, text_pdf2]})

Initially what many people would think is "let's process these strings of text in a per row basis, that way my funcion will apply to each row like a map funcion on a list", since that's what feels most intuitive to most of us. So they write a per-row-apply-style function for their pandas data and end up with something like this:

In [58]:
def clean_content_row(text):
    # remove non ASCII characters
    printable = set(string.printable).union(['á','é','í','ó','ú','Á','É','Í','Ó','Ú','ñ','Ñ'])
    text = ''.join(filter(lambda x: x in printable, text))
    lines = []
    prev = ""
    for line in text.split('\n'):
    # aggregate consecutive lines where text may be broken down
    # only if next line starts with a space or previous does not end with dot.
        if (line.startswith(' ') or not prev.endswith('.')):
            prev = prev + ' ' + line
        else:
            # new paragraph
            lines.append(prev)
            prev = line
    # don't forget left-over paragraph
    lines.append(prev)

    # clean paragraphs from extra space, unwanted characters, urls, etc.
    # best effort clean up, consider a more versatile cleaner
    sentences = []
    for line in lines:
        # removing header number
        line = re.sub(r'^\s?\d+(.*)$', r'\1', line)
        # removing trailing spaces
        line = line.strip()
        # words may be split between lines, ensure we link them back together
        line = re.sub('\s?-\s?', '-', line)
        # remove space prior to punctuation
        line = re.sub(r'\s?([,:;\.])', r'\1', line)
        # ESG contains a lot of figures that are not relevant to grammatical structure
        line = re.sub(r'\d{5,}', r' ', line)
        # remove mentions of URLs
        line = re.sub(r'((http|https)\:\/\/)?[a-zA-Z0-9\.\/\?\:@\-_=#]+\.([a-zA-Z]){2,6}([a-zA-Z0-9\.\&\/\?\:@\-_=#])*', r' ', line)
        # remove multiple spaces
        line = re.sub('\s+', ' ', line)
        sentences.append(line)
    return sentences

In [76]:
start_time = time()
statements_apply = text_df[text_df.ext=="tika"]["text"].apply(extract_statements)
delta_time = int(time() - start_time)
print(f"Time taken to process text using apply: {delta_time}s.")

Time taken to process text using pyPDF2: 429s.


I propose a different way of processing text, using what's called a vectorized function.

Pandas provides a nice set of vectorized string functions, which act on a column of data evenly.

In [81]:
def clean_content_vectorized(text):
    text = text.str.replace('\n', ' ', regex=False)
    text = text.str.replace('([a-zA-Z0-9]+)\-(?: *)([a-zA-Z0-9]+)', r'\1\2', regex=True) # juntar palabras separadas por guion
    text = text.str.replace('(http|https)://[^ ]+', '', regex=True) # eliminar enlaces
    text = text.str.replace('[^\w  \.]', '', regex=True) # eliminar caracteres especiales
    text = text.str.replace('[0-9_]', '', regex=True) # eliminar caracteres especiales
    text = text.str.replace('\s+', ' ', regex=True) # juntar muchos espacios en uno
    text = text.str.replace('[ \.]{2,}', '.', regex=True) # juntar puntuacion
    return text.str.split('.')

In [82]:
start_time = time()
statements_vectorized = clean_content_vectorized(text_df[text_df.ext=="tika"]["text"])
delta_time = int(time() - start_time)
print(f"Time taken to process text using vectorized functions: {delta_time}s.")

Time taken to process text using vectorized functions: 115s.


We see a 4 fold increase in speed changing from the apply method to using vectorized functions. Note that this may vastly improve if your text cleaning process is simpler (since much of what we do in these examples correspond to cleaning characters that arise from the pdf extracting process).

## 3. Lemmatization

Spacy seems to be the go to package for NLP real life application projects, since it's very convenient out of the box and quite perfoming. However when the text is not in english the results of the model are not what we would expect, so a better way to perform lemmatization on text is needed.

On the following section we present comparisons between the results from using Spacy's model for lemmatization and a new package that we found, Stanza.

In [88]:
import spacy
import stanza
from nltk.corpus import stopwords

In [86]:
def parallelize(df, func, cores):
    num_of_processes = cores
    data_split = np.array_split(df, num_of_processes)
    pool = Pool(num_of_processes)
    data = pd.concat(pool.map(func, data_split))
    pool.close()
    pool.join()
    return data

def run_on_subset(func, extra_data, data_subset):
    # data_subset is a series
    return data_subset.apply(func, args=(extra_data))

def multiprocess_apply(df, func, cores=8, data=None):
    return parallelize(df, partial(run_on_subset, func, data), cores)

In [None]:
def lemmatize_spacy(row, nlp):
    doc = nlp(row)
    sentence = doc.sentences
    return " ".join([word.lemma for word in sentence.words if word.lemma not in stop_words])

In [92]:
def lemmatize_stanza(row, nlp):
    doc = nlp(row)
    sentence = doc.sentences[0]
    return " ".join([word.lemma for word in sentence.words if word.lemma not in stop_words])

In [91]:
stop_words = stopwords.words('spanish')
nlp_gpu = stanza.Pipeline('es', processors='pos, tokenize, lemma', use_gpu=True)
nlp_cpu = stanza.Pipeline('es', processors='pos, tokenize, lemma', use_gpu=False)

2020-12-21 18:03:49 INFO: Loading these models for language: es (Spanish):
| Processor | Package |
-----------------------
| tokenize  | ancora  |
| pos       | ancora  |
| lemma     | ancora  |

2020-12-21 18:03:49 INFO: Use device: gpu
2020-12-21 18:03:49 INFO: Loading: tokenize
2020-12-21 18:03:49 INFO: Loading: pos
2020-12-21 18:03:50 INFO: Loading: lemma
2020-12-21 18:03:50 INFO: Done loading processors!
2020-12-21 18:03:50 INFO: Loading these models for language: es (Spanish):
| Processor | Package |
-----------------------
| tokenize  | ancora  |
| pos       | ancora  |
| lemma     | ancora  |

2020-12-21 18:03:50 INFO: Use device: cpu
2020-12-21 18:03:50 INFO: Loading: tokenize
2020-12-21 18:03:50 INFO: Loading: pos
2020-12-21 18:03:51 INFO: Loading: lemma
2020-12-21 18:03:51 INFO: Done loading processors!


In [None]:
esg_lemma_gpu = esg["sentences"][:1000].apply(lemmatize, args=[nlp_gpu])

In [None]:
esg_lemma_multicore = multiprocess_apply(esg["sentences"][:1000], lemmatize, cores=8, data=[nlp_cpu])

## esg_lemma_multicore = multiprocess_apply(esg["sentences"][:10000], lemmatize, cores=8, data=[nlp_cpu])