# Teksta izgūšana | Text extraction

**LV**: Teksta korpusa izveide ir viens no priekšnoteikumiem daudzos valodas modelēšanas uzdevumos. Tā kā vienkāršs teksts (*plain-text*) bieži vien nav pieejams, nākas nodarboties ar vienkārša teksta izguvi no dokumentiem, kas pieejami citos formātos.

Šajā piezīmju grāmatiņā ir apskatīti trīs vienkāršoti gadījumi teksta izguvei no dažāda formāta dokumentiem: HTML, PDF un VERT.

Populārajiem dokumentu formātiem (PDF, DOCX, HTML u.c.) eksistē dažādas *Python* bibliotēkas, kuras var izmantot vienkāršā teksta izguves uzdevumam, kā tas ir nodemonstrēts šajā nodarbībā.

---

**EN**: Creating a text corpus is one of the prerequisites for many language modeling tasks. Since plain-text is often not available, one has to deal with plain-text extraction from documents that are available in other formats.

This notebook covers three simplified cases for extracting text from documents in various formats: HTML, PDF and VERT.

For the popular document formats (PDF, DOCX, HTML, etc.), there are various Python libraries that can be used for the plain-text extraction task as demonstrated in this session.

## HTML-to-Text

**LV**: Viena no populārākajām un vienkāršāk izmantojamajām *Python* bibliotēkām HTML formāta dokumentu parsēšanai un satura "noskrāpēšanai" (*web scraping*) ir *BeautifulSoup* (https://pypi.org/project/beautifulsoup4/).

*BeautifulSoup* savukārt izmanto zemāka līmeņa HTML/XML parsētāju: var tikt izmantots gan *Python* iebūvētais `html.parser`, gan ārējas bibliotēkas (piem., `lxml`, `html5lib`), kas nodrošina dažādas priekšrocības (piem., ātrdarbību un papildu funkcionalitāti). Šajā demonstrācijā ir izmantots iebūvētais HTML parsētājs.

---

**EN**: One of the most popular and easy-to-use Python libraries for parsing HTML documents and web scraping is BeautifulSoup (https://pypi.org/project/beautifulsoup4/).

BeautifulSoup uses a lower-level HTML/XML parser - both Python's built-in `html.parser` and external libraries (e.g. `lxml`, `html5lib`) can be used. While the external libraries can provide various advantages w.r.t. performance and functionality, this demo uses the built-in HTML parser.

In [None]:
!pip install beautifulsoup4

In [None]:
!wget -O "sample_article.html" https://www.vti.lu.lv/par-mums/zinas/zina/t/82316/

In [2]:
from bs4 import BeautifulSoup
import html
import re


# Removes specific HTML elements from a webpage to get rid of needless content.
# For instance: header and footer blocks, menus, etc.
# Needs to be adapted for each website to get the best results.
def remove_html_elements(text):
    soup = BeautifulSoup(text, "html.parser")

    # Filtering by specific HTML elements
    for element in soup.find_all(["header", "footer", "button"]):
        element.decompose() # Removes an element from the tree

    # Filtering by HTML elements having specific attributes
    for element in soup.find_all(["div"], attrs={"class": re.compile(".*([Mm]enu|share|backlink).*")}):
        element.decompose()

    return str(soup)


# (1) Unescapes HTML entities.
# (2) Removes HTML tags while keeping the content.
# For instance: &amp; => &, <p>content</p> => content.
# This function is universal - it can be applied to any webpage from any website.
def convert_to_plaintext(text):
    text = html.unescape(text)                     # 1
    text = BeautifulSoup(text, "html.parser").text # 2
    return text


# Normalizes spaces and line breaks in the plain-text.
def normalize_white_spaces(text):
    text = re.sub("[ ]+", " ", text)
    text = re.sub("[ ]?\n+", "\n", text) # Try to comment out this line
    return text


# (1) Removal of needless HTML elements.
# (2) Unescaping HTML entities and removal of HTML tags.
# (3) Normalization of whitespaces in the plain-text.
def html_to_txt(html_file, txt_file):
    text = ""

    with open(html_file, "r", encoding="utf-8") as input_file:
        text = input_file.read()

    text = remove_html_elements(text)   # 1
    text = convert_to_plaintext(text)   # 2
    text = normalize_white_spaces(text) # 3

    with open(txt_file, "w", encoding="utf-8") as output_file:
        output_file.write(text)


html_to_txt("sample_article.html", "sample_article.txt")

## PDF-to-Text

**LV**: Saistīta teksta izguve no PDF dokumentiem lielākoties ir sarežģīta un ķēpīga: informācija par teksta struktūru un noformējumu bieži vien nav viennozīmīgi izgūstama, un teksta segmentēšana teikumos un rindkopās ir apgrūtināta, jo PDF formāts ir veidots satura drukāšanas nevis mašīnlasīšanas vajadzībām.

Arī PDF dokumentu apstrādei ir pieejamas dažādas ārējās *Python* bibliotēkas, piemēram, `pypdf`, `PyPDF2`, `PDFMiner`, `PyMuPDF`, `tabula-py`. Demonstrācijas nolūkiem izmantosim `pypdf` (https://pypi.org/project/pypdf/).

Plašāk par PDF dokumentu mašīnlasīšanas problemātiku aprakstīts `pypdf` [dokumentācijā](https://pypdf.readthedocs.io/en/stable/user/extract-text.html).

Eksperimentēšanas vērta alternatīva pieeja: konvertēt PDF uz HTML un tālāk strādāt ar HTML dokumentiem. Konvertēšanas funkcionalitāti nodrošina, piemēram, `PDFMiner` (https://pypi.org/project/pdfminer/).

---

**EN**: Retrieving continuous text from PDF documents is generally difficult and cumbersome: information about the text structure and presentation often cannot be unambiguously retrieved. Segmenting such text into sentences and paragraphs is difficult, since the PDF format is designed for printing rather than machine-reading the content.

Various external Python libraries are available for processing PDF documents, such as `pypdf`, `PyPDF2`, `PDFMiner`, `PyMuPDF`, `tabula-py`. For demonstration purposes, we will use `pypdf` (https://pypi.org/project/pypdf/).

The problem of text extraction from PDF documents is described in more detail [here](https://pypdf.readthedocs.io/en/stable/user/extract-text.html).

An alternative approach worth experimenting is to convert PDF to HTML and then work with HTML documents. Conversion functionality is provided, for example, by `PDFMiner` (https://pypi.org/project/pdfminer/).

In [None]:
!pip install pypdf

In [None]:
!wget -O "sample_paper.pdf" https://www.apgads.lu.lv/fileadmin/user_upload/lu_portal/apgads/PDF/Valoda-nozime-forma/VNF-10/vnf_10-16_Nespore_Saulite_Rituma.pdf

In [None]:
from pypdf import PdfReader


# Provides a very basic text extraction functionality
def pdf_to_txt(pdf_file, txt_file):
    text = ""

    with open(pdf_file, "rb") as input_file:
        reader = PdfReader(input_file)

        # Reads the document page by page
        for page in reader.pages:
            text += page.extract_text() + "\n"

    with open(txt_file, "w", encoding="utf-8") as output_file:
        output_file.write(normalize_white_spaces(text))

    print("Total number of lines in the text:", text.count("\n"))


pdf_to_txt("sample_paper.pdf", "sample_paper.txt")

**LV**: Papētot iegūto rezultātu (`sample_paper.txt`), redzams, ka ar tik vienkāršiem soļiem nepietiek, lai no PDF dokumenta izgūtu kvalitatīvu tekstu.

Izteiktākā problēma ir tā, ka izgūtajā tekstā ir saglabāts teksta dalījums rindās un lappusēs tā, kā tas drukas vajadzībām ir izkārtots PDF dokumentā, bet mums ir nepieciešams teksts, kas būtu strukturēts atbilstoši teikumu un rindkopu robežām, nevis dokumenta vizuālajam noformējumam.

Lai iegūtu vēlamo rezultātu (t.i., tuvinātos tam), ir jāveic papildu soļi teksta noformējuma analīzē un atbilstošā pēcapstrādē.

---

**EN**: By examining the obtained result (`sample_paper.txt`), we can see that such a simple approach is not enough to extract quality text from a PDF document.

The most obvious problem is that the retrieved text preserves the line and page split which follows the layout of the PDF document. However, we need the text to be structured according to sentence and paragraph boundaries, not the visual layout of the document.

To obtain the desired result (i.e. to approximate it), additional steps are required to analyze the text layout and formatting and to adjust the post-processing accordingly.

In [None]:
# Analyzes (heuristically) line breaks and merges lines if necessary.
def merge_lines(text):
    lower_case_letter = "[a-zāčēģīķļņšūž]" # FIXME: \p{Ll}

    # If a line ends with a lower case letter followed by a hyphen,
    # we *assume* this is a hyhenation of a word.
    text = re.sub(r"(?<="+lower_case_letter+")[--]\n(?="+lower_case_letter+")", "", text)

    # If a line begins with a lower case letter,
    # we *assume* this is a continuation of a sentence.
    text = re.sub(r"(\n)+(?="+lower_case_letter+")", " ", text)

    return text


# A more elaborate implementation of the basic text extractor
def pdf_to_txt_2(pdf_file, txt_file):
    text = ""

    with open(pdf_file, "rb") as input_file:
        reader = PdfReader(input_file)

        for page in reader.pages:
            text += page.extract_text() + "\n"

    text = merge_lines(normalize_white_spaces(text))

    with open(txt_file, "w", encoding="utf-8") as output_file:
        output_file.write(text)

    print("Total number of lines in the text:", text.count("\n"))


pdf_to_txt_2("sample_paper.pdf", "sample_paper_2.txt")

## VERT-to-Text

**LV**: Dažkārt nākas saskarties ar failu formātiem, kas nav plaši izplatīti vai tiek izmantoti pamatā tikai valodu tehnoloģiju jomā, un to apstrādei nav pieejamas jau gatavas bibliotēkas, vai arī pastāv vairāki varianti, kā attiecīgais datu formāts var tikt realizēts vai interpretēts.

Daži piemēri: CoNLL, VERT, VRT, TSV3. Tie ir specifiski *tab-separated* datu apmaiņas formāti, kas valodu tehnoloģiju jomā tiek izmantoti dažāda veida anotētiem valodas resursiem. Lai izgūtu saistītu tekstu, nepieciešams analizēt faila struktūru, nolasīt nepieciešamos datu laukus un rekonstruēt tekstu.

Demonstrācijas nolūkā apskatīsim VERT formātu, kuru izmanto populārā teksta korpusu platforma [SketchEngine](https://www.sketchengine.eu/my_keywords/vertical/) un tās atvērtā pirmkoda versija [NoSketchEngine](https://nlp.fi.muni.cz/trac/noske). Šis formāts tiek izmantots arī latviešu valodas korpusu kolekcijā [Korpuss.lv](https://korpuss.lv/).

Lai no VERT faila iegūtu vienkāršu, saistītu tekstu ir nepieciešams ievērot teksta segmentēšanu teikumos un teikumu segmentēšanu tekstvienībās atbilstoši VERT failā lietotajam strukturālajam marķējumam.

Demonstrācijai izmantosim atvērto Raiņa tekstu korpusu (2,3 milj. tekstvienību), kas pieejams CLARIN-LV repozitorijā: http://hdl.handle.net/20.500.12574/41

---

**EN**: Sometimes you have to deal with file formats that are not widely used or are used only in the field of language technology, thus, there might be no ready-made libraries available for processing these formats. Also, there might several alternatives how such data formats are implemented or interpreted.

Some examples: CoNLL, VERT, VRT, TSV3. These are specific tab-separated file formats used to encode and exchange annotated language resources. To retrieve continuous text, it is necessary to analyze the file structure, read the necessary data fields and reconstruct the text.

For demonstration purposes, we will consider the VERT format used by the popular text corpora platform [SketchEngine](https://www.sketchengine.eu/my_keywords/vertical/) and its open source version [NoSketchEngine](https://nlp.fi.muni.cz/trac/noske). This format is used also by the the Latvian corpora collection [Korpuss.lv](https://korpuss.lv/).

In order to obtain plain and continuous text from a VERT file, it is necessary to follow the segmentation of the text into sentences and the segmentation of sentences into tokens according to the structural markup used in the VERT file.

We will use the open text corpus of Rainis (a Latvian poet; 2.3M tokens), which can be downloaded from the CLARIN-LV repository: http://hdl.handle.net/20.500.12574/41

In [None]:
!wget -O "sample_corpus.vert" https://repository.clarin.lv/repository/xmlui/bitstream/handle/20.500.12574/41/rainis_v20180716.vert

In [16]:
# Reads a VERT file, line by line, and reconstructs sentences and paragraphs.
def vert_to_txt(vert_file, txt_file):
    input_file = open(vert_file, "r", encoding="utf-8")
    output_file = open(txt_file, "w", encoding="utf-8")

    text = ""

    while True:
        line = input_file.readline()

        if not line: break

        if line == "\n":
            if text != "":
                output_file.write(text + "\n")
            text = ""

        # If a line contains a tag, it has to be processed accordingly
        elif line[0] == "<" and line[1] != "\t":

            # </doc>, </p>, </s> - end of a document, paragraph, sentence:
            # outputs the so far constructed text segment and begins a new one
            if line == "</doc>\n":
                if text[:-1] == " ": text = text[:-1]
                output_file.write(text + "\n\n")
                text = ""
            elif line == "</p>\n":
                if text[:-1] == " ": text = text[:-1]
                output_file.write(text + "\n")
                text = ""
            elif line == "</s>\n":
                if text[:-1] == " ": text = text[:-1]
                output_file.write(text)
                text = ""

            # <g/> - "glue" tag - there should be no space between the consecutive tokens,
            # e.g. a word and the following punctuation mark
            elif line == "<g />\n" and len(text) > 1:
                if text[-1] == " ":
                    text = text[:-1]

            # Ignores the opening <doc>, <p>, <s> tags (as well as other tags)
            else:
                continue

        # If a line does not contain a structural tag, it contains a text token,
        # which is the first element of the tab-separated line
        else:
            text = text + line.split("\t")[0] + " "

    input_file.close()
    output_file.close()


vert_to_txt("sample_corpus.vert", "sample_corpus.txt")