# Regex and digitized books

## Get text data

Let us download one of the digitized books from The Royal Danish Library. The digitized books are ocr scanned (Optical Character Recognition) and available as pdf-files.

The book is [Macdonald, James. Travels through Denmark and Part of Sweden during the Winter and Spring of the Year 1809 : Containing Authentic Particulars of the Domestic Condition of Those Countries, the Opinions of the Inhabitants, and the State of Agriculture. 2015.](https://soeg.kb.dk/permalink/45KBDK_KGL/1pioq0f/alma99122806627205763)

We download the book, extract the text content and store it in a variable called full_text.


In [None]:
#! pip install PyPDF2
import requests
from io import BytesIO
from PyPDF2 import PdfReader

# URL to the ocr scanned, pdf verison of the text
url = "https://www.kb.dk/e-mat/dod/130014244515_bw.pdf"

# Download the pdf file
response = requests.get(url)
response.raise_for_status()  # Check if the request was successful

# Open the pdf file in memory
pdf_file = BytesIO(response.content)

# Create a PDF reader object
reader = PdfReader(pdf_file)

# Extract text from each page starting from page 5
text_content = []
for page in reader.pages[4:]:
    text_content.append(page.extract_text())

# Join all the text content into a single string
full_text = "\n".join(text_content)

# Print the extracted text
print(full_text[0:1000])

In [None]:
reader.pages[0].extract_text()

## RegEx to clean / preprocess text 

When we work with Regex, the website [regex101.com](https://regex101.com/) is a brilliant tool. We can help partly to understand Regex, partly to write a Regex pattern. It is a good idea to take ten minutes to familiarize yourself with the page.

Try inserting this text string in the 'TEXT STRING' field: 

_He observes, that he has not only committed to paper his  
own opinions, but also, those of persons with wliom he con-  
versed in the above-mcnt ioned eountries_ i

In the 'REGULAR EXPRESSION' field you can write this pattern `'W+\'`.

What happens when you start typing?


In the field EXPLANATION, at the top of the right side, you can read an explanation of the regex pattern.

### Write a function to clean text

The [Python function](https://www.w3schools.com/python/python_functions.asp) 'clean' below we use to clean the text of all characters other than letters.


In [None]:
import re
def clean(text): 
    
    # match a variety of punctuation and special characters
    # backslash \ and the pipe symbols | plays important roles, for example here \? 
    # Now it is a good idea to look up a see what \ and | does 
    text = re.sub(r'\.|,|:|;|!|\?|\(|\)|\||\+|\'|\"|‘|’|“|”|\'|\’|…|\-|_|–|—|\$|&|\*|>|<|\/|\[|\]', ' ', text)

    # Regex pattern to match numbers and words containing numbers
    text = re.sub(r'\b\w*\d\w*\b', '', text)
     

    # lower the letters
    text = text.lower()

    # replace sequences of non-word characters ('\W+') with a single space. 
    # The 'strip()' removes any leading or tailing whitespaces that could come from the substitution.
    text = re.sub(r'\W+', ' ', text).strip()
    
    return text
    

text = clean(full_text)

In [None]:
text[:3000]

## w+ together with \b

Why does anything happen on sunday, monday or yesterday?

Find words with special endings, e.g. _day_, can be a help to gain insight into where and when the literature takes place.

The regex pattern `\w+day` is used to find sequences of word characters that end with the letter 'day'.

`\b` finds the bounderies of where letters starts or ends.

You can also use the endings to find grammatical forms, e.g. words with a special suffix like '-ly' would be relatively easy to identify. Try it.





In [None]:
ending = re.findall(r'\w+day\b', text)
print(ending)

## More metacharacters, as well as pipes, lists and question marks

In literature, comparisons are often used to illustrate points more clearly by putting pictures on what you want to describe. Comparisons also contribute to making the text more lively and interesting.

But regex makes it a manageable task to retrieve examples of comparisons, because we can find text strings that follow the pattern in a typical comparison.

We can illustrate it in the following way. We look for phrases whose pattern is either as a ... or as an ....

There are two different ways.

First way is to use pipe `|`. Pipe means "or". The regex pattern will then look like this: `'as\sa\s\w+|as\san\s\w+'`

Another way is to use the list character `[]`? It looks like this: `'as\sa[n]?\s\w+'`.

In the list, letters can be added that can stand in that place in the word. The question mark indicates that the letter may or may not be there.

In [None]:
comparison = re.findall(r'as\sa\s\w+', text)
print (comparison)

In [None]:
comparison = re.findall(r'as\sa\s\w+|as\san\s\w+', text)
print (comparison)

In [None]:
comparison = re.findall(r'as\sa[n]?\s\w+', text)
print (comparison)

## Curly brackets for a concordance tool

We're going to try using curly brackets in our RegEx, and we'll try it out in a concrete example of how curly brackets can be used to build a concordance tool. A concordance tool is used to extract text snippets based on keywords and a range.

We will find snippets containing the keyword _eye_. It is a concrete example of how we can point down in the text and see how the term is used exactly. In horror novels, I would imagine that words like eyes is playing a special role.

For the task we would also need the full stop sign( `.` ), because it returns any word characters and the curly brackets like this: `{30}.` It checks that we get 30 word characters.

In [None]:
concordance_tool = re.findall(r'.{30}\beye[s]?\b.{30}', text)

concordance_tool

## Square brackets [A-Z]

Find words that begins with capital letters.

In [None]:
upper_case_words = re.findall(r'[A-Z]\w+', full_text)
upper_case_words = [i for i in upper_case_words if i.lower() not in full_text]

print (set(upper_case_words))

## Fuzzy searches in ocr text

Regex can also be used for performing a fuzzy search on OCR processed text. Let's try to locate instances of the word "danish" or "danili" within the text, allowing for slight variations or errors that might occur during OCR processing.

`.{30}`: Matching any 30 characters before and after the target word, providing context around the match.

`danis[h|li]?`: Looking for the word "danish" or "danili", where:

`danis` is the fixed part of the word.

`[h|li]?` allows for either "h" or "li" to follow "danis", accommodating potential OCR errors or variations.


In [None]:
fuzzy_search = re.findall(r'.{30}danis[h|li]?.{30}', text)

fuzzy_search