# Scraping "A Little Larger than the Entire Universe"

This jupyter notebook contains the code that we wrote to scrape the poems contained in the book, "A Little Larger than the Entire Universe". This is a collection of poems written by Pessoa under his own name (77 poems), and his three most prolific heteronyms: Alberto Caeiro (52 poems), Ricardo Reis (56 poems) and Alvaro de Campos (41 poems). Most of the poems were originally written in Portuguese, and translated into English by Richard Zenith, the seminal scholar of Pessoa's life.

## 1. Importing Modules

The following modules were used for the corpus collection and analysis. Please consult the comments within the code to see what each module was used for.

In [1]:
#Importing modules
import re
from pypdf import PdfReader
import pandas as pd
import os
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.sentiment import SentimentIntensityAnalyzer
from collections import Counter
import matplotlib.pyplot as plt
import seaborn as sns

# Download the NLTK punkt_tab model that will be used for the sentence tokenization
nltk.download('punkt_tab')
# Ensure you have the NLTK stopwords set downloaded
nltk.download('stopwords')
# Loading NLTK stopwords
stop_words = set(stopwords.words('english'))
# Download the VADER lexicon which will be used for sentiment analysis
nltk.download('vader_lexicon')
# Initialize VADER Sentiment Intensity Analyzer
sia = SentimentIntensityAnalyzer()

  from cryptography.hazmat.primitives.ciphers.algorithms import AES, ARC4
[nltk_data] Downloading package punkt_tab to /Users/henry/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/henry/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/henry/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


## 2. Scraping and Corpus Collection Functions
Several functions were created which were applied to page ranges from the PDF which corresponded to the respective heteronyms' poems.

### 2.1. Importing the poems
The function "ImportPDF" imports specifed pages from the PDF, where each page is a list entry. the variable "heteronym" will define the name of the list, in which the pages are stored. "heteronym_pages" denotes which pages of the book are scraped. Upon completion, this function returns the list of pages.

In [2]:
# Importing and formatting the pages corresponding to the heteronym.
def ImportPDF(heteronym, heteronym_pages):

    # Importing the PDF.
    reader = PdfReader("Pessoa.pdf")
    
    heteronym = []
    # Formatting the PDF.
    for page in heteronym_pages:
        page = reader.pages[page]
        text = page.extract_text(extraction_mode="layout")
        heteronym.append(text)

    return heteronym

### 2.2. Determining the indeces/page numbers of poems which are split across two or more pages.
Based on the scraping parameters from the previous function, pages which contain part of a poem which was started on another page always begin with two line breaks followed by either a whitespace or text. The function "FindIndeces" uses regular expressions to find pages which start with this pattern. It then stores the indeces of these pages in a list, which is then returned.

In [3]:
# Finding the indeces of each poem which was split across two or more pages.
def FindIndeces(heteronym):
    indeces = []
    marker1 = re.compile(r"^[\n\n]")
    marker2 = re.compile(r"^[A-Z]")
    for index, page in enumerate(heteronym):
        if marker1.match(page):
            if page[2] == " " or marker2.match(page[2]):
                indeces.append(index)

    return indeces

### 2.3. Removing Footnotes
Some pages contain footnotes. This function removes them by filtering for, and substituting the "*" character with nothing, using regular expressions.

In [4]:
# Define a function to remove footnotes and trailing unwanted text
def remove_footnotes(text):
    # Match only footnotes that are separate from the main content
    footnote_pattern = r'^\*.*$'  # Matches lines starting with '*' on a new line
    # Remove footnotes
    text = re.sub(footnote_pattern, '', text, flags=re.MULTILINE).strip()
    return text

### 2.4. Joining Pages
This function uses the previously collected list of indeces of pages which include parts of split poems. It does so by applying a series of if-conditions which determine whether a page contains a self-contained poem, the first page of a poem, the last page of a poem, or an in-between page of a poem.

The first if-condition applies specifically to two-page poems. It identifies them by checking whether the page before the one being iterated, and the page immediately after is in the indeces list. If this is the case, it joins the page with one page before.

The second if-condition applies to the first and second page of poems which are longer than two pages. These pages are identified if one page before the iterated page is not in the list, but the subsequent page is in the list.

The third if-condition applies to all pages of multi-page poems except for the first, second and last. If both the previous page, and the subsequent page are in the indeces list, this page is flagged.

Finally, the fouth if-condition applies to the final page of multi-page poems. It applies if the previous page is in the indeces list, but the subsequent page is not in the indeces list.

For each of these conditions, the relevant pages are added to the "pages_to_join" list, which, upon all pages being gathered (and any footnotes removed), is joined, added to the "heteronym_joined" list, and emptied again for the next poem.

The second for-loop in this function adds poems to the "heteronym_joined" list which only span a single page, so that, in the end, all poems are joined and together in a single list, which is returned upon execution of the function.

In [5]:
# Joining poems which were split across multiple pages into one.
def JoinPages(heteronym):
    heteronym_joined = []
    pages_to_join = []
    for index in indeces:
        # Handling multi-page poems
        if index - 1 not in indeces and index + 1 not in indeces:
            # Single split
            pages_to_join.append(heteronym[index - 1]) # First page
            pages_to_join.append(heteronym[index]) # Second page
            poem = "".join(pages_to_join)
            poem = remove_footnotes(poem)  # Apply footnote removal
            heteronym_joined.append(poem)
            pages_to_join = []
        elif index - 1 not in indeces and index + 1 in indeces:
            # First page of a multi-page poem
            pages_to_join.append(heteronym[index - 1]) # First page
            pages_to_join.append(heteronym[index]) # Second page
        elif index - 1 in indeces and index + 1 in indeces:
            pages_to_join.append(heteronym[index]) # Middle pages
        elif index - 1 in indeces and index + 1 not in indeces:
            pages_to_join.append(heteronym[index]) # Last page
            poem = "".join(pages_to_join)
            poem = remove_footnotes(poem)  # Apply footnote removal
            heteronym_joined.append(poem)
            pages_to_join = []

    for poem in heteronym:
        # Handling single-page poems
        if heteronym.index(poem) not in indeces and heteronym.index(poem) + 1 not in indeces:
            poem = remove_footnotes(poem)  # Apply footnote removal
            heteronym_joined.append(poem)
    
    return heteronym_joined

### 2.5. Cleaning and tokenizing the poems
This function cleans, tokenizes and reformats the joined poems. For each poem, regular expressions are used to remove excess line-breaks, whitespaces and dates. Next, the poem is tokenized. It is then processed in three ways: Firstly, it is tokenized with stopwords removed (and later saved in the "heteronym_tokenized" list). Secondly, the tokens are rejoined with punctuation kept in the poem, and only line breaks removed (this is saved in the "heteronym_formatted" list). Thirdly, the "heteronym_cleaned" list contains the poems with all linebreaks and punctuation removed. The final step is that all tokens are added to an "all_tokens" list, if they are not already in there.

The function returns a tuple of "heteronym_cleaned", "heteronym_tokenized", "all_tokens" and "heteronym_formatted". When calling the function, the variable being returned can be selected with the index of the list in square brackets after the function.

In [6]:
# Cleaning and tokenizing the poems
def CleanedTokenized(heteronym_joined):
    heteronym_formatted =[]
    heteronym_cleaned = []
    heteronym_tokenized = []
    all_tokens = []  # A list to hold all tokens for total count
    
    for poem in heteronym_joined:
        poem = re.sub(r'\n', ' ', poem)  # Remove newlines
        poem = re.sub(r'\s+', ' ', poem)  # Collapse multiple spaces
        # The following two lines of code will remove the months formatted with a first capitalized letter
        poem = re.sub(r'\b(January|February|March|April|May|June|July|August|September|October|November|December)\b', '', poem)  # Remove all the months
        poem = re.sub(r'\d+', '', poem)  # Remove all the digits
        heteronym_formatted.append(poem)

        
        # Tokenization and stopwords removal
        tokens = re.findall(r'\w+', poem)  # Tokenize without punctuation
        formatted_poem = " ".join(tokens)  # Saves the poem with new lines and excess whitespace removed.
        for index, char in enumerate(formatted_poem):  # Removes extra spaces before or after punctuation.
            if char in ".,-?:')!;":
                formatted_poem = re.sub(r"\s+([.,-?:\''’)!;])", r'\1', formatted_poem)
            if char in r"(’":
                formatted_poem = re.sub(r"([(’])\s+", r'\1', formatted_poem)
        heteronym_cleaned.append(formatted_poem) # Saves the poem in a list
        
        filtered_tokens = [word.lower() for word in tokens if word.lower() not in stop_words]  # Remove stopwords
        heteronym_tokenized.append(filtered_tokens) # We want to leave only the tokenized content words in our heteronym_tokenized variable
        
        all_tokens.extend(filtered_tokens)  # Add filtered tokens to the all_tokens list

    return heteronym_cleaned, heteronym_tokenized, all_tokens, heteronym_formatted

### 2.6. Creating a dataframe for each list of poems
This function first creates a dictionary in which each poem is assigned a number in the key, and the value is the cleaned poem. Subsequently, a dataframe is created containing the dictionary. Next, the "heteronym_formatted" and "heteronym_tokenized" lists are added, as well as another column containing the name of the heteronym.

The function returns a tuple of the dataframe, and heteronym_dict. When calling the function, the variable being returned can be selected with the index of the list in square brackets after the function.

In [7]:
# Create a DataFrame that will store the poems for each heteronym and Pessoa himself
def GetDF(heteronym, heteronym_cleaned, heteronym_formatted, heteronym_tokenized):
    heteronym_dict = {}
    n = 1
    for poem in heteronym_cleaned:
        heteronym_dict["poem_"+str(n)] = poem
        n += 1
    
    # Adding Caeiro's poems into a Df.
    df = pd.DataFrame(list(heteronym_dict.items()), columns=["Index", "Cleaned_text"])
    df["Formatted_text"] = heteronym_formatted
    df["Tokenized_text_Content"] = heteronym_tokenized
    df["Heteronym"] = heteronym
    
    return df, heteronym_dict

### 2.7. Exporting the poems to .txt
This function creates a folder in the working directory for each heteronym (if this doesn't already exist), and saves each formatted poem as a .txt file in that folder.

In [8]:
# Exporting the corpus to .txt files.
def ExportToTxt(heteronym, heteronym_dict):  # This relies on the previous function outputting the heteronym's dictionary.
    if heteronym not in os.listdir():
        os.mkdir(heteronym)  # Creates a folder for the heteronym if it doesn't already exist.
    for key in heteronym_dict:
        if key + ".txt" not in os.listdir(heteronym + "/"):  # Creates the file if it doesn't already exist.
            file = open(heteronym + "/" + key + ".txt", "x")
            file.close()
        with open(heteronym + "/" + key + ".txt", "w", encoding="utf-8") as file:  # Writes into the file.
            file.write(heteronym_dict[key])
    
    return "Complete"

## 3. Applying the functions to Pessoa and His Heteronyms

### 3.1. Alberto Caeiro

In [9]:
# Importing the PDF:
caeiro = ImportPDF("caeiro", range(56,126))

# Stripping the top of the pages and other unnecessary details.
for page in caeiro:
    if "a little larger than the entire universe" in page:
        caeiro[caeiro.index(page)] = page[2:].lstrip("a little larger than the entire universe")
    if "alberto caeiro" in page:
        caeiro[caeiro.index(page)] = page.strip("alberto caeiro")[2:]
    if "alber t o caeiro" in page:
        caeiro[caeiro.index(page)] = page.strip("alber t o caeiro")[2:]
    if len(page) == 0:
        caeiro.remove(page)
    if "            from\nTHE SHEPHERD IN LOVE" in page or "           from\nUNCOLLECTED POEMS" in page:
        caeiro.remove(page)

# Finding the indices of each poem split across two or more pages.
indeces = FindIndeces(caeiro)

# Joining poems which are split across multiple pages.
caeiro_joined = JoinPages(caeiro)

# Removing empty list entries.
caeiro_joined = [poem for poem in caeiro_joined if len(poem.strip()) > 0]

# Cleaning and tokenizing the poems.
caeiro_cleaned = CleanedTokenized(caeiro_joined)[0]
caeiro_tokenized = CleanedTokenized(caeiro_joined)[1]
caeiro_formatted = CleanedTokenized(caeiro_joined)[3]

# Count the total number of content words.
caeiro_all_tokens = CleanedTokenized(caeiro_joined)[2]

First, we will keep track of each heteronyms' corpus size, since it will be relevant for parts of our analysis. As mentioned before, we will be observing mostly the content words, since they are more relevant for most parts of our analysis. For the sentiment analysis, the function uses a different variable, which stores the entire texts of the poems, with the punctuation and function words included.

In [10]:
print(f"Caeiro's corpus comprises", len(caeiro_all_tokens), "content words.")

Caeiro's corpus comprises 3522 content words.


In [11]:
# Present the corpus as a DataFrame.
caeiro_df = GetDF("Caeiro", caeiro_cleaned, caeiro_formatted, caeiro_tokenized)[0]
caeiro_df.head()

Unnamed: 0,Index,Cleaned_text,Formatted_text,Tokenized_text_Content,Heteronym
0,poem_1,II My gaze is clear like a sunﬂower It is my c...,II My gaze is clear like a sunﬂower. It is my ...,"[ii, gaze, clear, like, sunﬂower, custom, walk...",Caeiro
1,poem_2,IV This afternoon a thunderstorm Rolled down f...,IV This afternoon a thunderstorm Rolled down f...,"[iv, afternoon, thunderstorm, rolled, slopes, ...",Caeiro
2,poem_3,VIII One midday in late spring I had a dream t...,VIII One midday in late spring I had a dream t...,"[viii, one, midday, late, spring, dream, like,...",Caeiro
3,poem_4,XXVIII Today I read nearly two pages In the bo...,XXVIII Today I read nearly two pages In the bo...,"[xxviii, today, read, nearly, two, pages, book...",Caeiro
4,poem_5,XLVI In this way or that way As it may happen ...,"XLVI In this way or that way, As it may happen...","[xlvi, way, way, may, happen, happen, sometime...",Caeiro


In [12]:
# Exports the corpus to .txt files.
caeiro_dict = GetDF("Caeiro", caeiro_formatted, caeiro_cleaned, caeiro_tokenized)[1]
ExportToTxt("Caeiro", caeiro_dict)

'Complete'

### 3.2. Ricardo Reis

In [23]:
#Importing the PDF
reis = ImportPDF("reis", range(128,189))

# Stripping the top of the pages and other unnecessary details.
for page in reis:
    if "a little larger than the entire universe" in page:
        reis[reis.index(page)] = page[2:].lstrip(" 0123456789 a little larger than the entire universe")
    if "ricardo reis" in page[:12]:
        reis[reis.index(page)] = page.lstrip("Ricardo reis")[2:]
    if len(page) == 0:
        reis.remove(page)

# Finding the indices of each poem split across two or more pages.
indeces = FindIndeces(reis)

# Joining poems which are split across multiple pages.
reis_joined = JoinPages(reis)

# Removing empty list entries
reis_joined = [poem for poem in reis_joined if len(poem.strip()) > 0]

# Cleaning and tokenizing the poems
reis_cleaned = CleanedTokenized(reis_joined)[0]
reis_tokenized = CleanedTokenized(reis_joined)[1]
reis_formatted = CleanedTokenized(reis_joined)[3]

# Count the total number of content words
reis_all_tokens = CleanedTokenized(reis_joined)[2]
print(f"Reis' corpus comprises", len(reis_all_tokens), "content words.")

Reis' corpus comprises 2234 content words.


In [14]:
# Present the corpus as a DataFrame.
reis_df = GetDF("Reis", reis_cleaned, reis_formatted, reis_tokenized)[0]
reis_df.head()

Unnamed: 0,Index,Cleaned_text,Formatted_text,Tokenized_text_Content,Heteronym
0,poem_1,To Alberto Caeiro Peaceful Master Are all the ...,"To Alberto Caeiro Peaceful, Master, Are all th...","[alberto, caeiro, peaceful, master, hours, los...",Reis
1,poem_2,Each thing in its time has its time The trees ...,"Each thing, in its time, has its time. The tre...","[thing, time, time, trees, blossom, winter, wh...",Reis
2,poem_3,THE CHESS PLAYERS I ve heard that once during ...,"THE CHESS PLAYERS I’ve heard that once, during...","[chess, players, heard, know, war, persia, inv...",Reis
3,poem_4,I love the roses of Adonis s gardens Yes Lydia...,"I love the roses of Adonis’s gardens. Yes, Lyd...","[love, roses, adonis, gardens, yes, lydia, lov...",Reis
4,poem_5,The god Pan isn t dead In each ﬁeld that shows...,The god Pan isn’t dead. In each ﬁeld that show...,"[god, pan, dead, ﬁeld, shows, ceres, naked, br...",Reis


In [15]:
# Exports the corpus to .txt files.
reis_dict = GetDF("Reis", reis_formatted, reis_cleaned, reis_tokenized)[1]
ExportToTxt("Reis", reis_dict)

'Complete'

### 3.3. Álvaro de Campos

In [16]:
# Importing the PDF.
campos = ImportPDF("campos", range(192,318))

# stripping the top of the pages, and dates at the end.
# Due to how the PDF was imported, I had to apply several different parameters here.
for page in campos:
    if "a little larger than the entire universe" in page:
        campos[campos.index(page)] = page.lstrip(" 0123456789 a little larger than the entire universe")
    if "álvaro de campos" in page[:16]:
        campos[campos.index(page)] = page.lstrip("álvaro de campos")[3:]
    if "álva r o de campos" in page[:18]:
        campos[campos.index(page)] = page.lstrip("álvaro de campos")[3:]
    if len(page) == 0:
        campos.remove(page)

# Finding the indices of each poem split across two or more pages.
indeces = FindIndeces(campos)

# Joining poems which are split across multiple pages.
campos_joined = JoinPages(campos)

# Joining the poems results in a few more empty list entries, which are removed here.
campos_joined = [poem for poem in campos_joined if len(poem.strip()) > 0]


# Cleaning and tokenizing the poems
campos_cleaned = []
campos_tokenized = []

# Cleaning and tokenizing the poems
campos_cleaned = CleanedTokenized(campos_joined)[0]
campos_tokenized = CleanedTokenized(campos_joined)[1]
campos_formatted = CleanedTokenized(campos_joined)[3]

# Count the total number of content words.
campos_all_tokens = CleanedTokenized(campos_joined)[2]
print(f"Campos' corpus comprises", len(campos_all_tokens), "content words.")

len(campos_cleaned)

Campos' corpus comprises 12223 content words.


41

In [17]:
# Present the corpus as a DataFrame.
campos_df = GetDF("Campos", campos_cleaned, campos_formatted, campos_tokenized)[0]
campos_df.head()

Unnamed: 0,Index,Cleaned_text,Formatted_text,Tokenized_text_Content,Heteronym
0,poem_1,OPIARY It s before I take opium that my soul i...,OPIARY It’s before I take opium that my soul i...,"[opiary, take, opium, soul, sick, feel, life, ...",Campos
1,poem_2,TRIUMPHAL ODE By the painful light of the fact...,TRIUMPHAL ODE By the painful light of the fact...,"[triumphal, ode, painful, light, factory, huge...",Campos
2,poem_3,EXCERPTS FROM TWO ODES I Come ancient and unch...,"EXCERPTS FROM TWO ODES I Come, ancient and unc...","[excerpts, two, odes, come, ancient, unchangin...",Campos
3,poem_4,MARITIME ODE Alone this summer morning on the ...,MARITIME ODE Alone this summer morning on the ...,"[maritime, ode, alone, summer, morning, desert...",Campos
4,poem_5,SALUTATION TO WALT WHITMAN Portugal Inﬁnity el...,"SALUTATION TO WALT WHITMAN Portugal, Inﬁnity— ...","[salutation, walt, whitman, portugal, inﬁnity,...",Campos


In [18]:
# Exports the corpus to .txt files.
campos_dict = GetDF("Campos", campos_formatted, campos_cleaned, campos_tokenized)[1]
ExportToTxt("Campos", campos_dict)

'Complete'

### 3.4. Fernando Pessoa - Himself

In [19]:
# Importing the PDF.
pessoa = ImportPDF("pessoa", range(322,428))

# stripping the top of the pages, and dates at the end.
# Due to how the PDF was imported, I had to apply several different parameters here.
for page in pessoa:
    if "a little larger than the entire universe" in page:
        pessoa[pessoa.index(page)] = page.lstrip(" 0123456789 a little larger than the entire universe")
    if "fernando pessoa–himself" in page[:23]:
        pessoa[pessoa.index(page)] = page.lstrip("fernando pessoa–himself")[3:]
    if len(page) == 0:
        pessoa.remove(page)

# Finding the indices of each poem split across two or more pages.
indeces = FindIndeces(pessoa)

# Joining poems which are split across multiple pages.
pessoa_joined = JoinPages(pessoa)

# Joining the poems results in a few more empty list entries, which are removed here.
pessoa_joined = [poem for poem in pessoa_joined if len(poem.strip()) > 0]

# Cleaning and tokenizing the poems
pessoa_cleaned = CleanedTokenized(pessoa_joined)[0]
pessoa_tokenized = CleanedTokenized(pessoa_joined)[1]
pessoa_formatted = CleanedTokenized(pessoa_joined)[3]

# Count the total number of tokens
pessoa_all_tokens = CleanedTokenized(pessoa_joined)[2]
print(f"Pessoa's corpus comprises", len(pessoa_all_tokens), "content words.")

Pessoa's corpus comprises 4848 content words.


In [20]:
# Present the corpus as a DataFrame.
pessoa_df = GetDF("Pessoa", pessoa_cleaned, pessoa_formatted, pessoa_tokenized)[0]
pessoa_df.head()

Unnamed: 0,Index,Cleaned_text,Formatted_text,Tokenized_text_Content,Heteronym
0,poem_1,Swamps of yearnings brushing against my gilded...,Swamps of yearnings brushing against my gilded...,"[swamps, yearnings, brushing, gilded, soul, di...",Pessoa
1,poem_2,from SLANTING RAIN I My dream of an inﬁnite po...,from SLANTING RAIN I My dream of an inﬁnite po...,"[slanting, rain, dream, inﬁnite, port, crosses...",Pessoa
2,poem_3,She sings poor reaper perhaps Believing hersel...,"She sings, poor reaper, perhaps Believing hers...","[sings, poor, reaper, perhaps, believing, happ...",Pessoa
3,poem_4,DIARY IN THE SHADE Do you still remember me Yo...,DIARY IN THE SHADE Do you still remember me? Y...,"[diary, shade, still, remember, knew, long, ti...",Pessoa
4,poem_5,Where s my life going and who s taking it ther...,"Where’s my life going, and who’s taking it the...","[life, going, taking, always, want, destiny, k...",Pessoa


In [21]:
# Exports the corpus to .txt files.
pessoa_dict = GetDF("Pessoa", pessoa_formatted, pessoa_cleaned, pessoa_tokenized)[1]
ExportToTxt("Pessoa", pessoa_dict)

'Complete'

## 4. Exporting the joined DF to .csv

In [22]:
# A DF containing all four heteronyms' corpora.
all_df = pd.concat([caeiro_df, reis_df, campos_df, pessoa_df], axis=0)
all_df.to_csv("pessoa_heteronyms_full_corpus.csv")  # Exporting the DF to a .csv file.

## 5. References

Pessoa, F., & Zenith, R. (2014). A little larger than the entire universe: Selected poems. Penguin Books.

Encyclopædia Britannica, inc. (2025, January 3). Horace. Encyclopædia Britannica. https://www.britannica.com/biography/Horace-Roman-poet

Poetry Foundation. (2025). Ode. Poetry Foundation. https://www.poetryfoundation.org/education/glossary/ode
