# Punctuation art with Python

In 2022, I saw <a href = https://artishockrevista.com/2022/06/23/pedro-reyes-escultura-social>"Escultura Social" </a> , a collection of pieces by Pedro Reyes. The artist reflected on many current issues but also reflected on the importance of language. The description detailed how the artist thought of punctuation as 2 dimensional sculpture. A symbol which represented an abstract notion.

I later saw the work of [Nicholas Rougeaux](https://www.c82.net/work/?id=347) who also analyzed punctuation, but strictly in the context of analyzing the structure of classical literature, such as Pride and Prejudice, Moby Dick, and Alice's Adventures in Wonderland.

I really enjoyed both of these projects and simply set out to see if I could replicate them. There are a number of classical texts I enjoy and I wanted to see what they would look like. I was able to use the <a href=https://jss367.github.io/getting-text-from-project-gutenberg.html> code written by Julius Simonelli </a> to scrape raw text data from the Gutenberg project. His code was really helpful.

In [1]:
# Import necessary libraries
import os
from urllib import request
import nltk
import re
import spacy
import en_core_web_sm
from html2image import Html2Image

## Web scrapping raw Gutenberg file (Adapted from Julius Simonelli)

In [2]:
# Define function for web scrapping

def text_from_gutenberg(title, author, url, path = 'corpora/canon_texts/', return_raw = False, return_tokens = False):
    # Convert inputs to lowercase
    title = title.lower()
    author = title.lower()
   
    # Check if the file is stored locally
    filename = 'corpora/canon_texts/' + title
    if os.path.isfile(filename) and os.stat(filename).st_size != 0:
        print("{title} file already exists".format(title=title))
        print(filename)
        with open(filename, 'r') as f:
            raw = f.read()

    else:
        print("{title} file does not already exist. Grabbing from Project Gutenberg".format(title=title))
        response = request.urlopen(url)
        raw = response.read().decode('utf-8-sig')
        print("Saving {title} file".format(title=title))
        with open(filename, 'w') as outfile:
            outfile.write(raw)
            
    if return_raw:
        return raw
    
    # Option to return tokens
    if return_tokens:
        return nltk.word_tokenize(find_text(raw))
    
    else:
        return find_beginning_and_end(raw)

## Use Case: Plato's Apology

### Gutenberg Text Info

In [3]:
title = 'Apology'
author = 'Plato'
url = 'https://gutenberg.org/cache/epub/1656/pg1656.txt'

### Filtering manually, based on review of .text file

In [4]:
# Find beginning of actual Apology in Raw Text
raw = text_from_gutenberg(title=title, author=author, url=url, return_raw=True)

# Search for beginning of Apology
phrase_to_find = "How you, O Athenians"

matches = re.finditer(re.escape(phrase_to_find), raw)

for match in matches:
    start_index = match.start()

# Search for end of apology
end_regex = '\*\*\*\ end of the project gutenberg ebook'
end_position = re.search(end_regex, raw.lower())
end_position.start()

# Filter based on beginning and end of text
text = raw[start_index:end_position.start()]

apology file already exists
corpora/canon_texts/apology


### Filtering using spaCy: new line characters & retaining punctuation

In [5]:
# Filter out new line character and underscore
text_no_new = text.replace('\n', '')
text_no_underscore = text.replace('_', '')

# Tokenize text so it can be fed to spaCy
nlp = spacy.load("en_core_web_sm")
tokenized = nlp(text_no_underscore)


In [6]:
# Define spaCy pipeline to exclude new line characters and only retain punctuation

def preprocessing(tokenized):
    normalized_text = []
    for token in tokenized:
        # Lays out requirements: not punct, not stop word, not num, not email, and not url
        if token.is_punct:
            normalized_text.append(token.text) # Takes lemma of word that meets all requirements
    return ' '.join(normalized_text)

# Extract punctuation using spaCy pipeline
punctuation = preprocessing(tokenized)

### Create Poster as HTML

The background image I am using was creadted by <a href="https://unsplash.com/@henry_be?utm_content=creditCopyText&utm_medium=referral&utm_source=unsplash">Henry Be</a> on <a href="https://unsplash.com/photos/assorted-title-book-lot-on-shelf-TCsCykbwSJw?utm_content=creditCopyText&utm_medium=referral&utm_source=unsplash"> Unsplash</a>.
  

In [7]:
# Define the background image URL with a relative path
background_image_url = 'url("Images/Henry_Be_Unsplash.jpg")'
# background_image_url = 'url("Henry_Be_Unsplash.jpg")'

row_number = len(punctuation)/50

# Define the HTML content
html_content = f"""
<!DOCTYPE html>
<html>
<head>
    <title>HTML with Background Image and Page Dimensions</title>
    <style>
        body {{
            background-image: {background_image_url};
            background-repeat: no-repeat;
            background-size: cover;
            display: flex;
            flex-direction: column;
            justify-content: center;
            align-items: center;
            }}
        textarea {{
            background: #FFFFED;
            width: 850px;
            height: 425px;
            margin-top: 3px;
        }}

        #title {{
            color: #FFFFED;  /* Replace with your desired text color, e.g., red */
            font-size: 56px;  /* Replace with your desired font size */
            text-align: center; /* Center the text horizontally */
        }}
    </style>
</head>
<body>
    <p id = "title"> Punctuation symbols in {title} by {author} </p>
    <textarea id="textbox" rows="{row_number}">{punctuation}</textarea>
</body>
</html>
"""

   
# Specify the file name and join it with the current working directory
file_name = "poster.html"
file_path = os.path.join(os.getcwd(), file_name)

with open(file_path, "w") as file:
    # Write the HTML content to the file
    file.write(html_content)


In [8]:
# Convert HTML to Image
hti = Html2Image()

hti.screenshot(html_str=html_content, save_as='page.png')
# imgkit.from_string('Hello!', 'out.jpg')

['c:\\Users\\Sebastian\\Desktop\\Portfolio\\Punctuation-Art\\page.png']

### Optional: general function to filter out text within raw Gutenberg file

In [9]:
def find_beginning_and_end(raw):
    '''
    This function serves to find the text within the raw data provided by Project Gutenberg
    '''
    
    start_regex = '\*\*\*\s?START OF TH(IS|E) PROJECT GUTENBERG EBOOK.*\*\*\*'
    draft_start_position = re.search(start_regex, raw)
    begining = draft_start_position.end()

    if re.search(title.lower(), raw[draft_start_position.end():].lower()):
        title_position = re.search(title.lower(), raw[draft_start_position.end():].lower())
        begining += title_position.end()
        # If the title is present, check for the author's name as well
        if re.search(author.lower(), raw[draft_start_position.end() + title_position.end():].lower()):
            author_position = re.search(author.lower(), raw[draft_start_position.end() + title_position.end():].lower())
            begining += author_position.end()
    
    end_regex = '\*\*\*\ end of th(is|e) project gutenberg ebook'
    end_position = re.search(end_regex, raw.lower())

    text = raw[begining:end_position.start()]
    
    return text

### Now you: Choose your own text

[Project Gutenberg](https://gutenberg.org/) is a wonderful resource that makes classical literature readily available on the internet. Find any text you are interested in and copy the link to the raw text file.



In [10]:
# Define text of interest

# Replace here the title, author, and url objects to make your own poster!

title = 'Emma'
author = 'Jane Austen'
url = 'https://gutenberg.org/cache/epub/158/pg158.txt'

In [11]:
# Scrape Raw Text
raw = text_from_gutenberg(title=title, author=author, url=url, return_raw=True)

# Filter on beginning and end
text = find_beginning_and_end(raw)

# Filter out new line character and underscore
text_no_new = text.replace('\n', '')
text_no_underscore = text.replace('_', '')

# Tokenize text so it can be fed to spaCy
nlp = spacy.load("en_core_web_sm")
tokenized = nlp(text_no_underscore)

# Extract punctuation using spaCy pipeline
punctuation = preprocessing(tokenized)

emma file already exists
corpora/canon_texts/emma


In [None]:
# Define the HTML content
html_content = f"""
<!DOCTYPE html>
<html>
<head>
    <title>HTML with Background Image and Page Dimensions</title>
    <style>
        body {{
            background-image: {background_image_url};
            background-repeat: no-repeat;
            background-size: cover;
            display: flex;
            flex-direction: column;
            justify-content: center;
            align-items: center;
            }}
        textarea {{
            background: #FFFFED;
            width: 850px;
            height: 425px;
            margin-top: 3px;
        }}

        #title {{
            color: #FFFFED;  /* Replace with your desired text color, e.g., red */
            font-size: 56px;  /* Replace with your desired font size */
            text-align: center; /* Center the text horizontally */
        }}
    </style>
</head>
<body>
    <p id = "title"> Punctuation in {title} by {author} </p>
    <textarea id="textbox" rows="{row_number}">{punctuation}</textarea>
</body>
</html>
"""


In [None]:
# Specify the file name and join it with the current working directory
file_name = "emma.html"
file_path = os.path.join(os.getcwd(), file_name)

with open(file_path, "w") as file:
    # Write the HTML content to the file
    file.write(html_content)