In this Google Colab cell, we will be downloading two distinct datasets for a text style transfer task. The goal is to develop a model capable of translating text between the writing styles of William Shakespeare and a more modern English author. For this purpose, we have chosen two datasets:

    Shakespeare's Complete Works: This dataset includes the entire collection of plays and poems written by William Shakespeare. Shakespeare's works represent early modern English with a rich and unique style, making them ideal for studying stylistic differences compared to contemporary English.

    "Pride and Prejudice" by Jane Austen: As a representative of more modern English, we've chosen Jane Austen's famous novel "Pride and Prejudice". While not contemporary in the current sense, Austen's language is more aligned with modern English than Shakespeare's, yet it maintains a level of formality and complexity in its style. This contrast will provide a challenging yet insightful basis for the style transfer model.

By downloading and utilizing these two datasets, we aim to explore the nuances of linguistic style transfer, particularly focusing on the transformation of syntactic and stylistic elements between the two distinct forms of English.

The following script will download both datasets from Project Gutenberg, which is a reliable source for public domain texts:

In [1]:
# import requests

# # Download Shakespeare's Complete Works
# shakespeare_url = 'https://www.gutenberg.org/files/100/100-0.txt'
# response = requests.get(shakespeare_url)

# if response.status_code == 200:
#     shakespeare_text = response.text
#     with open('shakespeare_complete_works.txt', 'w', encoding='utf-8') as file:
#         file.write(shakespeare_text)
#     print("Downloaded Shakespeare's Complete Works")
# else:
#     print("Failed to download Shakespeare's Complete Works")

# # Download "Pride and Prejudice" by Jane Austen
# pride_prejudice_url = 'https://www.gutenberg.org/files/1342/1342-0.txt'
# response = requests.get(pride_prejudice_url)

# if response.status_code == 200:
#     pride_prejudice_text = response.text
#     with open('pride_and_prejudice.txt', 'w', encoding='utf-8') as file:
#         file.write(pride_prejudice_text)
#     print("Downloaded 'Pride and Prejudice' by Jane Austen")
# else:
#     print("Failed to download 'Pride and Prejudice'")


To achieve style transfer between Shakespearean English and more modern English using the datasets provided, you can follow these steps:

- **Data Preprocessing**:
  - Tokenize and clean both datasets. This includes removing special characters, headers, and footers from the Project Gutenberg files.
  - Break down the text into smaller chunks (e.g., sentences or paragraphs) for easier processing.

- **Exploratory Data Analysis (EDA)**:
  - Analyze the unique characteristics of each style, such as common words, sentence length, and syntactic structures.
  - Use visualization tools to highlight these differences.

- **Creating a Pseudo Parallel Corpus**
  - Explain why we need that
  - Propose unsupervised methods to find semantically similar sentences across the two datasets.
  - Check if they do indeed match

- **Model Selection and Implementation**:
  - **Choose a suitable NLP model for style transfer.** Discuss this choice
  - Fine-tune the model on the style transfer task using the prepared datasets.

Further, if time allows:
- **Evaluation**:
  - Develop a set of metrics to evaluate the effectiveness of the style transfer, such as BLEU score, perplexity, and a qualitative assessment by human readers.
  - Compare the output of the model with the target style to assess how well it has captured the stylistic elements.


To dos:

1. Pride and Prejudice cleaning @Now

2. Shakespear cleaning @Now

3. Research and understand Pseudo Parallel Corpus + how to incorporate both datasets @Now

4. Identify 1-2 model architectures (shakespearian style-transfers, fine-tuned GP LLMs): e.g. {https://github.com/ToruOwO/style-transfer-writing} @Next

5. Visualisation @Last

6. Train models a shakespear-to-modern style transfer model and then better general purpose models @Last


## Data Preprocessing

Both datasets appear quite different and will require unique preprocessing methods.
The focus was on ensuring all special characters were removed and that the sentences were in a similar format between Austin's and Shakespear's works. This meant adapting to their wrinting styles, with Austin writing in dialogues (meaning many quotations), and Shakespear writing in poems (meaning unique line breaks).

In [2]:
import pandas as pd
import re
import unicodedata

In [3]:
with open('pride_and_prejudice.txt', 'r', encoding='utf-8-sig') as file:
    austin_text = file.read()
austin_text = austin_text.replace('â', "« ").replace('â', "» ").replace('-', '').replace('_', '').replace('\n\n', "|").replace('||', "|").replace('\n', " ").replace('--', " ")
austin_text = austin_text.replace('â', "'")
austin_text = austin_text.split('|')

In [4]:
with open('shakespeare_complete_works.txt', 'r', encoding='utf-8-sig') as file:
    shakespeare_text = file.read()

shakespeare_text



### Jane Austin

In [6]:
pd.set_option('display.max_rows', None)  # No limit on the number of rows displayed.
pd.set_option('display.max_columns', None)  # No limit on the number of columns displayed.
pd.set_option('display.max_colwidth', None)  # Use maximum width to display each row.

In [7]:
def remove_unwanted_rows(df, column_name):
    """Update the regular expression to include 'PAGE' and any numerical values"""
    pattern = '\[|\]|CHAPTER|PAGE|\d+|\/\*|«|»'
    mask = df[column_name].str.contains(pattern, regex=True, case=False, na=False)
    df = df[~mask]
    return df

def split_into_sentences(text):
    sentence_endings_regex = r'(?<!\bMr)(?<!\bMrs)(?<!\bMs)(?<!\bDr)(?<!\bSir)\.\s+(?=[A-Z])'
    parts = re.split('({})'.format(sentence_endings_regex), text)
    sentences = []
    for i in range(0, len(parts) - 1, 2):
        combined = parts[i].strip()
        if i+1 < len(parts):
            combined += parts[i+1].strip()
        sentences.append(combined)

    # Include the last part if it doesn't end with a matched pattern
    if len(parts) % 2 != 0:
        sentences.append(parts[-1].strip())

    return [sentence for sentence in sentences if sentence]


In [8]:
df_austin = pd.DataFrame(austin_text, columns=['sentence'])
df_austin = df_austin[df_austin['sentence'] != '']

df_austin = remove_unwanted_rows(df_austin, 'sentence')

ranges = [(0,33), (2073,2088)]

for i, v in ranges:
    df_austin = df_austin.drop(df_austin.index[i:v]).reset_index(drop=True)


df_austin['sentences'] = df_austin['sentence'].apply(split_into_sentences)
df_austin = df_austin.explode('sentences')
df_austin = df_austin.drop(columns=['sentence'])
df_austin.head()

In [11]:
df_austin.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2029 entries, 0 to 750
Data columns (total 1 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   sentences  2029 non-null   object
dtypes: object(1)
memory usage: 31.7+ KB


In [12]:
pd.reset_option('display.max_rows')
pd.reset_option('display.max_columns')
pd.reset_option('display.width')

### Willie Shakespear

In [13]:
def replace_unicode_with_ascii(text):
    normalized_text = unicodedata.normalize('NFKD', text).encode('ASCII', 'ignore').decode('utf-8')
    return normalized_text


def basic_processing(corpus):

    corpus_ = corpus
    corpus_ = corpus_.replace('\r', '')
    corpus_ = corpus_.split('\n')
    corpus_ = [replace_unicode_with_ascii(sentence) for sentence in corpus_]
    corpus_ = [sentence.strip() for sentence in corpus_]
    corpus_ = [sentence for sentence in corpus_ if len(sentence) > 0]
    corpus_ = [sentence for sentence in corpus_ if not ((sentence[0]=='[')&(sentence[-1]==']'))]

    list_start_remove = ['ACT', 'Scene', 'Chorus']

    for elem in list_start_remove :
        corpus_ = [sentence for sentence in corpus_ if not sentence.startswith(elem)]

    corpus_ = [sentence for sentence in corpus_ if not sentence.upper == sentence ]

    return(corpus_)

content_list = basic_processing(shakespeare_text)[16:60]

def corpus_to_text_list(text_list, corpus):
    
    end_of_text_marker = '*** END OF THE PROJECT GUTENBERG EBOOK THE COMPLETE WORKS OF WILLIAM SHAKESPEARE ***'

    corpus_ = corpus[60:corpus.index(end_of_text_marker)]
    list_of_text = []
    for i in range(len(text_list)-1):
        list_of_text.append([corpus_[corpus_.index(text_list[i]):corpus_.index(text_list[i+1])]])
    list_of_text.append([corpus_[corpus_.index(text_list[-1])::]])
    return(list_of_text)

list_of_text = corpus_to_text_list(content_list, basic_processing(shakespeare_text))


In [14]:
list_of_text

[[['THE SONNETS',
   '1',
   'From fairest creatures we desire increase,',
   'That thereby beautyas rose might never die,',
   'But as the riper should by time decease,',
   'His tender heir might bear his memory:',
   'But thou contracted to thine own bright eyes,',
   'Feedast thy lightas flame with self-substantial fuel,',
   'Making a famine where abundance lies,',
   'Thyself thy foe, to thy sweet self too cruel:',
   'Thou that art now the worldas fresh ornament,',
   'And only herald to the gaudy spring,',
   'Within thine own bud buriest thy content,',
   'And, tender churl, makast waste in niggarding:',
   'Pity the world, or else this glutton be,',
   'To eat the worldas due, by the grave and thee.',
   '2',
   'When forty winters shall besiege thy brow,',
   'And dig deep trenches in thy beautyas field,',
   'Thy youthas proud livery so gazed on now,',
   'Will be a tattered weed of small worth held:',
   'Then being asked, where all thy beauty lies,',
   'Where all the tre

In [15]:
df_austin.head(n=50)

Unnamed: 0,sentences
0,"The astonishment of the ladies was just what he wishedthat of Mrs. Bennet perhaps surpassing the rest; though when the first tumult of joy was over, she began to declare that it was what she had expected all the while."
1,"The rest of the evening was spent in conjecturing how soon he would return Mr. Bennet's visit, and determining when they should ask him to dinner."
2,"Not all that Mrs. Bennet, however, with the assistance of her five daughters, could ask on the subject, was sufficient to draw from her husband any satisfactory description of Mr. Bingley."
2,"They attacked him in various ways, with barefaced questions, ingenious suppositions, and distant surmises; but he eluded the skill of them all; and they were at last obliged to accept the secondhand intelligence of their neighbour, Lady Lucas."
2,Her report was highly favourable.
2,Sir William had been delighted with him.
2,"He was quite young, wonderfully handsome, extremely agreeable, and, to crown the whole, he meant to be at the next assembly with a large party."
2,Nothing could be more delightful! To be fond of dancing was a certain step towards falling in love; and very lively hopes of Mr. Bingley's heart were entertained.
3,"In a few days Mr. Bingley returned Mr. Bennet's visit, and sat about ten minutes with him in his library."
3,"He had entertained hopes of being admitted to a sight of the young ladies, of whose beauty he had heard much; but he saw only the father."


## Pseudo Parallel Corpus

A pseudo parallel corpus is a type of linguistic resource used in natural language processing (NLP) and machine translation. It's a collection of texts that are not originally parallel but have been aligned or matched to function as a parallel corpus. Here's a breakdown of what this means:

1. **Parallel Corpus**: In NLP, a parallel corpus typically consists of a set of documents in one language and their direct translations in another language. These corpora are essential for training machine translation systems and for other multilingual tasks.

2. **Pseudo Parallel**: The term "pseudo" indicates that the corpus is not naturally parallel. In a pseudo parallel corpus, the text pairs are not exact translations of each other. Instead, they are independently created documents that are similar in content, style, or topic.

3. **Uses and Creation**: Pseudo parallel corpora are often used when a true parallel corpus is unavailable for a particular language pair or domain. They can be created by aligning texts based on similarity metrics, using techniques like statistical machine translation models or deep learning methods to find text pairs that convey similar information.

4. **Applications**: These corpora are valuable in scenarios where bilingual data is scarce. They can help in training machine translation systems, especially for low-resource languages. They're also used in cross-lingual information retrieval and text summarization, where understanding the gist of texts across languages is crucial.

In summary, a pseudo parallel corpus is an artificially created resource that simulates a parallel corpus, enabling various multilingual NLP tasks in scenarios where direct translations are not available.

## No Language Left Behind Translation