In this Google Colab cell, we will be downloading two distinct datasets for a text style transfer task. The goal is to develop a model capable of translating text between the writing styles of William Shakespeare and a more modern English author. For this purpose, we have chosen two datasets:

    Shakespeare's Complete Works: This dataset includes the entire collection of plays and poems written by William Shakespeare. Shakespeare's works represent early modern English with a rich and unique style, making them ideal for studying stylistic differences compared to contemporary English.

    "Pride and Prejudice" by Jane Austen: As a representative of more modern English, we've chosen Jane Austen's famous novel "Pride and Prejudice". While not contemporary in the current sense, Austen's language is more aligned with modern English than Shakespeare's, yet it maintains a level of formality and complexity in its style. This contrast will provide a challenging yet insightful basis for the style transfer model.

By downloading and utilizing these two datasets, we aim to explore the nuances of linguistic style transfer, particularly focusing on the transformation of syntactic and stylistic elements between the two distinct forms of English.

The following script will download both datasets from Project Gutenberg, which is a reliable source for public domain texts:

In [1]:
import requests

# Download Shakespeare's Complete Works
shakespeare_url = 'https://www.gutenberg.org/files/100/100-0.txt'
response = requests.get(shakespeare_url)

if response.status_code == 200:
    shakespeare_text = response.text
    with open('shakespeare_complete_works.txt', 'w', encoding='utf-8') as file:
        file.write(shakespeare_text)
    print("Downloaded Shakespeare's Complete Works")
else:
    print("Failed to download Shakespeare's Complete Works")

# Download "Pride and Prejudice" by Jane Austen
pride_prejudice_url = 'https://www.gutenberg.org/files/1342/1342-0.txt'
response = requests.get(pride_prejudice_url)

if response.status_code == 200:
    pride_prejudice_text = response.text
    with open('pride_and_prejudice.txt', 'w', encoding='utf-8') as file:
        file.write(pride_prejudice_text)
    print("Downloaded 'Pride and Prejudice' by Jane Austen")
else:
    print("Failed to download 'Pride and Prejudice'")


Downloaded Shakespeare's Complete Works
Downloaded 'Pride and Prejudice' by Jane Austen


To achieve style transfer between Shakespearean English and more modern English using the datasets provided, you can follow these steps:

- **Data Preprocessing**:
  - Tokenize and clean both datasets. This includes removing special characters, headers, and footers from the Project Gutenberg files.
  - Break down the text into smaller chunks (e.g., sentences or paragraphs) for easier processing.

- **Exploratory Data Analysis (EDA)**:
  - Analyze the unique characteristics of each style, such as common words, sentence length, and syntactic structures.
  - Use visualization tools to highlight these differences.

- **Creating a Pseudo Parallel Corpus**
  - Explain why we need that
  - Propose unsupervised methods to find semantically similar sentences across the two datasets.
  - Check if they do indeed match

- **Model Selection and Implementation**:
  - **Choose a suitable NLP model for style transfer.** Discuss this choice
  - Fine-tune the model on the style transfer task using the prepared datasets.

Further, if time allows:
- **Evaluation**:
  - Develop a set of metrics to evaluate the effectiveness of the style transfer, such as BLEU score, perplexity, and a qualitative assessment by human readers.
  - Compare the output of the model with the target style to assess how well it has captured the stylistic elements.


To dos:

1. Pride and Prejudice cleaning @Now

2. Shakespear cleaning @Now

3. Research and understand Pseudo Parallel Corpus + how to incorporate both datasets @Now

4. Identify 1-2 model architectures (shakespearian style-transfers, fine-tuned GP LLMs): e.g. {https://github.com/ToruOwO/style-transfer-writing} @Next

5. Visualisation @Last

6. Train models a shakespear-to-modern style transfer model and then better general purpose models @Last


## Data Preprocessing

Both datasets appear quite different and will require unique preprocessing methods.

## Pseudo Parallel Corpus

A pseudo parallel corpus is a type of linguistic resource used in natural language processing (NLP) and machine translation. It's a collection of texts that are not originally parallel but have been aligned or matched to function as a parallel corpus. Here's a breakdown of what this means:

1. **Parallel Corpus**: In NLP, a parallel corpus typically consists of a set of documents in one language and their direct translations in another language. These corpora are essential for training machine translation systems and for other multilingual tasks.

2. **Pseudo Parallel**: The term "pseudo" indicates that the corpus is not naturally parallel. In a pseudo parallel corpus, the text pairs are not exact translations of each other. Instead, they are independently created documents that are similar in content, style, or topic.

3. **Uses and Creation**: Pseudo parallel corpora are often used when a true parallel corpus is unavailable for a particular language pair or domain. They can be created by aligning texts based on similarity metrics, using techniques like statistical machine translation models or deep learning methods to find text pairs that convey similar information.

4. **Applications**: These corpora are valuable in scenarios where bilingual data is scarce. They can help in training machine translation systems, especially for low-resource languages. They're also used in cross-lingual information retrieval and text summarization, where understanding the gist of texts across languages is crucial.

In summary, a pseudo parallel corpus is an artificially created resource that simulates a parallel corpus, enabling various multilingual NLP tasks in scenarios where direct translations are not available.