In this Google Colab cell, we will be downloading two distinct datasets for a text style transfer task. The goal is to develop a model capable of translating text between the writing styles of William Shakespeare and a more modern English author. For this purpose, we have chosen two datasets:

    Shakespeare's Complete Works: This dataset includes the entire collection of plays and poems written by William Shakespeare. Shakespeare's works represent early modern English with a rich and unique style, making them ideal for studying stylistic differences compared to contemporary English.

    "Pride and Prejudice" by Jane Austen: As a representative of more modern English, we've chosen Jane Austen's famous novel "Pride and Prejudice". While not contemporary in the current sense, Austen's language is more aligned with modern English than Shakespeare's, yet it maintains a level of formality and complexity in its style. This contrast will provide a challenging yet insightful basis for the style transfer model.

By downloading and utilizing these two datasets, we aim to explore the nuances of linguistic style transfer, particularly focusing on the transformation of syntactic and stylistic elements between the two distinct forms of English.

The following script will download both datasets from Project Gutenberg, which is a reliable source for public domain texts:

In [None]:
import requests

# Download Shakespeare's Complete Works
shakespeare_url = 'https://www.gutenberg.org/files/100/100-0.txt'
response = requests.get(shakespeare_url)

if response.status_code == 200:
    shakespeare_text = response.text
    with open('shakespeare_complete_works.txt', 'w', encoding='utf-8') as file:
        file.write(shakespeare_text)
    print("Downloaded Shakespeare's Complete Works")
else:
    print("Failed to download Shakespeare's Complete Works")

# Download "Pride and Prejudice" by Jane Austen
pride_prejudice_url = 'https://www.gutenberg.org/files/1342/1342-0.txt'
response = requests.get(pride_prejudice_url)

if response.status_code == 200:
    pride_prejudice_text = response.text
    with open('pride_and_prejudice.txt', 'w', encoding='utf-8') as file:
        file.write(pride_prejudice_text)
    print("Downloaded 'Pride and Prejudice' by Jane Austen")
else:
    print("Failed to download 'Pride and Prejudice'")


Downloaded Shakespeare's Complete Works
Downloaded 'Pride and Prejudice' by Jane Austen


To achieve style transfer between Shakespearean English and more modern English using the datasets provided, you can follow these steps:

- **Data Preprocessing**:
  - Tokenize and clean both datasets. This includes removing special characters, headers, and footers from the Project Gutenberg files.
  - Break down the text into smaller chunks (e.g., sentences or paragraphs) for easier processing.

- **Exploratory Data Analysis (EDA)**:
  - Analyze the unique characteristics of each style, such as common words, sentence length, and syntactic structures.
  - Use visualization tools to highlight these differences.

- **Creating a Pseudo Parallel Corpus**
  - Explain why we need that
  - Propose unsupervised methods to find semantically similar sentences across the two datasets.
  - Check if they do indeed match

- **Model Selection and Implementation**:
  - **Choose a suitable NLP model for style transfer.** Discuss this choice
  - Fine-tune the model on the style transfer task using the prepared datasets.

Further, if time allows:
- **Evaluation**:
  - Develop a set of metrics to evaluate the effectiveness of the style transfer, such as BLEU score, perplexity, and a qualitative assessment by human readers.
  - Compare the output of the model with the target style to assess how well it has captured the stylistic elements.


In [1]:
import torch
from transformers import AutoTokenizer, AutoModel

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
!pip install -U sentence-transformers

Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Collecting torchvision (from sentence-transformers)
  Obtaining dependency information for torchvision from https://files.pythonhosted.org/packages/02/b6/a540edc7ebcd510d42611e4344bbaa9c73e0c262750652e276866b43e33e/torchvision-0.16.1-cp310-cp310-macosx_11_0_arm64.whl.metadata
  Downloading torchvision-0.16.1-cp310-cp310-macosx_11_0_arm64.whl.metadata (6.6 kB)
Collecting scikit-learn (from sentence-transformers)
  Obtaining dependency information for scikit-learn from https://files.pythonhosted.org/packages/70/d0/50ace22129f79830e3cf682d0a2bd4843ef91573299d43112d52790163a8/scikit_learn-1.3.2-cp310-cp310-macosx_12_0_arm64.whl.metadata
  Downloading scikit_learn-1.3.2-cp310-cp310-macosx_12_0_arm64.whl.metadata (11 kB)
Collecting s

In [3]:
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
embeddings = model.encode(sentences)
print(embeddings)

Downloading .gitattributes: 100%|██████████| 1.18k/1.18k [00:00<00:00, 2.42MB/s]
Downloading 1_Pooling/config.json: 100%|██████████| 190/190 [00:00<00:00, 931kB/s]
Downloading README.md: 100%|██████████| 10.6k/10.6k [00:00<00:00, 26.9MB/s]
Downloading config.json: 100%|██████████| 612/612 [00:00<00:00, 2.86MB/s]
Downloading (…)ce_transformers.json: 100%|██████████| 116/116 [00:00<00:00, 553kB/s]
Downloading data_config.json: 100%|██████████| 39.3k/39.3k [00:00<00:00, 10.2MB/s]
Downloading pytorch_model.bin: 100%|██████████| 90.9M/90.9M [00:18<00:00, 4.92MB/s]
Downloading (…)nce_bert_config.json: 100%|██████████| 53.0/53.0 [00:00<00:00, 163kB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 112/112 [00:00<00:00, 385kB/s]
Downloading tokenizer.json: 100%|██████████| 466k/466k [00:00<00:00, 2.62MB/s]
Downloading tokenizer_config.json: 100%|██████████| 350/350 [00:00<00:00, 1.16MB/s]
Downloading train_script.py: 100%|██████████| 13.2k/13.2k [00:00<00:00, 18.9MB/s]
Downloading vocab

[[ 6.76569268e-02  6.34958595e-02  4.87131663e-02  7.93049484e-02
   3.74480151e-02  2.65273103e-03  3.93748954e-02 -7.09838420e-03
   5.93614578e-02  3.15370075e-02  6.00980110e-02 -5.29051572e-02
   4.06067595e-02 -2.59308219e-02  2.98427958e-02  1.12688739e-03
   7.35148787e-02 -5.03818542e-02 -1.22386679e-01  2.37027854e-02
   2.97265742e-02  4.24768254e-02  2.56337989e-02  1.99515815e-03
  -5.69191836e-02 -2.71599442e-02 -3.29035930e-02  6.60248697e-02
   1.19007140e-01 -4.58791628e-02 -7.26214647e-02 -3.25841382e-02
   5.23413569e-02  4.50552553e-02  8.25307053e-03  3.67024280e-02
  -1.39414705e-02  6.53919429e-02 -2.64272261e-02  2.06371697e-04
  -1.36643583e-02 -3.62810344e-02 -1.95043348e-02 -2.89738476e-02
   3.94270569e-02 -8.84091258e-02  2.62428215e-03  1.36713954e-02
   4.83062603e-02 -3.11565734e-02 -1.17329232e-01 -5.11690266e-02
  -8.85287672e-02 -2.18963381e-02  1.42985675e-02  4.44168337e-02
  -1.34815322e-02  7.43392855e-02  2.66382322e-02 -1.98762473e-02
   1.79191

The project aims to train sentence embedding models on very large sentence level datasets using a self-supervised contrastive learning objective. We used the pretrained **nreimers/MiniLM-L6-H384-uncased** model and fine-tuned in on a 1B sentence pairs dataset. We use a contrastive learning objective: given a sentence from the pair, the model should predict which out of a set of randomly sampled other sentences, was actually paired with it in our dataset.

This is a 6 layer version of **microsoft/MiniLM-L12-H384-uncased** by keeping only every second layer.

**MiniLM**: Small and Fast Pre-trained Models for Language Understanding and Generation

MiniLM is a distilled model from the paper "MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers".
Please find the information about preprocessing, training and full details of the MiniLM in the original MiniLM repository.
Please note: This checkpoint can be an inplace substitution for BERT and it needs to be fine-tuned before use!


In [4]:
model

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Normalize()
)

In [8]:
from prettytable import PrettyTable

def count_parameters(model):
    table = PrettyTable(["Modules", "Parameters"])
    total_params = 0
    for name, parameter in model.named_parameters():
        if not parameter.requires_grad:
            continue
        params = parameter.numel()
        table.add_row([name, params])
        total_params += params
    print(table)
    print(f"Total Trainable Params: {total_params}")
    return total_params
    
count_parameters(model)

+----------------------------------------------------------------+------------+
|                            Modules                             | Parameters |
+----------------------------------------------------------------+------------+
|         0.auto_model.embeddings.word_embeddings.weight         |  11720448  |
|       0.auto_model.embeddings.position_embeddings.weight       |   196608   |
|      0.auto_model.embeddings.token_type_embeddings.weight      |    768     |
|            0.auto_model.embeddings.LayerNorm.weight            |    384     |
|             0.auto_model.embeddings.LayerNorm.bias             |    384     |
|    0.auto_model.encoder.layer.0.attention.self.query.weight    |   147456   |
|     0.auto_model.encoder.layer.0.attention.self.query.bias     |    384     |
|     0.auto_model.encoder.layer.0.attention.self.key.weight     |   147456   |
|      0.auto_model.encoder.layer.0.attention.self.key.bias      |    384     |
|    0.auto_model.encoder.layer.0.attent

22713216

In [9]:
model_big = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

Downloading .gitattributes: 100%|██████████| 1.18k/1.18k [00:00<00:00, 3.98MB/s]
Downloading 1_Pooling/config.json: 100%|██████████| 190/190 [00:00<00:00, 698kB/s]
Downloading README.md: 100%|██████████| 10.6k/10.6k [00:00<00:00, 38.8MB/s]
Downloading config.json: 100%|██████████| 571/571 [00:00<00:00, 4.64MB/s]
Downloading (…)ce_transformers.json: 100%|██████████| 116/116 [00:00<00:00, 866kB/s]
Downloading data_config.json: 100%|██████████| 39.3k/39.3k [00:00<00:00, 62.7MB/s]
Downloading pytorch_model.bin: 100%|██████████| 438M/438M [01:31<00:00, 4.80MB/s] 
Downloading (…)nce_bert_config.json: 100%|██████████| 53.0/53.0 [00:00<00:00, 246kB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 239/239 [00:00<00:00, 922kB/s]
Downloading tokenizer.json: 100%|██████████| 466k/466k [00:00<00:00, 1.67MB/s]
Downloading tokenizer_config.json: 100%|██████████| 363/363 [00:00<00:00, 1.41MB/s]
Downloading train_script.py: 100%|██████████| 13.1k/13.1k [00:00<00:00, 25.7MB/s]
Downloading vocab.

In [10]:
count_parameters(model_big)

+----------------------------------------------------------+------------+
|                         Modules                          | Parameters |
+----------------------------------------------------------+------------+
|      0.auto_model.embeddings.word_embeddings.weight      |  23444736  |
|    0.auto_model.embeddings.position_embeddings.weight    |   394752   |
|         0.auto_model.embeddings.LayerNorm.weight         |    768     |
|          0.auto_model.embeddings.LayerNorm.bias          |    768     |
|   0.auto_model.encoder.layer.0.attention.attn.q.weight   |   589824   |
|    0.auto_model.encoder.layer.0.attention.attn.q.bias    |    768     |
|   0.auto_model.encoder.layer.0.attention.attn.k.weight   |   589824   |
|    0.auto_model.encoder.layer.0.attention.attn.k.bias    |    768     |
|   0.auto_model.encoder.layer.0.attention.attn.v.weight   |   589824   |
|    0.auto_model.encoder.layer.0.attention.attn.v.bias    |    768     |
|   0.auto_model.encoder.layer.0.atten

109486464