In [1]:
!pip install transformers



In [2]:
import pandas as pd

In [3]:
# reading the dataset
path = '/content/drive/MyDrive/MSc Data Science/datasets/english_balanced_10k.jsonl'
df = pd.read_json(path, lines = True)

In [4]:
df.head()

Unnamed: 0,masked_text,unmasked_text,token_entity_labels,tokenised_unmasked_text
0,[PREFIX_1] [FIRSTNAME_1] [MIDDLENAME_1] [LASTN...,"Mr. Adolphus Reagan Ziemann, as a Central Prin...","[B-PREFIX, I-PREFIX, B-FIRSTNAME, I-FIRSTNAME,...","[mr, ., adolph, ##us, reagan, z, ##ie, ##mann,..."
1,"Hello [FIRSTNAME_1], would you please investig...","Hello Hannah, would you please investigate the...","[O, B-FIRSTNAME, O, O, O, O, O, O, O, O, O, O,...","[hello, hannah, ,, would, you, please, investi..."
2,We also request a review of our policies with ...,We also request a review of our policies with ...,"[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[we, also, request, a, review, of, our, polici..."
3,"Dear [FIRSTNAME_1], a company-wide presentatio...","Dear Devan, a company-wide presentation is req...","[O, B-FIRSTNAME, I-FIRSTNAME, O, O, O, O, O, O...","[dear, dev, ##an, ,, a, company, -, wide, pres..."
4,Can we also have a session on how to manage st...,Can we also have a session on how to manage st...,"[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[can, we, also, have, a, session, on, how, to,..."


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10912 entries, 0 to 10911
Data columns (total 4 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   masked_text              10912 non-null  object
 1   unmasked_text            10912 non-null  object
 2   token_entity_labels      10912 non-null  object
 3   tokenised_unmasked_text  10912 non-null  object
dtypes: object(4)
memory usage: 341.1+ KB


## Data Preprocessing
Transform your dataset into a format suitable for training. You need to convert text data into numerical format that can be fed into the model. You can use the Tokenizer provided by the transformers library.

In [6]:
training_texts = df['unmasked_text']
training_texts

0        Mr. Adolphus Reagan Ziemann, as a Central Prin...
1        Hello Hannah, would you please investigate the...
2        We also request a review of our policies with ...
3        Dear Devan, a company-wide presentation is req...
4        Can we also have a session on how to manage st...
                               ...                        
10907    Can you please provide a breakdown of the comp...
10908    A transaction for user Patsy_Volkman on 23/10/...
10909    We are curious about the current investments i...
10910    Can you create an update presentation about th...
10911    Can you help with the documentation to set up ...
Name: unmasked_text, Length: 10912, dtype: object

In [7]:
df.loc[0, 'unmasked_text']

"Mr. Adolphus Reagan Ziemann, as a Central Principal Applications Executive at McLaughlin, Nader and Purdy, your knowledge of change management is vital for our company's transformation. We request you to create a change management strategy."

In [8]:
df.loc[0, 'masked_text']

"[PREFIX_1] [FIRSTNAME_1] [MIDDLENAME_1] [LASTNAME_1], as a [JOBDESCRIPTOR_1] [JOBTITLE_1] at [COMPANY_NAME_1], your knowledge of change management is vital for our company's transformation. We request you to create a change management strategy."

In [9]:
df.loc[0, 'token_entity_labels']

['B-PREFIX',
 'I-PREFIX',
 'B-FIRSTNAME',
 'I-FIRSTNAME',
 'B-MIDDLENAME',
 'B-LASTNAME',
 'I-LASTNAME',
 'I-LASTNAME',
 'O',
 'O',
 'O',
 'B-JOBDESCRIPTOR',
 'B-JOBTITLE',
 'I-JOBTITLE',
 'I-JOBTITLE',
 'O',
 'B-COMPANY_NAME',
 'I-COMPANY_NAME',
 'I-COMPANY_NAME',
 'I-COMPANY_NAME',
 'I-COMPANY_NAME',
 'I-COMPANY_NAME',
 'I-COMPANY_NAME',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O']

In [10]:
df.loc[0, 'tokenised_unmasked_text']

['mr',
 '.',
 'adolph',
 '##us',
 'reagan',
 'z',
 '##ie',
 '##mann',
 ',',
 'as',
 'a',
 'central',
 'principal',
 'applications',
 'executive',
 'at',
 'mclaughlin',
 ',',
 'nad',
 '##er',
 'and',
 'pu',
 '##rdy',
 ',',
 'your',
 'knowledge',
 'of',
 'change',
 'management',
 'is',
 'vital',
 'for',
 'our',
 'company',
 "'",
 's',
 'transformation',
 '.',
 'we',
 'request',
 'you',
 'to',
 'create',
 'a',
 'change',
 'management',
 'strategy',
 '.']

In the provided dataset, the column "token_entity_labels" seems to represent token-level entity labels for each token in the "tokenised_unmasked_text." Each element in this column appears to be a list of labels corresponding to the individual tokens in the corresponding "tokenised_unmasked_text" column.

Here's a breakdown:

- **"tokenised_unmasked_text":** This column contains the tokenized version of the "unmasked_text" column. Each sentence or text is tokenized into individual words or subwords.

- **"token_entity_labels":** This column contains lists of labels for each token in the "tokenised_unmasked_text." The labels appear to represent entity annotations for each token. For example, "B-FULLNAME" might indicate the beginning of a full name entity, and "I-FULLNAME" might indicate subsequent tokens of the same entity. "O" usually represents tokens that do not belong to any named entity.

Here's an example row:

```plaintext
masked_text: "In our video conference, discuss the role of evidence in the arbitration process involving [FULLNAME_1] and [FULLNAME_2]."
unmasked_text: "In our video conference, discuss the role of evidence in the arbitration process involving Dr. Marvin Rolfson and Julius Daugherty."
token_entity_labels: ["O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "B-FULLNAME", "I-FULLNAME", "I-FULLNAME", "I-FULLNAME", "I-FULLNAME", "O", "B-FULLNAME", "I-FULLNAME", "I-FULLNAME", "I-FULLNAME", "O"]
tokenised_unmasked_text: ["in", "our", "video", "conference", ",", "discuss", "the", "role", "of", "evidence", "in", "the", "arbitration", "process", "involving", "dr", ".", "marvin", "rolf", "##son", "and", "julius", "da", "##ugh", "##erty", "."]
```

In this example, the "token_entity_labels" list corresponds to the labels for each token in the "tokenised_unmasked_text" list. The labels "B-FULLNAME" and "I-FULLNAME" indicate the entity boundaries for the full names mentioned in the text. The "O" labels represent tokens that do not belong to any named entity.

In [11]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

In the code below, training_texts should be a list of strings, where each string represents a training sample. Each training sample corresponds to the "unmasked_text" column in your dataset. This list of strings is what you'll pass to the tokenizer to convert into numerical format suitable for training.

In [12]:
# Assuming df is your DataFrame containing the dataset
training_texts = df['unmasked_text'].tolist()

# Tokenize your training data
encoded_data = tokenizer(training_texts, truncation=True, padding=True)

print(encoded_data[: 10])

[Encoding(num_tokens=216, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]), Encoding(num_tokens=216, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]), Encoding(num_tokens=216, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]), Encoding(num_tokens=216, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]), Encoding(num_tokens=216, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]), Encoding(num_tokens=216, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]), Encoding(num_tokens=216, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]), Encoding(num_tokens=216, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]), Encoding(num_tokens=216, attrib

### Defining the model:
Choose a suitable pre-trained model from the transformers library and fine-tune it on your specific task. You need to define a model that takes the tokenized input and predicts the masked_text. You may need to add a layer for sequence classification.

In [13]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Training the model

In [15]:
import transformers
print(transformers.__version__)

4.35.1


In [16]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./model",
    per_device_train_batch_size=8,
    save_total_limit=2,
    num_train_epochs=3,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=encoded_data,
    # Add more parameters like eval_dataset, data_collator, etc., as needed
)

trainer.train()

ImportError: ignored