# BERT for Data Extraction

https://arxiv.org/pdf/2010.09885

In [23]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("DeepChem/ChemBERTa-77M-MLM") # load tokenizer class

model = AutoModelForMaskedLM.from_pretrained("DeepChem/ChemBERTa-77M-MLM")

# Tokenizing the current Dataset.

For the model to understand, we have to tokenize the dataset completely.

- Models like BERT cannot process raw text directly. They require the text to be converted into numerical representations (tokens).
- Tokenization splits the text into smaller units (tokens), maps them to unique IDs, and prepares them for input into the model.
- Tokenizers are model specific: the tokenizer for one Cohere model is not compatible with the tokenizer for another Cohere model, because they were trained using different tokenization methods.

### Working of Entire Hugging face tokenizer

- Convert texts into tokens
- Map Tokens to IDs ( Assign each token to a unique ID)
- Add tokens like [CLS] and [SEP] (basically label as input and output)
- Returning Attention Masks: Generating a mask to indicate which tokens are actual input and which are padding.

BERTs use Subword based tokenization. They split one word into mulitple subwords.

In [None]:
text = "I am Satya"
text = "phenol is organic"
token = tokenizer.tokenize(text)
print(token)
token_ids = tokenizer.convert_tokens_to_ids(token)
print(token_ids)
encoded = tokenizer.encode(text, max_length=512, padding= True, truncation=True)
print(encoded)
ATTencoded = tokenizer(text, max_length=512, padding= True, truncation=True)
print(ATTencoded)
# Explanation: 1 for actual tokens, 0 for padding

['p', 'n', 'o', 's', 'o', 'n', 'c']
[206, 25, 44, 42, 44, 25, 15]
[12, 206, 25, 44, 42, 44, 25, 15, 13]
{'input_ids': [12, 206, 25, 44, 42, 44, 25, 15, 13], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}


- max_length: Controls the maximum size of the sequence.
- padding: Ensures all sequences in a batch are the same length.
- truncation: Shortens sequences that exceed the maximum length.

### 1. Load Dataset and convert from pandas dataset to hugging face dataset.

- Why convert? Hugging Face Dataset is optimized for large datasets that may not fit into memory. It uses memory-mapped files and lazy loading, allowing you to work with datasets that are too large for a pandas DataFrame.

- Not just that, it also helps applying transformations to the dataset using .map() function

In [12]:
from datasets import *
from transformers import AutoTokenizer  
import pandas as pd

data = pd.read_csv(r"D:\Bunker\OneDrive - Amrita vishwa vidyapeetham\BaseCamp\ML\PLAI\Dataset_17_feat.csv")
data.head()

dataset = Dataset.from_pandas(data)

### 2. Tokenizer instantiation

In [None]:
tokenizer = AutoTokenizer.from_pretrained("DeepChem/ChemBERTa-77M-MLM") # load tokenizer class

### 3. Tokenize Dataset

"tokenized_dataset = dataset.map(token_it, batched=True)"

This line applies the token_it function to every row in your dataset.
The map function iterates through each row in the dataset and makes it a tokenized row

In [44]:


def token_it(dict):

    return tokenizer(dict["DP_Group"], padding=True, truncation=True, max_length=512)
    

tokenized_dataset = dataset.map(token_it, batched=True)
display(tokenized_dataset)

Map: 100%|██████████| 3783/3783 [00:00<00:00, 43645.96 examples/s]


Dataset({
    features: ['Experimental_index', 'DP_Group', 'LA/GA', 'Polymer_MW', 'CL Ratio', 'Drug_Tm', 'Drug_Pka', 'Initial D/M ratio', 'DLC', 'SA-V', 'SE', 'Drug_Mw', 'Drug_TPSA', 'Drug_NHA', 'Drug_LogP', 'Time', 'T=0.25', 'T=0.5', 'T=1.0', 'Release', 'input_ids', 'attention_mask'],
    num_rows: 3783
})

NICE! now we have tokenized our ENTIRE DATASET.
now lets train our model.

# Training(Fine-Tuning) the BERT Model

Transformers provides the Trainer API, which offers a comprehensive set of training features, for fine-tuning any of the models on the Hub.
https://huggingface.co/docs/transformers/en/training

### Importing 

In [48]:
from transformers import Trainer, TrainingArguments
from transformers import DataCollatorForLanguageModeling, TrainingArguments, Trainer


In [49]:

# Create a data collator for masked language modeling
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=True,
    mlm_probability=0.15
)

# Define training arguments
training_args = TrainingArguments(
    output_dir="./chemberta_finetuned",
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=8,
    learning_rate=5e-5,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=500,
    save_steps=1000,
    save_total_limit=2,
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    data_collator=data_collator,
    tokenizer=tokenizer,
)

# Train the model
trainer.train()

# Save the fine-tuned model
model.save_pretrained("./chemberta_finetuned")
tokenizer.save_pretrained("./chemberta_finetuned")

  trainer = Trainer(


Step,Training Loss
500,0.7601
1000,0.3186


('./chemberta_finetuned\\tokenizer_config.json',
 './chemberta_finetuned\\special_tokens_map.json',
 './chemberta_finetuned\\vocab.json',
 './chemberta_finetuned\\merges.txt',
 './chemberta_finetuned\\added_tokens.json',
 './chemberta_finetuned\\tokenizer.json')