<a href="https://colab.research.google.com/github/ShaliniAnandaPhD/Fine_Tuning/blob/main/Fine_tuning_a_BERT_Model_for_Text_Prediction_on_Perovskite_Solar_Cell_Literature.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

The following code snippet is a comprehensive demonstration of several machine learning tasks: web scraping, data preprocessing, training a language model, and making predictions with the trained model. The code is set in the context of Perovskite solar cells, a type of solar cell that is being actively researched for its potential to outperform traditional silicon cells. The aim is to fine-tune a BERT model on documents related to Perovskite to understand and predict subsequent words in new texts regarding this technology.


# Test

To test the code, we perform web scraping on both arXiv and specific education sites to gather texts related to Perovskite solar cells. These documents are combined and split into training, validation, and testing sets. This data is then used to train a BERT model with a masked language model task using PyTorch and Hugging Face's `transformers` library.

In the training phase, the model learns to predict a masked word in a sentence, learning the context and semantics of the text data. Following training, the model and tokenizer are saved for later use.

In the prediction phase, a sentence related to Perovskite solar cells is passed through the fine-tuned BERT model to predict subsequent words. The input text is: "One of the advantages of perovskite solar cells over traditional silicon cells is their depth".

# Conclusion

This code showcases a full cycle of a machine learning project from data gathering to prediction. By using tools like BeautifulSoup for web scraping, Hugging Face's `transformers` library for language model fine-tuning, and PyTorch for model training and inference, it demonstrates the power of these libraries in creating sophisticated NLP applications.

In the field of renewable energy, such a fine-tuned model can be particularly useful for understanding the complex research language and predicting text, enabling a wide array of applications such as document summarization, question-answering, and more. Future work could explore different architectures or training strategies to further improve the model's performance.

In [None]:
import re
import requests
from bs4 import BeautifulSoup
import random
from tqdm import tqdm

# Web scrape papers from arxiv.org using search term "perovskite"
papers = []
for i in range(1, 10):
    url = f"https://arxiv.org/search/?query=perovskite&start={i}"
    resp = requests.get(url)
    soup = BeautifulSoup(resp.text, 'html.parser')
    for match in soup.find_all('p', class_='mathjax'):
        papers.append(match.text)

# Scrape intro texts from Wikipedia, education sites
intros = []
for url in ["https://en.wikipedia.org/wiki/Perovskite_(structure)", "http://www.pv.unsw.edu.au/perovskite-solar-cells"]:
    resp = requests.get(url)
    soup = BeautifulSoup(resp.text, 'html.parser')
    for para in soup.find_all('p'):
        intros.append(para.text)

# Combine all documents
docs = papers + intros

# Split into train, val, test sets
random.shuffle(docs)
n = len(docs)
train = docs[:int(0.8*n)]
val = docs[int(0.8*n):int(0.9*n)]
test = docs[int(0.9*n):]

# Write to text files
with open("perovskite_train.txt", "w") as f:
    f.write("\n".join(train))

with open("perovskite_val.txt", "w") as f:
    f.write("\n".join(val))

with open("perovskite_test.txt", "w") as f:
    f.write("\n".join(test))

In [None]:
# Install necessary libraries
!pip install transformers
!pip install datasets

# Imports
from transformers import AutoModelForMaskedLM, AutoTokenizer
from datasets import load_dataset, Dataset
from torch.utils.data import DataLoader
from transformers import AdamW

import torch

# Load model and tokenizer
model = AutoModelForMaskedLM.from_pretrained("bert-base-uncased")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

# Load dataset
try:
    with open('/content/perovskite_train.txt', 'r') as file:
        data = file.readlines()
    dataset = Dataset.from_dict({'text': data})
except FileNotFoundError:
    print("The file was not found at the specified location.")

tokenized_dataset = dataset.map(tokenize_function, batched=True, num_proc=4, remove_columns=["text"])

tokenized_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask'])

# Prepare data loader
train_dataloader = DataLoader(tokenized_dataset, shuffle=True, batch_size=8)

# Optimizer
optimizer = AdamW(model.parameters(), lr=5e-5)

# Move model to GPU if available
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model.to(device)

# Training loop
model.train()

gradient_accumulation_steps = 4
num_train_epochs = 5

for epoch in range(num_train_epochs):
    for step, batch in enumerate(train_dataloader):
        inputs, attention_mask = batch['input_ids'].to(device), batch['attention_mask'].to(device)
        outputs = model(inputs, attention_mask=attention_mask, labels=inputs)
        loss = outputs.loss
        loss = loss / gradient_accumulation_steps  # Normalize the loss because it is accumulated over multiple batches
        loss.backward()

        if (step + 1) % gradient_accumulation_steps == 0 or step == len(train_dataloader) - 1:  # Perform the optimization step if we've accumulated enough gradients or it's the last batch
            optimizer.step()
            optimizer.zero_grad()

# Save model and tokenizer
model.save_pretrained("perovskite_bert_model")
tokenizer.save_pretrained("perovskite_bert_tokenizer")




Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'bert.pooler.dense.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Map (num_proc=4):   0%|          | 0/64 [00:00<?, ? examples/s]



('perovskite_bert_tokenizer/tokenizer_config.json',
 'perovskite_bert_tokenizer/special_tokens_map.json',
 'perovskite_bert_tokenizer/vocab.txt',
 'perovskite_bert_tokenizer/added_tokens.json',
 'perovskite_bert_tokenizer/tokenizer.json')

In [2]:
from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch

# Load the saved tokenizer
tokenizer = AutoTokenizer.from_pretrained("perovskite_bert_tokenizer")

# Load the saved model
model = AutoModelForMaskedLM.from_pretrained("perovskite_bert_model")

# Move the model to GPU if available
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model = model.to(device)

# Some new text data for prediction
text = "One of the advantages of perovskite solar cells over traditional silicon cells is their depth"

# Encode the text
encoded_input = tokenizer(text, return_tensors='pt')

# Move the encoded input to the same device as the model
encoded_input = encoded_input.to(device)

# Make prediction
with torch.no_grad():
    output = model(**encoded_input)

# Get the predicted token ids
predicted_token_ids = torch.argmax(output.logits, dim=-1)

# Decode the token ids to tokens
predicted_tokens = tokenizer.convert_ids_to_tokens(predicted_token_ids[0])

print(predicted_tokens)


['the', 'one', 'of', 'the', 'advantages', 'of', 'per', '##ov', '##ski', '##te', 'solar', 'cells', 'over', 'traditional', 'silicon', 'cells', 'is', 'their', 'depth', '.']
