## KeyPhrase Detection

In [1]:
!pip install keyphrase-vectorizers

Collecting keyphrase-vectorizers
  Downloading keyphrase_vectorizers-0.0.13-py3-none-any.whl.metadata (47 kB)
Collecting spacy-transformers>=1.1.6 (from keyphrase-vectorizers)
  Downloading spacy_transformers-1.3.8-cp312-cp312-win_amd64.whl.metadata (7.2 kB)
Collecting spacy-curated-transformers>=0.2.2 (from keyphrase-vectorizers)
  Downloading spacy_curated_transformers-2.1.2-py2.py3-none-any.whl.metadata (2.8 kB)
Collecting curated-transformers<3.0.0,>=2.0.0 (from spacy-curated-transformers>=0.2.2->keyphrase-vectorizers)
  Downloading curated_transformers-2.0.1-py2.py3-none-any.whl.metadata (5.3 kB)
Collecting curated-tokenizers<3.0.0,>=2.0.0 (from spacy-curated-transformers>=0.2.2->keyphrase-vectorizers)
  Downloading curated_tokenizers-2.0.0-cp312-cp312-win_amd64.whl.metadata (2.0 kB)
INFO: pip is looking at multiple versions of spacy-curated-transformers to determine which version is compatible with other requirements. This could take a while.
Collecting spacy-curated-transformers


[notice] A new release of pip is available: 24.2 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
!pip install keybert

Collecting keybert
  Downloading keybert-0.9.0-py3-none-any.whl.metadata (15 kB)
Collecting sentence-transformers>=0.3.8 (from keybert)
  Downloading sentence_transformers-3.4.1-py3-none-any.whl.metadata (10 kB)
Collecting Pillow (from sentence-transformers>=0.3.8->keybert)
  Using cached pillow-11.1.0-cp312-cp312-win_amd64.whl.metadata (9.3 kB)
Downloading keybert-0.9.0-py3-none-any.whl (41 kB)
Downloading sentence_transformers-3.4.1-py3-none-any.whl (275 kB)
Using cached pillow-11.1.0-cp312-cp312-win_amd64.whl (2.6 MB)
Installing collected packages: Pillow, sentence-transformers, keybert
Successfully installed Pillow-11.1.0 keybert-0.9.0 sentence-transformers-3.4.1



[notice] A new release of pip is available: 24.2 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [3]:
docs=[
    """
Mitochondria are double-membraned organelles found in most eukaryotic cells. They are often referred to as the "powerhouses" of the cell because they generate most of the cell's energy in the form of adenosine triphosphate (ATP). Mitochondria play a crucial role in cellular respiration, which is the process by which cells convert nutrients into usable energy.
The structure of mitochondria consists of an outer membrane, which surrounds the entire organelle, and an inner membrane that is highly folded to form structures called cristae. The inner membrane encloses the mitochondrial matrix, which contains enzymes and DNA molecules necessary for various metabolic reactions.
One of the primary functions of mitochondria is to carry out aerobic respiration, a process that uses oxygen to break down glucose and other organic molecules, releasing energy in the form of ATP. This process occurs in the inner membrane of the mitochondria, specifically in the electron transport chain and the citric acid cycle.
Apart from energy production, mitochondria have other important roles in the cell. They are involved in the regulation of cellular metabolism, calcium signaling, and apoptosis (programmed cell death). Mitochondria also contain their own DNA, known as mitochondrial DNA (mtDNA), which is separate from the nuclear DNA found in the cell's nucleus.

It's worth noting that while mitochondria are present in most eukaryotic cells, certain cell types may have varying numbers of mitochondria depending on their energy requirements. For example, muscle cells and liver cells often contain a higher number of mitochondria due to their high energy demands.

"""
]

In [4]:
from keyphrase_vectorizers import KeyphraseCountVectorizer
vectorizer=KeyphraseCountVectorizer()

In [5]:
document_keyphrase_matrix=vectorizer.fit_transform(docs).toarray()
document_keyphrase_matrix.shape

(1, 51)

In [6]:
from keybert import KeyBERT

In [7]:
kB=KeyBERT()

In [8]:
import numpy

In [9]:
kB.extract_keywords(docs=docs,vectorizer=vectorizer)

[('mitochondria', 0.6586),
 ('cellular metabolism', 0.541),
 ('cellular respiration', 0.5234),
 ('organelles', 0.5204),
 ('mitochondrial matrix', 0.5142)]

## Answer Aware Question Generation

In [10]:
import pandas as pd
import json

In [11]:
with open('../SQuAD/train-v2.0.json','r') as f:
    data=json.load(f)

In [12]:
contexts=[]
answers=[]
questions=[]

In [13]:
for article in data['data']:
    for paragraph in article['paragraphs']:
        context=paragraph['context']
        for qa in paragraph['qas']:
            question=qa['question']
            for answer in qa['answers']:
                answer_text=answer['text']
                contexts.append(context)
                answers.append(answer_text)
                questions.append(question)

In [14]:
SQuAD_data={
    'Question': questions,
    'Answer': answers,
    'Context': contexts
}

In [15]:
data=pd.DataFrame(SQuAD_data)

In [16]:
data.head(2)

Unnamed: 0,Question,Answer,Context
0,When did Beyonce start becoming popular?,in the late 1990s,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...
1,What areas did Beyonce compete in when she was...,singing and dancing,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...


In [17]:
len(data)

86821

In [18]:
data_sample=data[:100]

In [19]:
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import T5Tokenizer, T5ForConditionalGeneration, AdamW

In [20]:
class QuestionGenerationDataset(Dataset):
    def __init__(self, context_list, answer_list, question_list, tokenizer):
        self.context_list = context_list
        self.answer_list = answer_list
        self.question_list = question_list
        self.tokenizer = tokenizer

    def __len__(self):
        return len(self.context_list)

    def __getitem__(self, idx):
        context = self.context_list[idx]
        answer = self.answer_list[idx]
        question = self.question_list[idx]

        input_text = f"generate question: {context} Answer: {answer}"
        target_text = question

        input_ids = self.tokenizer.encode(input_text, truncation=True, padding='max_length', max_length=512, return_tensors='pt')[0]
        target_ids = self.tokenizer.encode(target_text, truncation=True, padding='max_length', max_length=32, return_tensors='pt')[0]

        return {"input_ids": input_ids, "attention_mask": input_ids.ne(0), "target_ids": target_ids, "target_attention_mask": target_ids.ne(0)}


In [21]:
context_list=data_sample['Context'].tolist()
answer_list=data_sample['Answer'].tolist()
question_list=data_sample['Question'].tolist()

tokenizer=T5Tokenizer.from_pretrained('t5-base')
dataset=QuestionGenerationDataset(context_list,answer_list,question_list,tokenizer)

In [22]:
model=T5ForConditionalGeneration.from_pretrained('t5-base')
epochs=3
batch_size=2
learning_rate=0.0001
dataloader=DataLoader(dataset,batch_size=batch_size, shuffle=True)
device =torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model.to(device)
optimizer=AdamW(model.parameters(),lr=learning_rate)
scheduler=torch.optim.lr_scheduler.StepLR(optimizer,step_size=1,gamma=0.1)

In [23]:
from tqdm import tqdm

In [24]:
for epoch in range(epochs):
    model.train()
    total_loss = 0
    for batch in tqdm(dataloader,desc=f'Epoch {epoch}'):
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        target_ids = batch['target_ids'].to(device)
        target_attention_mask = batch['target_attention_mask'].to(device)

        outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=target_ids, decoder_attention_mask=target_attention_mask)

        loss = outputs.loss
        total_loss += loss.item()

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    scheduler.step()

    avg_loss = total_loss / len(dataloader)
    print(f"Epoch {epoch+1}/{epochs} - Loss: {avg_loss}")

model.save_pretrained("./fine_tuned_t5_model")
tokenizer.save_pretrained("./fine_tuned_t5_tokenizer")


Epoch 0: 100%|██████████| 50/50 [18:05<00:00, 21.72s/it]


Epoch 1/3 - Loss: 2.189678477048874


Epoch 1: 100%|██████████| 50/50 [19:25<00:00, 23.31s/it]


Epoch 2/3 - Loss: 0.9359392833709717


Epoch 2: 100%|██████████| 50/50 [18:46<00:00, 22.53s/it]


Epoch 3/3 - Loss: 0.9229644989967346


('./fine_tuned_t5_tokenizer\\tokenizer_config.json',
 './fine_tuned_t5_tokenizer\\special_tokens_map.json',
 './fine_tuned_t5_tokenizer\\spiece.model',
 './fine_tuned_t5_tokenizer\\added_tokens.json')

In [26]:
def generate_question(answer,context,model,tokenizer):
    input_text=f'generate question: {context} Answer: {answer}'
    input_ids=tokenizer.encode(input_text, truncation=True, padding='max_length', max_length=512, return_tensors='pt').to(device)
    output=model.generate(input_ids)
    generated_question=tokenizer.decode(output[0],skip_special_tokens=True)
    return generated_question

In [27]:
extracted_keyword='mitochondria'
print(generate_question(extracted_keyword,docs[0], model,tokenizer))



Mitochondria are often referred to as the "powerhouses"


In [28]:
print(generate_question('cellular metabolism',docs[0],model,tokenizer))



entailment of this article.


In [29]:
print(generate_question('adenosine triphosphate', docs[0],model, tokenizer))



adenosine triphosphate (ATP) is the most common ATP
