<a href="https://colab.research.google.com/github/Alina78900/NLP/blob/main/Untitled0.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Paper Selection**


I selected the paper titled "BeliN: A Novel Corpus for Bengali Religious News Headline Generation using Contextual Feature Fusion" as the basis for my study. This research introduces the BeliN dataset, a uniquely curated corpus comprising Bengali religious news articles annotated with rich contextual features such as category, aspect, and sentiment. In my work, I aim to utilize the mT5 model to generate abstractive headlines using this dataset. Additionally, I will compare the performance of models that use only the article content versus those that employ the full set of contextual features (category, aspect, sentiment) as described in the paper’s proposed MultiGen approach. My focus will be to evaluate headline quality based on metrics like BLEU and ROUGE-L, as well as to assess the effect of feature fusion on the overall headline generation performance. This study will also help me understand the impact of integrating context features in low-resource language tasks, particularly for Bengali religious news.

Paper Link: https://doi.org/10.1016/j.nlp.2025.100138

Dataset Link: https://github.com/akabircs/BeliN


In [1]:
# Step 1: Install required packages
!pip install pandas transformers datasets sentencepiece --quiet

# Step 2: Clone the dataset from GitHub
!git clone https://github.com/akabircs/BeliN.git
%cd BeliN/Dataset

# Step 3: List the dataset files
!ls

Cloning into 'BeliN'...
remote: Enumerating objects: 50, done.[K
remote: Counting objects: 100% (50/50), done.[K
remote: Compressing objects: 100% (47/47), done.[K
remote: Total 50 (delta 13), reused 0 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (50/50), 7.02 MiB | 4.59 MiB/s, done.
Resolving deltas: 100% (13/13), done.
/content/BeliN/Dataset
Dataset.xlsx  test.csv	train.csv


In [7]:
# Step 4: Load the data
import pandas as pd

train_df = pd.read_csv("/content/BeliN/Dataset/train.csv")
# Change pd.read_csv to pd.read_excel for .xlsx file
val_df = pd.read_excel("/content/BeliN/Dataset/Dataset.xlsx")
test_df = pd.read_csv("/content/BeliN/Dataset/test.csv")

# Rename columns in train_df, val_df, and test_df to match the expected format in preprocess_data
# Rename columns for train_df
train_df = train_df.rename(columns={
    'Category': 'category',
    'Article': 'article',
    'Aspect': 'aspect',
    'Sentiment': 'sentiment',
    'Headlines': 'headline'
})

# Rename columns for val_df (already exists in original code)
val_df = val_df.rename(columns={
    'Category': 'category',
    'Article': 'article',
    'Aspect': 'aspect',
    'Sentiment': 'sentiment',
    'Headlines': 'headline'
})

# Rename columns for test_df (assuming it will be preprocessed later)
test_df = test_df.rename(columns={
    'Category': 'category',
    'Article': 'article',
    'Aspect': 'aspect',
    'Sentiment': 'sentiment',
    'Headlines': 'headline'
})


print("Train Data Sample:")
print(train_df.head())

print("Train Data Columns:", train_df.columns.tolist())
print("Validation Data Columns:", val_df.columns.tolist()) # Print validation data columns
print("Test Data Columns:", test_df.columns.tolist())     # Print test data columns


# Step 5: Preprocess the data for headline generation
from datasets import Dataset

def preprocess_data(df):
    # Check if the expected columns exist before accessing them
    expected_columns = ['category', 'aspect', 'sentiment', 'article', 'headline']
    if not all(col in df.columns for col in expected_columns):
        missing_cols = [col for col in expected_columns if col not in df.columns]
        raise ValueError(f"DataFrame is missing required columns: {missing_cols}. Available columns: {df.columns.tolist()}")

    inputs = df['category'] + ' [SEP] ' + df['aspect'] + ' [SEP] ' + df['sentiment'] + ' [SEP] ' + df['article']
    targets = df['headline']
    return pd.DataFrame({'input_text': inputs, 'target_text': targets})

train_data = preprocess_data(train_df)
val_data = preprocess_data(val_df)

train_dataset = Dataset.from_pandas(train_data)
val_dataset = Dataset.from_pandas(val_data)

print("\nSample Preprocessed Data:")
print(train_data.head())

# Step 6: Load tokenizer and model (mT5 or Bengali T5)
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "google/mt5-small"  # You can replace this with a Bengali fine-tuned T5 if available
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Step 7: Tokenize the dataset
def tokenize_function(batch):
    model_inputs = tokenizer(batch['input_text'], max_length=512, truncation=True)
    labels = tokenizer(batch['target_text'], max_length=64, truncation=True)
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_train = train_dataset.map(tokenize_function, batched=True)
tokenized_val = val_dataset.map(tokenize_function, batched=True)



Train Data Sample:
          Source                                           headline  \
0  দৈনিক ইনকিলাব       নবী সাহাবীদের পদধন্য ফিলিস্তিন আজ রক্তাক্ত\n   
1     কালের কন্ঠ      সমাজের সবার ওপর কোরআন প্রতিযোগিতার প্রভাব আছে   
2  Dhaka Tribune  বগুড়া থেকে গ্রেপ্তার হলেন ‘বাংলা ভাইয়ের ভাতিজা...   
3     কালের কন্ঠ                        নিজেকে বদলে ফেলার মাস রমজান   
4     কালের কন্ঠ         বাড়ছে মধুর চাহিদা, ব্যস্ত জর্দানের মৌয়ালরা   

     category                                            article  \
0  ইসলাম ধর্ম  ফিলিস্তিন মধ্যপ্রাচ্যের দক্ষিণাংশের একটি ভূখণ্...   
1  ইসলাম ধর্ম  আল-হামদুলিল্লাহ, আমি প্রথম থেকে কুরআনের নূরের ...   
2    অন্যান্য  বগুড়ার গাবতলী থেকে নিষিদ্ধ জঙ্গি সংগঠন জামাআতু...   
3  ইসলাম ধর্ম  মাহে রমজান নিজেকে পুনর্গঠন করার মাস। নিজেকে বদ...   
4  ইসলাম ধর্ম  করোনাকালে বিশ্বের অন্যান্য দেশের মতো ধাক্কা খা...   

              aspect sentiment  
0  ধর্মীয় প্রতিবেদন  Negative  
1  ধর্মীয় প্রতিবেদন  Negative  
2  ধর্মীয় প্রতিবেদন  Positive  
3     ধর্মীয়



Map:   0%|          | 0/2015 [00:00<?, ? examples/s]

Map:   0%|          | 0/2519 [00:00<?, ? examples/s]

In [11]:
# Step 7: Tokenize the dataset
def tokenize_function(batch):
    model_inputs = tokenizer(batch['input_text'], max_length=512, truncation=True, padding="max_length") # Add padding here
    labels = tokenizer(batch['target_text'], max_length=64, truncation=True, padding="max_length") # Add padding here
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_train = train_dataset.map(tokenize_function, batched=True)
tokenized_val = val_dataset.map(tokenize_function, batched=True)

# %%
# Step 8: Set training arguments and Trainer
from transformers import TrainingArguments, Trainer, DataCollatorWithPadding

training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=3,
    weight_decay=0.01,
    save_total_limit=2,
)

# Initialize DataCollatorWithPadding
# It's crucial that the tokenizer used here is the same one used for tokenization
# Since we are padding in the tokenize_function, the data collator
# can still be used, but the primary padding is handled earlier.
# The DataCollatorWithPadding will still ensure all features are in the correct format.
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, padding=True) # Explicitly set padding to True

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    data_collator=data_collator, # Add the data collator here
)

# Step 9: Train the model
trainer.train()

# Optional: Save the model
trainer.save_model("bengali_headline_gen_model")

print("Training Complete!")

Map:   0%|          | 0/2015 [00:00<?, ? examples/s]

Map:   0%|          | 0/2519 [00:00<?, ? examples/s]

Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Epoch,Training Loss,Validation Loss
1,31.1644,15.875946
2,17.6914,10.899961


Epoch,Training Loss,Validation Loss
1,31.1644,15.875946
2,17.6914,10.899961
3,13.5145,9.496676


Training Complete!


In [19]:
# Step 10: Load the trained model (optional if already in memory)
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import os

# Load from the path where the full model and tokenizer were saved
model_path = "/content/BeliN/Dataset/bengali_headline_gen_model"

# Verify the contents of the directory
print(f"\nContents of {model_path}:")
if os.path.exists(model_path):
    !ls {model_path}
else:
    print(f"Directory {model_path} does not exist.")


# Now attempt to load the tokenizer and model
try:
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    model = AutoModelForSeq2SeqLM.from_pretrained(model_path)
    print("\nModel and Tokenizer loaded successfully!")

    # Example: Generate headline for one article from the test set
    sample = test_df.iloc[0]

    # Prepare input as done in training
    input_text = sample['category'] + ' [SEP] ' + sample['aspect'] + ' [SEP] ' + sample['sentiment'] + ' [SEP] ' + sample['article']
    print("\nInput Text:")
    print(input_text)

    # Tokenize input
    inputs = tokenizer(input_text, return_tensors="pt", max_length=512, truncation=True)

    # Generate headline
    # Ensure the model is in evaluation mode
    model.eval()
    # Use the correct device for generation
    import torch
    if torch.cuda.is_available():
        model.to('cuda')
        inputs = {k: v.to('cuda') for k, v in inputs.items()}


    output = model.generate(**inputs, max_length=64, num_beams=4, early_stopping=True)
    generated_headline = tokenizer.decode(output[0], skip_special_tokens=True)

    print("\nGenerated Headline:")
    print(generated_headline)

    # Actual headline for comparison
    print("\nActual Headline:")
    print(sample['headline'])

except Exception as e:
    print(f"\nError loading model or generating headline: {e}")
    print("Please check the contents of the results directory and the saving process.")


Contents of /content/BeliN/Dataset/bengali_headline_gen_model:
config.json		special_tokens_map.json  tokenizer.json
generation_config.json	spiece.model		 training_args.bin
model.safetensors	tokenizer_config.json

Model and Tokenizer loaded successfully!

Input Text:
ইসলাম ধর্ম [SEP] ধর্মীয় শিক্ষা [SEP] Negative [SEP] শিশুরা পবিত্রতার প্রতীক। শিশুরা নিষ্পাপ। শিশুরা আনন্দের উপকরণ ও প্রেরণার উৎস। তাই শিশুদের প্রতি ভালোবাসা ও সম্মান প্রদর্শন করা জরুরি। কোরআন মজিদে বর্ণিত হয়েছে: ‘আল্লাহ তোমাদের থেকে তোমাদের জোড়া সৃষ্টি করেছেন এবং তোমাদের যুগল থেকে তোমাদের জন্য পুত্র ও পৌত্রাদি সৃষ্টি করেছেন এবং তোমাদের উত্তম জীবন উপকরণ দিয়েছেন। (সুরা-১৬ নাহল, আয়াত: ৭২)। শিশু মানবজাতির অতীব গুরুত্বপূর্ণ অংশ। শৈশবেই মানুষের জীবনের গতিপথ নির্ধারিত হয়। তাই শৈশবকাল বিশেষ গুরুত্বপূর্ণ। শিশুদের নিরাপদে ও স্বাচ্ছন্দ্যে বেড়ে ওঠার জন্মগত অধিকার রয়েছে। শিশুদের জন্য অনুকূল পরিবেশ নিশ্চিত করা আমাদের দায়িত্ব ও কর্তব্য। কিন্তু দেখা যায়, আমাদের সমাজে শিশুরা অহরহ নির্যাতনের শিকার হচ্ছে।

শিশুদের শারীরিক শাস্তি একটি সামাজি