# **OPEN-ARC**
---

### Project 6: News Headline Generation Model:
**Challenge:** Create an AI model, capable of generating convincing news headlines.


### Terms and Use:
Learn more about the project's [LICENSE](https://github.com/Infinitode/OPEN-ARC/blob/main/LICENSE) and read our [CODE_OF_CONDUCT](https://github.com/Infinitode/OPEN-ARC/blob/main/CODE_OF_CONDUCT) before contributing to the project. You can contribute to this project from here: [https://github.com/Infinitode/OPEN-ARC/](https://github.com/Infinitode/OPEN-ARC/).

---

Please fill out this performance sheet to help others quickly see your model's performance **(optional)**:

### Performance Sheet:
| Contributor | Architecture Type | Platform | Base Model | Dataset | BLEU-Score | Link |
|-------------|-------------------|----------|------------|---------|----------|------|
| Infinitode  | DistilBART  | Kaggle   | ✗  | NEWS SUMMARY | 52.8%    | [Notebook](https://github.com/Infinitode/OPEN-ARC/blob/main/Project-6-NHG/project-6-nhg.ipynb) |
| Username  | Unknown  | Kaggle   | ✗/✔  | NEWS SUMMARY | Score    | [Notebook](https://github.com) |

---

### Model: Pre-trained DistilBART:
This model is a pre-trained distilled BART model. We will fine-tune it to our specific dataset, and test its performance using the BLEU score, from `Duplipy`.

### Import the necessary libraries
---

In [1]:
import torch
import pandas as pd
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from sklearn.model_selection import train_test_split
from tqdm import tqdm  # Progress bar

### Load the data and pre-trained model with its tokenizer
---

In [2]:
# Load your data with different encodings
file_path = '/kaggle/input/news-summary/news_summary.csv'
encodings = ['utf-8', 'ISO-8859-1', 'latin1']  # List of possible encodings

data = None
for enc in encodings:
    try:
        data = pd.read_csv(file_path, encoding=enc)
        break
    except UnicodeDecodeError:
        continue

if data is None:
    raise ValueError("Failed to decode the file with provided encodings.")

# Load the pretrained DistilBART model and tokenizer
model_name = 'sshleifer/distilbart-xsum-12-1'  # Smaller, faster version of BART for summarization tasks
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.59k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]



pytorch_model.bin:   0%|          | 0.00/443M [00:00<?, ?B/s]

### Preprocess the input data
---

In [3]:
# Prepare the dataset, drop missing values, etc.
data = data[['headlines', 'text']].dropna()

texts = data['text']
summaries = data['headlines']

# Ensure consistency in sample size
min_len = min(len(texts), len(summaries))
texts = texts[:min_len]
summaries = summaries[:min_len]

# Tokenize the dataset
inputs = tokenizer(texts.tolist(), max_length=1024, return_tensors='pt', padding=True, truncation=True)
labels = tokenizer(summaries.tolist(), max_length=128, return_tensors='pt', padding=True, truncation=True)

# Split the data into training and validation sets
train_inputs, val_inputs, train_labels, val_labels = train_test_split(
    inputs['input_ids'], labels['input_ids'], test_size=0.2, random_state=42
)

# Move data to GPU if available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)

train_inputs = train_inputs.to(device)
train_labels = train_labels.to(device)
val_inputs = val_inputs.to(device)
val_labels = val_labels.to(device)

### Fine-tune and train the model
---

In [4]:
# Define the training function
def train_model(model, train_inputs, train_labels, val_inputs, val_labels, epochs=3, batch_size=2):
    optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)
    model.train()

    for epoch in range(epochs):
        total_loss = 0

        # Progress bar for training loop
        train_loader = tqdm(range(0, len(train_inputs), batch_size), desc=f"Epoch {epoch+1} Training", leave=False)

        # Training loop
        for i in train_loader:
            input_batch = train_inputs[i:i+batch_size]
            label_batch = train_labels[i:i+batch_size]

            # Forward pass
            outputs = model(input_ids=input_batch, labels=label_batch)
            loss = outputs.loss
            total_loss += loss.item()

            # Backward pass and optimization
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()

            train_loader.set_postfix(loss=loss.item())

        avg_train_loss = total_loss / len(train_inputs)

        # Validation with progress bar
        model.eval()
        with torch.no_grad():
            val_loss = 0
            val_loader = tqdm(range(0, len(val_inputs), batch_size), desc=f"Epoch {epoch+1} Validation", leave=False)

            for i in val_loader:
                val_input_batch = val_inputs[i:i+batch_size]
                val_label_batch = val_labels[i:i+batch_size]

                val_outputs = model(input_ids=val_input_batch, labels=val_label_batch)
                val_loss += val_outputs.loss.item()

                val_loader.set_postfix(val_loss=val_outputs.loss.item())

            avg_val_loss = val_loss / len(val_inputs)

        print(f'Epoch {epoch+1}, Training Loss: {avg_train_loss:.4f}, Validation Loss: {avg_val_loss:.4f}')

# Train the model
train_model(model, train_inputs, train_labels, val_inputs, val_labels, epochs=3, batch_size=8)

                                                                                     

Epoch 1, Training Loss: 0.2155, Validation Loss: 0.1500


                                                                                     

Epoch 2, Training Loss: 0.1038, Validation Loss: 0.1470


                                                                                     

Epoch 3, Training Loss: 0.0589, Validation Loss: 0.1664




### Save the trained model
---

In [5]:
# Save the fine-tuned model
model.save_pretrained('distilbart_summarization_model')
tokenizer.save_pretrained('distilbart_summarization_model')

Non-default generation parameters: {'max_length': 62, 'min_length': 11, 'early_stopping': True, 'num_beams': 6, 'length_penalty': 0.5, 'no_repeat_ngram_size': 3, 'forced_eos_token_id': 2}


('distilbart_summarization_model/tokenizer_config.json',
 'distilbart_summarization_model/special_tokens_map.json',
 'distilbart_summarization_model/vocab.json',
 'distilbart_summarization_model/merges.txt',
 'distilbart_summarization_model/added_tokens.json',
 'distilbart_summarization_model/tokenizer.json')

In [8]:
!zip -r model.zip '/kaggle/working/distilbart_summarization_model'

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


  adding: kaggle/working/distilbart_summarization_model/ (stored 0%)
  adding: kaggle/working/distilbart_summarization_model/tokenizer.json (deflated 72%)
  adding: kaggle/working/distilbart_summarization_model/config.json (deflated 60%)
  adding: kaggle/working/distilbart_summarization_model/tokenizer_config.json (deflated 76%)
  adding: kaggle/working/distilbart_summarization_model/merges.txt (deflated 53%)
  adding: kaggle/working/distilbart_summarization_model/generation_config.json (deflated 45%)
  adding: kaggle/working/distilbart_summarization_model/vocab.json (deflated 59%)
  adding: kaggle/working/distilbart_summarization_model/model.safetensors (deflated 7%)
  adding: kaggle/working/distilbart_summarization_model/special_tokens_map.json (deflated 85%)


### Generate new headlines from text
---

In [13]:
# Summarization inference
def summarize(text):
    inputs = tokenizer([text], max_length=1024, return_tensors='pt', padding=True, truncation=True)
    inputs = inputs.to(device)

    # Generate the summary
    summary_ids = model.generate(inputs['input_ids'], max_length=100, min_length=10, length_penalty=2.0, num_beams=4, early_stopping=True)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary

# Test with a new text
new_text = "The quick brown fox jumps over the lazy dog. The dog, surprised, looks at the fox. They then decide to become friends and explore the forest together."
print(summarize(new_text))

Delhi fox jumps over lazy dog in a park


## Meassure the model's performance using Duplipy
---
You can now test the model using Duplipy's built-in BLEU Score function.

In [15]:
!pip install duplipy

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting duplipy
  Downloading duplipy-0.2.2-py3-none-any.whl.metadata (1.2 kB)
Collecting valx (from duplipy)
  Downloading valx-0.2.0-py3-none-any.whl.metadata (1.0 kB)
Downloading duplipy-0.2.2-py3-none-any.whl (10.0 kB)
Downloading valx-0.2.0-py3-none-any.whl (349 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m349.7/349.7 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m:00:01[0m
[?25hInstalling collected packages: valx, duplipy
Successfully installed duplipy-0.2.2 valx-0.2.0


In [16]:
from duplipy.similarity import bleu_score
import numpy as np

# Evaluate using BLEU score
def evaluate_bleu(reference_summaries, generated_summaries):
    bleu_scores = []
    for ref, gen in zip(reference_summaries, generated_summaries):
        score = bleu_score(ref, gen)
        bleu_scores.append(score)
    return np.mean(bleu_scores)

# Generate summaries for the validation set
generated_summaries = [summarize(text) for text in texts[:len(val_inputs)]]  # Adjust as needed
average_bleu = evaluate_bleu(summaries[:len(val_inputs)], generated_summaries)

print(f'Average BLEU Score: {average_bleu:.4f}')

Corpus/Sentence contains 0 counts of 2-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().
Corpus/Sentence contains 0 counts of 4-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().
Corpus/Sentence contains 0 counts of 3-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


Average BLEU Score: 0.5284


### The End:

This is the end of this project notebook, make sure to experiment and contribute to help improve the model and implementation. You can browse more of the open-source free projects on our GitHub repository: https://github.com/Infinitode/OPEN-ARC. If you like this project, make sure to star the repo and contribute your implementation, or help others in the community.

~ Infinitode