# 📚 **Project Overview: Text Summarization with BART and Flask Deployment**

## 🎯 **Objective**
This project focuses on building a text summarization model using the pre-trained **BART** model. The model is fine-tuned on the CNN/Daily Mail dataset for summarization tasks, and a Flask web API is developed to deploy the model for real-time text summarization.

## 🧩 **Components**

### 1. **Dataset**
The **CNN/Daily Mail** dataset is used, consisting of articles and highlights (summaries). The dataset is preprocessed, tokenized, and split into inputs (articles) and labels (summaries) for training.

- **Dataset Structure**:
  - `article`: Textual content to be summarized.
  - `highlights`: Summarized content.

### 2. **Text Preprocessing**
The **BART tokenizer** from the `transformers` library is used to tokenize the input articles and summaries. The preprocessing steps include:
- Tokenizing the articles and summaries.
- Truncating and padding the sequences to fit the model's maximum length.

### 3. **Model Training**
The **BART model** is fine-tuned for the task of **text summarization**:
- **Model**: `facebook/bart-large-cnn`
- **Optimizer**: AdamW with a learning rate of `5e-5`
- **Batch Size**: 8
- **Epochs**: 3

The model is trained using **PyTorch**, with the option to run on **GPU** if available for faster training.

### 4. **Model Evaluation**
After training, the model's performance is evaluated by generating summaries for the first few articles in the dataset. The evaluation metric is the quality of generated summaries.

- **Summarization Process**:
  - Text is tokenized.
  - The model generates summaries using beam search for optimal output.

### 5. **Model Deployment with Flask**
A **Flask web application** is developed to deploy the trained summarization model:
- **Route**: `/summarize`
- **Method**: `POST`
- **Input**: Raw text to be summarized (in JSON format).
- **Output**: Summarized text in JSON format.

### 6. **Model Saving**
The fine-tuned model and tokenizer are saved for later use:
- **Saved Files**: Model weights and tokenizer files.
- **Directory**: `summarization_model`

## ⚙️ **Technologies Used**
- **Libraries**:
  - `transformers`: For the pre-trained BART model and tokenizer.
  - `torch`: For model training with PyTorch.
  - `flask`: For creating the web API.
  - `datasets`: For easy access to datasets like CNN/Daily Mail.
- **Tools**:
  - **Google Colab**: For model training and experimentation.
  - **GitHub**: For version control and project management.

## 🚀 **Execution Steps**
1. **Install necessary libraries** using `pip`.
2. **Load and preprocess the dataset** for text tokenization.
3. **Fine-tune the BART model** on the dataset.
4. **Evaluate the model** by generating summaries.
5. **Deploy the model** using Flask to create a web API.
6. **Save the model** for future use.

## 🎓 **Educational Objective**
This project is designed to demonstrate the process of **text summarization** using **transformer-based models** like **BART**. Additionally, it introduces the deployment of machine learning models using **Flask**, allowing the model to be accessed via a simple web interface. This hands-on project is ideal for learners who wish to apply NLP techniques and model deployment in real-world scenarios.

## 📜 **Future Improvements**
- Experimenting with different models (e.g., T5, GPT-3).
- Optimizing model performance using hyperparameter tuning.
- Extending the Flask app to support batch processing or multiple summarization models.

---

## ⚙️ **Install Necessary Libraries**
To begin, you'll need to install the required libraries for text summarization and web deployment. You can install them using `pip`. The essential libraries include:

- **Transformers**: For using pre-trained models like BART.
- **Torch**: For model training and deep learning operations.
- **Flask**: To create a simple web API for deployment.
- **Datasets**: To load and preprocess datasets such as CNN/Daily Mail.

```bash
!pip install transformers torch flask datasets

In [38]:
# Install required libraries

In [39]:
!pip install datasets transformers torch flask  # Installing necessary libraries for datasets, transformers, torch, and flask




---

In [40]:
import warnings  # Import warnings module to handle warning messages

In [41]:
warnings.filterwarnings('ignore')  # This will suppress all warnings during execution

---


---

### 🧑‍💻 **Step 2: Load and Preprocess the Dataset**

```markdown
## 📚 **Load and Preprocess the Dataset**
The dataset used in this project is the **CNN/Daily Mail** dataset, which contains news articles along with their summaries (highlights). To work with the dataset, we will load it using the `datasets` library and preprocess it for tokenization.

1. **Dataset Overview**:
   - `article`: The full article text.
   - `highlights`: The concise summary of the article.

2. **Preprocessing Steps**:
   - Load the dataset using the `datasets` library.
   - Tokenize both the articles (inputs) and the summaries (labels) using the BART tokenizer.

```python
from datasets import load_dataset
dataset = load_dataset('cnn_dailymail', '3.0.0')

# Sample preprocessing function (you can adjust tokenization as needed)
def preprocess_function(examples):
    inputs = examples['article']
    targets = examples['highlights']
    model_inputs = tokenizer(inputs, max_length=1024, truncation=True, padding='max_length')
    labels = tokenizer(targets, max_length=150, truncation=True, padding='max_length')
    model_inputs['labels'] = labels['input_ids']
    return model_inputs

tokenized_datasets = dataset.map(preprocess_function, batched=True)

In [42]:
# Load and preprocess the dataset

In [43]:
from datasets import load_dataset  # Importing load_dataset function from datasets library

In [44]:
import pandas as pd  # Importing pandas for data manipulation

In [45]:
# Load the CNN/Daily Mail dataset (first 100 rows for faster processing)

In [46]:
dataset = load_dataset("cnn_dailymail", "3.0.0", split='train[:100]')  # Load the first 100 rows

In [47]:
# Convert the dataset to a pandas dataframe for easier manipulation

In [48]:
df = pd.DataFrame(dataset)  # Converting the dataset to a pandas dataframe

In [49]:
# Display first 5 rows

In [50]:
print("First 5 rows of the dataset:")  # Printing the first 5 rows of the dataset
print(df.head())  # Displaying the first 5 rows of the dataframe

First 5 rows of the dataset:
                                             article  \
0  LONDON, England (Reuters) -- Harry Potter star...   
1  Editor's note: In our Behind the Scenes series...   
2  MINNEAPOLIS, Minnesota (CNN) -- Drivers who we...   
3  WASHINGTON (CNN) -- Doctors removed five small...   
4  (CNN)  -- The National Football League has ind...   

                                          highlights  \
0  Harry Potter star Daniel Radcliffe gets £20M f...   
1  Mentally ill inmates in Miami are housed on th...   
2  NEW: "I thought I was going to die," driver sa...   
3  Five small polyps found during procedure; "non...   
4  NEW: NFL chief, Atlanta Falcons owner critical...   

                                         id  
0  42c027e4ff9730fbb3de84c1af0d2c506e41c3e4  
1  ee8871b15c50d0db17b0179a6d2beab35065f1e9  
2  06352019a19ae31e527f37f7571c6dd7f0c5da37  
3  24521a2abb2e1f5e34e6824e0f9e56904a2b0e88  
4  7fe70cc8b12fab2d0a258fababf7d9c6b5e1262a  


In [51]:
# Display dataset info

In [52]:
print("\nDataset Info:")  # Printing dataset information
print(df.info())  # Displaying dataset info (column types, non-null counts)


Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   article     100 non-null    object
 1   highlights  100 non-null    object
 2   id          100 non-null    object
dtypes: object(3)
memory usage: 2.5+ KB
None


In [53]:
# Display summary statistics

In [54]:
print("\nSummary Statistics:")  # Printing summary statistics
print(df.describe())  # Displaying summary statistics for numerical columns


Summary Statistics:
                                                  article  \
count                                                 100   
unique                                                 99   
top     AMMAN, Jordan (CNN) -- In the sunbathed school...   
freq                                                    2   

                                               highlights  \
count                                                 100   
unique                                                 99   
top     Jordan opens school doors to all Iraqi childre...   
freq                                                    2   

                                              id  
count                                        100  
unique                                       100  
top     f16446db34e2861f0450dfa34d8cdda541ab7b19  
freq                                           1  


In [55]:
# Check for missing values

In [56]:
print("\nMissing Values:")  # Printing missing value information
print(df.isnull().sum())  # Summing the missing values in each column


Missing Values:
article       0
highlights    0
id            0
dtype: int64


In [57]:
# Check for duplicate rows

In [58]:
print("\nDuplicate Rows:")  # Printing duplicate row information
print(df.duplicated().sum())  # Summing the number of duplicate rows


Duplicate Rows:
0


In [59]:
# Describe dataset, including non-numerical columns

In [60]:
print("\nDataset Description:")  # Printing descriptive statistics for all columns
print(df.describe(include='all'))  # Displaying descriptive statistics for both numerical and non-numerical columns



Dataset Description:
                                                  article  \
count                                                 100   
unique                                                 99   
top     AMMAN, Jordan (CNN) -- In the sunbathed school...   
freq                                                    2   

                                               highlights  \
count                                                 100   
unique                                                 99   
top     Jordan opens school doors to all Iraqi childre...   
freq                                                    2   

                                              id  
count                                        100  
unique                                       100  
top     f16446db34e2861f0450dfa34d8cdda541ab7b19  
freq                                           1  


In [61]:
# Show unique values in 'category' column if available

In [62]:
if 'category' in df.columns:  # Checking if 'category' column exists
    print("\nUnique Categories in the Dataset:")  # Printing unique values in the 'category' column
    print(df['category'].unique())  # Displaying unique categories in the dataset

---

---

In [63]:
# Text Preprocessing with BART Tokenizer

In [64]:
from transformers import BartTokenizer  # Importing BART tokenizer from transformers library

In [65]:
# Load pre-trained BART tokenizer

In [66]:
tokenizer = BartTokenizer.from_pretrained("facebook/bart-large-cnn")  # Loading the BART tokenizer

In [67]:
# Preprocess dataset (tokenize articles and summaries)

In [68]:
def preprocess_data(df):  # Defining a function to preprocess the data
    inputs = tokenizer(df['article'].tolist(), return_tensors='pt', max_length=1024, truncation=True, padding=True)  # Tokenizing the articles
    labels = tokenizer(df['highlights'].tolist(), return_tensors='pt', max_length=150, truncation=True, padding=True)  # Tokenizing the highlights (summaries)
    return inputs, labels  # Returning tokenized inputs and labels

In [69]:
inputs, labels = preprocess_data(df)  # Preprocessing the dataset

---

---


---

### 🏋️‍♂️ **Step 3: Fine-Tune the BART Model**

```markdown
## 🔧 **Fine-Tune the BART Model**
Now that the dataset is preprocessed, it's time to fine-tune the **BART** model. BART is a transformer-based model designed for text generation tasks such as summarization.

### Fine-Tuning Process:
1. **Model Selection**: Use the pre-trained `facebook/bart-large-cnn` model.
2. **Training Parameters**: Use the AdamW optimizer with a learning rate of `5e-5`, a batch size of `8`, and train for `3 epochs`.
3. **Hardware**: Utilize GPU if available to speed up the training process.

```python
from transformers import BartForConditionalGeneration, BartTokenizer, Trainer, TrainingArguments

# Load the pre-trained BART model and tokenizer
model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')

# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['test'],
)

# Start fine-tuning
trainer.train()

In [70]:
# Load BART Model for Fine-Tuning

In [71]:
from transformers import BartForConditionalGeneration, AdamW  # Importing BART model and AdamW optimizer


In [72]:
import torch  # Importing torch for tensor manipulation

In [73]:
# Load pre-trained BART model for sequence-to-sequence tasks (summarization)

In [74]:
model = BartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn")  # Loading the BART model for summarization


In [75]:
# Optimizer

In [76]:
optimizer = AdamW(model.parameters(), lr=5e-5)  # Defining the optimizer with a learning rate of 5e-5


In [77]:
# Move model to GPU if available

In [78]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')  # Selecting device (GPU if available, otherwise CPU)


In [79]:
model.to(device)  # Moving the model to the selected device (GPU/CPU)

BartForConditionalGeneration(
  (model): BartModel(
    (shared): BartScaledWordEmbedding(50264, 1024, padding_idx=1)
    (encoder): BartEncoder(
      (embed_tokens): BartScaledWordEmbedding(50264, 1024, padding_idx=1)
      (embed_positions): BartLearnedPositionalEmbedding(1026, 1024)
      (layers): ModuleList(
        (0-11): 12 x BartEncoderLayer(
          (self_attn): BartSdpaAttention(
            (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (v_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (q_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (activation_fn): GELUActivation()
          (fc1): Linear(in_features=1024, out_features=4096, bias=True)
          (fc2): Linear(in_features=4096, out_features=1024, bias=True)
    

In [80]:
# Create DataLoader for training

In [81]:
from torch.utils.data import DataLoader, TensorDataset  # Importing DataLoader and TensorDataset for batching data


In [82]:
from tqdm import tqdm  # Importing tqdm for progress bar during training

In [83]:
train_data = TensorDataset(inputs['input_ids'], labels['input_ids'])  # Creating a TensorDataset from input_ids and label_ids


In [84]:
train_loader = DataLoader(train_data, batch_size=8, shuffle=True)  # Creating a DataLoader with batch size 8, shuffling the data


---

---

In [85]:
# Training loop

In [86]:
epochs = 3  # Number of training epochs

In [87]:
for epoch in range(epochs):  # Looping over epochs
    model.train()  # Setting the model to training mode
    total_loss = 0  # Initializing total loss for this epoch
    for batch in tqdm(train_loader, desc=f"Training Epoch {epoch+1}/{epochs}"):  # Looping over batches with progress bar
        input_ids, labels = [item.to(device) for item in batch]  # Moving input_ids and labels to device

        # Forward pass
        outputs = model(input_ids=input_ids, labels=labels)  # Performing forward pass to get model outputs
        loss = outputs.loss  # Extracting the loss from model outputs
        total_loss += loss.item()  # Accumulating the total loss

        # Backward pass and optimization
        optimizer.zero_grad()  # Zeroing the gradients
        loss.backward()  # Backpropagating the loss
        optimizer.step()  # Performing optimization step

    print(f"Epoch {epoch+1} | Loss: {total_loss / len(train_loader)}")  # Printing the average loss for the epoch


Training Epoch 1/3: 100%|██████████| 13/13 [00:40<00:00,  3.12s/it]


Epoch 1 | Loss: 3.5102916955947876


Training Epoch 2/3: 100%|██████████| 13/13 [00:42<00:00,  3.24s/it]


Epoch 2 | Loss: 0.9729984861153823


Training Epoch 3/3: 100%|██████████| 13/13 [00:42<00:00,  3.24s/it]

Epoch 3 | Loss: 0.45802743159807646





---

---


---

### 📊 **Step 4: Evaluate the Model**

```markdown
## 🧪 **Evaluate the Model**
After fine-tuning, it’s crucial to evaluate the model’s performance. Evaluation is done by generating summaries for a subset of test articles and comparing them to the actual summaries.

### Evaluation Process:
1. **Generate Summaries**: Use the fine-tuned model to generate summaries for test articles.
2. **Compare Summaries**: Evaluate the quality of generated summaries.

```python
# Generate summaries for the test set
model.eval()  # Switch to evaluation mode
generated_summaries = []

for article in tokenized_datasets['test']:
    inputs = tokenizer(article['article'], return_tensors="pt", max_length=1024, truncation=True, padding='max_length')
    summary_ids = model.generate(inputs['input_ids'])
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    generated_summaries.append(summary)

# Display the first few generated summaries
print(generated_summaries[:5])

In [88]:
# Model Evaluation

In [89]:
def generate_summary(input_texts):  # Defining a function to generate summaries
    inputs = tokenizer(input_texts, return_tensors="pt", max_length=1024, truncation=True, padding=True).to(device)  # Tokenizing the input texts
    summary_ids = model.generate(inputs['input_ids'], max_length=150, num_beams=4, length_penalty=2.0, early_stopping=True)  # Generating summaries
    return tokenizer.decode(summary_ids[0], skip_special_tokens=True)  # Decoding the summary back to text


In [90]:
# Test the model on the first 5 articles in the dataset

In [91]:
test_articles = df['article'].iloc[:5].tolist()  # Selecting the first 5 articles from the dataset

In [92]:
for article in test_articles:  # Looping through the test articles
    print("Original Article: ", article)  # Printing the original article
    print("Generated Summary: ", generate_summary([article]))  # Printing the generated summary
    print("\n" + "-"*100 + "\n")  # Printing a separator line for better readability

Original Article:  LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won't cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. "I don't plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar," he told an Australian interviewer earlier this month. "I don't think I'll be particularly extravagant. "The things I like buying are things that cost about 10 pounds -- books and CDs and DVDs." At 18, Radcliffe will be able to gamble in a casino, buy a drink in a pub or see the horror film "Hostel: Part II," currently six places below his number one movie on the UK box office ch

---

---


---

### 🚀 **Step 5: Deploy the Model with Flask**

```markdown
## 🌐 **Deploy the Model with Flask**
The final step is to deploy the model using **Flask**. Flask will allow us to create a simple web application that accepts input text (article) and returns the summary generated by the model.

### Flask Web API:
1. **Route**: `/summarize`
2. **Method**: `POST`
3. **Input**: Raw article text (in JSON format).
4. **Output**: Summarized text (in JSON format).

```python
from flask import Flask, request, jsonify

app = Flask(__name__)

@app.route('/summarize', methods=['POST'])
def summarize():
    article = request.json['article']
    inputs = tokenizer(article, return_tensors="pt", max_length=1024, truncation=True, padding='max_length')
    summary_ids = model.generate(inputs['input_ids'])
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return jsonify({"summary": summary})

if __name__ == '__main__':
    app.run(debug=True)

In [93]:
# Model Deployment with Flask

In [94]:
from flask import Flask, request, jsonify  # Importing necessary Flask modules

In [95]:
# Initialize Flask app

In [96]:
app = Flask(__name__)  # Initializing the Flask app

In [97]:
# Route to summarize text

In [98]:
@app.route('/summarize', methods=['POST'])  # Defining a POST route to accept text for summarization
def summarize():  # Defining the summarize function
    data = request.json  # Extracting the input text from the request body
    input_text = data['text']  # Storing the input text

    # Generate summary
    summary = generate_summary([input_text])  # Generating the summary for the input text

    return jsonify({'summary': summary})  # Returning the summary as a JSON response


In [99]:
# Run the Flask app

In [100]:
if __name__ == '__main__':  # Checking if the script is being run directly
    app.run(debug=True)  # Running the Flask app with debugging enabled

 * Serving Flask app '__main__'
 * Debug mode: on


 * Running on http://127.0.0.1:5000
INFO:werkzeug:[33mPress CTRL+C to quit[0m
INFO:werkzeug: * Restarting with stat


---

---


---

### 💾 **Step 6: Save the Model for Future Use**

```markdown
## 💡 **Save the Model for Future Use**
Once the model is trained and evaluated, it's essential to save it along with its tokenizer for future use, such as re-deployment or inference.

### Saving the Model:
1. **Save the Model**: Store the model weights and configuration files.
2. **Save the Tokenizer**: Store the tokenizer files for consistent text preprocessing during future inferences.

```python
model.save_pretrained('./summarization_model')
tokenizer.save_pretrained('./summarization_model')

In [101]:
# Save the fine-tuned model and tokenizer

In [102]:
model.save_pretrained("summarization_model")  # Saving the fine-tuned model

Non-default generation parameters: {'max_length': 142, 'min_length': 56, 'early_stopping': True, 'num_beams': 4, 'length_penalty': 2.0, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}


In [103]:
tokenizer.save_pretrained("summarization_model")  # Saving the tokenizer

('summarization_model/tokenizer_config.json',
 'summarization_model/special_tokens_map.json',
 'summarization_model/vocab.json',
 'summarization_model/merges.txt',
 'summarization_model/added_tokens.json')

---


---

### 📑 **Step 7: Summary and Next Steps**

```markdown
## 📌 **Summary and Next Steps**
In this project, we have:
1. Loaded and preprocessed the CNN/Daily Mail dataset.
2. Fine-tuned the **BART** model for text summarization.
3. Evaluated the model’s performance.
4. Deployed the model using **Flask** for real-time summarization.
5. Saved the model for future use.

### Next Steps:
- Experiment with other summarization models like **T5** or **GPT**.
- Improve model performance by tuning hyperparameters or using a larger dataset.
- Extend the Flask app to support batch processing or multi-model summarization.

This project provides a hands-on introduction to fine-tuning transformer-based models and deploying them in real-world applications.