# Fine Tuning Transformer for Summary Generation

Fine-tuning a transformer model for the **summarization task**. The objective is to train the model to generate concise summaries of articles or documents. Summarization can be achieved in two main ways:

1. **Extractive Summarization**: The model identifies and assembles the key sentences from the article to form a summary, essentially "extracting" existing content without altering it.
   
2. **Abstractive Summarization**: This approach enables the model to create new sentences that encapsulate the core ideas of the article, resulting in a summary that may include information rephrased or condensed rather than simply copied from the article.

The workflow includes:

1. **Setting Up the Environment**: Installing and importing the necessary libraries.
   
2. **Preparing the Dataset**: Creating a custom class to handle data processing.
   
3. **Model Fine-Tuning**: Writing functions to fine-tune the transformer model.
   
4. **Model Validation**: Evaluating the model’s performance after training.
   
5. **Main Workflow**:
    - Importing and preprocessing the dataset
    - Creating a dataset and dataloader
    - Defining the neural network and optimizer
    - Training the model
    - Validating the model and generating summaries

6. **Reviewing Summaries**: Analyzing some example summaries generated by the model.

### Technical Details

- **Dataset**: 
  Using the *News Summary* dataset from [Kaggle](https://www.kaggle.com/sunnysai12345/news-summary), which consists of news summaries from Indian newspapers. Focusing on `news_summary.csv`, which contains 4,514 entries with fields like author, publication date, headline, URL, a brief summary, and the full article text.

- **Model**: 
  Utilizing **T5**, a powerful transformer model designed for various NLP tasks. T5 frames each task as a "text-to-text" problem, where both input and output are treated as text, allowing it to work seamlessly across diverse NLP applications. Following guidance from the T5 research paper to ensure that the data aligns with the model’s requirements. [T5 paper](https://arxiv.org/abs/1910.10683) and [Hugging Face T5 documentation](https://huggingface.co/transformers/model_doc/t5.html) provide further technical insights.

- **Hardware and Libraries**: 
  Using Python 3.6+, PyTorch, and the Transformers library. Since the model is resource-intensive, operating with a GPU-enabled setup.

- **Objective**: 
  To fine-tune T5 to generate summaries that closely match or surpass the quality of actual article summaries, capturing essential points without losing critical information.

The project is set to explore how effectively T5 can be fine-tuned for summarization, aiming for high-quality summaries that retain the article's core information.

<a id='section01'></a>
### Preparing Environment and Importing Libraries

Installing the necessary libraries followed by importing the libraries and modules needed to run our script. 
Will be installing:
* transformers

Libraries imported are:
* Pandas
* Pytorch
* Pytorch Utils for Dataset and Dataloader
* Transformers
* T5 Model and Tokenizer

In [1]:
%pip install git+https://github.com/huggingface/transformers

^C
Note: you may need to restart the kernel to use updated packages.


  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers 'C:\Users\abhin\AppData\Local\Temp\pip-req-build-dyg1fzah'


Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to c:\users\abhin\appdata\local\temp\pip-req-build-dyg1fzah
  Resolved https://github.com/huggingface/transformers to commit 13493215abceafc1653af88b045120014fb4c1fc
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'


In [None]:
!pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

Looking in indexes: https://download.pytorch.org/whl/cu124


In [2]:
# Importing stock libraries
import numpy as np
import pandas as pd
import torch
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler

# Importing the T5 modules from huggingface/transformers
from transformers import T5Tokenizer, T5ForConditionalGeneration

In [3]:
from torch import cuda

# Check CUDA availability
print("CUDA available:", torch.cuda.is_available())
print("CUDA version:", torch.version.cuda)
print("PyTorch version:", torch.__version__)
print("GPU name:", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "No GPU detected")
print("CUDA devices:", torch.cuda.device_count())

# Set the device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

CUDA available: True
CUDA version: 12.4
PyTorch version: 2.5.1+cu124
GPU name: NVIDIA GeForce RTX 4070 Laptop GPU
CUDA devices: 1
Using device: cuda


<a id='section02'></a>
### Preparing the Dataset for Data Processing: Class

Starting with the creation of a Dataset class, which defines how the text is pre-processed before feeding it to the neural network. This dataset will be utilized by the Dataloader method, which loads the data in batches for efficient training and processing in the neural network. Both Dataloader and Dataset will be implemented inside the `main()` function. In PyTorch, Dataset and Dataloader constructs are essential for defining data preprocessing and controlling its flow into the neural network. For more details, refer to the [PyTorch Dataset and Dataloader documentation](https://pytorch.org/docs/stable/data.html).

#### *CustomDataset* Class
- The *CustomDataset* class accepts a DataFrame as input and generates tokenized output compatible with the **T5** model for training.
- **T5** tokenizer is used to tokenize data in the `text` and `ctext` columns of the DataFrame.
- The tokenizer’s `batch_encode_plus` method performs tokenization, producing `source_id` and `source_mask` from the main text, and `target_id` and `target_mask` from the summary text.
- For more on this tokenizer, refer to the [T5 tokenizer documentation](https://huggingface.co/transformers/model_doc/t5.html#t5tokenizer).
- The *CustomDataset* class generates two datasets: one for training and one for validation.
  - *Training Dataset*: Comprising 80% of the original data, used for fine-tuning the model.
  - *Validation Dataset*: Used to evaluate model performance on data unseen during training.

#### Dataloader: Called Inside the `main()`
- The Dataloader is responsible for creating training and validation dataloaders, which efficiently load data to the neural network in a controlled manner.
- Not all data can be loaded into memory at once, so Dataloader parameters like `batch_size` and `max_len` control the volume of data fed to the network.
- Training and Validation dataloaders are utilized in their respective phases in the workflow to manage memory and facilitate efficient processing.

In [4]:
# Creating a custom dataset for reading the dataframe and loading it into the dataloader
# to pass it to the neural network at a later stage for finetuning the model and to prepare it for predictions

class CustomDataset(Dataset):

    def __init__(self, dataframe, tokenizer, source_len, summ_len):
        self.tokenizer = tokenizer  # Initialize the tokenizer
        self.data = dataframe  # Store the dataframe
        self.source_len = source_len  # Set the maximum length for the source text
        self.summ_len = summ_len  # Set the maximum length for the summary text
        self.text = self.data.text  # Extract the text column from the dataframe
        self.ctext = self.data.ctext  # Extract the ctext (context) column from the dataframe

    def __len__(self):
        return len(self.text)  # Return the number of samples in the dataset

    def __getitem__(self, index):
        ctext = str(self.ctext[index])  # Get the context text at the specified index
        ctext = ' '.join(ctext.split())  # Clean the context text by removing extra spaces

        text = str(self.text[index])  # Get the text at the specified index
        text = ' '.join(text.split())  # Clean the text by removing extra spaces

        # Tokenize and encode the context text
        source = self.tokenizer.batch_encode_plus(
            [ctext], max_length=self.source_len, pad_to_max_length=True, return_tensors='pt'
        )
        # Tokenize and encode the text
        target = self.tokenizer.batch_encode_plus(
            [text], max_length=self.summ_len, pad_to_max_length=True, return_tensors='pt'
        )

        # Extract input IDs and attention masks for the source and target texts
        source_ids = source['input_ids'].squeeze()
        source_mask = source['attention_mask'].squeeze()
        target_ids = target['input_ids'].squeeze()
        target_mask = target['attention_mask'].squeeze()

        # Return a dictionary containing the encoded inputs and masks
        return {
            'source_ids': source_ids.to(dtype=torch.long),  # Source input IDs
            'source_mask': source_mask.to(dtype=torch.long),  # Source attention mask
            'target_ids': target_ids.to(dtype=torch.long),  # Target input IDs
            'target_ids_y': target_ids.to(dtype=torch.long)  # Target input IDs (for labels)
        }

<a id='section03'></a>
### Fine-Tuning the Model: Function

Defining a training function to fine-tune the model on the training dataset created earlier, iterating over the data for a specified number of epochs. An epoch represents a complete pass of the dataset through the network.

This function is called within `main()`.

The fine-tuning process in this function involves the following steps:
- The epoch count, tokenizer, model, device, training dataloader, and optimizer are passed to the `train()` function when called from `main()`.
- The dataloader feeds data to the model in batches as defined by the batch size.
- `language_model_labels` are generated from the `target_ids`, and `source_id` and `attention_mask` are extracted.
- The model’s output provides a loss value for the forward pass.
- This loss value is then used to optimize the neural network weights.
- Every 500 steps, the loss value is printed to the console for quick reference.

In [5]:
# Creating the training function. This will be called in the main function. It is run depending on the epoch value.
# The model is put into train mode and then we enumerate over the training loader and pass the data to the defined network

def train(epoch, tokenizer, model, device, loader, optimizer):
    model.train()  # Set the model to training mode
    for _, data in enumerate(loader, 0):
        # Move the target IDs to the specified device and set the data type to long
        y = data['target_ids'].to(device, dtype=torch.long)
        # Prepare the decoder input IDs by removing the last token from each target sequence
        y_ids = y[:, :-1].contiguous()
        # Prepare the labels by removing the first token from each target sequence
        lm_labels = y[:, 1:].clone().detach()
        # Replace padding token IDs in the labels with -100 to ignore them in the loss calculation
        lm_labels[y[:, 1:] == tokenizer.pad_token_id] = -100
        # Move the source IDs and attention masks to the specified device and set the data type to long
        ids = data['source_ids'].to(device, dtype=torch.long)
        mask = data['source_mask'].to(device, dtype=torch.long)

        # Forward pass: compute the model output
        outputs = model(input_ids=ids, attention_mask=mask, decoder_input_ids=y_ids, labels=lm_labels)
        loss = outputs[0]  # The first element of the output is the loss

        # Print the loss every 500 iterations
        if _ % 500 == 0:
            print(f'Epoch: {epoch}, Loss:  {loss.item()}')

        optimizer.zero_grad()  # Clear the gradients of all optimized tensors
        loss.backward()  # Backward pass: compute the gradients
        optimizer.step()  # Update the model parameters

<a id='section04'></a>
### Validating the Model Performance: Function

During the validation stage we pass the unseen data(Testing Dataset), trained model, tokenizer and device details to the function to perform the validation run. This step generates new summary for dataset that it has not seen during the training session. 

This function is called in the `main()`

This unseen data is the 20% of `news_summary.csv` which was seperated during the Dataset creation stage. 
During the validation stage the weights of the model are not updated. We use the generate method for generating new text for the summary. 

It depends on the `Beam-Search coding` method developed for sequence generation for models with LM head. 

The generated text and originally summary are decoded from tokens to text and returned to the `main()`

In [6]:
def validate(epoch, tokenizer, model, device, loader):
    model.eval()  # Set the model to evaluation mode
    predictions = []  # Initialize a list to store predictions
    actuals = []  # Initialize a list to store actual target sequences
    with torch.no_grad():  # Disable gradient calculation for validation
        for _, data in enumerate(loader, 0):
            y = data['target_ids'].to(device, dtype=torch.long)  # Move target IDs to the specified device
            ids = data['source_ids'].to(device, dtype=torch.long)  # Move source IDs to the specified device
            mask = data['source_mask'].to(device, dtype=torch.long)  # Move source attention masks to the specified device

            # Generate predictions using the model
            generated_ids = model.generate(
                input_ids=ids,
                attention_mask=mask,
                max_length=150,  # Set the maximum length for generated sequences
                num_beams=2,  # Use beam search with 2 beams
                repetition_penalty=2.5,  # Set the repetition penalty
                length_penalty=1.0,  # Set the length penalty
                early_stopping=True  # Enable early stopping
            )

            # Decode the generated IDs to text
            preds = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=True) for g in generated_ids]
            # Decode the target IDs to text
            target = [tokenizer.decode(t, skip_special_tokens=True, clean_up_tokenization_spaces=True) for t in y]

            # Print progress every 100 iterations
            if _ % 100 == 0:
                print(f'Completed {_}')

            # Extend the predictions and actuals lists with the current batch
            predictions.extend(preds)
            actuals.extend(target)
    
    return predictions, actuals  # Return the predictions and actual target sequences

<a id='section05'></a>
### Main Function

The `main()` as the name suggests is the central location to execute all the functions/flows created above in the notebook. The following steps are executed in the `main()`:


<a id='section502'></a>
#### Importing and Pre-Processing the domain data

We will be working with the data and preparing it for fine tuning purposes. 
*Assuming that the `news_summary.csv` is already downloaded in your `data` folder*

* The file is imported as a dataframe and give it the headers as per the documentation.
* Cleaning the file to remove the unwanted columns.
* A new string is added to the main article column `summarize: ` prior to the actual article. This is done because **T5** had similar formatting for the summarization dataset. 
* The final Dataframe will be something like this:

|text|ctext|
|--|--|
|summary-1|summarize: article 1|
|summary-2|summarize: article 2|
|summary-3|summarize: article 3|

* Top 5 rows of the dataframe are printed on the console.

<a id='section503'></a>
#### Creation of Dataset and Dataloader

* The updated dataframe is divided into 80-20 ratio for test and validation. 
* Both the data-frames are passed to the `CustomerDataset` class for tokenization of the new articles and their summaries.
* The tokenization is done using the length parameters passed to the class.
* Train and Validation parameters are defined and passed to the `pytorch Dataloader contstruct` to create `train` and `validation` data loaders.
* These dataloaders will be passed to `train()` and `validate()` respectively for training and validation action.
* The shape of datasets is printed in the console.


<a id='section504'></a>
#### Neural Network and Optimizer

* In this stage we define the model and optimizer that will be used for training and to update the weights of the network. 
* We are using the `t5-base` transformer model for our project. You can read about the `T5 model` and its features above. 
* We use the `T5ForConditionalGeneration.from_pretrained("t5-base")` commad to define our model. The `T5ForConditionalGeneration` adds a Language Model head to our `T5 model`. The Language Model head allows us to generate text based on the training of `T5 model`.
* We are using the `Adam` optimizer for our project. This has been a standard and is something that can be changed updated to see how different optimizer perform with different learning rates. 
* There is also a scope for doing more with Optimizer such a decay, momentum to dynamically update the Learning rate and other parameters.


<a id='section505'></a>
#### Training Model

* We call the `train()` with all the necessary parameters.
* Loss at every 500th step is printed on the console.


<a id='section506'></a>
#### Validation and generation of Summary

* After the training is completed, the validation step is initiated.
* As defined in the validation function, the model weights are not updated. We use the fine tuned model to generate new summaries based on the article text.
* An output is printed on the console giving a count of how many steps are complete after every 100th step. 
* The original summary and generated summary are converted into a list and returned to the main function. 
* Both the lists are used to create the final dataframe with 2 columns **Generated Summary** and **Actual Summary**
* The dataframe is saved as a csv file in the local drive.
* A qualitative analysis can be done with the Dataframe. 

In [None]:
TRAIN_BATCH_SIZE = 2    # input batch size for training
VALID_BATCH_SIZE = 2    # input batch size for testing
TRAIN_EPOCHS = 2        # number of epochs to train
VAL_EPOCHS = 1 
LEARNING_RATE = 1e-4    # learning rate (default: 0.01)
SEED = 42               
MAX_LEN = 512
SUMMARY_LEN = 150 

# Set random seeds and deterministic pytorch for reproducibility
torch.manual_seed(SEED) # pytorch random seed
np.random.seed(SEED) # numpy random seed
torch.backends.cudnn.deterministic = True

# tokenzier for encoding the text
tokenizer = T5Tokenizer.from_pretrained("t5-base")

# Importing and Pre-Processing the domain data
# Selecting the needed columns only. 
# Adding the summarzie text in front of the text. This is to format the dataset similar to how T5 model was trained for summarization task. 
df = pd.read_csv('../data/news_summary.zip',encoding='latin-1')
df = df[['text','ctext']]
df.ctext = 'summarize: ' + df.ctext
print(df.head())

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


                                                text  \
0  The Administration of Union Territory Daman an...   
1  Malaika Arora slammed an Instagram user who tr...   
2  The Indira Gandhi Institute of Medical Science...   
3  Lashkar-e-Taiba's Kashmir commander Abu Dujana...   
4  Hotels in Maharashtra will train their staff t...   

                                               ctext  
0  summarize: The Daman and Diu administration on...  
1  summarize: From her special numbers to TV?appe...  
2  summarize: The Indira Gandhi Institute of Medi...  
3  summarize: Lashkar-e-Taiba's Kashmir commander...  
4  summarize: Hotels in Mumbai and other Indian c...  


In [8]:
# Creation of Dataset and Dataloader

# Defining the train size. So 80% of the data will be used for training and the rest will be used for validation.
train_size = 0.8
train_dataset = df.sample(frac=train_size, random_state=SEED).reset_index(drop=True)  # Sample 80% of the data for training
val_dataset = df.drop(train_dataset.index).reset_index(drop=True)  # Use the remaining 20% for validation

# Print the shapes of the full dataset, training dataset, and validation dataset
print("FULL Dataset: {}".format(df.shape))
print("TRAIN Dataset: {}".format(train_dataset.shape))
print("TEST Dataset: {}".format(val_dataset.shape))

FULL Dataset: (4514, 2)
TRAIN Dataset: (3611, 2)
TEST Dataset: (903, 2)


In [9]:

# Creating the Training and Validation dataset for further creation of Dataloader
training_set = CustomDataset(train_dataset, tokenizer, MAX_LEN, SUMMARY_LEN)  # Create a custom dataset for training
val_set = CustomDataset(val_dataset, tokenizer, MAX_LEN, SUMMARY_LEN)  # Create a custom dataset for validation

# Defining the parameters for creation of dataloaders
train_params = {
    'batch_size': TRAIN_BATCH_SIZE,  # Set the batch size for training
    'shuffle': True,  # Shuffle the data for training
    'num_workers': 0  # Number of worker threads for loading data
}

val_params = {
    'batch_size': VALID_BATCH_SIZE,  # Set the batch size for validation
    'shuffle': False,  # Do not shuffle the data for validation
    'num_workers': 0  # Number of worker threads for loading data
}

In [None]:
import os
import torch
from transformers import T5ForConditionalGeneration

# Define the path to save the model
MODEL_PATH = "t5_fine_tuned_model.pth"

# Check if CUDA is available and set the device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Check if the model already exists
if os.path.exists(MODEL_PATH):
    print("Trained model already exists. Loading the model...")
    model = T5ForConditionalGeneration.from_pretrained("t5-base")
    model.load_state_dict(torch.load(MODEL_PATH, map_location=device))
    model = model.to(device)
    model.eval()  # Set the model to evaluation mode
else:
    print("No trained model found. Starting training...")

    # Creation of Dataloaders for testing and validation. This will be used down for training and validation stage for the model.
    training_loader = DataLoader(training_set, **train_params)  # Create a DataLoader for the training set
    val_loader = DataLoader(val_set, **val_params)  # Create a DataLoader for the validation set

    # Assuming you have already defined your model, tokenizer, training_loader, etc.
    model = T5ForConditionalGeneration.from_pretrained("t5-base")
    model = model.to(device)

    # Define the optimizer
    optimizer = torch.optim.Adam(params=model.parameters(), lr=LEARNING_RATE)

    # Training loop
    print('Initiating Fine-Tuning for the model on our dataset')

    for epoch in range(TRAIN_EPOCHS):
        train(epoch, tokenizer, model, device, training_loader, optimizer)

    # Save the model after training
    torch.save(model.state_dict(), MODEL_PATH)
    print(f"Model saved to {MODEL_PATH}")

Trained model already exists. Loading the model...


  model.load_state_dict(torch.load(MODEL_PATH, map_location=device))


In [16]:
# Validation loop and saving the resulting file with predictions and actuals in a dataframe.
# Saving the dataframe as predictions.csv

print('Now generating summaries on our fine-tuned model for the validation dataset and saving it in a dataframe')

for epoch in range(VAL_EPOCHS):
    # Call the validate function to generate predictions and actual summaries for the validation dataset
    predictions, actuals = validate(epoch, tokenizer, model, device, val_loader)
    
    # Create a DataFrame to store the generated and actual summaries
    final_df = pd.DataFrame({'Generated Text': predictions, 'Actual Text': actuals})
    
    # Save the DataFrame to a CSV file named predictions.csv
    final_df.to_csv('predictions.csv')
    # final_df.to_csv('code\evaluation\predictions.csv')
    
    # Print a message indicating that the output files have been generated
    print('Output Files generated for review')

Now generating summaries on our fine-tuned model for the validation dataset and saving it in a dataframe
Completed 0
Completed 100
Completed 200
Completed 300
Completed 400
Output Files generated for review


<a id='section06'></a>
### Examples of the Summary Generated from the model

##### Example 1

**Original Text**
New Delhi, Apr 25 (PTI) Union minister Vijay Goel today batted for the unification of the three municipal corporations in the national capital saying a discussion over the issue was pertinent. The BJP leader, who was confident of a good show by his party in the MCD polls, the results of which will be declared tomorrow, said the civic bodies needed to be "revamped" in order to deliver the services to the people more effectively. The first thing needed was a discussion on the unification of the three municipal corporations and there should also be an end to the practice of sending Delhi government officials to serve in the civic bodies, said the Union Minister of State (Independent Charge) for Youth Affairs and Sports. "Barring one, the two other civic bodies have been incurring losses. It would be more fruitful and efficient if all the three were merged," he said, referring to the north, south and east Delhi municipal corporations. The erstwhile Municipal Corporation of Delhi (MCD) was trifurcated into NDMC, SDMC and EDMC by the then Sheila Dikshit-led Delhi government in 2012. Goel predicted a "thumping" victory for the BJP in the MCD polls. He said the newly-elected BJP councillors will be trained on the functioning of the civic bodies and dealing with the bureaucracy. 


**Original Summary**
Union Minister Vijay Goel has favoured unification of three MCDs ? North, South and East ? in order to deliver the services more effectively. "Barring one, the two other civic bodies have been incurring losses. It would be more fruitful and efficient if all the three were merged," he said. MCD was trifurcated into EDMC, NDMC and SDMC in 2012.

**Generated Summary**
BJP leader Vijay Goel on Saturday batted for the unification of three municipal corporations in the national capital saying a discussion over this was pertinent. "Barring one, two other civic bodies have been incurring losses," said Goels. The erstwhile Municipal Corporations of Delhi (MCD) were trifurcated into NDMC and SDMC by the then Sheilha Dikshi-led government in 2012. Notably, the MCD poll results will be declared tomorrow.

##### Example 2

**Original Text**
After much wait, the first UDAN flight took off from Shimla today after being flagged off by Prime Minister Narendra Modi.The flight will be operated by Alliance Air, the regional arm of Air India. PM Narendra Modi handed over boarding passes to some of passengers travelling via the first UDAN flight at the Shimla airport.Tomorrow PM @narendramodi will flag off the first UDAN flight under the Regional Connectivity Scheme, on Shimla-Delhi sector.Air India yesterday opened bookings for the first launch flight from Shimla to Delhi with all inclusive fares starting at Rs2,036.THE GREAT 'UDAN'The UDAN (Ude Desh ka Aam Naagrik) scheme seeks to make flying more affordable for the common people, holding a plan to connect over 45 unserved and under-served airports.Under UDAN, 50 per cent of the seats on each flight would have a cap of Rs 2,500 per seat/hour. The government has also extended subsidy in the form of viability gap funding to the operators flying on these routes.The scheme was launched to "make air travel accessible to citizens in regionally important cities," and has been described as "a first-of-its-kind scheme globally to stimulate regional connectivity through a market-based mechanism." Report have it the first flight today will not be flying at full capacity on its 70-seater ATR airplane because of payload restrictions related to the short Shimla airfield.|| Read more ||Udan scheme: Now you can fly to these 43 cities, see the full list hereUDAN scheme to fly hour-long flights capped at Rs 2,500 to smaller cities 


**Original Summary**
PM Narendra Modi on Thursday launched Ude Desh ka Aam Nagrik (UDAN) scheme for regional flight connectivity by flagging off the inaugural flight from Shimla to Delhi. Under UDAN, government will connect small towns by air with 50% plane seats' fare capped at?2,500 for a one-hour journey of 500 kilometres. UDAN will connect over 45 unserved and under-served airports.

**Generated Summary**
UDAN (Ude Desh Ka Aam Naagrik) scheme, launched to make air travel accessible in regionally important cities under the Regional Connectivity Scheme, took off from Shimla on Tuesday. The first flight will be operated by Alliance Air, which is the regional arm of India's Air India. Under the scheme, 50% seats would have?2,500 per seat/hour and 50% of the seats would have capped at this rate. It was also extended subsidy in form-based funding for operators flying these routes as well.

##### Example 3

**Original Text**
New Delhi, Apr 25 (PTI) The Income Tax department has issued a Rs 24,646 crore tax demand notice to Sahara Groups Aamby Valley Limited (AVL) after conducting a special audit of the company. The department, as part of a special investigation and audit into the account books of AVL, found that an income of over Rs 48,000 crore for a particular assessment year was allegedly not reflected in the record books of the firm and hence it raised a fresh tax demand and penalty amount on it. A Sahara Group spokesperson confirmed the development to PTI. "Yes, the Income Tax Department has raised Rs 48,085.79 crores to the income of the Aamby Valley Limited with a total demand of income tax of Rs 24,646.96 crores on the Aamby Valley Limited," the spokesperson said in a brief statement. Officials said the notice was issued by the taxman in January this year after the special audit of AVLs income for the Assessment Year 2012-13 found that the parent firm had allegedly floated a clutch of Special Purpose Vehicles whose incomes were later accounted on the account of AVL as they were merged with the former in due course of time. The AVL, in its income return filed for AY 2012-13, had reflected a loss of few crores but the special I-T audit brought up the added income, a senior official said. The Supreme Court, last week, had asked the Bombay High Courts official liquidator to sell the Rs 34,000 crore worth of properties of Aamby Valley owned by the Sahara Group and directed its chief Subrata Roy to personally appear before it on April 28.  


**Original Summary**
The Income Tax Department has issued a ?24,646 crore tax demand notice to Sahara Group's Aamby Valley Limited. The department's audit found that an income of over ?48,000 crore for the assessment year 2012-13 was not reflected in the record books of the firm. A week ago, the SC ordered Bombay HC to auction Sahara's Aamby Valley worth ?34,000 crore.

**Generated Summary**
the Income Tax department has issued a?24,646 crore tax demand notice to Sahara Groups Aamby Valley Limited (AVL) after conducting an audit of the company. The notice was issued in January this year after the special audit found that the parent firm had floated Special Purpose Vehicle income for the Assessment Year 2012-13 and later accounted on its account as they were merged with the former. "Yes...the Income Tax Department raised Rs48,085.79 crores to the income," he added earlier said at the notice.