<a href="https://colab.research.google.com/github/Deepachowdhari/MyProjects/blob/main/TextSummarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [20]:
!pip install datasets



In [21]:
from datasets import load_dataset

ds = load_dataset("knkarthick/dialogsum")

In [22]:
ds

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 12460
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 500
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 1500
    })
})

In [23]:
ds['train'][1]['dialogue']

"#Person1#: Hello Mrs. Parker, how have you been?\n#Person2#: Hello Dr. Peters. Just fine thank you. Ricky and I are here for his vaccines.\n#Person1#: Very well. Let's see, according to his vaccination record, Ricky has received his Polio, Tetanus and Hepatitis B shots. He is 14 months old, so he is due for Hepatitis A, Chickenpox and Measles shots.\n#Person2#: What about Rubella and Mumps?\n#Person1#: Well, I can only give him these for now, and after a couple of weeks I can administer the rest.\n#Person2#: OK, great. Doctor, I think I also may need a Tetanus booster. Last time I got it was maybe fifteen years ago!\n#Person1#: We will check our records and I'll have the nurse administer and the booster as well. Now, please hold Ricky's arm tight, this may sting a little."

In [24]:
ds['train'][1]['summary']

'Mrs Parker takes Ricky for his vaccines. Dr. Peters checks the record and then gives Ricky a vaccine.'

In [25]:
!pip install transformers



In [26]:
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("summarization", model="facebook/bart-large-cnn")

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [27]:
article1=ds['train'][1]['dialogue']

In [28]:
pipe(article1,max_length=20,min_length=10,do_sample=False)

[{'summary_text': 'Ricky has received his Polio, Tetanus and Hepatitis B shots.'}]

In [29]:
ds['train'][1]['summary']

'Mrs Parker takes Ricky for his vaccines. Dr. Peters checks the record and then gives Ricky a vaccine.'

In [30]:
#With Fine Tuning

In [31]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large-cnn")

In [32]:
#Tokenization

In [33]:
def preprocess_function(batch):
  source=batch['dialogue']
  target=batch['summary']
  source_ids=tokenizer(source,truncation=True,padding='max_length',max_length=128)
  target_ids=tokenizer(target,truncation=True,padding='max_length',max_length=128)

  labels=target_ids['input_ids']
  labels=[[(label if label != tokenizer.pad_token_id else -100)for label in labels_example] for labels_example in labels]
  return {
      "input_ids":source_ids["input_ids"],
      "attention_mask":source_ids["attention_mask"],
      "labels":labels
  }

In [34]:
df_source=ds.map(preprocess_function,batched=True)

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

In [35]:
from transformers import TrainingArguments,Trainer

training_args=TrainingArguments(
    output_dir="/content",
    per_device_train_batch_size=8,
    num_train_epochs=2,
    remove_unused_columns=True
)

In [36]:
trainer=Trainer(
    model=model,
    args=training_args,
    train_dataset=df_source['train'],
    eval_dataset=df_source['test']
)

In [37]:
trainer.train()

Step,Training Loss
500,1.5923
1000,1.4886
1500,1.4343
2000,1.0836
2500,1.0181
3000,0.9997


Non-default generation parameters: {'max_length': 142, 'min_length': 56, 'early_stopping': True, 'num_beams': 4, 'length_penalty': 2.0, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
Non-default generation parameters: {'max_length': 142, 'min_length': 56, 'early_stopping': True, 'num_beams': 4, 'length_penalty': 2.0, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
Non-default generation parameters: {'max_length': 142, 'min_length': 56, 'early_stopping': True, 'num_beams': 4, 'length_penalty': 2.0, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
Non-default generation parameters: {'max_length': 142, 'min_length': 56, 'early_stopping': True, 'num_beams': 4, 'length_penalty': 2.0, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
Non-default generation parameters: {'max_length': 142, 'min_length': 56, 'early_stopping': True, 'num_beams': 4, 'length_penalty': 2.0, 'no_

TrainOutput(global_step=3116, training_loss=1.2605395825132633, metrics={'train_runtime': 3459.0154, 'train_samples_per_second': 7.204, 'train_steps_per_second': 0.901, 'total_flos': 6750530835578880.0, 'train_loss': 1.2605395825132633, 'epoch': 2.0})

In [38]:
eval_results=trainer.evaluate()

In [39]:
eval_results

{'eval_loss': 1.6633961200714111,
 'eval_runtime': 50.1886,
 'eval_samples_per_second': 29.887,
 'eval_steps_per_second': 3.746,
 'epoch': 2.0}

##Save the Model


In [40]:
model.save_pretrained('/content/model_directory')
tokenizer.save_pretrained('/content/model_directory')

Non-default generation parameters: {'max_length': 142, 'min_length': 56, 'early_stopping': True, 'num_beams': 4, 'length_penalty': 2.0, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}


('/content/model_directory/tokenizer_config.json',
 '/content/model_directory/special_tokens_map.json',
 '/content/model_directory/vocab.json',
 '/content/model_directory/merges.txt',
 '/content/model_directory/added_tokens.json',
 '/content/model_directory/tokenizer.json')

In [41]:
tokenizer=AutoTokenizer.from_pretrained('/content/model_directory')
model=AutoModelForSeq2SeqLM.from_pretrained('/content/model_directory')

In [49]:
def summarize(blog_post):
  #Tokenize the input
  input=tokenizer(blog_post,max_length=1024,truncation=True,return_tensors='pt')

  #Generate the Summary
  summary_ids=model.generate(input['input_ids'],max_length=60,min_length=30,length_penalty=2.0, num_beams=4,early_stopping=True)

  #decode the summary
  summary=tokenizer.decode(summary_ids[0],skip_special_tokens=True)
  return summary

In [50]:
blog_post="""
President Joe Biden's decision Sunday to drop out of the 2024 election sets the stage to end a nearly 50-year run when either a Bush, Clinton, or Biden appeared on the ballot as president or vice presidential candidate for the White House.

Members of the Bush and Clinton families, along with Joe Biden, have been on every presidential election ticket since 1980, when Ronald Reagan and running mate George H.W. Bush won.

Reagan and Bush easily won reelection in 1984 before Bush won the presidency himself in 1988.

The next four elections would feature either a Bush or Clinton on the ballot, with Bill Clinton defeating George H.W. Bush in 1992, before defeating Bob Dole in 1996, and George W. Bush winning elections in 2000 and 2004
"""

summary=summarize(blog_post)
print(f'Summary:{summary}')

Summary:Joe Biden's decision to drop out of the 2024 election sets the stage to end a nearly 50-year run when either a Bush, Clinton, or Biden appeared on the ballot as president or vice presidential candidate.


In [58]:
print(len(blog_post),len(summary))

741 202
