In [1]:
import pandas as pd
from fastai.text.all import *
from transformers import *
from blurr.data.all import *
from blurr.modeling.all import *

#Get data
df = pd.read_csv("news_summary.csv", encoding = "ISO-8859-1")
df.dropna().reset_index()
df.columns




Index(['author', 'date', 'headlines', 'read_more', 'text', 'ctext'], dtype='object')

In [2]:
df = df.drop(['author', 'date', 'headlines', 'read_more'], axis = 1)
df.head(5)

Unnamed: 0,text,ctext
0,The Administration of Union Territory Daman and Diu has revoked its order that made it compulsory for women to tie rakhis to their male colleagues on the occasion of Rakshabandhan on August 7. The administration was forced to withdraw the decision within 24 hours of issuing the circular after it received flak from employees and was slammed on social media.,"The Daman and Diu administration on Wednesday withdrew a circular that asked women staff to tie rakhis on male colleagues after the order triggered a backlash from employees and was ripped apart on social media.The union territory?s administration was forced to retreat within 24 hours of issuing the circular that made it compulsory for its staff to celebrate Rakshabandhan at workplace.?It has been decided to celebrate the festival of Rakshabandhan on August 7. In this connection, all offices/ departments shall remain open and celebrate the festival collectively at a suitable time wherein a..."
1,"Malaika Arora slammed an Instagram user who trolled her for ""divorcing a rich man"" and ""having fun with the alimony"". ""Her life now is all about wearing short clothes, going to gym or salon, enjoying vacation[s],"" the user commented. Malaika responded, ""You certainly got to get your damn facts right before spewing sh*t on me...when you know nothing about me.""","From her special numbers to TV?appearances, Bollywood actor Malaika Arora Khan has managed to carve her own identity. The actor, who made her debut in the Hindi film industry with the blockbuster debut opposite Shah Rukh Khan in Chaiyya Chaiyya from Dil Se (1998), is still remembered for the song. However, for trolls, she is a woman first and what matters right now is that she divorced a ?rich man?. On Wednesday, Malaika Arora shared a gorgeous picture of herself on Instagram and a follower decided to troll her for using her ?alumni? (read alimony) money to wear ?short clothes and going t..."
2,"The Indira Gandhi Institute of Medical Sciences (IGIMS) in Patna on Thursday made corrections in its Marital Declaration Form by changing 'Virgin' option to 'Unmarried'. Earlier, Bihar Health Minister defined virgin as being an unmarried woman and did not consider the term objectionable. The institute, however, faced strong backlash for asking new recruits to declare their virginity in the form.","The Indira Gandhi Institute of Medical Sciences (IGIMS) in Patna amended its marital declaration form on Thursday, replacing the word ?virgin? with ?unmarried? after controversy.Until now, new recruits to the super-specialty medical institute in the state capital were required to declare if they were bachelors, widowers or virgins.IGIMS medical superintendent Dr Manish Mandal said institute director Dr NR Biswas held a meeting on Thursday morning before directing that the word ?virgin? on the marital declaration form be immediately replaced with ?unmarried?. Dr Biswas had just returned aft..."
3,"Lashkar-e-Taiba's Kashmir commander Abu Dujana, who was killed by security forces, said ""Kabhi hum aage, kabhi aap, aaj aapne pakad liya, mubarak ho aapko (Today you caught me. Congratulations)"" after being caught. He added that he won't surrender, and whatever is in his fate will happen to him. ""Hum nikley they shaheed hone (had left home for martyrdom),"" he added.","Lashkar-e-Taiba's Kashmir commander Abu Dujana was killed in an encounter in a village in Pulwama district of Jammu and Kashmir earlier this week. Dujana, who had managed to give the security forces a slip several times in the past, carried a bounty of Rs 15 lakh on his head.Reports say that Dujana had come to meet his wife when he was trapped inside a house in Hakripora village. Security officials involved in the encounter tried their best to convince Dujana to surrender but he refused, reports say.According to reports, Dujana rejected call for surrender from an Army officer. The Army had..."
4,"Hotels in Maharashtra will train their staff to spot signs of sex trafficking, including frequent requests for bed linen changes and 'Do not disturb' signs left on room doors for days. A mobile phone app called Rescue Me, which will allow staff to alert police of suspicious behaviour, will be developed. The initiative has been backed by the Maharashtra government.","Hotels in Mumbai and other Indian cities are to train their staff to spot signs of sex trafficking such as frequent requests for bed linen changes or a ""Do not disturb"" sign left on the door for days on end. The group behind the initiative is also developing a mobile phone app - Rescue Me - which hotel staff can use to alert local police and senior anti-trafficking officers if they see suspicious behavior. ""Hotels are breeding grounds for human trade,"" said Sanee Awsarmmel, chairman of the alumni group of Maharashtra State Institute of Hotel Management and Catering Technology. ""(We) have h..."


In [3]:
articles = df.head(100)

In [6]:
#Import the pretrained model
pretrained_model_name = "facebook/bart-large-cnn"
hf_arch, hf_config, hf_tokenizer, hf_model = BLURR.get_hf_objects(pretrained_model_name, 
                                                                  model_cls=BartForConditionalGeneration)

#Create mini-batch and define parameters
hf_batch_tfm = HF_Seq2SeqBeforeBatchTransform(hf_arch, hf_config, hf_tokenizer, hf_model, 
    task='summarization',
    text_gen_kwargs=
 {'max_length': 744,'min_length': 248,'do_sample': False, 'early_stopping': True, 'num_beams': 4, 'temperature': 1.0, 
  'top_k': 50, 'top_p': 1.0, 'repetition_penalty': 1.0, 'bad_words_ids': None, 'bos_token_id': 0, 'pad_token_id': 1,
 'eos_token_id': 2, 'length_penalty': 2.0, 'no_repeat_ngram_size': 3, 'encoder_no_repeat_ngram_size': 0,
 'num_return_sequences': 1, 'decoder_start_token_id': 2, 'use_cache': True, 'num_beam_groups': 1,
 'diversity_penalty': 0.0, 'output_attentions': False, 'output_hidden_states': False, 'output_scores': False,
 'return_dict_in_generate': False, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2, 'remove_invalid_values': False})


#Prepare data for training
blocks = (HF_Seq2SeqBlock(before_batch_tfm=hf_batch_tfm), noop)
dblock = DataBlock(blocks=blocks, get_x=ColReader('ctext'), get_y=ColReader('text'), splitter=RandomSplitter())
dls = dblock.dataloaders(articles, batch_size = 2)

Due to IPython and Windows limitation, python multiprocessing isn't available now.
So `number_workers` is changed to 0 to avoid getting stuck


In [7]:
#Define performance metrics
seq2seq_metrics = {
        'rouge': {
            'compute_kwargs': { 'rouge_types': ["rouge1", "rouge2", "rougeL"], 'use_stemmer': True },
            'returns': ["rouge1", "rouge2", "rougeL"]
        },
        'bertscore': {
            'compute_kwargs': { 'lang': 'fr' },
            'returns': ["precision", "recall", "f1"]}}

#Model
model = HF_BaseModelWrapper(hf_model)
learn_cbs = [HF_BaseModelCallback]
fit_cbs = [HF_Seq2SeqMetricsCallback(custom_metrics=seq2seq_metrics)]

#Specify training
learn = Learner(dls, model,
                opt_func=ranger,loss_func=CrossEntropyLossFlat(),
                cbs=learn_cbs,splitter=partial(seq2seq_splitter, arch=hf_arch)).to_fp16()

#Create optimizer with default hyper-parameters
learn.create_opt() 
learn.freeze()

#Training
learn.fit_one_cycle(3, lr_max=3e-5, cbs=fit_cbs)



epoch,train_loss,valid_loss,rouge1,rouge2,rougeL,bertscore_precision,bertscore_recall,bertscore_f1,time
0,2.28937,2.040586,0.43716,0.173449,0.274106,0.744127,0.741346,0.742623,13:07
1,2.026023,1.84613,0.419814,0.170931,0.281067,0.746231,0.745896,0.745943,11:23
2,1.817264,1.803282,0.444131,0.197494,0.306661,0.752699,0.755363,0.753893,11:19


Downloading: 100%|██████████| 29.0/29.0 [00:00<00:00, 29.0kB/s]
Downloading: 100%|██████████| 625/625 [00:00<00:00, 603kB/s]
Downloading: 100%|██████████| 972k/972k [00:02<00:00, 460kB/s]  
Downloading: 100%|██████████| 681M/681M [01:06<00:00, 10.7MB/s] 


In [8]:
text_to_generate = "Tesla has idled its Shanghai Gigafactory for two days amid a rise in China’s Omicron cases that has prompted the government to tighten restrictions there.The automaker sent a notice to employees and suppliers on Wednesday informing them of the closure, reported Reuters, which viewed the internal memo.The electric vehicle maker didn’t confirm the reason for suspending production on Wednesday and Thursday. However, the temporary suspension in production comes as Toyota and Volkswagen  – the world’s largest two automakers – also idled operations in China this week due to a local increase in COVID-19 cases and the additional restrictions that the government implemented to manage the surge.It is also possible that supply chain constraints contributed to the reason for the shutdown.The round-the-clock factory is key to Tesla’s global operations — and its bottom line. The Shanghai Gigafactory, which is Tesla’s largest by volume, exports a significant number of Model 3 and Model Y vehicles to Europe. The factory has been producing about 2,000 vehicles per day, so even a two-day shutdown could drastically reduce Tesla’s output and further delay deliveries.The virus is surging again in China, with cases for the first three months of the year surpassing the total number of cases in 2021. The number of new daily cases has begun reaching levels not seen since the pandemic’s arrival in March 2020.Throughout the pandemic, the Chinese government has enforced mass testing and isolation to contain the spread."
outputs = learn.blurr_generate(text_to_generate, early_stopping=False, num_return_sequences=1)

for idx, o in enumerate(outputs):
    print(f'=== Prediction {idx+1} ===\n{o}\n')

=== Prediction 1 ===
 Tesla has idled its Shanghai Gigafactory for two days amid a rise in China's Omicron virus cases. The automaker sent a notice to employees and suppliers on Wednesday informing them of the closure. Toyota and Volkswagen also idled operations in China this week due to a local increase in COVID-19 cases and the additional restrictions that the government implemented to manage the surge. The factory is key to Tesla's global operations and exports a significant number of Model 3 and Model Y vehicles to Europe.



In [9]:
learn.metrics = None
learn.export(fname='sum_model_export.pkl')