# Section: Transfer Learning for NLP





## <font color='#4073FF'>Project Solution: Text Summarization</font>

###  <font color='#14AAF5'>This project is aimed at summarizing text using transfer learning.</font>


### Project Brief:

Summarization is the task of condensing a piece of text to a shorter version, reducing the size of the initial text while at the same time preserving key informational elements and the meaning of content. Since manual text summarization is a time expensive and generally laborious task, the automatization of the task is gaining increasing popularity and therefore constitutes a strong motivation for academic research. In this project, you are required to scrape paragraphs from "https://en.wikipedia.org/wiki/Cassowary" and summarize them using transfer learning techniques.

### 1. Data collection

In [1]:
import bs4 as bs
import urllib.request
import re

# Reading the data from https://en.wikipedia.org/wiki/Cassowary 

urlr = urllib.request.urlopen('https://en.wikipedia.org/wiki/Cassowary')
page = urlr.read()
soup = bs.BeautifulSoup(page,'lxml')

text = ""

# Getting first few paragraphs
for i in soup.find_all('p')[:20]:
     text+= i.text


In [2]:
print(text)



Casuarius is a genus of birds in the order Casuariiformes, whose members are the cassowaries (Tok Pisin: muruk, Indonesian: kasuari). It is classified as a ratite (flightless bird without a keel on its sternum bone) and is native to the tropical forests of New Guinea (Papua New Guinea and Indonesia), Aru Islands (Indonesia), and northeastern Australia.[3]
Three species are extant: The most common, the southern cassowary, is the third-tallest and second-heaviest living bird, smaller only than the ostrich and emu. The other two species are represented by the northern cassowary and the dwarf cassowary. A fourth but extinct species is represented by the pygmy cassowary.
Cassowaries feed mainly on fruit, although all species are truly omnivorous and take a range of other plant foods, including shoots and grass seeds, in addition to fungi, invertebrates, and small vertebrates. Cassowaries are very wary of humans, but if provoked, they are capable of inflicting serious, even fatal, injuries

### 2. Data Cleaning

In [3]:
import re

x = re.sub(r"[[0-9]*]", "", text)
x = re.sub(r"\([^()]*\)", "", x)
x = x.replace("\xa0","")
x = x.replace("C.","")

  This is separate from the ipykernel package so we can avoid doing imports until


In [4]:
x

'\n\nCasuarius is a genus of birds in the order Casuariiformes, whose members are the cassowaries . It is classified as a ratite  and is native to the tropical forests of New Guinea , Aru Islands , and northeastern Australia.\nThree species are extant: The most common, the southern cassowary, is the third-tallest and second-heaviest living bird, smaller only than the ostrich and emu. The other two species are represented by the northern cassowary and the dwarf cassowary. A fourth but extinct species is represented by the pygmy cassowary.\nCassowaries feed mainly on fruit, although all species are truly omnivorous and take a range of other plant foods, including shoots and grass seeds, in addition to fungi, invertebrates, and small vertebrates. Cassowaries are very wary of humans, but if provoked, they are capable of inflicting serious, even fatal, injuries to both dogs and people. The cassowary has often been labeled "the world\'s most dangerous bird".\nThe genus Casuarius was erected 

In [5]:
with open('original_text.txt','w') as f:
  f.write(x)

In [6]:
# Installing dependencies

!pip install sentencepiece
!pip install bert-extractive-summarizer



In [7]:
from summarizer import Summarizer,TransformerSummarizer

### 3. BERT Summarizer

In [8]:
bert_model = Summarizer()
bert_summary = ''.join(bert_model(x, min_length=60))

print("\n\n")
bert_summary

Some weights of the model checkpoint at bert-large-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).







'Casuarius is a genus of birds in the order Casuariiformes, whose members are the cassowaries . Three species are extant: The most common, the southern cassowary, is the third-tallest and second-heaviest living bird, smaller only than the ostrich and emu. The genus Casuarius was erected by French scientist Mathurin Jacques Brisson in his Ornithologie published in 1760. The evolutionary history of cassowaries, as of all ratites, is not well known. Studies show that ratites continued to evolve after this separation into their modern counterparts. Females are larger and more brightly coloured than the males. All three species have a keratinous, skin-covered casque on their heads that grows with age. The casque would help protect the skull from such collisions". Cassowaries eat fallen fruit, and consequently spend much time under trees where seeds the size of golfballs or larger fall from heights up to 30m ; the wedge-shaped casque may protect the head by deflecting falling fruit.[citation

In [9]:
# write the summary to bert_summary.txt file

with open('bert_summary.txt','w') as f:
  f.write(bert_summary)

### 4. GPT2_model Summarizer

In [10]:
GPT2_model = TransformerSummarizer(transformer_type="GPT2",transformer_model_key="gpt2-medium")
gpt2_summary = ''.join(GPT2_model(x, min_length=60))

print("\n\n")
gpt2_summary






'Casuarius is a genus of birds in the order Casuariiformes, whose members are the cassowaries . Cassowaries feed mainly on fruit, although all species are truly omnivorous and take a range of other plant foods, including shoots and grass seeds, in addition to fungi, invertebrates, and small vertebrates. As the publication date of Linnaeus\'s sixth edition was before the 1758 starting point of the International Commission on Zoological Nomenclature, Brisson, and not Linnaeus, is considered the authority for the genus. A fossil species was reported from Australia, but for reasons of biogeography, this assignment is not certain, and it might belong to the prehistoric Emuarius, which was a genus of cassowary-like primitive emus. Typically, all cassowaries are shy birds that are found in the deep forest. The southern cassowary of the far north Queensland rain forests is not well studied, and the northern and dwarf cassowaries even less so. Females are larger and more brightly coloured than 

In [11]:
# Write the summary to gpt2_summary.txt file

with open('gpt2_summary.txt','w') as f:
  f.write(gpt2_summary)

### 5. XLnet Summarizer

In [12]:
xlnet_model = TransformerSummarizer(transformer_type="XLNet",transformer_model_key="xlnet-base-cased")
xlnet_summary = ''.join(xlnet_model(x, min_length=60))

print("\n\n")
xlnet_summary

Some weights of the model checkpoint at xlnet-base-cased were not used when initializing XLNetModel: ['lm_loss.bias', 'lm_loss.weight']
- This IS expected if you are initializing XLNetModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLNetModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).







'Casuarius is a genus of birds in the order Casuariiformes, whose members are the cassowaries . The other two species are represented by the northern cassowary and the dwarf cassowary. The taxonomic name   papuanus also may be in need of revision to Casuarius  westermanni. A fossil species was reported from Australia, but for reasons of biogeography, this assignment is not certain, and it might belong to the prehistoric Emuarius, which was a genus of cassowary-like primitive emus. Typically, all cassowaries are shy birds that are found in the deep forest. They are adept at disappearing long before a human knows they were there. The casque\'s shape and size, up to 18cm , is species-dependent. Earlier research indicates the birds lower their heads when running "full tilt through the vegetation, brushing saplings aside and occasionally careening into small trees. Cassowaries eat fallen fruit, and consequently spend much time under trees where seeds the size of golfballs or larger fall fro

In [13]:
# Write the summary to xlnet_summary.txt file

with open('xlnet_summary.txt','w') as f:
  f.write(xlnet_summary)