<a href="https://colab.research.google.com/github/MohitPanchasara/BART-Text-Autoencoder/blob/main/BART_Large_CNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Implementation of BART model

```
Bidirectional AutoRegressive Transformers (BART) is a Denoising Autoencoder, fined-tuned on CNN_dailymail, that is used to pretrain Sequence-to-Sequence models, researched by Facebook AI.
```


## Structure of BART Model

The BART model is the denoising autoencoder that is used to pretrain sequence-to-sequence models and generate natural language text from a limited input. Lets take an example to build the BART model.


=> Consider a neural network with 3 layers, which as Encoder weights and Decoder weights both. The first layer will be the input layer and last will be the output layer. We want same amount of output which was given in input. Hence number of input neurons will be equal to number of output neurons, the difference will be the updated informaiton without loosing any.

=> Sequence-to-Sequence model here demonstrates how the data is flowing inside the model, i.e. that will be sequential. The Encoder part takes input and runs through forward pass, that will give input to decoder layer. the output from the decoder layer will again send to encoder via back-prapogation. While calculating the cross-entropy loss and similarly updating the weights at each neuron.

## Functioning of BART Model

The working of model has the steps:
1. Currupting the orignal text by an arbitary noising function.
2. Sending the encoded noising text to decoder to decode orignal text.
3. Comparing the output from decoding layer and orignal output.
4. Calculating loss function and sending back for backward prapogation.
5. Learning model to recrunct the orignal text

In [2]:
# Installing the transformers library
!pip install transformers

Collecting transformers
  Downloading transformers-4.18.0-py3-none-any.whl (4.0 MB)
[K     |████████████████████████████████| 4.0 MB 33.2 MB/s 
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.49-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 24.1 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 48.9 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 46.7 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.5.1-py3-none-any.whl (77 kB)
[K     |████████████████████████████████| 77 kB 6.3 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    F

In [3]:
# pipline implementation so that we can import our project from huggingface
from transformers import pipeline

In [4]:
# downloading the BART model
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

Downloading:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.51G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

In [24]:
# Given orignal text, and generate similar text/sentences of desired length.
Orignal_text = """ Hello, my name is Mohit. Im studying at IIIT Vadodara.
"""

In [29]:
# printing length and type of input
print(len(Orignal_text))
print(type(Orignal_text))

56
<class 'str'>


In [60]:
# the summarizer will take the arguments orignal_text, how long we want to generate the text from given orignal text,
# minimum length and sampling value equal to FALSE
# storing the result in Final_text
Final_text = summarizer(Orignal_text, max_length = 150, min_length = 30, do_sample = False)

Your max_length is set to 150, but you input_length is only 22. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=11)


Output

In [61]:
print(Final_text)
# >>> [{'summary_text': 'Liana Barrientos, 39, is charged with two counts of "offering a false instrument for filing in the first degree" In total, she has been married 10 times, with nine of her marriages occurring between 1999 and 2002. She is believed to still be married to four men.'}]

[{'summary_text': 'Mohit is a student at IIIT Vadodara. Mohit is from Mumbai. He is studying to be a software engineer. He hopes to one day work in the IT industry.'}]


In [62]:
# Final text
print(f"Orignal Text: {Orignal_text}Length: {len(Orignal_text)}")

for i in Final_text[0].items():
  print(f"\nOutput Text: {i[1]}")

print(f"Length: {len(i[1])}")

Orignal Text:  Hello, my name is Mohit. Im studying at IIIT Vadodara.
Length: 56

Output Text: Mohit is a student at IIIT Vadodara. Mohit is from Mumbai. He is studying to be a software engineer. He hopes to one day work in the IT industry.
Length: 145


### Another Example

In [63]:
orignal_txt = """The Indian Premier League is a professional men's Twenty20 cricket league, contested by ten teams based out of ten Indian cities. The league was founded by the Board of Control for Cricket in India in 2007"""

In [64]:
Final_txt = summarizer(orignal_txt, max_length = 200, min_length = 50, do_sample = True)

Your max_length is set to 200, but you input_length is only 42. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=21)


In [66]:
print(f"Orignal Text: {orignal_txt}")
print(f"Length: {len(orignal_txt)}")

for i in Final_txt[0].items():
  print(f"\nOutput Text: {i[1]}")

print(f"Length: {len(i[1])}")

Orignal Text: The Indian Premier League is a professional men's Twenty20 cricket league, contested by ten teams based out of ten Indian cities. The league was founded by the Board of Control for Cricket in India in 2007
Length: 205

Output Text: The Indian Premier League is a professional men's Twenty20 cricket league. It is contested by ten teams based out of ten Indian cities. The league was founded by the Board of Control for Cricket in India in 2007. It was first held in 2007 and has since expanded to ten cities.
Length: 276
