## Data Collection and Preprocessing:

### Discription:

We downloaded data sets from Wikipedia using NLTK and used four pre-processing approaches to train our models for higher accuracy rates.<br>
Pre-processing is crucial for cleaning and transforming data before analysis, and NLTK provides powerful tools for processing text data.<br> 
Combining these tools makes working with large text datasets easier and more efficient.

### Data collection:

For data collection we make a list of topics with complex scientific background and make use of the wikipedia package to download and store<br>
the tiles, content and summary in a json file. Here is a code snipit showing how.

In [None]:
import wikipedia
import json
from wikipedia.exceptions import WikipediaException
# Set the language to English
wikipedia.set_lang("en")

keywords = ['Hilbert\'s fifth problem', 'P vs NP', 'Navier–Stokes existence and smoothness', 'Birch and Swinnerton-Dyer conjecture', 'Twin prime conjecture', ..]
titles = []
i = 0
for keyword in keywords:
    if i > 1000:
        break
    try:
        pages = wikipedia.search(keyword, results=600)
        for page in pages:
            if page not in titles:
                try:
                    with open('10krun.json', 'r') as f:
                        data = json.load(f)

                    summary = wikipedia.summary(page)
                    content = wikipedia.page(page).content
                    # Add the article data to the list
                    data.append({
                        "topic": page,
                        "summary": summary,
                        "content": content
                    })
                    with open('10krun.json', 'w') as f:
                        json.dump(data, f)
                        if len(data) > 1000:
                            i = 1000
                            break
                except wikipedia.exceptions.DisambiguationError as e:
                    # If the page is a disambiguation page, skip it
                    continue 
    except wikipedia.exceptions.PageError as e:
        # If no pages are found for the keyword, skip it
        continue
    except KeyError:

        continue

### Pre-processing:

We utilize four distinct preprocessing approaches, including the Traditional, Custom, Raw, and Combined approaches. Each approach has its unique<br> strengths and benefits, allowing us to tailor our data processing for better evaluation and achieve higher accuracy rates in our models.


#### Raw approach:

We utilize the content in its original form to establish a baseline that can aid in further ablation research.

Here's what the article content actually looks like.

In [2]:
content = "\n\n\n== Disciplines ==\nPhysics and Astrophysics have played central roles in shaping our understanding of the universe through scientific observation and experiment. Physical cosmology was shaped through both mathematics and observation in an analysis of the whole universe. The universe is generally understood to have begun with the Big Bang, followed almost instantaneously by cosmic inflation, an expansion of space from which the universe is thought to have emerged 13.799 \u00b1 0.021 billion years ago. Cosmogony studies the origin of the universe, and cosmography maps the features of the universe.\nIn Diderot's Encyclop\u00e9die, cosmology is broken down into uranology (the science of the heavens), aerology (the science of the air), geology (the science of the continents), and hydrology (the science of waters).Metaphysical cosmology has also been described as the placing of humans in the universe in relationship to all other entities. This is exemplified by Marcus Aurelius's observation that a man's place in that relationship: \"He who does not know what the world is does not know where he is, and he who does not know for what purpose the world exists, does not know who he is, nor what the world is.\"\n\n\n== Discoveries ==\n\n\n=== Physical cosmology ===\n\nPhysical cosmology is the branch of physics and astrophysics that deals with the study of the physical origins and evolution of the universe. It also includes the study of the nature of the universe on a large scale. In its earliest form, it was what is now known as \"celestial mechanics\", the study of the heavens. Greek philosophers Aristarchus of Samos, Aristotle, and Ptolemy proposed different cosmological theories. The geocentric Ptolemaic system was the prevailing theory until the 16th century when Nicolaus Copernicus, and subsequently Johannes Kepler and Galileo Galilei, proposed a heliocentric system. This is one of the most famous examples of epistemological rupture in physical cosmology.\nIsaac Newton's Principia Mathematica, published in 1687, was the first description of the law of universal gravitation. It provided a physical mechanism for Kepler's laws and also allowed the anomalies in previous systems, caused by gravitational interaction between the planets, to be resolved. A fundamental difference between Newton's cosmology and those preceding it was the Copernican principle\u2014that the bodies on Earth obey the same physical laws as all celestial bodies. This was a crucial philosophical advance in physical cosmology.\nModern scientific cosmology is usually considered to have begun in 1917 with Albert Einstein's publication of his final modification of general relativity in the paper \"Cosmological Considerations of the General Theory of Relativity\" (although this paper was not widely available outside of Germany until the end of World War I). General relativity prompted cosmogonists such as Willem de Sitter, Karl Schwarzschild, and Arthur Eddington to explore its astronomical ramifications, which enhanced the ability of astronomers to study very distant objects. Physicists began changing the assumption that the universe was static and unchanging. In 1922, Alexander Friedmann introduced the idea of an expanding universe that contained moving matter.\n\nIn parallel to this dynamic approach to cosmology, one long-standing debate about the structure of the cosmos was coming to a climax - the Great Debate (1917 to 1922) - with early cosmologists such as Heber Curtis and Ernst \u00d6pik determining that some nebulae seen in telescopes were separate galaxies far distant from our own. While Heber Curtis argued for the idea that spiral nebulae were star systems in their own right as island universes, Mount Wilson astronomer Harlow Shapley championed the model of a cosmos made up of the Milky Way star system only. This difference of ideas came to a climax with the organization of the Great Debate on 26 April 1920 at the meeting of the U.S. National Academy of Sciences in Washington, D.C. The debate was resolved when Edwin Hubble detected Cepheid Variables in the Andromeda Galaxy in 1923 and 1924. Their distance established spiral nebulae well beyond the edge of the Milky Way.\nSubsequent modelling of the universe explored the possibility that the cosmological constant, introduced by Einstein in his 1917 paper, may result in an expanding universe, depending on its value. Thus the Big Bang model was proposed by the Belgian priest Georges Lema\u00eetre in 1927 which was subsequently corroborated by Edwin Hubble's discovery of the redshift in 1929 and later by the discovery of the cosmic microwave background radiation by Arno Penzias and Robert Woodrow Wilson in 1964. These findings were a first step to rule out some of many alternative cosmologies.\nSince around 1990, several dramatic advances in observational cosmology have transformed cosmology from a largely speculative science into a predictive science with precise agreement between theory and observation. These advances include observations of the microwave background from the COBE, WMAP and Planck satellites, large new galaxy redshift surveys including 2dfGRS and SDSS, and observations of distant supernovae and gravitational lensing. These observations matched the predictions of the cosmic inflation theory, a modified Big Bang theory, and the specific version known as the Lambda-CDM model. This has led many to refer to modern times as the \"golden age of cosmology\".On 17 March 2014, astronomers at the Center for Astrophysics | Harvard & Smithsonian announced the detection of gravitational waves, providing strong evidence for inflation and the Big Bang. However, on 19 June 2014, lowered confidence in confirming the cosmic inflation findings was reported.On 1 December 2014, at the Planck 2014 meeting in Ferrara, Italy, astronomers reported that the universe is 13.8 billion years old and composed of 4.9% atomic matter, 26.6% dark matter and 68.5% dark energy.\n\n\n=== Religious or mythological cosmology ===\n\nReligious or mythological cosmology is a body of beliefs based on mythological, religious, and esoteric literature and traditions of creation and eschatology.\n\n\n=== Philosophical cosmology ===\n\nCosmology deals with the world as the totality of space, time and all phenomena. Historically, it has had quite a broad scope, and in many cases was found in religion. In modern use metaphysical cosmology addresses questions about the Universe which are beyond the scope of science. It is distinguished from religious cosmology in that it approaches these questions using philosophical methods like dialectics. Modern metaphysical cosmology tries to address questions such as:\nWhat is the origin of the universe? What is its first cause? Is its existence necessary? (see monism, pantheism, emanationism and creationism)\nWhat are the ultimate material components of the universe? (see mechanism, dynamism, hylomorphism, atomism)\nWhat is the ultimate reason for the existence of the universe? Does the cosmos have a purpose? (see teleology)\nDoes the existence of consciousness have a purpose? How do we know what we know about the totality of the cosmos? Does cosmological reasoning reveal metaphysical truths? (see epistemology)\n\n\n== Historical cosmologies ==\n\nTable notes: the term \"static\" simply means not expanding and not contracting. Symbol G represents Newton's gravitational constant; \u039b (Lambda) is the cosmological constant.\n\n\n== See also ==\n\n\n== References ==\n\n\n== External links ==\n\nNASA/IPAC Extragalactic Database (NED) (NED-Distances)\nCosmic Journey: A History of Scientific Cosmology Archived 21 October 2008 at the Wayback Machine from the American Institute of Physics\nIntroduction to Cosmology David Lyth's lectures from the ICTP Summer School in High Energy Physics and Cosmology\nThe Sophia Centre The Sophia Centre for the Study of Cosmology in Culture, University of Wales Trinity Saint David\nGenesis cosmic chemistry module\n\"The Universe's Shape\", BBC Radio 4 discussion with Sir Martin Rees, Julian Barbour and Janna Levin (In Our Time, 7 February 2002)"
print(content)




== Disciplines ==
Physics and Astrophysics have played central roles in shaping our understanding of the universe through scientific observation and experiment. Physical cosmology was shaped through both mathematics and observation in an analysis of the whole universe. The universe is generally understood to have begun with the Big Bang, followed almost instantaneously by cosmic inflation, an expansion of space from which the universe is thought to have emerged 13.799 ± 0.021 billion years ago. Cosmogony studies the origin of the universe, and cosmography maps the features of the universe.
In Diderot's Encyclopédie, cosmology is broken down into uranology (the science of the heavens), aerology (the science of the air), geology (the science of the continents), and hydrology (the science of waters).Metaphysical cosmology has also been described as the placing of humans in the universe in relationship to all other entities. This is exemplified by Marcus Aurelius's observation that a ma

#### Traditional Approach

We utilize a range of traditional preprocessing techniques, including stopword removal, punctuation filtering, and tokenization.<br>
 To perform these tasks, we rely on the Spacy library, which is renowned for its effectiveness in handling scientific text and <br> 
 related terminology. 
 
 Moreover, we employ the Term Frequency-Inverse Document Frequency (TFIDF) method to reduce the size of the document to 1500 tokens,<br>
  which is essential for the model's optimal performance. Following this, we perform sentence segmentation and rank each sentence's <br> 
  importance using the TFIDF score. Finally, we only retain the highest ranking sentences that fit under the 1500 token limit, <br>
  ensuring that only the most relevant and informative content is included in the analysis.

In [3]:
import spacy

nlp = spacy.load('en_core_web_sm')

import string
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
nltk.download('stopwords')

def Traditional_approach(text):
    '''
    Input:
        review: a string containing a review.
    Output:
        review_cleaned: a processed review. 

    '''
    lst = re.findall('http://\S+|https://\S+', text)
    for i in lst:
        text = text.replace(i,'')
    text = text.translate(str.maketrans('','',string.punctuation))
    text = text.lower()
    stop_words = set(stopwords.words('english'))
    word_tokens = nlp(text)
    tokens = [token.text for token in word_tokens ]
    text_cleaned = []
    for w in tokens:
        if w not in stop_words:
            text_cleaned.append(w)
    filtered_text = [word for word in text_cleaned if '\n' not in word]
    filtered_text = [word for word in filtered_text if ' ' not in word]
    return ' '.join(filtered_text)





import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk

def tfidf_content(content):
    # Load the input document and split it into sentences
    if len(nltk.word_tokenize(content)) < 1500:
        return content
    else:
        token = len(nltk.word_tokenize(content))
        sentences = sentence_segmentation(content)
        tfidf = TfidfVectorizer().fit_transform(sentences).toarray()
        x = 1
        para = ""
        while token > 1500 :
            N = len(sentences) - x 
            top_indices = np.argsort(tfidf.sum(axis=1))[::-1][:N]
            # Concatenate the selected sentences into a single input sequence
            para =  ' '.join([sentences[i] for i in top_indices])
            token = len(nltk.word_tokenize(para))
            x+=1
        return para

import nltk
nltk.download('punkt')
def sentence_segmentation(content):
    sentences = nltk.sent_tokenize(content)
    return sentences




[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


#### Here is what Traditionaly processed data looks like

disciplines physics astrophysics played central roles shaping understanding universe scientific observation experiment physical <br>
 cosmology shaped mathematics observation analysis whole universe universe generally understood begun big bang followed almost <br>
 instantaneously cosmic inflation expansion space universe thought emerged 13799 ± 0021 billion years ago cosmogony studies <br>
 origin universe cosmography maps features universe diderots encyclopédie cosmology broken uranology science heavens aerology <br>
 science air geology science continents hydrology science watersmetaphysical cosmology also described placing humans universe <br>
 relationship entities exemplified marcus aureliuss observation mans place relationship know world know know purpose world exists <br>
 know world discoveries physical cosmology physical cosmology branch physics astrophysics deals study physical origins evolution <br>


#### Custon Approach:

In this approach we get rid of non-informational sections of the content like refernces, notes or symbols in some cases. before running TFIDF on it.

In [7]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize
import re
def section_creator(article):
# Download article from wikipedia
    section_content = ''

    # Split article into sections by headers
    sections = {}
    lines = article.split('\n')
    current_section = None
    for line in lines:
        if line.startswith('='):
            if current_section is not None:
                sections[current_section] = section_content.strip()
            current_section = line.strip('= ')
            section_content = ''
        else:
            section_content += line + '\n'
    if current_section is not None:
        sections[current_section] = section_content.strip()
        for head in sections.keys():
            sections[head] = re.sub(r'\s{1,}', ' ', sections[head]).replace('\n', '')

    return sections

exclude = ["See also",
"References",
"External links",
"Notes",
"Sources",
"Further reading",
"Bibliography",
"Production",
"Abstracting and indexing",
"Examples",
"Citations",
"Nomenclature",
"Evolution",
"Uses"]
def exclusion(content):
    new_content = {}
    for title, value in content.items():
        if title not in exclude:
            new_content[title] = value
    return new_content


def custom_approach(content):
    con_dict = section_creator(content)
    con_dict = exclusion(con_dict)
    total = ""
    for key in con_dict.keys():
        total += con_dict[key]
    return total


### Here What custon approach looks like 

'Physics and Astrophysics have played central roles in shaping our understanding of the universe through scientific observation <br> 
and experiment. Physical cosmology was shaped through both mathematics and observation in an analysis of the whole universe. The <br>
universe is generally understood to have begun with the Big Bang, followed almost instantaneously by cosmic inflation, an expansion <br>
of space from which the universe is thought to have emerged 13.799 ± 0.021 billion years ago. Cosmogony studies the origin of the <br>
universe, and cosmography maps the features of the universe. In Diderot\'s Encyclopédie, cosmology is broken down into uranology <br>
(the science of the heavens), aerology (the science of the air), geology (the science of the continents), and hydrology (the science<br>
 of waters).Metaphysical cosmology has also been described as the placing of humans in the universe in relationship to all other entities.<br> 

#### Combined Approach

In this is apart of the abliation stuy where we use both traditional and custum approach of pre-processing on this content<br> for a better comparision and to see if it can give a higher accuracy.

In [9]:
def combined_approach(content):
    return Traditional_approach(custom_approach(content))

#### Here is what the combined approach looks like 

'physics astrophysics played central roles shaping understanding universe scientific observation experiment physical cosmology<br> 
shaped mathematics observation analysis whole universe universe generally understood begun big bang followed almost instantaneously<br>
 cosmic inflation expansion space universe thought emerged 13799 ± 0021 billion years ago cosmogony studies origin universe cosmography<br>
  maps features universe diderots encyclopédie cosmology broken uranology science heavens aerology science air geology science <br>
  continents hydrology science watersmetaphysical cosmology also described placing humans universe relationship entities exemplified<br>
   marcus aureliuss observation mans place relationship know world know know purpose world exists know world isphysical cosmology branch<br>
    physics astrophysics deals study physical origins evolution universe also includes study nature universe large scale earliest form<br>
     known celestial mechanics study heavens greek philosophers aristarchus samos aristotle ptolemy proposed different cosmological <br>

### Fine-tuning

#### Discription:

To optimize our text summarization results, we leveraged the power of two rival companies - Google and Facebook - and <br>
fine-tuned their respective models using transfer learning on our preprocessed data. This allowed us to achieve superior<br>
 performance and accuracy in our summarization tasks, enabling us to extract the most important insights and information <br>from our text data.

In [None]:
from datasets import load_dataset

data_files = {"train": "3500_train_trad.json"}
dataset = load_dataset("PrathameshPawar/10ktesttrain", data_files=data_files)

In [None]:
from transformers import PegasusForConditionalGeneration, PegasusTokenizer

In [None]:
from transformers import pipeline 
from transformers import AutoTokenizer

prefix = "summarize: "

def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["content"]]
    model_inputs = tokenizer(inputs, max_length=8192, truncation=True)

    labels = tokenizer(text_target=examples["summary"], max_length=128, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs



In [None]:
dataset_train = dataset['train'].train_test_split(test_size=0.1)
dataset_train = dataset_train.map(preprocess_function, batched=True)

In [None]:
import torch
from transformers import PegasusForConditionalGeneration, PegasusTokenizer
from transformers import DataCollatorForSeq2Seq, AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer
if torch.cuda.is_available():
    device = torch.device('cuda:0')  
    torch.cuda.set_device(device)  
else:
    device = torch.device('cpu')

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint)

model_name = "google/pegasus-xsum"
tokenizer = PegasusTokenizer.from_pretrained(model_name,)
model = PegasusForConditionalGeneration.from_pretrained(model_name,max_position_embeddings=8192).to(device)
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

training_args = Seq2SeqTrainingArguments(
    output_dir="pegasus_model",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=8,
    predict_with_generate=True,
    fp16=True,
    push_to_hub=True,
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset_train["train"],
    eval_dataset=dataset_train["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()


![](bart_custom.JPG)

### Model Evaluation :

####  We shall use ROUGE score as an evaluation metric for the summarization task.

ROUGE (Recall-Oriented Understudy for Gisting Evaluation): ROUGE is a widely used evaluation metric for text summarization. It measures the overlap between the generated summary and the reference summaries in terms of n-gram (unigram, bigram, etc.) matches, as well as the length of the summary. Higher ROUGE scores indicate better summary quality.

In [24]:

import re

def split_into_chunks(text, max_length):
    """
    Splits a string into chunks of text with complete sentences, where each chunk
    has a maximum length of `max_length` characters.
    """
    sentences = re.findall(r'[^\n.!?]+[.!?]', text)  # Split into sentences
    chunks = []
    current_chunk = ''
    
    for sentence in sentences:
        if len(current_chunk) + len(sentence) <= max_length:
            # If adding the sentence doesn't exceed max_length, add to current chunk
            current_chunk += sentence
        else:
            # If adding the sentence exceeds max_length, start a new chunk
            chunks.append(current_chunk.strip())
            current_chunk = sentence
    
    # Add the last chunk if it's not empty
    if current_chunk:
        chunks.append(current_chunk.strip())
    
    return chunks



def get_chunks(input_text):
    max_length = 1025
    chunks = split_into_chunks(input_text, max_length)
    
    summary_temps=[]
    
    for i in chunks:
        summary_temps.append(summarizer(i,max_length=32))
        
    summary_temps_ = [i[0]['summary_text'] for i in summary_temps]
        
    return '. '.join(summary_temps_)

from rouge import Rouge
# Initialize ROUGE
rouge = Rouge()

def rouge_score_generation(generated_summary,reference_summary):

#     # Example generated and reference summaries
#     generated_summary = x#content_sustom_summary
#     reference_summary = test_dataset['summary'][0]
#     # Compute ROUGE scores
    scores = rouge.get_scores(generated_summary, reference_summary)

    # Extract relevant ROUGE scores
    rouge_1 = scores[0]['rouge-1']['f']
    rouge_2 = scores[0]['rouge-2']['f']
    rouge_l = scores[0]['rouge-l']['f']

    # Print ROUGE scores
    print("ROUGE-1: {:.2f}".format(rouge_1 * 100))
    print("ROUGE-2: {:.2f}".format(rouge_2 * 100))
    print("ROUGE-L: {:.2f}".format(rouge_l * 100))
    
    return True


In [9]:
from transformers import pipeline
from datasets import load_dataset
data_files = {"test": "1000_test.json"}

dataset = load_dataset("PrathameshPawar/summary_2k", data_files=data_files)


Found cached dataset json (/Users/prathameshpawar/.cache/huggingface/datasets/PrathameshPawar___json/PrathameshPawar--summary_2k-c9ec564ecb7c9e74/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e)


  0%|          | 0/1 [00:00<?, ?it/s]

In [13]:
dataset

DatasetDict({
    test: Dataset({
        features: ['topic', 'summary', 'content', 'content_traditional', 'custom_approach', 'combined_approach'],
        num_rows: 1000
    })
})

### Bart-Base model shall be used as a control summary to evaluate the score against

In [27]:
summarizer = pipeline("summarization", model="facebook/bart-base",)

bart_base_summary = get_chunks(dataset['test']['content'][0])

rouge_score_generation(dataset['test']['summary'][0],bart_base_summary)



ROUGE-1: 16.04
ROUGE-2: 1.97
ROUGE-L: 14.33


True

### We concluded that Bart-base model finetuned on custom preprocessing data approach as the best performing model

In [10]:
summarizer = pipeline("summarization", model="PrathameshPawar/bart_custom",)

bart_custom_summary = get_chunks(dataset['test']['custom_approach'][0])

bart_custom_summary

In [29]:
rouge_score_generation(dataset['test']['summary'][0],bart_custom_summary)

ROUGE-1: 21.21
ROUGE-2: 5.26
ROUGE-L: 21.21


True

### Pegasus model shall be used as a control summary to evaluate the score against

In [28]:
summarizer = pipeline("summarization", model="google/pegasus-arxiv",)

bart_base_summary = get_chunks(dataset['test']['content'][0])

rouge_score_generation(dataset['test']['summary'][0],bart_base_summary)

ROUGE-1: 19.91
ROUGE-2: 3.47
ROUGE-L: 17.59


True

### We concluded that Pegasus model finetuned on custom preprocessing data approach as the best performing model

In [30]:
summarizer = pipeline("summarization", model="PrathameshPawar/pegasus_custom",)

pegasus_custom_summary = get_chunks(dataset['test']['custom_approach'][0])

In [32]:
rouge_score_generation(dataset['test']['summary'][0],pegasus_custom_summary)

ROUGE-1: 21.46
ROUGE-2: 3.87
ROUGE-L: 16.31


True

### Simplification ###
Using the Hugging Face's Transformer, we use Google's pretrained model "T5-base"

In [1]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "t5-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

def simplify_text(text):
    inputs = tokenizer.encode(text, return_tensors="pt")
    outputs = model.generate(inputs, max_length=128, num_beams=4, early_stopping=True)
    simplified_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return simplified_text

text = """physics astrophysics played central roles shaping understanding universe scientific observation experiment physical cosmology 
shaped mathematics observation analysis whole universe universe generally understood begun big bang followed almost instantaneously
 cosmic inflation expansion space universe thought emerged 13799 ± 0021 billion years ago cosmogony studies origin universe cosmography<br>
  maps features universe diderots encyclopédie cosmology broken uranology science heavens aerology science air geology science 
  continents hydrology science watersmetaphysical cosmology also described placing humans universe relationship entities exemplified
   marcus aureliuss observation mans place relationship know world know know purpose world exists know world isphysical cosmology branch
    physics astrophysics deals study physical origins evolution universe also includes study nature universe large scale earliest form
     known celestial mechanics study heavens greek philosophers aristarchus samos aristotle ptolemy proposed different cosmological"""
simplified_text = simplify_text(text)
print("Original text: ", text)
print("Sample output: ", simplified_text)


For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


Original text:  physics astrophysics played central roles shaping understanding universe scientific observation experiment physical cosmology 
shaped mathematics observation analysis whole universe universe generally understood begun big bang followed almost instantaneously
 cosmic inflation expansion space universe thought emerged 13799 ± 0021 billion years ago cosmogony studies origin universe cosmography<br>
  maps features universe diderots encyclopédie cosmology broken uranology science heavens aerology science air geology science 
  continents hydrology science watersmetaphysical cosmology also described placing humans universe relationship entities exemplified
   marcus aureliuss observation mans place relationship know world know know purpose world exists know world isphysical cosmology branch
    physics astrophysics deals study physical origins evolution universe also includes study nature universe large scale earliest form
     known celestial mechanics study heavens greek p

Getting the ROUGE-2 score for the overlap of bigrams between system and reference summaries.

In [4]:
from rouge import Rouge 

rouge = Rouge()

reference_text = text
predicted_text = simplified_text

scores = rouge.get_scores(reference_text, predicted_text, avg=True)
rouge1_score = scores["rouge-1"]["f"]
rouge2_score = scores["rouge-2"]["f"]
rougel_score = scores["rouge-l"]["f"]

print("ROUGE-1 score:", rouge1_score)
print("ROUGE-2 score:", rouge2_score)
print("ROUGE-l score:", rougel_score)

ROUGE-1 score: 0.575757571657484
ROUGE-2 score: 0.4878048741225461
ROUGE-l score: 0.575757571657484


In [5]:
import json

with open('1000_test.json') as f:
    data = json.load(f)
    
for article in data[:5]:
    summary = article['summary']
    simplified_summary = simplify_text(summary)
    print("Original summary: ", summary)
    print("Simplified summary: ", simplified_summary)
    scores = rouge.get_scores(summary, simplified_summary, avg=True)
    rouge2_score = scores["rouge-2"]["f"]

    # print ROUGE-2 score
    print("ROUGE-2 score:", rouge2_score)
    


Original summary:  Metagenomics is the study of genetic material recovered directly from environmental or clinical samples by a method called sequencing. The broad field may also be referred to as environmental genomics, ecogenomics, community genomics or microbiomics.
While traditional microbiology and microbial genome sequencing and genomics rely upon cultivated clonal cultures, early environmental gene sequencing cloned specific genes (often the 16S rRNA gene) to produce a profile of diversity in a natural sample. Such work revealed that the vast majority of microbial biodiversity had been missed by cultivation-based methods.Because of its ability to reveal the previously hidden diversity of microscopic life, metagenomics offers a powerful lens for viewing the microbial world that has the potential to revolutionize understanding of the entire living world. As the price of DNA sequencing continues to fall, metagenomics now allows microbial ecology to be investigated at a much greater