# Hometask 12

1. Summarize the following text using NLP libraries: nltk and SpaCy

`The Orbiter Discovery, OV-103, is considered eligible for listing in the National Register of Historic Places (NRHP) in the context of the U.S. Space Shuttle Program (1969-2011) under Criterion A in the areas of Space Exploration and Transportation and under Criterion C in the area of Engineering. Because it has achieved significance within the past fifty years, Criteria Consideration G applies. Under Criterion A, Discovery is significant as the oldest of the three extant orbiter vehicles constructed for the Space Shuttle Program (SSP), the longest running American space program to date; she was the third of five orbiters built by NASA. Unlike the Mercury, Gemini, and Apollo programs, the SSP’s emphasis was on cost effectiveness and reusability, and eventually the construction of a space station. Including her maiden voyage (launched August 30, 1984), Discovery flew to space thirty-nine times, more than any of the other four orbiters; she was also the first orbiter to fly twenty missions. She had the honor of being chosen as the Return to Flight vehicle after both the Challenger and Columbia accidents. Discovery was the first shuttle to fly with the redesigned SRBs, a result of the Challenger accident, and the first shuttle to fly with the Phase II and Block I SSME. Discovery also carried the Hubble Space Telescope to orbit and performed two of the five servicing missions to the observatory. She flew the first and last dedicated Department of Defense (DoD) missions, as well as the first unclassified defense-related mission. In addition, Discovery was vital to the construction of the International Space Station (ISS); she flew thirteen of the thirty-seven total missions flown to the station by a U.S. Space Shuttle. She was the first orbiter to dock to the ISS, and the first to perform an exchange of a resident crew. Under Criterion C, Discovery is significant as a feat of engineering. According to Wayne Hale, a flight director from Johnson Space Center, the Space Shuttle orbiter represents a “huge technological leap from expendable rockets and capsules to a reusable, winged, hypersonic, cargo-carrying spacecraft.” Although her base structure followed a conventional aircraft design, she used advanced materials that both minimized her weight for cargo-carrying purposes and featured low thermal expansion ratios, which provided a stable base for her Thermal Protection System (TPS) materials. The Space Shuttle orbiter also featured the first reusable TPS; all previous spaceflight vehicles had a single-use, ablative heat shield. Other notable engineering achievements of the orbiter included the first reusable orbital propulsion system, and the first two-fault-tolerant Integrated Avionics System. As Hale stated, the Space Shuttle remains “the largest, fastest, winged hypersonic aircraft in history,” having regularly flown at twenty-five times the speed of sound.`

2. Tip

First of all, we need to import the necessary libraries. For SpaCy, this can be done with a command:

`import spacy`

Note that NLTK may require additional data to be loaded, such as a list of stop words or tokenizers.

`import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize`

In [1]:
#install and download special packages
#!pip install spacy
#!python -m spacy download en_core_web_sm

In [2]:
#import libraries
import spacy
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from string import punctuation

Before you can start working with SpaCy, you need to download the required language model. For example, for English, we can load the "en_core_web_sm" model:

`nlp = spacy.load('en_core_web_sm')`

In [3]:
nlp = spacy.load('en_core_web_sm')

### Preparing the text
Before you start creating text summaries, you need to prepare the text. This includes removing unnecessary characters, tokenization (breaking the text into separate words or sentences), removing stop words (words that do not carry essential information), and, if necessary, other text processing such as stemming or lemmatization.

#### Text to process
`text = "This is an example sentence for tokenization and lemmatization."`

#### Tokenization
`doc = nlp(text)`

`tokens = [token.text for token in doc]`

`print(tokens)`

In [4]:
# Text to process
text = "The Orbiter Discovery, OV-103, is considered eligible for listing in the National Register of Historic Places (NRHP) in the context of the U.S. Space Shuttle Program (1969-2011) under Criterion A in the areas of Space Exploration and Transportation and under Criterion C in the area of Engineering. Because it has achieved significance within the past fifty years, Criteria Consideration G applies. Under Criterion A, Discovery is significant as the oldest of the three extant orbiter vehicles constructed for the Space Shuttle Program (SSP), the longest running American space program to date; she was the third of five orbiters built by NASA. Unlike the Mercury, Gemini, and Apollo programs, the SSP’s emphasis was on cost effectiveness and reusability, and eventually the construction of a space station. Including her maiden voyage (launched August 30, 1984), Discovery flew to space thirty-nine times, more than any of the other four orbiters; she was also the first orbiter to fly twenty missions. She had the honor of being chosen as the Return to Flight vehicle after both the Challenger and Columbia accidents. Discovery was the first shuttle to fly with the redesigned SRBs, a result of the Challenger accident, and the first shuttle to fly with the Phase II and Block I SSME. Discovery also carried the Hubble Space Telescope to orbit and performed two of the five servicing missions to the observatory. She flew the first and last dedicated Department of Defense (DoD) missions, as well as the first unclassified defense-related mission. In addition, Discovery was vital to the construction of the International Space Station (ISS); she flew thirteen of the thirty-seven total missions flown to the station by a U.S. Space Shuttle. She was the first orbiter to dock to the ISS, and the first to perform an exchange of a resident crew. Under Criterion C, Discovery is significant as a feat of engineering. According to Wayne Hale, a flight director from Johnson Space Center, the Space Shuttle orbiter represents a “huge technological leap from expendable rockets and capsules to a reusable, winged, hypersonic, cargo-carrying spacecraft.” Although her base structure followed a conventional aircraft design, she used advanced materials that both minimized her weight for cargo-carrying purposes and featured low thermal expansion ratios, which provided a stable base for her Thermal Protection System (TPS) materials. The Space Shuttle orbiter also featured the first reusable TPS; all previous spaceflight vehicles had a single-use, ablative heat shield. Other notable engineering achievements of the orbiter included the first reusable orbital propulsion system, and the first two-fault-tolerant Integrated Avionics System. As Hale stated, the Space Shuttle remains “the largest, fastest, winged hypersonic aircraft in history,” having regularly flown at twenty-five times the speed of sound."

# Tokenization
doc = nlp(text)
tokens = [token.text for token in doc]
print(tokens)

['The', 'Orbiter', 'Discovery', ',', 'OV-103', ',', 'is', 'considered', 'eligible', 'for', 'listing', 'in', 'the', 'National', 'Register', 'of', 'Historic', 'Places', '(', 'NRHP', ')', 'in', 'the', 'context', 'of', 'the', 'U.S.', 'Space', 'Shuttle', 'Program', '(', '1969', '-', '2011', ')', 'under', 'Criterion', 'A', 'in', 'the', 'areas', 'of', 'Space', 'Exploration', 'and', 'Transportation', 'and', 'under', 'Criterion', 'C', 'in', 'the', 'area', 'of', 'Engineering', '.', 'Because', 'it', 'has', 'achieved', 'significance', 'within', 'the', 'past', 'fifty', 'years', ',', 'Criteria', 'Consideration', 'G', 'applies', '.', 'Under', 'Criterion', 'A', ',', 'Discovery', 'is', 'significant', 'as', 'the', 'oldest', 'of', 'the', 'three', 'extant', 'orbiter', 'vehicles', 'constructed', 'for', 'the', 'Space', 'Shuttle', 'Program', '(', 'SSP', ')', ',', 'the', 'longest', 'running', 'American', 'space', 'program', 'to', 'date', ';', 'she', 'was', 'the', 'third', 'of', 'five', 'orbiters', 'built', 'b

NLTK also provides advanced features for text processing. With NLTK methods such as word_tokenize, sent_tokenize, or stopwords, we can get tokenized words and sentences, as well as a list of stop words.

`
tokens = word_tokenize(text)
sentences = sent_tokenize(text)
stop_words = set(stopwords.words('english'))
`

And don't forget about punctuation

`
punctuation = punctuation + '\n'
`

You can also calculate the frequency of occurrence of certain words in the text (but remember that this should be done after excluding all punctuation)

`
word_frequencies = {}
for word in doc:
  if word.text.lower() is not in stopwords:
    if word.text.lower() not in punctuation:
      if word.text is not in word_frequencies.keys():
        word_frequencies[word.text] = 1
      else:
        word_frequencies[word.text] += 1
`

Once we have the text prepared and have used SpaCy or NLTK to extract the necessary information, we can create a text summary. This can be done, for example, by highlighting the most important sentences from the text, taking into account their weight or the frequency of certain words.

In [5]:
tokens = word_tokenize(text)
sentences = sent_tokenize(text)
stop_words = set(stopwords.words('english'))

punctuation = punctuation + '\n'

word_frequencies = {}
for word in doc:
  if word.text.lower() not in stop_words:
    if word.text.lower() not in punctuation:
      if word.text not in word_frequencies.keys():
        word_frequencies[word.text] = 1
      else:
        word_frequencies[word.text] += 1

In [6]:
# Sentence tokenization:
max_frequency = max(word_frequencies.values())
max_frequency
for word in word_frequencies.keys():
    word_frequencies[word] = word_frequencies[word]/max_frequency
    sentence_tokens = [sent for sent in doc.sents]

In [7]:
# Word frequency table:
sentence_scores = {}
for sent in sentence_tokens:
    for word in sent:
        if word.text.lower() in word_frequencies.keys():
            if sent not in sentence_scores.keys():
                sentence_scores[sent] = word_frequencies[word.text.lower()]
            else:
                sentence_scores[sent] += word_frequencies[word.text.lower()]

### The heapq library
The heapq library is part of the standard Python library and provides functionality for working with data structures called heaps. One of the imported objects in this library - nlargest - is a function that allows you to find the largest elements from an iterable object.
`
from heapq import nlargest
`
The `nlargest(n, iterable, key=None)` function takes three arguments:

n is the number of largest elements you want to get
iterable is the iterable object from which you want to select the largest elements
key (optional) - a function that determines by which key the elements are compared (for example, key=str.lower)
The nlargest function returns a list of the n largest elements from iterable. These elements will be ordered in descending order. If n is greater than the length of the iterable, then the function will return the entire iterable in sorted order.

So, the imported from heapq import nlargest allows you to use the `nlargest` function to find the largest elements from an arbitrary iterable object.
`
select_length = int(len(sentence_tokens))
summary = nlargest(select_length, sentence_scores, key = sentence_scores.get)
summary
`

In this case, the nlargest function is used to find the largest elements in select_length from the sentence_scores dictionary. The dictionary keys represent sentences, and the values represent their scores or weights. The key argument is specified as sentence_scores.get, which means that the get function is used to compare items. In this case, it returns a value (score) for each sentence that is used as a criterion for comparison. So, the summary variable will contain a select_length list of the best sentences from the sentence_scores dictionary in descending order of scores.

In [8]:
# Summarization:
from heapq import nlargest
select_length = int(len(sentence_tokens)*0.3)
select_length
summary = nlargest(select_length, sentence_scores, key = sentence_scores.get)
summary
final_summary = [word.text for word in summary]
summary = ' '.join(final_summary)
summary

'Including her maiden voyage (launched August 30, 1984), Discovery flew to space thirty-nine times, more than any of the other four orbiters; she was also the first orbiter to fly twenty missions. Other notable engineering achievements of the orbiter included the first reusable orbital propulsion system, and the first two-fault-tolerant Integrated Avionics System. According to Wayne Hale, a flight director from Johnson Space Center, the Space Shuttle orbiter represents a “huge technological leap from expendable rockets and capsules to a reusable, winged, hypersonic, cargo-carrying spacecraft.” The Space Shuttle orbiter also featured the first reusable TPS; all previous spaceflight vehicles had a single-use, ablative heat shield. She flew the first and last dedicated Department of Defense (DoD) missions, as well as the first unclassified defense-related mission.'