Summarizing by using important sentences or phrases directly from the text

In [None]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
# Counter counts frequency of words
from collections import Counter
# imports list of stop words to be removed (a, an, the ...)
from nltk.corpus import stopwords
# sentence tokeniser splits texts into sentences
# word tokenizer splits sentence into words
from nltk.tokenize import word_tokenize, sent_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from heapq import nlargest

In [None]:
def summarize(text, n):
  # Tokenize the text into sentences
  sentences = sent_tokenize(text)
  stop_words = set(stopwords.words('english'))

  #tokenize words, removing stopwords, and punctutation
  words = []
  for word in word_tokenize(text):
    if word.lower() not in stop_words and word.isalpha():
      words.append(word.lower())
  word_freq = Counter(words)

  # Score sentences based on word frequency
  sentence_scores = {}
  for sentence in sentences:
    for word in word_tokenize(sentence):
      if word.lower() not in stop_words and word.isalpha():
        if sentence not in sentence_scores:
          sentence_scores[sentence] = word_freq[word.lower()]
        else:
          sentence_scores[sentence] += word_freq[word.lower()]

  summarized_sentences = sorted(sentence_scores, key=sentence_scores.get, reverse=True)[:n]
  return ' '.join(summarized_sentences)

In [None]:
def summarize_vectorized(text, n):
  # Tokenize the text into sentences
  sentences = sent_tokenize(text)
  # use tokens to make tfidf matrix
  # tfidf matrix takes into accound the frequency of words in the sentences
  # and the frequency of words in the entire corpus (weighting rare words)
  vectorizer = TfidfVectorizer(stop_words='english')
  tfidf_matrix = vectorizer.fit_transform(sentences)

  #cosine similarity checks the similarity between the sentence and the document
  sentence_scores = cosine_similarity(tfidf_matrix[-1], tfidf_matrix[:-1])[0]

  #find indices for n sentences with highest similarity scores
  summarized_sentences = nlargest(n, range(len(sentence_scores)), key=sentence_scores.__getitem__)

  # use indices to add sentences
  return ' '.join([sentences[i] for i in sorted(summarized_sentences)])

In [None]:
text =  '''
Weather is the day-to-day or hour-to-hour change in the atmosphere.
Weather includes wind, lightning, storms, hurricanes, tornadoes (also known as twisters), rain, hail, snow, and lots more.
Energy from the Sun affects the weather too.
Climate tells us what kinds of weather usually happen in an area at different times of the year.
Changes in weather can affect our mood and life. We wear different clothes and do different things in different weather conditions.
We choose different foods in different seasons.
Weather stations around the world measure different parts of weather.
Ways to measure weather are wind speed, wind direction, temperature and humidity.
People try to use these measurements to make weather forecasts for the future.
These people are scientists that are called meteorologists.
They use computers to build large mathematical models to follow weather trends.'''
summary = summarize(text, 5)
summary_sentences = summary.split('. ')
formatted_summary = '.\n'.join(summary_sentences)

summary_vec = summarize_vectorized(text, 5)
summary_sent_vec = summary_vec.split('. ')
formatted_summ_vec = '.\n'.join(summary_sent_vec)

print(formatted_summary)
print()
print(formatted_summ_vec)

We wear different clothes and do different things in different weather conditions.
Weather stations around the world measure different parts of weather.
Climate tells us what kinds of weather usually happen in an area at different times of the year.
Weather includes wind, lightning, storms, hurricanes, tornadoes (also known as twisters), rain, hail, snow, and lots more.
Ways to measure weather are wind speed, wind direction, temperature and humidity.

Energy from the Sun affects the weather too.
Changes in weather can affect our mood and life.
We wear different clothes and do different things in different weather conditions.
Weather stations around the world measure different parts of weather.
People try to use these measurements to make weather forecasts for the future.
