<a href="https://colab.research.google.com/github/Pratik94229/NLP/blob/main/Frequency_Summarizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [9]:
######################################################################################
# THis example is pretty much entirely based on this excellent blog post
# http://glowingpython.blogspot.in/2014/09/text-summarization-with-nltk.html
# Thanks to TheGlowingPython, the good soul that wrote this excellent article!
######################################################################################


######################################################################################
#  we will use 2 functions from nltk
#  sent_tokenize: given a group of text, tokenize (split) it into sentences
#  word_tokenize: given a group of text, tokenize (split) it into words
#  stopwords.words('english') to find and ignored very common words ('I', 'the',...) 
######################################################################################
from nltk.tokenize import sent_tokenize,word_tokenize
from nltk.corpus import stopwords

import nltk
nltk.download('stopwords')
nltk.download('punkt')

######################################################################################
# defaultdict is class that inherits from dictionary, but has
# one additional nice feature: Usually, a Python dictionary throws a KeyError if you try 
# to get an item with a key that is not currently in the dictionary. 
# The defaultdict in contrast will simply create any items that you try to access 
# (provided of course they do not exist yet). To create such a "default" item, it relies 
# a function that is passed in..more below. 
######################################################################################
from collections import defaultdict

######################################################################################
#  punctuation to ignore punctuation symbols
######################################################################################
from string import punctuation

######################################################################################
# heapq.nlargest is a function that given a list, easily and quickly returns
# the 'n' largest elements in the list. More below
######################################################################################
from heapq import nlargest

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [3]:
######################################################################################
# Our first class, named FrequencySummarizer 
######################################################################################
class FrequencySummarizer:
    # Constructer
    def __init__(self, min_cut=0.1, max_cut=0.9):
        self._min_cut = min_cut
        self._max_cut = max_cut 
        # Words that have a frequency term lower than min_cut 
        # or higer than max_cut will be ignored.

        self._stopwords = set(stopwords.words('english') + list(punctuation))
        # Punctuation symbols and stopwords (common words like 'an','the' etc) are ignored
        
        # Here self._min_cut, self._max_cut and self._stopwords are created as member variables
        # i.e. each object (instance) of this class will have an independent version of these
        # variables. 

   

    def _compute_frequencies(self, word_sent):
      '''
        next method (member function) which takes in self (the special keyword for this same object)
        as well as a list of sentences, and outputs a dictionary, where the keys are words, and
        values are the frequencies of those words in the set of sentences'''
      freq = defaultdict(int)
        # defaultdict, which we referred to above - is a class that inherits from dictionary,
        # with one difference: Usually, a Python dictionary throws a KeyError if you try 
        # to get an item with a key that is not currently in the dictionary. 
        # The defaultdict in contrast will simply create any items that you try to access 
        # (provided of course they do not exist yet). THe 'int' passed in as argument tells
        # the defaultdict object to create a default value of 0

      #Calculating frequency of each words and storing it in default dictionary 
      for sen in word_sent:
        # looping through sentence
          for word in sen:
            # looping through words in sentence
            if word not in self._stopwords:
                # if the word is in the member variable (dictionary) self._stopwords, then ignore it,
                # else increment the frequency. Had the dictionary freq been a regular dictionary (not a 
                # defaultdict, we would have had to first check whether this word is in the dict
                freq[word] += 1

      '''
      Now we will go through our frequency list and do 2 things
      normalize the frequencies by dividing each by the highest frequency (this allows us to 
      always have frequencies between 0 and 1, which makes comparing them easy and then
      filter out frequencies that are too high or too low.This would help us get better results.

      '''
      m = float(max(freq.values()))
      # getting the highest frequency of any word in the list of words
        
      for w in list(freq.keys()):
    
        freq[w] = freq[w]/m
        # divide each frequency by that max value, so it is now between 0 and 1 and updating it in original dictionary

        if freq[w] >= self._max_cut or freq[w] <= self._min_cut:
        # really common or really uncommon. In either case - delete it from our dictionary
         del freq[w]

         return freq
    
    #Function to rank sentences based on their importance(i.e frequency of words )
    def summarize(self, text, n):
        '''
        next method (member function) which takes in self (the special keyword for this same object)
        as well as the raw text, and the number of sentences we wish the summary to contain. Return the 
        summary of the document
        '''
        # split the text into sentences
        sents = sent_tokenize(text)
       
        # assert is a way of making sure a condition holds true, else an exception is thrown. Used to do 
        # sanity checks like making sure the summary is shorter than the original article.
        assert n <= len(sents)

        # converting each sentence to lower-case, then 
        # splits each sentence into words, then takes all of those lists (1 per sentence)
        # and packs them into bigger list
        word_sent = [word_tokenize(s.lower()) for s in sents]
        
        # make a call to the method (member function) _compute_frequencies, and places that in
        # the member variable _freq.
        self._freq = self._compute_frequencies(word_sent)

        # creating an empty dictionary (of the superior defaultdict variety) to hold the rankings of the 
        # sentences.  
        ranking = defaultdict(int)
        
        for i,sent in enumerate(word_sent):
          # enumerate(sequence) will return (0, thing[0]), (1, thing[1]), (2, thing[2]), and so forth.

          # for each word in this sentence
          for w in sent:
            # if this is not a stopword (common word), add the frequency of that word 
            # to the weightage assigned to that sentence 
            if w in self._freq:
              ranking[i] += self._freq[w]

        # we want to return the first n sentences with highest ranking so we are using the nlargest function to do so
        # this function needs to know how to get the list of values to rank, so give it a function - simply the 
        # get method of the dictionary
        sents_idx = nlargest(n, ranking, key=ranking.get)
        return [sents[sen] for sen in sents_idx]
       # return a list with these values in a list





In [19]:
######################################################################################
# Now to get a URL and summarize
######################################################################################
import urllib.request
from bs4 import BeautifulSoup


# This function takes in a URL as an argument, and returns only the text of the article in that URL.
def get_only_text_washington_post_url(url):
    # download the URL 
    page = urllib.request.urlopen(url).read().decode('utf8')

    # initialise a BeautifulSoup object with the text of that URL
    soup = BeautifulSoup(page)
    

    #to get everything in that text that lies between a pair of <article> and </article> tags.
    text = ' '.join(map(lambda p: p.text, soup.find_all('article')))
  

    # Owe got everything between the <article> and </article> tags, but that everything
    # includes a bunch of other stuff we don't want

    # Now - repeat, but this time we will only take what lies between <p> and </p> tags
    # these are HTML tags for "paragraph" i.e. this should give us the actual text of the article
    soup2 = BeautifulSoup(text)

    #to get everything in that text that lies between a pair of 
    # <p> and </p> tags.

    text = ' '.join(map(lambda p: p.text, soup2.find_all('p')))

    # Return a pair of values (article title, article body)
    return soup.title.text, text



In [23]:
# the article we would like to summarize
someUrl = "https://www.washingtonpost.com/technology/2023/05/02/ai-jobs-takeover-ibm/"

# get the title and text
textOfUrl = get_only_text_washington_post_url(someUrl)


# instantiate our FrequencySummarizer class and get an object of this class
fs = FrequencySummarizer()

# get a summary of this article that is 3 sentences long
summary = fs.summarize(textOfUrl[1], 3)


In [25]:
summary

['Goldman Sach’s March report predicted 18 percent of work worldwide could be computerized, with white-collar workers more at risk than manual laborers.The ability of AI software to generate new content that is “indistinguishable from human-created output” and breaks down communication barriers between humans and machines are key to why it might drastically affect the workforce, the report said.Using jobs data in both the United States and Europe, report writers found that roughly two-thirds of current jobs are exposed to some degree of AI automation, and that generative AI could substitute for up to one-fourth of current work done by humans.Your next job interview could be judged by AI.',
 '(Christopher Goodney/Bloomberg News)Listen4 minComment on this storyCommentGift ArticleShareIBM Corp. said it expects to pause hiring for jobs that artificial intelligence could do, indicating that the potentially groundbreaking technology is beginning to disrupt how humans work.Arvind Krishna, the