In [1]:
import bs4 as bs
import urllib.request
import re

scraped_data = urllib.request.urlopen('https://en.wikipedia.org/wiki/Digital_image_processing')
article = scraped_data.read()

parsed_article = bs.BeautifulSoup(article,'lxml')
# print(parsed_article)  

paragraphs = parsed_article.find_all('p')
# print(paragraphs)  

article_text = ""

for p in paragraphs:
    article_text += p.text
    
# print(article_text)  

# use the urlopen function from the urllib.request utility to scrape the data. 
# Next, we need to call read function on the object returned by urlopen function in order to read the data. 
# To parse the data, we use BeautifulSoup object and pass it the scraped data object i.e. article and the lxml parser.
# The find_all function returns all the paragraphs in the article in the form of a list

In [2]:
# Preprocessing
# The following script removes the square brackets and replaces the resulting multiple spaces by a single space.
article_text = re.sub(r'\[[0-9]*\]', ' ', article_text)
article_text = re.sub(r'\s+', ' ', article_text)
# print(article_text)  

fname="D:/Digital_image_processing.txt"
with open(fname, "w", encoding="utf-8") as f:
    f.write(article_text)
    
with open(fname, "r",encoding="utf-8") as reader:
   txt= reader.read()
#print(txt)   

In [3]:
# Removing special characters and digits
formatted_article_text = re.sub('[^a-zA-Z]', ' ', txt )
formatted_article_text = re.sub(r'\s+', ' ', formatted_article_text)
# print(article_text)  
# The formatted_article_text does not contain any punctuation and 
# therefore cannot be converted into sentences using the full stop as a parameter.

In [7]:
fname="D:/Digital_image_processing.txt"
with open(fname, "w", encoding="utf-8") as f:
    f.write(formatted_article_text)

In [5]:
import nltk
# nltk.download('punkt')
# Converting Text To Sentences
sentence_list = nltk.sent_tokenize(txt)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\balkr\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


In [8]:
sentence_list

['Digital image processing is the use of a digital computer to process digital images through an algorithm.',
 'As a subcategory or field of digital signal processing, digital image processing has many advantages over analog image processing.',
 'It allows a much wider range of algorithms to be applied to the input data and can avoid problems such as the build-up of noise and distortion during processing.',
 'Since images are defined over two dimensions (perhaps more) digital image processing may be modeled in the form of multidimensional systems.',
 'The generation and development of digital image processing are mainly affected by three factors: first, the development of computers; second, the development of mathematics (especially the creation and improvement of discrete mathematics theory); third, the demand for a wide range of applications in environment, agriculture, military, industry and medical science has increased.',
 'Many of the techniques of digital image processing, or di

In [52]:
# Find Weighted Frequency of Occurrence

# we use the formatted_article_text variable to find the frequency of occurrence
# since it doesn't contain punctuation, digits, or other special characters.

stopwords = nltk.corpus.stopwords.words('english')

word_frequencies = {}
for word in nltk.word_tokenize(formatted_article_text):
    if word not in stopwords:
        if word not in word_frequencies.keys():
            word_frequencies[word] = 1
        else:
            word_frequencies[word] += 1

# print(word_frequencies)  

# Finally, to find the weighted frequency, we can simply divide the number of occurances of all the words
# by the frequency of the most occurring word.

maximum_frequncy = max(word_frequencies.values())

for word in word_frequencies.keys():
    word_frequencies[word] = (word_frequencies[word]/maximum_frequncy)

# print(word_frequencies)      

In [53]:
# Calculating Sentence Scores

sentence_scores = {}
for sent in sentence_list:
    for word in nltk.word_tokenize(sent.lower()):
        if word in word_frequencies.keys():
            if len(sent.split(' ')) < 30:
                if sent not in sentence_scores.keys():
                    sentence_scores[sent] = word_frequencies[word]
                else:
                    sentence_scores[sent] += word_frequencies[word]
                    
                    
                    
# print(sentence_scores)    

In [54]:
# Getting the Summary

import heapq

# we use the heapq library and call its nlargest function to retrieve the top 7 sentences with the highest scores.
summary_sentences = heapq.nlargest(7, sentence_scores, key=sentence_scores.get)

summary = ' '.join(summary_sentences)
print(summary)


As a subcategory or field of digital signal processing, digital image processing has many advantages over analog image processing.  In computer science, digital image processing is the use of computer algorithms to perform image processing on digital images. In image processing, the input is a low-quality image, and the out put is an image with improved quality. Common image processing include image enhancement, restoration, encoding, and compression. The purpose of early image processing was to improve the quality of the image. DSP chips have since been widely used in digital image processing. Westworld (1973) was the first feature film to use the digital image processing to pixellate photography to simulate an android's point of view.
