# Import all Libraries

In [30]:
from gensim.summarization.summarizer import summarize
from gensim.summarization import keywords
import requests
from urllib.request import urlopen
import bs4 as bs
import urllib.request
import re

# Web Scrapping :Method 1

In [31]:
def get_only_text(url):
    page=urlopen(url)
    soup=bs.BeautifulSoup(page,"lxml")
    text =''.join(map(lambda p: p.text, soup.find_all('p')))
    return soup.title.text, text


In [32]:
url="https://en.wikipedia.org/wiki/Python_(programming_language)"
text=get_only_text(url)
text

('Python (programming language) - Wikipedia',
 '\nPython is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum and first released in 1991, Python\'s design philosophy emphasizes code readability with its notable use of significant whitespace. Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects.[28]\nPython is dynamically typed and garbage-collected. It supports multiple programming paradigms, including structured (particularly, procedural), object-oriented, and functional programming. Python is often described as a "batteries included" language due to its comprehensive standard library.[29]\nPython was conceived in the late 1980s as a successor to the ABC language. Python\xa02.0, released in 2000, introduced features like list comprehensions and a garbage collection system with reference counting.\nPython\xa03.0, released in 2008, was a major revision of the 

# Web Scrapping: Method 2

In [33]:
scraped_data = urllib.request.urlopen('https://en.wikipedia.org/wiki/Python_(programming_language)')
article = scraped_data.read()

parsed_article = bs.BeautifulSoup(article,'lxml')

paragraphs = parsed_article.find_all('p')

text = ""

for p in paragraphs:
    text += p.text
text    

'\nPython is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum and first released in 1991, Python\'s design philosophy emphasizes code readability with its notable use of significant whitespace. Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects.[28]\nPython is dynamically typed and garbage-collected. It supports multiple programming paradigms, including structured (particularly, procedural), object-oriented, and functional programming. Python is often described as a "batteries included" language due to its comprehensive standard library.[29]\nPython was conceived in the late 1980s as a successor to the ABC language. Python\xa02.0, released in 2000, introduced features like list comprehensions and a garbage collection system with reference counting.\nPython\xa03.0, released in 2008, was a major revision of the language that is not completely backward-compat

Wikipedia, references are enclosed in square brackets. The following script removes the square brackets and replaces the resulting multiple spaces by a single space. 


In [34]:
# Removing Square Brackets and Extra Spaces
text = re.sub(r'\[[0-9]*\]', ' ', text)
text = re.sub(r'\s+', ' ', text)

clean the text for obtaining word weighted frequency because we are going to use this word frequency later on to obtain sentence score

In [35]:
# Removing special characters and digits
Clean_text = re.sub('[^a-zA-Z]', ' ', text )
Clean_text = re.sub(r'\s+', ' ', Clean_text)

Converting Text To Sentences i.e sentence tokenization

In [36]:
import nltk
sentence_list = nltk.sent_tokenize(text)

## Find Weighted Frequency of Occurrence
To find the frequency of occurrence of each word, we use the clean_text variable. We used this variable to find the frequency of occurrence since it doesn't contain punctuation, digits, or other special characters. 

In [37]:
stopwords = nltk.corpus.stopwords.words('english')

word_frequencies = {}
for word in nltk.word_tokenize(Clean_text):
    if word not in stopwords:
        if word not in word_frequencies.keys():
            word_frequencies[word] = 1
        else:
            word_frequencies[word] += 1

## how calculate weighted frquency of each word

In [38]:
maximum_frequncy = max(word_frequencies.values())

for word in word_frequencies.keys():
    word_frequencies[word] = (word_frequencies[word]/maximum_frequncy)

## Calculating Sentence Scores
We have now calculated the weighted frequencies for all the words. Now is the time to calculate the scores for each sentence by adding weighted frequencies of the words that occur in that particular sentence.

In [39]:
sentence_scores = {}
for sent in sentence_list:
    for word in nltk.word_tokenize(sent.lower()):
        if word in word_frequencies.keys():
            if len(sent.split(' ')) < 30:
                if sent not in sentence_scores.keys():
                    sentence_scores[sent] = word_frequencies[word]
                else:
                    sentence_scores[sent] += word_frequencies[word]

In [40]:
import heapq
summary_sentences = heapq.nlargest(7, sentence_scores, key=sentence_scores.get)

summary = ' '.join(summary_sentences)
print(summary)

Object-oriented programming and structured programming are fully supported, and many of its features support functional programming and aspect-oriented programming (including by metaprogramming and metaobjects (magic methods)). As a scripting language with modular architecture, simple syntax and rich text processing tools, Python is often used for natural language processing. Python's design and philosophy have influenced many other programming languages: Python's development practices have also been emulated by other languages. Python's performance compared to other programming languages has also been benchmarked by The Computer Language Benchmarks Game. Van Rossum's vision of a small core language with a large standard library and easily extensible interpreter stemmed from his frustrations with ABC, which espoused the opposite approach. In contrast, code that is difficult to understand or reads like a rough transcription from another programming language is called unpythonic. Python 

# Text Summarization using Gensim library

In [44]:
summary=summarize(str(text),ratio=0.06)


In [45]:
keyword=keywords(str(text),ratio=0.01)


In [46]:
print('summary:\n',summary)
print("original text character length:",len(str(text)))
print("summary text character length",len(summary))
print("compression ratio:",round(100-(100*len(summary)/len(str(text)))),"%")
print('\nkeywords\n:',keyword)

summary:
 Created by Guido van Rossum and first released in 1991, Python's design philosophy emphasizes code readability with its notable use of significant whitespace.
The language's core philosophy is summarized in the document The Zen of Python (PEP 20), which includes aphorisms such as: Rather than having all of its functionality built into its core, Python was designed to be highly extensible.
Van Rossum's vision of a small core language with a large standard library and easily extensible interpreter stemmed from his frustrations with ABC, which espoused the opposite approach.
When speed is important, a Python programmer can move time-critical functions to extension modules written in languages such as C, or use PyPy, a just-in-time compiler.
Python's statements include (among others): Python does not support tail call optimization or first-class continuations, and, according to Guido van Rossum, it never will.
The long term plan is to support gradual typing and from Python 3.5, t