### Importing Libraries
- In the script below we first import the important libraries required for scraping the data from the web. We then use the urlopen function from the urllib.request utility to scrape the data. Next, we need to call read function on the object returned by urlopen function in order to read the data. To parse the data, we use BeautifulSoup object and pass it the scraped data object i.e. article and the lxml parser.
- In Wikipedia articles, all the text for the article is enclosed inside the p tags. To retrieve the text we need to   call find_all function on the object returned by the BeautifulSoup. The tag name is passed as a parameter to the       function. The find_all function returns all the paragraphs in the article in the form of a list. All the paragraphs  have been combined to recreate the article.

- Once the article is scraped, we need to do some preprocessing.

In [140]:
# importing liabraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
import re
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline
import bs4 as bs
import urllib.request
stopwords = nltk.corpus.stopwords.words('english')
import heapq

In [149]:
scraped_data = urllib.request.urlopen('https://big4accountingfirms.com/ernst-and-young-wiki/')
article = scraped_data.read()

parsed_article = bs.BeautifulSoup(article,'lxml')

paragraphs = parsed_article.find_all('p')

article_text = ""

for p in paragraphs:
    article_text += p.text

### Preprocessing
 - The first preprocessing step is to remove references from the article. Wikipedia, references are enclosed in square brackets. The following script removes the square brackets and replaces the resulting multiple spaces by a single space. Take a look at the script below:

  - The article_text object contains text without brackets. However, we do not want to remove anything else from the article since this is the original article. We will not remove other numbers, punctuation marks and special characters from this text since we will use this text to create summaries and weighted word frequencies will be replaced in this article.

In [150]:
# Removing Square Brackets and Extra Spaces
article_text = re.sub(r'\[[0-9]*\]', ' ', article_text)
article_text = re.sub(r'\s+', ' ', article_text)

- To clean the text and calculate weighted frequences, we will create another object. Take a look at the following script:

- Now we have two objects article_text, which contains the original article and formatted_article_text which contains the formatted article. We will use formatted_article_text to create weighted frequency histograms for the words and will replace these weighted frequencies with the words in the article_text object.

In [151]:
# Removing special characters and digits
formatted_article_text = re.sub('[^a-zA-Z]', ' ', article_text )
formatted_article_text = re.sub(r'\s+', ' ', formatted_article_text)

### Converting Text To Sentences
 - At this point we have preprocessed the data. Next, we need to tokenize the article into sentences. We will use thearticle_text object for tokenizing the article to sentence since it contains full stops. The formatted_article_text does not contain any punctuation and therefore cannot be converted into sentences using the full stop as a parameter.

 - The following script performs sentence tokenization:

In [152]:
sentence_list = nltk.sent_tokenize(article_text)

### Find Weighted Frequency of Occurrence
 - To find the frequency of occurrence of each word, we use the formatted_article_text variable. We used this variable to find the frequency of occurrence since it doesn't contain punctuation, digits, or other special characters. Take a look at the following script:
 - In the below above, we first store all the English stop words from the nltk library into a stopwords variable. Next, we loop through all the sentences and then corresponding words to first check if they are stop words. If not, we proceed to check whether the words exist in word_frequency dictionary i.e. word_frequencies, or not. If the word is encountered for the first time, it is added to the dictionary as a key and its value is set to 1. Otherwise, if the word previously exists in the dictionary, its value is simply updated by 1.

In [153]:
word_frequencies = {}
for word in nltk.word_tokenize(formatted_article_text):
    if word not in stopwords:
        if word not in word_frequencies.keys():
            word_frequencies[word] = 1
        else:
            word_frequencies[word] += 1

- Finally, to find the weighted frequency, we can simply divide the number of occurances of all the words by the frequency of the most occurring word, as shown below:

In [154]:
maximum_frequncy = max(word_frequencies.values())
for word in word_frequencies.keys():
    word_frequencies[word] = (word_frequencies[word]/maximum_frequncy)

### Calculating Sentence Scores
 - We have now calculated the weighted frequencies for all the words. Now is the time to calculate the scores for each sentence by adding weighted frequencies of the words that occur in that particular sentence. The following script calculates sentence scores:
 - In the script below, we first create an empty sentence_scores dictionary. The keys of this dictionary will be the sentences themselves and the values will be the corresponding scores of the sentences. Next, we loop through each sentence in the sentence_list and tokenize the sentence into words.

 - We then check if the word exists in the word_frequencies dictionary. This check is performed since we created the sentence_list list from the article_text object; on the other hand, the word frequencies were calculated using the formatted_article_text object, which doesn't contain any stop words, numbers, etc.

 - We do not want very long sentences in the summary, therefore, we calculate the score for only sentences with less than 30 words (although you can tweak this parameter for your own use-case). Next, we check whether the sentence exists in the sentence_scores dictionary or not. If the sentence doesn't exist, we add it to the sentence_scores dictionary as a key and assign it the weighted frequency of the first word in the sentence, as its value. On the contrary, if the sentence exists in the dictionary, we simply add the weighted frequency of the word to the existing value.

In [155]:
sentence_scores = {}
for sent in sentence_list:
    for word in nltk.word_tokenize(sent.lower()):
        if word in word_frequencies.keys():
            if len(sent.split(' ')) < 30:
                if sent not in sentence_scores.keys():
                    sentence_scores[sent] = word_frequencies[word]
                else:
                    sentence_scores[sent] += word_frequencies[word]

###  Getting the Summary
 - Now we have the sentence_scores dictionary that contains sentences with their corresponding score. To summarize the article, we can take top N sentences with the highest scores. The following script retrieves top 7 sentences and prints them on the screen.

 - In the script below, we use the heapq library and call its nlargest function to retrieve the top 7 sentences with the highest scores.

In [157]:
summary_sentences = heapq.nlargest(5, sentence_scores, key=sentence_scores.get)
summary = ' '.join(summary_sentences)
print("~"*50,"SUMMARY","~"*58)
print(summary)

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ SUMMARY ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
$7.3 billion came from advisory and $2.5 billion came from transaction advisory services.EY has 231,000 employees working across its member firms as of 2016. Ey received a tax incentive to locate new jobs in San Antonio Texas.What’s weird is that the jobs will not be accounting jobs. That makes them the third largest public accounting firm in the world.EY’s revenues totaled $28.7 billion for their 2015 fiscal year. The solution that EY and Concur have teamed up to build will allow business travelers manage real time immigration and tax assessments. This will allow business travelers to see their immigration and tax obligations that results from their travel before they travel.
