# Text Summarization of web text

### Process:
    1.Collect data through web scraping
    2.cleanup data
    3.Using NLTK to build tokens
    4.Calculating word Frequency
    5.Weighted frequency for each words
    6.Calculate score for each sentence
    7.Select top 10 sentences for summary
    8.Using text blob for polarity of the summary

# Collecting data

In [15]:
from bs4 import BeautifulSoup
import re
import nltk
import requests
import heapq
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\abhay\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\abhay\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [16]:
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
url = "https://en.wikipedia.org/wiki/Machine_learning"
res = requests.get(url,headers=headers)
print("Getting the data......\n")
summary = ""
soup = BeautifulSoup(res.text,'html.parser') 
content = soup.findAll("p")
for text in content:
    summary +=text.text
print(summary)  

Getting the data......

Machine learning (ML) is the study of computer algorithms that improve automatically through experience.[1][2] It is seen as a subset of artificial intelligence. Machine learning algorithms build a mathematical model based on sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to do so.[3] Machine learning algorithms are used in a wide variety of applications, such as email filtering and computer vision, where it is difficult or infeasible to develop  conventional algorithms to perform the needed tasks.
Machine learning is closely related to computational statistics, which focuses on making predictions using computers. The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning. Data mining is a related field of study, focusing on exploratory data analysis through unsupervised learning.[5][6] In its application across business problems, m

## Cleaning Data

In [17]:
def clean(text):
    text = re.sub(r"\[[0-9]*\]"," ",text)
    text = text.lower()
    text = re.sub(r'\s+'," ",text)
    text = re.sub(r","," ",text)
    return text
summary = clean(summary)
print(summary)

machine learning (ml) is the study of computer algorithms that improve automatically through experience. it is seen as a subset of artificial intelligence. machine learning algorithms build a mathematical model based on sample data  known as "training data"  in order to make predictions or decisions without being explicitly programmed to do so. machine learning algorithms are used in a wide variety of applications  such as email filtering and computer vision  where it is difficult or infeasible to develop conventional algorithms to perform the needed tasks. machine learning is closely related to computational statistics  which focuses on making predictions using computers. the study of mathematical optimization delivers methods  theory and application domains to the field of machine learning. data mining is a related field of study  focusing on exploratory data analysis through unsupervised learning. in its application across business problems  machine learning is also referred to as p

# Tokenizing

In [18]:
##Tokenixing
from nltk.tokenize import sent_tokenize,word_tokenize
sent_tokens = sent_tokenize(summary)

summary = re.sub(r"[^a-zA-z]"," ",summary)
word_tokens = word_tokenize(summary)

# Calculating Frequency

In [19]:
## Removing Stop words
from nltk.corpus import stopwords
word_frequency = {}
stopwords =  set(stopwords.words("english"))

for word in word_tokens:
    if word not in stopwords:
        if word not in word_frequency.keys():
            word_frequency[word]=1
        else:
            word_frequency[word] +=1
print(word_frequency) 

{'machine': 89, 'learning': 163, 'ml': 1, 'study': 5, 'computer': 12, 'algorithms': 43, 'improve': 5, 'automatically': 1, 'experience': 4, 'seen': 2, 'subset': 2, 'artificial': 22, 'intelligence': 8, 'build': 3, 'mathematical': 8, 'model': 33, 'based': 14, 'sample': 3, 'data': 79, 'known': 11, 'training': 46, 'order': 5, 'make': 5, 'predictions': 8, 'decisions': 4, 'without': 7, 'explicitly': 3, 'programmed': 3, 'used': 27, 'wide': 1, 'variety': 4, 'applications': 4, 'email': 3, 'filtering': 2, 'vision': 4, 'difficult': 2, 'infeasible': 2, 'develop': 2, 'conventional': 1, 'perform': 10, 'needed': 4, 'tasks': 15, 'closely': 3, 'related': 9, 'computational': 4, 'statistics': 10, 'focuses': 3, 'making': 3, 'using': 11, 'computers': 5, 'optimization': 5, 'delivers': 1, 'methods': 19, 'theory': 10, 'application': 4, 'domains': 3, 'field': 18, 'mining': 11, 'focusing': 1, 'exploratory': 1, 'analysis': 11, 'unsupervised': 13, 'across': 2, 'business': 1, 'problems': 11, 'also': 16, 'referred':

# Calculating weighted frequency

In [20]:
maximum_frequency = max(word_frequency.values())
print(maximum_frequency)

163


In [21]:
for word in word_frequency.keys():
    word_frequency[word] = (word_frequency[word]/maximum_frequency)
print(word_frequency)

{'machine': 0.5460122699386503, 'learning': 1.0, 'ml': 0.006134969325153374, 'study': 0.03067484662576687, 'computer': 0.0736196319018405, 'algorithms': 0.26380368098159507, 'improve': 0.03067484662576687, 'automatically': 0.006134969325153374, 'experience': 0.024539877300613498, 'seen': 0.012269938650306749, 'subset': 0.012269938650306749, 'artificial': 0.13496932515337423, 'intelligence': 0.049079754601226995, 'build': 0.018404907975460124, 'mathematical': 0.049079754601226995, 'model': 0.20245398773006135, 'based': 0.08588957055214724, 'sample': 0.018404907975460124, 'data': 0.48466257668711654, 'known': 0.06748466257668712, 'training': 0.2822085889570552, 'order': 0.03067484662576687, 'make': 0.03067484662576687, 'predictions': 0.049079754601226995, 'decisions': 0.024539877300613498, 'without': 0.04294478527607362, 'explicitly': 0.018404907975460124, 'programmed': 0.018404907975460124, 'used': 0.1656441717791411, 'wide': 0.006134969325153374, 'variety': 0.024539877300613498, 'appli

## Sentense Score

In [22]:
sentences_score = {}
for sentence in sent_tokens:
    for word in word_tokenize(sentence):
        if word in word_frequency.keys():
            if (len(sentence.split(" "))) <30:
                if sentence not in sentences_score.keys():
                    sentences_score[sentence] = word_frequency[word]
                else:
                    sentences_score[sentence] += word_frequency[word]
                    
print(max(sentences_score.values()))
def get_key(val): 
    for key, value in sentences_score.items(): 
         if val == value: 
             return key 
key = get_key(max(sentences_score.values()))
print(key+"\n")
print(sentences_score)

4.846625766871166
semi-supervised learning falls between unsupervised learning (without any labeled training data) and supervised learning (with completely labeled training data).

{'machine learning (ml) is the study of computer algorithms that improve automatically through experience.': 1.98159509202454, 'it is seen as a subset of artificial intelligence.': 0.2085889570552147, 'machine learning is closely related to computational statistics  which focuses on making predictions using computers.': 1.889570552147239, 'the study of mathematical optimization delivers methods  theory and application domains to the field of machine learning.': 1.9938650306748467, 'data mining is a related field of study  focusing on exploratory data analysis through unsupervised learning.': 2.392638036809816, 'in its application across business problems  machine learning is also referred to as predictive analytics.': 1.7975460122699387, 'machine learning involves computers discovering how they can perform t

# Creating summary

In [23]:
import heapq
summary = heapq.nlargest(5,sentences_score,key=sentences_score.get)
print(" ".join(summary))

###  POlarity of the text

from textblob import TextBlob
obj = TextBlob(" ".join(summary))
sentiment = obj.sentiment.polarity
if sentiment == 0:
    print("\nText is neutral")
elif sentiment >0:
    print("\nText is positive")
else :
    print("\nText is negative")

semi-supervised learning falls between unsupervised learning (without any labeled training data) and supervised learning (with completely labeled training data). a representative book of the machine learning research during the 1960s was the nilsson's book on learning machines  dealing mostly with machine learning for pattern classification. rule-based machine learning approaches include learning classifier systems  association rule learning  and artificial immune systems. machine learning also has intimate ties to optimization: many learning problems are formulated as minimization of some loss function on a training set of examples. generalization in this context is the ability of a learning machine to perform accurately on new  unseen examples/tasks after having experienced a learning data set.

Text is positive


In [9]:
def gen_summary(url,sent):
    from bs4 import BeautifulSoup
    import re
    import requests
    import heapq
    from textblob import TextBlob
    from nltk.tokenize import sent_tokenize,word_tokenize
    from nltk.corpus import stopwords
    from nltk.stem import WordNetLemmatizer 
  
    lemmatizer = WordNetLemmatizer() 
    
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
    res = requests.get(url,headers=headers)
    summary = ""
    soup = BeautifulSoup(res.text,'html.parser') 
    content = soup.findAll("p")
    for text in content:
        summary +=text.text 
    def clean(text):
        text = re.sub(r"\[[0-9]*\]"," ",text)
        text = text.lower()
        text = re.sub( r'\[.*?\]', '', text)
        text = re.sub(r'\s+'," ",text)
        text = re.sub(r","," ",text)
        return text
    summary = clean(summary)  
    ##Tokenixing
    sent_tokens = sent_tokenize(summary)

    summary = re.sub(r"[^a-zA-z]"," ",summary)
    word_tokens = word_tokenize(summary)
    ## Removing Stop words

    word_frequency = {}
    stopwords =  set(stopwords.words("english"))

    for word in word_tokens:
        if word not in stopwords:
            if word not in word_frequency.keys():
                word = lemmatizer.lemmatize(word)
                word_frequency[word]=1
            else:
                word_frequency[word] +=1
    maximum_frequency = max(word_frequency.values()) 
    for word in word_frequency.keys():
        word_frequency[word] = (word_frequency[word]/maximum_frequency)
    sentences_score = {}
    for sentence in sent_tokens:
        for word in word_tokenize(sentence):
            if word in word_frequency.keys():
                if (len(sentence.split(" "))) <30:
                    if sentence not in sentences_score.keys():
                        sentences_score[sentence] = word_frequency[word]
                    else:
                        sentences_score[sentence] += word_frequency[word]
    def get_key(val): 
        for key, value in sentences_score.items(): 
            if val == value: 
                return key 
    key = get_key(max(sentences_score.values()))
    summary = heapq.nlargest(sent,sentences_score,key=sentences_score.get)
    summary = " ".join(summary)
    print(summary)

    ###  POlarity of the text


    obj = TextBlob(" ".join(summary))
    sentiment = obj.sentiment.polarity
    if sentiment == 0:
        print("\nText is neutral")
    elif sentiment >0:
        print("\nText is positive")
    else :
        print("\nText is negative")

In [10]:
gen_summary('https://en.wikipedia.org/wiki/data',5)

data processing commonly occurs by stages  and the "processed data" from one stage may be considered the "raw data" of the next stage. data analysis methodologies vary and include data triangulation and data percolation. whenever data needs to be registered  data exists in the form of a data documents. data may be used as a plural noun in this sense  with some writers—usually scientific writers—in the 20th century using datum in the singular and data for plural. according to a common view  data are collected and analyzed; data only becomes information suitable for making decisions once it has been analyzed in some fashion.

Text is neutral


In [30]:
import re
pattern = r'\[.*?\]'
s = """Issachar is a rawboned[a] donkey lying down among the sheep pens.[b]"""
re.sub(pattern, '', s)


'Issachar is a rawboned donkey lying down among the sheep pens.'