# Text Summarization with Weighted Frequency Analysis

#### Note: This text summarization is based on simply word's Weighted Frequency, So it can't capture semantics and it's just a simple mathematical summarized tool, which may or may not capture the summary everytime.

STEPS FOLLOWED:::
            
Extract the paragraph, 
Convert the paragraph to sentences, 
Preprocessing: remove special character,stop words,numbers, 
Tokenization, 
Find weighted frequency of occurence:(Freq of ith word/freq of word having max freq), 
Calculated the weighted sum of the words in the sentence, 
Sorting sentences in descending order.

### For web scraping purposes.

In [1]:
!pip install beautifulsoup4



###  For parsing XML and HTML documents

In [2]:
!pip install lxml



### Scrapping the data from  a URL link

In [3]:
import bs4 as bs
import urllib.request
import re

scrapped_data=urllib.request.urlopen("https://en.wikipedia.org/wiki/Machine_learning")
article=scrapped_data.read()

parsing=bs.BeautifulSoup(article,'lxml')
paragraphs=parsing.find_all('p')

text_article=""
for p in paragraphs:
  text_article+=p.text


In [4]:
text_article[:500]

'Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalize to unseen data, and thus perform tasks without explicit instructions.[1] Recently, artificial neural networks have been able to surpass many previous approaches in performance.[2][3]\nMachine learning approaches have been applied to many fields including natural language processing, computer vision, speech recognition, emai'

### Preprocessing

In [5]:
text_article=re.sub(r'\[[0-9]*\]',' ',text_article)
text_article=re.sub(r'\s+',' ',text_article)

In [6]:
text_article[:500]

'Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalize to unseen data, and thus perform tasks without explicit instructions. Recently, artificial neural networks have been able to surpass many previous approaches in performance. Machine learning approaches have been applied to many fields including natural language processing, computer vision, speech recognition, email filteri'

In [7]:
# Removed Punctuation and special characters.
formatted_text_article=re.sub('[^a-zA-Z]',' ',text_article)
formatted_text_article=re.sub('\s+',' ',formatted_text_article)


In [8]:
formatted_text_article[:500]

'Machine learning ML is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalize to unseen data and thus perform tasks without explicit instructions Recently artificial neural networks have been able to surpass many previous approaches in performance Machine learning approaches have been applied to many fields including natural language processing computer vision speech recognition email filtering agricu'

### Tokenization

In [9]:
import nltk
nltk.download('punkt')
sentence_list=nltk.sent_tokenize(text_article) #taken this article so that we can make sentences based on  full stop(.)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\sifta\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Creating Word Frequency

In [10]:
nltk.download('stopwords')

stopwords=nltk.corpus.stopwords.words('english')
word_frequencies={}
for word in nltk.word_tokenize(formatted_text_article):
  if word not in stopwords:
    if word not in word_frequencies.keys():
      word_frequencies[word]=1
    else:
      word_frequencies[word]+=1

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\sifta\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [11]:
# Convert dictionary items to list and get the first 10 items
word_freq = list(word_frequencies.items())[:10]

for sentence, score in word_freq:
    print(f"Word: {sentence} | freq: {score}")


Word: Machine | freq: 16
Word: learning | freq: 192
Word: ML | freq: 4
Word: field | freq: 19
Word: study | freq: 8
Word: artificial | freq: 24
Word: intelligence | freq: 10
Word: concerned | freq: 4
Word: development | freq: 1
Word: statistical | freq: 9


### Converting Word Frequency to Weighted Word Frequency

In [12]:
maximum_frequency=max(word_frequencies.values())
for word in word_frequencies.keys():
  word_frequencies[word]=(word_frequencies[word]/maximum_frequency)

In [13]:
# Convert dictionary items to list and get the first 10 items
weighted_word = list(word_frequencies.items())[:10]

for sentence, score in weighted_word:
    print(f"Word: {sentence} | weighted freq: {score}")


Word: Machine | weighted freq: 0.08333333333333333
Word: learning | weighted freq: 1.0
Word: ML | weighted freq: 0.020833333333333332
Word: field | weighted freq: 0.09895833333333333
Word: study | weighted freq: 0.041666666666666664
Word: artificial | weighted freq: 0.125
Word: intelligence | weighted freq: 0.052083333333333336
Word: concerned | weighted freq: 0.020833333333333332
Word: development | weighted freq: 0.005208333333333333
Word: statistical | weighted freq: 0.046875


### Creating Sentence scores

In [14]:
sentence_scores={}
for sent in sentence_list:
  for word in nltk.word_tokenize(sent.lower()):
    if word in word_frequencies.keys():
      if len(sent.split(' '))<30:
        if sent not in sentence_scores.keys():
          sentence_scores[sent]=word_frequencies[word]
        else:
          sentence_scores[sent]+=word_frequencies[word]

In [15]:
# Convert dictionary items to list and get the first 10 items
top_10_sentences = list(sentence_scores.items())[:10]

for sentence, score in top_10_sentences:
    print(f"Sentence: {sentence} | Score: {score}")


Sentence: Recently, artificial neural networks have been able to surpass many previous approaches in performance. | Score: 0.5677083333333334
Sentence: Machine learning approaches have been applied to many fields including natural language processing, computer vision, speech recognition, email filtering, agriculture, and medicine. | Score: 2.0312500000000004
Sentence: ML is known in its application across business problems under the name predictive analytics. | Score: 0.203125
Sentence: Although not all machine learning is statistically based, computational statistics is an important source of the field's methods. | Score: 1.9531249999999998
Sentence: The mathematical foundations of ML are provided by mathematical optimization (mathematical programming) methods. | Score: 0.41145833333333337
Sentence: Data mining is a related (parallel) field of study, focusing on exploratory data analysis (EDA) through unsupervised learning. | Score: 2.375
Sentence: From a theoretical point of view Pro

### Importing heapq to sort the sentence scores

In [16]:
import heapq
summary_sentences=heapq.nlargest(7,sentence_scores,key=sentence_scores.get)

summary='\n'.join(summary_sentences)


### Summarized Sentences

In [17]:
print(summary)

Robot learning is inspired by a multitude of machine learning methods, starting from supervised learning, reinforcement learning, and finally meta-learning (e.g.
Semi-supervised learning falls between unsupervised learning (without any labeled training data) and supervised learning (with completely labeled training data).
A representative book on research into machine learning during the 1960s was Nilsson's book on Learning Machines, dealing mostly with machine learning for pattern classification.
When training a machine learning model, machine learning engineers need to target and collect a large and representative sample of data.
Rule-based machine learning approaches include learning classifier systems, association rule learning, and artificial immune systems.
Embedded Machine Learning could be applied through several techniques including hardware acceleration, using approximate computing, optimization of machine learning models and many more.
Machine learning also has intimate ties

In [18]:
# ONLY FOR TEXT NOT FOR URLs
# text_article=input('Enter the paragraph:\n')
# text_article=re.sub(r'\[[0-9]*\]',' ',text_article)
# text_article=re.sub(r'\s+',' ',text_article)
# # Removed Punctuation and special characters.
# formatted_text_article=re.sub('[^a-zA-Z]',' ',text_article)
# formatted_text_article=re.sub('\s+',' ',formatted_text_article)
# import nltk
# # nltk.download('punkt')
# sentence_list=nltk.sent_tokenize(text_article) #taken this article so that we can make sentences based on  full stop(.)
# # nltk.download('stopwords')

# stopwords=nltk.corpus.stopwords.words('english')
# word_frequencies={}
# for word in nltk.word_tokenize(formatted_text_article):
#   if word not in stopwords:
#     if word not in word_frequencies.keys():
#       word_frequencies[word]=1
#     else:
#       word_frequencies[word]+=1

# maximum_frequency=max(word_frequencies.values())
# for word in word_frequencies.keys():
#   word_frequencies[word]=(word_frequencies[word]/maximum_frequency)

# sentence_scores={}
# for sent in sentence_list:
#   for word in nltk.word_tokenize(sent.lower()):
#     if word in word_frequencies.keys():
#       if len(sent.split(' '))<30:
#         if sent not in sentence_scores.keys():
#           sentence_scores[sent]=word_frequencies[word]
#         else:
#           sentence_scores[sent]+=word_frequencies[word]

# import heapq
# num_sentences=int(input('\nEnter number of sentences in which you want to summarize: '))
# summary_sentences=heapq.nlargest(num_sentences,sentence_scores,key=sentence_scores.get)

# summary='\n'.join(summary_sentences)
# print("\n*****\033[1m\033[4m\033[7mSUMMARIZATION:\033[0m*****\n")
# print(summary)

### User Input URLs/ Text

In [19]:
import bs4 as bs
import urllib.request
import re
import nltk
import heapq

def summarize_text(text, num_sentences):
    formatted_text = re.sub(r'\[[0-9]*\]', ' ', text)
    formatted_text = re.sub(r'\s+', ' ', formatted_text)
    formatted_text = re.sub('[^a-zA-Z]', ' ', formatted_text)
    formatted_text = re.sub('\s+', ' ', formatted_text)

    sentence_list = nltk.sent_tokenize(text)
    stopwords = nltk.corpus.stopwords.words('english')
    word_frequencies = {}

    for word in nltk.word_tokenize(formatted_text):
        if word not in stopwords:
            if word not in word_frequencies.keys():
                word_frequencies[word] = 1
            else:
                word_frequencies[word] += 1

    maximum_frequency = max(word_frequencies.values())
    for word in word_frequencies.keys():
        word_frequencies[word] = (word_frequencies[word] / maximum_frequency)

    sentence_scores = {}
    for sent in sentence_list:
        for word in nltk.word_tokenize(sent.lower()):
            if word in word_frequencies.keys():
                if len(sent.split(' ')) < 30:
                    if sent not in sentence_scores.keys():
                        sentence_scores[sent] = word_frequencies[word]
                    else:
                        sentence_scores[sent] += word_frequencies[word]

    summary_sentences = heapq.nlargest(num_sentences, sentence_scores, key=sentence_scores.get)
    summary = '\n'.join(summary_sentences)

    return summary

def summarize_url(url, num_sentences):
    scrapped_data = urllib.request.urlopen(url)
    article = scrapped_data.read()

    parsing = bs.BeautifulSoup(article, 'lxml')
    paragraphs = parsing.find_all('p')

    text_article = ""
    for p in paragraphs:
        text_article += p.text

    summary = summarize_text(text_article, num_sentences)
    return summary

def main():
    source = input("Enter 'url' to summarize content from a URL, or 'text' to summarize your own paragraph: \n")

    if source.lower() == 'url':
        url = input("Enter the URL to summarize: \n")
        num_sentences = int(input("Enter number of sentences in which you want to summarize: \n"))
        summary = summarize_url(url, num_sentences)
    elif source.lower() == 'text':
        text = input("Enter the paragraph to summarize: \n")
        num_sentences = int(input("Enter number of sentences in which you want to summarize: \n"))
        summary = summarize_text(text, num_sentences)
    else:
        print("Invalid input. Please enter 'url' or 'text'.")

    print("\n*****\033[1m\033[4m\033[7mSUMMARIZATION:\033[0m*****\n")
    print(summary)

if __name__ == "__main__":
    main()
 

Enter 'url' to summarize content from a URL, or 'text' to summarize your own paragraph: 
url
Enter the URL to summarize: 
https://en.wikipedia.org/wiki/Education_in_India
Enter number of sentences in which you want to summarize: 
10

*****[1m[4m[7mSUMMARIZATION:[0m*****

The policy is a comprehensive framework for elementary education to higher education as well as vocational training in both rural and urban India.
[7]
Education in India covers different levels and types of learning, such as early childhood education, primary education, secondary education, higher education, and vocational education.
[159][160]
The school education structure in India typically follows a 10+2 system, which consists of 10 years of primary and secondary education followed by two years of higher secondary education.
These terms are widely used across the country to refer to the stage of education that follows primary education and precedes higher secondary education.
The District Education Revitalisati

In [20]:
import bs4 as bs
import urllib.request
import re
import nltk
import heapq

def summarize_text(text, num_sentences):
    formatted_text = re.sub(r'\[[0-9]*\]', ' ', text)
    formatted_text = re.sub(r'\s+', ' ', formatted_text)
    formatted_text = re.sub('[^a-zA-Z]', ' ', formatted_text)
    formatted_text = re.sub('\s+', ' ', formatted_text)

    sentence_list = nltk.sent_tokenize(text)
    stopwords = nltk.corpus.stopwords.words('english')
    word_frequencies = {}

    for word in nltk.word_tokenize(formatted_text):
        if word not in stopwords:
            if word not in word_frequencies.keys():
                word_frequencies[word] = 1
            else:
                word_frequencies[word] += 1

    maximum_frequency = max(word_frequencies.values())
    for word in word_frequencies.keys():
        word_frequencies[word] = (word_frequencies[word] / maximum_frequency)

    sentence_scores = {}
    for sent in sentence_list:
        for word in nltk.word_tokenize(sent.lower()):
            if word in word_frequencies.keys():
                if len(sent.split(' ')) < 30:
                    if sent not in sentence_scores.keys():
                        sentence_scores[sent] = word_frequencies[word]
                    else:
                        sentence_scores[sent] += word_frequencies[word]

    summary_sentences = heapq.nlargest(num_sentences, sentence_scores, key=sentence_scores.get)
    summary = '\n'.join(summary_sentences)

    return summary

def summarize_url(url, num_sentences):
    scrapped_data = urllib.request.urlopen(url)
    article = scrapped_data.read()

    parsing = bs.BeautifulSoup(article, 'lxml')
    paragraphs = parsing.find_all('p')

    text_article = ""
    for p in paragraphs:
        text_article += p.text

    summary = summarize_text(text_article, num_sentences)
    return summary

def main():
    source = input("Enter 'url' to summarize content from a URL, or 'text' to summarize your own paragraph: \n")

    if source.lower() == 'url':
        url = input("Enter the URL to summarize: \n")
        num_sentences = int(input("Enter number of sentences in which you want to summarize: \n"))
        summary = summarize_url(url, num_sentences)
    elif source.lower() == 'text':
        text = input("Enter the paragraph to summarize: \n")
        num_sentences = int(input("Enter number of sentences in which you want to summarize: \n"))
        summary = summarize_text(text, num_sentences)
    else:
        print("Invalid input. Please enter 'url' or 'text'.")

    print("\n*****\033[1m\033[4m\033[7mSUMMARIZATION:\033[0m*****\n")
    print(summary)

if __name__ == "__main__":
    main()


Enter 'url' to summarize content from a URL, or 'text' to summarize your own paragraph: 
text
Enter the paragraph to summarize: 
Cricket is a popular sport played worldwide, particularly in countries like India, England, Australia, and South Africa. It's a bat-and-ball game played between two teams of eleven players each. The game is played on a grass field with a flat strip called the pitch at the center. The objective is to score runs by hitting the ball and running between two sets of wickets, while the opposing team tries to get the batsmen out. The game has different formats, including Test cricket, One Day Internationals (ODIs), and Twenty20 (T20) matches, each with its own set of rules and strategies. Test cricket is the longest format, lasting up to five days, while T20 matches are completed in a few hours. Cricket has a rich history dating back centuries, with its origins traced to medieval England. It has evolved over time, with innovations such as limited-overs cricket and t

# Thank You