# Text Summarization

Text summarization refers to the technique of shortening long pieces of text. The intention is to create a coherent and fluent summary having only the main points outlined in the document.
Automatic text summarization is a common problem in machine learning and natural language processing (NLP).



In [1]:
# importing libraries 
import nltk 
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize, sent_tokenize 
from string import punctuation
   

In [3]:
# Input text - to summarize  
text = open("Day 5/India.txt",'r')
text=text.read()
text

"India, officially the Republic of India (Hindi: Bhārat Gaṇarājya),[23] is a country in South Asia. It is the second-most populous country, the seventh-largest country by area, and the most populous democracy in the world. Bounded by the Indian Ocean on the south, the Arabian Sea on the southwest, and the Bay of Bengal on the southeast, it shares land borders with Pakistan to the west;[f] China, Nepal, and Bhutan to the north; and Bangladesh and Myanmar to the east. In the Indian Ocean, India is in the vicinity of Sri Lanka and the Maldives; its Andaman and Nicobar Islands share a maritime border with Thailand and Indonesia. Modern humans arrived on the Indian subcontinent from Africa no later than 55,000 years ago.[24] Their long occupation, initially in varying forms of isolation as hunter-gatherers, has made the region highly diverse, second only to Africa in human genetic diversity.[25] Settled life emerged on the subcontinent in the western margins of the Indus river basin 9,000 y

In [5]:
# Tokenizing the text 
stopWords = list(stopwords.words("english"))+list(punctuation)

In [6]:
words = word_tokenize(text) 

In [7]:
# Creating a frequency table to keep the  
# score of each word 
freqTable={}
   
for word in words: 
    word = word.lower() 
    if word not in stopWords: 
           if word in freqTable: 
                freqTable[word] += 1
            
           else: 
                freqTable[word] = 1


freqTable   

{'india': 14,
 'officially': 1,
 'republic': 1,
 'hindi': 1,
 'bhārat': 1,
 'gaṇarājya': 1,
 '23': 1,
 'country': 3,
 'south': 4,
 'asia': 3,
 'second-most': 1,
 'populous': 2,
 'seventh-largest': 1,
 'area': 1,
 'democracy': 1,
 'world': 1,
 'bounded': 1,
 'indian': 3,
 'ocean': 2,
 'arabian': 1,
 'sea': 1,
 'southwest': 1,
 'bay': 1,
 'bengal': 1,
 'southeast': 2,
 'shares': 1,
 'land': 1,
 'borders': 1,
 'pakistan': 1,
 'west': 1,
 'f': 1,
 'china': 1,
 'nepal': 1,
 'bhutan': 1,
 'north': 1,
 'bangladesh': 1,
 'myanmar': 1,
 'east': 2,
 'vicinity': 1,
 'sri': 1,
 'lanka': 1,
 'maldives': 1,
 'andaman': 1,
 'nicobar': 1,
 'islands': 1,
 'share': 1,
 'maritime': 1,
 'border': 1,
 'thailand': 1,
 'indonesia': 1,
 'modern': 1,
 'humans': 1,
 'arrived': 1,
 'subcontinent': 2,
 'africa': 2,
 'later': 1,
 '55,000': 1,
 'years': 2,
 'ago': 2,
 '24': 1,
 'long': 1,
 'occupation': 1,
 'initially': 1,
 'varying': 1,
 'forms': 1,
 'isolation': 1,
 'hunter-gatherers': 1,
 'made': 1,
 'region': 1

In [8]:
# another way for 
# Creating a frequency table to keep the  
# score of each word 
{k: v for k, v in sorted(freqTable.items(),reverse=True, key=lambda item: item[1])}

{'india': 14,
 'south': 4,
 'emerged': 4,
 'country': 3,
 'asia': 3,
 'indian': 3,
 'bce': 3,
 'northern': 3,
 'populous': 2,
 'ocean': 2,
 'southeast': 2,
 'east': 2,
 'subcontinent': 2,
 'africa': 2,
 'years': 2,
 'ago': 2,
 'life': 2,
 'western': 2,
 'indus': 2,
 'basin': 2,
 'gradually': 2,
 'language': 2,
 'hinduism': 2,
 'early': 2,
 'era': 2,
 'also': 2,
 'kingdoms': 2,
 'medieval': 2,
 'islam': 2,
 "'s": 2,
 'empire': 2,
 'rule': 2,
 'british': 2,
 'officially': 1,
 'republic': 1,
 'hindi': 1,
 'bhārat': 1,
 'gaṇarājya': 1,
 '23': 1,
 'second-most': 1,
 'seventh-largest': 1,
 'area': 1,
 'democracy': 1,
 'world': 1,
 'bounded': 1,
 'arabian': 1,
 'sea': 1,
 'southwest': 1,
 'bay': 1,
 'bengal': 1,
 'shares': 1,
 'land': 1,
 'borders': 1,
 'pakistan': 1,
 'west': 1,
 'f': 1,
 'china': 1,
 'nepal': 1,
 'bhutan': 1,
 'north': 1,
 'bangladesh': 1,
 'myanmar': 1,
 'vicinity': 1,
 'sri': 1,
 'lanka': 1,
 'maldives': 1,
 'andaman': 1,
 'nicobar': 1,
 'islands': 1,
 'share': 1,
 'marit

In [13]:
# Creating a dictionary to keep the score 
# of each sentence

sentences = sent_tokenize(text) 
# here the len(sentences) is 22
sentences , len(sentences)

(['India, officially the Republic of India (Hindi: Bhārat Gaṇarājya),[23] is a country in South Asia.',
  'It is the second-most populous country, the seventh-largest country by area, and the most populous democracy in the world.',
  'Bounded by the Indian Ocean on the south, the Arabian Sea on the southwest, and the Bay of Bengal on the southeast, it shares land borders with Pakistan to the west;[f] China, Nepal, and Bhutan to the north; and Bangladesh and Myanmar to the east.',
  'In the Indian Ocean, India is in the vicinity of Sri Lanka and the Maldives; its Andaman and Nicobar Islands share a maritime border with Thailand and Indonesia.',
  'Modern humans arrived on the Indian subcontinent from Africa no later than 55,000 years ago.',
  '[24] Their long occupation, initially in varying forms of isolation as hunter-gatherers, has made the region highly diverse, second only to Africa in human genetic diversity.',
  '[25] Settled life emerged on the subcontinent in the western margin

In [14]:
sentence_weight = dict() 
   
for sentence in sentences: 
    for word, freq in freqTable.items(): 
        if word in sentence.lower(): 
            if sentence in sentence_weight: 
                sentence_weight[sentence] += freq 
            else: 
                sentence_weight[sentence] = freq 

sentence_weight

{'India, officially the Republic of India (Hindi: Bhārat Gaṇarājya),[23] is a country in South Asia.': 34,
 'It is the second-most populous country, the seventh-largest country by area, and the most populous democracy in the world.': 13,
 'Bounded by the Indian Ocean on the south, the Arabian Sea on the southwest, and the Bay of Bengal on the southeast, it shares land borders with Pakistan to the west;[f] China, Nepal, and Bhutan to the north; and Bangladesh and Myanmar to the east.': 50,
 'In the Indian Ocean, India is in the vicinity of Sri Lanka and the Maldives; its Andaman and Nicobar Islands share a maritime border with Thailand and Indonesia.': 34,
 'Modern humans arrived on the Indian subcontinent from Africa no later than 55,000 years ago.': 34,
 '[24] Their long occupation, initially in varying forms of isolation as hunter-gatherers, has made the region highly diverse, second only to Africa in human genetic diversity.': 22,
 '[25] Settled life emerged on the subcontinent in t

In [19]:
sentence_weight['India, officially the Republic of India (Hindi: Bhārat Gaṇarājya),[23] is a country in South Asia.']

34

In [20]:
sumValues = 0
for sentence in sentence_weight: 
    sumValues += sentence_weight[sentence] 
sumValues 

672

In [21]:
# Average value of a sentence from the original text 
   
average = int(sumValues / len(sentence_weight)) 
print(average,sumValues,len(sentence_weight),sep='\n\n')

30

672

22


In [22]:
# Storing sentences into our summary. 
summary = '' 
counter=0
for sentence in sentences: 
    # if you want to increase the size of the summery you should multiply with number lower than 1.3
    if (sentence in sentence_weight) and (sentence_weight[sentence] > (1.3 * average)): 
        summary += " " + sentence 
        counter+=1
        
print(counter,summary,sep='\n\n')

4

 Bounded by the Indian Ocean on the south, the Arabian Sea on the southwest, and the Bay of Bengal on the southeast, it shares land borders with Pakistan to the west;[f] China, Nepal, and Bhutan to the north; and Bangladesh and Myanmar to the east. [g][34] In South India, the Middle kingdoms exported Dravidian-languages scripts and religious cultures to the kingdoms of Southeast Asia. In the early medieval era, Christianity, Islam, Judaism, and Zoroastrianism put down roots on India's southern and western coasts. Muslim armies from Central Asia intermittently overran India's northern plains,[37] eventually establishing the Delhi Sultanate, and drawing northern India into the cosmopolitan networks of medieval Islam.
