## This text summarization is based on frequency of words present in the text.

Importing libraries

In [1]:
import pandas as pd
import numpy as np
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, PorterStemmer
from nltk.tokenize import word_tokenize, sent_tokenize
import re

Reading the dataset

In [2]:
df = pd.read_excel('TASK.xlsx')

In [3]:
df.head()

Unnamed: 0,TEST DATASET,Unnamed: 1
0,,Introduction
1,,Acnesol Gel is an antibiotic that fights bacte...
2,,Ambrodil Syrup is used for treating various re...
3,,Augmentin 625 Duo Tablet is a penicillin-type ...
4,,Azithral 500 Tablet is an antibiotic used to t...


In [4]:
df.shape

(1001, 2)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1001 entries, 0 to 1000
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   TEST DATASET  0 non-null      float64
 1   Unnamed: 1    1001 non-null   object 
dtypes: float64(1), object(1)
memory usage: 15.8+ KB


In the dataset we can see that 'TEST DATASET' contains all the NaN values. So, we will drop this column and we can also see that second column have column name as 'Unnamed:1' and 0th row contain no any paragrabh but a word 'introduction'. So, we will rename second column as 'introduction' and we will remove 0th row from the dataset. 

Dropping 'TEST DATASET' column

In [6]:
df.drop('TEST DATASET',inplace = True,axis=1)

Renaming 'Unnames: 1' column as 'introduction'

In [7]:
df.rename({'Unnamed: 1':'introduction'},inplace=True,axis=1)

Removing 0th row from the dataset and resetting the index off the dataset.

In [8]:
df = df[1:]
df = df.reset_index(drop=True)
df.head()

Unnamed: 0,introduction
0,Acnesol Gel is an antibiotic that fights bacte...
1,Ambrodil Syrup is used for treating various re...
2,Augmentin 625 Duo Tablet is a penicillin-type ...
3,Azithral 500 Tablet is an antibiotic used to t...
4,Alkasol Oral Solution is a medicine used in th...


Since this dataset contains texts so, we will clean the dataset, because it will contain many numbers, special characters, stopwords etc. Since our text summarization is based on frequency of words in the text so, we will only go with important words in the text.

Steps to clean text:
* Remove all numbers and special characters with the help of regular expression and convert each word to lowercase.
* Now tokenize the text. It will convert the text to list of words.
* Converting each word to their root form with the help of stemming.
* Now remove all the stopwords.

In [9]:
def clean_text(text):
    text = re.sub('[^A-Za-z]',' ',text).lower()
    words = word_tokenize(text)
    stopWords = set(stopwords.words('english'))
    ps = PorterStemmer()
    cleanedText = []
    for word in words:
        word = ps.stem(word)
        if word in stopWords:
            continue
        else:
            cleanedText.append(word)
    return cleanedText

#### Creating a dictionary of word frequency. 

The 'word_freq_dict' function will return a dictionary which will contain word as key and word frequency as value.

In [10]:
def word_freq_dict(text):
    words = clean_text(text)
    freq_dict = {}
    for word in words:
        if word in freq_dict:
            freq_dict[word]+=1
        else:
            freq_dict[word]=1
    return freq_dict

#### Calculating score of sentences in the text

The text which we want to summerize will contain many sentences.So, we will calculate score of every sentence based on frequency of words which are present in the word.

Large sentences will contain large number of words so, their score will be more. So, we will divide score of sentence by the number of words present in the sentence.

The 'sentence_score' function will return dictionaty of sentence score where key will be the 5th to 19th character of the sentence so that key of one sentence should not match with the key of other sentences, value willbe the score of each sentences.

In [11]:
def sentence_score(text,freq_dict):
    sentences = sent_tokenize(text)
    sentence_importance = {}
    for sentence in sentences:
        words = clean_text(sentence)
        sentence_length = len(words)
        for word_value in freq_dict:
            if word_value in words:
                if sentence[5:20] in sentence_importance:
                    sentence_importance[sentence[5:20]]+=freq_dict[word_value]
                else:
                    sentence_importance[sentence[5:20]]=freq_dict[word_value]
        sentence_importance[sentence[5:20]] = round(sentence_importance[sentence[5:20]]/sentence_length,2)
    return sentence_importance   

#### Calculate average score of the text

We are calculating average score of the text because we want to show only those sentences of the text which have score greater than the average score.

In [12]:
def average_score(sentence_importance):
    total = 0
    for i in sentence_importance:
        total += sentence_importance[i]
    average = round(total/len(sentence_importance),2)
    return average

#### Generating summary

In [13]:
def create_summary(text,average,sentence_importance):
    sentences = sent_tokenize(text)
    summary = ''
    for sentence in sentences:
        if sentence[5:20] in sentence_importance and sentence_importance[sentence[5:20]]>average:
            summary += " "+sentence
    return summary

We are going to add the summary of texts in the same excel file but in different column which is 'summary' column.

In [14]:
df['summary']=''

In [15]:
for i in range (df.shape[0]):
    text = df.introduction[i]
    word_frequency_dictionary = word_freq_dict(text)
    text_sentence_score = sentence_score(text,word_frequency_dictionary)
    average_val = average_score(text_sentence_score)
    summary = create_summary(text,average_val,text_sentence_score)
    df['summary'][i]=summary 

In [16]:
df.head()

Unnamed: 0,introduction,summary
0,Acnesol Gel is an antibiotic that fights bacte...,This medicine works by attacking the bacteria...
1,Ambrodil Syrup is used for treating various re...,It is advised not to use it for more than 14 ...
2,Augmentin 625 Duo Tablet is a penicillin-type ...,You should take it regularly at evenly spaced...
3,Azithral 500 Tablet is an antibiotic used to t...,"This medicine is taken orally, preferably eit..."
4,Alkasol Oral Solution is a medicine used in th...,Alkasol Oral Solution is a medicine used in t...


#### Saving the dataframe to a excel file

In [17]:
df.to_excel('summary.xlsx')