====================================================================================================================

# Text Summerizer

Text Summerization is the process of distilling the most important information from a source text.

====================================================================================================================

## Steps Involved to build this text summerizer

- Data Cleaning
- Text Cleaning
- Word Tokenization
- Sentence Tokenization
- Word Frequency
- Sentence Frequency
- Clustering
- Summerization

=====================================================================================================================

## Data Cleaning 

#### First step lets extract information from excel with python and pandas
Requirments:  
Download the required packages  
* pip install xlrd 
* pip install pandas

In [1]:
#import the pandas library and give alias as pd
import pandas as pd

In [2]:
#Read the Excel file
xls = pd.ExcelFile('TASK.xlsx')

In [3]:
#get all the sheet names
sheet_names = xls.sheet_names
sheet_names

['Sheet 1']

In [4]:
#Get the Data from 'Sheet 1'
Dataset = pd.read_excel(xls,'Sheet 1')
Dataset.head()

Unnamed: 0,TEST DATASET,Unnamed: 1
0,,Introduction
1,,Acnesol Gel is an antibiotic that fights bacte...
2,,Ambrodil Syrup is used for treating various re...
3,,Augmentin 625 Duo Tablet is a penicillin-type ...
4,,Azithral 500 Tablet is an antibiotic used to t...


Here we find uncleaned dataset. Hence initially we need to clean the dataset by getting the proper header to the file and remove NaN values.

In [5]:
#Here we are skipping the first row and get the second row as header
Dataset = pd.read_excel(xls,'Sheet 1',header=1)
Dataset.head(3)

Unnamed: 0.1,Unnamed: 0,Introduction
0,,Acnesol Gel is an antibiotic that fights bacte...
1,,Ambrodil Syrup is used for treating various re...
2,,Augmentin 625 Duo Tablet is a penicillin-type ...


In [6]:
#To check the datatypes of each column
Dataset.dtypes

Unnamed: 0      float64
Introduction     object
dtype: object

In [7]:
#Check the length of the Dataset
len(Dataset)

1000

In [8]:
#Lets check for the NaN values present in the data
Dataset.isnull().sum()

Unnamed: 0      1000
Introduction       0
dtype: int64

We see an Unnamed column with no values present in it. Hence we are going to delete that column.

In [9]:
#Here I have displayed the entire text of the columns
pd.set_option('display.max_colwidth', None)

In [10]:
#Here I have deleted the column which has more than 1000 'NaN' values
Dataset = Dataset.dropna(thresh=1000,axis='columns')
Dataset.head(2)

Unnamed: 0,Introduction
0,"Acnesol Gel is an antibiotic that fights bacteria. It is used to treat acne, which appears as spots or pimples on your face, chest or back. This medicine works by attacking the bacteria that cause these pimples.Acnesol Gel is only meant for external use and should be used as advised by your doctor. You should normally wash and dry the affected area before applying a thin layer of the medicine. It should not be applied to broken or damaged skin. Avoid any contact with your eyes, nose, or mouth. Rinse it off with water if you accidentally get it in these areas. It may take several weeks for your symptoms to improve, but you should keep using this medicine regularly. Do not stop using it as soon as your acne starts to get better. Ask your doctor when you should stop treatment.Common side effects like minor itching, burning, or redness of the skin and oily skin may be seen in some people. These are usually temporary and resolve on their own. Consult your doctor if they bother you or do not go away.It is a safe medicine, but you should inform your doctor if you have any problems with your bowels (intestines). Also, inform the doctor if you have ever had bloody diarrhea caused by taking antibiotics or if you are using any other medicines to treat skin conditions. Consult your doctor about using this medicine if you are pregnant or breastfeeding."
1,"Ambrodil Syrup is used for treating various respiratory tract disorders associated with excessive mucus. It works by thinning and loosens mucus in the nose, windpipe and lungs and make it easier to cough out.Ambrodil Syrup should be taken with food. For better results, it is suggested to take it at the same time every day. The dose and how often you take it depends on what you are taking it for. Your doctor will decide how much you need to improve your symptoms. It is advised not to use it for more than 14 days without doctor consultation.The most common side effects of this medicine include vomiting, nausea, and stomach upset. Talk to your doctor if you are worried about side effects or they would not go away. Generally, it is advised not to take alcohol while on treatment.Before taking this medicine, tell your doctor if you have liver or kidney disease or if you have stomach problems. Your doctor should also know about all other medicines you are taking as many of these may make this medicine less effective or change the way it works. You must take doctor's advice before using this medicine if you are pregnant or breastfeeding."


Now the data is clean to proceed. Hence we have completed the data cleaning process

====================================================================================================================

## Text Cleaning 

  #### Second step lets clean the text cloumn by removing punctuation and stopwords. For this process install and import the necessary packages.
  Download the required packages 
  * !pip install -U spacy
  * !python -m spacy download en
  * !python -m spacy download en_core_web_sm

In [11]:
#Import necessary packages and libraries
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation
from heapq import nlargest

In [12]:
#Store the list of stopwords in a variable
stopwords = list(STOP_WORDS)

In [13]:
#Load the model
nlp = spacy.load("en_core_web_sm")

In [14]:
#Check the list of punctuations
punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [15]:
#Its a function to clean up the text removing the punctuations and stopwords
def cleanup_text(docs, logging=False):
    texts = []
    doc = nlp(docs, disable=['parser', 'ner'])
    tokens = [tok.lemma_.lower().strip() for tok in doc if tok.lemma_ != '-PRON-']
    tokens = [tok for tok in tokens if tok not in stopwords and tok not in punctuation]
    tokens = ' '.join(tokens)
    texts.append(tokens)
    return pd.Series(texts)

In [16]:
#Call the 'cleanup_text' function on the 'Introduction' column where it loops through the each row and save it in a new cloumn
Dataset['Cleaned_Introduction'] = Dataset['Introduction'].apply(lambda x: cleanup_text(x, False))

In [17]:
#Below Check the difference between cleaned data and original data
#For example lets take the first row
print('---Introduction with punctuation and stopwords---\n')
print(Dataset['Introduction'][0])
print('\n---Introduction after removing punctuation and stopwords---\n')
print(Dataset['Cleaned_Introduction'][0])
print('======================================================================================================')
print('---Length of Original Data---\n')
print(len(Dataset['Introduction'][0]))
print('---Length of Cleaned Data---\n')
print(len(Dataset['Cleaned_Introduction'][0]))

---Introduction with punctuation and stopwords---

Acnesol Gel is an antibiotic that fights bacteria. It is used to treat acne, which appears as spots or pimples on your face, chest or back. This medicine works by attacking the bacteria that cause these pimples.Acnesol Gel is only meant for external use and should be used as advised by your doctor. You should normally wash and dry the affected area before applying a thin layer of the medicine. It should not be applied to broken or damaged skin. Avoid any contact with your eyes, nose, or mouth. Rinse it off with water if you accidentally get it in these areas. It may take several weeks for your symptoms to improve, but you should keep using this medicine regularly. Do not stop using it as soon as your acne starts to get better. Ask your doctor when you should stop treatment.Common side effects like minor itching, burning, or redness of the skin and oily skin may be seen in some people. These are usually temporary and resolve on their ow

In [18]:
#Function to overall Summerize the data
def Summarized_Data(cleaned_text):
    doc = nlp(cleaned_text)
    word_frequencies = {}
    sentence_scores = {}
    texts = []

#Get the word frequency of each row    
    for word in doc:  
        if word.text.lower() not in stopwords:
            if word.text.lower() not in punctuation:
                if word.text not in word_frequencies.keys():
                    word_frequencies[word.text] = 1
                else:
                    word_frequencies[word.text] += 1
    
#Get the maximum of word frequency for each row    
    max_frequency = max(word_frequencies.values())

#Get the normalized frequency of each word for each row    
    for word in word_frequencies.keys():
        word_frequencies[word] = word_frequencies[word]/max_frequency

#Get the sentence tokens for each row
    sentence_tokens = [sent for sent in doc.sents]

#Get the sentence score of each sentence tokens for each row
    for sent in sentence_tokens:
        for word in sent:
            if word.text.lower() in word_frequencies.keys():
                if sent not in sentence_scores.keys():
                    sentence_scores[sent] = word_frequencies[word.text.lower()]
                else:
                    sentence_scores[sent] += word_frequencies[word.text.lower()]

#Get the 30% of sentence with the maximum score for each row
#You can change the percentage of sentence you require
    select_length = int(len(sentence_tokens)*0.3)

#Get the 30% of highest frequency sentence and store in a variable
    summary = nlargest(select_length,sentence_scores,key = sentence_scores.get)

#Combine these sentences
    final_summary = [word.text for word in summary]
    summary = ' '.join(final_summary)

#Append it to a list variable and return in series
    texts.append(summary)
    return pd.Series(texts)

In [19]:
#Check the Summerized data of first data point 
Summarized_Data(Dataset['Cleaned_Introduction'][0])

0    doctor problem bowel intestine inform doctor bloody diarrhea cause antibiotic use medicine treat skin condition consult doctor use medicine pregnant breastfeeding area week symptom improve use medicine regularly stop use soon acne start ask doctor stop treatment common effect like minor itching burning redness skin
dtype: object

In [20]:
#Call the 'Summarized_Data' function on the 'Cleaned_Introduction' column where it loops through the each row and save it in a new cloumn
Dataset['Summarized_Introduction'] = Dataset['Cleaned_Introduction'].apply(lambda x: Summarized_Data(x))

In [21]:
#Call the first two rows
Dataset.head(2)

Unnamed: 0,Introduction,Cleaned_Introduction,Summarized_Introduction
0,"Acnesol Gel is an antibiotic that fights bacteria. It is used to treat acne, which appears as spots or pimples on your face, chest or back. This medicine works by attacking the bacteria that cause these pimples.Acnesol Gel is only meant for external use and should be used as advised by your doctor. You should normally wash and dry the affected area before applying a thin layer of the medicine. It should not be applied to broken or damaged skin. Avoid any contact with your eyes, nose, or mouth. Rinse it off with water if you accidentally get it in these areas. It may take several weeks for your symptoms to improve, but you should keep using this medicine regularly. Do not stop using it as soon as your acne starts to get better. Ask your doctor when you should stop treatment.Common side effects like minor itching, burning, or redness of the skin and oily skin may be seen in some people. These are usually temporary and resolve on their own. Consult your doctor if they bother you or do not go away.It is a safe medicine, but you should inform your doctor if you have any problems with your bowels (intestines). Also, inform the doctor if you have ever had bloody diarrhea caused by taking antibiotics or if you are using any other medicines to treat skin conditions. Consult your doctor about using this medicine if you are pregnant or breastfeeding.",acnesol gel antibiotic fight bacteria use treat acne appear spot pimple face chest medicine work attack bacteria cause pimple acnesol gel mean external use use advise doctor normally wash dry affected area apply thin layer medicine apply broken damage skin avoid contact eye nose mouth rinse water accidentally area week symptom improve use medicine regularly stop use soon acne start ask doctor stop treatment common effect like minor itching burning redness skin oily skin people usually temporary resolve consult doctor bother away safe medicine inform doctor problem bowel intestine inform doctor bloody diarrhea cause antibiotic use medicine treat skin condition consult doctor use medicine pregnant breastfeeding,doctor problem bowel intestine inform doctor bloody diarrhea cause antibiotic use medicine treat skin condition consult doctor use medicine pregnant breastfeeding area week symptom improve use medicine regularly stop use soon acne start ask doctor stop treatment common effect like minor itching burning redness skin
1,"Ambrodil Syrup is used for treating various respiratory tract disorders associated with excessive mucus. It works by thinning and loosens mucus in the nose, windpipe and lungs and make it easier to cough out.Ambrodil Syrup should be taken with food. For better results, it is suggested to take it at the same time every day. The dose and how often you take it depends on what you are taking it for. Your doctor will decide how much you need to improve your symptoms. It is advised not to use it for more than 14 days without doctor consultation.The most common side effects of this medicine include vomiting, nausea, and stomach upset. Talk to your doctor if you are worried about side effects or they would not go away. Generally, it is advised not to take alcohol while on treatment.Before taking this medicine, tell your doctor if you have liver or kidney disease or if you have stomach problems. Your doctor should also know about all other medicines you are taking as many of these may make this medicine less effective or change the way it works. You must take doctor's advice before using this medicine if you are pregnant or breastfeeding.",ambrodil syrup use treat respiratory tract disorder associate excessive mucus work thinning loosen mucus nose windpipe lung easy cough ambrodil syrup food result suggest time day dose depend doctor decide need improve symptom advise use 14 day doctor consultation common effect medicine include vomiting nausea stomach upset talk doctor worried effect away generally advise alcohol treatment medicine tell doctor liver kidney disease stomach problem doctor know medicine medicine effective change way work doctor advice use medicine pregnant breastfeeding,alcohol treatment medicine tell doctor liver kidney disease stomach problem doctor know medicine medicine effective change way work doctor advice use medicine pregnant breastfeeding


In [22]:
#After getting the summarized data for each column I have deleted the 'Cleaned_Introduction' column
Dataset = Dataset.drop(['Cleaned_Introduction'],axis=1)

In [23]:
Dataset.head(2)

Unnamed: 0,Introduction,Summarized_Introduction
0,"Acnesol Gel is an antibiotic that fights bacteria. It is used to treat acne, which appears as spots or pimples on your face, chest or back. This medicine works by attacking the bacteria that cause these pimples.Acnesol Gel is only meant for external use and should be used as advised by your doctor. You should normally wash and dry the affected area before applying a thin layer of the medicine. It should not be applied to broken or damaged skin. Avoid any contact with your eyes, nose, or mouth. Rinse it off with water if you accidentally get it in these areas. It may take several weeks for your symptoms to improve, but you should keep using this medicine regularly. Do not stop using it as soon as your acne starts to get better. Ask your doctor when you should stop treatment.Common side effects like minor itching, burning, or redness of the skin and oily skin may be seen in some people. These are usually temporary and resolve on their own. Consult your doctor if they bother you or do not go away.It is a safe medicine, but you should inform your doctor if you have any problems with your bowels (intestines). Also, inform the doctor if you have ever had bloody diarrhea caused by taking antibiotics or if you are using any other medicines to treat skin conditions. Consult your doctor about using this medicine if you are pregnant or breastfeeding.",doctor problem bowel intestine inform doctor bloody diarrhea cause antibiotic use medicine treat skin condition consult doctor use medicine pregnant breastfeeding area week symptom improve use medicine regularly stop use soon acne start ask doctor stop treatment common effect like minor itching burning redness skin
1,"Ambrodil Syrup is used for treating various respiratory tract disorders associated with excessive mucus. It works by thinning and loosens mucus in the nose, windpipe and lungs and make it easier to cough out.Ambrodil Syrup should be taken with food. For better results, it is suggested to take it at the same time every day. The dose and how often you take it depends on what you are taking it for. Your doctor will decide how much you need to improve your symptoms. It is advised not to use it for more than 14 days without doctor consultation.The most common side effects of this medicine include vomiting, nausea, and stomach upset. Talk to your doctor if you are worried about side effects or they would not go away. Generally, it is advised not to take alcohol while on treatment.Before taking this medicine, tell your doctor if you have liver or kidney disease or if you have stomach problems. Your doctor should also know about all other medicines you are taking as many of these may make this medicine less effective or change the way it works. You must take doctor's advice before using this medicine if you are pregnant or breastfeeding.",alcohol treatment medicine tell doctor liver kidney disease stomach problem doctor know medicine medicine effective change way work doctor advice use medicine pregnant breastfeeding


In [24]:
#Save the output to an excel sheet
Dataset.to_excel('Summarized_Data.xlsx', index=False)