# Assignment 2

I have created one class named "SummarizeDocument". It has different methods as per requirement. Methods are as follow:
<ol>
    <li><b>get_data:</b> To fetch the information from the given URL</li>
    <li><b>process_clean_data:</b> To process the data, Remove HTML tags etc. It uses clean_data method</li>
    <li><b>clean_data:</b> To clean the data, remove unwanted brackets with text as we have in WikiPedia, Replace some UTF-8 encoding characters</li>
    <li><b>tokenize_data:</b> To create bad of word</li>
    <li><b>get_word_frequency:</b> To count occurance of each words</li>
    <li><b>norm_min_max:</b> Normalization using max and min of word frequency</li>
    <li><b>norm_mean_sd:</b> Normalization using mean and standard deviation of word frequency</li>
    <li><b>get_sentences:</b> To create sentences from article</li>
    <li><b>get_sentence_weight:</b> To give weight to sentences</li>
    <li><b>get_summary:</b> To get summary of the article. It will print some sentences</li>
</ol>

In [264]:
# Loading the libraries

import string
import urllib.request
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
import re
from collections import defaultdict
from statistics import mean, stdev
import heapq

class SummarizeDocument(object):
    
    #Constructor
    def __init__(self):
        self.stopwords = set(stopwords.words('english')) # All possible stopwords

    # url (string): URL from where you want to fetch (crawl) the data
    def get_data(self, url):
        crawl_data = urllib.request.urlopen(wiki_url)
        html_content = crawl_data.read().decode("utf-8")
        
        return BeautifulSoup(html_content, "html.parser").find_all("p")
        
    # data (list) : Unparsed data from beautiful soup
    def process_clean_data(self, data):
        # clean the data
        processed_data = [self.clean_data(p.text) for p in data]

        return " ".join(processed_data).strip()
        
    # We can use regular expression in order to remove html content. I used it before and it worked well but later on
    # I found out that we have .text method in BeautifulSoup which exactly does the same
    # data (list) : Unparsed data from beautiful soup
    def clean_data(self, data):
        #cleanr = re.compile('<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});')
        #cleantext = re.sub(cleanr, '', raw_html)
        #Above both line can be replaced by .text function in BeautifulSoup. It is done in get_data function

        #Remove anything that comes between square brackets
        raw_data = re.sub(r'\[.*?\]', '', data)

        #Replaceing \xa0 which is actually non-breaking space in Latin1 (ISO 8859-1)
        data = raw_data.replace(u'\xa0', u' ')
        return data.replace('\n','') # Removing \n from data
    
    # Tokenize the data
    # data (list) : Parsed and clean data
    def tokenize_data(self, data):
        all_tokens = word_tokenize(data)

        #Remove words like 's or 'll or 'it.
        #Remove single quote before word like we have 'without, 'and etc.
        for k,tok in enumerate(all_tokens):
            if re.match(r"\'[A-Za-z0-9]{1,2}$",tok):
                all_tokens.remove(tok)
            elif re.match(r"\'[A-Za-z0-9]{3,}",tok):
                all_tokens[k] = tok.replace("'","")

        return [tok.lower() for tok in all_tokens if tok.lower() not in string.punctuation and tok.lower() not in self.stopwords]

    # Count the frequency of every word
    # data (list | dictionary) : Parsed and clean data
    def get_word_frequency(self, data):
        freq_table = defaultdict(int)
        for word in data:
            if word in freq_table:
                freq_table[word] += 1
            else:
                freq_table[word] = 1

        return freq_table
    
    # Normalization mentod - 1 using max and min 
    # word_count (dictionary) : Dictionary having word frequency 
    def norm_min_max(self, word_count):
        all_counts = word_count.values() # All the count from bag of words
        M = max(all_counts) # Mamimum count
        m = min(all_counts) # Min count
        range_data = M - m # Range of count

        weight = defaultdict(int)

        for word, cnt in word_count.items():
            weight[word] = cnt / range_data

        return weight
    
    # Normalization mentod - 2 using mean and standard deviation
     # word_count (dictionary) : Dictionary having word frequency 
    def norm_mean_sd(self, word_count):
        all_counts = word_count.values() # All the count from bag of words
        mean_val = mean(all_counts)
        stdev_val = stdev(all_counts)

        weight = defaultdict(int)

        for word, cnt in word_count.items():
            weight[word] = (cnt - mean_val) / stdev_val

        return weight
    
    
    # Convert the data into sentences
    # data (list) : Parsed and clean data
    def get_sentences(self, data):
        return sent_tokenize(data)
    
    # Give weight to the sentece
    # sentences (list) : List with all sentences of document
    # word_with_weight (dictionary): All word having weights using normalization
    def get_sentence_weight(self, sentences, word_with_weight):
        sentence_weight = {}
        for sent in sentences:
            word_tokens = word_tokenize(sent)
            sent_length = len(sent.split(" "))

            if sent_length < 30:
                for word in word_tokens:
                    if word in word_with_weight:
                        if sent in sentence_weight:
                            sentence_weight[sent] += word_with_weight[word]
                        else:
                            sentence_weight[sent] = word_with_weight[word]
                            
        return sentence_weight
    
    # Get summary of document
    # num_of_sent (int) : Number of sentences required as summary (This is top K)
    # weighted_sentences (dictionaty) : All senteces with respective weight
    def get_summary(self, num_of_sent, weighted_sentences):
        summary_sentences = heapq.nlargest(num_of_sent, weighted_sentences, key=weighted_sentences.get)
        print(" ".join(summary_sentences))

## Covid 19
### Normalization using max and min

In [283]:
#Initialization of SummarizeDocument
doc_summ = SummarizeDocument()

# Fetching the data from WikiPedia
unparsed_text = doc_summ.get_data("https://en.wikipedia.org/wiki/Coronavirus_disease_2019")

# Processing and cleaning the data
parsed_clean_text = doc_summ.process_clean_data(unparsed_text)

# Creating tokens
bag_of_words = doc_summ.tokenize_data(parsed_clean_text)

# Get word frequencies
word_frequency = doc_summ.get_word_frequency(bag_of_words)

# Normalization
word_with_weight = doc_summ.norm_min_max(word_frequency)

# Creating sentences from article
all_sentences = doc_summ.get_sentences(parsed_clean_text)

# Giving weight to each sentences based on the words it have
sent_with_weight = doc_summ.get_sentence_weight(all_sentences, word_with_weight)

# As per requirement, K sentences containing 5% of words. 1000 / 20 = 50 where 50 is 5%

# Algotithem to find K
# We need to find the number of the sentence. 
#     1) Find averge length of the words in the sentences (That will be our expected sentence length)
#     2) Find number of word that should be in summary. ( Divide total word count by 20)
#     3) Divide 2nd step count / 1st step count and that will be the number of sentence

# First step
total_words = len(word_tokenize(parsed_clean_text))
total_sentences = len(all_sentences)
average_sentence_length = total_words/total_sentences

# Second step
summary_no_of_words = total_words / 20

# Third step
K = round(summary_no_of_words / average_sentence_length)

doc_summ.get_summary(K, sent_with_weight)

During the initial outbreak in Wuhan, China, the virus and disease were commonly referred to as "coronavirus" and "Wuhan coronavirus", with the disease sometimes called "Wuhan pneumonia". People are most infectious when they show symptoms (even mild or non-specific symptoms), but may be infectious for up to two days before symptoms appear (pre-symptomatic transmission). It is most contagious during the first three days after the onset of symptoms, although spread is possible before symptoms appear, and from people who do not show symptoms. Coronavirus disease 2019 (COVID-19) is an infectious disease caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). Following the infection, children may develop paediatric multisystem inflammatory syndrome, which has symptoms similar to Kawasaki disease, which can be fatal. The disease may take a mild course with few or no symptoms, resembling other common upper respiratory diseases such as the common cold. In a study of early cases

### Normalization using mean and standard deviation

In [282]:
#Initialization of SummarizeDocument
doc_summ = SummarizeDocument()

url = "https://en.wikipedia.org/wiki/Coronavirus_disease_2019"

# Fetching the data from WikiPedia
unparsed_text = doc_summ.get_data(url)

# Processing and cleaning the data
parsed_clean_text = doc_summ.process_clean_data(unparsed_text)

# Creating tokens
bag_of_words = doc_summ.tokenize_data(parsed_clean_text)

# Get word frequencies
word_frequency = doc_summ.get_word_frequency(bag_of_words)

# Normalization
word_with_weight = doc_summ.norm_mean_sd(word_frequency)

# Creating sentences from article
all_sentences = doc_summ.get_sentences(parsed_clean_text)

# Giving weight to each sentences based on the words it have
sent_with_weight = doc_summ.get_sentence_weight(all_sentences, word_with_weight)

# As per requirement, K sentences containing 5% of words. 1000 / 20 = 50 where 50 is 5%

# Algotithem to find K
# We need to find the number of the sentence. 
#     1) Find averge length of the words in the sentences (That will be our expected sentence length)
#     2) Find number of word that should be in summary. ( Divide total word count by 20)
#     3) Divide 2nd step count / 1st step count and that will be the number of sentence

# First step
total_words = len(word_tokenize(parsed_clean_text))
total_sentences = len(all_sentences)
average_sentence_length = total_words/total_sentences

# Second step
summary_no_of_words = total_words / 20

# Third step
K = round(summary_no_of_words / average_sentence_length)

doc_summ.get_summary(K, sent_with_weight)

During the initial outbreak in Wuhan, China, the virus and disease were commonly referred to as "coronavirus" and "Wuhan coronavirus", with the disease sometimes called "Wuhan pneumonia". People are most infectious when they show symptoms (even mild or non-specific symptoms), but may be infectious for up to two days before symptoms appear (pre-symptomatic transmission). It is most contagious during the first three days after the onset of symptoms, although spread is possible before symptoms appear, and from people who do not show symptoms. Coronavirus disease 2019 (COVID-19) is an infectious disease caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). Following the infection, children may develop paediatric multisystem inflammatory syndrome, which has symptoms similar to Kawasaki disease, which can be fatal. The disease may take a mild course with few or no symptoms, resembling other common upper respiratory diseases such as the common cold. The WHO additionally uses

# Conclusion

Both normalization have created almost same summarization. Only last one or two sentences are differing.