## Data Importing and preprocessing

In [404]:
from sklearn.datasets import fetch_20newsgroups
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
from nltk.corpus import brown, stopwords
from nltk.cluster.util import cosine_distance
from nltk.tokenize import sent_tokenize
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from mlxtend.plotting import plot_confusion_matrix
from matplotlib import cm
import matplotlib.pyplot as plt
import itertools
import operator
from gensim.summarization.summarizer import summarize

Downloading training dataset and testing dataset
Notice: since the text are emails so they have "From", "Subjects", "Lines", etc. Set arguments remove to remove news group header, blocks at the ends of posts that look like signatures and lines that appear to be quoting another post.

In [2]:
train = fetch_20newsgroups(subset='train', shuffle=True, remove = ('headers','footers','quotes'))
test = fetch_20newsgroups(subset='test', shuffle=True, remove = ('headers','footers','quotes'))

In [3]:
type(train)

sklearn.utils.Bunch

In [4]:
train.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR', 'description'])

In [5]:
print(train.target_names)
target_set = set(train.target)
print(target_set)

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19}


As we can see, the lebel of messages are stored in target names and are represented by integers from 0 to 19 when fitting models

The text is clean without header, footer or quoting. 

In [504]:
## clean the text.
import re
import string

def clean(text):
    text = text.replace("-", "")
    tokenized_text = sent_tokenize(text)
    tokenized_text_list = []
    for sent in tokenized_text:
        word_list = word_tokenize(sent)
        word_list = [x for x in word_list if x not in string.punctuation]
        word_list[len(word_list)-1] = word_list[len(word_list)-1]+'.'
        new_sent = " ".join(word_list)
        
        
        tokenized_text_list.append(new_sent)
        
    
    return(" ".join(tokenized_text_list))
        

In [481]:
train.data[9]

"\n\n\nI've had the board for over a year, and it does work with Diskdoubler,\nbut not with Autodoubler, due to a licensing problem with Stac Technologies,\nthe owners of the board's compression technology. (I'm writing this\nfrom memory; I've lost the reference. Please correct me if I'm wrong.)\n\nUsing the board, I've had problems with file icons being lost, but it's\nhard to say whether it's the board's fault or something else; however,\nif I decompress the troubled file and recompress it without the board,\nthe icon usually reappears. Because of the above mentioned licensing\nproblem, the freeware expansion utility DD Expand will not decompress\na board-compressed file unless you have the board installed.\n\nSince Stac has its own product now, it seems unlikely that the holes\nin Autodoubler/Diskdoubler related to the board will be fixed.\nWhich is sad, and makes me very reluctant to buy Stac's product since\nthey're being so stinky. (But hey, that's competition.)\n-- "

In [480]:
clean(train.data[9])

"I 've had the board for over a year and it does work with Diskdoubler but not with Autodoubler due to a licensing problem with Stac Technologies the owners of the board 's compression technology. I 'm writing this from memory I 've lost the reference. Please correct me if I 'm wrong. Using the board I 've had problems with file icons being lost but it's hard to say whether it 's the board 's fault or something else however if I decompress the troubled file and recompress it without the board the icon usually reappears. Because of the above mentioned licensing problem the freeware expansion utility DD Expand will not decompress a boardcompressed file unless you have the board installed. Since Stac has its own product now it seems unlikely that the holes in Autodoubler/Diskdoubler related to the board will be fixed. Which is sad and makes me very reluctant to buy Stac 's product since they 're being so stinky. But hey that 's competition."

## Build Text Summarization Models

First, I try to build extractive summarization model

### Tf-idf summarizer  

I built my first sentence ranker by tfidf score. I used tfidf transformer to convert the words in the text into tfidf matrix and store them as a dictionary. For each sentence, I added the tfidf score for all the word in that sentence and then calculate the average. The average score will be the tfidf score for this sentences. I chose the first few sentences when ranking the sentences in a descendant order and generate an extractive summarization. 

In [49]:
tf_idfvec = TfidfVectorizer()
tf_idfvec.fit(train.data)
tf_matrix = tf_idfvec.transform(train.data)

In [176]:
## Calculate the rank for each sentence
def get_sentence_score_tf(sentence):
    """
    parameter: sentence- a string representing the sentence 
    return : the tf-idf score for this sentence 
    """
    word_list = word_tokenize(sentence)
    n = len(word_list)
    score = 0
    for word in word_list:
        if word in tf_dict.keys():
            score += tf_dict[word]
    
    ave_score = score/n
    return ave_score
             

In [406]:
def Tfidf_summarizer(text, prop):
    """
    parameter: text - corpus which is represented by a string input
               prop - double between 0 to 1, indicating the proportion of text is the length of summary
    return:  string output representing the summarization of the text. 
    """

    corpus_matrix = tf_idfvec.transform([text])
    sentence_list = sent_tokenize(text)
    n = len(sentence_list)
    feature_names = tf_idfvec.get_feature_names
    
    ## get score: https://stackoverflow.com/questions/34449127/sklearn-tfidf-transformer-how-to-get-tf-idf-values-of-given-words-in-document
    doc = 0
    feature_index = corpus_matrix[doc,:].nonzero()[1]
    tfidf_scores = zip(feature_index, [corpus_matrix[doc, x] for x in feature_index])
    
    tf_dict = dict() # create dictionary 
    [(i, s) for (i, s) in tfidf_scores]
    for w, s in [(feature_names[i], s) for (i, s) in tfidf_scores]:
        tf_dict[w] = s
    
    score_list = dict()
    for i in range(n):
        ave_score = get_sentence_score_tf(sentence_list[i])
        score_list[i] = ave_score
        
    sorted_tuple = sorted(score_list.items(), key=operator.itemgetter(1), reverse = True)
    #print(sorted_tuple)
    num_of_sentence = round(prop*n)
    #print([x[0] for x in sorted_tuple[0:num_of_sentence]])
    result = ''.join(sentence_list[x[0]] for x in sorted_tuple[0:num_of_sentence])
    return(result)
    

In [487]:
print(Tfidf_summarizer(clean(train.data[9]), 0.3))

Using the board I 've had problems with file icons being lost but it's hard to say whether it 's the board 's fault or something else however if I decompress the troubled file and recompress it without the board the icon usually reappears.Because of the above mentioned licensing problem the freeware expansion utility DD Expand will not decompress a boardcompressed file unless you have the board installed.


### Name Entity Recognition Summarizer 

I built my second text summarizer by using name entity recognition. Sentences with more than or equal to 2 NERs would be important and worth including into summarization.

In [266]:
def get_sentence_score_ner(sentence):
    """
    parameter: sentence - a string representing a sentence
    return: score - a double representing the NER score for the sentence 
    """
    score = 0
    chunks = ne_chunk(pos_tag(word_tokenize(sentence)))
    for c in chunks:
        if type(c) == nltk.tree.Tree:
            score += 1
    return(score)


In [402]:
def NER_summarizer(text):
    """
    parameter: text - corpus which is represented by a string input

    return:  string output representing the summarization of the text.
    """
    text = clean(text)
    sent_list = sent_tokenize(text)
    score_list = []
    for i in range(len(sent_list)):
        score_list.append(get_sentence_score_ner(sent_list[i]))
    
    result = [sent for sent in sent_list if score_list[sent_list.index(sent)]>=3]
    result = ''.join(sent for sent in result)
    
    return(result)

In [482]:
print(NER_summarizer(train.data[9]))

I 've had the board for over a year and it does work with Diskdoubler but not with Autodoubler due to a licensing problem with Stac Technologies the owners of the board 's compression technology.


### TextRank Summarizer 

sentence similarity 

In [489]:
from nltk.corpus import brown, stopwords
from nltk.cluster.util import cosine_distance
 
def sentence_similarity(sent1, sent2, stopwords=None):
    if stopwords is None:
        stopwords = []
 
    sent1 = [w.lower() for w in sent1]
    sent2 = [w.lower() for w in sent2]
 
    all_words = list(set(sent1 + sent2))
 
    vector1 = [0] * len(all_words)
    vector2 = [0] * len(all_words)
 
    # build the vector for the first sentence
    for w in sent1:
        if w in stopwords:
            continue
        vector1[all_words.index(w)] += 1
 
    # build the vector for the second sentence
    for w in sent2:
        if w in stopwords:
            continue
        vector2[all_words.index(w)] += 1
 
    return 1 - cosine_distance(vector1, vector2)


In [500]:
import numpy as np

 
# get the english list of stopwords
stop_words = stopwords.words('english')
 
def build_similarity_matrix(sentences, stopwords=None):
    # Create an empty similarity matrix
    #sentences = word_tokenize(sentences)
    S = np.zeros((len(sentences), len(sentences)))
 
 
    for idx1 in range(len(sentences)):
        for idx2 in range(len(sentences)):
            if idx1 == idx2:
                continue
 
            S[idx1][idx2] = sentence_similarity(sentences[idx1], sentences[idx2], stop_words)
 
    # normalize the matrix row-wise
    for idx in range(len(S)):
        S[idx] /= S[idx].sum()
 
    return S
 

In [491]:
def pagerank(A, eps=0.0001, d=0.85):
    P = np.ones(len(A)) / len(A)
    while True:
        new_P = np.ones(len(A)) * (1 - d) / len(A) + d * A.T.dot(P)
        delta = abs((new_P - P).sum())
        if delta <= eps:
            return new_P
        P = new_P

In [492]:
from operator import itemgetter
def textrank(sentences, top_n=5, stopwords=None):
    """
    sentences - corpus which is represented by a string input

    top_n - how may sentences the summary should contain
    stopwords - a list of stopwords
    """
    S = build_similarity_matrix(sentences, stop_words) 
    sentence_ranks = pagerank(S)
 
    # Sort the sentence ranks
    ranked_sentence_indexes = [item[0] for item in sorted(enumerate(sentence_ranks), key=lambda item: -item[1])]
    selected_sentences = sorted(ranked_sentence_indexes[:top_n])
    summary = itemgetter(*selected_sentences)(sentences)        
    return summary 


## Evaluating Models

In [44]:
text = train.data[9]
print(text)




I've had the board for over a year, and it does work with Diskdoubler,
but not with Autodoubler, due to a licensing problem with Stac Technologies,
the owners of the board's compression technology. (I'm writing this
from memory; I've lost the reference. Please correct me if I'm wrong.)

Using the board, I've had problems with file icons being lost, but it's
hard to say whether it's the board's fault or something else; however,
if I decompress the troubled file and recompress it without the board,
the icon usually reappears. Because of the above mentioned licensing
problem, the freeware expansion utility DD Expand will not decompress
a board-compressed file unless you have the board installed.

Since Stac has its own product now, it seems unlikely that the holes
in Autodoubler/Diskdoubler related to the board will be fixed.
Which is sad, and makes me very reluctant to buy Stac's product since
they're being so stinky. (But hey, that's competition.)
-- 


In [486]:
clean(train.data[9])

"I 've had the board for over a year and it does work with Diskdoubler but not with Autodoubler due to a licensing problem with Stac Technologies the owners of the board 's compression technology. I 'm writing this from memory I 've lost the reference. Please correct me if I 'm wrong. Using the board I 've had problems with file icons being lost but it's hard to say whether it 's the board 's fault or something else however if I decompress the troubled file and recompress it without the board the icon usually reappears. Because of the above mentioned licensing problem the freeware expansion utility DD Expand will not decompress a boardcompressed file unless you have the board installed. Since Stac has its own product now it seems unlikely that the holes in Autodoubler/Diskdoubler related to the board will be fixed. Which is sad and makes me very reluctant to buy Stac 's product since they 're being so stinky. But hey that 's competition."

tf-idf summarizer

In [483]:
print(Tfidf_summarizer(clean(train.data[9]), 0.3))

Using the board I 've had problems with file icons being lost but it's hard to say whether it 's the board 's fault or something else however if I decompress the troubled file and recompress it without the board the icon usually reappears.Because of the above mentioned licensing problem the freeware expansion utility DD Expand will not decompress a boardcompressed file unless you have the board installed.


NER summarizer

In [484]:
print(NER_summarizer(clean(train.data[9])))

I 've had the board for over a year and it does work with Diskdoubler but not with Autodoubler due to a licensing problem with Stac Technologies the owners of the board 's compression technology.


summarizing function from gensim package

In [485]:
print(summarize(clean(train.data[9])))# From gensim package

Since Stac has its own product now it seems unlikely that the holes in Autodoubler/Diskdoubler related to the board will be fixed.


textrank summarizer 

In [503]:
text = sent_tokenize(clean(train.data[9]))
text = [word_tokenize(sent) for sent in text]

In [505]:
for idx, sentence in enumerate(textrank(text, top_n = 3, stopwords=stopwords.words('english'))):
    print("%s. %s" % ((idx + 1), ' '.join(sentence)))
 

1. I 've had the board for over a year and it does work with Diskdoubler but not with Autodoubler due to a licensing problem with Stac Technologies the owners of the board 's compression technology .
2. Using the board I 've had problems with file icons being lost but it 's hard to say whether it 's the board 's fault or something else however if I decompress the troubled file and recompress it without the board the icon usually reappears .
3. But hey that 's competition .


"I ve had the board for over a year and it does work with Diskdoubler but not with autodoubler dut to a licensing problem with Stac Technologies the owners of the borad's compression technology." appears twice

"Using the board I 've had problems with file icons being lost but it's hard to say whether it 's the board 's fault or something else however if I decompress the troubled file and recompress it without the board the icon usually reappears. " appears twice

## More to Go

- More about evaluation: How many sentences are important, how many times the first and last few sentences are included, similarity plot.


- About identifying importance: how do we define the label. i.e How do we know if the sentence is important. 


- Compare with more existed models.