1) Topics displayed preferably as bubble charts
2) Keyword/Cluster Mapping to Sentences 
3) Sentence Abstract Summary 

**To-Do in Future:**

**** User-Directed Feedback for which model performs the best --> Use those selections as directed weights to help model selection for future events

In [1]:
# Load Libraries and packages
import numpy as np
import pandas as pd

# Parsing Tools for Summarizers
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer

# Extractive Text Summarizer Libraries
from sumy.summarizers.lex_rank import LexRankSummarizer
from sumy.summarizers.lsa import LsaSummarizer
from sumy.summarizers.luhn import LuhnSummarizer
from sumy.summarizers.kl import KLSummarizer

# Abstractive Text Summarizers
## T5 Models
from transformers import T5Tokenizer, T5Config, T5ForConditionalGeneration
## BART Model
from transformers import BartForConditionalGeneration, BartTokenizer, BartConfig
## GPT-2 Model
from transformers import GPT2Tokenizer,GPT2LMHeadModel

In [2]:
col_names = ['Document_No', 'Dominant_Topic', 'Topic_Keywords', 'Text']
data = pd.read_csv('data/top_dominant_results.csv', usecols=col_names)
data.head()

Unnamed: 0,Document_No,Dominant_Topic,Topic_Keywords,Text
0,0,9.0,"car, call, company, tell, work, would, say, da...",Giving this location a low rating only because...
1,1,3.0,"order, wait, ask, say, minute, table, take, te...",I'll start off by saying that this was my favo...
2,2,3.0,"order, wait, ask, say, minute, table, take, te...",The mille crepe cake my sister gifted me for m...
3,3,7.0,"pizza, restaurant, love, service, bar, order, ...",Go someplace else there are better hotels in t...
4,4,9.0,"car, call, company, tell, work, would, say, da...",Used them a number of years ago and they were ...


In [3]:
print(f'Number of Rows in Dataframe: {len(data)}\n')
print(data.isnull().sum())

Number of Rows in Dataframe: 335433

Document_No         0
Dominant_Topic    143
Topic_Keywords    143
Text                0
dtype: int64


In [4]:
data.dropna(subset=['Dominant_Topic','Topic_Keywords'], inplace=True)
print(f'Number of Rows in Dataframe: {len(data)}\n')
data['Dominant_Topic'] = data['Dominant_Topic'].astype(int)
data.head()

Number of Rows in Dataframe: 335290



Unnamed: 0,Document_No,Dominant_Topic,Topic_Keywords,Text
0,0,9,"car, call, company, tell, work, would, say, da...",Giving this location a low rating only because...
1,1,3,"order, wait, ask, say, minute, table, take, te...",I'll start off by saying that this was my favo...
2,2,3,"order, wait, ask, say, minute, table, take, te...",The mille crepe cake my sister gifted me for m...
3,3,7,"pizza, restaurant, love, service, bar, order, ...",Go someplace else there are better hotels in t...
4,4,9,"car, call, company, tell, work, would, say, da...",Used them a number of years ago and they were ...


### Data currently stored in Dataframe format, however Summarizers require plain text format. Conversion and minor preprocessing to follow

In [5]:
# group text by topic number
grouped_text = data.groupby(['Dominant_Topic'], as_index = False).agg({'Text': '.'.join})

In [6]:
# Function to remove linebreaks from compiled text data
def remove_linebreaks(text):
    cleaned_string = '.'.join(text.splitlines())
    return cleaned_string

In [7]:
# Reformat new dataframe with grouped, cleaned text data through lambda application
grouped_text['Text'] = grouped_text.apply(lambda row : remove_linebreaks(row['Text']),axis = 1)

In [8]:
# Function to Save formatted data to Text format
def store_as_txt(groupby_column, target_column, file_location):
    ## Loop through dataframe by Topic Number
    for i in groupby_column:
        ## Store text data as local variable
        item = target_column.loc[groupby_column == i].item()
        ## Create unique text_doc for each topic
        with open(f"{file_location}{i}.txt", "w", encoding="utf-8") as text_file:
            text_file.write(item)
        print(f"Text Document {i} Complete")

In [9]:
# Designate Textfile-Save Directory Location
file_loc = "data/Text_Gen_Files/Clean_Text_Topic_"
# Execute Save Function
store_as_txt(grouped_text['Dominant_Topic'], grouped_text['Text'],file_loc)

Text Document 0 Complete
Text Document 1 Complete
Text Document 2 Complete
Text Document 3 Complete
Text Document 4 Complete
Text Document 5 Complete
Text Document 6 Complete
Text Document 7 Complete
Text Document 8 Complete
Text Document 9 Complete
Text Document 10 Complete
Text Document 11 Complete
Text Document 12 Complete
Text Document 13 Complete
Text Document 14 Complete
Text Document 18 Complete


In [10]:
# Load txt file as compiled string variable
def load_txt_file(file_loc,filename):
    with open(f"{file_loc}{filename}.txt","r",encoding="utf-8") as text_file:
        contents = text_file.read()
    return contents

In [13]:
file_dir = "data/Text_Gen_Files/"
filename = "Clean_Text_Topic_0"
topic_1_txt = load_txt_file(file_dir,filename)
# topic_1_txt

## Extractive Summarizers to be evaluated:
1) LexRank
2) LSA
3) Luhn 
4) KL-Sum

In [None]:
# Initialize Text Parser and Tokenizer for string variable as input
text_parser = PlaintextParser.from_string(topic_1_txt, Tokenizer('english'))

### 1. LexRank Summarizer

- unsupervised approach to text summarization based on graph-based centrality scoring of sentences

In [15]:
# Initialize LexRank Summarizer model
lex_rank_summarizer = LexRankSummarizer()
lexrank_summary = lex_rank_summarizer(text_parser.document, sentences_count=10)

# Print Summarized Text
for sentence in lexrank_summary:
    print(sentence)

=..i?I(?
?i ?
?i ?
?i ?
1 in [32].
On this...1.
?._IIIJ11l'aMm...1 _..R2 _.It3..----- - - L3.L2.Ll... -..........~.tI,: I...1t,1_11, _ _.AS.."...,., ,,--'.. \ .
and K.G.
It was good.
!.Loved it.


### 2. LSA Summarizer

- Extracts semantically significant sentences by applying singular value decomposition(SVD) to the matrix of term-document frequency.

In [16]:
lsa_summarizer = LsaSummarizer()
lsa_summary = lsa_summarizer(text_parser.document,sentences_count=10)

# Printing the summary
for sentence in lsa_summary:
    print(sentence)

Note that our method achieves the.optimal precision faster than SGD and also stops learning approximately when overfitting sets in...direction with high probability: for g?j > 0, we want P (gj ?
The simplest way to.provide this information is to add sensors which signal when a leg has reached an extreme forward or backward angle, as shown with dashed lines in Figure 3.
The important point here is while traditional projection pursuit.does not provide a well-founded justification for combining directions obtained from different indices, our framework allows to do precisely this ?
This does not change fundamentally the.bound (up to an additional complexity factor d log(n)), and justifies that we consider.simultaneously such a family of functions in the main algorithm...h1..h(x)..h2.h3.h4.h5..^.?4..^.?1..I..^.?3..^.?2..^.
I wanted to like this place as tonkotsu ramen is one of my favourite dishes on the planet after having lived in Kanto for three years..Was here tonight.
it started off as

### 3. Luhn Summarizer

- Approach is based on TF-IDF (Term Frequency-Inverse Document Frequency).
- Useful when very low frequent words as well as highly frequent words(stopwords) are both not significant

In [18]:
luhn_summarizer = LuhnSummarizer()
luhn_summary = luhn_summarizer(text_parser.document,sentences_count=10)

# Printing the summary
for sentence in luhn_summary:
    print(sentence)

Because of this, extra care K-SVM: select a transformation.learn a linear classifier.needs to be put in manually choosing the right.kernel in K-SVM; and in MCBoost, we may Figure 1: Duality between multiclass boosting and.not even be able to learn a good mapping if we SVM..preset some bad boundaries..SVCL.We can potentially overcome these limitations by combining boosting.and SVM to jointly learn both.the mapping and linear classifiers for a prediction space of arbitrary dimension d. We note that this.is not a straightforward merge of the two methods as this can lead to a computationally prohibitive.method; e.g. imagine having to solve the quadratic optimization of K-SVM before each iteration of.boosting.
Viola..778..Original..Denoise Shrinkage..Shrinkage Residual..Noised..Denoise Ours..Our Residual..Figure 3: (Original) the original image; (Noised) the image corrupted with white.gaussian noise (SNR 8.9 dB); (Denoise Shrinkage) the results of de-noising using.wavelet shrinkage or corin

### 4. KL Summarizer

- Selects sentences based on similarity of word distribution as the original text. 
- Aims to lower the KL-divergence criteria through a greedy optimization approach by adding sentences until the KL-divergence decreases

In [19]:
kl_summarizer = KLSummarizer()
kl_summary = kl_summarizer(text_parser.document,sentences_count=10)

# Printing the summary
for sentence in kl_summary:
    print(sentence)

(1) The features of the handwritten.character are extracted by the via-point estimation algorithm.
And that's the way it should be.
The bread was way too dense and the quality of meat and cheese was good, but there was just not enough of it.
He wasn't really clear and put us in a line with a group of guys.
The staff is great, the service is good, the store is clean and their selection is pretty wide.
That's the worst part about this place, the parking.
..The mojitos here are the best in the hotel.
That is why this is a 2 star and not 1 star.
Was was a family of three and the other consisted of two employee of the restaurant and a manager with a laptop in front of them...We sat there and discussed the fact that this is a huge bar with hardly any customers in it.
..The fries were the the highlight of the night!


## Abstractive Summarizer to be evaluated:

1) T5 Transformer
2) BART Model
3) GPT-2 Model

In [22]:
# Install T5 Library if not already installed
!pip install transformers



### T5 Transformer Model

In [48]:
topic_1_txt



In [45]:
# Import T5 Tool Library
from transformers import T5Tokenizer, T5Config, T5ForConditionalGeneration

# Initiate T5-Base model and Tokenizer
T5_model = T5ForConditionalGeneration.from_pretrained('t5-base')
tokenizer = T5Tokenizer.from_pretrained('t5-base')

# T5-Model needs keyword prior to text operation
## Concatenating the word "summarize:" to raw text
text = "summarize:" + topic_1_txt

# Retrieve encoded word ID's and generate summary from ID's
input_ids=tokenizer.encode(text, return_tensors='pt', max_length=512)
summary_ids = T5_model.generate(input_ids)

# Decode Generated ID's back to words & print results
t5_summary = tokenizer.decode(summary_ids[0])
print(t5_summary)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


<pad> a recent paper examines the use of statistical tests to improve optimization efficiency. the authors


### BART Model

In [46]:
# Import BART Model Library and Tokenizer tools
from transformers import BartForConditionalGeneration, BartTokenizer, BartConfig

# Initialize Tokenizer and Model
tokenizer=BartTokenizer.from_pretrained('facebook/bart-large-cnn')
model=BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')

# Encode inputs & pass to model.generate()
inputs = tokenizer.batch_encode_plus([topic_1_txt],max_length=512,return_tensors='pt')
summary_ids = model.generate(inputs['input_ids'],max_length=512, early_stopping=True)

# Decoding & print results
bart_summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print(bart_summary)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Statistical Tests for Optimization Efficiency. The paper argues that the loss function depends on stochastically generated data. This in turn determines an intrinsic scale of precision for statistical estimation. The proposed algorithms depend on a single interpretable parameter?.the probability for an update to be in the wrong direction.


### GPT-2 Model

In [47]:
# Import GPT-2 model & tokenizer
from transformers import GPT2Tokenizer,GPT2LMHeadModel

# Instantiate GPT-2 model & tokenizer
tokenizer=GPT2Tokenizer.from_pretrained('gpt2')
model=GPT2LMHeadModel.from_pretrained('gpt2')

# Encode text to get input ids & generate encoded summary
inputs=tokenizer.batch_encode_plus([topic_1_txt],return_tensors='pt',max_length=512)
summary_ids=model.generate(inputs['input_ids'],max_length=1024,early_stopping=True)

# Decode summary & print results
GPT_summary=tokenizer.decode(summary_ids[0],skip_special_tokens=True)
print(GPT_summary)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Statistical Tests for Optimization Efficiency..Levi Boyles, Anoop Korattikara, Deva Ramanan, Max Welling.Department of Computer Science.University of California, Irvine.Irvine, CA 92697-3425.{lboyles},{akoratti},{dramanan},{welling}@ics.uci.edu..Abstract.Learning problems, such as logistic regression, are typically formulated as pure.optimization problems defined on some loss function. We argue that this view.ignores the fact that the loss function depends on stochastically generated data.which in turn determines an intrinsic scale of precision for statistical estimation..By considering the statistical properties of the update variables used during the.optimization (e.g. gradients), we can construct frequentist hypothesis tests to.determine the reliability of these updates. We utilize subsets of the data for computing updates, and use the hypothesis tests for determining when the batch-size.needs to be increased. This provides computational benefits and avoids overfitting.by stopping w