### Extractive summarization using Hybrid Term Frequency - Inverse Document Frequency (TF-IDF)

- Load preprocessed reviews (after lemmatization) where each review is broken down into sentences
- Each data input is a tokenized sentence
- Compute TFIDF score for each term in the sentences
- Compute average TFIDF score for each sentence and sort the sentences by this score
- Select top num_reviews (10) sentences based on their cosine similarity
    - Encode each word into its word embeddings using Global Vector (GloVe) word embedding. Each embedding is a vector of dimension 300. 
    - Compute average embedding for the sentence by taking an average of non-zero vectors of the words in the sentence
    - Use threshold = (0.01, 0.99), diff_threshold = 0.01 to select dissimilar sentences having to hybrid TF-IDF score
    - Each sentence has score above 0.01, and below 0.99. 
    - Selected sentences differ from each other by atleast 0.01 score

In [1]:
%run lib.ipynb import *

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\sshre35\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\sshre35\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\sshre35\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\sshre35\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\sshre35\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\sshre35\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is

In [3]:
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords
from collections import Counter
from functools import reduce 
import gensim.downloader as api
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import nltk
import spacy
import math

In [4]:
# load the pre-trained GloVe model
glove_model = api.load('glove-wiki-gigaword-300')

In [7]:
domains_reviews = {}
domains = ["ride", "health", "investing"]
min_wc = 3

In [None]:
# load lemmatized reviews in each domain
for domain in domains:
    input_csv_file = DATA_DIR + "lemmatized/" + domain + "_lemmatized.csv"
    print("\n-----loading reviews from ", input_csv_file, "--------------\n")
    df = pd.read_csv(input_csv_file)
    df = df[df["word count"] > min_wc]
    domains_reviews[domain] = df

In [9]:
domains_reviews["ride"].head()

Unnamed: 0,review,sent,tokenized,lemmatized,word count
0,someone made an account on this app using my e...,someone made an account on this app using my e...,"someone,made,account,app,using,email,address","someone,make,account,app,use,email,address",7
1,someone made an account on this app using my e...,"i get all of their receipts, trip info, and cu...","get,receipts,trip,info,customer,service,responses","get,receipt,trip,info,customer,service,response",7
2,someone made an account on this app using my e...,they don't even have remotely the same name as...,"even,remotely,name,emailed,several,times,ask,s...","even,remotely,name,email,several,time,ask,stop...",9
3,someone made an account on this app using my e...,they had me describe myself to prove it's inco...,"describe,prove,incorrect,still,emailing","describe,prove,incorrect,still,email",5
4,someone made an account on this app using my e...,they should at least have some sort of email v...,"least,sort,email,verification,prove,emailing,v...","least,sort,email,verification,prove,email,vali...",9


In [10]:
domains_reviews["investing"].head()

Unnamed: 0,review,sent,tokenized,lemmatized,word count
1,"i like investing on here. yeah, savings are ma...","yeah, savings are mainly through round-ups, bu...","yeah,savings,mainly,roundups,saved,lot","yeah,saving,mainly,roundup,save,lot",6
2,"i like investing on here. yeah, savings are ma...","the spend account, however, neither makes sens...","spend,account,however,neither,makes,sense,conn...","spend,account,however,neither,make,sense,conne...",10
3,"i like investing on here. yeah, savings are ma...",you would think you could just move money betw...,"would,think,could,move,money,accounts,really,o...","would,think,could,move,money,account,really,on...",11
5,"i like investing on here. yeah, savings are ma...",most people signed up using another bank.,"people,signed,using,another,bank","people,sign,use,another,bank",5
7,"i like investing on here. yeah, savings are ma...",you can't even move invest funds to spend.,"even,move,invest,funds,spend","even,move,invest,fund,spend",5


In [11]:
domains_reviews["health"].head()

Unnamed: 0,review,sent,tokenized,lemmatized,word count
0,"i've tried many sleep apps the last few years,...","i've tried many sleep apps the last few years,...","tried,many,sleep,apps,last,years,helped,extent...","try,many,sleep,apps,last,year,help,extent,none...",14
1,"i've tried many sleep apps the last few years,...",there are dozens - maybe hundreds - to choose ...,"dozens,maybe,hundreds,choose,tell,within,minut...","dozen,maybe,hundred,choose,tell,within,minute,...",10
2,"i've tried many sleep apps the last few years,...","in the trial period, i was already figuring ou...","trial,period,already,figuring,individuals,want...","trial,period,already,figure,individual,want,fo...",12
3,"i've tried many sleep apps the last few years,...","in the past, lying with my eyes closed trying ...","past,lying,eyes,closed,trying,sleep,would,wind...","past,lie,eye,close,try,sleep,would,wind,make,f...",11
4,"i've tried many sleep apps the last few years,...",listening to the hypnosis audio that i like ta...,"listening,hypnosis,audio,like,takes,place,open...","listen,hypnosis,audio,like,take,place,openness...",10


In [22]:
# matrix (corpus_size, num_features) : term frequency of unique words in each document
# df also known as document frequency (num_features) : counts number of doc that contains the given word

class HybridTfidfTransformer:
    def __init__(self, len_unique_words, corpus_size):
        self.len_unique_words = len_unique_words
        self.corpus_size = corpus_size # size of corpus

    def transform(self, matrix, df):
        tf = matrix / self.len_unique_words # normalize term frequency by the number of unique words in the corpus
        idf = np.log(self.corpus_size / df)
        tf_idf = tf * idf
        return tf_idf

In [30]:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

def create_glove_embeddings(reviews, review_sent = False):
    
    review_embeddings = []
    for review_str in reviews:
        review_words = review_str
        top_words = []
        valid_words = 0
        if review_sent:
            review_words = nltk.word_tokenize(review_str)
        # get embeddings for each word
        review_words_embedding = []
        for word in review_words:
            if word in glove_model.key_to_index:
                review_words_embedding.append(glove_model[word])
                valid_words += 1
            else:
                review_words_embedding.append(np.zeros(300))
        review_embedding = np.mean(review_words_embedding, axis=0)
        #if review_embedding is nan
        if len(review_embedding.shape) == 0: 
            review_embedding = np.zeros(300)
        else:
            # centroid embedding of this review with associated words
            sim = cosine_similarity([review_embedding], review_words_embedding)[0]
            # get top 80 words closer to sentence embedding
            top_words_indices =sim.argsort()[::-1][:int(0.8*valid_words)]
            top_words = set([review_words[i] for i in top_words_indices])
        
        review_embeddings.append((review_embedding, ",".join(top_words)))

    # Normalize the review embeddings
    review_embeddings_arr = np.array([embedding for embedding, top_words in review_embeddings])
    review_embeddings_arr /= np.linalg.norm(review_embeddings_arr, axis=1).reshape(-1, 1)
    review_embeddings_top_words = [top_words for embedding, top_words in review_embeddings]
    return (review_embeddings_arr, review_embeddings_top_words)


    
def hybrid_tfidf_sentences_summary(raw_reviews, raw_sent, lemma_reviews, 
                                   output_file, num_reviews = 20, 
                                   threshold = (0.01, 0.99), diff_threshold = 0.01):

    review_words_str = [" ".join(words) for words in lemma_reviews]

    # create tfidf object and fit it with reviews
    vectorizer = TfidfVectorizer(norm=None) # # do not apply L2 normalization to TF-IDF scores
    tfidf_matrix = vectorizer.fit_transform(review_words_str)
    
    # create a custom transformer for the hybrid TF-IDF formula
    corpus_size = len(lemma_reviews)
    len_unique_words = len(vectorizer.idf_)
    hybrid_transformer = HybridTfidfTransformer(len_unique_words, corpus_size)

    # calculate the document frequencies (df) for each feature
    df = np.array((tfidf_matrix != 0).sum(axis=0)).flatten()
    
    # apply the custom transformer to the TF-IDF vectors to obtain the hybrid TF-IDF scores
    sentence_scores = hybrid_transformer.transform(tfidf_matrix, df)
    
    review_embeddings, review_embeddings_top_words = create_glove_embeddings(lemma_reviews)
    sentence_embeddings = np.array(review_embeddings)

    
    # Generate the summary by selecting the top sentences
    selected_indices = np.argsort(sentence_scores)[::-1]
    
    index = 0
    next_index = index + 1
    review_index = selected_indices[0]
    count = 1
    dissimilar_reviews = [(review_index, sentence_scores[review_index], 
                           raw_reviews[review_index], 
                           raw_sent[review_index],
                           ",".join(lemma_reviews[review_index]))]
    last_sim = 1
    
    while count < num_reviews:
        if next_index >= num_reviews:
            break
        next_review_index = selected_indices[next_index]
        review_matrix = [sentence_embeddings[review_index], sentence_embeddings[next_review_index]]
        cosine_value = cosine_similarity(review_matrix)[0][1]
        diff_last_sim = abs(last_sim - cosine_value)
        if diff_last_sim > diff_threshold and cosine_value > threshold[0] and cosine_value < threshold[1]:
            dissimilar_reviews.append((next_review_index, sentence_scores[next_review_index], 
                                       raw_reviews[next_review_index], 
                                       raw_sent[next_review_index],
                                       ",".join(lemma_reviews[next_review_index])))
            count += 1
            last_sim = cosine_value
        next_index += 1
    
    print("------------saving output to file", output_file, "------------")
    df = pd.DataFrame(dissimilar_reviews, columns = ["doc_id", "score", "raw", "sent",  "lemmatized"])
    df.to_csv(output_file, header=True, index=False)
    return df

In [None]:
domains_summary = {}

for domain in domains:
    df = domains_reviews[domain]
    raw_reviews = df["review"].tolist()
    sent_reviews = df["sent"].tolist()
    lemma_reviews = df["lemmatized"].tolist()
    lemmatized_reviews = [item.split(",") for item in lemma_reviews]
    output_file = DATA_DIR + domain + "_summary.csv"
    summary_df = hybrid_tfidf_sentences_summary(raw_reviews, sent_reviews, lemmatized_reviews, output_file, 15)
    domains_summary[domain] = summary_df

In [36]:
domains_summary["ride"].head()

Unnamed: 0,doc_id,score,raw,sent,lemmatized
0,27783,1.205823,i am afairlynew lyftriderand iheardheard that ...,i am afairlynew lyftriderand iheardheard that ...,"competition,lyft,find,true,ber,seem,much,well,..."
1,26214,1.180414,usually cheaper than uber not cheaper than a c...,usually cheaper than uber not cheaper than a c...,"usually,cheaper,uber,cheaper,cab,reliable,usua..."
2,62785,0.802241,i hate via they take way to long they take you...,i hate via they take way to long they take you...,"hate,via,take,way,long,take,money,paid,via,pas..."
3,35641,0.668979,i ordered a lift to night an d they told me i...,i ordered a lift to night an d they told me i...,"order,lift,night,told,would,want,get,text,tell..."
4,42767,0.646816,i was hesitant to book an uber bc of all the h...,i was hesitant to book an uber bc of all the h...,"hesitant,book,uber,horror,story,heard,ubers,st..."


In [37]:
domains_summary["health"].head()

Unnamed: 0,doc_id,score,raw,sent,lemmatized
0,25004,0.782249,had to go back & forth with customer service t...,had to go back & forth with customer service t...,"back,forth,customer,service,get,price,two,frie..."
1,63148,0.705859,just downloaded rootd today & i think itz such...,just downloaded rootd today & i think itz such...,"download,rootd,today,think,cute,lil,app,help,m..."
2,40152,0.580661,this app is gawd awful!! i accidentally downlo...,i accidentally downloaded it(via my niece) and...,"accidentally,download,via,niece,get,update,say..."
3,26196,0.440951,this app is for anxiety and to help people who...,this app is for anxiety and to help people who...,"app,anxiety,help,people,trouble,fall,asleep,be..."
4,4375,0.400531,i've been using betterhelp for maybe six month...,i've been using betterhelp for maybe six month...,"use,betterhelp,maybe,six,month,student,discoun..."


In [38]:
domains_summary["investing"].head()

Unnamed: 0,doc_id,score,raw,sent,lemmatized
0,37683,0.806133,m1 vs webull hands down m1 is better then webu...,webull has a stock lending program allows you ...,"webull,stock,lending,program,allows,loan,share..."
1,33941,0.743765,i am a disabled american man that was excited ...,i am a disabled american man that was excited ...,"disabled,american,man,excite,opening,account,p..."
2,26125,0.624518,i gave one star because had to do something to...,i gave one star because had to do something to...,"give,one,star,something,proceed,download,app,p..."
3,223916,0.538981,i enjoy trading on webull but mobile app and d...,i enjoy trading on webull but mobile app and d...,"enjoy,trading,webull,mobile,app,desktop,versio..."
4,26484,0.513329,nothing works on apple since march 2020 - not ...,an issue with broken watch lists on apple ipad...,"issue,broken,watch,list,apple,ipad,fidelity,ap..."
