## Types of Text Summarization Methods

![Imgur](https://i.imgur.com/J5KyMBJ.png)


# Import Packages 


In [33]:
import spacy
import pandas as pd 
import seaborn as sns
from gensim.models.callbacks import CallbackAny2Vec

import re

from nltk.corpus import stopwords

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from heapq import nlargest
from nltk import sent_tokenize
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize

%matplotlib inline
stopwords = stopwords.words('english')
sns.set_context('notebook')

# Import Dataset 


In [117]:
reviews = pd.read_csv("../data/wine_reviews.csv", usecols =['points', 'title', 'description','price'], encoding='utf-8')
reviews = reviews.drop_duplicates(subset=['title'])
reviews = reviews.dropna()
reviews

Unnamed: 0,description,points,price,title
1,"This is ripe and fruity, a wine that is smooth...",87,15.0,Quinta dos Avidagos 2011 Avidagos Red (Douro)
2,"Tart and snappy, the flavors of lime flesh and...",87,14.0,Rainstorm 2013 Pinot Gris (Willamette Valley)
3,"Pineapple rind, lemon pith and orange blossom ...",87,13.0,St. Julian 2013 Reserve Late Harvest Riesling ...
4,"Much like the regular bottling from 2012, this...",87,65.0,Sweet Cheeks 2012 Vintner's Reserve Wild Child...
5,Blackberry and raspberry aromas show a typical...,87,15.0,Tandem 2011 Ars In Vitro Tempranillo-Merlot (N...
...,...,...,...,...
129966,Notes of honeysuckle and cantaloupe sweeten th...,90,28.0,Dr. H. Thanisch (Erben Müller-Burggraef) 2013 ...
129967,Citation is given as much as a decade of bottl...,90,75.0,Citation 2004 Pinot Noir (Oregon)
129968,Well-drained gravel soil gives this wine its c...,90,30.0,Domaine Gresser 2013 Kritt Gewurztraminer (Als...
129969,"A dry style of Pinot Gris, this is crisp with ...",90,32.0,Domaine Marcel Deiss 2012 Pinot Gris (Alsace)


In [118]:
reviews.shape

(110581, 4)

# Text preprocessing

In [81]:
nlp = spacy.load('en_core_web_lg')
def normalize_text(text):
    tm1 = re.sub('<pre>.*?</pre>', '', text, flags=re.DOTALL)
    tm2 = re.sub('<code>.*?</code>', '', tm1, flags=re.DOTALL)
    tm3 = re.sub('<[^>]+>©', '', tm2, flags=re.DOTALL)
    return tm3.replace("\n", "")

In [119]:
reviews['description_cleaned'] = reviews['description'].apply(normalize_text)

In [120]:
print('Before normalizing text-----\n')
print(reviews['description'][2])
print('\nAfter normalizing text-----\n')
print(reviews['description_cleaned'][2])

Before normalizing text-----

Tart and snappy, the flavors of lime flesh and rind dominate. Some green pineapple pokes through, with crisp acidity underscoring the flavors. The wine was all stainless-steel fermented.

After normalizing text-----

Tart and snappy, the flavors of lime flesh and rind dominate. Some green pineapple pokes through, with crisp acidity underscoring the flavors. The wine was all stainless-steel fermented.


Ваши методы, подумайте до конца ли вы почистили датасет

In [121]:
def cleanup_text(docs):
    punctuations = '!"#$%&\'()*+,-/:;<=>?@[\\]^_`{|}~©'
    doc = nlp(docs, disable=['parser', 'ner'])
    tokens = [tok.lemma_.lower().strip() for tok in doc if tok.lemma_ != '-PRON-']
    tokens = [tok for tok in tokens if tok not in stopwords and tok not in punctuations]
    tokens = ' '.join(tokens)
    return tokens

reviews['description_cleaned'] = reviews['description_cleaned'].apply(lambda x: cleanup_text(x))

In [123]:
print('\nReviews description after removing punctuation and stopwrods---\n')
print(reviews['description_cleaned'][2])


Reviews description after removing punctuation and stopwrods---

tart snappy flavor lime flesh rind dominate . green pineapple poke crisp acidity underscore flavor . wine stainless steel ferment .


In [124]:
def split_data(texts):
    if isinstance(texts, str):
        sentences = sent_tokenize(texts)
    else:
        sentences = [sentence for text in texts for sentence in sent_tokenize(text)]
        
    return sentences

In [125]:
def generate_summary_cosin(cleaned_text):
    sentences = split_data(cleaned_text)
    if len(sentences) < 2:
        return sentences
    
    vectorizer = TfidfVectorizer(stop_words='english')
    tfidf_matrix = vectorizer.fit_transform(sentences)
    sentence_scores = cosine_similarity(tfidf_matrix[-1], tfidf_matrix[:-1])[0]
    summary_sentences = nlargest(7, range(len(sentence_scores)), key=sentence_scores.__getitem__)

    summary_tfidf = ' '.join([sentences[i] for i in sorted(summary_sentences)])
    
    return summary_tfidf
    

In [127]:
print("Original Text:\n")
print(reviews['description_cleaned'][8])
print('\nSummarized text:\n')
print(generate_summary_cosin(reviews['description_cleaned'][8]))

Original Text:

savory dry thyme note accent sunny flavor preserve peach brisk dry wine . fruity fresh elegant sprightly footprint .

Summarized text:

savory dry thyme note accent sunny flavor preserve peach brisk dry wine .


In [129]:
print("Original Text:\n")
print(reviews['description_cleaned'][100])
print('\nSummarized text:\n')
print(generate_summary_cosin(reviews['description_cleaned'][100]))

Original Text:

fresh apple lemon pear flavor accent hint smoke nut bold full bodied pinot gris . rich bit creamy mouthfeel yet balanced briskly satisfy white wide pairing appeal . drink 2019 .

Summarized text:

fresh apple lemon pear flavor accent hint smoke nut bold full bodied pinot gris . rich bit creamy mouthfeel yet balanced briskly satisfy white wide pairing appeal .


In [131]:
print("Original Text:\n")
print(reviews['description_cleaned'][1000])
print('\nSummarized text:\n')
print(generate_summary_cosin(reviews['description_cleaned'][1000]))

Original Text:

arcane 's cab stylistically apart either california washington . define space . plenty new oak fruit acid tannin stand . sharp tangy cranberry raspberry strawberry citric acid play part . still young give time decanter cellar come together show good .

Summarized text:

arcane 's cab stylistically apart either california washington . define space . plenty new oak fruit acid tannin stand . sharp tangy cranberry raspberry strawberry citric acid play part .


In [132]:
reviews['summary'] = reviews['description_cleaned'].apply(generate_summary_cosin)

In [133]:
reviews['summary']


1         ripe fruity wine smooth still structure . firm...
2         tart snappy flavor lime flesh rind dominate . ...
3         pineapple rind lemon pith orange blossom start...
4         much like regular bottling 2012 come across ra...
5         blackberry raspberry aroma show typical navarr...
                                ...                        
129966    note honeysuckle cantaloupe sweeten deliciousl...
129967    citation give much decade bottle age prior rel...
129968    well drain gravel soil give wine crisp dry cha...
129969    dry style pinot gris crisp acidity . also weig...
129970    big rich dry power intense spiciness rounded t...
Name: summary, Length: 110581, dtype: object

In [25]:
data = reviews['summary'].astype(str).tolist()

def tagged_documents(data):
    documents = []

    splitted_texts = [text.split() for text in data]
    idx = [i for i in range(len(data))]
    for i in range(len(data)):
        documents.append(TaggedDocument(splitted_texts[i], [idx[i]]))

    return documents

In [34]:
class ProgressCallback(CallbackAny2Vec):
    def __init__(self, total_epochs):
        self.total_epochs = total_epochs
        self.epoch = 0

    def on_epoch_begin(self, model):
        print(f"Epoch {self.epoch + 1}/{self.total_epochs} - Training: ", end='', flush=True)

    def on_epoch_end(self, model):
        print("Completed")
        self.epoch += 1

In [27]:
documents = tagged_documents(data)

In [28]:
documents

[TaggedDocument(words=['ripe', 'fruity', 'wine', 'smooth', 'still', 'structure', '.', 'firm', 'tannin', 'fill', 'juicy', 'red', 'berry', 'fruit', 'freshen', 'acidity', '.'], tags=[0]),
 TaggedDocument(words=['tart', 'snappy', 'flavor', 'lime', 'flesh', 'rind', 'dominate', '.', 'green', 'pineapple', 'poke', 'crisp', 'acidity', 'underscore', 'flavor', '.'], tags=[1]),
 TaggedDocument(words=['pineapple', 'rind', 'lemon', 'pith', 'orange', 'blossom', 'start', 'aroma', '.'], tags=[2]),
 TaggedDocument(words=['much', 'like', 'regular', 'bottling', '2012', 'come', 'across', 'rather', 'rough', 'tannic', 'rustic', 'earthy', 'herbal', 'characteristic', '.'], tags=[3]),
 TaggedDocument(words=['blackberry', 'raspberry', 'aroma', 'show', 'typical', 'navarran', 'whiff', 'green', 'herb', 'case', 'horseradish', '.', 'mouth', 'fairly', 'full', 'bodied', 'tomatoey', 'acidity', '.'], tags=[4]),
 TaggedDocument(words=['bright', 'informal', 'red', 'open', 'aroma', 'candy', 'berry', 'white', 'pepper', 'savo

In [35]:
model = Doc2Vec(vector_size=100,
                    window=5,
                    min_count=1,
                    workers=8,
                    alpha=0.025,
                    min_alpha=0.01,
                    dm=0)

model.build_vocab(documents)
model.train(documents, total_examples=len(documents), epochs=1000, callbacks=[ProgressCallback(total_epochs=1000)])

Epoch 1/1000 - Training: Completed
Epoch 2/1000 - Training: Completed
Epoch 3/1000 - Training: Completed
Epoch 4/1000 - Training: Completed
Epoch 5/1000 - Training: Completed
Epoch 6/1000 - Training: Completed
Epoch 7/1000 - Training: Completed
Epoch 8/1000 - Training: Completed
Epoch 9/1000 - Training: Completed
Epoch 10/1000 - Training: Completed
Epoch 11/1000 - Training: Completed
Epoch 12/1000 - Training: Completed
Epoch 13/1000 - Training: Completed
Epoch 14/1000 - Training: Completed
Epoch 15/1000 - Training: Completed
Epoch 16/1000 - Training: Completed
Epoch 17/1000 - Training: Completed
Epoch 18/1000 - Training: Completed
Epoch 19/1000 - Training: Completed
Epoch 20/1000 - Training: Completed
Epoch 21/1000 - Training: Completed
Epoch 22/1000 - Training: Completed
Epoch 23/1000 - Training: Completed
Epoch 24/1000 - Training: Completed
Epoch 25/1000 - Training: Completed
Epoch 26/1000 - Training: Completed
Epoch 27/1000 - Training: Completed
Epoch 28/1000 - Training: Completed
E

In [40]:
model.save('../model/doc2vec_model')
print('Model Saved')

Model Saved


In [44]:
reviews

Unnamed: 0,description,points,price,title,description_Cleaned_1,Description_Cleaned,Summary
1,"This is ripe and fruity, a wine that is smooth...",87,15.0,Quinta dos Avidagos 2011 Avidagos Red (Douro),"This is ripe and fruity, a wine that is smooth...",ripe fruity wine smooth still structure . firm...,ripe fruity wine smooth still structure . firm...
2,"Tart and snappy, the flavors of lime flesh and...",87,14.0,Rainstorm 2013 Pinot Gris (Willamette Valley),"Tart and snappy, the flavors of lime flesh and...",tart snappy flavor lime flesh rind dominate . ...,tart snappy flavor lime flesh rind dominate . ...
3,"Pineapple rind, lemon pith and orange blossom ...",87,13.0,St. Julian 2013 Reserve Late Harvest Riesling ...,"Pineapple rind, lemon pith and orange blossom ...",pineapple rind lemon pith orange blossom start...,pineapple rind lemon pith orange blossom start...
4,"Much like the regular bottling from 2012, this...",87,65.0,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,"Much like the regular bottling from 2012, this...",much like regular bottling 2012 come across ra...,much like regular bottling 2012 come across ra...
5,Blackberry and raspberry aromas show a typical...,87,15.0,Tandem 2011 Ars In Vitro Tempranillo-Merlot (N...,Blackberry and raspberry aromas show a typical...,blackberry raspberry aroma show typical navarr...,blackberry raspberry aroma show typical navarr...
...,...,...,...,...,...,...,...
129966,Notes of honeysuckle and cantaloupe sweeten th...,90,28.0,Dr. H. Thanisch (Erben Müller-Burggraef) 2013 ...,Notes of honeysuckle and cantaloupe sweeten th...,note honeysuckle cantaloupe sweeten deliciousl...,note honeysuckle cantaloupe sweeten deliciousl...
129967,Citation is given as much as a decade of bottl...,90,75.0,Citation 2004 Pinot Noir (Oregon),Citation is given as much as a decade of bottl...,citation give much decade bottle age prior rel...,citation give much decade bottle age prior rel...
129968,Well-drained gravel soil gives this wine its c...,90,30.0,Domaine Gresser 2013 Kritt Gewurztraminer (Als...,Well-drained gravel soil gives this wine its c...,well drain gravel soil give wine crisp dry cha...,well drain gravel soil give wine crisp dry cha...
129969,"A dry style of Pinot Gris, this is crisp with ...",90,32.0,Domaine Marcel Deiss 2012 Pinot Gris (Alsace),"A dry style of Pinot Gris, this is crisp with ...",dry style pinot gris crisp acidity . also weig...,dry style pinot gris crisp acidity . also weig...


In [41]:
model = Doc2Vec.load('../model/doc2vec_model')

In [154]:
def get_similar_wines(summary, model, data):
    vec = model.infer_vector(summary.split())
    similar_indices = model.dv.most_similar([vec], topn=5)
    similar_indices = [idx for idx, _ in similar_indices]
    similar_descriptions = data.iloc[similar_indices]

    return similar_descriptions

In [155]:
summary = 'tart snappy flavor lime flesh rind dominate . green pineapple poke crisp acidity underscore flavor .'

In [156]:
top_similar_descriptions = get_similar_wines(summary,model,reviews)

In [157]:
top_similar_descriptions

Unnamed: 0,description,points,price,title,description_cleaned,summary
2,"Tart and snappy, the flavors of lime flesh and...",87,14.0,Rainstorm 2013 Pinot Gris (Willamette Valley),tart snappy flavor lime flesh rind dominate . ...,tart snappy flavor lime flesh rind dominate . ...
111871,"Yellow-ish in color, heavy on the nose, more o...",86,15.0,Tomero 2009 Torrontés (Mendoza),yellow ish color heavy nose oily past aroma ch...,yellow ish color heavy nose oily past aroma ch...
181,"Crisp, coastal acidity dominates this wine, ma...",88,36.0,McIntyre Vineyards 2006 Mission Ranch Pinot No...,crisp coastal acidity dominate wine make mouth...,crisp coastal acidity dominate wine make mouth...
69973,"Aromas of passion fruit, tangerine, bell peppe...",85,15.0,Anakena 2012 Tama Vineyard Selection Sauvignon...,aroma passion fruit tangerine bell pepper gras...,aroma passion fruit tangerine bell pepper gras...
111011,"A lot of wood, still pretty rough, surrounds a...",87,28.0,Lolonis 1998 Tollini Vineyard Zinfandel (Redwo...,lot wood still pretty rough surround thick tan...,lot wood still pretty rough surround thick tan...


In [158]:
reviews.to_csv('../data/wine_reviews_final.csv')

In [170]:
columns_to_display = ['points', 'title', 'description', 'price']

print(f"Top 5 Similar Descriptions:\n")
for idx, row in top_similar_descriptions.iterrows():
    print(f"doc {idx + 1}:")
    for col in columns_to_display:
        print(f"{col}: {row[col]}")
    print()

Top 5 Similar Descriptions:

doc 3:
points: 87
title: Rainstorm 2013 Pinot Gris (Willamette Valley)
description: Tart and snappy, the flavors of lime flesh and rind dominate. Some green pineapple pokes through, with crisp acidity underscoring the flavors. The wine was all stainless-steel fermented.
price: 14.0

doc 111872:
points: 86
title: Tomero 2009 Torrontés (Mendoza)
description: Yellow-ish in color, heavy on the nose, more oily than in the past, and with an aroma of cheap perfume. The wine has high acidity, a zesty bite, but not much in the middle, and thus the flavors of mango, melon and papaya are wan. Good for quaffing but not on a par with previous years.
price: 15.0

doc 182:
points: 88
title: McIntyre Vineyards 2006 Mission Ranch Pinot Noir (Arroyo Seco)
description: Crisp, coastal acidity dominates this wine, making it mouth-wateringly tart. May be too much for some folks, but it's quite elegant, with good cherry, black raspberry and cola flavors enriched with a touch of s