## Types of Text Summarization Methods

![Imgur](https://i.imgur.com/J5KyMBJ.png)


# Import Packages 


In [1]:
import spacy
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt

import re

from nltk.corpus import stopwords

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from heapq import nlargest
from nltk import sent_tokenize
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize

%matplotlib inline
stopwords = stopwords.words('english')
sns.set_context('notebook')

# Import Dataset 


In [22]:
reviews = pd.read_csv("../data/wine_reviews.csv", usecols =['points', 'title', 'description','price'], encoding='utf-8')
reviews = reviews.dropna()
reviews.head()

Unnamed: 0,description,points,price,title
1,"This is ripe and fruity, a wine that is smooth...",87,15.0,Quinta dos Avidagos 2011 Avidagos Red (Douro)
2,"Tart and snappy, the flavors of lime flesh and...",87,14.0,Rainstorm 2013 Pinot Gris (Willamette Valley)
3,"Pineapple rind, lemon pith and orange blossom ...",87,13.0,St. Julian 2013 Reserve Late Harvest Riesling ...
4,"Much like the regular bottling from 2012, this...",87,65.0,Sweet Cheeks 2012 Vintner's Reserve Wild Child...
5,Blackberry and raspberry aromas show a typical...,87,15.0,Tandem 2011 Ars In Vitro Tempranillo-Merlot (N...


# Text preprocessing

In [23]:
nlp = spacy.load('en_core_web_lg')
def normalize_text(text):
    tm1 = re.sub('<pre>.*?</pre>', '', text, flags=re.DOTALL)
    tm2 = re.sub('<code>.*?</code>', '', tm1, flags=re.DOTALL)
    tm3 = re.sub('<[^>]+>©', '', tm2, flags=re.DOTALL)
    return tm3.replace("\n", "")

In [24]:
reviews['description_Cleaned_1'] = reviews['description'].apply(normalize_text)

In [25]:
print('Before normalizing text-----\n')
print(reviews['description'][2])
print('\nAfter normalizing text-----\n')
print(reviews['description_Cleaned_1'][2])

Before normalizing text-----

Tart and snappy, the flavors of lime flesh and rind dominate. Some green pineapple pokes through, with crisp acidity underscoring the flavors. The wine was all stainless-steel fermented.

After normalizing text-----

Tart and snappy, the flavors of lime flesh and rind dominate. Some green pineapple pokes through, with crisp acidity underscoring the flavors. The wine was all stainless-steel fermented.


Ваши методы, подумайте до конца ли вы почистили датасет

In [26]:
def cleanup_text(docs):
    punctuations = '!"#$%&\'()*+,-/:;<=>?@[\\]^_`{|}~©'
    texts = []
    doc = nlp(docs, disable=['parser', 'ner'])
    tokens = [tok.lemma_.lower().strip() for tok in doc if tok.lemma_ != '-PRON-']
    tokens = [tok for tok in tokens if tok not in stopwords and tok not in punctuations]
    tokens = ' '.join(tokens)
    return tokens

reviews['Description_Cleaned'] = reviews['description_Cleaned_1'].apply(lambda x: cleanup_text(x))

In [29]:
print('Reviews description with punctuatin and stopwords---\n')
print(reviews['description_Cleaned_1'][2])
print('\nReviews description after removing punctuation and stopwrods---\n')
print(reviews['Description_Cleaned'][2])

Reviews description with punctuatin and stopwords---

Tart and snappy, the flavors of lime flesh and rind dominate. Some green pineapple pokes through, with crisp acidity underscoring the flavors. The wine was all stainless-steel fermented.

Reviews description after removing punctuation and stopwrods---

tart snappy flavor lime flesh rind dominate . green pineapple poke crisp acidity underscore flavor . wine stainless steel ferment .


In [18]:
reviews = reviews.drop_duplicates(subset=['title'])

In [10]:
def split_data(texts):
    if isinstance(texts, str):
        sentences = sent_tokenize(texts)
    else:
        sentences = [sentence for text in texts for sentence in sent_tokenize(text)]
        
    return sentences

In [11]:
def generate_summary_cosin(cleaned_text):
    sentences = split_data(cleaned_text)
    if len(sentences) < 2:
        return sentences
    
    vectorizer = TfidfVectorizer(stop_words='english')
    tfidf_matrix = vectorizer.fit_transform(sentences)
    sentence_scores = cosine_similarity(tfidf_matrix[-1], tfidf_matrix[:-1])[0]
    summary_sentences = nlargest(7, range(len(sentence_scores)), key=sentence_scores.__getitem__)

    summary_tfidf = ' '.join([sentences[i] for i in sorted(summary_sentences)])
    
    return summary_tfidf
    

In [12]:
print("Original Text:\n")
print(reviews['description_Cleaned_1'][8])
print('\nSummarized text:\n')
print(generate_summary_cosin(reviews['Description_Cleaned'][8]))

Original Text:

Savory dried thyme notes accent sunnier flavors of preserved peach in this brisk, off-dry wine. It's fruity and fresh, with an elegant, sprightly footprint.

Summarized text:

savory dry thyme note accent sunny flavor preserve peach brisk dry wine .


In [13]:
print("Original Text:\n")
print(reviews['description_Cleaned_1'][100])
print('\nSummarized text:\n')
print(generate_summary_cosin(reviews['Description_Cleaned'][100]))

Original Text:

Fresh apple, lemon and pear flavors are accented by a hint of smoked nuts in this bold, full-bodied Pinot Gris. Rich and a bit creamy in mouthfeel yet balanced briskly, it's a satisfying white with wide pairing appeal. Drink now through 2019.

Summarized text:

fresh apple lemon pear flavor accent hint smoke nut bold full bodied pinot gris . rich bit creamy mouthfeel yet balanced briskly satisfy white wide pairing appeal .


In [14]:
print("Original Text:\n")
print(reviews['description_Cleaned_1'][1000])
print('\nSummarized text:\n')
print(generate_summary_cosin(reviews['Description_Cleaned'][1000]))

Original Text:

Arcane's Cab is stylistically apart from either California or Washington. It defines its own space. There's plenty of new oak, but the fruit, acid and tannins stand up to it. This is sharp and tangy; cranberry and raspberry, strawberry and citric acids all playing their part. Still young, give it some time in a decanter or in your cellar to come together and show its best.

Summarized text:

arcane 's cab stylistically apart either california washington . define space . plenty new oak fruit acid tannin stand . sharp tangy cranberry raspberry strawberry citric acid play part .


In [15]:
reviews['Summary'] = reviews['Description_Cleaned'].apply(generate_summary_cosin)

In [16]:
reviews['Summary']

1         ripe fruity wine smooth still structure . firm...
2         tart snappy flavor lime flesh rind dominate . ...
3         pineapple rind lemon pith orange blossom start...
4         much like regular bottling 2012 come across ra...
5         blackberry raspberry aroma show typical navarr...
                                ...                        
129966    note honeysuckle cantaloupe sweeten deliciousl...
129967    citation give much decade bottle age prior rel...
129968    well drain gravel soil give wine crisp dry cha...
129969    dry style pinot gris crisp acidity . also weig...
129970    big rich dry power intense spiciness rounded t...
Name: Summary, Length: 120975, dtype: object

In [207]:
data = reviews['Summary'].astype(str).tolist()

tagged_data = [TaggedDocument(words=word_tokenize(_d.lower()), tags=[str(i)]) for i, _d in enumerate(data)]

In [227]:
n_epochs = 100
vec_size = 100
alpha = 0.025

model = Doc2Vec(vector_size=vec_size,
                alpha=alpha, 
                min_alpha=0.00025,
                min_count=1,
                dm =1)
  
model.build_vocab(tagged_data)


for epoch in range(n_epochs):
    model.train(tagged_data,
                total_examples=model.corpus_count,
                epochs=model.epochs)
    # decrease the learning rate
    model.alpha -= 0.0002
    # fix the learning rate, no decay
    model.min_alpha = model.alpha
    
    print("Epoch: {}".format(epoch))

Epoch: 0
Epoch: 1
Epoch: 2
Epoch: 3
Epoch: 4
Epoch: 5
Epoch: 6
Epoch: 7
Epoch: 8
Epoch: 9
Epoch: 10
Epoch: 11
Epoch: 12
Epoch: 13
Epoch: 14
Epoch: 15
Epoch: 16
Epoch: 17
Epoch: 18
Epoch: 19
Epoch: 20
Epoch: 21
Epoch: 22
Epoch: 23
Epoch: 24
Epoch: 25
Epoch: 26
Epoch: 27
Epoch: 28
Epoch: 29
Epoch: 30
Epoch: 31
Epoch: 32
Epoch: 33
Epoch: 34
Epoch: 35
Epoch: 36
Epoch: 37
Epoch: 38
Epoch: 39
Epoch: 40
Epoch: 41


KeyboardInterrupt: 

In [229]:
test_data = word_tokenize(reviews['Summary'][1].lower())

In [230]:
inferred_vector = model.infer_vector(test_data)
sims = model.dv.most_similar([inferred_vector], topn=len(model.dv))

# Compare and print the most/median/least similar documents from the train corpus
print('Test Document : «{}»\n'.format(' '.join(test_data)))
print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % model)
for label, index in [('MOST', 0), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
    print(u'%s %s: «%s»\n' % (label, sims[index], ' '.join(tagged_data[int(sims[index][0])].words)))

Test Document : «ripe fruity wine smooth still structure . firm tannin fill juicy red berry fruit freshen acidity .»

SIMILAR/DISSIMILAR DOCS PER MODEL Doc2Vec<dm/m,d100,n5,w5,s0.001,t3>:

MOST ('0', 0.8460253477096558): «ripe fruity wine smooth still structure . firm tannin fill juicy red berry fruit freshen acidity .»

MEDIAN ('4252', 0.1888270229101181): «soft dull vegetal .»

LEAST ('119931', -0.1409110575914383): «inaugural vintage owner atlas vineyard management napa also farm vineyard oregon california 's central coast exquisite candy popcorn aroma seasoning fennel .»


In [250]:
model.save('../model/doc2vec_model.pkl')
print('Model Saved')

Model Saved


In [247]:
model = Doc2Vec.load('dock2vec_model.pkl')

In [248]:
test_data = word_tokenize(reviews['Summary'][1].lower())

In [249]:
inferred_vector = model.infer_vector(test_data)
sims = model.dv.most_similar([inferred_vector], topn=len(model.dv))

# Compare and print the most/median/least similar documents from the train corpus
print('Test Document : «{}»\n'.format(' '.join(test_data)))
print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % model)
for label, index in [('MOST', 0), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
    print(u'%s %s: «%s»\n' % (label, sims[index], ' '.join(tagged_data[int(sims[index][0])].words)))

Test Document : «ripe fruity wine smooth still structure . firm tannin fill juicy red berry fruit freshen acidity .»

SIMILAR/DISSIMILAR DOCS PER MODEL Doc2Vec<dm/m,d100,n5,w5,s0.001,t3>:

MOST ('0', 0.8341885209083557): «ripe fruity wine smooth still structure . firm tannin fill juicy red berry fruit freshen acidity .»

MEDIAN ('104902', 0.20941321551799774): «softness dry wine . forward rich currant smoky oak could use great acidity .»

LEAST ('36457', -0.105157271027565): «dry crisply acidic silky mouth rich flavor . truly distinctive pinot noir . show variety 's elusive personality rich cherry currant withdraw mineral herb . yet final impression fine silky decadence .»
