# Key Phrase Extraction and Summarization

This notebook attempts to synthesize useful information from the reviews using key phrase extraction and summarization.

The task is accomplished with two specialized packages, [pke](https://github.com/boudinfl/pke) for phrase extraction and [sumy](https://github.com/miso-belica/sumy) for summarization.

In [1]:
import re
import string
import os
from pathlib import Path
import pickle
import gzip
import inspect
from collections import Counter, OrderedDict

import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

import matplotlib.pyplot as plt
import seaborn as sns

import spacy
import nltk
import gensim
from unidecode import unidecode


import pke
from pke.unsupervised import FirstPhrases,KPMiner,TfIdf,YAKE
from pke.unsupervised import MultipartiteRank,PositionRank,SingleRank,TextRank,TopicalPageRank,TopicRank

from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers import kl, lex_rank, luhn, text_rank, lsa, reduction
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words

from tqdm import tqdm_notebook as tqdmn
import logging
from joblib import Parallel, delayed



In [3]:
pd.set_option('max_rows',100)

In [2]:
nlp = spacy.load("en_core_web_md")

In [2]:
rpath = Path('proc_data/raw_texts/critiques/')
cpath = Path('proc_data/raw_texts/crits_dedup')
docs = [*cpath.iterdir()]

In [4]:
# Old Google stopwords list from the 2000's 
# from: https://www.ranks.nl/stopwords
# plus a few custom add-ins
minimal_stops = set(
    ['i','a','about','an','are','as','at','be','by','for',
     'from','how','in','is','it','of','on','or','that','the','this',
     'to','was','what','when','where','who','will','with'] # -com, -www
    + ['and','your','essay','have','so','my'] 
    + list(string.punctuation))

In [None]:
def to_file_corpus(df, datapath, idcol='doc_id_rv', postcol='post_content_rv'):
    print(f'Writting {df.shape[0]} files to: {datapath} ...')
    for doc_id, content in df[[idcol,postcol]].values:
        Path(datapath/doc_id).with_suffix('.txt').write_text(content)
    print('Done.')

In [5]:
df_merge= pd.read_feather('proc_data/merge_deduped.df')
df_merge = df_merge.applymap(unidecode)
rmcrits = df_merge.post_content_rv

### Term Frequency File

Create a term frequency file for use in `KPMiner` and `Tfidf`.

Note: these models consistently underperformed other methods, so ultimately, this file was not used.

In [121]:
df_text = pd.read_pickle('proc_data/minparse/full_text_minproc.pkl')
crit_idx = np.load('proc_data/minparse/crit_idx.npy')
df_crit = df_text.loc[crit_idx]
crit_content = df_crit['post_content']

In [204]:
vec_params = dict(strip_accents='ascii', ngram_range=(1,5), max_df=0.8, min_df=5, stop_words=list(minimal_stops))
count_vctzr = CountVectorizer(**vec_params)

count_mat_crit = count_vctzr.fit_transform(crit_content)
feat_names = count_vctzr.get_feature_names()

term_counts = pd.Series(count_mat_crit.sum(axis=0).A1, index=feat_names)
term_counts.to_csv('proc_data/raw_texts/crit_termcounts.tsv',sep='\t', header=False)

In [209]:
!head -n 5 'proc_data/raw_texts/crit_termcounts.tsv'

00	6
000	19
10	106
10 year	5
10 years	25


In [210]:
!sed -i '1i--NB_DOC--\t{count_mat_crit.shape[0]}' proc_data/raw_texts/crit_termcounts.tsv

In [211]:
!head -n 5 'proc_data/raw_texts/crit_termcounts.tsv'

--NB_DOC--	5534
00	6
000	19
10	106
10 year	5


In [18]:
xtrcrs = [SingleRank(), TextRank(), PositionRank(), TopicalPageRank(), MultipartiteRank(), TopicRank(), YAKE()]

In [19]:
for x in xtrcrs:
    print(f"===== {str(x.__class__).split('.')[-1][:-2]} =====")
    print('select:',[*inspect.signature(x.candidate_selection).parameters.keys()])
    print('weight:',[*inspect.signature(x.candidate_weighting).parameters.keys()])

===== SingleRank =====
select: ['pos']
weight: ['window', 'pos', 'normalized']
===== TextRank =====
select: ['pos']
weight: ['window', 'pos', 'top_percent', 'normalized']
===== PositionRank =====
select: ['grammar', 'maximum_word_number', 'kwargs']
weight: ['window', 'pos', 'normalized']
===== TopicalPageRank =====
select: ['grammar', 'kwargs']
weight: ['window', 'pos', 'lda_model', 'stoplist', 'normalized']
===== MultipartiteRank =====
select: ['pos', 'stoplist']
weight: ['threshold', 'method', 'alpha']
===== TopicRank =====
select: ['pos', 'stoplist']
weight: ['threshold', 'method', 'heuristic']
===== YAKE =====
select: ['n', 'stoplist', 'kwargs']
weight: ['window', 'stoplist', 'use_stems']


In [114]:
# https://boudinfl.github.io/pke/build/html/unsupervised.html
logging.disable(level=logging.CRITICAL)
def kpextract(extractor, doc, pos=None, topn=10, show=False, termfreq_file=None):
    """Apply multiple keyphrase extraction methods to a document
    
    Parameters
    ----------
    extractor : pke.unsupervised Model
        Keyphrase extraction model to use.
    doc : str or filepath-object
        Text passage to apply keyphrase extraction
    pos : list[str], (default: ['NOUN','VERB','ADJ','ADV'])
        Part of speech tags used in canidate selection and weighting when applicable.
    topn : int, (default: 10)
        Number of phrases to extract for each method.
    show : bool, (default: False)
        If true, print results as they are calculated
    termfreq_file : str or filepath-object, (default: None)
        Path to term frequency document. Only required when 'kpminer' or 'tfidf' are included
        
    Returns
    -------
    phrases : list[tuple(extractor:str, phrase:str, score:float)]]
        a list of each extracted phrase containing
        tuples of the extraction method, extracted phrase, and score of the phrase
    """
    
    doc = doc.as_posix() if isinstance(doc, Path) else doc
    pos = pos if pos is not None else set(['NOUN','VERB','ADJ','ADV'])
    phrases = []

    name = str(extractor.__class__).split('.')[-1][:-2]
    if show: print(f"===== {name} =====")
    
    if name not in ['KPMiner','TfIdf']:
        extractor.load_document(input=doc, language='en', encoding='utf-8')
        extractor.candidate_selection(pos=pos)
        try:
            if name in ['SingleRank', 'TextRank', 'PositionRank']: 
                extractor.candidate_weighting(pos=pos)
            else: 
                extractor.candidate_weighting()
        except Exception as e: # ignore TPR exception when it fails to product candidates 
            if show: print(e)
    else:
        assert termfreq_file is not None, "Term frequency file required for KPMiner and TfIdf: pass a valid filepath to `termfreq_file`"
        docfreq = pke.load_document_frequency_file(input_file=termcount_file)
        extractor.load_document(input=doc, language='en')

        extractor.candidate_selection(lasf=3, cutoff=200, n=5, stoplist=list(minimal_stops))
        extractor.candidate_weighting(df=docfreq)

    keyphrases = extractor.get_n_best(n=topn)
    for x,y in keyphrases:
        phrases.append((name,x,y))
        if show: print(f'{y:<10.4f} {x}')
            
    return phrases

In [96]:
samp = rmcrits.sample(1)
print(samp.iloc[0])

Sunkanmi, did the university that you are applying to provide you with a prompt for your personal statement? If they did not, then you still have a chance to correct the content of your essay. This essay just doesn't fall within the expected requirements of a masters degree personal statement. It sounds more like you drafted this for a college application instead. Therefore, you need to work on creating a more relevant personal statement for your application.

For starters, the focus of the essay should consider writing a shorter, more informative and relevant essay. A reviewer will normally offer 5 minutes to reading your essay before he decides it has taken too long to get to the point then sets it aside for future reading. Try to keep this essay short because this is a statement for graduate school. So the information you indicate should represent more of your college academics and professional experience. Even though your mother was the reason that you started your medical career, 

In [97]:
multi_extract(samp.iloc[0], show=True)

===== SingleRank =====
0.0755     essay just does n't fall
0.0736     more relevant personal statement
0.0665     essay should consider writing
0.0653     personal essay
0.0628     essay is
0.0599     masters degree personal statement
0.0535     relevant essay
0.0533     essay content
0.0493     indicate should represent more
0.0489     essay short
===== TextRank =====
0.0821     essay just does n't fall
0.0663     essay should consider writing
0.0642     more relevant personal statement
0.0559     indicate should represent more
0.0521     personal essay
0.0511     do n't lose focus
0.0496     masters degree personal statement
0.0488     personality will thrive
0.0445     reviewer will normally offer
0.0432     non - related reasons
===== PositionRank =====
0.0792     personal essay
0.0740     relevant personal statement
0.0640     personal statement
0.0571     essay content
0.0558     relevant essay
0.0458     essay
0.0335     person
0.0305     statement
0.0259     college application

[[('SingleRank', "essay just does n't fall", 0.07550260655124301),
  ('SingleRank', 'more relevant personal statement', 0.07358865299907494),
  ('SingleRank', 'essay should consider writing', 0.06653609641263179),
  ('SingleRank', 'personal essay', 0.06530097046880788),
  ('SingleRank', 'essay is', 0.06280347902841764),
  ('SingleRank', 'masters degree personal statement', 0.059891787799678735),
  ('SingleRank', 'relevant essay', 0.05350622811292505),
  ('SingleRank', 'essay content', 0.0533159836964467),
  ('SingleRank', 'indicate should represent more', 0.04933620153832671),
  ('SingleRank', 'essay short', 0.04885619372052423)],
 [('TextRank', "essay just does n't fall", 0.08212028743866907),
  ('TextRank', 'essay should consider writing', 0.06631508028225536),
  ('TextRank', 'more relevant personal statement', 0.06415624119462564),
  ('TextRank', 'indicate should represent more', 0.055906057154897),
  ('TextRank', 'personal essay', 0.05212557413230238),
  ('TextRank', "do n't lose f

In [113]:
multi_extract(samp.iloc[0], include=['single','text','position'], pos=['VBZ'], show=True)

===== SingleRank =====
===== TextRank =====
===== PositionRank =====
0.0000     sunkanmi
0.0000     university
0.0000     prompt
0.0000     personal statement
0.0000     chance
0.0000     content
0.0000     essay
0.0000     requirements
0.0000     masters degree
0.0000     college application


[[],
 [],
 [('PositionRank', 'sunkanmi', 0.0),
  ('PositionRank', 'university', 0.0),
  ('PositionRank', 'prompt', 0.0),
  ('PositionRank', 'personal statement', 0.0),
  ('PositionRank', 'chance', 0.0),
  ('PositionRank', 'content', 0.0),
  ('PositionRank', 'essay', 0.0),
  ('PositionRank', 'requirements', 0.0),
  ('PositionRank', 'masters degree', 0.0),
  ('PositionRank', 'college application', 0.0)]]

In [12]:
def multi_extract(doc, include=None, exclude=None, pos=None, topn=10, show=False, termfreq_file=None):
    """Apply multiple keyphrase extraction methods to a document
    
    Parameters
    ----------
    doc : str or filepath-object
        Text passage to apply keyphrase extraction
    include : list[str], (default: ['single','text','position','topic','multipartite','topical','yake'])
        Extraction methods to use. Valid options: 
        ['single','text','position','topic','multipartite','topical','yake','kpminer','tfidf','firstphrases']
    exclude : list[str], (default: None)
        Extraction methods to omit. If not None, all valid, non-excluded options will be used.
    pos : list[str], (default: None)
        Part of speech tags. Passed to `kpextract`
    topn : int, (default: 10)
        Number of phrases to extract for each method
    show : bool, (default: False)
        If true, print results as they are calculated
    termfreq_file : str or filepath-object, (default: None)
        Path to term frequency document. Only required when 'kpminer' or 'tfidf' are included
        
    Returns
    -------
    phrases : list[list[tuple(extractor:str, phrase:str, score:float)]]
        A list of each included extractor containing
        a list of each extracted phrase containing
        tuples of the extraction method, extracted phrase, and score of the phrase
    """
    optnames = ['single','text','position','topic','multipartite','topical','yake',
                'kpminer','tfidf','firstphrases']
    opts = [SingleRank(), TextRank(), PositionRank(), TopicRank(), MultipartiteRank(), TopicalPageRank(), YAKE(),
            KPMiner(), TfIdf(), FirstPhrases()]
    
    optdict = {k:v for k,v in zip(optnames, opts)}
    default = optnames[:7]
    
    if include is None:
        extnames = default if exclude is None else list(set(optnames)-set(exclude))
    else:
        extnames = include
    
    extractors = [optdict[x] for x in extnames]
    
    phrases = []
    for extractor in extractors:
        phrases.append(kpextract(extractor=extractor, doc=doc, pos=pos, topn=topn, show=show, termfreq_file=termfreq_file))
    
    return phrases

In [45]:
def extract_all(doc, include_tpr=True, include_freq_models=False, termcount_file=None, show=False):
    """Deprecated. TODO: replace with `multi_extract`"""
    st_extractors = [SingleRank(), TextRank()]
    extractors = [PositionRank(), MultipartiteRank(), TopicRank(), YAKE()]  #FirstPhrases()
    
    docp = doc.as_posix() if isinstance(doc, Path) else doc
    
    pos=set(['NOUN','VERB','ADJ','ADV'])
    phrases = []

    for extractor in st_extractors:
        name = str(extractor.__class__).split('.')[-1][:-2]
        if show: print(f"===== {name} =====")
        extractor.load_document(input=docp, language='en', encoding='utf-8')
        extractor.candidate_selection(pos=pos)
        extractor.candidate_weighting(window=10, pos=pos) # top_percent=0.33

        keyphrases = extractor.get_n_best(n=10)
        for x,y in keyphrases:
            phrases.append((name,x,y))
            if show: print(f'{y:<10.4f} {x}')

    for extractor in extractors: # extractors
        name = str(extractor.__class__).split('.')[-1][:-2]
        if show: print(f"===== {name} =====")
        extractor.load_document(input=docp, language='en', encoding='utf-8')
        extractor.candidate_selection(pos=pos)
        extractor.candidate_weighting() #pos=set(['NOUN','VERB','ADJ'])

        keyphrases = extractor.get_n_best(n=10)
        for x,y in keyphrases:
            phrases.append((name,x,y))
            if show: print(f'{y:<10.4f} {x}')

    if include_tpr:
        try:
            extractor = TopicalPageRank()
            name = str(extractor.__class__).split('.')[-1][:-2]
            if show: print(f"===== {name} =====")
            extractor.load_document(input=docp, language='en', encoding='utf-8')
            extractor.candidate_selection(pos=pos)
            extractor.candidate_weighting() #pos=set(['NOUN','VERB','ADJ'])

            keyphrases = extractor.get_n_best(n=10)
            for x,y in keyphrases:
                phrases.append((name,x,y))
                if show: print(f'{y:<10.4f} {x}')
        except Exception as e: # ignore TPR exception when it fails to product candidates 
            if show: print(e)

    if include_freq_models:
        df_extractors = [KPMiner(), TfIdf()]
        docfreq = pke.load_document_frequency_file(input_file=termcount_file 
                                                   if termcount_file is not None 
                                                   else 'proc_data/raw_texts/crit_termcounts.tsv')
        for extractor in df_extractors:
            name = str(extractor.__class__).split('.')[-1][:-2]
            if show: print(f"===== {name} =====")
            extractor.load_document(input=docp, language='en')

            extractor.candidate_selection(lasf=3, cutoff=200, n=5, stoplist=list(minimal_stops))
            extractor.candidate_weighting(df=docfreq) #sigma=3.0, alpha=2.3

            keyphrases = extractor.get_n_best(n=10)
            for x,y in keyphrases:
                phrases.append((name,x,y))
            if show: print(f'{y:<10.4f} {x}')

    return phrases

In [29]:
exdoc = np.random.choice(docs)
print(exdoc, exdoc.read_text(), sep='\n\n')

proc_data\raw_texts\crits_dedup\16657_10.txt

Thanks for that discussion of the roles played by those health professionals, Casey.

I think of that kind of profession within its historical context... like, in every culture, the healers have been multidisciplinary practitioners who did what they knew how to do and got the best results possible.

So, all practitioners would be considered healers with some being more advanced than others. So, when I hear about any medical profession, I just thing 

In human society, a healer is a healer. And you'll gradually learn more modalities. Like, you might learn acupressure and trigger point work, for example,to compliment what you do. You might learn Ericksonian hypnosis. No matter what your day job is, you can be a part time freelance healer and try out various forms of therapy with various groups. It's good to get passionate about your own blend of therapeutic modalities.


In [118]:
samp = rmcrits.sample(1)
print(samp.iloc[0])

Can choices command? I don't know if I like that intro... but you could try different verbs...

There is A sense of clarity within me that intuitively makes me enables me to intuit a calling to Bi omedical Engineering.---here is an idea I had for you.

I believe The central nervous system is the most

 All this content is ineffective. Anyone can make claims.

This part is excellent, though, because it tells something about your unique experiences: ... feel confident for  about new academic challenges. Nice!I seemed to criticize you a lot, but actually the essay is very impressive. I wish you would include more specific examples of your intentions, your plans for how you might specialize, etc. This definitely shows great writing skill, too.


In [12]:
# disable warnings about default LDA model and fewer than 10 extracts
logging.disable(level=logging.CRITICAL)

In [10]:
# skip reviews with less than 100 characters
sm_rmcrits = rmcrits[rmcrits.str.len() > 100]

In [31]:
extracted_phrases = Parallel(n_jobs=3)(delayed(multi_extract)(x) for x in tqdmn(sm_rmcrits))
pickle.dump(extracted_phrases,open('proc_data/exphrases_no_tpr.pkl','wb'))

HBox(children=(IntProgress(value=0, max=5125), HTML(value='')))




In [13]:
tpr_extracted_phrases = Parallel(n_jobs=3)(delayed(multi_extract)(x) for x in tqdmn(sm_rmcrits))
pickle.dump(tpr_extracted_phrases,open('proc_data/exphrases_tpr.pkl','wb'))

HBox(children=(IntProgress(value=0, max=5125), HTML(value='')))




In [39]:
extracted_phrases = pickle.load(open('proc_data/exphrases_no_tpr.pkl','rb'))
tpr_extracted_phrases = pickle.load(open('proc_data/exphrases_tpr.pkl','rb'))
extracted = [e + t for e,t in zip(extracted_phrases,tpr_extracted_phrases)]
pickle.dump(extracted,open('proc_data/exphrases.pkl','wb'))

In [26]:
sm_docid = df_merge[df_merge.post_content_rv.str.len() > 100]['doc_id_rv']
extracted = pickle.load(open('proc_data/exphrases.pkl','rb'))

In [27]:
interm = pd.DataFrame([*zip(sm_docid,extracted)],columns=['doc_id','rest']).explode('rest') # fill doc_ids
exphrase_df = pd.DataFrame([(a,p,s) for ex in extracted for a,p,s in ex], columns=['kpe','phrase','score'])

exphrase_df['doc_id'] = interm.reset_index()['doc_id']
exphrase_df = exphrase_df.iloc[:, np.r_[-1,0:3]] # reorder columns

In [148]:
(exphrase_df[(exphrase_df['kpe'] == 'YAKE') & (exphrase_df['score'] > 1)].shape[0] 
 / exphrase_df[(exphrase_df['kpe'] == 'YAKE')].shape[0])

0.0005076340349096021

A small percentage (~0.05%) of `YAKE` values fall outside of the 0-1 range. This few outliers does not warrant rescaling, these will be omitted. Similarly, `TopicalPageRank` values exceed 0-1 in instances where underscores are prevalent.

`FirstPhrases` will also be omitted since it is purely a baseline test to assess relative model performance.

In [153]:
exphrase_df[~(exphrase_df['kpe'] == 'FirstPhrases') & (exphrase_df['score'] > 1)].sort_values('score',ascending=False)

Unnamed: 0,doc_id,kpe,phrase,score
134661,63711_1,TopicalPageRank,_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ ...,8.358186
134662,63711_1,TopicalPageRank,_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ ...,7.010091
134663,63711_1,TopicalPageRank,_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _,6.740472
52668,77173_1,YAKE,like the name,5.584342
355310,12081_1,YAKE,good idea,4.963447
222474,51326_1,YAKE,paragraph only contains,4.719739
306283,7664_2,YAKE,give the reasons,4.389942
85128,63636_2,TopicalPageRank,_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ ...,3.890064
86121,54768_1,YAKE,search in google,3.884757
85129,63636_2,TopicalPageRank,_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ ...,3.787694


In [172]:
exphrase_df = exphrase_df[exphrase_df.score.between(0,1)] # exclude negative and > 1

In [179]:
exphrase_df['phrase'] = exphrase_df['phrase'].str.replace(r"[^a-z0-9 ']",'').str.replace(r" {2,}",' ').str.strip()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [332]:
exphrase_df[exphrase_df.phrase.str.split().str.len() > 2].groupby(['phrase']).score.count().sort_values(ascending=False)[:100]

phrase
do n't think                   167
statement of purpose           165
will be able                   139
do n't have                    108
long term goals                104
do n't know                     96
masters degree course           90
do not have                     89
would be better                 85
year career plan                83
current work experience         83
masters degree student          78
do n't need                     76
masters degree studies          74
am not sure                     66
do not need                     61
short term goals                60
essay is good                   56
be more specific                56
future career plans             51
'm not sure                     49
is very good                    49
should be able                  48
specific program help           47
actual work experience          46
college application essay       45
would be best                   45
do n't want                     44
relevant work

Even abstract of context, we can see a few commonly expressed concepts:
* long term goals
* year career plan
* short term goals
* future career plans
* long term career goals
* relevant work experience
* maximum word count
* is too long
* be more specific

and many stating some variant of "don't need" / "not necessary".

From this, we can start putting together the overarching themes. 

Addressing goals seems to be a frequently discussed topic, particularly for long term. The other main notion is omitting irrelevant content in effort to keep the document short.

In [79]:
def filter_pos(text, keep_pos=['ADJ','NOUN','VERB','ADV']):
    return " ".join([t.text for t in filter(lambda t: t.pos_ in keep_pos, nlp(text, disable=['parser','ner']))]) # , disable=['parser','ner']

In [53]:
def show_topics(a, vocab, n_top_words=8):
    top_words = lambda t: [vocab[i] for i in np.argsort(t)[:-n_top_words-1:-1]]
    topic_words = ([top_words(t) for t in a])
    return [' '.join(t) for t in topic_words]

In [72]:
vctzr_params = dict(stop_words=minimal_stops, strip_accents='ascii', ngram_range=(1,3), max_df=0.8, min_df=5)
lda_params = dict(n_components=6, max_iter=50, batch_size=64, evaluate_every=1, 
                  learning_method='online', n_jobs=3, verbose=1)

In [73]:
count_vctzr = CountVectorizer(**vctzr_params)
lda = LatentDirichletAllocation(**lda_params)

In [74]:
cmat = count_vctzr.fit_transform(rmcrits)
ct_ftnames = [*map(lambda x: x.replace(' ','_'),count_vctzr.get_feature_names())]

In [76]:
Xlda = lda.fit_transform(cmat)

iteration: 1 of max_iter: 50, perplexity: 4908.5938
iteration: 2 of max_iter: 50, perplexity: 4363.2268
iteration: 3 of max_iter: 50, perplexity: 4252.7059
iteration: 4 of max_iter: 50, perplexity: 4206.4561
iteration: 5 of max_iter: 50, perplexity: 4181.2662
iteration: 6 of max_iter: 50, perplexity: 4165.5252
iteration: 7 of max_iter: 50, perplexity: 4154.5851
iteration: 8 of max_iter: 50, perplexity: 4146.6657
iteration: 9 of max_iter: 50, perplexity: 4140.7530
iteration: 10 of max_iter: 50, perplexity: 4136.1232
iteration: 11 of max_iter: 50, perplexity: 4132.3841
iteration: 12 of max_iter: 50, perplexity: 4129.3784
iteration: 13 of max_iter: 50, perplexity: 4126.8868
iteration: 14 of max_iter: 50, perplexity: 4124.7465
iteration: 15 of max_iter: 50, perplexity: 4122.8814
iteration: 16 of max_iter: 50, perplexity: 4121.2369
iteration: 17 of max_iter: 50, perplexity: 4119.7834
iteration: 18 of max_iter: 50, perplexity: 4118.4761
iteration: 19 of max_iter: 50, perplexity: 4117.3116
it

In [77]:
show_topics(lda.components_, ct_ftnames, 15)

['statement personal not information purpose reviewer personal_statement university prompt statement_purpose letter why application should academic',
 'goals want sop why do program you_want school talk them future why_you term plan help',
 'work career can research experience masters field university course study studies degree help not professional',
 'engineering reader end idea write main very letter thesis design interest topic theme motivation end_first',
 'not can if paragraph do if_you more you_can need should just make don because then',
 'sentence think good but like not more would me some very can first paragraph also']

In [95]:
def save_picklegz(model, count_vectorizer, filepath='saves/models/tmp.pkl.gz'):
    """Save an LDA model in the format required by pke """
    ldabunch = (count_vectorizer.get_feature_names(),
                model.components_,
                model.exp_dirichlet_component_,
                model.doc_topic_prior_)
    with gzip.open(filepath,'wb') as f:
        pickle.dump(ldabunch,f)
    return filepath

In [96]:
save_picklegz(lda, count_vctzr, 'saves/models/sklda.pkl.gz')

'saves/models/sklda.pkl.gz'

In [153]:
ex1 = exdoc.read_text()

```
Editor's choice awards:
TopicalPageRank: 'avoid flowery statements', 'bland introduction cheesy'
PositionRank 'avoid flowery statements', 'tad corny'
SingleRank: 'writing could be clearer'
TextRank: 'writing could be clearer','eliminate unnecessary wording such'
```

## Summarization

In [48]:
# https://pypi.org/project/sumy/https://pypi.org/project/sumy/
def summerize_all(exdoc, nsents=3, lang='english'):
    parser = (PlaintextParser.from_file(exdoc.as_posix(), Tokenizer(lang)) 
              if os.path.isfile(exdoc) 
              else PlaintextParser.from_string(exdoc, Tokenizer(lang)))
    stemmer = Stemmer(lang)
    
    summ_models = [lsa.LsaSummarizer(stemmer),lex_rank.LexRankSummarizer(stemmer),
                   text_rank.TextRankSummarizer(stemmer),kl.KLSummarizer(stemmer),
                   luhn.LuhnSummarizer(stemmer),reduction.ReductionSummarizer(stemmer)]
    
    for summzr in summ_models:
        summzr.stop_words = minimal_stops
        name = str(summzr.__class__).split('.')[-1][:-2]
        print(f"===== {name} =====")
        for sentence in summzr(parser.document, nsents):
            print('>', sentence, end='\n\n')

In [49]:
summerize_all(samp.iloc[0], nsents=5)

===== LsaSummarizer =====
> The whole essay should be powerful, as if it is a beautiful explanation of a single profound thought.

> Can you inspire the reader with an insight into film that makes it, perhaps, the most meaningful pursuit possible!?

> After all, it is the most technologically sophisticated form of modern art -- and it really does include all the arts, from writing to music to storytelling and more.

> Can you take this as an opportunity to express an important idea of yours, without interruption?

> The body of the essay, with your info and accomplishments, must be presented in a way that supports your main idea -- which may be a philosophical point about film.

===== LexRankSummarizer =====
> The rule is like this: Say it, explain it, and then say it again.

> The whole essay should be powerful, as if it is a beautiful explanation of a single profound thought.

> Can you inspire the reader with an insight into film that makes it, perhaps, the most meaningful pursuit p