# Effort #1: Recreating Base

Luhn, Hans Peter. "The Automatic Creation of Literature Abstracts." *IBM Journal of Research and Revelopment 2.2* (1958): 159-165.

DOI: [10.1147/rd.22.0159](https://ieeexplore.ieee.org/document/5392672)

## Initial Setup

In [1]:
import luhn_abstract as la

## Exhibit 2
*Source: The New York Times, September8, 1957, page E11*  
*Title: Chemistry Is Employed in a Search for New Methods toConquer Mental Illness*  
*Author: Robert K. Plumb*

In [2]:
val_text = '''By coincidence this weekend in New York City marks the end of the annual meeting of the American Psychological Association and the begining of 
the annual meeting of the American Chemical Society. Psychologists and chemists have never had so much in common as they now have in new studies of the chemical 
basis for human behavior. Exciting new finds in this field were also discussed last week in Iowa City, Iowa, at the annual meeting of the American 
Physiological Society and at Zurich, Switzerland, at the Second International Congress for Psychiatry. Two major recent developments have called the attention 
of chemists, physiologists, physicists and other scientists to mental diseases: It has been found that extremely minute quantities of chemicals can induce 
hallucinations and bizarre psychic disturbances in normal people, and mood-altering drugs (tranquilizers, for instance) have made long-institutionalized people 
amenable to therapy. Money to finance resreach on the physical factors in mental illness is being made available. Progress has 
been achieved toward the understanding of the chemistry of the brain. New goals are in sight. At the psychiatrists meeting in Zurich last week, four New York 
City physicians urged their colleagues to broaden their concept of "mental disease," and to probe more deeply into the chemistry and metabolism of the human 
body for answers to mental disorders and their prevention. Dr. Felix Marti-Ibanez and three brothers, Dr. Mortimer D. Sackler, 
Dr. Raymond R. Sackler and Dr. Arthur M. Sackler cited evidence that the blood chemistry of victims of schizophrenia is different from that of normal people. 
Perhaps multiple biological factors are responsible for this chemical change, they suggested. Mental disease is a "developmental process" and long duration of 
a disorder may result in "permanent alteration of anatomy and physiology," they said. They urged that trials of new drugs which affect the brain should be 
concentrated on complex studies of the mechanism of action of the drugs. The variety of substances capable of producing profound mental effects is a new 
armory of weapons for use in investigating biological mechanisms underlying mental disease, they said. The sources of behavioral disturbance are many and they 
may come from external as well as internal forces, the four reported. This concept has already proven practical, for instances, when it enabled psychiatrists 
to predict that the administration of ACTH and cortisone could produce psychosis. "It led some years ago to the development of a blood test which was 80 percent 
accurate in the identification of schizophrenic patients," they said. "It permitted us on physiologic grounds to deny that the psychoneuroses and the 
psychoses were lesser and greater degrees of the same disease process, and, in fact, to affirm that they represented opposite and even mutually exclusive 
directions of physiologic disturbances," they said. Chemicals now available should he used not only to bring relief to the mentally sick but also to uncover 
the biological mechanisms of the disease processes themselves. "Only then will the metabolic era mature and bring to fruition man's long hoped for salvation 
from the ravages of mental disease," they reported.
At the psychologist's meeting here, a technique for tracing electrical activity in specific portions of the animal brain was described by researchers from the 
University of California at Los Angeles. They reported that deep brain implants in cat brains were used to record electrical discharges created as the animals 
respond to stimulations to which they had been conditioned. In this way the California group reported, it is possible to track the sequence in which the brain 
brings its various parts into play in learning. Specific areas of memory in the brain may be located. Furthermore, the electrical pathways so traced out can 
be blocked temporarily by the use of chemicals. This poses new possibilities for studying brain chemistry changes in health and sickness and their alleviation, 
the California researchers emphasized. The new studies of brain chemistry have provided practical therapeutic results and tremendous encouragement to those who 
must care for mental patients. One evidence that knowledge in the interdisciplinary field is accumulating fast came last week in an announcement from Washington. 
This was the establishment by the National Institute of Mental Health of a clearing house of information on psychopharmacology. Literature in the field will be 
classified and coded so that staff members can answer a wide variety bf technical and scientific questions. People working in the field are invited to send three 
copies of papers or other material — even informal letters describing work they may have in progress to the Technical Information Unit of the center In 
Silver Spring, Md.'''
val_text = val_text.replace('\n','')

In [3]:
luhn_rtn = la.luhn_abstract.run_auto_summarization(val_text=val_text,is_print=True,func_stem_selected=la.luhn_abstract.tokenize_stem_nltk)

Quantile Significance Lower = 1
Quantile Significance Upper = 4
Exciting new finds in this field were also discussed last week in Iowa City, Iowa, at the annual meeting of the American Physiological Society and at Zurich, Switzerland, at the Second International Congress for Psychiatry. [31m[10.8889][0m "Only then will the metabolic era mature and bring to fruition man's long hoped for salvation from the ravages of mental disease," they reported.At the psychologist's meeting here, a technique for tracing electrical activity in specific portions of the animal brain was described by researchers from the University of California at Los Angeles. [31m[8.5000][0m People working in the field are invited to send three copies of papers or other material — even informal letters describing work they may have in progress to the Technical Information Unit of the center In Silver Spring, Md. [31m[8.0476][0m At the psychiatrists meeting in Zurich last week, four New York City physicians urged t

## Effort #1: Explore the Parameter Space

In [4]:
import pandas as pd
import numpy as np
from tqdm import tqdm

In [5]:
vec_quant_lower = np.arange(start=5,stop=31)
vec_quant_upper = np.arange(start=30,stop=56)
vec_num_apart = np.arange(start=4,stop=8)
vec_article_sent_id = [3,23,24]
vec_article_send_score = [4.0,5.4,5.4]

In [6]:
%%time
val_num_sents = len(vec_article_sent_id)
#Setup storage for iteration results...
val_cnt = 0
vec_iter_sent_ids = []
vec_iter_sent_scores = []
vec_iter_common_ids = []
vec_iter_bound_lower = []
vec_iter_bound_upper = []
vec_iter_sw_remove = []
vec_iter_sw_zero = []
vec_iter_stem = []
vec_iter_luhn_sw = []
vec_iter_num_apart = []
#Create iterations over test conditions...
for val_lower in tqdm(vec_quant_lower):
    for is_sw_remove in [True,False]:
        for is_sw_zero in [True,False]:
            for func_stem_selected in [None,la.luhn_abstract.tokenize_stem_nltk,la.luhn_abstract.tokenize_stem_luhn]:
                for vec_ignore_words in [[],la.luhn_abstract.vec_luhn_sw]:
                    for val_num_apart in vec_num_apart:
                        for val_upper in vec_quant_upper:
                            if(val_lower < val_upper):
                                luhn_rtn = la.luhn_abstract.run_auto_summarization(val_text=val_text,
                                                                                   val_lower_int=val_lower,
                                                                                   val_upper_int=val_upper,
                                                                                   val_spacing=val_num_apart,
                                                                                   val_n=val_num_sents,
                                                                                   is_sw_remove=is_sw_remove,
                                                                                   is_sw_zero=is_sw_zero,
                                                                                   is_use_luhn_tf=True,
                                                                                   vec_sw_luhn=vec_ignore_words,
                                                                                   vec_sw_add=[],
                                                                                   func_stem_selected=func_stem_selected,
                                                                                   func_summary_selected=None,
                                                                                   is_print=False)
                                if(luhn_rtn[1].shape[0]>0):
                                    vec_tmp_ids = luhn_rtn[1]['id_sent'].iloc[0:val_num_sents].values.tolist()
                                    vec_iter_sent_ids.append(vec_tmp_ids)
                                    vec_iter_sent_scores.append(luhn_rtn[1]['score'].iloc[0:val_num_sents].values.tolist())
                                    vec_iter_common_ids.append(len(set(vec_tmp_ids)&set(vec_article_sent_id)))
                                    del(vec_tmp_ids)
                                    vec_iter_bound_lower.append(val_lower)
                                    vec_iter_bound_upper.append(val_upper)
                                    vec_iter_sw_remove.append(is_sw_remove)
                                    vec_iter_sw_zero.append(is_sw_zero)
                                    vec_iter_stem.append(func_stem_selected.__name__ if func_stem_selected else 'None')
                                    vec_iter_luhn_sw.append([]==vec_ignore_words)
                                    vec_iter_num_apart.append(val_num_apart)
                            val_cnt += 1

100%|███████████████████████████████████████████████████████████████████████████████████| 26/26 [41:09<00:00, 94.99s/it]

CPU times: user 40min 42s, sys: 23 s, total: 41min 5s
Wall time: 41min 9s





In [7]:
df_param_search = pd.DataFrame({'vec_sent_ids':vec_iter_sent_ids,'vec_scores':vec_iter_sent_scores,'val_bound_lower':vec_iter_bound_lower,
                          'val_bound_upper':vec_iter_bound_upper,'val_num_apart':vec_iter_num_apart,'is_sw_remove':vec_iter_sw_remove,
                          'is_sw_zero':vec_iter_sw_zero,'is_luhn_sw':vec_iter_luhn_sw,'stem_func':vec_iter_stem,'cnt_common':vec_iter_common_ids})
df_param_search['sum'] = [sum(x) for x in df_param_search['vec_scores']]
df_param_search.sort_values(by=['cnt_common','sum'],ascending=[False,False],inplace=True)
df_param_search.head()

Unnamed: 0,vec_sent_ids,vec_scores,val_bound_lower,val_bound_upper,val_num_apart,is_sw_remove,is_sw_zero,is_luhn_sw,stem_func,cnt_common,sum
1495,"[16.0, 23.0, 24.0]","[2.5, 2.0833333333333335, 2.0]",8,30,7,False,True,True,tokenize_stem_nltk,2,6.583333
1496,"[16.0, 23.0, 24.0]","[2.5, 2.0833333333333335, 2.0]",8,31,7,False,True,True,tokenize_stem_nltk,2,6.583333
1497,"[16.0, 23.0, 24.0]","[2.5, 2.0833333333333335, 2.0]",8,32,7,False,True,True,tokenize_stem_nltk,2,6.583333
1498,"[16.0, 23.0, 24.0]","[2.5, 2.0833333333333335, 2.0]",8,33,7,False,True,True,tokenize_stem_nltk,2,6.583333
1499,"[16.0, 23.0, 24.0]","[2.5, 2.0833333333333335, 2.0]",8,34,7,False,True,True,tokenize_stem_nltk,2,6.583333


In [8]:
print(f'''This resulted in {val_cnt:,} possible parameters to evaluate; however, based on the restrictions for $C$ and $D$, only {df_param_search.shape[0]:,} summaries were generated''')

This resulted in 64,896 possible parameters to evaluate; however, based on the restrictions for $C$ and $D$, only 12,144 summaries were generated


Q: What does the distribution for the number of matching sentence IDs look like to the baseline sentences IDs from the article (i.e., $[3,23,24]$)?  
A: &darr;

In [9]:
df_param_search['cnt_common'].value_counts()

cnt_common
1    6635
0    4621
2     888
Name: count, dtype: int64

Q: For results with 2 matching sentence IDs, how frequently were stopwords removed from the corpus?  
A: &darr;

In [10]:
df_param_search.loc[(df_param_search['cnt_common']==2)]['is_sw_remove'].value_counts()

is_sw_remove
False    888
Name: count, dtype: int64

Q: For results with 2 matching sentence IDs, how frequently were stopword scores zeroed in the corpus?  
A: &darr;

In [11]:
df_param_search.loc[(df_param_search['cnt_common']==2)]['is_sw_zero'].value_counts()

is_sw_zero
True     456
False    432
Name: count, dtype: int64

Q: For results with 2 matching sentence IDs, how frequently were the Luhn stopwords (prepositions, pronouns, articles) removed from the corpus?  
A: &darr;

In [12]:
df_param_search.loc[(df_param_search['cnt_common']==2)]['is_luhn_sw'].value_counts()

is_luhn_sw
True    888
Name: count, dtype: int64

Q: For results with 2 matching sentence IDs, how frequently were the different stemming functions utilized?  
A: &darr;

In [13]:
df_param_search.loc[(df_param_search['cnt_common']==2)]['stem_func'].value_counts()

stem_func
tokenize_stem_nltk    600
None                  144
tokenize_stem_luhn    144
Name: count, dtype: int64

Q: For results with 2 matching sentence IDs, what is the distribution for the length of the span (i.e., $n$)?  
A: &darr;

In [14]:
df_param_search.loc[(df_param_search['cnt_common']==2)]['val_num_apart'].value_counts()

val_num_apart
7    312
5    288
6    288
Name: count, dtype: int64

Q: For results with 2 matching sentence IDs and $n==7$, how frequently were stopword scores zeroed in the corpus?  
A: &darr;

In [15]:
df_param_search.loc[(df_param_search['cnt_common']==2) & (df_param_search['val_num_apart']==7)]['is_sw_zero'].value_counts()

is_sw_zero
True     168
False    144
Name: count, dtype: int64

Q: What are the minimum and maximum values for $C$ and $D$ for the different results by the number of matching sentence IDs to the baseline?  This question provides insight into the "resolving power" of the words and how $C$ and $D$ can behave in a similar manner to stopwords.  
A: &darr;

In [16]:
df_param_search.pivot_table(index='cnt_common',values=['val_bound_lower','val_bound_upper'],aggfunc={'val_bound_lower':pd.Series.min,
                                                                                                     'val_bound_upper':pd.Series.max})

Unnamed: 0_level_0,val_bound_lower,val_bound_upper
cnt_common,Unnamed: 1_level_1,Unnamed: 2_level_1
0,5,55
1,5,54
2,8,41


Q: For results with 2 matching sentence IDs, what sentence IDs were identified?  To reduce rearrangements of the same IDs, the vector of IDs is first sorted for each result and then converted to a string value.  
A: &darr;

In [17]:
_ = [x.sort() for x in df_param_search['vec_sent_ids']]
df_param_search['str_sent_id'] = [','.join([f'{j:n}' for j in x]) for x in df_param_search['vec_sent_ids']]
df_param_search.loc[(df_param_search['cnt_common']==2)]['vec_sent_ids'].value_counts()

vec_sent_ids
[0.0, 3.0, 23.0]      576
[3.0, 16.0, 23.0]     288
[16.0, 23.0, 24.0]     24
Name: count, dtype: int64

Q: For results with 2 matching sentence IDs, and each vector of sentence IDs identified in the previous block, what were the ranges for $C$ and $D$?  This starts to shed light on some of the nuances for identifying all three sentences from the paper with the parameters.  
A: &darr;

In [19]:
df_param_search.loc[(df_param_search['cnt_common']==2) &
                    (df_param_search['is_luhn_sw']==True) &
                    (df_param_search['is_sw_remove']==False) &
                    (df_param_search['val_num_apart']==7)].pivot_table(index='str_sent_id',values=['val_bound_lower','val_bound_upper'],
                                                                       aggfunc={'val_bound_lower':pd.Series.min,'val_bound_upper':pd.Series.max})

Unnamed: 0_level_0,val_bound_lower,val_bound_upper
str_sent_id,Unnamed: 1_level_1,Unnamed: 2_level_1
323,14,41
162324,8,41


## Effort #2: Using More Modern NLP

This effort uses data from the [CNN/Daily Mail](https://paperswithcode.com/dataset/cnn-daily-mail-1) effort, which provides news articles with short summaries written by the author.  The raw dataset was available from the [cnn-dailymail GitHub](https://github.com/abisee/cnn-dailymail).  The parser written below provides a quick method to extract and partially clean the article text while separting the highlights (summary) and retaining the ID.  This parser was not intended to holistically clean and produce text equivalent to the Stanford CoreNLP library, but rather as a method to more easily obtain the data for comparing automatic summariztion.

In [20]:
import glob
from rouge_score import rouge_scorer

In [21]:
%%time
#This code provides a rough ETL/parser for the CNN/DailyMail content...
vec_story_files = glob.glob('./data/*.story')
print(f'''There are {len(vec_story_files)} articles in total...''')
vec_id = []
vec_stories = []
vec_highlights = []
vec_ids = []
for i,val_file_path in enumerate(vec_story_files):
    tmp_open = open(val_file_path,'r').readlines()
    tmp_story = ''
    tmp_highlights = []
    is_past_story = False
    for line in tmp_open:
        line = line.strip()
        if(not is_past_story and line.startswith('@h')):
            is_past_story = True
        if(not line=='' and not is_past_story):
            tmp_story = tmp_story+' '+line
        elif(not line=='' and not line.startswith('@h')):
            tmp_highlights.append(line)
    if(tmp_story.find('-LRB- CNN -RRB-')>=0):
        vec_story = tmp_story.split('-LRB- CNN -RRB-')
        tmp_story = ' '.join(vec_story[1:])
    if(0<=tmp_story.find('--')<50):
        vec_story = tmp_story.split('--')
        tmp_story = ' '.join(vec_story[1:])
    vec_ids.append(val_file_path.split('/')[-1].split('.')[0])
    vec_stories.append(tmp_story)
    vec_highlights.append(tmp_highlights)
    if(i>=4999):
        break

There are 92579 articles in total...
CPU times: user 268 ms, sys: 331 ms, total: 598 ms
Wall time: 2.74 s


In [22]:
df_cnn = pd.DataFrame({'article_id':vec_ids,'story':vec_stories,'highlight':vec_highlights})
print(df_cnn.shape)
df_cnn.head()

(5001, 3)


Unnamed: 0,article_id,story,highlight
0,638ba1352bdf405a8f5bd681d7fe5c928686afff,At the start of a big week for the Higgs boso...,[U.S.-based scientists say their data points t...
1,f9f9601180ab3278165d936821e8f145659997f3,acquitted by a Florida jury over the death of...,"[Zimmerman posts $ 5,000 bail ; he was accused..."
2,80ec0efb252ec4470aee44482d1e196111b5780b,Zlatan Ibrahimovic scored his third goal in a...,[Barcelona move three points clear of Real Mad...
3,8435150be66ea9792999dfc233cc690f9c2fe2d0,"Nobel laureate Norman E. Borlaug , an agricul...",[Borlaug died at the age of 95 from complicati...
4,1444cf4d1832507a29a98529c2cd1a41f0154b52,Louisiana Gov. Bobby Jindal on Monday stood b...,[Louisiana Gov. Bobby Jindal decried `` no-go ...


It was observed that the following article IDs do not have stories.

In [33]:
print('- '+'\n- '.join(df_cnn.loc[df_cnn['story']=='']['article_id'].tolist()))

- 226ca83313bb4db0917847f80fcf4a2d2af5007d
- c36fb222cee4c1f4e38cf62ad37e2eb8dd0a85be
- d4b4ee22583e0490d5e41e93941e8e6ec182d7ab
- 2cb398794fea7b2dd83501c401c034ca73362323


In [34]:
%%time
#Setup storage for iteration results...
val_cnt = 0
vec_iter_article_id = []
vec_iter_sent_ids = []
vec_iter_sent_scores = []
vec_iter_sent_strs = []
vec_iter_rouge = []
vec_iter_method = []

#ROUGE scoring function
rouge_scr = rouge_scorer.RougeScorer(['rouge1','rougeL'],use_stemmer=True)

#Create iterations over test conditions...
for i,val_row in tqdm(df_cnn.iterrows(),total=df_cnn.shape[0],miniters=10):
    if(not val_row['story']==''):
        for func_summary_method in [None,np.mean]:
            #Ensure the method is generating the same number of sentences as the highlight...
            val_num_sents = len(val_row['highlight'])
            luhn_rtn = la.luhn_abstract.run_auto_summarization(val_text=val_row['story'],
                                                               val_n=val_num_sents,
                                                               is_sw_remove=False,
                                                               is_sw_zero=False,
                                                               is_use_luhn_tf=True,
                                                               vec_sw_luhn=la.luhn_abstract.vec_luhn_sw,
                                                               func_stem_selected=la.luhn_abstract.tokenize_stem_nltk,
                                                               func_summary_selected=func_summary_method,
                                                               is_print=False)
            if(luhn_rtn[1].shape[0]>0):
                vec_iter_article_id.append(val_row['article_id'])
                vec_iter_method.append(func_summary_method.__name__ if func_summary_method else 'None')
                vec_tmp_ids = luhn_rtn[1]['id_sent'].iloc[0:val_num_sents].values.tolist()
                vec_iter_sent_ids.append(vec_tmp_ids)
                vec_iter_sent_scores.append(luhn_rtn[2])
                vec_iter_sent_strs.append(luhn_rtn[3])
                vec_iter_rouge.append(rouge_scr.score(target=' '.join(val_row['highlight']),prediction=' '.join(luhn_rtn[3])))
                del(vec_tmp_ids)
        val_cnt += 1

100%|███████████████████████████████████████████████████████████████████████████████| 5001/5001 [25:59<00:00,  3.21it/s]

CPU times: user 25min 44s, sys: 14.3 s, total: 25min 58s
Wall time: 25min 59s





In [35]:
df_cnn_msmts = pd.DataFrame({'article_id':vec_iter_article_id,'vec_sent_ids':vec_iter_sent_ids,'vec_scores':vec_iter_sent_scores,
                             'vec_sents':vec_iter_sent_strs,'func_summary_method':vec_iter_method,'rouge':vec_iter_rouge})
print(f'''This resulted in {val_cnt:,} iterations; only {df_cnn_msmts.shape[0]:,} summaries were generated''')
df_cnn_msmts.head()

This resulted in 5,001 iterations; only 9,994 summaries were generated


Unnamed: 0,article_id,vec_sent_ids,vec_scores,vec_sents,func_summary_method,rouge
0,638ba1352bdf405a8f5bd681d7fe5c928686afff,"[15.0, 1.0, 3.0, 0.0]","[7.5625, 6.32258064516129, 5.76, 4.76470588235...",[`` We now have more than double the data we h...,,"{'rouge1': (0.22448979591836735, 0.61111111111..."
1,638ba1352bdf405a8f5bd681d7fe5c928686afff,"[15.0, 1.0, 3.0, 16.0]","[3.8823529411764706, 3.3870967741935485, 3.12,...",[`` We now have more than double the data we h...,mean,"{'rouge1': (0.2125984251968504, 0.5, 0.2983425..."
2,f9f9601180ab3278165d936821e8f145659997f3,"[10.0, 12.0, 6.0]","[14.08695652173913, 12.041666666666666, 10.888...","[In fact , it 's his second arrest for alleged...",,"{'rouge1': (0.10869565217391304, 0.22727272727..."
3,f9f9601180ab3278165d936821e8f145659997f3,"[10.0, 12.0, 6.0]","[6.576923076923077, 6.375, 5.526315789473684]","[In fact , it 's his second arrest for alleged...",mean,"{'rouge1': (0.10869565217391304, 0.22727272727..."
4,80ec0efb252ec4470aee44482d1e196111b5780b,"[16.0, 12.0, 1.0, 13.0]","[6.722222222222222, 6.722222222222222, 5.78571...",[The win lifted Zaragoza four points clear of ...,,"{'rouge1': (0.27419354838709675, 0.55737704918..."


In [55]:
##KDE Plots.....

<hr>

# Original Notebook
## Imports/Setup

In [36]:
import requests
import bs4 as bs

In [37]:
def scrape_page_wikipedia(val_url):
    html_scraped_text = requests.get(val_url)
    html_text_parsed = bs.BeautifulSoup(html_scraped_text.content,'lxml')
    html_text_content = html_text_parsed.find(id='content')
    html_text_paragraphs = html_text_content.find_all('p')
    val_text_formatted = ''
    for val in html_text_paragraphs:
        val_text_formatted += val.text
    return(val_text_formatted)

## Examples
### Request NLP Wikipedia Summary

In [38]:
%%time
val_text = scrape_page_wikipedia(val_url='https://en.wikipedia.org/wiki/natural_language_processing')

CPU times: user 111 ms, sys: 10.9 ms, total: 122 ms
Wall time: 488 ms


#### Experiment #1: Use Luhn Score

| Parameter | Value (English) | Value (Set) |
| :-: | :-: | :-: |
| Stop Words Removed from Text| No | False |
| Counting Method | Luhn Counting | True |
| Stop Words Removed from Score | No | False |
| Aggregation Function | Luhn Algorithm | None |

In [39]:
%%time
luhn_rtn_exp1 = la.luhn_abstract.run_auto_summarization(val_text=val_text,
                                                        is_sw_remove=False,
                                                        is_sw_zero=False,
                                                        is_use_luhn_tf=True,
                                                        vec_sw_luhn=[],
                                                        vec_sw_add=[],
                                                        func_stem_selected=None,
                                                        func_summary_selected=None,
                                                        is_print=True)

Quantile Significance Lower = 1
Quantile Significance Upper = 31
Machine learning approaches, which include both statistical and neural networks, on the other hand, have many advantages over the symbolic approach:  Although rule-based systems for manipulating symbols were still in use in 2020, they have become mostly obsolete with the advance of LLMs in 2023. [31m[22.5000][0m [8] In 2003, word n-gram model, at the time the best statistical algorithm, was outperformed by a multi-layer perceptron (with a single hidden layer and context length of several words trained on up to 14 million of words with a CPU cluster in language modelling) by Yoshua Bengio with co-authors. [31m[18.0000][0m The premise of symbolic NLP is well-summarized by John Searle's Chinese room experiment: Given a collection of rules (e.g., a Chinese phrasebook, with questions and matching answers), the computer emulates natural language understanding (or other NLP tasks) by applying those rules to the data it confr

#### Experiment #2: Remove Stop Words

| Parameter | Value (English) | Value (Set) |
| :-: | :-: | :-: |
| Stop Words Removed from Text| Yes | True |
| Counting Method | Luhn Counting | True |
| Stop Words Removed from Score | No | False |
| Aggregation Function | Luhn Algorithm | None |

In [40]:
%%time
luhn_rtn_exp2 = la.luhn_abstract.run_auto_summarization(val_text=val_text,
                                                        is_sw_remove=True,
                                                        is_sw_zero=False,
                                                        is_use_luhn_tf=True,
                                                        vec_sw_luhn=[],
                                                        vec_sw_add=[],
                                                        func_stem_selected=None,
                                                        func_summary_selected=None,
                                                        is_print=True)

Quantile Significance Lower = 1
Quantile Significance Upper = 8
Machine learning approaches, which include both statistical and neural networks, on the other hand, have many advantages over the symbolic approach:  Although rule-based systems for manipulating symbols were still in use in 2020, they have become mostly obsolete with the advance of LLMs in 2023. [31m[12.0417][0m In the 2010s, representation learning and deep neural network-style (featuring many hidden layers) machine learning methods became widespread in natural language processing. [31m[8.0667][0m [8] In 2003, word n-gram model, at the time the best statistical algorithm, was outperformed by a multi-layer perceptron (with a single hidden layer and context length of several words trained on up to 14 million of words with a CPU cluster in language modelling) by Yoshua Bengio with co-authors. [31m[7.7576][0m [9] In 2010, Tomáš Mikolov (then a PhD student at Brno University of Technology) with co-authors applied a simpl

#### Experiment #3: Use Mean Score

| Parameter | Value (English) | Value (Set) |
| :-: | :-: | :-: |
| Stop Words Removed from Text| No | False |
| Counting Method | Luhn Counting | True |
| Stop Words Removed from Score | No | False |
| Aggregation Function | Mean | np.mean |

In [41]:
%%time
luhn_rtn_exp3 = la.luhn_abstract.run_auto_summarization(val_text=val_text,
                                                        is_sw_remove=False,
                                                        is_sw_zero=False,
                                                        is_use_luhn_tf=True,
                                                        vec_sw_luhn=[],
                                                        vec_sw_add=[],
                                                        func_stem_selected=None,
                                                        func_summary_selected=np.mean,
                                                        is_print=True)

Quantile Significance Lower = 1
Quantile Significance Upper = 31
Machine learning approaches, which include both statistical and neural networks, on the other hand, have many advantages over the symbolic approach:  Although rule-based systems for manipulating symbols were still in use in 2020, they have become mostly obsolete with the advance of LLMs in 2023. [31m[10.1087][0m [8] In 2003, word n-gram model, at the time the best statistical algorithm, was outperformed by a multi-layer perceptron (with a single hidden layer and context length of several words trained on up to 14 million of words with a CPU cluster in language modelling) by Yoshua Bengio with co-authors. [31m[8.7736][0m As an example, George Lakoff offers a methodology to build natural language processing (NLP) algorithms through the perspective of cognitive science, along with the findings of cognitive linguistics,[50] with two defining aspects: Ties with cognitive linguistics are part of the historical heritage of N

#### Experiment #4: Zero Score Stop Words

| Parameter | Value (English) | Value (Set) |
| :-: | :-: | :-: |
| Stop Words Removed from Text| No | False |
| Counting Method | Luhn Counting | True |
| Stop Words Removed from Score | Yes | True |
| Aggregation Function | Luhn Algorithm | None |

In [42]:
%%time
luhn_rtn_exp4 = la.luhn_abstract.run_auto_summarization(val_text=val_text,
                                                        is_sw_remove=False,
                                                        is_sw_zero=True,
                                                        is_use_luhn_tf=True,
                                                        vec_sw_luhn=[],
                                                        vec_sw_add=[],
                                                        func_stem_selected=None,
                                                        func_summary_selected=None,
                                                        is_print=True)

Quantile Significance Lower = 0
Quantile Significance Upper = 31
[57] Likewise, ideas of cognitive NLP are inherent to neural models multimodal NLP (although rarely made explicit)[58] and developments in artificial intelligence, specifically tools and technologies using large language model approaches[59] and new directions in artificial general intelligence based on the free energy principle[60] by British neuroscientist and theoretician at University College London Karl J. Friston. [31m[29.8983][0m As an example, George Lakoff offers a methodology to build natural language processing (NLP) algorithms through the perspective of cognitive science, along with the findings of cognitive linguistics,[50] with two defining aspects: Ties with cognitive linguistics are part of the historical heritage of NLP, but they have been less frequently addressed since the statistical turn during the 1990s. [31m[21.4912][0m [17] Symbolic approach, i.e., the hand-coding of a set of rules for manipula

#### Experiment #4: Use TF Counting w/Mean Score

$$tf=\frac{f_t}{\sum{f_t}}$$

| Parameter | Value (English) | Value (Set) |
| :-: | :-: | :-: |
| Stop Words Removed from Text| No | False |
| Counting Method | TF Counting | False |
| Stop Words Removed from Score | No | False |
| Aggregation Function | Mean | np.mean |

In [43]:
%%time
luhn_rtn_exp5 = la.luhn_abstract.run_auto_summarization(val_text=val_text,
                                                        is_sw_remove=False,
                                                        is_sw_zero=False,
                                                        is_use_luhn_tf=False,
                                                        vec_sw_luhn=[],
                                                        vec_sw_add=[],
                                                        func_stem_selected=None,
                                                        func_summary_selected=np.mean,
                                                        is_print=True)

Quantile Significance Lower = 0
Quantile Significance Upper = 0
Machine learning approaches, which include both statistical and neural networks, on the other hand, have many advantages over the symbolic approach:  Although rule-based systems for manipulating symbols were still in use in 2020, they have become mostly obsolete with the advance of LLMs in 2023. [31m[10.1087][0m [8] In 2003, word n-gram model, at the time the best statistical algorithm, was outperformed by a multi-layer perceptron (with a single hidden layer and context length of several words trained on up to 14 million of words with a CPU cluster in language modelling) by Yoshua Bengio with co-authors. [31m[8.7736][0m As an example, George Lakoff offers a methodology to build natural language processing (NLP) algorithms through the perspective of cognitive science, along with the findings of cognitive linguistics,[50] with two defining aspects: Ties with cognitive linguistics are part of the historical heritage of NL

#### Experiment #5: Use TF Counting

| Parameter | Value (English) | Value (Set) |
| :-: | :-: | :-: |
| Stop Words Removed from Text| No | False |
| Counting Method | TF Counting | False |
| Stop Words Removed from Score | No | False |
| Aggregation Function | Luhn Algorithm | None |

In [44]:
%%time
luhn_rtn_exp5 = la.luhn_abstract.run_auto_summarization(val_text=val_text,
                                                        is_sw_remove=False,
                                                        is_sw_zero=False,
                                                        is_use_luhn_tf=False,
                                                        vec_sw_luhn=[],
                                                        vec_sw_add=[],
                                                        func_stem_selected=None,
                                                        func_summary_selected=None,
                                                        is_print=True)

Quantile Significance Lower = 0
Quantile Significance Upper = 0
Machine learning approaches, which include both statistical and neural networks, on the other hand, have many advantages over the symbolic approach:  Although rule-based systems for manipulating symbols were still in use in 2020, they have become mostly obsolete with the advance of LLMs in 2023. [31m[22.5000][0m [8] In 2003, word n-gram model, at the time the best statistical algorithm, was outperformed by a multi-layer perceptron (with a single hidden layer and context length of several words trained on up to 14 million of words with a CPU cluster in language modelling) by Yoshua Bengio with co-authors. [31m[18.0000][0m The premise of symbolic NLP is well-summarized by John Searle's Chinese room experiment: Given a collection of rules (e.g., a Chinese phrasebook, with questions and matching answers), the computer emulates natural language understanding (or other NLP tasks) by applying those rules to the data it confro

#### Experiment #5: Use TF Counting w/Stop Words Zeroed

| Parameter | Value (English) | Value (Set) |
| :-: | :-: | :-: |
| Stop Words Removed from Text| No | False |
| Counting Method | TF Counting | False |
| Stop Words Removed from Score | Yes | True |
| Aggregation Function | Mean | np.mean |

In [45]:
%%time
luhn_rtn_exp5 = la.luhn_abstract.run_auto_summarization(val_text=val_text,
                                                        is_sw_remove=False,
                                                        is_sw_zero=True,
                                                        is_use_luhn_tf=False,
                                                        vec_sw_luhn=[],
                                                        vec_sw_add=[],
                                                        func_stem_selected=None,
                                                        func_summary_selected=np.mean,
                                                        is_print=True)

Quantile Significance Lower = 0
Quantile Significance Upper = 0
[57] Likewise, ideas of cognitive NLP are inherent to neural models multimodal NLP (although rarely made explicit)[58] and developments in artificial intelligence, specifically tools and technologies using large language model approaches[59] and new directions in artificial general intelligence based on the free energy principle[60] by British neuroscientist and theoretician at University College London Karl J. Friston. [31m[15.3051][0m As an example, George Lakoff offers a methodology to build natural language processing (NLP) algorithms through the perspective of cognitive science, along with the findings of cognitive linguistics,[50] with two defining aspects: Ties with cognitive linguistics are part of the historical heritage of NLP, but they have been less frequently addressed since the statistical turn during the 1990s. [31m[10.6780][0m [17] Symbolic approach, i.e., the hand-coding of a set of rules for manipulat

### Request George Washington Wikipedia Summary

In [46]:
%%time
val_text = scrape_page_wikipedia(val_url='https://en.wikipedia.org/wiki/George_Washington')

CPU times: user 270 ms, sys: 17.8 ms, total: 288 ms
Wall time: 451 ms


In [47]:
luhn_rtn_exp6 = la.luhn_abstract.run_auto_summarization(val_text=val_text,
                                                        is_sw_remove=False,
                                                        is_sw_zero=False,
                                                        is_use_luhn_tf=True,
                                                        vec_sw_luhn=[],
                                                        vec_sw_add=[],
                                                        func_stem_selected=None,
                                                        func_summary_selected=None,
                                                        is_print=True)

Quantile Significance Lower = 2
Quantile Significance Upper = 471
He also maintains that Washington never advocated outright confiscation of tribal land or the forcible removal of tribes and that he berated American settlers who abused natives, admitting that he held out no hope for peaceful relations as long as "frontier settlers entertain the opinion that there is not the same crime (or indeed no crime at all) in killing a native as in killing a white man. [31m[44.4853][0m While his relationship with Washington would remain friendly, Washington's relationship with his Secretary of War Henry Knox deteriorated after rumors that Knox had profited from contracts for the construction of U.S. frigates which had been commissioned under the Naval Act of 1794 in order to combat Barbary pirates, forcing Knox to resign. [31m[35.8519][0m Southern opposition was intense, antagonized by an ever-growing rift between North and South; many were concerned that Washington's remains could end up on 

### Request GMU Wikipedia Summary

In [48]:
val_text = scrape_page_wikipedia(val_url='https://en.wikipedia.org/wiki/George_Mason_University')

In [49]:
luhn_rtn_exp7 = la.luhn_abstract.run_auto_summarization(val_text=val_text,
                                                        is_sw_remove=False,
                                                        is_sw_zero=False,
                                                        is_use_luhn_tf=True,
                                                        vec_sw_luhn=[],
                                                        vec_sw_add=[],
                                                        func_stem_selected=None,
                                                        func_summary_selected=None,
                                                        is_print=True)

Quantile Significance Lower = 1
Quantile Significance Upper = 127
Among undergraduate students, 80% of students are enrolled full-time while 20% are enrolled part-time[146] In terms of ethnic and racial demographics: American Indian/Alaska Native people make up 0% of the student body and 0% of the full-time staff; Asian people make up 22% of the student body and 14% of the full-time staff; Black people make up 11% of the student body and 5% of the full-time staff; Hispanic and Latino people make up 17% of the student population and 3% of the full-time staff; Native Hawaiian/Pacific Islander people make up 0% of the student body and 0% of the full-time staff; non-resident alien people make up 5% of the student body and 8% of the full-time staff; people of two or more races/multiracial people make up 5% of the student body and 2% of the full-time staff; people of an Unknown ethno-racial demographic make up 3% of the student body and 3% of the full-time staff; and White people make up 36%

## Random Example

*Source: Scientific American*  
*Title: A New Quantum Cheshire Cat Thought Experiment Is Out of the Box*  
*Author: Manon Bischoff*

[A New Quantum Cheshire Cat Thought Experiment Is Out of the Box](https://www.scientificamerican.com/article/a-new-quantum-cheshire-cat-thought-experiment-is-out-of-the-box/)

In [50]:
val_text = '''Physicists seem to be obsessed with cats. James Clerk Maxwell, the father of electrodynamics, studied falling felines to investigate how they turned as they fell. Many physics teachers have used a cat’s fur and a hard rubber rod to explain the phenomenon of frictional electricity. And Erwin Schrödinger famously illustrated the strangeness of quantum physics with a thought experiment involving a cat that is neither dead nor alive.  So it hardly seems surprising that physicists turned to felines once again to name a newly discovered quantum phenomenon in a paper published in the New Journal of Physics in 2013. Their three-sentence study abstract reads, “In this paper we present a quantum Cheshire Cat. In a pre- and post-selected experiment we find the Cat in one place, and its grin in another. The Cat is a photon, while the grin is its circular polarization.”
The newfound phenomenon was one in which certain particle features take a different path from their particle—much like the smile of the Cheshire Cat in Alice’s Adventures in Wonderland, written by Lewis Carroll—a pen name of mathematician Charles Lutwidge Dodgson—and published in 1865. To date, several experiments have demonstrated this curious quantum effect. But the idea has also drawn significant skepticism. Critics are less concerned about the theoretical calculations or experimental rigor than they are about the interpretation of the evidence. “It seems a bit bold to me to talk about disembodied transmission,” says physicist Holger Hofmann of Hiroshima University in Japan. “Instead we should revise our idea of particles.”  Recently researchers led by Yakir Aharonov of Chapman University took the debate to the next level. Aharonov was a co-author of the first paper to propose the quantum Cheshire effect. Now, on the preprint server arXiv.org, he and his colleagues have posted a description of theoretical work that they believe demonstrates that quantum properties can move without any particles at all—like a disembodied grin flitting through the world and influencing its surroundings—in ways that bypass the critical concerns raised in the past.
Aharonov and his colleagues first encountered their quantum Cheshire cat several years ago as they were pondering one of the most fundamental principles of quantum mechanics: nothing can be predicted unambiguously. Unlike classical physics, the same quantum mechanical experiment can have different outcomes under exactly the same conditions. It is therefore impossible to predict the exact outcome of a single experiment—only its outcome with a certain probability. “Nobody understands quantum mechanics. It’s so counterintuitive. We know its laws, but we are always surprised,” says Sandu Popescu, a physicist at the University of Bristol in England, who collaborated with Aharonov on the 2013 paper and the new preprint.
But Aharonov was not satisfied with this uncertainty. So, since the 1980s, he has been exploring ways to investigate fundamental processes despite the probability-based nature of quantum mechanics. Aharonov—now age 92—employs an approach that involves intensively repeating an experiment, grouping results and then examining what came out before and after the experiment and relating these events to each other. “To do this, you have to understand the flow of time in quantum mechanics,” Popescu explains. “We developed a completely new method to combine information from measurements before and after the experiment.”
The researchers have stumbled across several surprises with this method—including their theoretical Cheshire cat. Their idea sounds simple at first: send particles through an optical tool called an interferometer, which causes each particle to move through one of two paths that ultimately merge again at the end. If the setup and measurements were carried out skillfully, Aharonov and his colleagues theorized, it could be shown that the particle traveled a path in the interferometer that differed from the path of its polarization. In other words, they claimed the property of the particle could be measured on one path even though the particle itself took the other—as if the grin and the cat had come apart.
Inspired by this theory, a team led by Tobias Denkmayr, then at the Vienna University of Technology, implemented the experiment with neutrons in a study published in 2014. The team showed that the neutral particles inside an interferometer followed a different path from that of their spin, a quantum mechanical property of particles similar to angular momentum: Denkmayr and his colleagues had indeed found evidence of the Cheshire cat theory. Two years later researchers led by Maximilian Schlosshauer of the University of Portland successfully implemented the same experiment with photons. The scientists saw evidence that the light particles took a different path in the interferometer than their polarization did.
But not everyone is convinced. “Such a separation makes no sense at all. The location of a particle is itself a property of the particle,” Hofmann says. “It would be more accurate to talk about an unusual correlation between location and polarization.” Last November Hofmann and his colleagues provided an alternative explanation based on widely known quantum mechanical effects.
And in another interpretation of the Cheshire cat results, Pablo Saldanha of the Federal University of Minas Gerais in Brazil and his colleagues argue that the findings can be explained with wave-particle duality. “If you take a different view, there are no paradoxes,” Saldanha says, “but all results can be explained with traditional quantum mechanics as simple interference effects.”
Much of the controversy surrounds the way in which particles’ properties and positions are detected in these experiments. Disturbing a particle could alter its quantum mechanical properties. For that reason, the photons or neutrons cannot be recorded inside the interferometer using an ordinary detector. Instead scientists must resort to a principle of weak measurement developed by Aharonov in 1988. A weak measurement makes it possible to scan a particle very lightly without destroying its quantum state. This comes at a price, however: the weak measurement result is extremely inaccurate. (Thus, these experiments must be repeated many times over, to compensate for the fact that each individual measurement is highly uncertain.)
In the quantum Cheshire cat experiments, a weak measurement is made along a path in the interferometer, the paths then merge, and the emerging particles are measured with an ordinary detector. Along one path of the interferometer, a weak measurement of the particle’s position can be taken and, along the other, its spin. Using detectors, physicists can more definitively characterize the particles that traveled through the interferometer and potentially reconstruct what occurred during the particle’s journey. For example, only certain particles will appear in certain detectors, helping the physicists piece together which path their neutron or photon previously took. According to Aharonov, Popescu and their colleagues, the Cheshire cat experiments ultimately reveal that the particle’s position can be confirmed on one path even as its polarization or spin was measured on the other.
Saldanha and his co-authors assert that it is impossible to make claims about quantum systems in the past given their measurements in the present. In other words, the photons and neutrons measured in the final detectors cannot tell us much about their previous trajectory. Instead the wave functions of particles passing through the paths of the interferometer could overlap, which would make it impossible to trace which path a particle had taken. “Ultimately, the paradoxical behaviors are related to the wave-particle duality,” Saldanha says. But in the papers that report evidence of the quantum Cheshire cat, he asserts, the findings “are processed in a sophisticated way that obscures this simpler interpretation.”
Hofmann, meanwhile, has stressed that the results will differ if you measure the system in a different way. This phenomenon is well-known in quantum physics: if, for example, you first measure the speed of a particle and then its position, the result can be different than it would be if you first measured the position of the same particle and then its speed. He and his colleagues therefore contend that Aharonov and his team’s conclusions were correct in themselves—that the particle moved along one path and the polarization followed the other—but that such differing paths do not apply simultaneously.
As Hofmann’s co-author Jonte Hance, also at Hiroshima University, told New Scientist, “It only looks like [the particle and polarization are] separated because you’re measuring one of the properties in one place and the other property in the other place, but that doesn’t mean that the properties are in one place and the other place, that means that the actual measuring itself is affecting it in such a way that it looks like it’s in one place and the other place.”
But these critiques are “missing the point,” Popescu says. He agrees that the work and reasoning put forward by Saldanha and Hofmann’s respective groups are correct—but adds that the best way to test any interpretation is to generate testable predictions from each. “As I understand it, there is no direct way to make predictions based on them,” Popescu says in reference to these alternative explanations. “They kind of have a very old-fashioned way of looking at things: there are contradictions, so you stop doing the math.”
With their recent preprint paper, Aharonov and Popescu, together with physicist Daniel Collins of the University of Bristol, have now described how a particle’s spin can move completely independently of the particle itself—without employing a weak measurement. In their new experimental setup, a particle is located in the left half of an elongated two-part cylinder that is sealed at the outer edges. Because of a highly reflective wall in the middle, the particle has a vanishingly small probability of tunneling through to the right-hand side of the cylinder. In their paper, the researchers provide a proof that even if the particle remains in the left-hand area in almost all cases, it should still be possible to measure a transfer of the particle’s spin at the right-hand outer wall. “It’s amazing, isn’t it?” Collins says. “You think the particle has a spin and the spin should stay with the particle. But the spin crosses the box without the particle.”
This approach would address several of the critical concerns raised thus far. The physicists don't need weak measurements. Nor do they need to group their experimental results to draw temporal conclusions. (That being said, grouping results would still improve the measurements, given that the angular momentum of the wall itself cannot be determined unambiguously because of the Heisenberg uncertainty principle.) But in this scenario, the only physical principles involved are conservation laws, such as the conservation of energy or the conservation of momentum and angular momentum. Popescu and Collins explain that they hope other groups will implement the experiment to observe the effects in the laboratory.
The new work has piqued Hofmann’s interest. “The scenario is exciting because the interaction between polarization and particle motion produces a particularly strong quantum effect that clearly contradicts the particle picture,” he says.
But he still does not see this as proof of disembodied (particle-free) spin transfer. “For me, this means, above all, that it is wrong to assume a measurement-independent reality,” Hofmann says. Instead quantum mechanics allows a particle’s residence to extend to the right-hand region of the cylinder, even if a residence in the left-hand region seems logically compelling. “I think it is quite clear to Aharonov, Collins and Popescu that the space in front of the wall is not really empty,” he adds.
Saldanha, meanwhile, still sees the researchers as overcomplicating what could be explained as traditional quantum interference effects. When discussing the particle’s very low probability of entering the right-hand side of the experimental setup, he explains, “we have to be careful about a ‘vanishingly small probability’ when we refer to waves.” The wave function of the particle could also expand into the right-hand side of the setup and thus influence the angular momentum of the wall. “The same predictions can be made without such dramatic conclusions,” he says.
In response to these critiques, Popescu says, “This is of course another way of thinking about it. The question is whether this interpretation is useful.” Regardless of which interpretation of the events is correct, the quantum Cheshire cat could enable new technological applications. For example, it could be used to transfer information or energy without moving a physical particle—whether made of matter or light.
For Popescu, however, the fundamental questions of physics play a more important role. “It all started when we thought about how time propagates in quantum mechanics,” he says. “And suddenly we were able to discover something fundamental about the laws of conservation.”
'''

In [51]:
luhn_rtn_exp8 = la.luhn_abstract.run_auto_summarization(val_text=val_text,
                                                        is_sw_remove=False,
                                                        is_sw_zero=False,
                                                        is_use_luhn_tf=True,
                                                        vec_sw_luhn=[],
                                                        vec_sw_add=[],
                                                        func_stem_selected=None,
                                                        func_summary_selected=None,
                                                        is_print=True)

Quantile Significance Lower = 1
Quantile Significance Upper = 56
As Hofmann’s co-author Jonte Hance, also at Hiroshima University, told New Scientist, “It only looks like [the particle and polarization are] separated because you’re measuring one of the properties in one place and the other property in the other place, but that doesn’t mean that the properties are in one place and the other place, that means that the actual measuring itself is affecting it in such a way that it looks like it’s in one place and the other place.” But these critiques are “missing the point,” Popescu says. [31m[54.5684][0m This phenomenon is well-known in quantum physics: if, for example, you first measure the speed of a particle and then its position, the result can be different than it would be if you first measured the position of the same particle and then its speed. [31m[31.3913][0m “If you take a different view, there are no paradoxes,” Saldanha says, “but all results can be explained with traditi

## Potential Issues...

- Parsing Wikipedia with NLTK results in sentences that are sometimes more than one independent clause/sentence.
- Reproducing the results from the paper is complicated by the fact that exact values for *C* and *D* are not documented.