<a href="https://colab.research.google.com/github/Arteric-Jeff-Knight/collabs/blob/master/phrase_tokenizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Run the first code block once to load libraries and define functions.  

Ignore any output that isn't an error. Process takes a minute or so, but is finished when you see:

<font color="green">✔ Download and installation successful</font><br>
You can now load the model via spacy.load('en_core_web_sm')<br>
This script is now done.

In [None]:
#@title <= Run first one time

!pip install spacy==3.0.5 textacy==0.11.0 phrasemachine  # --upgrade 

import IPython
import uuid
from google.colab import output
from google.colab import files
import ipywidgets as widgets
import io, re, string, unicodedata
from itertools import dropwhile
import numpy as np
import pandas as pd 
from collections import Counter
import spacy
import textacy
from textacy import extract
import phrasemachine

compiled_word_counter = re.compile(r'\w+')

# print(spacy.__version__)

# functions used several times
def drop_fewer_than(counter_dict: Counter, threshold: int = 1):
  for key, count in dropwhile(lambda key_count: key_count[1] >= threshold, counter_dict.most_common()):
      del counter_dict[key]

def split_df_into_data_and_configs(uploaded, defaults: dict = {}, config_name: str = 'config'):
  # Put uploaded file into dataframe
  filename = list(uploaded.keys())[0]
  df = pd.read_csv(io.BytesIO(uploaded[filename]),header=None)

  # Get everything with 'configs' in first coumn
  configs = df[df[0] == config_name] 
  # Build a dictionary from the key in the second column with values from the third
  defaults.update(dict(zip(configs[1], configs[2])))

  # Everything else that isn't a config, is data
  data = df[df[0] != config_name].reset_index(drop=True)
  # Assume that the first row is the column names now that configs are gone
  data.columns = data.iloc[0]
  # Drop the row with the column names
  data.drop(df.index[0], inplace=True)
  # Reset the index, so zero works below
  data = data.reset_index(drop=True)

  # Validate the column name configs
  defaults['columns'] = list(data.columns)

  if 'tokenize_col_in' not in defaults:
      defaults['tokenize_col_in'] = defaults['columns'][0]
  elif defaults['tokenize_col_in'] not in defaults['columns']:
    # With nothing defined or garbage, use first column
    if defaults['tokenize_col_in'].capitalize() in defaults['columns']:
      defaults['tokenize_col_in'] = defaults['tokenize_col_in'].capitalize()
    else:
      defaults['tokenize_col_in'] = defaults['columns'][0]

  if 'tokenize_file_out' not in defaults:
    defaults['tokenize_file_out'] = '-tokenized'

  if 'tokenize_output_filename' not in defaults:
      defaults['tokenize_output_filename'] = filename.replace('.csv', f"{defaults['tokenize_file_out']}.csv")

  try:
    defaults['tokenize_ng_threshold'] = min(6,int(defaults['tokenize_ng_threshold']))
    if defaults['tokenize_ng_threshold'] < 3:
      defaults['tokenize_ng_threshold'] = tuple([2])
    else:
      defaults['tokenize_ng_threshold'] = tuple(range(2,defaults['tokenize_ng_threshold']+1))
  except Exception as e:
    defaults['tokenize_ng_threshold'] = (2,3,4)

  try:
    defaults['tokenize_min_token_count'] = min(9,int(defaults['tokenize_min_token_count']))
    if defaults['tokenize_min_token_count'] < 1:
      defaults['tokenize_min_token_count'] = 1
  except Exception as e:
    defaults['tokenize_min_token_count'] = 3

  return data, defaults

spacy.cli.download('en_core_web_sm')
nlp = textacy.load_spacy_lang('en_core_web_sm')

print('This script is now done.')

# Next, run the following code block to upload a file for processing and set options

If you want to process multiple files, start here for each one, no need to run the first block again and again.

Processing is mostly handled by default values, but if you need to override them, configuration is handled by passing values in the input file. See **How to Configure** below for instructions.

The code will ask you to choose a file to import and select which column to tokenize and the maximum number of words in an n-gram. After uploading a file, you can change these values to influence the next steps. You do not need to re-run this script to change the option values.

In [None]:
#@title <= Run once for every file to tokenize

defaults = {
  'tokenize_col_in': 'lemmatized',
  'tokenize_file_out': '-tokenized',
  'tokenize_ng_threshold': 4,
  'tokenize_min_phrase_count': 3
}

df, configs = split_df_into_data_and_configs(files.upload(), defaults)

col_in = widgets.Dropdown(options=configs['columns'], value=configs['tokenize_col_in'])

print('\nSelect the column that contains the text to be tokenized:')
display(col_in)

ng_threshold = widgets.Dropdown(options=[('No n-grams', 0), 
  ('2-grams', (2, )), 
  ('3-grams', (2, 3)), 
  ('4-grams', (2, 3, 4)), 
  ('5-grams', (2, 3, 4, 5)), 
  ('6-grams', (2, 3, 4, 5, 6))], value=configs['tokenize_ng_threshold'])
print('\nWhat is the maximum size of the n-grams (zero to not run n-grams):')
display(ng_threshold)

print('\nConfigs:')
for key in configs:
  if 'tokenize_' in key:
    print('   ',key,':',configs[key])

# This code block tokenizes the data in the upload file

If you want to process multiple files, start here for each one, no need to run the first block again and again.

Processing is mostly handled by default values, but if you need to override them, configuration is handled by passing values in the input file. See **How to Configure** below for instructions.

The code will ask you to choose a file to import and select which column to tokenize and the maximum number of words in an n-gram. After uploading a file, you can change these values to influence the next steps. You do not need to re-run this script to change the option values.

In [None]:
#@title <= Run every time you change the column and n-gram settings above
text_df = df[[col_in.value]]

pm_phrases = []
cluster_phrases = []
textrank_phrases = []
yake_phrases = []
scake_phrases = []
sgrank_phrases = []
ngram_phrases = []


for docrow in text_df[col_in.value].squeeze():
  docrow = str(docrow)
  doc = textacy.make_spacy_doc(docrow, lang='en_core_web_sm')

  word_count = len(doc)

  # Start with the phrasemachine, maybe we can bail on the others
  pm = phrasemachine.get_phrases(docrow)
  for phrase in pm['counts']:
    pm_phrases.append({'token': phrase, 'phrasemachine_count': pm['counts'][phrase], 'phrasemachine_docs': 1})
      
  # if nc_count.value:
  chunks = Counter()
  for nc in doc.noun_chunks:
    # add only if is a phrase and does not start with a stop word
    if " " in nc.text and nc.text.split()[0] not in spacy.lang.en.stop_words.STOP_WORDS:
      chunks[nc.text] += 1

  # a round about way to preserve token frequency and document frequency
  for token in chunks:
    cluster_phrases.append({'token': token, 'cluster_count': chunks[token], 'cluster_docs': 1})

  try:
    for token, rank in extract.keyterms.textrank(doc=doc):
      if " " in token: # only add tokens that are phrases!
        textrank_phrases.append({'token': token, 'tr_count': 1, 'tr_docs': 1, 'textrank': rank})
  except Exception as e:
    pass

  try:
    for token, rank in extract.keyterms.yake(doc=doc):
      if " " in token: # only add tokens that are phrases!
        yake_phrases.append({'token': token, 'yake_count': 1, 'yake_docs': 1, 'yake': rank})
  except Exception as e:
    pass

  try:
    for token, rank in extract.keyterms.scake(doc=doc):
      if " " in token: # only add tokens that are phrases!
        scake_phrases.append({'token': token, 'scake_count': 1, 'scake_docs': 1, 'scake': rank})
  except Exception as e:
    pass

  try:
    for token, rank in extract.keyterms.sgrank(doc=doc):
      if " " in token: # only add tokens that are phrases!
        sgrank_phrases.append({'token': token, 'sgrank_count': 1, 'sgrank_docs': 1, 'sgrank': rank})
  except Exception as e:
    pass

  bot = doc._.to_bag_of_terms(ngs=ng_threshold.value, ents=True, weighting="count") #, as_strings=True)
  for term in bot:
    if " " in term:  # for some reason the function returns 1-grams, even if not asked to
        ngram_phrases.append({'token': term, 'ng_count': bot[term], 'ng_docs': 1, 'ng_freq': bot[term]/word_count})

# Drop all rows with count below threshold
# ncdf = ncdf[ncdf.noun_cluster_count > nc_count.value]

if pm_phrases:
  pmdf = pd.DataFrame(pm_phrases)
  # Group by token
  pmdf = pmdf.groupby('token', as_index=False).agg({'phrasemachine_count': 'sum', 'phrasemachine_docs': 'sum'})  
else:
  # the merges later on will require an empty dataframe
  pmdf = pd.DataFrame([], columns=['token', 'phrasemachine_count', 'phrasemachine_docs'])

if cluster_phrases:
  ncdf = pd.DataFrame(cluster_phrases)
  # Group by token
  ncdf = ncdf.groupby('token', as_index=False).agg({'cluster_count': 'sum', 'cluster_docs': 'sum'}) 
else:
  # the merges later on will require an empty dataframe
  ncdf = pd.DataFrame([], columns=['token', 'cluster_count', 'cluster_docs'])

if textrank_phrases:
  trdf = pd.DataFrame(textrank_phrases)
  # Group by token
  trdf = trdf.groupby('token', as_index=False).agg({'tr_count': 'sum', 'tr_docs': 'sum', 'textrank': 'max'})   
else:
  # the merges later on will require an empty dataframe
  trdf = pd.DataFrame([], columns=['token', 'tr_count', 'tr_docs'])

if yake_phrases:
  ykdf = pd.DataFrame(yake_phrases)
  # Group by token
  ykdf = ykdf.groupby('token', as_index=False).agg({'yake_count': 'sum', 'yake_docs': 'sum', 'yake': 'max'}) 
else:
  # the merges later on will require an empty dataframe
  ykdf = pd.DataFrame([], columns=['token', 'yake_count', 'yake_docs'])

if scake_phrases:
  scdf = pd.DataFrame(scake_phrases)
  # Group by token
  scdf = scdf.groupby('token', as_index=False).agg({'scake_count': 'sum', 'scake_docs': 'sum', 'scake': 'max'}) 
else:
  # the merges later on will require an empty dataframe
  scdf = pd.DataFrame([], columns=['token', 'scake_count', 'scake_docs'])

if sgrank_phrases:
  sgdf = pd.DataFrame(sgrank_phrases)
  # Group by token
  sgdf = sgdf.groupby('token', as_index=False).agg({'sgrank_count': 'sum', 'sgrank_docs': 'sum', 'sgrank': 'max'}) 
else:
  # the merges later on will require an empty dataframe
  sgdf = pd.DataFrame([], columns=['token', 'sgrank_count', 'sgrank_docs'])

if ngram_phrases:
  ngdf = pd.DataFrame(ngram_phrases)
  # Group by token
  ngdf = ngdf.groupby('token', as_index=False).agg({'ng_count': 'sum', 'ng_docs': 'sum'}) 
else:
  ngdf = pd.DataFrame([], columns=['token', 'ng_count', 'ng_docs'])

merged = pd.merge(left=pmdf, right=ncdf, left_on='token', right_on='token',how='outer')
merged = pd.merge(left=merged, right=trdf, left_on='token', right_on='token',how='outer')
merged = pd.merge(left=merged, right=ykdf, left_on='token', right_on='token',how='outer')
merged = pd.merge(left=merged, right=scdf, left_on='token', right_on='token',how='outer')
merged = pd.merge(left=merged, right=ngdf, left_on='token', right_on='token',how='outer')
merged = merged.fillna(0)
cols = ['phrasemachine_count', 'phrasemachine_docs', 
        'cluster_count', 'cluster_docs',
        'tr_count', 'tr_docs',
        'yake_count', 'yake_docs',
        'scake_count', 'scake_docs',
        'ng_count', 'ng_docs' ]
merged[cols] = merged[cols].astype(int)

merged['words'] = merged.apply(lambda x: len(compiled_word_counter.findall(x['token'])), axis=1)

# Add word count
def reduce_merged_to_count(mdf, min_count: int = 3):
  return mdf[
    (mdf['phrasemachine_count'] >= min_count) 
    | (mdf['cluster_count'] >= min_count)
    | (mdf['tr_count'] >= min_count)
    | (mdf['yake_count'] >= min_count)
    | (mdf['scake_count'] >= min_count)
    | (mdf['ng_count'] >= min_count)
    ]

rowcount = []
for count in range(1,10):
  rows = len(reduce_merged_to_count(merged, count))
  rowcount.append((f'{count} or more for {rows} phrases',count))

min_token_count = widgets.Dropdown(options=rowcount,value=configs['tokenize_min_token_count'])
print('\nHow many times must a phrase, noun cluster or n-gram appear to be counted:')
display(min_token_count)


# Finally, set the phrase count option above and download

You can run both scripts once, and then change the option to generate different downloads.



In [None]:
#@title <= Run this code to get download button.

class InvokeButton(object):
  def __init__(self, title, callback):
    self._title = title
    self._callback = callback

  def _repr_html_(self):
    callback_id = 'button-' + str(uuid.uuid4())
    output.register_callback(callback_id, self._callback)

    template = """<button id="{callback_id}">{title}</button>
        <script>
          document.querySelector("#{callback_id}").onclick = (e) => {{
            google.colab.kernel.invokeFunction('{callback_id}', [], {{}})
            e.preventDefault();
          }};
        </script>"""
    html = template.format(title=self._title, callback_id=callback_id)
    return html

def download_tokens():
  try:
    reduce_merged_to_count(merged, min_token_count.value).to_csv(configs['tokenize_output_filename'], index=False)
    files.download(configs['tokenize_output_filename'])
  except Exception as e:
    print('ex',e)

InvokeButton('Download Tokens', download_tokens)

# Returned File

The columns in the generated file are:

|  column | explanation |
|----|:----|
|`token` | The token identified by one or more of the processes |
|`phrasemachine_count` | Number of times the token is identified as phrase by the *phrasemachine* algorithm |
|`phrasemachine_docs` | Number of rows the token is identified by the *phrasemachine* algorithm |
|`cluster_count` | Number of times the token is identified by spaCy as a *Noun Cluster* in all rows (must exceed `tokenize_cluster_min` to be counted at all) |
|`cluster_docs` | Number of rows the token is identified by spaCy as a *Noun Cluster* |
|`tr_count` | Number of times the token is identified as phrase by the *TextRank* algorithm in all rows (must exceed `tokenize_textrank_min` to be counted at all) |
|`tr_docs` | Number of rows the token is identified by the the *TextRank* algorithm |
|`textrank` | The maximum *TextRank* value for the token across all documents |
|`yake_count` | Number of times the token is identified as phrase by the *YAKE* algorithm in all rows (must exceed `tokenize_yake_min` to be counted at all) |
|`yake_docs` | Number of rows the token is identified by the the *YAKE* algorithm |
|`yake` | The maximum *YAKE* score for the token across all documents |
|`scake_count` | Number of times the token is identified as phrase by the *sCAKE* algorithm in all rows (must exceed `tokenize_scake_min` to be counted at all) |
|`scake_docs` | Number of rows the token is identified by the the *sCAKE* algorithm |
|`scake` | The maximum *sCAKE* score for the token across all documents |
|`sgrank_count` | Number of times the token is identified as phrase by the *SGRank* algorithm in all rows (must exceed `tokenize_sgrank_min` to be counted at all) |
|`sgrank_docs` | Number of rows the token is identified by the the *SGRank* algorithm |
|`sgrank` | The maximum *SGRank* score for the token across all documents |
|`ngram_count` | Number of times the token is identified as an *n*-gram in all rows where *n* is defined by `tokenize_ngram_limit` (must exceed `tokenize_ngram_min` to be counted at all)  |
|`ngram_docs` | Number of rows the token is identified as an *n*-gram |
|`words` | The token's word count |
# **Algorithms**

The processor uses several methods to identify phrases and key tokens from the input file. 

## **phrasemachine**:  

An implelmentation of the algorithm found in the paper [Bag of What? Simple Noun Phrase Extraction for Text Analysis](http://brenocon.com/handler2016phrases.pdf). A new phrase-based method, **NPFST**, for enriching a unigram BOW. NPFST uses a part-of-speech tagger and a finite state transducer to extract multiword phrases to be added to a unigram BOW.

## **TextRank**:  

https://www.aclweb.org/anthology/W04-3252.pdf

In this paper, we introduce TextRank – a graph-based
ranking model for text processing, and show how this
model can be successfully used in natural language
applications. In particular, we propose two innovative unsupervised methods for keyword and sentence
extraction, and show that the results obtained compare favorably with previously published results on
established benchmarks.

## **YAKE**:  A Text Feature Based Automatic Keyword Extraction Method for Single Documents

https://repositorio.inesctec.pt/bitstream/123456789/7622/1/P-00N-NF3.pdf

A lightweight approach for keyword
extraction and ranking based on an unsupervised methodology to select the most
important keywords of a single document. To understand the merits of our
proposal, we compare it against RAKE, TextRank and SingleRank methods
(three well-known unsupervised approaches) and the baseline TF.IDF, over four
different collections to illustrate the generality of our approach. The experimental results suggest that extracting keywords from documents using our
method results in a superior effectiveness when compared to similar approaches.

## **sCAKE**: Semantic Connectivity Aware Keyword Extraction

https://arxiv.org/abs/1811.10831v1

Keyword Extraction is an important task in several text analysis endeavors. In this paper, we present a critical discussion of the issues and challenges ingraph-based keyword extraction methods, along with comprehensive empirical analysis. We propose a parameterless method for constructing graph of text that captures the contextual relation between words. A novel word scoring method is also proposed based on the connection between concepts. We demonstrate that both proposals are individually superior to those followed by the state-of-the-art graph-based keyword extraction algorithms. Combination of the proposed graph construction and scoring methods leads to a novel, parameterless keyword extraction method (sCAKE) based on semantic connectivity of words in the document.

Motivated by limited availability of NLP tools for several languages, we also design and present a language-agnostic keyword extraction (LAKE) method. We eliminate the need of NLP tools by using a statistical filter to identify candidate keywords before constructing the graph. We show that the resulting method is a competent solution for extracting keywords from documents oflanguages lacking sophisticated NLP support.

## **SGRank**: Combining Statistical and Graphical Methods to Improve the State of the Art in Unsupervised Keyphrase Extraction

https://www.aclweb.org/anthology/S15-1013.pdf

Keyphrase extraction is a fundamental technique in natural language processing. It enables documents to be mapped to a concise set of phrases that can be used for indexing, clustering, ontology building, auto-tagging and other information organization schemes. Two major families of unsupervised keyphrase extraction algorithms may be characterized as statistical and graph-based. We present a hybrid statistical-graphical algorithm that capitalizes on the heuristics of both families of algorithms and is able to outperform the state of the art in unsupervised keyphrase extraction on several datasets.

## ***n*-grams**: 

The simplest algorithm at all. For integer values of *n* from 2 to 6, ...

# **How to Configure**

To add configurations to a file, put the value `config` in the first column (no matter what the header) and the configuration key name in the second, and the value to be set in the third (anything beyond that can be ignored). Durding processing these rows will be separated from the data and not included in the returned file. For convenience, they can appear anywhere in the incoming file: before the headers, at the end, anywhere in between or even mixed among the data.

For this particular operation, the configurations just set the default values for the option dropdowns, and are inlcuded mainly for consistency and a future option to operate without interaction.

The possible configuration keys and their default values are:

| Key | Defualt | Notes |
|--------------|:-----------|:------|
| `tokenize_col_in` | lemmatized | *The exact, case sensitive name of column in the incoming file to process* |
| `tokenize_file_out` | -tokenized | *The text to add to the filename that is returned* |
| `tokenize_output_filename` | | *If passed, this value overrides the value calulated by applyting `tokenize_file_out` to the uploaded filename* |
| `tokenize_ng_threshold` | `4` | *The maximum word length of the generated n-grams* |
| `tokenize_min_token_count` | `3` | *The minimum times a phrase must occur in all documents to be counted* |

- If the value in `tokenize_col_in` does not match any header in the uploaded file, the content in the first column will be processed.
- The value in `tokenize_file_out` will be inserted between the base filename and the '.csv' of the uploaded file to create `tokenize_output_filename` unless a value is passed as a configuration.

# **Sample Configuration to Copy and Paste**

No need to add them all to a file, just the ones you want to change:

```
tokenize_col_in,lemmatized
tokenize_file_out,-tokenized
tokenize_output_filename,custom_filename.csv
tokenize_ng_threshold
tokenize_min_token_count
```
