![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/KEYPHRASE_EXTRACTION.ipynb)

# **Extract keyphrases from documents**

You can look at the example outputs stored at the bottom of the notebook to see what the model can do, or enter your own inputs to transform in the "Inputs" section. Find more about this keyphrase extraction model in another notebook [here](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/9.Keyword_Extraction_YAKE.ipynb).

## 1. Colab setup

Install dependencies

In [None]:
# Install Java
! apt-get update -qq
! apt-get install -y openjdk-8-jdk-headless -qq > /dev/null
! java -version

# Install pyspark
! pip install --ignore-installed -q pyspark==2.4.4
! pip install --ignore-installed -q spark-nlp

openjdk version "11.0.8" 2020-07-14
OpenJDK Runtime Environment (build 11.0.8+10-post-Ubuntu-0ubuntu118.04.1)
OpenJDK 64-Bit Server VM (build 11.0.8+10-post-Ubuntu-0ubuntu118.04.1, mixed mode, sharing)
[K     |████████████████████████████████| 215.7MB 72kB/s 
[K     |████████████████████████████████| 204kB 52.5MB/s 
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 133kB 8.6MB/s 
[?25h

Import dependencies

In [None]:
import json
import os
import pandas as pd
import numpy as np

os.environ['JAVA_HOME'] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ['PATH'] = os.environ['JAVA_HOME'] + "/bin:" + os.environ['PATH']

# Import pyspark
from pyspark.sql import SparkSession
from pyspark.ml import PipelineModel
from pyspark.sql import functions as F

# Import SparkNLP
import sparknlp
from sparknlp.annotator import *
from sparknlp.base import *

# Start Spark session
spark = sparknlp.start()

## 2. Inputs

Enter inputs as strings in this list. Later cells of the notebook will extract keyphrases from whatever inputs are entered here.

In [None]:
input_list = [
    """Extracting keywords from texts has become a challenge for individuals and organizations as the information grows in complexity and size. The need to automate this task so that text can be processed in a timely and adequate manner has led to the emergence of automatic keyword extraction tools. Yake is a novel feature-based system for multi-lingual keyword extraction, which supports texts of different sizes, domain or languages. Unlike other approaches, Yake does not rely on dictionaries nor thesauri, neither is trained against any corpora. Instead, it follows an unsupervised approach which builds upon features extracted from the text, making it thus applicable to documents written in different languages without the need for further knowledge. This can be beneficial for a large number of tasks and a plethora of situations where access to training corpora is either limited or restricted.""",
    """Iodine deficiency is a lack of the trace element iodine, an essential nutrient in the diet. It may result in metabolic problems such as goiter, sometimes as an endemic goiter as well as cretinism due to untreated congenital hypothyroidism, which results in developmental delays and other health problems. Iodine deficiency is an important global health issue, especially for fertile and pregnant women. It is also a preventable cause of intellectual disability.

Iodine is an essential dietary mineral for neurodevelopment among offsprings and toddlers. The thyroid hormones thyroxine and triiodothyronine contain iodine. In areas where there is little iodine in the diet, typically remote inland areas where no marine foods are eaten, iodine deficiency is common. It is also common in mountainous regions of the world where food is grown in iodine-poor soil.

Prevention includes adding small amounts of iodine to table salt, a product known as iodized salt. Iodine compounds have also been added to other foodstuffs, such as flour, water and milk, in areas of deficiency. Seafood is also a well known source of iodine.""",
    """The Prague Quadrennial of Performance Design and Space was established in 1967 to bring the best of design for performance, scenography, and theatre architecture to the front line of cultural activities to be experienced by professional and emerging artists as well as the general public. The quadrennial exhibitions, festivals, and educational programs act as a global catalyst of creative progress by encouraging experimentation, networking, innovation, and future collaborations. PQ aims to honor, empower and celebrate the work of designers, artists and architects while inspiring and educating audiences, who are the most essential element of any live performance. The Prague Quadrennial strives to present performance design as an art form concerned with creation of active performance environments, that are far beyond merely decorative or beautiful, but emotionally charged, where design can become a quest, a question, an argument, a threat, a resolution, an agent of change, or a provocation. Performance design is a collaborative field where designers mix, fuse and blur the lines between multiple artistic disciplines to search for new approaches and new visions.

The Prague Quadrennial organizes an expansive program of international projects and activities between the main quadrennial events – performances, exhibitions, symposia, workshops, residencies, and educational initiatives serve as an international platform for exploring the practice, theory and education of contemporary performance design in the most encompassing terms.""",
    """Author Nathan Wiseman-Trowse explained that the "approach to the sheer physicality of sound" integral to dream pop was "arguably pioneered in popular music by figures such as Phil Spector and Brian Wilson". The music of the Velvet Underground in the 1960s and 1970s, which experimented with repetition, tone, and texture over conventional song structure, was also an important touchstone in the genre's development George Harrison's 1970 album All Things Must Pass, with its Spector-produced Wall of Sound and fluid arrangements, led music journalist John Bergstrom to credit it as a progenitor of the genre.

Reynolds described dream pop bands as "a wave of hazy neo-psychedelic groups", noting the influence of the "ethereal soundscapes" of bands such as Cocteau Twins. Rolling Stone's Kory Grow described "modern dream pop" as originating with the early 1980s work of Cocteau Twins and their contemporaries, while PopMatters' AJ Ramirez noted an evolutionary line from gothic rock to dream pop. Grow considered Julee Cruise's 1989 album Floating into the Night, written and produced by David Lynch and Angelo Badalamenti, as a significant development of the dream pop sound which "gave the genre its synthy sheen." The influence of Cocteau Twins extended to the expansion of the genre's influence into Cantopop and Mandopop through the music of Faye Wong, who covered multiple Cocteau Twins songs, including tracks featured in Chungking Express, in which she also acted. Cocteau Twins would go on to collaborate with Wong on original songs of hers, and Wong contributed vocals to a limited release of a late Cocteau Twins single.

In the early 1990s, some dream pop acts influenced by My Bloody Valentine, such as Seefeel, were drawn to techno and began utilizing elements such as samples and sequenced rhythms. Ambient pop music was described by AllMusic as "essentially an extension of the dream pop that emerged in the wake of the shoegazer movement", distinct for its incorporation of electronic textures.

Much of the music associated with the 2009-coined term "chillwave" could be considered dream pop. In the opinion of Grantland's David Schilling, when "chillwave" was popularized, the discussion that followed among music journalists and bloggers revealed that labels such as "shoegaze" and "dream pop" were ultimately "arbitrary and meaningless".""",
    """North Ingria was located in the Karelian Isthmus, between Finland and Soviet Russia. It was established 23 January 1919. The republic was first served by a post office at the Rautu railway station on the Finnish side of the border. As the access across the border was mainly restricted, the North Ingrian postal service was finally launched in the early 1920. The man behind the idea was the lieutenant colonel Georg Elfvengren, head of the governing council of North Ingria. He was also known as an enthusiastic stamp collector. The post office was opened at the capital village of Kirjasalo.

The first series of North Ingrian stamps were issued in 21 March 1920. They were based on the 1917 Finnish "Model Saarinen" series, a stamp designed by the Finnish architect Eliel Saarinen. The first series were soon sold to collectors, as the postage stamps became the major financial source of the North Ingrian government. The second series was designed for the North Ingrian postal service and issued 2 August 1920. The value of both series was in Finnish marks and similar to the postal fees of Finland. The number of letters sent from North Ingria was about 50 per day, most of them were carried to Finland. They were mainly sent by the personnel of the Finnish occupying forces. Large number of letters were also sent in pure philatelic purposes.

With the Treaty of Tartu, the area was re-integrated into Soviet Russia and the use of the North Ingrian postage stamps ended in 4 December 1920. Stamps were still sold in Finland in 1921 with an overprinting "Inkerin hyväksi" (For the Ingria), but they were no longer valid. Funds of the sale went for the North Ingrian refugees."""
]

# Change these to wherever you want your inputs and outputs to go
INPUT_FILE_PATH = "inputs"
OUTPUT_FILE_PATH = "outputs"

Write the example inputs to the input folder.

In [None]:
! mkdir -p $INPUT_FILE_PATH

for i, text in enumerate(input_list):
    open(f'{INPUT_FILE_PATH}/Example{i + 1}.txt', 'w') \
        .write(text[:min(len(text) - 10, 100)] + '... \n' + text)

## 3. Pipeline creation

Create the NLP pipeline.

In [None]:
# Transforms the raw text into a document readable by the later stages of the
# pipeline
document_assembler = DocumentAssembler() \
    .setInputCol('text') \
    .setOutputCol('document')

# Separates the document into sentences
sentence_detector = SentenceDetector() \
    .setInputCols(['document']) \
    .setOutputCol('sentences')# \
    #.setDetectLists(True)

# Separates sentences into individial tokens (words)
tokenizer = Tokenizer() \
    .setInputCols(['sentences']) \
    .setOutputCol('tokens') \
    .setContextChars(['(', ')', '?', '!', '.', ','])

# The keyphrase extraction model. Change MinNGrams and MaxNGrams to set the
# minimum and maximum length of possible keyphrases, and change NKeywords to
# set the amount of potential keyphrases identified per document.
keywords = YakeModel() \
    .setInputCols('tokens') \
    .setOutputCol('keywords') \
    .setMinNGrams(2) \
    .setMaxNGrams(5) \
    .setNKeywords(100) \
    .setStopWords(StopWordsCleaner().getStopWords())

# Assemble all of these stages into a pipeline, then fit the pipeline on an
# empty data frame so it can be used to transform new inputs.
pipeline = Pipeline(stages=[
    document_assembler, 
    sentence_detector,
    tokenizer,
    keywords
])
empty_df = spark.createDataFrame([[""]]).toDF('text')
pipeline_model = pipeline.fit(empty_df)

# LightPipeline is faster than Pipeline for small datasets
light_pipeline = LightPipeline(pipeline_model)

## 4. Output creation

Utility functions to create more useful sets of keyphrases from the raw data frame produced by the model.

In [None]:
def adjusted_score(row, pow=2.5):
    """This function adjusts the scores of potential key phrases to give better
    scores to phrases with more words (which will naturally have worse scores
    due to the nature of the model). You can change the exponent to reward
    longer phrases more or less. Higher exponents reward longer phrases."""
    return ((row.result.count(' ') + 1) ** pow /
            (float(row.metadata['score']) + 0.1))

def get_top_ranges(phrases, input_text):
    """Combine phrases that overlap."""
    starts = sorted([row['begin'] for row in phrases])
    ends = sorted([row['end'] for row in phrases])

    ranges = [[starts[0], None]]
    for i in range(len(starts) - 1):
        if ends[i] < starts[i + 1]:
            ranges[-1][1] = ends[i]
            ranges.append([starts[i + 1], None])
    ranges[-1][1] = ends[-1]
    return [{
        'begin': range[0],
        'end': range[1],
        'phrase': input_text[range[0]:range[1] + 1]
     } for range in ranges]

def remove_duplicates(phrases):
    """Remove phrases that appear multiple times."""
    i = 0
    while i < len(phrases):
        j = i + 1
        while j < len(phrases):
            if phrases[i]['phrase'] == phrases[j]['phrase']:
                phrases.remove(phrases[j])
            j += 1
        i += 1

    return phrases

def get_output_lists(df_row):
    """Returns a tuple with two lists of five phrases each. The first combines
    key phrases that overlap to create longer kep phrases, which is best for
    highlighting key phrases in text, and the seocnd is simply the keyphrases
    with the highest scores, which is best for summarizing a document."""
    keyphrases = []
    for row in df_row.keywords:
        keyphrases.append({
            'begin': row.begin,
            'end': row.end,
            'phrase': row.result,
            'score': adjusted_score(row)
        })
    keyphrases = sorted(keyphrases, key=lambda x: x['score'], reverse=True)

    return (
        get_top_ranges(keyphrases[:20], df_row.text)[:5],
        remove_duplicates(keyphrases[:10])[:5]
    )

Transform the example inputs to create a data frame storing the identified keyphrases.

In [None]:
df = spark.createDataFrame(pd.DataFrame({'text': input_list}))
result = light_pipeline.transform(df).toPandas()

For each example, create two JSON files containing selections of the best keyphrases for the document. See the docstring of `get_output_lists` two cells above to learn more about the two JSON files produced. These JSON files are used directly in the public demo app for this model.

In [None]:
! mkdir -p $OUTPUT_FILE_PATH

for i in range(len(result)):
    top_ranges, top_summaries = get_output_lists(result.iloc[i])
    with open(f'{OUTPUT_FILE_PATH}/Example{i + 1}.json', 'w') as ranges_file:
        json.dump(top_ranges, ranges_file)
    with open(f'{OUTPUT_FILE_PATH}/Example{i + 1}_summaries.json', 'w') \
            as summaries_file:
        json.dump(top_summaries, summaries_file)

## 5. Visualize outputs

The raw pandas data frame containing the outputs

In [None]:
result

Unnamed: 0,text,document,sentences,tokens,keywords
0,Extracting keywords from texts has become a ch...,"[(document, 0, 896, Extracting keywords from t...","[(document, 0, 135, Extracting keywords from t...","[(token, 0, 9, Extracting, {'sentence': '0'}, ...","[(keyword, 0, 18, extracting keywords, {'sente..."
1,Iodine deficiency is a lack of the trace eleme...,"[(document, 0, 1119, Iodine deficiency is a la...","[(document, 0, 90, Iodine deficiency is a lack...","[(token, 0, 5, Iodine, {'sentence': '0'}, []),...","[(keyword, 0, 16, iodine deficiency, {'sentenc..."
2,The Prague Quadrennial of Performance Design a...,"[(document, 0, 1548, The Prague Quadrennial of...","[(document, 0, 287, The Prague Quadrennial of ...","[(token, 0, 2, The, {'sentence': '0'}, []), (t...","[(keyword, 4, 21, prague quadrennial, {'senten..."
3,Author Nathan Wiseman-Trowse explained that th...,"[(document, 0, 2358, Author Nathan Wiseman-Tro...","[(document, 0, 205, Author Nathan Wiseman-Trow...","[(token, 0, 5, Author, {'sentence': '0'}, []),...","[(keyword, 0, 12, author nathan, {'sentence': ..."
4,North Ingria was located in the Karelian Isthm...,"[(document, 0, 1679, North Ingria was located ...","[(document, 0, 83, North Ingria was located in...","[(token, 0, 4, North, {'sentence': '0'}, []), ...","[(keyword, 0, 11, north ingria, {'sentence': '..."


The list of the top keyphrases (with overlapping keyphrases merged) for the last example

In [None]:
top_ranges

[{'begin': 0, 'end': 11, 'phrase': 'North Ingria'},
 {'begin': 291, 'end': 318, 'phrase': 'North Ingrian postal service'},
 {'begin': 462, 'end': 473, 'phrase': 'North Ingria'},
 {'begin': 599, 'end': 634, 'phrase': 'first series of North Ingrian stamps'},
 {'begin': 895, 'end': 918, 'phrase': 'North Ingrian government'}]

The list of the best summary kephrases (with duplicates removed) for the last example

In [None]:
top_summaries

[{'begin': 291,
  'end': 310,
  'phrase': 'north ingrian postal',
  'score': 59.43933549489857},
 {'begin': 291,
  'end': 318,
  'phrase': 'north ingrian postal service',
  'score': 55.25119651613854},
 {'begin': 291,
  'end': 303,
  'phrase': 'north ingrian',
  'score': 41.928576798395355},
 {'begin': 895,
  'end': 907,
  'phrase': 'north ingrian',
  'score': 41.928576798395355}]