![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/Spark_NLP_Udemy_MOOC/Open_Source/09.01.YakeKeywordExtraction.ipynb)

# **Keyword Extraction with YAKE!**

This notebook will cover the different parameters and usages of `YakeKeywordExtraction`. This annotator provides the ability to select the most important keywords of a text and `YAKE!`'s algorithm rests on text statistical features extracted from single documents.


**📖 Learning Objectives:**

1. Understand the meaning of `Keyword Extraction`, namely being the process of automatically extracting the most important keywords from a text document.

2. Understand how `YakeKeywordExtraction` follows an unsupervised approach which builds upon features extracted from the text.

3. Become comfortable using the different parameters of the annotator - most parameters will help define: 
  - total number of keywords to be selected,
  - minimum or maximum words in a keyword,
  - list of stopwords.


**🔗 Helpful Links:**

- Documentation : [YakeKeywordExtraction](https://nlp.johnsnowlabs.com/docs/en/annotators#yakekeywordextraction)

- Python Docs : [YakeKeywordExtraction](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/annotator/keyword_extraction/yake_keyword_extraction/index.html)

- Scala Docs : [YakeKeywordExtraction
](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/annotators/keyword/yake/YakeKeywordExtraction.html)

- For extended examples of usage, see the [Spark NLP Workshop repository](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/8.Keyword_Extraction_YAKE.ipynb).

- Academic Reference Paper: [YAKE! Keyword extraction from single documents using multiple local features](https://www.sciencedirect.com/science/article/abs/pii/S0020025519308588)

## **📜 Background**

`Yake!` is a novel feature-based system for multi-lingual keyword extraction, which supports texts of different sizes, domain or languages. 

Unlike other approaches, `Yake!` does not rely on dictionaries nor thesauri, neither is trained against any corpora. Instead, it follows an unsupervised approach which builds upon features extracted from the text, making it thus applicable to documents written in different languages without the need for further knowledge. This can be beneficial for a large number of tasks and a plethora of situations where the access to training corpora is either limited or restricted.

## **🎬 Colab Setup**

In [None]:
!pip install -q pyspark==3.1.2  spark-nlp==4.2.4

In [2]:
from pyspark.sql import functions as F
from pyspark.sql.functions import lit
from pyspark.sql.types import StringType, DataType,ArrayType
from pyspark.sql.functions import udf, struct
from pyspark.ml import Pipeline
from sparknlp.base import LightPipeline
from IPython.core.display import display, HTML
import re

In [3]:
import sparknlp

import sys
sys.path.append('../../')

import pandas as pd

import sparknlp

from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
from pyspark.sql.functions import array_contains
from sparknlp.annotator import *
from sparknlp.common import RegexRule
from sparknlp.base import DocumentAssembler, Finisher

spark = sparknlp.start()

print("Spark NLP version", sparknlp.version())
print("Apache Spark version:", spark.version)

spark

Spark NLP version 4.2.4
Apache Spark version: 3.1.2


## **🖨️ Input/Output Annotation Types**

- Input: `TOKEN`

- Output: `CHUNK`

## **🔎 Parameters**


- `setMinNGrams`: (int) Select the minimum length of a extracted keyword

- `setMaxNGrams`: (int) Select the maximum length of a extracted keyword

- `setNKeywords`: (int) Extract the top N keywords

- `setStopWords`: (list) Set the list of stop words

- `setThreshold`: (float) Each keyword will be given a keyword score greater than 0. (Lower the score better the keyword) Set an upper bound for the keyword score from this method.

- `setWindowSize`: (int) Yake will construct a co-occurence matrix. You can set the window size for the cooccurence matrix construction from this method. ex: windowSize=2 will look at two words to both left and right of a candidate word.


## **💻 YakeKeywordExtraction Pipeline**

`YAKE!` algorithm makes use of the position of a sentence and token. Therefore, to use the annotator, the text should be first sent through a Sentence Boundary Detector and then a tokenizer.



In [4]:
stopwords = StopWordsCleaner().getStopWords()

In [5]:
stopwords[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your']

In [6]:
document = DocumentAssembler() \
            .setInputCol("text") \
            .setOutputCol("document")

sentenceDetector = SentenceDetector() \
            .setInputCols("document") \
            .setOutputCol("sentence")

token = Tokenizer() \
            .setInputCols("sentence") \
            .setOutputCol("token") \
            .setContextChars(["(", ")", "?", "!", ".", ","])

keywords = YakeKeywordExtraction() \
            .setInputCols("token") \
            .setOutputCol("keywords") \
            
yake_pipeline = Pipeline(stages=[document, sentenceDetector, token, keywords])

empty_df = spark.createDataFrame([['']]).toDF("text")

yake_Model = yake_pipeline.fit(empty_df)

In [8]:
keywords.extractParamMap()

{Param(parent='YakeKeywordExtraction_8c127e9efee7', name='lazyAnnotator', doc='Whether this AnnotatorModel acts as lazy in RecursivePipelines'): False,
 Param(parent='YakeKeywordExtraction_8c127e9efee7', name='minNGrams', doc='Minimum N-grams a keyword should have'): 2,
 Param(parent='YakeKeywordExtraction_8c127e9efee7', name='maxNGrams', doc='Maximum N-grams a keyword should have'): 3,
 Param(parent='YakeKeywordExtraction_8c127e9efee7', name='nKeywords', doc='Number of Keywords to extract'): 30,
 Param(parent='YakeKeywordExtraction_8c127e9efee7', name='windowSize', doc='Window size for Co-Occurrence'): 3,
 Param(parent='YakeKeywordExtraction_8c127e9efee7', name='threshold', doc='Keyword Score threshold'): -1.0,
 Param(parent='YakeKeywordExtraction_8c127e9efee7', name='stopWords', doc="the words to be filtered out. by default it's english stop words from Spark ML"): ['i',
  'me',
  'my',
  'myself',
  'we',
  'our',
  'ours',
  'ourselves',
  'you',
  'your',
  'yours',
  'yourself',
 

In [None]:
# LightPipeline

light_model = LightPipeline(yake_Model)

text = '''
google is acquiring data science community kaggle. Sources tell us that google is acquiring kaggle, a platform that hosts data science and machine learning competitions. Details about the transaction remain somewhat vague , but given that google is hosting its Cloud Next conference in san francisco this week, the official announcement could come as early as tomorrow. Reached by phone, kaggle co-founder ceo anthony goldbloom declined to deny that the acquisition is happening. google itself declined 'to comment on rumors'. kaggle, which has about half a million data scientists on its platform, was founded by Goldbloom and Ben Hamner in 2010. The service got an early start and even though it has a few competitors like DrivenData, TopCoder and HackerRank, it has managed to stay well ahead of them by focusing on its specific niche. The service is basically the de facto home for running data science and machine learning competitions. With kaggle, google is buying one of the largest and most active communities for data scientists - and with that, it will get increased mindshare in this community, too (though it already has plenty of that thanks to Tensorflow and other projects). kaggle has a bit of a history with google, too, but that's pretty recent. Earlier this month, google and kaggle teamed up to host a $100,000 machine learning competition around classifying YouTube videos. That competition had some deep integrations with the google Cloud platform, too. Our understanding is that google will keep the service running - likely under its current name. While the acquisition is probably more about Kaggle's community than technology, kaggle did build some interesting tools for hosting its competition and 'kernels', too. On kaggle, kernels are basically the source code for analyzing data sets and developers can share this code on the platform (the company previously called them 'scripts'). Like similar competition-centric sites, kaggle also runs a job board, too. It's unclear what google will do with that part of the service. According to Crunchbase, kaggle raised $12.5 million (though PitchBook says it's $12.75) since its launch in 2010. Investors in kaggle include Index Ventures, SV Angel, Max Levchin, Naval Ravikant, google chief economist Hal Varian, Khosla Ventures and Yuri Milner
'''

light_result = light_model.fullAnnotate(text)[0]

[(s.metadata['sentence'], s.result) for s in light_result['sentence']]

In [None]:
keys_df = pd.DataFrame([(k.result, k.begin, k.end, k.metadata['score'],  k.metadata['sentence']) for k in light_result['keywords']],
                       columns = ['keywords','begin','end','score','sentence'])
keys_df['score'] = keys_df['score'].astype(float)

# ordered by relevance 
keys_df.sort_values(['sentence','score']).head(100)

Unnamed: 0,keywords,begin,end,score,sentence
1,data science,21,32,0.255856,0
0,acquiring data,11,24,0.844244,0
31,google is acquiring,1,19,1.039254,0
3,community kaggle,34,49,1.040628,0
2,science community,26,42,1.152803,0
32,acquiring data science,11,32,1.26386,0
6,data science,123,134,0.255856,1
7,machine learning,140,155,0.466911,1
8,learning competitions,148,168,0.762934,1
4,acquiring kaggle,83,98,0.849239,1


`setMinNGrams` and `setMaxNGrams`

## `setMinNGrams` and `setMaxNGrams`

`setMinNGrams` (default:1) and `setMaxNGrams` (default:3) parameters should be used to set the minimum and the maximum number of N-grams a keyword should have.

In [None]:
keywords = YakeKeywordExtraction() \
            .setInputCols("token") \
            .setOutputCol("keywords") \
            .setMinNGrams(1) \
            .setMaxNGrams(2)\

yake_pipeline = Pipeline(stages=[document, sentenceDetector, token, keywords])

empty_df = spark.createDataFrame([['']]).toDF("text")

yake_Model = yake_pipeline.fit(empty_df)

In [None]:
light_model = LightPipeline(yake_Model)

text = '''
google is acquiring data science community kaggle. Sources tell us that google is acquiring kaggle, a platform that hosts data science and machine learning competitions. Details about the transaction remain somewhat vague , but given that google is hosting its Cloud Next conference in san francisco this week, the official announcement could come as early as tomorrow. Reached by phone, kaggle co-founder ceo anthony goldbloom declined to deny that the acquisition is happening. google itself declined 'to comment on rumors'. kaggle, which has about half a million data scientists on its platform, was founded by Goldbloom and Ben Hamner in 2010. The service got an early start and even though it has a few competitors like DrivenData, TopCoder and HackerRank, it has managed to stay well ahead of them by focusing on its specific niche. The service is basically the de facto home for running data science and machine learning competitions. With kaggle, google is buying one of the largest and most active communities for data scientists - and with that, it will get increased mindshare in this community, too (though it already has plenty of that thanks to Tensorflow and other projects). kaggle has a bit of a history with google, too, but that's pretty recent. Earlier this month, google and kaggle teamed up to host a $100,000 machine learning competition around classifying YouTube videos. That competition had some deep integrations with the google Cloud platform, too. Our understanding is that google will keep the service running - likely under its current name. While the acquisition is probably more about Kaggle's community than technology, kaggle did build some interesting tools for hosting its competition and 'kernels', too. On kaggle, kernels are basically the source code for analyzing data sets and developers can share this code on the platform (the company previously called them 'scripts'). Like similar competition-centric sites, kaggle also runs a job board, too. It's unclear what google will do with that part of the service. According to Crunchbase, kaggle raised $12.5 million (though PitchBook says it's $12.75) since its launch in 2010. Investors in kaggle include Index Ventures, SV Angel, Max Levchin, Naval Ravikant, google chief economist Hal Varian, Khosla Ventures and Yuri Milner
'''

light_result = light_model.fullAnnotate(text)[0]

[(s.metadata['sentence'], s.result) for s in light_result['sentence']]

In [None]:
keys_df = pd.DataFrame([(k.result, k.begin, k.end, k.metadata['score'],  k.metadata['sentence']) for k in light_result['keywords']],
                       columns = ['keywords','begin','end','score','sentence'])
keys_df['score'] = keys_df['score'].astype(float)

# ordered by relevance 
keys_df.sort_values(['sentence','score']).head(30)

Unnamed: 0,keywords,begin,end,score,sentence
5,kaggle,44,49,0.057364,0
0,google,1,6,0.061313,0
2,data,21,24,0.110227,0
3,science,26,32,0.231976,0
4,community,34,42,0.270738,0
86,data science,21,32,0.284252,0
1,acquiring,11,19,0.365614,0
8,kaggle,93,98,0.057364,1
6,google,73,78,0.061313,1
10,data,123,126,0.110227,1


By using the `setMinNGrams` and `setMaxNGrams` parameters, number of words on a keyword is limited, as shown in the new dataframe.

## `setNKeywords`

`setNKeywords` parameter (default:30) is used for limiting the number of keywords to be extracted from the text.

In [None]:
keywords = YakeKeywordExtraction() \
            .setInputCols("token") \
            .setOutputCol("keywords") \
            .setNKeywords(10)

yake_pipeline = Pipeline(stages=[document, sentenceDetector, token, keywords])

empty_df = spark.createDataFrame([['']]).toDF("text")

yake_Model = yake_pipeline.fit(empty_df)

In [None]:
light_model = LightPipeline(yake_Model)

text = '''
google is acquiring data science community kaggle. Sources tell us that google is acquiring kaggle, a platform that hosts data science and machine learning competitions. Details about the transaction remain somewhat vague , but given that google is hosting its Cloud Next conference in san francisco this week, the official announcement could come as early as tomorrow. Reached by phone, kaggle co-founder ceo anthony goldbloom declined to deny that the acquisition is happening. google itself declined 'to comment on rumors'. kaggle, which has about half a million data scientists on its platform, was founded by Goldbloom and Ben Hamner in 2010. The service got an early start and even though it has a few competitors like DrivenData, TopCoder and HackerRank, it has managed to stay well ahead of them by focusing on its specific niche. The service is basically the de facto home for running data science and machine learning competitions. With kaggle, google is buying one of the largest and most active communities for data scientists - and with that, it will get increased mindshare in this community, too (though it already has plenty of that thanks to Tensorflow and other projects). kaggle has a bit of a history with google, too, but that's pretty recent. Earlier this month, google and kaggle teamed up to host a $100,000 machine learning competition around classifying YouTube videos. That competition had some deep integrations with the google Cloud platform, too. Our understanding is that google will keep the service running - likely under its current name. While the acquisition is probably more about Kaggle's community than technology, kaggle did build some interesting tools for hosting its competition and 'kernels', too. On kaggle, kernels are basically the source code for analyzing data sets and developers can share this code on the platform (the company previously called them 'scripts'). Like similar competition-centric sites, kaggle also runs a job board, too. It's unclear what google will do with that part of the service. According to Crunchbase, kaggle raised $12.5 million (though PitchBook says it's $12.75) since its launch in 2010. Investors in kaggle include Index Ventures, SV Angel, Max Levchin, Naval Ravikant, google chief economist Hal Varian, Khosla Ventures and Yuri Milner
'''

light_result = light_model.fullAnnotate(text)[0]

[(s.metadata['sentence'], s.result) for s in light_result['sentence']]

In [None]:
keys_df = pd.DataFrame([(k.result, k.begin, k.end, k.metadata['score'],  k.metadata['sentence']) for k in light_result['keywords']],
                       columns = ['keywords','begin','end','score','sentence'])
keys_df['score'] = keys_df['score'].astype(float)

# ordered by relevance 
keys_df.sort_values(['sentence','score']).head(30)

Unnamed: 0,keywords,begin,end,score,sentence
1,data science,21,32,0.255856,0
0,acquiring data,11,24,0.844244,0
3,data science,123,134,0.255856,1
4,machine learning,140,155,0.466911,1
5,learning competitions,148,168,0.762934,1
2,acquiring kaggle,83,98,0.849239,1
14,google cloud,1450,1461,0.61196,10
15,cloud platform,1457,1470,0.796338,10
6,cloud next,262,271,0.514866,2
7,data scientists,567,581,0.562581,4


Limiting the number of keywords by using the `setNKeywords` parameter , instead of using the default value decreased the number of extracted keywords considerably.

## `setStopWords`

The `setStopWords` parameter can be used to specify the user-defined list of words to be filtered out (Default: English stop words from MLlib).

In [None]:
keywords = YakeKeywordExtraction() \
            .setInputCols("token") \
            .setOutputCol("keywords") \
            .setStopWords(stopwords)

yake_pipeline = Pipeline(stages=[document, sentenceDetector, token, keywords])

empty_df = spark.createDataFrame([['']]).toDF("text")

yake_Model = yake_pipeline.fit(empty_df)

In [None]:
light_model = LightPipeline(yake_Model)

text = '''
google is acquiring data science community kaggle. Sources tell us that google is acquiring kaggle, a platform that hosts data science and machine learning competitions. Details about the transaction remain somewhat vague , but given that google is hosting its Cloud Next conference in san francisco this week, the official announcement could come as early as tomorrow. Reached by phone, kaggle co-founder ceo anthony goldbloom declined to deny that the acquisition is happening. google itself declined 'to comment on rumors'. kaggle, which has about half a million data scientists on its platform, was founded by Goldbloom and Ben Hamner in 2010. The service got an early start and even though it has a few competitors like DrivenData, TopCoder and HackerRank, it has managed to stay well ahead of them by focusing on its specific niche. The service is basically the de facto home for running data science and machine learning competitions. With kaggle, google is buying one of the largest and most active communities for data scientists - and with that, it will get increased mindshare in this community, too (though it already has plenty of that thanks to Tensorflow and other projects). kaggle has a bit of a history with google, too, but that's pretty recent. Earlier this month, google and kaggle teamed up to host a $100,000 machine learning competition around classifying YouTube videos. That competition had some deep integrations with the google Cloud platform, too. Our understanding is that google will keep the service running - likely under its current name. While the acquisition is probably more about Kaggle's community than technology, kaggle did build some interesting tools for hosting its competition and 'kernels', too. On kaggle, kernels are basically the source code for analyzing data sets and developers can share this code on the platform (the company previously called them 'scripts'). Like similar competition-centric sites, kaggle also runs a job board, too. It's unclear what google will do with that part of the service. According to Crunchbase, kaggle raised $12.5 million (though PitchBook says it's $12.75) since its launch in 2010. Investors in kaggle include Index Ventures, SV Angel, Max Levchin, Naval Ravikant, google chief economist Hal Varian, Khosla Ventures and Yuri Milner
'''

light_result = light_model.fullAnnotate(text)[0]

[(s.metadata['sentence'], s.result) for s in light_result['sentence']]

In [None]:
keys_df = pd.DataFrame([(k.result, k.begin, k.end, k.metadata['score'],  k.metadata['sentence']) for k in light_result['keywords']],
                       columns = ['keywords','begin','end','score','sentence'])
keys_df['score'] = keys_df['score'].astype(float)

# ordered by relevance 
keys_df.sort_values(['sentence','score']).head(30)

Unnamed: 0,keywords,begin,end,score,sentence
1,data science,21,32,0.255856,0
0,acquiring data,11,24,0.844244,0
31,google is acquiring,1,19,1.039254,0
3,community kaggle,34,49,1.040628,0
2,science community,26,42,1.152803,0
32,acquiring data science,11,32,1.26386,0
6,data science,123,134,0.255856,1
7,machine learning,140,155,0.466911,1
8,learning competitions,148,168,0.762934,1
4,acquiring kaggle,83,98,0.849239,1


By using the `setStopWords` parameter, a list of user-defined stopwords were used for filtering out the words.


## `setThreshold`

`setThreshold` parameter can be used to set the upper bound for the keyword score and as a result to filter the keywords. By default it is disabled. Each keyword will be given a keyword score greater than 0. (The lower the score better the keyword). 

In [None]:
keywords = YakeKeywordExtraction() \
            .setInputCols("token") \
            .setOutputCol("keywords") \
            .setThreshold(0.75)

yake_pipeline = Pipeline(stages=[document, sentenceDetector, token, keywords])

empty_df = spark.createDataFrame([['']]).toDF("text")

yake_Model = yake_pipeline.fit(empty_df)

In [None]:
light_model = LightPipeline(yake_Model)

text = '''
google is acquiring data science community kaggle. Sources tell us that google is acquiring kaggle, a platform that hosts data science and machine learning competitions. Details about the transaction remain somewhat vague , but given that google is hosting its Cloud Next conference in san francisco this week, the official announcement could come as early as tomorrow. Reached by phone, kaggle co-founder ceo anthony goldbloom declined to deny that the acquisition is happening. google itself declined 'to comment on rumors'. kaggle, which has about half a million data scientists on its platform, was founded by Goldbloom and Ben Hamner in 2010. The service got an early start and even though it has a few competitors like DrivenData, TopCoder and HackerRank, it has managed to stay well ahead of them by focusing on its specific niche. The service is basically the de facto home for running data science and machine learning competitions. With kaggle, google is buying one of the largest and most active communities for data scientists - and with that, it will get increased mindshare in this community, too (though it already has plenty of that thanks to Tensorflow and other projects). kaggle has a bit of a history with google, too, but that's pretty recent. Earlier this month, google and kaggle teamed up to host a $100,000 machine learning competition around classifying YouTube videos. That competition had some deep integrations with the google Cloud platform, too. Our understanding is that google will keep the service running - likely under its current name. While the acquisition is probably more about Kaggle's community than technology, kaggle did build some interesting tools for hosting its competition and 'kernels', too. On kaggle, kernels are basically the source code for analyzing data sets and developers can share this code on the platform (the company previously called them 'scripts'). Like similar competition-centric sites, kaggle also runs a job board, too. It's unclear what google will do with that part of the service. According to Crunchbase, kaggle raised $12.5 million (though PitchBook says it's $12.75) since its launch in 2010. Investors in kaggle include Index Ventures, SV Angel, Max Levchin, Naval Ravikant, google chief economist Hal Varian, Khosla Ventures and Yuri Milner
'''

light_result = light_model.fullAnnotate(text)[0]

[(s.metadata['sentence'], s.result) for s in light_result['sentence']]

In [None]:
keys_df = pd.DataFrame([(k.result, k.begin, k.end, k.metadata['score'],  k.metadata['sentence']) for k in light_result['keywords']],
                       columns = ['keywords','begin','end','score','sentence'])
keys_df['score'] = keys_df['score'].astype(float)

# ordered by relevance 
keys_df.sort_values(['sentence','score']).head(30)

Unnamed: 0,keywords,begin,end,score,sentence
0,data science,21,32,0.255856,0
1,data science,123,134,0.255856,1
2,machine learning,140,155,0.466911,1
10,google cloud,1450,1461,0.61196,10
3,cloud next,262,271,0.514866,2
4,data scientists,567,581,0.562581,4
5,ben hamner,629,638,0.623279,4
6,data science,895,906,0.255856,6
7,machine learning,912,927,0.466911,6
8,data scientists,1024,1038,0.562581,7


Setting the `setThreshold` parameter to a higher value than the default value (-1) decreased the number of extracted keywords, but increased the quality of them. 

## **🔆 Highlighting Keywords in a Text**


In addition to getting the keywords as a dataframe, it is also possible to highlight the words in the text.

In this example, a dataset of `7537` texts were used.

In [None]:
! wget -q https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/pubmed/pubmed_sample_text_small.csv

df = spark.read\
                .option("header", "true")\
                .csv("pubmed_sample_text_small.csv")\
                
df.show(truncate=120)

+------------------------------------------------------------------------------------------------------------------------+
|                                                                                                                    text|
+------------------------------------------------------------------------------------------------------------------------+
|The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel ...|
|BACKGROUND: At present, it is one of the most important issues for the treatment of breast cancer to develop the stan...|
|OBJECTIVE: To investigate the relationship between preoperative atrialfibrillation and early and late clinical outcom...|
|Combined EEG/fMRI recording has been used to localize the generators of EEGevents and to identify subject state in co...|
|Kohlschutter syndrome is a rare neurodegenerative disorder presenting withintractable seizures, developmental regress...|
|Statistical ana

In [None]:
df.count()

7537

In [None]:
result = yake_pipeline.fit(df).transform(df)

In [None]:
result = result.withColumn('unique_keywords', F.array_distinct("keywords.result"))

In [None]:
def highlight(text, keywords):
    for k in keywords:
        text = (re.sub(r'(\b%s\b)'%k, r'<span style="background-color: yellow;">\1</span>', text, flags=re.IGNORECASE))
    return text

In [None]:
highlight_udf = udf(highlight, StringType())

In [None]:
result = result.withColumn("highlighted_keywords",highlight_udf('text','unique_keywords'))

In [None]:
for r in result.select("highlighted_keywords").limit(20).collect():
    display(HTML(r.highlighted_keywords))
    print("\n\n")



































































































