<a href="https://colab.research.google.com/github/AlfredIsair/Natural-Language-Processing-Projects/blob/main/Keyword_Extraction/Keyword_Extraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Keyphrase or keyword extraction**
in NLP is a text analysis technique that extracts important words and phrases from the input text. These key phrases can be used in a variety of tasks, including information retrieval, document summarization, and content categorization.

In [32]:
! pip install -q pyspark==3.4.1 spark-nlp==5.1.2

In [33]:
from pyspark.sql import functions as F
from pyspark.sql.functions import lit
from pyspark.sql.types  import StringType, DataType, ArrayType
from pyspark.sql.functions import udf, struct
from pyspark.ml import Pipeline
from IPython.core.display import display, HTML
import re


In [34]:
import sparknlp

from pyspark.ml import PipelineModel
from sparknlp.annotator import *
from sparknlp.base import *

spark = sparknlp.start()



In [35]:
#list of stop words to exclude, they don't provide much value when identifying key concepts
stopwords = StopWordsCleaner().getStopWords()

In [36]:
stopwords[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your']

### **YAKE Keyword Extractor**

YAKE stands for Yet Another Keyword Extractor and it is an unsupervised approach for automatic keyword extraction by leveraging text features.
Yake is a novel feature-based system for multi-lingual keyword extraction, which supports texts of different sizes, domain or languages.
Unlike other approaches, Yake does not rely on dictionaries nor thesauri, neither is trained against any corpora. Instead, it follows an unsupervised approach which builds upon features extracted from the text, making it thus applicable to documents written in different languages without the need for further knowledge. This can be beneficial for a large number of tasks and a plethora of situations where access to training corpora is either limited or restricted.

The algorithm makes use of the position of a sentence and token. Therefore, to use the annotator, the text should be first sent through a Sentence Boundary Detector and then a tokenizer.

Parameters to get the best result from the annotator.

- *setMinNGrams(int)* Select the minimum length of a extracted keyword
- *setMaxNGrams(int)* Select the maximum length of a extracted keyword
- *setNKeywords(int)* Extract the top N keywords
- *setStopWords(list)* Set the list of stop words
- *setThreshold(float)* Each keyword will be given a keyword score greater than 0. (Lower the score better the keyword) Set an upper bound for the keyword score from this method.
- *setWindowSize(int)* Yake will construct a co-occurence matrix. You can set the window size for the cooccurence matrix construction from this method. ex: windowSize=2 will look at two words to both left and right of a candidate word

In [37]:
#Building Yake Ppipeline
document= DocumentAssembler() \
        .setInputCol("text").setOutputCol("document")

sentence= SentenceDetector() \
        .setInputCols("document") \
        .setOutputCol("sentence")

token= Tokenizer()\
        .setInputCols("sentence")\
        .setOutputCol("token") \
        .setContextChars(["(", ")", "?", "!", ".", ","])

keywords= YakeKeywordExtraction() \
        .setInputCols("token")  \
        .setOutputCol("keywords") \
        .setMinNGrams(1) \
        .setMaxNGrams(3) \
        .setNKeywords(20) \
        .setStopWords(stopwords) \
        .setThreshold(0.5) \
        .setWindowSize(3)


yake_pipeline = Pipeline(stages=[
    document,
    sentence,
    token,
    keywords
])

In [38]:
#create an empty dataframe to store our results.
empty_df = spark.createDataFrame([['']]).toDF("text")

yake_Model = yake_pipeline.fit(empty_df)

In [39]:
#LightPipeline

light_model = LightPipeline(yake_Model)

text = '''
  This guide is about How to build an LLM with a Vector Database and improve LLM’s use of this flow. We’ll look at how combining these two can make LLMs more accurate and useful, especially for specific topics.
  Next, we offer a brief overview of Vector Databases, explaining the concept of vector embedding and its role in enhancing AI and machine learning applications.
  We’ll show you how these databases differ from traditional databases and why they are better suited for AI-driven tasks, particularly when working with unstructured data like text, images, and complex patterns.
  Further, we’ll explore the practical application of this technology in building a Closed-QA bot. This bot, powered by Falcon-7B and ChromaDB, demonstrates the effectiveness of LLMs when coupled with the right tools and techniques.
'''

light_result = light_model.fullAnnotate(text)[0]

# summary view of the keyword extraction outcome, pairing each recognized keyword(s) with its matching source sentence(s)
[(s.metadata['sentence'], s.result) for s in light_result['sentence']]


[('0',
  'This guide is about How to build an LLM with a Vector Database and improve LLM’s use of this flow.'),
 ('1',
  'We’ll look at how combining these two can make LLMs more accurate and useful, especially for specific topics.'),
 ('2',
  'Next, we offer a brief overview of Vector Databases, explaining the concept of vector embedding and its role in enhancing AI and machine learning applications.'),
 ('3',
  'We’ll show you how these databases differ from traditional databases and why they are better suited for AI-driven tasks, particularly when working with unstructured data like text, images, and complex patterns.'),
 ('4',
  'Further, we’ll explore the practical application of this technology in building a Closed-QA bot.'),
 ('5',
  'This bot, powered by Falcon-7B and ChromaDB, demonstrates the effectiveness of LLMs when coupled with the right tools and techniques.')]

In [40]:
#lets take the extracted keywords from light_result, organizes them into a Pandas DataFrame, sorts the entries by descending order of relevance, and finally displays the highest ranked 30 results.

import pandas as pd

keys_df = pd.DataFrame([(k.result, k.begin, k.end, k.metadata['score'], k.metadata['sentence'])
                        for k in light_result['keywords']],
                       columns = ['keywords', 'begin', 'end', 'score', 'sentence'])
keys_df['score'] = keys_df['score'].astype(float)

#ordering by relevance
keys_df.sort_values(['sentence', 'score']).head(30)

Unnamed: 0,keywords,begin,end,score,sentence
1,vector,50,55,0.171555,0
12,vector database,50,64,0.315063,0
0,llm,39,41,0.465506,0
2,database,57,64,0.465506,0
3,llms,149,152,0.372603,1
4,vector,249,254,0.171555,2
6,vector,293,298,0.171555,2
5,databases,256,264,0.208702,2
13,vector databases,249,264,0.475439,2
7,databases,401,409,0.208702,3


In [41]:
! wget -q https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/pubmed/pubmed_sample_text_small.csv

df = spark.read\
                .option("header", "true")\
                .csv("pubmed_sample_text_small.csv")\

df.show(5, truncate=False)



+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [42]:
df.columns


['text']

In [43]:
result = yake_pipeline.fit(df).transform(df)

In [44]:
result = result.withColumn('unique_keywords', F.array_distinct("keywords.result"))

In [45]:
def highlight(text, keywords):
    for k in keywords:
        text = (re.sub(r'(\b%s\b)'%k, r'<span style="background-color: yellow;">\1</span>', text, flags=re.IGNORECASE))
    return text

In [46]:
highlight_udf = udf(highlight, StringType())

In [47]:
result = result.withColumn("highlighted_keywords",highlight_udf('text','unique_keywords'))

In [48]:
for r in result.select("highlighted_keywords").limit(10).collect():
    display(HTML(r.highlighted_keywords))
    print("\n\n")



















































## **Keyword Extraction Tool**

 The tool provides a user interface built with IPython widgets, allowing users to input text and click a button to extract keywords. The extracted keywords are displayed along with the input text with highlighted keywords. This tool can be useful for quickly identifying key terms and concepts in text documents.



In [49]:
keyword_model = yake_pipeline.fit(df)

In [50]:
import re
import ipywidgets as widgets
from IPython.display import display, HTML



def highlight(text, keywords):
    for k in keywords:
        text = (re.sub(r'(\b%s\b)' % k, r'<span style="background-color: yellow;">\1</span>', text, flags=re.IGNORECASE))
    return text

def extract_keywords(text):
    # Transform the input text using the pre-fitted pipeline model
    result = keyword_model.transform(spark.createDataFrame([(text,)], ["text"]))

    # Extract unique keywords
    unique_keywords = [keyword.result for keyword in result.select("keywords").collect()[0]["keywords"]]

    # Highlight keywords in the input text
    highlighted_text = highlight(text, unique_keywords)
    return unique_keywords, highlighted_text


# Colab UI
text_input = widgets.Textarea(
    value='',
    placeholder='Enter your text here',
    description='Text:',
    disabled=False
)
display(text_input)

def on_button_click(b):
    uploaded_text = text_input.value
    if uploaded_text:
        keywords, highlighted_text = extract_keywords(uploaded_text)
        display(HTML(f"<b>Keywords:</b> {' | '.join(keywords)}"))
        display(HTML(highlighted_text))
    else:
        print("Please enter some text.")

button = widgets.Button(description="Extract Keywords")
button.on_click(on_button_click)
display(button)


Textarea(value='', description='Text:', placeholder='Enter your text here')

Button(description='Extract Keywords', style=ButtonStyle())