# Keyword extraction

- figuring out **which words and phrases in a piece of text are the most important**.
- These keywords can be used **to summarise the content of the text**. A common use case is using keywords to **improve search engine optimization (SEO) and make content more easily discoverable online.**
- Natural language processing (NLP) methods like **part-of-speech tagging and phrase chunking** are used in many keyword extraction methods.
- These methods can help you find the most important ideas and objects in a text and the most common words and phrases.
- Another popular keyword extraction method is **term frequency-inverse document frequency (TF-IDF) analysis**. With this method, you figure out how important each word in a document is by comparing how often it appears in that document to how often it appears in a group of documents.





- It helps summarize the content of a document

- It improves search engine optimization (SEO)

- It improves content marketing

- It improves customer service

# How it works?

Keyword extraction involves using natural language processing (NLP) techniques to identify the essential words and phrases in a text automatically. This can be done using a variety of methods, including the following:



**1.Part-of-speech tagging**: This involves using algorithms to identify the parts of speech (e.g. nouns, verbs, adjectives) of each word in a text. It is possible to extract the main subjects and objects discussed in the text by identifying the most commonly used nouns and other content words.

**2.Phrase chunking** involves using algorithms **to identify common phrases and patterns in a text**. By identifying the most commonly used terms, it is possible **to extract the main ideas and themes discussed in the text.**

**3.Term frequency–inverse document frequency (TF-IDF) analysis**: This involves calculating the **relative importance of each word in a document by comparing its frequency in that document to its frequency across a corpus of documents**. Words frequently appearing in a particular document but not in many others are considered essential keywords for that document.

# Machine learning algorithms

Several machine learning algorithms can be used for keyword extraction, including the following

**1.Supervised learning** algorithms require a pre-labelled training dataset, where the input data (i.e. the text) has already been manually annotated with the relevant keywords. The algorithm uses this training dataset to learn the patterns and associations between the input data and the labels and can then be applied to new, unseen data to identify the relevant keywords automatically.

**2.Unsupervised learning algorithms:** These algorithms do not require a pre-labelled training dataset and instead learn the patterns and associations in the data automatically through clustering and clustering. Unsupervised learning algorithms can be used to identify the most commonly used words and phrases in a text and the relationships between different words and phrases.

**3.Semi-supervised learning algorithms** combine supervised and unsupervised learning elements and can be helpful when only a tiny amount of pre-labelled training data is available. The algorithm uses the pre-labelled data to learn the patterns and associations between the input data and the labels. It then uses unsupervised learning techniques to identify the relevant keywords in new, unseen data.

# How to implement keyword extraction

**Preprocess the text:**
- Before extracting keywords from a text, it is essential to preprocess the text **to remove any irrelevant, noisy information or stop words**.
- This may include **eliminating punctuation, special characters, and numbers, converting the text to lowercase and stemming or lemmatizing the words**.

**Identify essential words and phrases:**
- Many techniques can identify the most important words and phrases in a text, including **part-of-speech tagging, phrase chunking, and term frequency-inverse document frequency (TF-IDF) analysis**.
- These techniques can help identify the main subjects and objects discussed in the text and the most commonly used words and phrases.

**Filter and rank the keywords:**
- Once the most important words and phrases have been identified, it is essential **to filter out any irrelevant or redundant keywords and rank the remaining keywords according to their relevance and importance.**
- This can be done using various techniques, including statistical measures such as **term frequency-inverse document frequency, domain-specific knowledge and expertise**.

**Use the keywords:**
- Once they have been extracted and ranked, they can summarize the **text’s content**, it can further be used to enhance the discoverability and relevance of the document.
- In SEO, this may include using the keywords in the titles, headings, and body of a web page to improve its ranking or using them to create relevant and engaging content for target audiences.

# Implementation

## NLTK keyword extraction

In [12]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

# Download necessary NLTK data if you haven't already
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Preprocess the text
text = "This is a sample text for keyword extraction."
text = text.lower()

# Tokenize and tag parts of speech
tokens = word_tokenize(text)
tagged_tokens = pos_tag(tokens)

# Filter for nouns (NN: noun, singular; NNS: noun, plural)
nouns = [word for word, tag in tagged_tokens if tag.startswith('NN')]
print(f"Identified nouns: {nouns}")

# For proper TF-IDF, we need multiple documents for comparison
# We'll create a small corpus using sentences as documents
sentences = [
    "This is a sample text for keyword extraction.",
    "Keyword extraction is an important task in NLP.",
    "Text analysis helps identify important keywords in a document.",
    "Sample documents are used to demonstrate keyword extraction."
]

# Create a TF-IDF vectorizer that only considers nouns
all_tokens = []
for sentence in sentences:
    tokens = word_tokenize(sentence.lower())
    tagged = pos_tag(tokens)
    nouns_in_sentence = [word for word, tag in tagged if tag.startswith('NN')]
    all_tokens.extend(nouns_in_sentence)

# Keep only unique nouns
unique_nouns = list(set(all_tokens))

# Create and apply the TF-IDF vectorizer with the noun vocabulary
vectorizer = TfidfVectorizer(vocabulary=unique_nouns)
tfidf_matrix = vectorizer.fit_transform(sentences)

# Get the TF-IDF scores for the first document (our original text)
first_doc_vector = tfidf_matrix[0].toarray().flatten()

# Map scores to words and sort
word_score_pairs = [(word, first_doc_vector[idx]) for word, idx in vectorizer.vocabulary_.items()]
top_keywords = sorted(word_score_pairs, key=lambda x: x[1], reverse=True)[:3]

print("\nTop 3 keywords by TF-IDF score:")
for keyword, score in top_keywords:
    print(f"{keyword}: {score:.4f}")


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


Identified nouns: ['text', 'keyword', 'extraction']

Top 3 keywords by TF-IDF score:
text: 0.5496
sample: 0.5496
keyword: 0.4449


## SpaCy keyword extraction

In [14]:
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

# Load the Spacy model
nlp = spacy.load("en_core_web_sm")

# Original text
text = "This is a sample text for keyword extraction."
doc = nlp(text)

# Extract noun phrases
noun_phrases = [chunk.text.lower() for chunk in doc.noun_chunks]
print(f"Identified noun phrases: {noun_phrases}")

# For proper TF-IDF, we need multiple documents for comparison
# Create a small corpus using related sentences
corpus = [
    "This is a sample text for keyword extraction.",
    "Keyword extraction is an important task in NLP.",
    "Text analysis helps identify important keywords in documents.",
    "Sample documents demonstrate extraction techniques."
]

# Process each document to extract noun phrases
all_docs = [nlp(doc) for doc in corpus]
all_noun_phrases = []
for doc in all_docs:
    noun_phrases = [chunk.text.lower() for chunk in doc.noun_chunks]
    all_noun_phrases.extend(noun_phrases)

# Create a unique vocabulary of noun phrases
unique_noun_phrases = list(set(all_noun_phrases))

# Create and apply the TF-IDF vectorizer with the noun phrase vocabulary
vectorizer = TfidfVectorizer(vocabulary=unique_noun_phrases)
tfidf_matrix = vectorizer.fit_transform(corpus)

# Get the TF-IDF scores for the first document (our original text)
first_doc_vector = tfidf_matrix[0].toarray().flatten()

# Map scores to noun phrases and sort
phrase_score_pairs = [(phrase, first_doc_vector[idx]) for phrase, idx in vectorizer.vocabulary_.items()]
top_keywords = sorted(phrase_score_pairs, key=lambda x: x[1], reverse=True)[:3]

print("\nTop 3 keyword phrases by TF-IDF score:")
for keyword, score in top_keywords:
    print(f"{keyword}: {score:.4f}")

# Alternative approach: Use spaCy's built-in keyword extraction
print("\nAlternative: Using spaCy's built-in keyword extraction:")
for token in doc:
    if not token.is_stop and not token.is_punct and (token.pos_ == "NOUN" or token.pos_ == "PROPN"):
        print(f"{token.text}: {token.rank}")

Identified noun phrases: ['this', 'a sample text', 'keyword extraction']

Top 3 keyword phrases by TF-IDF score:
this: 1.0000
important keywords: 0.0000
sample documents: 0.0000

Alternative: Using spaCy's built-in keyword extraction:
Sample: 18446744073709551615
documents: 18446744073709551615
extraction: 18446744073709551615
techniques: 18446744073709551615


## BERT keyword extraction

BERT (Bidirectional Encoder Representations from Transformers) is a powerful language model that can be used for various natural language processing tasks, including keyword extraction. It is trained on a large corpus of text data and learns to encode the meaning and context of words and phrases in a text, allowing it to accurately identify the most important words and phrases in a document.

In [15]:
import torch
from transformers import BertModel, BertTokenizer
import numpy as np

# Load pre-trained model and tokenizer
model = BertModel.from_pretrained('bert-base-uncased', output_attentions=True)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Input text
text = "This is a sample text for keyword extraction."

# Tokenize input
inputs = tokenizer(text, return_tensors="pt", add_special_tokens=True)
input_ids = inputs["input_ids"]

# Get token strings to match with outputs later
tokens = tokenizer.convert_ids_to_tokens(input_ids[0])
print(f"All tokens: {tokens}")

# Run the text through BERT
with torch.no_grad():
    outputs = model(**inputs)

# Get the attention weights from the last layer
# Shape is [batch_size, num_heads, seq_length, seq_length]
attention = outputs.attentions[-1]  # Last layer's attention

# Average the attention across all heads
# Shape becomes [batch_size, seq_length, seq_length]
attention_avg = torch.mean(attention, dim=1)[0]  # Take first (only) batch

# For keyword extraction, we'll use attention from the [CLS] token to all other tokens
# The [CLS] token is at position 0
cls_attention = attention_avg[0].numpy()

# Create pairs of tokens and their attention scores, skipping special tokens
token_attention_pairs = []
for i, (token, score) in enumerate(zip(tokens, cls_attention)):
    # Skip special tokens like [CLS] and [SEP]
    if token not in ['[CLS]', '[SEP]', '[PAD]']:
        # Skip common stopwords and punctuation
        if token not in ['is', 'a', 'the', 'for', ',', '.']:
            token_attention_pairs.append((token, float(score)))

# Get top 3 tokens with highest attention
top_keywords = sorted(token_attention_pairs, key=lambda x: x[1], reverse=True)[:3]

print("\nTop 3 keywords by BERT attention:")
for token, score in top_keywords:
    print(f"{token}: {score:.6f}")

# Alternative approach: Use token embeddings and their L2 norm as importance
token_embeddings = outputs.last_hidden_state[0].numpy()
token_norms = np.linalg.norm(token_embeddings, axis=1)

# Create pairs of tokens and their embedding norms
token_norm_pairs = []
for i, (token, norm) in enumerate(zip(tokens, token_norms)):
    if token not in ['[CLS]', '[SEP]', '[PAD]']:
        if token not in ['is', 'a', 'the', 'for', ',', '.']:
            token_norm_pairs.append((token, float(norm)))

# Get top 3 tokens with highest embedding norms
top_keywords_norm = sorted(token_norm_pairs, key=lambda x: x[1], reverse=True)[:3]

print("\nAlternative - Top 3 keywords by embedding magnitude:")
for token, score in top_keywords_norm:
    print(f"{token}: {score:.6f}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



All tokens: ['[CLS]', 'this', 'is', 'a', 'sample', 'text', 'for', 'key', '##word', 'extraction', '.', '[SEP]']

Top 3 keywords by BERT attention:
##word: 0.079881
key: 0.077045
text: 0.075706

Alternative - Top 3 keywords by embedding magnitude:
##word: 16.146860
key: 16.077915
this: 15.463538


# Source

- https://spotintelligence.com/2022/12/13/keyword-extraction/